arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03019 2026-06-03 cs.CY cs.AI

Reproducibility is the New Copyleft: Defining AGI-oriented Reproducible Builds

可重现性是新的Copyleft:定义面向AGI的可重现构建

Masayuki Hatta

AI总结 本文提出面向通用人工智能(AGI)的可重现构建作为Copyleft的功能等价物,通过定义七项要求来确保模型从声明输入到输出的比特精确可重现性,并论证协议而非平台是更优的治理框架。

详情
Comments
Accepted at AGI-26. To appear in the proceedings (Springer LNCS)
AI中文摘要

Copyleft,如GNU通用公共许可证中所实施的,是一种利用版权保证用户自由的法律技巧,通过将源代码的可用性与每次分发行为绑定。其规范力量依赖于一个隐含的技术前提:源代码和目标代码之间存在定义明确、可人工审计且可重现的关系。大型语言模型以及未来的通用人工智能(AGI)系统系统地违反了这一前提。重建模型所需的工件——代码、数据、权重、超参数、工具链和硬件配置——各自受到独立的法律、技术和经济约束,当前没有任何开源框架能完全解决这些问题。足够强大的AI系统还可以将许可下的源代码重写为功能等效的衍生作品,从而剥离原始义务,这是一种Copyleft无法有效防御的洗白形式。本文认为,对于AGI,Copyleft的功能等价物必须基于可重现构建,而非代码的共享相同条款:可重现构建是一种保证从声明输入到输出比特精确可重构性的实践。我们回顾了Copyleft的逻辑,批判性地审视了Maffulli的“第二次解放”论点(即AI实现了Stallman的梦想),并表明除非AGI系统本身是可重现的,否则该论点不成立。借鉴开源AI定义(OSAID)、模型开放框架(MOF)、OpenMDW和确定性推理研究,我们定义了面向AGI的可重现构建的七项要求。我们进一步论证,模型上下文协议(MCP)和类似的AI到AI耦合机制构成了一个新的动态链接层,Copyleft式许可对此并不适用,而Masnick的“协议而非平台”框架提供了更有前景的治理模板。

英文摘要

Copyleft, as implemented in licenses such as the GNU General Public License, was a legal hack that used copyright to guarantee user freedom by tying the availability of source code to every act of distribution. Its normative force rested on an implicit technical premise: that source code and object code stand in a well-defined, humanly auditable, and reproducible relationship. Large language models and, prospectively, Artificial General Intelligence (AGI) systems systematically violate this premise. The artifacts jointly required to reconstruct a model -- code, data, weights, hyperparameters, toolchain, and hardware configuration -- are each subject to independent legal, technical, and economic constraints that no current open-source framework fully resolves. Sufficiently capable AI systems can also rewrite licensed source into functionally equivalent derivatives stripped of their original obligations, a form of laundering against which copyleft has no effective defense. This paper argues that a functional analogue of copyleft for AGI must be grounded not in share-alike clauses over code, but in reproducible builds: a practice guaranteeing bit-exact reconstructability from declared inputs. We review the logic of copyleft, critically examine Maffulli's Second Liberation thesis according to which AI fulfills Stallman's dream, and show that the argument collapses unless AGI systems are themselves reproducible. Drawing on the Open Source AI Definition (OSAID), the Model Openness Framework (MOF), OpenMDW, and deterministic-inference research, we define seven requirements for AGI-oriented reproducible builds. We further argue that the Model Context Protocol (MCP) and analogous AI-to-AI coupling mechanisms constitute a new dynamic linking layer for which copyleft-style licensing is ill-suited, and that Masnick's "protocols, not platforms" framework offers a more promising governance template.

2606.03017 2026-06-03 cs.LG cs.AI cs.RO

ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL

ConTraIRL:用于可迁移逆强化学习的分解对比抽象

Yikang Gui, Bikramjit Banerjee, Prashant Doshi

AI总结 提出ConTraIRL框架,通过双编码器对比学习解耦环境动态与任务目标的潜在表示,实现组合奖励迁移,在连续控制基准上显著提升少样本迁移的样本效率和奖励恢复。

详情
AI中文摘要

当策略必须泛化到未见过的环境动态与任务目标组合时,逆强化学习中的奖励迁移不可靠。我们提出用于可迁移逆强化学习的分解对比抽象(ConTraIRL),该框架通过学习这两个因素的解耦潜在表示来实现组合奖励迁移。ConTraIRL采用双编码器架构,将观测映射到分离的动态和目标的潜在空间,并通过双重对比目标进行训练。时间对齐鼓励动态编码器学习目标不变的结构,而目标编码器捕获动态不变的特征。这种分解支持在重组动态-目标设置下的奖励推断。在连续控制基准上的实验表明,对未见过的动态-目标配对进行有效的少样本迁移,与迁移逆强化学习基线相比,提高了样本效率和奖励恢复。

英文摘要

Reward transfer in Inverse Reinforcement Learning (IRL) is unreliable when policies must generalize to unseen combinations of environment dynamics and task goals. We propose Factorized Contrastive Abstractions for Transferable IRL (ConTraIRL), a framework that enables compositional reward transfer by learning decoupled latent representations of these two factors. ConTraIRL uses a dual-encoder architecture that maps observations into separate dynamics and goal latent spaces, trained with a dual contrastive objective. Temporal alignment encourages the dynamics encoder to learn goal-invariant structure, while the goal encoder captures dynamics-invariant features. This factorization supports reward inference under recombined dynamics-goal settings. Experiments on continuous control benchmarks demonstrate effective few-shot transfer to unseen dynamics-goal pairings, improving sample efficiency and reward recovery over transfer IRL baselines.

2606.03014 2026-06-03 cs.LG cs.AR

MOSAIC: Efficient Mixture-of-Agent Scheduling via Adaptive Aggregation and Inference Concurrency

MOSAIC: 通过自适应聚合和推理并发的高效混合智能体调度

Saptarshi Mitra, Yifan Zhang, Rachid Karami, Phyo Pyae Moe Aung, Nazmul Takbir, Sreetama Sarkar, Souvik Kundu, Sitao Huang

AI总结 针对混合智能体系统在有限GPU资源下的负载不均衡问题,提出基于整数线性规划调度器和置信度感知自适应聚合的MOSAIC框架,实现最高2.5倍专家阶段、4.23倍聚合阶段和1.7~2.3倍端到端加速,精度损失在0.1个百分点内。

详情
Comments
13 pages, 8 main pages
AI中文摘要

混合智能体(MoA)系统通过将每个查询路由到多个专家大语言模型并聚合其输出来提高推理准确性。在有限的GPU资源上高效执行此工作负载存在瓶颈。基于技能的调度导致专家需求倾斜,而将指令微调的大语言模型与长推理模型结合会导致生成长度的极端变化。因此,传统的调度策略由于负载不平衡而遭受显著的GPU空闲和吞吐量崩溃。我们提出了MOSAIC,一个加速MoA工作负载的调度框架。首先,我们制定了一个基于整数线性规划(ILP)的调度器,该调度器根据离线分析的成本联合优化专家放置和每个工作线程的提示分配,在工作线程间复制推理专家同时固定轻量级专家。其次,MOSAIC使用置信度感知的自适应聚合,利用专家间一致性来绕过重型最终聚合器大语言模型处理共识查询。在我们的4-GPU系统中,与基线调度器相比,MOSAIC实现了最高2.5倍的专家阶段、4.23倍的聚合阶段和1.7~2.3倍的端到端加速,同时精度匹配在0.1个百分点以内。

英文摘要

Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based routing creates skewed expert demand, and combining instruction-tuned LLMs with long-reasoning models results in extreme variability in generation lengths. Consequently, traditional scheduling strategies suffer from significant GPU idling and throughput collapse due to load imbalances. We present MOSAIC, a scheduling framework to accelerate MoA workloads. First, we formulate an Integer Linear Program (ILP) based scheduler that jointly optimizes expert placement and per-worker prompt assignment from offline-profiled costs, replicating reasoning experts across workers while pinning lightweight ones. Second, MOSAIC uses confidence-aware adaptive aggregation, leveraging inter-expert agreement to bypass the heavy final aggregator LLM for consensus queries. In our 4-GPU system, MOSAIC achieves up to 2.5x expert-stage, 4.23x aggregator-stage and 1.7~2.3x end-to-end speedups over the baseline scheduler, while matching accuracy within 0.1pp.

2606.03005 2026-06-03 cs.CV cs.AI

MUSE: A Unified Agentic Harness for MLLMs

MUSE: 多模态大语言模型的统一智能体框架

Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu

AI总结 提出MUSE框架,通过可组合模块(任务表示、视觉处理、感知工具、结构化解析、确定性验证和验证器引导修复)提升冻结多模态大语言模型性能,无需重新训练。

详情
AI中文摘要

尽管进展迅速,多模态大语言模型(MLLMs)在人类轻松解决的任务上仍然失败,例如从屏幕截图导航网格迷宫或选择正确的拼图块。我们不重新训练模型,而是提出一个补充性问题:仅通过改进执行脚手架,能从冻结的MLLM中引出多少能力?我们引入MUSE,一个多模态统一结构化执行框架,它用可组合的模块(任务表示、视觉处理、感知工具使用、结构化解析、确定性验证和验证器引导修复)包装任何现成的MLLM,无需任何模型重新训练。我们使用多个最先进的MLLM,在涵盖视觉空间规划、视觉感知、多模态推理和细粒度视觉辨别的多样化基准上评估MUSE。MUSE在所有设置中都比裸模型带来一致的提升,在困难实例上提升最大。进一步分析揭示,许多MLLM失败源于框架层面的缺陷而非根本的模型缺陷,并且可以通过验证器引导修复来解决,无需触及模型。这些发现突显了智能体多模态框架作为一个关键但尚未充分探索的设计维度,提供了超越以模型为中心的优化的正交改进途径。

英文摘要

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

2606.03003 2026-06-03 cs.LG cs.AI cs.RO

Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group

精确等变性在训练中保持,实现跨对称群的零样本泛化

Hongbo Wang

AI总结 通过等变编码器和预测器构建的潜世界模型,其训练损失具有可证明的对称性,从而在仅拟合部分方向动力学时,数学上确定整个轨道上的行为,实现跨对称群的零样本泛化。

详情
Comments
92 pages, 11 figures. Core paper plus an extended results-log appendix and a forward-looking theory supplement. All experiments are laptop-scale (CPU/MPS), fully seeded and deterministic
AI中文摘要

由等变编码器 $E$ 和等变预测器 $f$ 构建的潜世界模型继承了其训练损失的可证明对称性:当世界的动力学真正承载一个群 $G$,通过正交表示 $\rho(g)$ 作用于潜变量时,单步预测 relMSE 在整个群上精确不变,因此仅在方向的受限切片上拟合动力学,数学上就确定了整个轨道上的动力学(举一反三)。我们在笔记本电脑规模(CPU/MPS,完全设定随机种子)上端到端验证了这一点。[A] 该对称性在真实的 Muon/AdamW + EMA + VICReg 运行中幸存——组合的编码-预测残差在优化后约为 $10^{-6}$,不仅在初始化时,而且在任何优化器下都成立。[B] 单步误差在整个群上平坦至五位小数,而相同假设类别的非等变基线拟合了切片但在分布外失效(2D 中 VN $\times 1.00$ 对比基线 $\times 13.8$,3D 中 $\times 17.2$,整个 $\mathrm{SE}(3)$ 阶梯上 $\times 157$),且等变模型小 $4.5$-$7.4$ 倍。[C] 相同的等距论证提升到闭环:在匹配的等变规划器下,方向 $g$ 处的控制轨迹恰好是所见轨迹应用 $\rho(g)$ 的结果,因此闭环误差在整个群上不变——在真实 PushT 上的 2D/$\mathrm{SO}(2)$ 中浮点地板精确,在 3D/$\mathrm{SE}(3)$ 中统计平坦(不相交的 95% 置信区间)。我们针对 Sutton 的苦涩教训对先验进行了压力测试:增强、暴力规模和软等变性各自最多缩小跨群任务指标,但从未达到浮点地板精确性。由于等变性在复合下封闭,$H$ 步展开在每个视界上保持平坦($\times 1.00$,$\le 2\times 10^{-7}$),而基线的残差随 $H$ 复合。超出范围:任务成功扫描、无规划器不变性和缩放。

英文摘要

A latent world model built from an equivariant encoder $E$ and an equivariant predictor $f$ inherits a provable symmetry of its training loss: when the world's dynamics genuinely carries a group $G$ acting on latents by an orthogonal representation $ρ(g)$, the one-step prediction relMSE is exactly invariant across the whole group, so fitting the dynamics on a restricted slice of orientations mathematically determines it on the entire orbit (jǔ yī fǎn sān). We verify this end-to-end at laptop scale (CPU/MPS, fully seeded). [A] The symmetry survives a real Muon/AdamW + EMA + VICReg run -- composed encode-then-predict residual $\sim 10^{-6}$ after optimisation, not just at initialisation, and under any optimiser. [B] One-step error is flat to five digits across the group, while a same-hypothesis-class non-equivariant baseline fits the slice but breaks out-of-distribution (VN $\times 1.00$ vs baseline $\times 13.8$ in 2D, $\times 17.2$ in 3D, $\times 157$ over the full $\mathrm{SE}(3)$ ladder), with the equivariant model $4.5$-$7.4\times$ smaller. [C] The same isometry argument lifts to closed loop: under a matching equivariant planner the control trajectory at orientation $g$ is exactly $ρ(g)$ applied to the seen one, so closed-loop error is invariant across the group -- float-floor-exact in 2D/$\mathrm{SO}(2)$ on real PushT and statistically flat in 3D/$\mathrm{SE}(3)$ (disjoint 95% CIs). We stress-test the prior against Sutton's Bitter Lesson: augmentation, brute-force scale, and soft-equivariance each close at most the across-group task metric, never the float-floor exactness. Because equivariance is closed under composition, the $H$-fold rollout stays flat ($\times 1.00$, $\le 2\times 10^{-7}$) at every horizon, while the baseline's residual compounds with $H$. Out of scope: task-success sweeps, planner-free invariance, and scaling.

2606.02998 2026-06-03 cs.LG eess.AS

CoughSense: Five-Class Respiratory Disease Classification via Whisper Encoder Fine-Tuning and Dual-Encoder Cross-Attention Fusion with Balanced Contrastive Learning

CoughSense:通过Whisper编码器微调和双编码器交叉注意力融合与平衡对比学习的五类呼吸系统疾病分类

Nikhil Vincent

AI总结 提出CoughSense系统,利用Whisper编码器微调和双编码器交叉注意力融合,结合主动帧注意力池化和平衡对比学习,在智能手机录音上实现五类呼吸系统疾病(健康、COVID-19、哮喘/呼吸疾病、支气管炎、肺炎)的高精度分类。

详情
Comments
26 pages, 3 figures
AI中文摘要

自动咳嗽分析为低成本呼吸系统筛查提供了一条途径,但现有工作大多止步于二元COVID-19检测。一个实用的工具需要能够从消费者智能手机的一次咳嗽录音中区分出多种呼吸系统疾病。我们提出了CoughSense,一个将咳嗽录音分为五类的系统:健康、COVID-19、哮喘或呼吸系统疾病、支气管炎和肺炎。我们汇集了来自四个公共数据集(Coswara、CoughVID、Virufy和West China Hospital Pediatric Cough Dataset)的18,301条录音,并使用OpenAI Whisper编码器作为预训练骨干进行咳嗽疾病分类。主要贡献是主动帧QKV注意力池化,它将注意力限制在1500个编码器令牌中的前200个。这避免了由于3秒咳嗽仅填充Whisper 30秒输入窗口中的150个令牌而产生的静音稀释问题。其他训练部分处理19:1的类别不平衡和四个数据集的领域偏移,包括加权随机采样器、SpecAugment、强制少数配对的平衡混合、监督对比辅助损失、FiLM症状条件化和梯度反转领域适应。双编码器模型通过交叉注意力将Whisper与OPERA-CT呼吸基础模型融合。CoughSense(Whisper-tiny,8.6M参数)在五折交叉验证中达到了82.3%的平衡准确率(宏F1为0.817,AUC为0.941),比ImageNet预训练的EfficientNet-B2高出11.1个百分点,比从头训练的ViT高出29.6个百分点。所有五个类别的召回率均超过74%,其中四个超过80%。双编码器模型达到了85.4%的平衡准确率。在所有消融组件中,主动帧池化是最大的单一贡献者,贡献了5.1个百分点,这应该有助于任何使用Whisper作为骨干的短音频任务。

英文摘要

Automated cough analysis offers a path to low-cost respiratory screening, but most existing work stops at binary COVID-19 detection. A practical tool needs to tell apart several respiratory conditions from one cough recording on a consumer smartphone. We present CoughSense, a system that sorts cough recordings into five classes. These are healthy, COVID-19, asthma or respiratory condition, bronchitis, and pneumonia. We aggregated 18,301 recordings from four public datasets (Coswara, CoughVID, Virufy, and the West China Hospital Pediatric Cough Dataset) and used the OpenAI Whisper encoder as a pretrained backbone for cough disease classification. The main contribution is active-frame QKV attention pooling, which restricts attention to the first 200 of 1500 encoder tokens. This avoids the silence-dilution problem that arises because a 3-second cough fills only 150 tokens of Whisper's 30-second input window. Other training parts handle the 19 to 1 class imbalance and the four-dataset domain shift. These include WeightedRandomSampler, SpecAugment, Balanced Mixup with forced minority pairing, a supervised contrastive auxiliary loss, FiLM symptom conditioning, and gradient-reversal domain adaptation. A dual-encoder model fuses Whisper with the OPERA-CT respiratory foundation model through cross-attention. CoughSense (Whisper-tiny, 8.6M parameters) reached 82.3 percent balanced accuracy on five-fold cross-validation (macro-F1 of 0.817, AUC of 0.941). It beat an ImageNet-pretrained EfficientNet-B2 by 11.1 points and a ViT trained from scratch by 29.6 points. All five classes passed 74 percent recall and four of five passed 80 percent. The dual-encoder model reached 85.4 percent balanced accuracy. Active-frame pooling is the largest single contributor across all ablation components at 5.1 points, which should help any short-audio task using Whisper as a backbone.

2606.02996 2026-06-03 cs.RO cs.CV cs.HC

MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry

MARIO: 运动增强的实时多传感器惯性里程计

Yiquan Li, Taeyoung Yeon, Chenfeng Gao, Vasco Xu, Xuanyou Liu, Karan Ahuja

AI总结 提出MARIO框架,通过学习IMU推断的人体姿态先验约束运动动力学,并结合多传感器融合(磁力计、气压计、辅助IMU),在Nymeria数据集上将位置漂移降低36%-42%,实现无相机人体跟踪的准确鲁棒惯性里程计。

详情
Comments
CVPR 2026 Findings
AI中文摘要

仅使用惯性测量单元(IMU)的惯性里程计(IO)为增强现实(AR)和可穿戴设备中的人体运动跟踪提供了轻量级解决方案。最近的基于学习的IO方法通过在大规模人体运动数据集上进行预训练,提高了惯性定位的泛化能力。然而,这些方法仍然容易受到漂移和噪声的影响,因为它们没有显式捕捉人体运动动力学,尤其是在日常活动数据集(如Nymeria)上。在这项工作中,我们提出通过学习的IMU推断姿态先验将惯性里程计建立在人体运动学基础上,该先验促进物理一致的运动约束。我们将此姿态先验集成到现有IO架构中,并在具有挑战性的Nymeria数据集上将位置漂移减少高达36%,该数据集比先前工作中使用的数据集大5倍。我们进一步通过传感器融合框架改进了长期性能,该框架整合了商用AR眼镜上已有的轻量级传感器的辅助信号,包括磁力计、气压计和辅助IMU。通过这种融合策略,位置漂移减少了高达42%,提高了在不同运动条件下的鲁棒性和泛化能力。总之,我们的结果通过将人体运动学与多模态传感统一起来,为惯性轻量级里程计引入了新范式,为准确鲁棒的无相机人体跟踪设立了新基准。我们的网站位于此https URL。

英文摘要

Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at https://spice-lab.org/projects/MARIO/.

2606.02995 2026-06-03 cs.CR cs.AI cs.IR cs.LG

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Patcher: 后门大型语言模型的事后修补

Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu, Minghong Fang

AI总结 提出Patcher框架,仅利用单个失败案例和模型参数,通过基于梯度的显著性定位后门触发器,并采用约束微调消除触发-响应关联,同时保持模型效用。

详情
Comments
To appear in the USENIX Security Symposium, 2026
AI中文摘要

大型语言模型仍然容易受到越狱后门攻击,其中对手污染安全对齐数据以嵌入隐藏触发器,从而绕过安全机制。现有防御通常需要全面的攻击信息或多个触发示例,使得当防御者仅观察到单个报告失败案例而不知道其源于后门攻击还是自然对齐错误时,这些防御不切实际。本文提出Patcher,一个事后防御框架,仅使用单个报告失败案例和模型参数来修复后门语言模型。Patcher分两个阶段运行。首先,通过计算基于响应的梯度显著性分数并应用自适应聚类将触发器与良性上下文分离来定位后门触发器。其次,通过约束微调目标修补模型,该目标打破触发-响应关联,同时通过KL散度约束保持良性任务效用和对非触发越狱攻击的鲁棒性。我们在多种后门攻击策略下进行了广泛评估,并证明Patcher成功定位触发器并中和后门,同时保持模型效用。我们进一步展示了针对旨在规避我们防御的自适应攻击的鲁棒性。这项工作代表了向部署语言模型中训练时攻击的实际防御迈出的重要一步。

英文摘要

Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.

2606.02994 2026-06-03 cs.AI cs.CL

Inducing Reasoning Primitives from Agent Traces

从智能体轨迹中归纳推理原语

Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

AI总结 提出推理原语归纳方法,从ReAct智能体轨迹中挖掘并聚类常见推理步骤,构建伪工具库,在多个推理任务上显著提升性能。

详情
Comments
22 pages including appendices
AI中文摘要

ReAct风格的LLM智能体经常跨问题重新发现相同的推理例程,但这些例程被困在瞬时的草稿板中。我们引入了推理原语归纳,一种单次通过的方法,挖掘成功的ReAct轨迹,聚类循环出现的推理动作,并将最频繁的动作转换为一个紧凑的类型化伪工具库。每个伪工具由一个自然语言文档字符串指定,在调用时由LLM解释,标准的ReAct循环在测试时组合这些原语。核心结果是,归纳出的库优于生成其轨迹的原始智能体:在RuleArena NBA上提高44个百分点(30 -> 74),在MuSR团队分配上提高30个百分点(38 -> 68),在NatPlan会议规划上提高22个百分点(7 -> 29)。在涵盖叙事推理、规则应用和约束满足规划的五个可比较子任务中,单个固定配置在每个子任务上优于零样本思维链,匹配或超过专家编写的分解,并以更低的平均推理成本优于AWM。

英文摘要

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

2606.02993 2026-06-03 cs.LG math.OC math.RT math.ST stat.ML stat.TH

Neural Networks Provably Learn Spectral Representations for Group Composition

神经网络可证明地学习群组合的谱表示

Jianliang He, Leda Wang, Fengzhuo Zhang, Siyu Chen, Zhuoran Yang

AI总结 通过将投影梯度流提升到傅里叶域,证明两层神经网络在群组合任务中几乎必然收敛到单个不可约表示,并揭示了表示论视角下的特征学习和低秩压缩现象。

详情
AI中文摘要

理解神经网络训练过程中结构化内部结构如何涌现是深度学习研究的核心。我们通过群组合任务研究这一现象,其中训练一个两层神经网络来预测有限群 $G$ 中元素的 $g_1 \star g_2$。通过将投影梯度流提升到傅里叶域,我们证明训练动力学由一个表示论能量泛函上的黎曼梯度上升控制。我们证明,在随机初始化下,该流驱动每个神经元几乎必然收敛到单个不可约表示,而跨层傅里叶系数实现旋转秩一对齐。该框架提供了特征学习的表示论解释,并刻画了矩阵值群表示的一种新颖的低秩压缩现象。此外,对于阿贝尔群,我们提供了完整的总体水平描述:随机初始化促进非平凡表示上的均匀多样化,并诱导 Haar 均匀相位,通过多数投票机制联合逼近指示函数。我们进一步证明相位对齐和表示竞争都以指数收敛速率出现。

英文摘要

Understanding how structured internal structure emerges during neural network training is central to the study of deep learning. We investigate this phenomenon through the group composition task, where a two-layer neural network is trained to predict $g_1 \star g_2$ for elements of a finite group $G$. By lifting the projected gradient flow to the Fourier domain, we demonstrate that the training dynamics are governed by a Riemannian gradient ascent on a representation-theoretic energy functional. We prove that, under random initialization, this flow drives each neuron to converge almost surely toward a single irreducible representation, while the cross-layer Fourier coefficients achieve a rotational rank-one alignment. This framework provides a representation-theoretic account of feature learning and characterizes a novel low-rank compression phenomenon for matrix-valued group representations. Moreover, for Abelian groups, we provide a complete population-level description: random initialization promotes uniform diversification across nontrivial representations and induces Haar-uniform phases, jointly approximating the indicator via a majority-vote mechanism. We further prove that both phase alignment and representation competition emerge with exponential convergence rates.

2606.02991 2026-06-03 cs.CL cs.AI

Pretraining Language Models on Historical Text

在历史文本上预训练语言模型

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

AI总结 提出TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型,通过构建TypewriterCorpus语料库、引入词汇基础指令微调框架和History-Event基准套件,解决数据质量、时间泄漏、训练和评估等挑战。

详情
AI中文摘要

我们介绍了TypewriterLM,一个仅在1913年前英文文本上训练的7.24B历史语言模型。开发历史语言模型需要解决数据质量和可用性、防止时间泄漏、设计时间一致的后训练流程以及构建可靠评估等挑战。为了解决这些问题,我们构建了TypewriterCorpus,一个54B词元的历史语料库,收集自多样化的档案和语言标注来源,并进行了广泛的数据清洗和泄漏缓解措施。此外,我们引入了词汇基础指令微调,一种后训练框架,限制响应直接基于历史源文档。使用该框架,我们构建了两个历史指令微调数据集:History-LIMA和History-SelfInstruct。为了评估能力和时间一致性,我们引入了History-Event,一个用于评估能力、时间基础和泄漏的基准套件。我们发布了TypewriterLM及所有相关资源,以支持未来对历史语言模型的研究。

英文摘要

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

2606.02983 2026-06-03 cs.CL

A Locally Deployed RAG-Based Academic Advising System for Course Selection

基于本地部署RAG的课程选择学术咨询系统

Feng Li, Yoritaka Iwata

AI总结 提出一种本地部署的RAG学术咨询系统,利用大语言模型和结构化课程大纲检索,以隐私保护方式支持课程选择、先修课程理解和个性化学习规划。

详情
Comments
to be published in Elsevier's Procedia Computer. Sci. (KES 2026)
AI中文摘要

基于课程之间先修关系的正确课程顺序对于学生全面发展知识和技能至关重要。然而,学生孤立地制定这一顺序时,常常因认知局限和信息过载而困惑。同时,教育机构由于教育资源有限,在提供关于正确顺序的充分学术建议方面遇到困难。为解决这些挑战,我们提出一种基于课程大纲信息的本地部署RAG学术咨询系统。通过将大语言模型与结构化课程大纲数据的检索相结合,该系统旨在以隐私保护的方式支持课程选择、先修课程理解和个性化学习规划。

英文摘要

The correct sequence of courses in the curriculum based on prerequisites between courses is of great importance for students to develop their knowledge and skills holistically. However, students crafting this sequence in isolation frequently struggle with recognition limitations and information overload that leads to confusion. Simultaneously, education institutions encounter difficulties in providing adequate academic advice for the correct sequence due to limited education resources. To address these challenges, we propose a locally deployed RAG-based academic advising system grounded in syllabus information. By combining large language models with retrieval from structured syllabus data, the system is designed to support course selection, prerequisite understanding, and personalized study planning in a privacy-preserving manner.

2606.02982 2026-06-03 cs.PF cs.DC cs.LG

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

DriftSched: 多租户GPU推理中运行时令牌漂移下的自适应QoS感知调度

Kathiravan Palaniappan

AI总结 提出DriftSched框架,通过运行时反馈驱动的漂移补偿和自适应偏差校正,解决多租户LLM推理中令牌漂移导致的调度问题,在NVIDIA L4 GPU上实现平均38.8%的估计误差降低和42%的中位延迟改善。

详情
Comments
17 pages, 22 figures, 7 tables
AI中文摘要

大型语言模型(LLM)推理服务的快速增长增加了对高效多租户GPU调度的需求。尽管现代推理运行时(如vLLM)通过连续批处理和优化内存管理提高了吞吐量,但准确估计异构推理请求的运行时成本仍然是一个重大挑战。在实践中,观察到的输出长度通常偏离准入时的估计值,产生运行时令牌漂移,可能导致工作负载错误分类、队列不平衡、尾延迟增加和服务质量(QoS)下降。本文提出了DriftSched,一个用于NVIDIA L4 GPU上多租户LLM推理服务的自适应QoS感知调度框架。DriftSched结合了工作负载分类、令牌预算估计、租户感知队列管理和运行时反馈驱动的漂移补偿,以改进准入时的调度决策。该框架在异构多租户工作负载下评估了FIFO、优先级、加权、最短作业优先(SJF)和老化优先级调度策略。实验结果表明,各工作负载类别存在可测量的运行时令牌漂移。自适应偏差校正将工作负载估计误差平均降低38.8%(MAE)和40.5%(RMSE),提高了工作负载分类稳定性和调度准确性。在所有评估的调度器中,SJF实现了最佳整体性能,在持续GPU争用下,相对于FIFO,中位端到端延迟降低了约42%,P99延迟降低了约16%。该工作贡献了一个自适应漂移感知调度架构、一个运行时令牌漂移补偿机制,以及一个用于评估共享GPU基础设施上QoS感知LLM推理调度的可重复基准测试框架。

英文摘要

The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains a significant challenge. In practice, observed output lengths often deviate from admission-time estimates, creating runtime token drift that can lead to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS). This paper presents DriftSched, an adaptive QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and runtime feedback-driven drift compensation to improve admission-time scheduling decisions. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads. Experimental results demonstrate measurable runtime token drift across workload categories. Adaptive bias correction reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability and scheduling accuracy. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. The work contributes an adaptive drift-aware scheduling architecture, a runtime token-drift compensation mechanism, and a reproducible benchmarking framework for evaluating QoS-aware LLM inference scheduling on shared GPU infrastructure.

2606.02981 2026-06-03 cs.CL

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

从标注验证集输出统计量预测推理时缩放增益

Luyang Zhang, Jingyan Li

AI总结 提出一种基于标注验证集输出统计量的轻量级方法,通过三个核心特征(提示级一致性扩散、标签辅助的首个正确样本位置、完成长度方差)结合熵特征,使用岭回归预测最佳-of-N推理缩放增益,达到Spearman ρ=0.90的相关性。

详情
AI中文摘要

Best-of-$N$ 推理缩放(从语言模型中抽取 $N$ 个候选答案,并返回奖励模型评分最高的一个)能提高准确性,但提升幅度因模型而异,而预先预测该幅度目前需要端到端运行整个过程。先前的工作将模型采样输出的廉价统计量与验证集正确性(样本一致性、多样性、模型置信度以及正确样本出现的位置)与模型行为联系起来,但并未确定其中哪些能构成稳定、紧凑的 best-of-$N$ 增益预测器。我们基于单次标注验证集采样过程中计算的特征拟合岭回归预测器,使用 bootstrap-Lasso 对候选特征集进行稳定性分析,并给出带有显式线性近似残差的集中性分析。在三个基础模型族、六种后训练方法以及数学和推理任务领域上,稳定性分析识别出一个严格的三特征核心,包括提示级一致性扩散、标签辅助的首个正确样本位置和完成长度方差;基于该核心加上熵扩展构建的紧凑岭回归预测器,在奖励模型验证器下与实际 best-of-$N$ 增益的 Spearman 相关系数达到 $ ho = 0.90$。预期用途是在支付完整的奖励模型评分成本之前,利用标注验证集对候选配置进行筛选。

英文摘要

Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end-to-end. Prior work links cheap statistics of a model's sampled outputs and validation-set correctness (how often samples agree, how diverse they are, how confident the model is, and where correct samples appear) to model behavior, but does not isolate which of these form a stable, compact predictor of best-of-$N$ gain. We fit ridge predictors on features computed from a single labeled validation-set sampling pass, use bootstrap-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear-approximation residual. Across three base-model families, six post-training methods, and math and reasoning task domains, the stability analysis identifies a strict three-feature core spanning prompt-level agreement spread, label-assisted first-correct-sample position, and completion-length variance; a compact ridge predictor built from this core plus an entropy add-on reaches Spearman $ρ= 0.90$ with actual best-of-$N$ gain under a reward-model verifier. The intended use is labeled validation-set screening of candidate configurations before paying the full reward-model scoring cost.

2606.02980 2026-06-03 cs.SD cs.CY

A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5

一种训练高效的基于Transformer的反欺骗网络用于ASVspoof 5中的逻辑访问

Sidan Yin, Bo Zhao

AI总结 针对ASVspoof 5 Track 1封闭条件,提出TFPARN网络,结合焦点分类损失和成对排序损失,通过Transformer编码器和注意力池化实现高效反欺骗,在minDCF和EER上优于AASIST和RawNet2,且推理内存更低、训练更快。

详情
Comments
11 pages, 2 figures
AI中文摘要

合成和篡改的语音会降低自动说话人验证系统的可靠性,因此反欺骗方法需要在训练和推理中既准确又高效。本文聚焦于ASVspoof 5 Track 1封闭条件,其中标准交叉熵训练可能对困难样本关注不足,且不与基于排序和阈值的评估指标直接对齐。我们提出TFPARN,一种基于Transformer的焦点成对注意力排序网络。该系统从语音中提取log-Mel特征,使用Transformer编码器建模帧级信息,应用注意力池化获得话语级表示,并通过焦点分类损失和成对排序损失的组合进行训练。训练中使用RawBoost增强,评估时应用测试时增强以提高鲁棒性。与在相同协议下重新实现的AASIST和RawNet2基线相比,TFPARN取得了最佳结果,minDCF为0.2430,EER为12.52%。消融实验进一步表明,成对损失、焦点损失和注意力池化均能提升性能。TFPARN在比较系统中使用最低的推理内存(1.4 GB),每段话语运行时间约0.79毫秒,并且达到最佳检查点的训练时间少于AASIST。这些结果表明,TFPARN在逻辑访问反欺骗中实现了检测准确性和计算成本之间的良好平衡。

英文摘要

Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti-spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross-entropy training may not give enough attention to hard trials and is not directly aligned with ranking- and threshold-based evaluation metrics. We propose TFPARN, a Transformer-based focal-pairwise attentive ranking network. The system extracts log-Mel features from speech, uses a Transformer encoder to model frame-level information, applies attention pooling to obtain utterance-level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test-time augmentation is applied during evaluation to improve robustness. Compared with re-implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti-spoofing.

2606.02979 2026-06-03 cs.CV cs.AI cs.RO

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

AI总结 提出一种紧凑的深度多任务学习模型,通过自适应损失加权和中间传感器融合技术,在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影,实现高效自动驾驶感知。

详情
Comments
This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213
AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型,能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影,无需其他模型支持。我们还提供了一种自适应损失加权算法,以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术,该模型可以处理并组合来自RGB摄像头、动态视觉传感器(DVS)和安装在自车多个位置的激光雷达的多种输入模态。因此,可以更好地理解动态变化的环境。基于消融研究,使用我们提出的方法训练的模型变体取得了更好的性能。此外,还进行了比较研究,以阐明其与一些近期模型组合相比的性能和有效性。结果表明,即使参数少得多,我们的模型仍能保持更好的性能。因此,该模型可以更快地推理,并减少GPU内存使用。此外,结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究,我们在以下网址公开共享代码和其他文件:https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

2606.02976 2026-06-03 cs.CL

Memory Retrieval for Changing Preferences

针对偏好变化的记忆检索

Yuehan Qin, Li Li, Linxin Song, Wei Yang, Jiate Li, Yuqing Yang, Yue Zhao

AI总结 提出基于贝叶斯因子的统一框架,通过量化历史轮次对潜在偏好状态的证据强度,实现长上下文对话系统中的记忆访问与选择。

详情
AI中文摘要

长上下文对话系统必须决定何时访问记忆以及交互历史的哪些部分是相关的。现有方法通常依赖启发式检索信号或始终开启的记忆使用,未能考虑用户偏好的变化性和潜在不一致性。在这项工作中,我们提出了一个基于偏好变化的记忆访问与选择统一框架。我们将个性化记忆检索表述为识别哪些历史轮次提供了关于用户潜在偏好状态的证据,而不是依赖表面语义相似性。为此,我们使用贝叶斯因子量化每个记忆轮次的效用,定义为当该轮次包含在上下文中时模型参考响应似然的改进。这提供了证据强度的原则性度量,以及用于记忆访问和选择的统一信号。通过将记忆检索视为效用估计,模型学会识别显著轮次并根据预期效用调节记忆使用。在四个异构记忆基准上的实验表明,我们的方法在需要建模偏好变化的长上下文、偏好密集型任务上优于现有的基于嵌入的检索,同时在语义相似性足够的低密度场景中保持竞争力。

英文摘要

Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to account for the changing and potentially inconsistent nature of user preferences. In this work, we propose a unified framework for memory access and selection based on changing preferences. We formulate personalized memory retrieval as identifying which historical turns provide evidence about a user's latent preference state, rather than relying on surface-level semantic similarity. To this end, we quantify the utility of each memory turn using a Bayes factor, defined as the improvement in the model's likelihood of the reference response when the turn is included in context. This provides a principled measure of evidence strength and a unified signal for both memory access and selection. By framing memory retrieval as utility estimation, the model learns to identify salient turns and regulate memory usage based on expected utility. Experiments on four heterogeneous memory benchmarks show that our approach outperforms existing embedding-based retrieval on long-context, preference-intensive tasks where modeling changing preferences is essential, while remaining competitive in low-density regimes where semantic similarity suffices.

2606.02974 2026-06-03 cs.AI cs.HC cs.LG

WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

WISE-HAR:一种基于WiFi的人类活动识别的可泛化集成深度学习框架

Maheen Arshad, Qindeel E Zahra, Muhammad Khuram Shahzad

AI总结 本文提出WISE-HAR框架,通过集成五种CNN架构、数据增强和跨场景评估,在Wallhack1.8k数据集上实现94.87%的LOS测试准确率,并展现出强泛化能力。

详情
Comments
8 pages, 5 figures
AI中文摘要

利用WiFi信号进行人类活动识别(HAR)已成为智能家居、医疗监控、安全系统和环境辅助生活的一项变革性技术。与引发严重隐私问题且在弱光条件下失效的传统基于摄像头的系统,或需要用户配合的可穿戴传感器不同,基于WiFi的HAR是非侵入性的、保护隐私的、成本效益高的,并且能在任何光照条件下无缝工作。本文提出了一种综合方法,使用Wallhack1.8k WiFi频谱图数据集识别三种不同的人类活动:“无人”(空房间)、“行走”和“行走+挥手”。我们提出了三项关键改进以应对基于WiFi的HAR的主要挑战。首先,为了解决高性能方差问题,我们实现了集成学习,采用五种不同的CNN架构(Deep CNN、Wide CNN、MobileNetV2、ResNet50V2和EfficientNetB0)。其次,为了解决小数据集大小的限制,我们应用了激进的数据增强技术,包括时间扭曲、频率掩蔽和噪声添加。第三,为了评估真实世界的泛化能力,我们进行了跨场景评估(在视距上训练,在非视距上测试)和跨天线评估(在双锥天线上训练,在PIFA天线上测试)。我们的集成模型在使用双锥天线的LOS场景下达到了94.87%的测试准确率,比最佳单个模型高出0.66%。数据增强将随机森林的性能从60%提升到95%。跨场景评估显示准确率下降极小,仅为1.37%和2.07%,证明了强大的泛化能力。结果表明,所提出的方法鲁棒、可靠,适用于不同硬件配置的多样化环境中的实际部署。

英文摘要

Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: "No Presence" (empty room), "Walking", and "Walking + Arm-waving" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.

2606.02973 2026-06-03 cs.CL

Chatbots Output Meaningful (but Problematic) Language

聊天机器人输出有意义(但有问题)的语言

Matthew Stone, Una Stojnić

AI总结 本文论证大型语言模型(LLM)的输出是有意义的,但无需假设其具有心理状态或意图,并探讨了这一观点对语言理论和AI伦理的影响。

详情
Comments
49 pages
AI中文摘要

AI聊天机器人的话语有意义吗?具体来说,如果用户问Anthropic的智能体Claude:“西班牙的首都是什么?”Claude回答:“马德里是西班牙的首都。”这句话是否具有其通常的意义——并且表达了一个真实的命题?大多数普通用户以及AI工程师认为答案显然是“是”。然而,许多认知科学家、语言学家和语言哲学家认为,关于语言和意义的主流意向主义理论得出了相反的结论。因此,更同情普通用户直觉的理论家主张对语言进行激进的“去拟人化”,修正我们对心理状态、意图和语义内容的理解,以捕捉LLM输出有意义的直觉。我们采取不同的方法。虽然我们也认为LLM的输出是有意义的,但我们认为,适当的人类语言理论已经适用于当前的聊天机器人。意义是一个低门槛:声称LLM输出有意义并不需要假设心理状态、意图、理性或LLM中交流所需的认知能力——实际上,也不需要任何其他拟人化假设。人们确实有交流意图(通常是成功的),但即便如此,在人类中,语言产出也可能偏离说话者的想法。我们的观点对于我们应该如何理论化——并批判性地参与——人类语言输出和合成生成的文本具有重要影响。特别是,说聊天机器人产生有意义的文本绝不意味着认可它们的输出,或假设该技术是(或不是)好的、强大的、合适的或有用的。

英文摘要

Are utterances by AI chatbots meaningful? Concretely, if a user asks, say, Anthropic's agent Claude, "What is the capital of Spain?" and Claude answers, "Madrid is the capital of Spain," does that sentence have its ordinary meaning -- and does it express a true proposition? Most ordinary users, as well as AI engineers, take the answer to be trivially "yes." However, many cognitive scientists, linguists, and philosophers of language argue that dominant intentionalist accounts of language and meaning deliver the opposite conclusion. Theorists more sympathetic to ordinary users' intuitions have therefore advocated a radical "de-anthropomorphization" of language, revising our understanding of mental states, intentions, and semantic content to capture the intuition that the outputs of LLMs are meaningful. We take a different approach. While we, too, argue that LLM outputs are meaningful, we contend that a proper theory of human language already applies, as is, to current chatbots. Meaning is a low bar: claiming that LLM outputs are meaningful does not require positing mental states, intentions, rationality, or the cognitive capacities requisite for communication in LLMs -- or, indeed, making any other anthropomorphic assumptions. People do have communicative intentions (typically successful ones), but nevertheless, even in humans, language production can depart from what the speaker has in mind. Our view has important consequences for how we should theorize about -- and critically engage with -- both human linguistic output and synthetically generated text. In particular, to say that chatbots produce meaningful text is not by any means to endorse what they output, or to assume that the technology is (or is not) good, powerful, appropriate, or useful.

2606.02971 2026-06-03 cs.CL

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

EURO-5K:领域预训练何时重要?用于欧盟报告义务提取的Transformer基准测试

Marios Koniaris, Vasileios Kotronis, Eugenia Giannini, Panayiotis Tsanakas

AI总结 本文构建了EURO-5K数据集,通过对比判别式与生成式模型在欧盟报告义务提取上的表现,发现领域预训练在参数高效微调时收益显著,且模型可作为专用提取器。

详情
AI中文摘要

从欧盟立法中提取报告义务对于评估和减少监管报告负担至关重要。然而,区分报告要求与结构相似的条款需要专门的法律理解。当前的法律NLP方法缺乏具有明确指南和提取范式及领域适应策略比较评估的专门数据集。我们整理了EURO-5K,一个包含来自136项欧盟立法法案的句子级报告义务和具有挑战性的负例的语料库。在该数据集上,我们训练并比较了判别式标记分类模型(BERT风格)和生成式跨度提取模型(LLM),针对基线(基于模式和依赖关系的提取、少样本提示)评估了全微调和参数高效的QLoRA。结果表明,全微调的通用和法律BERT模型实现了相似的性能(0.89 F1),而微调的LLM在句子级提取上达到了编码器的准确度。法律预训练对生成式模型仅带来微小提升。相反,当适应能力受限时,法律预训练明显有益,因为参数高效微调的法律BERT优于其通用对应版本。学习曲线分析表明,法律预训练在数据极少时加速了早期学习。所有方法在大约3000个样本时收敛,之后收益递减,验证了数据集的充分性。在两个外部监管语料库上的跨数据集评估表明,我们的模型表现为专门的报告义务提取器,而非通用监管分类器。我们发布了EURO-5K、训练好的模型以及一个带有可解释性可视化和结构化RDF导出的交互式演示。这些表明,两种范式和参数高效训练为监管合规自动化提供了实用工具。

英文摘要

Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal understanding. Current legal NLP methods lack specialised datasets with clear guidelines and comparative evaluation of extraction paradigms and domain adaptation strategies. We curate EURO-5K, a corpus of sentence-level reporting obligations and challenging negative examples from 136 EU legislative acts. On this dataset, we train and compare discriminative token-classification models (BERT-style) and generative span-extraction models (LLMs), evaluating both full fine-tuning and parameter-efficient QLoRA against baselines (pattern and dependency-based extraction, few-shot prompting). Results show that fully fine-tuned generic and legal BERT models achieve similar performance (0.89 F1), while fine-tuned LLMs match encoder accuracy for sentence-level extraction. Legal pretraining offers only small gains for generative models. In contrast, it is clearly beneficial when adaptation capacity is constrained, as parameter-efficient tuning of Legal-BERT outperforms its generic counterpart. Learning curve analysis demonstrates that legal pretraining accelerates early learning with minimal data. All approaches converge around 3K samples with diminishing returns thereafter, validating dataset sufficiency. Cross-dataset evaluation on two external regulatory corpora shows that our models behave as specialised reporting obligation extractors rather than generic regulatory classifiers. We release EURO-5K, trained models, and an interactive demo with explainability visualizations and structured RDF export. These demonstrate that both paradigms and parameter-efficient training provide practical tools for regulatory compliance automation.

2606.02969 2026-06-03 cs.RO math.OC

Hybrid Dynamics Modeling for a Flexible 2-DoF Robotic Arm

柔性2自由度机械臂的混合动力学建模

Maciek Popik, Daniel Yang, Mahdis Bisheban

AI总结 针对刚性模型无法捕获的未建模动力学,本文结合刚体动力学与高斯混合模型或纯数据驱动回归,对柔性2自由度机械臂进行混合建模,并比较了不同方法的扭矩预测精度。

详情
AI中文摘要

本文研究了三种对柔性连杆2自由度机械臂动力学进行建模的方法,以解决刚体模型无法捕获的未建模动力学。两种物理信息模型将刚体动力学(RBD)公式与高斯混合模型(GMM)相结合,以捕获残差模型误差和连杆柔性。一个基于运动学的回归模型作为纯数据驱动的基线。使用开源数据集,首先通过运动学特征的岭回归估计扭矩预测,而基于物理的基线则根据公布的规格构建,随后使用普通最小二乘回归直接从数据估计相同的参数集。结果表明,基于物理的参数精度最差,而正则化和最小二乘估计器与实测扭矩更吻合。残差分析和误差指标凸显了纯参数模型在柔性连杆系统中的局限性,并强调了正则化和数据驱动辨识的价值,支持了半参数残差学习方法的发展。

英文摘要

This paper examines three approaches for modeling the dynamics of a flexible-link 2-DoF robotic arm to address unmodeled dynamics not captured by rigid-body models. Two physics informed models combine rigid-body dynamics (RBD) formulations with a Gaussian Mixture Model (GMM) to capture residual model errors and linkage flexibility. A kinematics-based regression model serves as a purely data-driven baseline. Using an open-source dataset, torque predictions are first estimated using Ridge regression on kinematic features, while the physicsbased baseline is constructed from published specifications, and ordinary least-squares regression is subsequently used to estimate the same parameter set directly from data. Results show that the physics-based parameters yield the poorest accuracy, while regularized and least-squares estimators align more closely with measured torques. Residual analysis and error metrics highlight the limitations of purely parametric models for flexible-link systems and underscore the value of regularization and data-driven identification, supporting developments of semi-parametric residual learning methods.

2606.02967 2026-06-03 cs.ET cs.AI cs.AR cs.SY eess.SY

Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence

轨道上的玻璃盒:面向可信自主立方星智能的宪法AI验证框架

Karthik Barma, Anil Sanneboyina, V C Premchand Yadav

AI总结 提出玻璃盒框架,通过运行时宪法AI验证层拦截自主航天器决策,利用六项物理约束和七项线性时序逻辑安全不变式确保安全,并证明其验证开销与模型规模无关。

详情
Comments
12 pages, 2 figures, 2 tables, 32 references. Paper 1 of the Project October series on autonomous orbital intelligence
AI中文摘要

航天工业正在悄然构建一个尚未被充分认识的事物:在地球上空550公里处运行数千个自主AI工作负载的轨道数据中心,且无人类参与。微软、AWS以及越来越多的轨道计算企业正在将云规模处理从地面转移到轨道。然而,它们都尚未回答治理问题——当轨道数据中心规模的自主AI系统在太空中做出错误决策时,如何在决策变得不可逆转之前阻止它们?我们引入玻璃盒:一个运行时宪法AI验证层,在单个命令到达任何航天器子系统之前,拦截来自机载AI策略的每个候选动作,并根据六项基于物理的宪法约束和七项线性时序逻辑(LTL)安全不变式对其进行评估。每个批准的动作都附带一个加权可解释性分数E(a_t)(范围[0,1])和完整的宪法审计日志。我们在Project October中演示了玻璃盒:一个针对CubeSat级航天器的完全模拟的五层自主轨道智能架构。我们证明玻璃盒的验证开销为O(N_c),其中N_c是宪法规则的数量,与模型大小或航天器状态维度无关。我们提供了宪法约束语法的完整形式规范、通过Z3和NuSMV模型检查验证的七项LTL安全不变式,以及一个详细的工作示例,展示玻璃盒在电池状态退化的日食入口处拦截不安全推理请求。随着轨道计算向数据中心基础设施规模发展,运行时宪法验证不再是研究上的新奇事物——它是每个自主轨道平台最终将需要的任务关键型安全基础设施。

英文摘要

The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomous AI workloads with no human in the loop, 550 km above the Earth. Microsoft, AWS, and a growing list of orbital computing ventures are moving cloud-scale processing off the ground and into orbit. What none of them have answered yet is the governance question -- when autonomous AI systems at orbital data center scale make wrong decisions in space, what stops those decisions before they become irreversible? We introduce Glass Box: a runtime constitutional AI verification layer that intercepts every candidate action from an onboard AI policy and evaluates it against six physics-grounded constitutional constraints and seven Linear Temporal Logic (LTL) safety invariants before a single command reaches any spacecraft subsystem. Every approved action carries a weighted explainability score E(a_t) in [0,1] and a complete constitutional audit log. We demonstrate Glass Box within Project October: a fully simulated five-layer autonomous orbital intelligence architecture for CubeSat-class spacecraft. We prove that Glass Box verification overhead is O(N_c) in the number of constitutional rules, independent of model size or spacecraft state dimension. We present a complete formal specification of the constitutional constraint grammar, seven LTL safety invariants verified by Z3 and NuSMV model checking, and a detailed worked example of Glass Box intercepting an unsafe inference request at eclipse-entry under degraded battery state. As orbital computing scales toward data center infrastructure, runtime constitutional verification is no longer a research novelty -- it is mission-critical safety infrastructure that every autonomous orbital platform will eventually require.

2606.02965 2026-06-03 cs.AI

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

基准测试无法衡量的:论自主智能体弃权能力的评估

Victor Ojewale, Suresh Venkatasubramanian

AI总结 本文指出自主智能体基准测试忽视弃权能力,提出合规偏差概念,并引入弃权场景分类和评估协议,实验表明安全-可用性权衡是可调的。

详情
Comments
ACM CAIS 2026: RLEval Workshop Oral Presentation(Best Paper Award)
AI中文摘要

自主智能体的基准测试衡量智能体是否完成任务,然而这种框架系统地忽略了智能体是否应该继续执行任务。在人类反馈目标下训练的智能体形成了一种结构性倾向,即使缺乏安全行动所需的输入、证据或授权也会继续执行,我们将这种倾向称为合规偏差,因为奖励信号和基准测试评分机制都将继续执行视为正确的默认行为,无论安全行动的前提条件是否满足。我们做出三项贡献。首先,我们表明合规偏差源于人类反馈流程中的奖励黑客行为,并因主流智能体基准测试而根深蒂固,这些基准测试要么惩罚智能体的暂停,要么在架构上无法区分有原则的暂停和静默失败。然后,我们引入弃权合理场景的三缺口分类法,涵盖所需信息缺失的规范缺口、无法确认世界状态的验证缺口以及未获得明确授权的权威缺口,这些共同为构建弃权感知的智能体基准测试提供了原则性基础。最后,我们提出弃权评估协议(安全率、可用率和知情拒绝率),并报告了144个企业智能体场景和五个模型系列的初步结果,其中运行时强制弃权机制在授权场景下实现了高达89.2%的危险动作阻断和87.5%的可用性,表明安全-可用性权衡是可调的而非固有的,并且其形状在不同模型系列间差异显著。我们将此视为初步工作,并提供分类法和复合指标作为进一步讨论的起点。

英文摘要

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.

2606.02964 2026-06-03 cs.AR cs.CL cs.LG

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

多段注意力:实现高效KV缓存管理以加速大型语言模型服务

Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, Bin Cui

AI总结 提出AsymCache,一种计算延迟感知的KV缓存管理系统,通过多段注意力、缓存驱逐策略和自适应分块调度器,在保持无损精度的同时显著降低TTFT和TPOT。

详情
AI中文摘要

大型语言模型(LLM)推理依赖键值(KV)缓存以避免冗余的注意力计算。虽然近似KV缓存保留技术通过牺牲模型精度来减少内存使用,但无损方法则从GPU内存中驱逐KV缓存块并按需重建以保留精确输出。现有的无损KV缓存管理系统主要基于访问频率或位置启发式做出驱逐决策,而不考虑不同KV缓存块如何影响GPU注意力内核的执行效率。在本文中,我们提出了AsymCache,一种用于LLM推理的计算延迟感知KV缓存管理系统,它明确地将缓存驻留决策与GPU注意力内核性能对齐,包括三个关键组件:用于高效非连续KV上下文处理的多段注意力(MSA)、联合优化命中率和位置感知重计算成本的缓存驱逐策略,以及用于高硬件利用率的自适应分块调度器。实验表明,与最新基线相比,AsymCache将TTFT降低了高达1.90-2.03倍,每输出令牌时间(TPOT)降低了1.62-1.71倍,证实了该方法在常见工作负载中的有效性,并验证了其平衡计算效率与缓存命中率的设计目标。此外,AsymCache的低级设计允许无缝集成到诸如Continuum的代理服务系统中,进一步将平均作业延迟降低高达18.1%。

英文摘要

Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.

2606.02963 2026-06-03 cs.LG

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

KForge:面向AI加速器的LLM驱动跨平台内核生成

Taras Sereda, Burak Bartan, Ankita Nayak, Tom St. John, Natalie Serrino, Zain Asgar

AI总结 提出KForge框架,通过两个协作的LLM代理(生成代理和性能分析代理)迭代优化,自动生成跨平台高性能内核,在NVIDIA B200和Intel Arc B580上分别实现2.12%的吞吐量提升和5.13倍的几何平均加速。

详情
Comments
Accepted at ISCA 2026 Workshop MLArchSys
AI中文摘要

生产推理越来越多地针对异构加速器组合。智能体管道交织推理、工具调用和多智能体协调,每个阶段具有不同的计算和内存特征。为达到最优效率,每个阶段应在最适合的加速器上运行。这带来了系统挑战:每个管道现在需要在越来越多的硬件后端和编程模型上生成高性能内核。手工编写这些内核耗时、需要深厚的底层专业知识,并且随着内核复杂性增长而难以扩展。最近,大型语言模型(LLMs)已被用于自动内核生成,但在低级代码生成和跨后端泛化方面仍存在挑战。我们提出KForge,一个跨平台框架,围绕由两个协作的基于LLM的代理驱动的迭代优化循环构建:生成代理,使用编译和正确性反馈生成并逐步优化内核;性能分析代理,解释从编程API到基于GUI的工具的性能数据,并发出指导下一轮合成的建议。该循环在功能传递(驱动候选达到正确性)和优化传递(缩小与手工调优基线的性能差距)之间交替。我们在两个基线参考可用性差异很大的后端上评估KForge。在NVIDIA B200上,KForge在gpt-oss-20b推理速度基准上相比TensorRT-LLM实现了2.12%的端到端吞吐量提升。在Intel Arc B580上,KForge生成的Triton内核在KernelBench Level 2的37个GEMM+尾部操作工作负载上,通过算子融合和混合精度执行,实现了比PyTorch eager和this http URL中较快者5.13倍的几何平均加速。

英文摘要

Production inference increasingly targets a heterogeneous mix of accelerators. Agentic pipelines interleave reasoning, tool calls, and multi-agent coordination, each with distinct compute and memory profiles. For optimal efficiency, each stage should run on the accelerator best suited to it. This creates a systems challenge: each pipeline now requires high-performance kernels across a growing set of hardware backends and programming models. Writing these kernels by hand is time-consuming, demands deep low-level expertise, and does not scale as kernel complexity grows. Recently, Large Language Models (LLMs) have been leveraged for automatic kernel generation, but challenges in low-level code generation and cross-backend generalization persist. We present KForge, a cross-platform framework built around an iterative refinement loop driven by two collaborating LLM-based agents: a generation agent that produces and progressively refines kernels using compilation and correctness feedback, and a performance-analysis agent that interprets profiling data, from programmatic APIs to GUI-based tools, and emits recommendations that steer the next round of synthesis. The loop alternates between functional passes, which drive a candidate to correctness, and optimization passes, which close the performance gap to hand-tuned baselines. We evaluate KForge on two backends with very different baseline reference availability. On NVIDIA B200, KForge achieves a 2.12$\%$ improvement in end-to-end throughput compared to TensorRT-LLM on the gpt-oss-20b inference speed benchmark. On Intel Arc B580, KForge generates Triton kernels achieving a 5.13$\times$ geometric mean speedup over the faster of PyTorch eager and torch.compile on 37 GEMM + tail-ops workloads from KernelBench Level 2, primarily via operator fusion and mixed-precision execution.

2606.02962 2026-06-03 cs.CV cs.AI cs.HC eess.IV

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

AI总结 针对自我中心视频中的自然语言查询定位任务,提出手部轨迹编码器与自适应门控交叉注意力融合方法,利用手部运动信息提升查询定位性能。

详情
Comments
Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026
AI中文摘要

自我中心自然语言查询(NLQ)定位要求模型在长第一人称视频中定位回答自由形式文本查询的时间区间。现有方法融合视频外观与查询,但忽略了手部运动,尽管大约41%的Ego4D NLQ查询是在手-物交互或其后立即发生的时刻回答的。我们提出了一种手部轨迹编码器,用于将手部骨骼序列转换为高语义的手部运动学特征,然后通过具有自适应门控的交叉注意力融合策略,将这些特征与预训练的视频-文本特征对齐并组合。在Ego4D NLQ v2验证集上,手-物交互查询(R1@IoU=0.3提升2.54)和数量/状态查询(R1@IoU=0.3提升4.32)的增益最为明显,表明手部轨迹提供了超越外观的定位线索。

英文摘要

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

2606.02959 2026-06-03 cs.LG cs.CR

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Gate AI:大语言模型安全基准评估方法与结果

Ryle Goehausen, Marcus Sousa

AI总结 针对提示注入和越狱检测器评估中数据集阈值调优和操作点未公开的问题,提出一种采用5折交叉验证、全局操作点选择和多种泛化诊断的评估框架,并在16个公开基准上进行了测试。

详情
Comments
17 pages, 23 figures, 2 tables. Working preprint; subsequent versions may update benchmark numbers as the framework evolves
AI中文摘要

已发布的大语言模型提示注入和越狱检测器评估通常存在两个系统性弱点:每个数据集单独调整阈值以及未公开的操作点。我们描述了一种解决这两个问题的评估框架。被评估的检测器在16个公共基准(12,111个样本)上使用5折交叉验证进行评分。主要流程采用StratifiedKFold(按行);同时,并行运行StratifiedGroupKFold流程,基于复合键(父提示ID加上Jaccard $\gtrsim 0.8$的MinHash + LSH近重复聚类)作为泄漏溢价诊断。在保留的折上选择一个全局操作点(在FPR $\leq 1\%$条件下最大化F1),并统一应用于每个数据集,因此每个数据集的结果反映一个阈值,而非每个基准的优化。通过一系列诊断检查泛化能力(留一数据集交叉验证、随机标签对照、对抗验证、排列特征重要性、长度偏差相关性、分类器头部一致性、跨源近重复检测、阈值可迁移性、训练集与OOF一致性以及释义不变性探测),其中大多数具有定量通过阈值,其余则说明失败模式。对于每次外部比较,检测器的阈值根据竞争对手公布的假阳性率重新调整,以便在匹配的操作点上评估对比值。

英文摘要

Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both. The detector under evaluation is scored across 16 public benchmarks (12,111 samples) using 5-fold cross-validation. StratifiedKFold (by row) is the headline pass; a parallel StratifiedGroupKFold pass over a composite key (parent-prompt id plus MinHash + LSH near-duplicate clusters at Jaccard $\gtrsim 0.8$) runs alongside it as a leakage-premium diagnostic. A single global operating point is selected on the held-out folds (max F1 subject to FPR $\leq 1\%$) and applied uniformly to every dataset, so per-dataset results reflect one threshold rather than per-benchmark optimisation. Generalisation is examined through a battery of diagnostics (leave-one-dataset-out cross-validation, a random-label control, adversarial validation, permutation feature importance, length-bias correlation, classifier-head agreement, cross-source near-duplicate detection, threshold transferability, train-vs-OOF agreement, and a paraphrase-invariance probe), most with a quantitative pass threshold and the remainder with a stated failure mode. For every external comparison, the detector's threshold is re-tuned to the competitor's published false-positive rate so head-to-head values are evaluated at matched operating points.

2606.02958 2026-06-03 cs.CR cs.AI

Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries

Echelon: 跨隐私边界的可审计聚合专用语言模型适配

Hina Dixit, Punit Kumar, Irene Tenison, Nevasini Sasikumar

AI总结 提出Echelon架构,通过强制设备级模型状态不可导出为系统不变量,仅允许聚合后的跨边界数据传输,并结合缓冲半异步安全聚合、陈旧感知加权等机制,在1B参数LoRA适配中实现低通信开销下的稳定训练。

详情
AI中文摘要

跨组织语言模型适配日益面临严格的治理约束:在许多部署中,设备级模型状态(参数、激活值、优化器状态及每设备更新)无法导出到管理边界之外。现有的分布式和联邦学习栈通常假设跨站点模型交换,然后改造隐私机制,这使合规性复杂化并导致审计脆弱。我们提出Echelon,一种边界优先的训练架构,将设备级模型状态不可导出作为系统不变量强制执行。设备在每个边界内本地训练;唯一的跨边界负载是安全聚合的边界级增量加上O(1)协调元数据,并通过具体的审计接口暴露。将交换限制为聚合改变了优化问题:系统必须在广域网延迟、异构参与、节点波动和非独立同分布数据下保持稳定,尽管全局层面从未看到每设备更新。Echelon结合了缓冲半异步安全聚合、陈旧感知加权、参与窗口、近端局部目标以及漂移感知外同步控制器。在M=2个边界上的1B参数LoRA适配中,预算匹配的竞赛(三个种子,24.88M tokens)达到验证损失3.887 +/-0.010,并在固定token、固定字节、固定挂钟时间和固定同步次数预算下,在调优的低通信基线中表现最佳或并列最佳。在OpenWebText压力测试中,Echelon在评估的广域网和非独立同分布处理下维持2,139-2,176 tokens/s的吞吐量;Echelon-DA在广域网延迟下相对于隐私对等的DiLoCo+SA基线改善了达到目标的时间,并且在200ms模拟延迟或严重非独立同分布分区下质量最多下降2.2%。

英文摘要

Cross-organization language-model adaptation increasingly faces hard governance constraints: in many deployments, device-level model state-parameters, activations, optimizer state, and per-device updates-cannot be exported outside an administrative boundary. Existing distributed and federated stacks typically assume cross-site model exchange and then retrofit privacy mechanisms, which complicates compliance and makes auditing brittle. We present Echelon, a boundary-first training architecture that enforces device-level model-state non-export as a systems invariant. Devices train locally inside each boundary; the only cross-boundary payloads are securely aggregated boundary-level deltas plus O(1) coordination metadata, exposed through a concrete audit surface. Restricting exchange to aggregates changes the optimization problem: the system must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data even though the global plane never sees per-device updates. Echelon combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer synchronization controller. In 1B-parameter LoRA adaptation across M= 2 boundaries, a budget-matched contest over three seeds (24.88M tokens) reaches validation loss 3.887 +/-0.010 and is best or tied-best among tuned low-communication baselines under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets. In OpenWebText stress tests, Echelon sustains 2,139-2,176 tokens/s across evaluated WAN and non-IID treatments, Echelon-DA improves time-to-target under WAN latency relative to a privacy-parityDiLoCo+SA baseline, and quality degrades by at most 2.2% under 200ms emulated latency or severe non-IID partitioning.

2606.02956 2026-06-03 cs.CV cs.LG cs.RO

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

自动驾驶的未来之路:KITScenes多模态数据集

Richard Schwarzkopf, Fabian Immel, Alexander Blumberg, Jonas Merkert, Nils Rack, Kaiwen Wang, Fabian Konstantinidis, Julian Truetsch, Carlos Fernandez, Annika Bätz, Kevin Rösch, Marlon Steiner, Willi Poh, Yinzhe Shen, Royden Wagner, Felix Hauser, Dominik Strutz, Jaime Villa, Gleb Stepanov, Holger Caesar, Ömer Şahin Taş, Frank Bieder, Jan-Hendrik Pauls, Christoph Stiller

AI总结 本文提出KITScenes多模态数据集,通过高保真传感器和完整HD地图,解决现有数据集在传感器精度、地图完整性和地理多样性上的不足,并引入四个基准推动空间学习。

详情
Comments
28 pages, 21 figures
AI中文摘要

现有的自动驾驶数据集取得了重大进展,但在传感器保真度、地图完整性或地理多样性方面仍存在不足。我们提出了KITScenes多模态数据集,这是一个基于高保真传感器和地图构建的欧洲数据集。我们完全同步的传感器套件结合了高分辨率全局快门相机、超过400米的长距离激光雷达、4D成像雷达以及冗余的GNSS/INS定位。据我们所知,我们的HD地图是任何传感器数据集中最完整的,并通过开源软件上的自动驾驶试验进行了验证。首次在公共数据集中,所有与驾驶相关的交通元素(如交通灯)都以3D方式映射到重投影精确的水平,并具有完整的拓扑连接。我们的数据集记录在街道布局不规则且交通模式混合的城市中,通过拓宽可用的地理多样性来补充现有数据集。我们还引入了四个基准,每个基准都推动了具身AI的空间学习:在线HD地图构建、长距离深度估计、新颖视图合成和端到端驾驶。项目页面:此https URL

英文摘要

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization. Our HD maps are, to our knowledge, the most complete of any sensor dataset, validated through autonomous driving trials on open-source software. For the first time in a public dataset, all driving-relevant traffic elements, such as traffic lights, are mapped in 3D to a reprojection-accurate level with full topological connectivity. Recorded in cities with irregular street layouts and mixed traffic modes, our dataset complements existing datasets by broadening the available geographic diversity. We also introduce four benchmarks, each advancing spatial learning for embodied AI: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. Project page: https://kitscenes.com/

2606.02955 2026-06-03 cs.CL cs.AI cs.LG

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

Fast-dLLM++: 用于更快扩散LLM推理的Fréchet轮廓解码

Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li

AI总结 针对扩散大语言模型推理中并行令牌生成的瓶颈,提出Fréchet轮廓解码方法,通过利用异构置信度轮廓选择并行提交集,在保持模型和缓存不变的情况下提升吞吐量。

详情
Comments
Initial version accepted at Workshop on Structured Probabilistic Inference & Generative Modeling, ICML 2026
AI中文摘要

扩散大语言模型承诺并行令牌生成,但推理仍然受限于决定哪些掩码令牌可以安全地一起提交。Fast-dLLM通过KV缓存和置信度引导的并行解码解决了这个问题,但其解码理论使用同质高置信度假设,实际上将每个候选集简化为其最弱的选择令牌。我们认为这留下了速度提升空间,因为实际解码步骤表现出异构置信度轮廓。我们提出 extbf{Fast-dLLM++},一种无需训练的扩展,引入了\emph{Fréchet轮廓解码}:从完整的排序置信度轮廓中选择并行提交集,而不是单个最坏情况置信度。得到的规则是Fast-dLLM因子选择器的异构置信度泛化,在等置信度情况下精确恢复先前规则,并在所选令牌具有不均匀置信度时增加一个可证明的\emph{异构性奖励}。Fast-dLLM++完全保持模型、扩散过程和缓存实现不变,使其成为现有Fast-dLLM解码的直接替代品。在GSM8K、MATH、HumanEval和MBPP上使用LLaDA-8B模型的实验表明,理论改进直接转化为经验收益:轮廓感知选择通过利用最弱令牌规则忽略的安全并行性改进了准确率-吞吐量前沿,在可比准确率下实现了高达37%的吞吐量提升。我们的匿名代码发布在此https URL。

英文摘要

Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our anonymous code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.