arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
热门方向导航
2606.16465 2026-06-16 cs.AI cs.CE 新提交

When Agent Automation Becomes Profitable: Quantifying and Insuring Autonomous AI Risk through Trace-Economic Underwriting

当智能体自动化变得有利可图:通过痕迹经济核保量化和保险自主AI风险

Binyan Xu, Xilin Dai, Fan Yang, Kehuan Zhang

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出痕迹经济核保方法,通过量化客户-任务-痕迹片段级别的风险并转移至保险,使自主AI部署在经济上可接受,实验显示定价误差从$17.7K降至$569,风险降低72%。

Comments 26 pages, 14 figures, 29 tables

详情
AI中文摘要

AI智能体现在可以在操作系统中采取不可逆的行动,但由智能体造成的损失仍然没有被明确分配、定价或转移。提供商通常否认间接损失,用户承担未补偿的损失,而默认的人工审查限制了自动化的效率提升。我们探讨自主AI部署在存在失败风险的情况下何时可以变得经济上可接受。我们的答案是在客户-任务-痕迹片段级别量化风险,并通过保险转移风险。当预期收益超过保费、控制成本和剩余风险时,自动化是可接受的。这需要一个具有有限权限和可比较痕迹的明确角色。我们引入了痕迹经济核保,它将工具使用痕迹映射到客户暴露和可索赔损失,然后使用这种表示进行定价、控制和风险转移。它使用确定性经济标签而非LLM评判。在我们的痕迹到损失测试平台上,痕迹经济定价将定价MAE从$17.7K降低到$569,并消除了累退性交叉补贴。一个300条痕迹的专家审计接受了295个标签不变。在1000条真实SWE-smith痕迹上,痕迹条件控制将CVaR95降低了72%。定理1给出了一个有限样本范围条件。我们发布了代码、标签和审计表。

英文摘要

AI agents can now take irreversible actions in operational systems, but agent-caused losses are still not clearly assigned, priced, or transferred. Providers often disclaim consequential damages, users are left with uncompensated losses, and default human review limits the efficiency gains of automation. We ask when autonomous AI deployment can become economically acceptable despite failure risk. Our answer is to quantify risk at the customer-task-trace episode level and transfer it through insurance. Automation is acceptable when its expected benefit exceeds the premium, control cost, and remaining risk. This requires a defined role with bounded permissions and comparable traces. We introduce trace-economic underwriting, which maps tool-use traces to customer exposure and claimable loss, then uses this representation for pricing, control, and risk transfer. It uses deterministic economic labels rather than an LLM judge. In our trace-to-loss testbed, trace-economic pricing reduces pricing MAE from $17.7K to $569 and removes regressive cross-subsidy. A 300-trace expert audit accepts 295 labels unchanged. On 1,000 real SWE-smith traces, trace-conditioned controls reduce CVaR95 by 72%. Theorem~1 gives a finite-sample scope condition. We release code, labels, and audit sheets.

2606.16462 2026-06-16 cs.LG cs.AI 新提交

Learning aligned EEG representations with subject-specific encoders

学习带有主体特定编码器的对齐脑电图表示

Bruna J. Lopes, Gabriel Schwartz, Sylvain Chevallier, Raphael Y. de Camargo, Bruno Aristimunha

发表机构 * University of São Paulo(圣保罗大学) Université Paris-Saclay, Inria TAU team, LISN-CNRS(巴黎萨克雷大学,Inria TAU团队,LISN-CNRS) Institut de neuromodulation, GHU Paris, psychiatrie et neurosciences, centre hospitalier Sainte-Anne, pôle hospitalo-universitaire 15, Université Paris Cité(神经调控研究所,GHU巴黎,精神病学与神经科学,圣安娜医院,大学医院中心15区,巴黎西岱大学) Federal University of ABC (UFABC)(ABC联邦大学) Yneuro Swartz Center for Computational Neuroscience (SCCN), Institute for Neural Computation (INC), University of California San Diego(斯沃茨计算神经科学中心,神经计算研究所,加州大学圣地亚哥分校)

AI总结 提出使用主体特定编码器替代共享编码器,结合共同分类器实现跨主体脑电图对齐,实验表明该方法能内化欧几里得对齐的作用,提高类别区分度,并识别出未见主体的编码器选择是主要瓶颈。

详情
AI中文摘要

跨主体脑电图解码有望提供更多训练数据,但也使神经网络面临强烈的跨主体分布偏移。我们研究仅凭任务监督和架构是否能学习主体对齐的表示。我们将共享的脑电图编码器替换为主体特定编码器后接共同分类器,并在四个运动想象数据集上将该混合模型与标准EEGNet、AttentionBaseNet和CTNet基线(结合欧几里得对齐EA)进行比较。EA通过重新居中主体协方差改进了共享编码器,但混合编码器在很大程度上内化了这一作用:当移除EA时,验证损失曲线和潜在距离分析变化很小。主体特定头增加了类别区分度,并将每个主体置于其自身的潜在流形附近,改善了大多数主体,但留下了一个对方法敏感的子集。这些结果支持主体特定编码器作为脑电图解码的学习对齐机制,并将未见主体的编码器选择确定为剩余瓶颈。

英文摘要

Cross-subject EEG decoding promises more training data, but it also exposes neural networks to strong inter-subject distribution shifts. We study whether task supervision and architecture alone can learn subject-aligned representations. We replace a shared EEG encoder with subject-specific encoders followed by a common classifier, and compare this hybrid model with standard EEGNet, AttentionBaseNet, and CTNet baselines with Euclidean Alignment (EA) on four motor-imagery datasets. EA improves shared encoders by recentering subject covariances, but the hybrid encoder largely internalises this role: validation-loss curves and latent-distance analyses change little when EA is removed. Subject-specific heads increase class distinctiveness and place each subject close to its own latent manifold, improving most subjects while leaving a method-sensitive subset. These results support subject-specific encoders as a learned alignment mechanism for EEG decoding and identify head selection for unseen subjects as the remaining bottleneck.

2606.16461 2026-06-16 cs.LG 新提交

Privacy from Symmetry: Orthogonally Equivariant Transformers for LLM Inference

对称性带来的隐私:用于大语言模型推理的正交等变Transformer

Alexander Yukhimchuk, Andrey Shulga, Mladen Kolar, Martin Takáč

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Southern California(南加州大学)

AI总结 针对拆分推理中隐藏表示易被近邻搜索恢复的问题,提出正交混淆方法,并设计ConjFormer架构实现O(d)-等变性,在不加噪声或重加密下将令牌恢复率从35%降至1.3%,困惑度仅增0.4%。

详情
AI中文摘要

本地运行大型语言模型通常不切实际,这促使将敏感文本的推理推向第三方提供商。拆分推理通过将令牌保留在客户端并仅发送隐藏表示来部分缓解这一问题,但这些表示仍可通过针对公共嵌入表的最近邻搜索恢复。我们提出一种正交混淆过程,其中客户端在传输前将嵌入乘以一个秘密正交矩阵。为了在任意旋转下实现正确推理,我们引入了ConjFormer,这是一种Transformer变体,通过轻量级归一化更改(标量RMSNorm)以及所有线性权重的块状正交共轭,实现精确的$\mathrm{O}(d)$-等变性。因此,服务器完全在旋转基中执行前向传播,并且从未观察到未旋转的隐藏状态。在PubMed上微调的GPT-2和Llama 3.2 1B模型上的实验表明,正交混淆消除了直接余弦最近邻反演,并将令牌恢复率从超过35%的前10名降至最多1.3%,而微调后困惑度仅增加0.4%。这些结果表明,在架构层面强制执行对称性可以为隐私保护的大语言模型推理提供一种实用的防御,无需噪声注入或繁重的密码学机制。

英文摘要

Running large language models locally is often impractical, pushing inference on sensitive text to third-party providers. Split inference partially mitigates this by keeping tokens on the client and sending only hidden representations, but these representations can still be recovered via nearest-neighbor search against the public embedding table. We propose an orthogonal obfuscation procedure in which the client multiplies embeddings by a secret orthogonal matrix before transmission. To enable correct inference under arbitrary rotations, we introduce ConjFormer, a transformer variant that is exactly $\mathrm{O}(d)$-equivariant via a lightweight normalization change (scalar RMSNorm) together with blockwise orthogonal conjugation of all linear weights. As a result, the server performs the full forward pass entirely in the rotated basis and never observes unrotated hidden states. Experiments on GPT-2 and Llama 3.2 1B models fine-tuned on PubMed show that orthogonal obfuscation eliminates direct cosine nearest-neighbor inversion and reduces token recovery from over 35% top-10 to at most 1.3%, while increasing perplexity by only 0.4% after fine-tuning. These results indicate that enforcing symmetry at the architectural level can provide a practical defense for privacy-preserving LLM inference without noise injection or heavy cryptographic machinery.

2606.16458 2026-06-16 cs.RO 新提交

RHO: Your Coding Agent is Secretly a Roboticist

RHO:你的编码代理其实是个机器人专家

Karim Elmaaroufi, Justin Svegliato, Sarunas Kalade, Graham Schelle, Sanjit A. Seshia, Matei Zaharia

发表机构 * University of California, Berkeley(加州大学伯克利分校) AMD

AI总结 提出RHO范式,通过训练时搜索神经符号化多文件策略库,实现机器人任务的高效零样本泛化,在LIBERO-PRO和Robosuite上分别达到45%和70%的成功率,显著优于现有方法。

Comments 46 pages, 9 figures, 15 tables. Project page: https://rho-robotics.github.io

详情
AI中文摘要

代码即策略(CaP)表明,大型语言模型(LLM)可以通过组合感知、规划和控制原语来编写代码解决机器人任务。然而,最近的CaP系统在测试时依赖多轮代码生成循环,这对于实时机器人控制通常不可行。我们引入了机器人学优化(RHO),这是一种新颖的范式,其中支持工具编码的代理在训练时提出并搜索可解释的、神经符号化的多文件策略库(仓库即策略),这些库组合这些原语,而不是单个提示、函数或文件。RHO通过环境奖励和执行的反思性反馈进行搜索,而不是通过遥操作演示。它泛化到受扰动的拾取和放置场景,如LIBERO-PRO,其中OpenVLA得分为0.0%,π_{0.5}平均为12.83%。使用相同的低级原语,RHO达到45.0%的成功率,比最强的多轮代理系统高2.5倍,比π_{0.5}高3.5倍。在Robosuite上,RHO以70.0%的成绩创造了新的最先进水平,超过了之前多轮记录的68.29%,且部署时无需纠正性LLM代码编辑。当在控制循环中使用LLM时,如在RAI的O3DE基准测试中,RHO优化了部署代理的多文件提示、工具和控制代码,将保留成功率从23.5%提高到44.3%,同时减少了20%的墙钟时间和27%的工具调用。

英文摘要

Code-as-Policies (CaP) has shown that large language models (LLMs) can write code to solve robotics tasks by composing perception, planning, and control primitives. Recent CaP systems, however, rely on multi-turn code-generation loops at test time, which is often infeasible for real-time robot control. We introduce Robotics Harness Optimization (RHO), a novel paradigm in which tool-enabled coding agents, at training time, propose and search for interpretable, neurosymbolic multi-file policy repositories (Repositories-as-Policies) that compose these primitives rather than a single prompt, function, or file. RHO searches with reflective feedback from environment reward and execution rather than teleoperation demonstrations. It generalizes to perturbed pick-and-place settings like LIBERO-PRO, where OpenVLA scores 0.0% and $π_{0.5}$ averages 12.83%. Using the same low-level primitives, RHO reaches a 45.0% success rate, 2.5x higher than the strongest multi-turn agentic system, and 3.5x higher than $π_{0.5}$. On Robosuite, RHO sets a new state-of-the-art of 70.0%, exceeding the prior multi-turn record of 68.29% using single-turn execution with no corrective LLM code edits at deployment. When an LLM is used in the control loop, as on RAI's O3DE benchmark, RHO optimizes the deployed agent's multi-file harness of prompts, tools, and control code, improving held-out success from 23.5% to 44.3% with 20% less wall-clock time and 27% fewer tool calls.

2606.16457 2026-06-16 cs.CV cs.GR 新提交

ResEdit: Residual embeddings for precise generative image editing

ResEdit:用于精确生成式图像编辑的残差嵌入

Ahmet Canberk Baykal, Valentin Deschaintre, Yannick Hold-Geoffroy, Michael Fischer, Anna Frühstück, Cengiz Öztireli, Iliyan Georgiev

发表机构 * Adobe Research(Adobe研究院) University of Cambridge(剑桥大学)

AI总结 提出残差图像编码作为额外条件,结合梯度反转优化策略,在保持图像身份和全局一致性的同时实现高保真精确编辑。

Comments Accepted to the EGSR 2026 journal track

详情
AI中文摘要

条件扩散图像生成器可以通过反演重新用于编辑,无需大规模配对微调数据。然而,在保持图像身份和全局一致性的同时产生高质量、有针对性的编辑仍然具有挑战性,因为弱条件反演通常会将冲突的图像特征嵌入到噪声中。我们证明,将残差图像编码作为额外条件,既能改善身份保留,又能提高可编辑性。我们优化这种残差编码,为重建提供强大的条件信号,从而减少对反演的依赖及其易受上述缺陷的影响。为了确保该残差不干扰期望的编辑,我们采用了一种基于梯度反转的优化策略,将残差与编辑条件解耦。我们展示了该方法在基于内在属性的精确编辑和重光照中产生高保真结果的能力,并给出了概念验证性的文本引导操作。

英文摘要

Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method's ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.

2606.16456 2026-06-16 cs.LG cs.AI 新提交

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

SPRI: 基于SVD分解残差初始化的数据受限MoE升级方法

Weiqiao Shan, Ruixiang Mao, Yuang Li, Yuhao Zhang, Yingfeng Luo, Tong Zheng, Chen Xu, Yucheng Qiao, Chunxiang Jin, Yi Yuan, Jingdong Chen, Tong Xiao, Jingbo Zhu

发表机构 * Northeastern University, China(东北大学) Huawei TSC, China(华为技术有限公司) CUHK-Shenzhen, China(香港中文大学(深圳)) University of Maryland, USA(马里兰大学) Harbin Engineering University, China(哈尔滨工程大学) Inclusion AI, Ant Group(蚂蚁集团Inclusion AI) NiuTrans Research, China(小牛翻译研究中心)

AI总结 提出SPRI方法,利用预训练FFN权重的SVD分解残差初始化MoE专家,结合两阶段训练策略,在数据受限的多语言语音翻译任务中显著提升性能。

Comments 8pages, 12 tables, 3 figures

详情
AI中文摘要

混合专家(MoE)模型能够实现高效扩展,但从头训练成本过高。MoE升级通过将预训练的密集模型转换为稀疏MoE模型来降低这一成本。然而,现有的升级方法通常依赖大规模持续训练,并且在数据受限的监督适应中表现不佳,原因在于专家同质化或对预训练参数的过度扰动。在此设置下,有效的升级必须利用预训练权重结构,同时为路由专家引入足够的多样性。为此,我们提出了基于SVD分解残差初始化(SPRI)的方法,该方法将从预训练前馈网络(FFN)权重中提取的SVD分解残差分配到路由专家中,从而在预训练谱结构的基础上引入可控的专家多样性。我们进一步引入两阶段训练策略以提高适应稳定性。我们在多语言语音到文本翻译任务上评估SPRI,该任务中有限的监督数据对MoE升级构成挑战,而多个目标语言提供了天然的路由异质性。在CoVoST2数据集上的15个英语到其他语言方向中,SPRI相比完全微调的密集模型平均BLEU和COMET分别提高了2.58和3.32分,并且比之前最佳的MoE升级基线高出3.39 BLEU和4.34 COMET分。

英文摘要

Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.

2606.16454 2026-06-16 cs.LG cs.AI 新提交

SDS-LoRA: Overcoming Anisotropic Gradient Scaling in Low-Rank Adaptation

SDS-LoRA:克服低秩适应中的各向异性梯度缩放

Junghun Oh, Sungyong Baik, Kyoung Mu Lee

发表机构 * Seoul National University(首尔大学) Hanyang University(汉阳大学)

AI总结 提出SDS-LoRA,通过结构解耦奇异值与反向传播,消除LoRA中梯度各向异性缩放导致的秩降低和次优对齐问题,提升收敛速度和适应性能。

详情
AI中文摘要

低秩适应(LoRA)通过使用低秩矩阵参数化权重更新,实现了大型预训练模型对下游任务的高效适应。在本文中,我们从几何角度研究了LoRA参数化的局限性。具体地,我们表明当全微调梯度反向传播到低秩矩阵时,它会经历由奇异值驱动的各向异性缩放。我们认为这种现象是不可取的,因为它通过将梯度偏向主导奇异方向而抑制其他方向,从而扭曲了全微调梯度。我们的分析表明,各向异性梯度缩放降低了低秩矩阵梯度的有效秩,并导致LoRA中全微调梯度与其低秩近似之间的次优对齐,从而加剧了与全微调的差距。为了解决这些局限性,我们提出了一种新的低秩参数化方法SDS-LoRA,该方法在结构上将奇异值与反向传播解耦。我们的方法确保全微调梯度仅通过低秩矩阵子空间的正交基反向传播,独立于其尺度。收敛性分析表明,虽然LoRA的收敛速率随低秩矩阵的条件数而恶化,但SDS-LoRA与之无关。在自然语言和视觉基准上的实验结果表明,SDS-LoRA改善了损失收敛并缩小了与全微调的差距,显著提升了适应性能。

英文摘要

Low-Rank Adaptation (LoRA) enables efficient adaptation of large pre-trained models to downstream tasks by parameterizing weight updates with low-rank matrices. In this paper, we investigate the limitations of the LoRA parameterization from a geometric perspective. Specifically, we show that when a full fine-tuning gradient is backpropagated to the low-rank matrices, it undergoes anisotropic scaling driven by their singular values. We argue that this phenomenon is undesirable because it distorts the full fine-tuning gradient by skewing it toward dominant singular directions while suppressing others. Our analyses demonstrate that anisotropic gradient scaling reduces the effective rank of the low-rank matrices' gradients and results in suboptimal alignment between the full fine-tuning gradient and its low-rank approximation in LoRA, thereby exacerbating the gap to full fine-tuning. To address these limitations, we propose a new low-rank parameterization, SDS-LoRA, which structurally decouples singular values from the backward pass. Our method ensures that the full fine-tuning gradient backpropagates only through the orthonormal bases of the low-rank matrices' subspaces, independent of their scales. Convergence analysis demonstrates that while LoRA's convergence rate degrades with the condition number of the low-rank matrices, SDS-LoRA remains independent of it. Experimental results across natural language and vision benchmarks show that SDS-LoRA improves loss convergence and reduces the gap to full fine-tuning, significantly enhancing adaptation performance.

2606.16448 2026-06-16 cs.CV 新提交

Hierarchical Fine-Grained Aerial Object Detection

层次化细粒度航空目标检测

Yan Zhang, Fang Xu, Wen Yang, Gui-Song Xia

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院)

AI总结 提出ExpertDet,通过视觉感知掩码属性建模和层次化视觉实例提升,利用结构化先验知识增强细粒度航空目标检测,并在新基准PSP上超越现有方法。

Comments 15 pages

详情
AI中文摘要

细粒度航空目标检测,由现实世界目标类别的内在粒度驱动,对于遥感中的高级场景理解至关重要。现有方法很大程度上继承了粗粒度目标检测的范式,仅依赖单标签监督,因此难以区分具有细微结构差异的模型级类别。然而,对于每个特定模型(例如波音787),属性和层次等结构化先验知识提供了跨多个粒度的判别性语义。受此启发,我们提出了ExpertDet,一种融合专家知识线索以增强细粒度航空目标检测的方案。具体来说,我们设计了视觉感知掩码属性建模(VMAM),通过从视觉线索重建随机掩码的属性,将属性语义与视觉结构对齐,使检测器能够捕捉细微的结构差异。我们进一步提出了层次化视觉实例提升(HierVIP),该方法基于层次关系构建视觉原型树,并施加分类学感知约束,以在增强类别判别性的同时保持跨层次语义连续性。此外,我们为航空图像中模型特定的舰船和飞机的精确识别整理了一个新的细粒度目标检测基准PSP,分别涵盖106个舰船类别和30个飞机模型,是现有航空目标检测数据集中模型特定类别最广泛的集合。我们在PSP基准上对最先进的目标检测算法进行了基准测试。大量评估表明,ExpertDet在各个层次上始终优于其他细粒度竞争对手。数据集、基准和代码可在https://nnnnerd.github.io/PSP-Benchmark/获取。

英文摘要

Fine-grained aerial object detection, driven by the intrinsic granularity of real-world object categories, is crucial for advanced scene understanding in remote sensing. Existing methods largely inherit the paradigm of coarse-grained object detection, relying solely on single-label supervision and thus struggling to distinguish model-level categories with subtle structural differences. However, for each specific model (e.g., Boeing 787), structured prior knowledge such as attributes and hierarchies offers discriminative semantics across multiple granularities. Motivated by this, we present ExpertDet, a scheme that incorporates expert-informed cues to enhance fine-grained aerial object detection. Specifically, we design Vision-aware Masked Attribute Modeling (VMAM), which aligns attribute semantics with visual structures by reconstructing randomly masked attributes from visual cues, enabling the detector to capture subtle structural distinctions. We further propose Hierarchical Visual Instance Promotion (HierVIP), which builds a visual prototype tree based on hierarchical relations and imposes taxonomy-aware constraints to preserve cross-level semantic continuity while enhancing category discrimination. Moreover, we curate a new fine-grained object detection benchmark for Precise recognition of model-specific Ships and Planes from aerial imagery, PSP, covering 106 ship classes and 30 airplane models, respectively, featuring the most extensive collection of model-specific categories among existing aerial object detection datasets to date. We benchmark state-of-the-art object detection algorithms on the PSP benchmark. Extensive evaluation demonstrates that ExpertDet consistently outperforms other fine-grained competitors across hierarchy levels. The dataset, benchmark, and code are available at https://nnnnerd.github.io/PSP-Benchmark/.

2606.16447 2026-06-16 cs.RO cs.AI 新提交

Training and Evaluating Diffusion Policies with Long Context Lengths

训练和评估具有长上下文长度的扩散策略

Abhinav Agarwal, Adam Wei, Taylan Kargin, Michael Zeng, Cole Becker, Arif Kerem Dayi, Pablo Parrilo, Asuman Ozdaglar, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文首次详细研究模仿学习中上下文长度的影响,发现简单扩展上下文长度并不脆弱,并提出联合训练多上下文长度策略的方法以降低样本复杂度。

详情
AI中文摘要

模仿学习已经能够从RGB观测中实现高度灵巧的机器人操作。然而,使用这些方法训练的策略通常仅基于短历史观测来调节机器人动作。这些策略无法解决需要记忆的任务,并且可能反复执行相同的失败动作。在这项工作中,我们首先在任务具有不同局部稳定性和记忆需求以及多种数据体制下,将上下文长度从短到长逐步增加,对策略性能进行基准测试。据我们所知,这是首次如此详细地研究模仿学习中上下文长度的影响。我们的结果挑战了先前的说法:简单地扩展上下文长度并不像文献中声称的那样脆弱。使用适当的调节方法和去噪骨干网络(UNet+交叉注意力),单任务策略在通常的数据体制下即使采用简单扩展也能在许多任务上取得高成功率。接下来,我们提出一种训练算法,用于联合训练多个上下文长度的策略,进一步降低长上下文学习的样本复杂度。最后,我们将我们的发现应用于重新评估先前提出的一些长上下文模仿学习解决方案。

英文摘要

Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.

2606.16436 2026-06-16 cs.RO cs.CV 新提交

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

V2P-Manip:从单目人类视频学习灵巧操作

Kaihan Chen, Yanming Shao, Haifeng Ji, Xiaokang Yang, Yao Mu

发表机构 * Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出V2P-Manip框架,从单目人类演示视频中提取具有视觉保真度和物理合理性的轨迹,通过两阶段精炼实现空间对齐与物理一致性,在TACO和OakInk基准上显著优于先前方法。

详情
AI中文摘要

实现自主机器人灵巧操作需要大规模精确、类人的动作序列。作为昂贵遥操作数据的可扩展补充,从单目视频中提取兼具视觉保真度和物理合理性的轨迹是具身智能的一个有前景的前沿方向。为此,我们引入V2P-Manip,一个高效的框架,旨在直接从人类演示视频中学习灵巧操作策略。我们建立了一个高效、集成的流水线,涵盖3D资产获取、轨迹估计和灵巧策略学习。为了弥合视觉感知与物理约束之间的差距,我们引入了一个两阶段精炼过程,以强制执行空间对齐和物理一致性。在TACO和OakInk基准上的评估表明,我们的方法在姿态精度、对非结构化环境的适应性以及训练效率方面显著优于先前方法。最终,实验结果证实了在多个合成操作任务上平均成功率超过75%,并验证了提取的操作先验在不同灵巧手形态上的适应性。

英文摘要

Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

2606.16434 2026-06-16 cs.LG cs.AI 新提交

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

基于时间对比表示学习的电池系统自主端到端健康状态预测服务

Junting Wen, Dan Li, Qihao Quan, Xiwen Wang, Hang Yang, Zhaohong Meng, Zigui Jiang, Changlin Yang, Tianle Liu, Diego Muñoz-Carpintero, Jian Lou

发表机构 * School of Software Engineering, Sun Yat-sen University(中山大学软件学院) Tianneng Battery Group Co., Ltd(天能电池集团有限公司) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院) Institute of Engineering Science, Universidad de O’Higgins(奥希金斯大学工程科学研究所)

AI总结 提出TC-SOH模块化服务架构,通过时间对比机制和跨窗口预测任务从原始数据中提取退化相关表示,实现自主端到端SOH预测,在四个数据集上MAPE和RMSE分别降低1.91倍和2.13倍。

详情
AI中文摘要

准确的状态健康(SOH)估计是锂离子电池管理的关键诊断服务。然而,依赖劳动密集型的手动特征工程和不透明的黑箱模型阻碍了可扩展的工业部署。为此,我们引入TC-SOH:一种模块化、即插即用的服务架构,用于自主、端到端的SOH预测。TC-SOH采用时间对比机制和跨窗口预测预任务,直接从原始运行数据中提取与退化相关的表示。为了提高透明度,我们将模型效能与表示诊断联系起来:可视化、敏感性分析、冗余分析、双向探测、未来SOH探测和时间洗牌表明,学习到的特征与选定的专家描述符重叠,同时保留了额外的SOH相关变化,并且有序的时间上下文改善了后续SOH预测。在四个公开数据集上,TC-SOH优于所考虑的物理信息和数据驱动基线,MAPE降低了1.91倍,RMSE降低了2.13倍。

英文摘要

Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

2606.16432 2026-06-16 cs.CL cs.AI 新提交

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

ACCORD: 面向语言智能体的动作条件上下文接地

Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu, Heng Ji, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Stanford University(斯坦福大学)

AI总结 针对用户指令常因隐含环境假设而欠指定,导致LLM智能体执行失败的问题,提出ACCORD框架,在每次动作前主动探测缺失信息并整合轨迹上下文,无需额外训练,在AppWorld和AlfWorld上显著提升任务完成率。

详情
AI中文摘要

用户指令往往因人类对周围环境的隐含假设而欠指定。对于在信息丰富的数字和物理环境中运行的大型语言模型(LLM)智能体,这些假设无法仅从指令中推断;必须从工具、数据、接口和观察的当前状态中恢复。因此,有效执行要求智能体识别缺失的上下文,将其基于观察到的证据,并带入后续动作。我们表明,当前智能体常常未能做到这一点。它们基于假设而非观察到的细节行动,忽略本可收集的信息,并且未能整合已经返回的证据。基于这一洞察,我们提出ACCORD(动作条件上下文接地),一种简单有效的自适应接地智能体框架。在每次动作前,ACCORD主动探测环境中缺失的信息,并整合来自智能体轨迹中原本会被忽略的相关上下文。无需额外训练或任务成功信号,ACCORD在AppWorld上将任务目标完成率从42.0%提升至62.6%(GPT-5-mini),比强基线高出最多20.6个百分点。这些增益在更强的基模型(Claude-4.5-sonnet上+10.8)、开放权重模型(Qwen3.5-27B-FP8上+10.1)以及具身AlfWorld基准(GPT-5-mini上成功率+7.4)上持续存在。

英文摘要

User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent's trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

2606.16429 2026-06-16 cs.LG cs.CL 新提交

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Taylor-Calibrate:混合线性注意力蒸馏的原则性初始化

Zhongzhu Zhou, Qingyang Wu, Junxiong Wang, Mayank Mishra, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu

发表机构 * The University of Sydney(悉尼大学) Together AI University of California, Berkeley(加州大学伯克利分校) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft(微软)

AI总结 提出Taylor-Calibrate方法,利用泰勒引导的教师注意力统计初始化混合线性注意力学生模型,显著减少蒸馏所需训练令牌数。

Comments 24 pages, 9 figures

详情
AI中文摘要

混合线性注意力模型提供了一条更快长上下文推理的诱人路径:它们降低了全softmax注意力的二次成本和KV缓存负担,同时保留了Transformer模型的大部分质量。获得此类模型的一种实用方法是转换预训练的Transformer,而不是从头开始预训练新架构,但这种转换仍然脆弱。简单地将教师注意力投影复制到Gated DeltaNet(GDN)学生中并不能指定新的循环衰减、写入和输出门控动态。因此,转换后的模型通常从较差的动态状态开始,必须花费大量蒸馏令牌来修复初始化,而不是学习剩余的教师行为。我们提出了Taylor-Calibrate,一种用于混合GDN学生的轻量级初始化方法。该方法使用泰勒引导的教师注意力统计来设置值投影、记忆时间尺度、写入门和输出门,然后应用一个简短的逐层对齐步骤,使每个转换后的层与教师输出匹配。在四种教师设置和三种保留层策略下,Taylor-Calibrate提供了显著更强的零样本学生,在代表性消融中改进高达88倍,并且达到匹配恢复目标所需的训练令牌比朴素转换少4.9倍至9.2倍。

英文摘要

Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.

2606.16428 2026-06-16 cs.CL cs.AI cs.HC 新提交

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

LectūraAgents:面向自适应个性化AI辅助学习与具身教学的多智能体框架

Jaward Sesay, Yue Yu, Siwei Dong, Yemin Shi, Guangyao Chen, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Peking University(北京大学) Cornell University(康奈尔大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出LectūraAgents多智能体框架,通过层次化架构和自适应具身教学机制(如手势、高亮等)实现端到端个性化学习,并设计教学动作-语音对齐算法提升连贯性,在多个课程级别上优于现有方法。

详情
AI中文摘要

有效的个性化AI辅助学习需要系统不仅能够生成准确的、针对学习者的教育材料,还能动态调整其教学方式以适应不同学习者。然而,现有的教育智能体主要关注讲座内容自动化和模拟,往往缺乏针对个体学习者的多模态和具身教学方法的建模。为此,我们提出LectūraAgents——一个多智能体框架,通过端到端的自适应具身教学实现个性化学习。其核心模拟了教授-学生关系,其中ProfessorAgent领导一个由专业下属智能体组成的协作团队,通过研究、规划、审查和具身交付适应学习者需求的讲座内容。该框架有三个主要贡献:(1)用于端到端个性化学习的层次化多智能体架构;(2)自适应具身教学机制,其中ProfessorAgent在教学环境中对内容执行可见且具有教学动机的教学动作(例如手写、高亮、下划线等);(3)教学动作-语音对齐(TASA)算法,该算法采用基于显著性的启发式和时序语义分割,生成与学习者档案对齐的连贯教学动作序列。我们在高中、本科和研究生级别的多样化课程上,使用基于样本特定量规的分析评估LectūraAgents;生成的讲座材料和教学动作由专家教育者评估和验证。实验结果显示,在讲座内容质量、具身教学质量、评估和个性化方面,LectūraAgents持续优于现有方法,使其成为大规模个性化学习的教学基础扎实的框架。

英文摘要

Effective personalized AI-assisted learning demands systems that can not only generate accurate learner-specific educational materials, but also dynamically adapt their instruction to diverse learners. However, existing educational agents have primarily focused on lecture content automation and simulations, which often fall short of modelling multimodal and embodied instructional methods tailored for the individual learner. To this end, we propose LectūraAgents - a multi-agent framework that enables personalized learning through end-to-end adaptive embodied teaching. At its core, LectūraAgents mirrors a professor-student relationship, in which a ProfessorAgent leads a collaborative team of specialized subordinate agents through research, planning, review, and embodied delivery of lecture contents that adapt to a learner's needs. The framework offers three main contributions: (1) a hierarchical multi-agent architecture for end-to-end personalized learning; (2) an adaptive embodied teaching mechanism, wherein the ProfessorAgent executes visible and pedagogically motivated teaching actions (e.g., handwrite, highlight, underline, etc.) over contents in a teaching environment; and (3) a Teaching Action-Speech Alignment (TASA) algorithm that employs salience-based heuristics and temporal semantic segmentation to generate coherent teaching action sequences aligned with learner profiles. We evaluate LectūraAgents on diverse courses at high school, undergraduate, and graduate levels using sample-specific rubric-based analysis; with generated lecture materials and teaching actions assessed and validated by expert educators. Experimental results show consistent gains in lecture content quality, embodied teaching quality, assessment, and personalization over existing approaches, positioning LectūraAgents as a pedagogically well-grounded framework for personalized learning at scale.

2606.16421 2026-06-16 cs.CV 新提交

Beer-Lambert Guided Representation Learning for Unsupervised Anomaly Detection in Sub-THz Food Inspection Images

比尔-朗伯引导的表示学习用于亚毫米波食品检测图像中的无监督异常检测

Gyutae Hwang, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University(全北国立大学电子与信息工程学部)

AI总结 提出比尔-朗伯引导的表示学习框架,通过衰减分解模块约束学生表示,在亚毫米波食品检测图像中实现无监督异常检测,并引入留一食品协议评估泛化能力。

Comments 6 pages, 3 figures

详情
AI中文摘要

食品制造需要可靠的检测系统来检测异物污染并维护产品安全。亚毫米波透射成像提供了依赖于材料的衰减特性,有助于检测食品中的低密度污染物。然而,现有的无监督异常检测方法主要依赖于RGB预训练的视觉表示,这可能无法充分捕捉亚毫米波图像的透射行为。本文提出了一种比尔-朗伯引导的表示学习框架,用于亚毫米波食品检测图像中的无监督异常检测。该方法引入了一个衰减分解模块作为辅助正则化模块,在训练过程中通过衰减重建来约束学生表示。除了传统的单类设置外,我们还引入了一种留一食品协议,以评估在未见食品类别下的泛化能力。在Inline-Food-Inspection-THz数据集上的实验结果表明,所提出的方法在整体异常检测性能上优于基线方法。

英文摘要

Food manufacturing requires reliable inspection systems to detect foreign material contamination and maintain product safety. Sub-THz transmission imaging provides material-dependent attenuation characteristics that are useful for detecting low-density contaminants in food products. However, existing unsupervised anomaly detection methods mainly rely on RGB-pretrained visual representations, which may not adequately capture the transmission behavior of Sub-THz images. This paper proposes a Beer-Lambert guided representation learning framework for unsupervised anomaly detection in Sub-THz food inspection images. The proposed method introduces an attenuation decomposition module as an auxiliary regularization module that constrains student representations through attenuation reconstruction during training. In addition to the conventional one-class setting, we introduce a Leave-One-Food-Out protocol to evaluate generalization capability under unseen food categories. Experimental results on the Inline-Food-Inspection-THz dataset show that the proposed method improves overall anomaly detection performance over the baseline method.

2606.16415 2026-06-16 cs.AI 新提交

Posterior Twins: Distributional Behavioral Simulation for Enterprise Decisions

后验孪生:面向企业决策的分布行为模拟

Ankit Das

发表机构 * Twinning Labs, Inc.(Twinning Labs公司)

AI总结 提出后验孪生方法,通过记忆驱动的数字孪生将模拟行为表示为决策条件下的更新分布,在226例基准上评估模型,发现模态准确率与分布保真度揭示不同操作区域,TL-Twin Alpha实现最低Wasserstein-1距离(1.16)。

Comments 13 pages, 2 figures

详情
AI中文摘要

企业行为模拟不仅需要产生合理的响应。许多决策取决于在拟议行动下群体的形态:哪些细分群体接受、拒绝、犹豫或进入风险敏感状态。本文介绍了后验孪生(Posterior Twins),一种记忆驱动的数字孪生方法,将可能的行为表示为特定决策上下文下的更新分布。我们在一个包含226个保留示例的行为响应基准上评估了一系列Twinning Labs行为模型操作点,并报告了模态准确率和Wasserstein-1距离。结果表明,模态准确率和分布保真度识别出不同的操作区域。在报告的结果集中,TL-Twin Alpha实现了最低的观测Wasserstein-1距离($W_1 = 1.16$),而TL-Twin Delta和TL-Twin Gamma在模态准确率前沿附近提供了平衡的操作点。本文将这些结果视为系统结果:受控记忆、行为模型路由、场景编排、分布聚合和可审计性对于将模拟行为转化为可重用的企业决策证据是必要的。

英文摘要

Enterprise behavioral simulation requires more than producing a plausible response. Many decisions depend on the shape of a population under a proposed action: which segments accept, defect, hesitate, or move into risk-sensitive states. This paper introduces Posterior Twins, a memory-grounded digital-twin approach that represents likely behavior as an updated distribution under a specific decision context. We evaluate a family of Twinning Labs behavioral-model operating points on a 226-example held-out behavioral-response benchmark and report both modal accuracy and Wasserstein-1 distance. The results show that modal accuracy and distributional fidelity identify different operating regimes. TL-Twin Alpha achieves the lowest observed Wasserstein-1 distance in the reported result set ($W_1 = 1.16$), while TL-Twin Delta and TL-Twin Gamma provide balanced operating points near the modal-accuracy frontier. The paper frames these results as a systems result: governed memory, behavioral model routing, scenario orchestration, distributional aggregation, and auditability are necessary for turning simulated behavior into reusable enterprise decision evidence.

2606.16414 2026-06-16 cs.CV 新提交

Instance-Aware Knowledge Distillation for Semi-Supervised Learning of an On-Board Multi-Task Dense Prediction Model for Collision Avoidance System

面向碰撞避免系统的半监督学习中的实例感知知识蒸馏用于车载多任务密集预测模型

Gyutae Hwang, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University(全北国立大学电子与信息工程学部)

AI总结 提出实例感知知识蒸馏框架,利用教师模型领域先验和基础模型实例知识生成伪标签,训练轻量学生模型在边缘设备上实时执行多任务密集预测,在实例分割上超越教师,计算量降低22.68倍。

Comments 13 pages, 7 figures

详情
AI中文摘要

碰撞避免系统已发展为基于摄像头的深度学习方法用于驾驶场景理解。然而,在乡村俱乐部等边缘环境中的部署受到有限计算资源和不可靠通信基础设施的限制。此外,为目标领域构建大规模数据集涉及大量标注成本。为了解决这些限制,我们提出了一种实例感知知识蒸馏框架用于半监督学习。具体来说,我们通过利用来自教师的领域先验和来自基础模型的实例中心知识生成减轻教师偏差的伪标签。训练后的轻量学生模型被部署在所提出的碰撞避免系统中,并实时执行多个密集预测任务。该系统检测前方障碍物并将其空间信息编码为控制器局域网消息,用于自动导引车操作。为此,我们构建了一个大规模的乡村俱乐部数据集,并对所提出的系统进行了现场验证。实验结果表明,学生在实例分割上优于大型教师,同时减轻了单目深度估计中的性能下降。与教师相比,学生将FLOPs减少了22.68倍,参数减少了14.33倍,在低成本边缘设备上实现了6.46 FPS。

英文摘要

Collision avoidance systems have evolved toward camera-based deep learning approaches for driving scene understanding. However, deployment in edge environments such as country clubs is constrained by limited computational resources and unreliable communication infrastructure. Moreover, constructing large-scale datasets for the target domain involves substantial annotation cost. To address these limitations, we propose an instance-aware knowledge distillation framework for semi-supervised learning. Specifically, we generate pseudo labels that mitigate teacher bias by leveraging domain priors from the teacher and instance-centric knowledge from foundation models. The trained lightweight student is deployed in the proposed collision avoidance system and performs multiple dense prediction tasks in real-time. The system detects frontal obstacles and encodes their spatial information into controller area network messages for automated guided vehicle operation. To achieve this, we construct a large-scale country club dataset and perform field validation of the proposed system. Experimental results demonstrate that the student outperforms the large teacher in instance segmentation while mitigating performance degradation in monocular depth estimation. Compared with the teacher, the student reduces FLOPs by 22.68$\times$ and parameters by 14.33$\times$, achieving 6.46 FPS on a low-cost edge device.

2606.16413 2026-06-16 cs.RO cs.HC 新提交

An Augmented Reality Brain-Robot Interface for Generalist Robot Arm Manipulation

面向通用机器人臂操控的增强现实脑机接口

Shangkai Zhang, Rousslan Fernand Julien Dossa, Luca Nunziante, Marina Di Vincenzo, Kai Arulkumaran

发表机构 * Araya Inc.(Araya公司)

AI总结 提出结合眼动追踪与运动想象脑电的增强现实脑机接口,实现通用机器人臂的直观操控,通过18人实验验证了多步骤日常任务的有效性和良好可用性。

Comments Accepted at the 2026 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情
AI中文摘要

增强现实(AR)与基于脑电图的脑机接口(BCI)的融合为辅助目的提供了直观控制机器人的有前景途径。然而,现有的AR脑机接口(BRI)系统通常受限于特定任务结构,限制了其在真实环境中的实用性。我们提出了一种面向通用机器人臂操控的AR BRI,它将基于注视的对象选择与运动想象动作控制相结合。我们的系统使用眼动追踪进行直观对象定位,并通过上下文感知的视觉覆盖(“放置”和“使用”)在共享自主框架内引导用户完成任务。我们通过一项可行性研究评估了该界面,18名健康参与者执行了三种多步骤日常生活活动:饮水、使用抽屉和操作烤箱。结果表明,这种交互范式实现了有效的顺序任务执行和高用户参与度,获得了“良好”的可用性评级(SUS > 70)。这些发现支持了所提出的交互范式用于复杂BCI驱动的机器人辅助的可行性,并激励了未来针对预期目标人群的评估。项目网站:https://ar-bri-manip.github.io/。

英文摘要

The integration of augmented reality (AR) and EEG-based brain-computer interfaces (BCIs) offers a promising path for enabling intuitive control of robots for assistive purposes. However, existing AR brain-robot interface (BRI) systems are often constrained to task-specific structures, limiting their utility in real-world environments. We present an AR BRI designed for generalist robot arm manipulation that combines gaze-based object selection with motor imagery action control. Our system uses eye-tracking for intuitive object targeting and context-aware visual overlays ("Place" and "Use") to guide the user through tasks within a shared autonomy framework. We evaluated the interface through a feasibility study with 18 healthy participants performing three multi-step activities of daily living: drinking, using a drawer, and operating an oven. Our results demonstrate that this interaction paradigm enables effective sequential task execution and high user engagement, achieving a "Good" usability rating (SUS > 70). These findings support the feasibility of the proposed interaction paradigm for complex BCI-driven robotic assistance, and motivate future evaluation with the intended target population. Project website: https://ar-bri-manip.github.io/.

2606.16412 2026-06-16 cs.SD eess.AS math.HO math.NT 新提交

An Asymmetric Formula for Interval Consonance and its Relation to Harmonic Coincidence

区间协和的不对称公式及其与谐波重合的关系

David De Roure

发表机构 * University of Oxford(牛津大学) Royal Northern College of Music(皇家北方音乐学院)

AI总结 提出一个不对称公式 f(p/q) = p + Ω(q) 用于度量音程协和度,并证明在标准协和数据上表现良好,同时揭示了与欧拉和谐波重合模型的联系。

Comments Working note to support OEIS submissions

详情
AI中文摘要

欧拉的 Gradus Suavitatis (1739) 通过公式 G(p/q) = 1 + Ω(p) + Ω(q) 为音程 p/q 分配一个不协和值,其中 Ω(n) = \sum_i e_i(p_i - 1) 对 n 的加权质数指数求和。我们提出更简单的不对称公式 f(p/q) = p + Ω(q),该公式对分子和分母区别对待,并在标准协和数据上表现相当。我们还表明,在谐波被整数索引并均匀计数到固定截断水平的模型下,Gradus 等价于加权谐波重合计数,权重为 w(n) = Ω(n),从而将其与伽利略早期的脉冲重合模型 (1638) 联系起来。该公式自然生成一个互质整数三角形 T(n,k) = n + Ω(k),其最右对角线给出了超特定(连续谐波)音程的两阶段不协和。公式 f 允许在谐波背景和部分识别方面进行简单的两阶段解释,我们将其作为一种推测性的感知假设提出。

英文摘要

Euler's Gradus Suavitatis (1739) assigns a dissonance value to a musical interval p/q by the formula G(p/q) = 1 + Ω^(p) + Ω^(q), where Ω^(n) = \sum_i e_i(p_i - 1) sums the weighted prime exponents of n. We propose the simpler asymmetric formula f(p/q) = p + Ω^(q), which treats numerator and denominator differently and performs comparably on standard consonance data. We also show that, under a model in which harmonics are integer-indexed and counted uniformly up to a fixed truncation level, Gradus is equivalent to a weighted harmonic coincidence count with weights w(n) = Ω^(n), connecting it to Galileo's earlier pulse-coincidence model (1638). The formula naturally generates a coprime integer triangle T(n,k) = n + Ω^(k), whose rightmost diagonal gives the two-stage dissonance of the superparticular (consecutive-harmonic) intervals. The formula f admits a simple two-stage interpretation in terms of harmonic context and partial recognition, which we offer as a speculative perceptual hypothesis.

2606.16411 2026-06-16 cs.LG 新提交

Not all Jensen-Shannon Divergence Estimators are Equal

并非所有 Jensen-Shannon 散度估计器都是等价的

Alba Garrido, Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz, Santiago Zazo, Juan Parras

发表机构 * Information Processing and Telecommunications Center, ETSI Telecomunicación, Universidad Politécnica de Madrid(马德里理工大学电信工程学院信息处理与电信中心)

AI总结 针对合成表格数据保真度评估中 Jensen-Shannon 散度估计协议不明确的问题,系统研究了不同估计器族、采样协议等因素对估计值的影响,揭示了边际估计器的依赖盲性和分类器估计器的敏感性,并提出了后验校正方法。

详情
AI中文摘要

Jensen-Shannon 散度被广泛报道为合成表格数据保真度的标量度量。然而,在实践中,它是使用通常未明确说明的协议从有限样本中估计的。这造成了一个测量问题。尽管总体散度定义明确,但经验值取决于估计器族、采样协议、校准、维度和类别平衡。我们表明,不同的协议可能产生不可比较的值:基于边际的估计器忽略联合分布中的依赖关系,可能严重低估散度,而基于分类器的估计器捕获联合结构,但表现出强烈的估计器依赖性。我们在具有参考散度的受控设置和真实世界合成表格基准上系统地研究了这种行为。我们的分析揭示了边际估计器中的依赖盲性、类别不平衡下的先验偏移偏差以及高维中的估计器敏感性。为了解决先验偏移,我们推导了基于分类器的 Jensen-Shannon 估计的闭式后验校正。我们的结果表明,经验 Jensen-Shannon 散度值本质上依赖于协议,因此明确指定估计程序对于有意义的比较是必要的。我们提供了实用指南和一个用于估计器感知的 Jensen-Shannon 评估的开源工具。

英文摘要

The Jensen-Shannon divergence is widely reported as a scalar measure of fidelity for synthetic tabular data. Yet, in practice, it is estimated from finite samples using protocols that are often underspecified. This creates a measurement problem. Although the population divergence is well defined, the empirical value depends on the estimator family, sampling protocol, calibration, dimensionality, and class balance. We show that different protocols can yield non-comparable values: marginal-based estimators ignore dependencies in the joint distribution and can severely underestimate divergence, while classifier-based estimators capture joint structure but exhibit strong estimator dependence. We systematically study this behavior across controlled settings with reference divergences and real-world synthetic tabular benchmarks. Our analysis reveals dependence blindness in marginal estimators, prior-shift bias under class imbalance, and estimator sensitivity in high dimensions. To address prior shift, we derive a closed-form posterior correction for classifier-based Jensen-Shannon estimation. Our results show that empirical Jensen-Shannon divergence values are inherently protocol-dependent, making explicit specification of the estimation procedure necessary for meaningful comparison. We provide practical guidelines and an open-source tool for estimator-aware Jensen-Shannon evaluation.

2606.16409 2026-06-16 cs.CL 新提交

PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

PathRouter: 在智能体图检索增强生成中对齐奖励与检索质量

Bo Wang, Heyan Huang, Yaolin Li, Wei Tang, Yuan Zhang, Wenbo Li, Mingze Gao, Ge Shi, Chong Feng

发表机构 * Beijing Institute of Technology(北京理工大学) Joy Future Academy

AI总结 针对智能体图RAG中答案路径奖励混淆和搜索更新模糊问题,提出PathRouter框架,通过路径感知训练联合评估答案正确性与证据路径重叠,并引入冻结金证据教师提供token级KL指导,在六个QA基准上显著提升F1和证据路径重叠。

详情
AI中文摘要

智能体图RAG训练语言模型代理迭代检索并推理图结构证据,通过高效导航复杂信息网络实现更准确和上下文感知的决策。然而,仅基于结果的强化学习存在\textit{\textbf{答案路径奖励混淆}},即正确答案可能来自捷径而非有用证据路径。它还表现出\textit{\textbf{搜索更新模糊}},因为标量轨迹级反馈未指示应调整哪些检索动作。为缓解这些缺陷,我们提出PathRouter,一种用于智能体图RAG的路径感知训练框架。PathRouter联合评估每条轨迹的答案正确性和证据路径重叠,产生四种轨迹类别,并采用差异化的GRPO优势缩放,抑制捷径强化同时保留证据寻求行为。对于证据贫乏的轨迹,冻结的金证据教师提供推理和搜索查询token上的token级KL指导,排除答案token以避免直接响应模仿。在三个模型大小、六个QA基准上的实验表明,PathRouter一致提升了答案F1和证据路径重叠,与强基线相比,3B模型平均F1提升3.1,7B模型提升4.9。

英文摘要

Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit{\textbf{answer-path reward aliasing}}, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit{\textbf{search-update ambiguity}}, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.

2606.16408 2026-06-16 cs.LG 新提交

MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

MUNI:面向连贯任意到任意生成的多模态统一潜在扩散

Kyeongmin Yeo, Yunhong Min, Minhyuk Sung

发表机构 * KAIST(韩国科学技术院)

AI总结 提出MUNI框架,通过端到端多模态潜在扩散和路由训练目标,实现任意到任意生成,在条件生成上匹配或超越基线,并在无条件连贯性上取得最大优势。

Comments Project page: https://muni-proj.github.io/

详情
AI中文摘要

我们提出MUNI,一个端到端的多模态潜在扩散框架,用于任意到任意生成,通过共享随机潜在变量统一了子集条件跨模态生成和无条件联合采样。现有的多模态生成模型大多基于LLM,这限制了利用特定模态生成器,并且需要文本配对数据进行训练。最近的基于扩散和流的任意到任意扩展采取了不同方向,但仍依赖于文本对齐嵌入、完全配对训练或匹配维度的确定性映射。MUNI基于两个互补贡献,一个架构上的,一个在训练目标上。首先,我们将潜在扩散扩展到端到端的多模态任意到任意生成:不是标准的两个阶段方案(预计算冻结潜在空间然后在其上拟合先验),MUNI联合训练特定模态编码器、表达性解码器和单个共享的基于流的先验,在一个目标下。其次,我们识别出多模态变分推断的标准聚合规则在与学习到的先验和表达性解码器结合时是不充分的。一个合适的共享潜在变量必须同时满足生成模态间的连贯性、子集潜在变量的预测充分性以及潜在内容的最小性。我们提出一个路由训练目标,其结构选择使潜在变量与这些标准对齐,并在可实现设置中允许最小充分性表征。在PolyMNIST-Quadrant-Labels和一个大规模图像-文本-音频基准上的实验表明,MUNI在条件生成上匹配或超过最强基线,同时在无条件连贯性上打开最大差距。项目页面:https://muni-proj.github.io/。

英文摘要

We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence. Project page: https://muni-proj.github.io/.

2606.16407 2026-06-16 cs.CL cs.LG 新提交

A Mechanistic Understanding of Pronoun Fidelity in LLMs

对大型语言模型中代词忠实性的机制理解

Katharina Trinley, Jesujoba O. Alabi, Dietrich Klakow, Vagrant Gautam

发表机构 * Saarland University(萨尔大学) Heidelberg Institute for Theoretical Studies(海德堡理论研究所)

AI总结 通过因果分析发现,代词忠实性由组实体绑定、近因偏差和刻板印象偏差三种因果子空间共同作用,解释了91-99.5%的行为。

详情
AI中文摘要

忠实且稳健的代词使用对于公平和连贯的生成至关重要,然而当多个指代对象使用不同代词时,大型语言模型大多会失败。为了研究推理、重复和偏差在此任务中的相互作用,先前的工作完全依赖行为方法,这可能无法反映模型的内部运作。因此,我们提供了关于代词忠实性的机制性、模型内部视角,测试了三种机制——组实体绑定(G)、近因偏差(R)和刻板印象偏差(S)——是否在多个SOTA语言模型中因果实现。使用无界分布式对齐搜索,我们发现三者作为因果子空间共存,分布在网络深度上。没有单一机制能完全解释模型行为,但三者的组合一致地解释了91-99.5%。注意力头分析进一步揭示了两种竞争的复制路径;组绑定和刻板印象共享一个局部化的概念级路径,检索绑定的职业-代词单元,而近因使用分布式的令牌级路径,重复表面形式。总之,代词忠实性源于同时活跃的因果子空间之间的竞争。

英文摘要

Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behavioural approaches, which may not reflect a model's internal workings. Therefore, we provide a mechanistic, model-internal perspective on pronoun fidelity, testing whether three mechanisms -- group entity binding (G), recency bias (R), and stereotypical bias (S) -- are causally implemented across several SOTA language models. Using Boundless Distributed Alignment Search, we find all three coexist as causal subspaces distributed across network depth. No single mechanism fully explains model behaviour, but a combination of the three consistently accounts for 91-99.5%. An attention head analysis further reveals two competing copying routes; group binding and stereotype share a localized concept-level route that retrieves a bound occupation-pronoun unit, while recency uses a distributed token-level route that repeats surface forms. In sum, pronoun fidelity arises from competition between simultaneously active causal subspaces.

2606.16401 2026-06-16 cs.CV 新提交

RGFVR: Reference-Guided Face Video Restoration with Flow Matching

RGFVR: 基于参考引导的流匹配人脸视频修复

Cem Eteke, Batuhan Tosun, Eckehard Steinbach

发表机构 * Chair of Media Technology, Munich Institute of Robotics and Machine Intelligence, School of Computation, Information, and Technology, Technical University of Munich(慕尼黑工业大学计算、信息与技术学院慕尼黑机器人与机器智能研究所媒体技术教席)

AI总结 提出一种主体无关的参考引导框架,通过双模态感知-描述身份条件注入预训练流匹配文本到视频生成器,结合两阶段训练策略,在降采样、模糊、噪声和压缩伪影等退化下提升人脸视频修复的保真度、时间一致性和身份保持。

详情
AI中文摘要

从退化观测中恢复人脸视频具有挑战性,因为它需要同时恢复视觉保真度、时间一致性和主体身份。现有方法要么是无参考的,当个体特定面部细节丢失时可能导致身份丢失,要么是主体特定的,限制了对未见身份的泛化。我们提出了一种主体无关的参考引导框架,用于身份保持的人脸视频修复。我们的方法将双模态感知-描述身份条件引入预训练的基于流的文本到视频生成器,并采用两阶段训练策略来增强修复过程中的身份引导。实验表明,我们的方法提高了修复保真度、时间一致性和身份保持,在包括降采样、模糊、噪声和压缩伪影在内的挑战性视频退化下实现了优越性能。代码可在 https://github.com/batuhanntosun/RG-FVR 获取。

英文摘要

Face video restoration from degraded observations is challenging, as it requires simultaneously recovering visual fidelity, temporal consistency, and subject identity. Existing approaches are often either reference-free, which can lead to identity loss when person-specific facial details are lost, or subject-specific, which limits generalization to unseen identities. We propose a subject-agnostic, reference-guided framework for identity-preserving face video restoration. Our method introduces bimodal perceptual-descriptive identity conditioning into a pretrained flow-based text-to-video generator and employs a two-stage training strategy to strengthen identity guidance during restoration. Experiments show that our approach improves restoration fidelity, temporal consistency, and identity preservation, achieving superior performance under challenging video degradations, including downsampling, blur, noise, and compression artifacts. The code is available under: https://github.com/batuhanntosun/RG-FVR.

2606.16400 2026-06-16 cs.RO 新提交

SemGeoNav:A Safety-Guided Visual Navigation Approach with Semantic Reasoning and Geometric Planning

SemGeoNav:一种结合语义推理与几何规划的安全引导视觉导航方法

Yu Liu, Zongyang Chen, Yan Guo, Chao Liu, Xianfei Pan

发表机构 * College of Intelligence Science and Technology, National University of Defense Technology(国防科技大学智能科学学院)

AI总结 提出SemGeoNav分层视觉导航框架,融合端到端模型的高层语义推理与几何方法的可靠局部规划,实现鲁棒图像导航并显著提升避障能力,在真实机器人上优于ViNT和NoMaD。

Comments The paper has been accepted by ICGNC 2026

详情
AI中文摘要

基于学习的视觉导航增强了语义目标到达能力。然而,由于其黑箱特性,纯端到端模型通常缺乏显式的几何约束,导致在开放环境中避障不可预测且不可靠。相反,传统几何规划器确保安全性,但难以处理高维视觉目标。为了解决这些限制,我们提出了SemGeoNav,一种新颖的分层视觉导航框架。它紧密集成了端到端模型的高层语义推理与基于几何方法的可靠局部规划能力,实现了鲁棒的基于图像的导航,同时显著改善了避障。此外,我们引入了一种时间轨迹平滑机制,以确保机器人运动连续稳定。我们在真实环境中的Unitree Go2四足机器人上评估了SemGeoNav。结果表明,SemGeoNav优于现有代表性方法(包括ViNT和NoMaD),实现了更高的成功率和更短的导航时间。

英文摘要

Learning-based visual navigation has enhanced semantic goal-reaching capabilities. However, due to their black-box nature, purely end-to-end models often lack explicit geometric constraints, leading to unpredictable and unreliable obstacle avoidance in open environments. Conversely, traditional geometric planners ensure safety but struggle with high-dimensional visual targets. To address these limitations, we propose SemGeoNav, a novel hierarchical visual navigation framework.It tightly integrates the high-level semantic reasoning of end-to-end models with the reliable local planning ability of geometry-based methods, achieving robust image-based navigation while significantly improving obstacle avoidance. Furthermore, we introduce a temporal trajectory smoothing mechanism to ensure continuous and stable robot motion. We evaluated SemGeoNav on a Unitree Go2 quadruped robot in real-world environments. The results demonstrate that SemGeoNav outperforms existing representative methods, including ViNT and NoMaD, achieving higher success rates and shorter navigation times.

2606.16396 2026-06-16 cs.CV eess.IV 新提交

SP$^3$: Spherical Priors for Plug-and-Play Restoration

SP$^3$:用于即插即用恢复的球面先验

Sean Man, Ron Raphaeli, Matan Kleiner, Or Ronai

发表机构 * Technion – Israel Institute of Technology(以色列理工学院) Independent Researcher(独立研究员)

AI总结 提出SP$^3$算法,用球面编码器替代去噪器作为生成先验,通过半二次分裂实现快速图像恢复,速度比零样本扩散方法快3-630倍。

详情
AI中文摘要

在本文中,我们介绍了SP$^3$,一种新颖的即插即用算法,通过用球面编码器(SE)作为生成先验替代去噪器,加速最大后验图像恢复。SP$^3$利用SE紧密结构的潜在空间作为自然图像流形上的鲁棒投影,来近似难处理的近端先验步骤。通过半二次分裂,将该投影与闭式数据一致性步骤交替进行,实现了无需推理期间梯度计算的稳定收敛。这种独特的公式解锁了“任意时刻”恢复能力,从第一次迭代起就能产生清晰、合理的图像。在各种图像恢复任务上的评估表明,SP$^3$实现了与最先进的零样本扩散和流方法相当的感知质量,同时速度提升3-630倍。

英文摘要

In this paper, we introduce SP$^3$, a novel Plug-and-Play algorithm that accelerates maximum a posteriori image restoration by replacing denoisers with Spherical Encoders (SE) as generative priors. SP$^3$ approximates the intractable proximal prior step by utilizing the SE tightly structured latent space as a robust projection onto the natural image manifold. Alternating this projection with a closed-form data-consistency step, via Half-Quadratic Splitting, achieves stable convergence without requiring gradient computation during inference. This unique formulation unlocks "anytime" restoration capabilities, producing sharp, plausible images from the first iteration. Evaluations across a variety of image restoration tasks demonstrate that SP$^3$ achieves perceptual quality comparable to state-of-the-art zero-shot diffusion and flow methods while being $3$-$630\times$ faster.

2606.16392 2026-06-16 cs.CV 新提交

Towards UAV Image Dehazing: A UAV Atmospheric Scattering Model, Benchmark, and Geometry-Aware Deep Unfolding Network

面向无人机图像去雾:无人机大气散射模型、基准与几何感知深度展开网络

Wenxuan Fang, Jiangwei Weng, Yu Zheng, Junkai Fan, Guangfa Wang, Xiang Chen, Jian Yang, Jun Li

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 提出无人机大气散射模型(UASM)描述非均匀雾分布,并设计几何感知深度展开网络(GP-DUN),通过隐式几何估计、几何感知梯度下降和池化专家近端映射模块实现高效去雾,在合成和真实数据集上超越现有方法。

详情
AI中文摘要

在无人机应用中,雾霾会显著模糊远处细节并削弱结构信息,阻碍细节恢复。当前无人机场景仍面临两个关键挑战:(i) 真实世界的成对雾/干净图像难以获取,而经典大气散射模型不足以建模无人机图像中空间非均匀的雾霾;(ii) 现有去雾方法难以去除无人机图像上部区域积累的浓雾。为解决这些问题,我们首先提出无人机大气散射模型(UASM),该模型显式地结合飞行高度、俯仰角和消光系数来表征无人机成像中的非均匀雾霾分布。基于UASM,我们开发了一个物理驱动的去雾框架,称为几何感知近端深度展开网络(GP-DUN)。具体来说,GP-DUN由三个关键模块组成:隐式几何估计器(LGE),用于推断与无人机成像几何一致的透射率;几何感知梯度下降模块(GeoGDM),将UASM嵌入数据保真项并执行物理一致的闭式更新;以及池化专家近端映射模块(PE-PMM),学习隐式先验以恢复超出显式物理建模能力的纹理和结构。此外,我们进一步构建了UASM-HazeSet,它提供可控的成对合成数据以及2,285张真实无人机雾霾图像用于测试。大量实验表明,GP-DUN在UASM-HazeSet和真实无人机雾霾基准上均持续优于现有方法。

英文摘要

In UAV applications, haze significantly obscures distant details and weaken structural information, hindering the recovery of details. Current UAV scenarios still face two key challenges: (i) paired hazy/clean images from the real world are unobtainable, while the classical atmospheric scattering model is inadequate for modeling the spatially non-uniform haze in UAV imagery; (ii) existing dehazing methods struggle to remove the heavy haze accumulated in the upper regions of UAV images. To address these issues, we first propose a UAV Atmospheric Scattering Model (UASM), which explicitly incorporates flight altitude, viewing pitch, and extinction to characterize the non-uniform haze distribution in UAV imaging. Based on UASM, we develop a physics-driven dehazing framework, termed Geometry-aware Proximal Deep Unfolding Network (GP-DUN). Specifically, GP-DUN consists of three key modules: a Latent Geometry Estimator (LGE) that infers transmittance consistent with UAV imaging geometry, a Geometry-aware Gradient Descent Module (GeoGDM) that embeds UASM into the data-fidelity term and performs physics-consistent closed-form updates, and an Pooling-Expert Proximal Mapping Module (PE-PMM) that learns an implicit prior to restore textures and structures beyond the capability of explicit physical modeling. In addition, we further construct UASM-HazeSet, which provides controllable paired synthetic data together with 2,285 real UAV haze images for testing. Extensive experiments show that GP-DUN consistently outperforms existing methods on both UASM-HazeSet and real UAV haze benchmarks.

2606.16388 2026-06-16 cs.LG 新提交

Robust Neural Tucker Factorization with Bias Correction and Adaptive Initialization

鲁棒神经Tucker分解:偏差校正与自适应初始化

Yuchao Su, Yixin Ran

发表机构 * School of Computer Science and Engineering, Chongqing University of Science and Technology(重庆科技大学计算机科学与工程学院) College of Computer and Information Science, School of Software, Southwest University(西南大学计算机与信息科学学院 软件学院)

AI总结 提出KaBiN模型,结合Kaiming初始化和偏差校正,解决高维不完全张量补全中初始化不当和偏差缺失导致的优化不稳定问题。

Comments 9 pages,3 figures, 106 conferences

详情
AI中文摘要

高维不完全(HDI)张量广泛应用于交通和气候领域,但稀疏观测使得准确补全困难。跨不同多模态场的固有非线性动态和非平稳变化严重阻碍了传统线性重构框架的有效性。神经Tucker分解为建模张量模式间的高阶交互提供了有效框架。通过将底层结构特征参数化为连续潜在空间,神经表示规避了经典代数的刚性低秩约束。然而,其性能仍可能受到实现层面选择的影响,尤其是参数初始化和最终输出映射的偏差配置。次优初始化常导致立方扩展交互空间中的方差爆炸,将后续非线性激活边界推入严重梯度饱和区域,而忽略专用平移参数迫使交互权重隐式吸收全局统计偏差。本文提出一种简单有效的神经Tucker分解模型,结合Kaiming初始化和偏差校正(KaBiN),用于HDI张量补全。所提模型对嵌入和Tucker线性参数采用Kaiming均匀初始化,并在输出映射中采用简单偏差校正。通过优雅地将全局均值偏移与局部结构表示解耦,该框架提供了高度稳定且条件良好的优化景观。在三个真实HDI张量数据集上的实验表明,KaBiN在引入最小计算开销的同时,实现了优于原始NeuTucF的性能。

英文摘要

High-dimensional incomplete (HDI) tensors are widely used in traffic and climate applications, but sparse observations make accurate completion difficult. The intrinsic non-linear dynamics and non-stationary variations across distinct multi-modal fields severely hinder the efficacy of conventional linear reconstruction frameworks. Neural Tucker factorization provides an effective framework for modeling high-order interactions among tensor modes. By parameterizing underlying structural characteristics into continuous latent spaces, neural representations circumvent the rigid low-rank constraints of classical algebra. However, its performance can still be affected by implementation-level choices, especially parameter initialization and the bias configuration of the final output mapping. Suboptimal initializations frequently lead to variance explosion across the cubically expanded interaction spaces, driving the subsequent non-linear activation boundaries into severe gradient saturation zones, while the omission of a dedicated translation parameter forces interaction weights to implicitly absorb global statistical deviations. This paper proposes a simple yet effective neural Tucker factorization model with Kaiming initialization and bias correction (KaBiN) for HDI tensor completion. The proposed model utilizes Kaiming uniform initialization for the embedding and Tucker linear parameters, and adopts a simple bias correction in output mapping. By elegantly decoupling global mean shifts from local structural representations, the framework provides a highly stable and well-conditioned optimization landscape. Experiments on three real-world HDI tensor datasets show that KaBiN achieves better performance than the original NeuTucF, while introducing minimal computational overhead.

2606.16384 2026-06-16 cs.LG 新提交

Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

子空间混合:面向带宽高效上下文并行训练

Sameera Ramasinghe, Ajanthan Thalaiyasingam, Hadi Mohaghegh Dolatabadi, Gil Avraham, Violetta Shevchenko, Yan Zuo, Chamin Hewa Koneputugodage, Alexander Long

发表机构 * Pluralis Research

AI总结 提出一种基于子空间混合的压缩方法,在低带宽分布式训练中实现超过95%的通信压缩,支持百亿参数模型在100K上下文长度下高效训练。

详情
AI中文摘要

预训练具有扩展上下文窗口的语言模型增强了它们在生成过程中利用丰富信息的能力。现有方法将输入序列分割成块,广播到多个设备,并逐块计算注意力,这带来了显著的通信开销。虽然在高速集群中可行,但这些方法在低带宽连接上的去中心化训练中不实用。我们提出了一种用于去中心化设置中通信高效上下文并行的压缩方法,实现了超过95%的显著压缩率,开销极小且无收敛损失。我们的关键洞察是通过高效重参数化,将激活输出动态约束到学习到的子空间混合,从而利用其内在的低秩结构。我们展示了将十亿参数去中心化模型扩展到超过100K令牌的上下文长度,在慢至300Mbps的网络上,匹配了在100Gbps互连上的集中式模型的壁钟收敛速度。

英文摘要

Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95\% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.

2606.16383 2026-06-16 cs.CL 新提交

Surpassing Scale by Efficiency: A Compact 135M Parameter Foundational LLM Natively Adapted for the Bangla Language

通过效率超越规模:一个紧凑的135M参数基础LLM原生适配孟加拉语

Rabindra Nath Nandi

发表机构 * Independent Researcher(独立研究者)

AI总结 提出一个135M参数的紧凑型解码器模型bangla-smollm-135m,通过确定性交集-追加词元合并策略解决孟加拉语子词碎片化,在零样本多任务基准上匹配或超越两倍大小模型。

Comments Submitted to a Workshop

详情
AI中文摘要

虽然自然语言处理领域由数十亿参数的架构主导,但它们在低资源、非拉丁文字中的部署对于边缘配置、移动系统和分散的本地硬件来说在计算上仍然难以承受。本文提出了bangla-smollm-135m,一个高度紧凑的1.35亿参数仅解码器基础模型,专为孟加拉语脚本的高效语言建模而设计。通过利用TituLLMs和SmolLM2-135M之间的确定性交集-追加词元合并策略,该模型克服了子词脚本碎片化,同时不破坏早期预训练参数状态。在零样本多任务基准评估(PIQA_bn、OpenBookQA_bn、CommonsenseQA_bn和Bangla_MMLU)中,bangla-smollm-135m匹配或超越了其两倍大小的模型(Gemma-3-270m),并与1B参数级别的模型达到同等水平。该模型可在rnnandi/bangla-smollm-135m获取。

英文摘要

While the NLP landscape is dominated by multi-billion parameter architectures, their deployment in low-resource, non-Latin scripts remains computationally prohibitive for edge configurations, mobile systems, and decentralized local hardware. This paper presents bangla-smollm-135m, a highly compact 135-million parameter decoder-only foundational model engineered explicitly for high-efficiency language modeling in the Bangla script. By leveraging a deterministic intersect-and-append token merging strategy between TituLLMs and SmolLM2-135M, the model overcomes subword script fragmentation without destabilizing early pretrained parameter states. In zero-shot multi-task benchmark evaluations (PIQA_bn, OpenBookQA_bn, CommonsenseQA_bn, and Bangla_MMLU), bangla-smollm-135m matches or outperforms models twice its size (Gemma-3-270m) and achieves parity with models in the 1B parameter tier. The model is available at rnnandi/bangla-smollm-135m