arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2338
2606.10939 2026-06-10 cs.CV 新提交

PENet+: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis

PENet+: 一种用于高效图像隐写分析的轻量级残差Transformer框架

Jincheol AN, Dongsu Kim, Haneol Jang, YoungJoon Yoo

发表机构 * Chung-Ang University(中央大学) Hanbat National University(韩巴大学) SNUAILAB

AI总结 提出PENet+,通过保留自注意力拓扑、精简分类器通道、激活感知HPF茎和MobileNetV2骨干,在保持检测精度的同时大幅降低参数量和FLOPs。

详情
Comments
IEEE ACCESS
AI中文摘要

图像隐写分析,即检测嵌入数字图像中的隐藏信息,是现代网络安全和数字取证的核心组成部分。最近的残差Transformer架构,如像素差分卷积与增强Transformer网络(PENet)[1],实现了强大的检测精度,但其计算和内存需求阻碍了在资源受限环境中的部署。我们提出PENet+,一种轻量级隐写分析框架,在保持PENet判别性结构的同时显著提高效率。我们并非重新设计或压缩注意力块,而是保留PENet的自注意力拓扑以确保可重复性,并添加一个分类器精简阶段,逐步缩小SPP到FC1的输入通道(SPP:空间金字塔池化;FC1:第一个全连接层),从而大幅减少参数和FLOPs,且精度损失可忽略。我们进一步通过激活感知机制细化高通滤波器(HPF)茎,该机制早期聚合HPF响应并选择平衡的SRM-Gabor top-K子集,并将PENet的骨干替换为MobileNetV2风格的倒残差网络。一个K=31滤波器(16个Gabor + 15个SRM)的平衡配置在较低计算量下达到或超越更重设置。最后,我们从隐写分析角度论证PReLU,认为保留负响应有助于捕捉ReLU抑制的弱隐写线索。在512x512分辨率下的独立ALASKA2 JPEG QF90协议上(5000张封面图像用于训练、验证和内部测试;另外19000张封面图像用于评估集),PENet+相比重新评估的PENet基线,参数量减少高达45.5%,FLOPs减少约97%,为资源受限的隐写分析提供了一种计算高效的方向。设备级延迟和功耗测量留待未来工作。

英文摘要

Image steganalysis, the detection of hidden information embedded in digital images, is a core component of modern cybersecurity and digital forensics. Recent residual Transformer architectures, such as the Pixel-Difference-Convolution and Enhanced-Transformer-Network (PENet) [1], achieve strong detection accuracy, but their computational and memory demands hinder deployment in resource-constrained settings. We present PENet+, a lightweight steganalysis framework that preserves PENet's discriminative structure while substantially improving efficiency. Rather than redesigning or compressing the attention blocks, we retain PENet's self-attention topology for reproducibility and add a classifier-streamlining stage that progressively narrows the SPP-to-FC1 input channels (SPP: spatial pyramid pooling; FC1: first fully connected layer), yielding large reductions in parameters and FLOPs with negligible accuracy loss. We further refine the high-pass-filter (HPF) stem with an activation-aware mechanism that aggregates HPF responses early and selects a balanced SRM-Gabor top-K subset, and we replace PENet's backbone with a MobileNetV2-style inverted residual network. A balanced configuration with K=31 filters (16 Gabor + 15 SRM) matches or surpasses heavier settings at lower compute. Finally, we motivate PReLU from a steganalysis standpoint, arguing that preserving negative responses helps capture weak stego cues that ReLU suppresses. On a disjoint ALASKA2 JPEG QF90 protocol at 512x512 resolution (5,000 cover images for training, validation, and internal testing; a separate 19,000-cover evaluation set), PENet+ achieves up to 45.5% fewer parameters and about 97% fewer FLOPs than the re-evaluated PENet baseline, offering a computationally efficient direction for resource-constrained steganalysis. Device-level latency and power measurements remain future work.

2606.10935 2026-06-10 cs.LG cs.AI 新提交

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

CLP: 零损失自适应多令牌推理的搭配长度预测

Xuezhen Xie, Zhiqiang Zhou

发表机构 * Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 提出CLP方法,通过轻量级线性层预测可安全接受的额外令牌数,解决多令牌预测中头-主干竞争导致的输出退化问题,在Qwen2.5模型上实现零质量损失的1.14x-1.29x加速。

详情
Comments
13 pages, 8 figures, 8 tables
AI中文摘要

大型语言模型推理受限于自回归解码,每个令牌需要一次完整的前向传播。多令牌预测(MTP)提供了一种有前景的加速路径,但现有方法存在根本性的架构缺陷:第一个令牌的MTP头与主干自身的语言模型(LM)头竞争,导致预测被接受时质量严重下降。我们将这种头-主干竞争确定为先前基于MTP的加速方法中重复和不连贯输出的根本原因。为了解决这个问题,我们提出了Backbone-as-Architect设计原则,其中主干LM头始终生成第一个令牌,MTP头仅负责后续令牌。基于这一原则,我们引入了CLP(搭配长度预测器),一个轻量级的跨度级决策层,预测每个解码步骤可以安全接受多少个额外令牌。CLP仅使用单个线性层(4.6K--7.7K参数),取代了先前工作中过度设计的1M参数门控网络。在Qwen2.5模型(0.5B、1.5B、7B)上的实验表明,CLP在1.5B模型上实现了1.20x--1.29x加速,在7B模型上实现了1.14x--1.20x加速,且零质量退化(重复率<0.02),而基于门控的方法无法加速(1.07x)或产生严重退化的输出(重复率>0.5%)。我们进一步证明,较短的预测范围(k=2)在大模型上恢复了24%更高的MTP头准确率,建立了一个可扩展感知的设计原则。我们确定MTP头预测准确率是加速的约束条件,并为未来改进建立了清晰的路线图。

英文摘要

Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token prediction (MTP) offers a promising acceleration path, but existing approaches suffer from a fundamental architectural flaw: the MTP head for the first token competes with the backbone's own language model (LM) head, leading to severe quality degradation when predictions are accepted. We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens. Building on this principle, we introduce CLP (Collocation-Length Predictor), a lightweight span-level decision layer that predicts how many additional tokens can be safely accepted at each decoding step. CLP uses only a single linear layer (4.6K--7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x--1.29x speedup on 1.5B and 1.14x--1.20x on 7B, with zero quality degradation (repetition ratio < 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio > 0.5%). We further demonstrate that shorter prediction horizons (k=2) recover 24% higher MTP head accuracy on large models, establishing a scaling-aware design principle. We identify MTP head prediction accuracy as the binding constraint on acceleration and establish a clear roadmap for future improvements.

2606.10933 2026-06-10 cs.AI 新提交

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

前沿编码代理使用元编程适应不熟悉的编程语言

Aman Sharma, Sushrut Thorat, Paras Chopra

发表机构 * Lossfunk

AI总结 研究评估LLM编码代理在陌生语言上的表现,发现强代理通过元编程(如用Python生成目标语言代码)适应,禁止此策略性能大幅下降,而弱代理无法从中受益。

详情
Comments
43 pages, 8 figures
AI中文摘要

基于LLM的编码代理通常在熟悉的软件环境中进行评估:主流语言、常见库和公共仓库。这些基准仍然重要,但它们可能隐藏代理在语言本身不熟悉时的行为。我们使用顺序设置(包括文件编辑、本地执行和隐藏测试评分)在四种深奥编程语言上评估了六个当代编码代理。我们的协议揭示了这些代理之间的能力差异,而主流编码和代理基准(如SWE-Bench Verified和Terminal-Bench 2.0)将这些差异压缩到更窄的范围内。我们观察到,最强的代理Claude Opus 4.6和GPT-5.4 xhigh通常避免直接编写目标语言。在Brainfuck和Befunge-98上,它们编写Python程序来生成目标语言代码,并在本地调试这些生成器。禁止这种元编程策略会导致性能大幅下降。从该策略中提取的文本指导并未实质性地改善较弱的代理。相比之下,来自Opus的用于构建生成器的Python辅助代码(没有解决的基准程序或隐藏测试答案)显著提高了Sonnet 4.6和GPT-5.4 mini在相同问题上的表现,而Haiku 4.5仍然较低。更多的解释器调用和输出令牌改善了较强的代理,但使较弱的代理接近其原始性能,表明这些资源放大了有用的策略而非创造了它们。总之,这些结果表明,强大的编码代理通过使用工具、反馈和工作区状态来构建目标语言的工作模型,从而适应不熟悉的语言。元编程是最明显的案例,但更广泛的差距在于构建和调试在目标语言规则下有效的策略。

英文摘要

LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.

2606.10932 2026-06-10 cs.CL cs.LG 新提交

Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

密度场状态空间模型:Mamba-2中的1比特蒸馏、高效推理与知识组织

Chirag Shinde

发表机构 * Independent Researcher(独立研究者)

AI总结 提出DF-SSM框架,将SSM压缩至1比特骨架加int8低秩校正,应用于Mamba-2 1.3B模型,实现9.7倍压缩和21.4倍推理加速,仅需3200万令牌和6小时蒸馏,并发现模型内部知识组织的三个处理阶段。

详情
Comments
16 pages, 6 figures, 7 tables. Code available at https://github.com/cs-cmyk/df-ssm
AI中文摘要

我们提出了密度场状态空间模型(DF-SSM),这是一个将SSM压缩为1比特骨架并带有int8低秩校正的框架。应用于Mamba-2 1.3B模型,我们得到了一个278 MB的模型(比2.7 GB的FP16教师模型小9.7倍),在GPU上推理速度提升21.4倍(batch=1,相对于mamba-ssm参考实现),同时在下游任务性能上保持在BitMamba-2(一个在150B令牌上从头训练的1.58比特模型)的2-4个百分点以内。蒸馏本身仅需3200万令牌和6小时(在单个A100 GPU上),尽管它假设有一个预训练的FP16教师模型。我们开发了一个优化的推理流水线,结合了用于骨架矩阵乘法的cuBLAS INT8张量核心、用于有状态SSM和卷积操作的自定义CUDA内核,以及用于在GPU和CPU上高效部署的AVX-512 CPU后端。除了压缩,我们还研究了所得模型的内部知识组织,发现了三个不同的处理阶段:意图分类(第0-3层,在没有词汇对齐的抽象空间中操作)、知识检索(第25-35层,事实关联定位在一个5层窗口内)和输出格式化(第36-47层,类别结构消失)。通过对19个类别中445个事实提示的系统分析,我们发现早期层分类是句法的(由模板结构驱动)而非语义的,并且尽管事实回忆较弱,模型仍表现出组织良好的知识表示——这表明表示结构可能先于事实强度。

英文摘要

We present Density Field State Space Models (DF-SSM), a framework for compressing SSMs to a 1-bit scaffold with int8 low-rank correction. Applied to Mamba-2 1.3B, we achieve a 278 MB model (9.7x smaller than the 2.7 GB FP16 teacher) that runs at 21.4x faster inference on GPU (batch=1, relative to the mamba-ssm reference implementation) while maintaining downstream task performance within 2-4 percentage points of BitMamba-2, a 1.58-bit model trained from scratch on 150B tokens. The distillation itself requires only 32M tokens and 6 hours on a single A100 GPU, though it presupposes a pretrained FP16 teacher. We develop an optimized inference pipeline combining cuBLAS INT8 tensor cores for the scaffold matmul, custom CUDA kernels for stateful SSM and convolution operations, and an AVX-512 CPU backend for efficient deployment on both GPU and CPU. Beyond compression, we investigate the internal knowledge organization of the resulting model, discovering three distinct processing phases: intent classification (layers 0-3, operating in an abstract space with no vocabulary alignment), knowledge retrieval (layers 25-35, where factual associations localize to a 5-layer window), and output formatting (layers 36-47, where category structure dissolves). Through systematic analysis of 445 factual prompts across 19 categories, we find that early-layer classification is syntactic (driven by template structure) rather than semantic, and that the model exhibits well-organized knowledge representations despite weak factual recall--suggesting that representational structure may precede factual strength.

2606.10929 2026-06-10 cs.LG cs.AI 新提交

Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

可恢复但不稳定:权重和激活中的局部线性结构

Irina Piontkovskaia, Sergey Nikolenko

发表机构 * St. Petersburg Department of the Steklov Institute of Mathematics(斯捷克洛夫数学研究所圣彼得堡分所) St. Petersburg State University(圣彼得堡国立大学)

AI总结 研究神经网络中线性结构的存在性与尺度,发现局部低秩任务梯度结构,但固定任务平面假设不成立;首次恢复更新形成轨迹前缀基,捕获大部分恢复位移;提出随机搜索理论解释高维随机参数搜索有效性,并验证参数扰动与激活引导的关系。

详情
Comments
23 pages, 8 tables, 9 figures
AI中文摘要

任务向量、LoRA、激活引导和预训练权重周围的随机搜索都表明学习行为可以由线性方向控制。我们询问哪些线性结构实际存在以及它们处于什么尺度。在合成多任务Transformer和DistilGPT-2/GPT-2上的LoRA适配器中,我们发现强烈的局部低秩任务梯度结构,但拒绝了固定任务平面假设:静态基会错过恢复方向,有用的基在100步内显著漂移。然而,首次恢复更新形成了一个轨迹前缀基,捕获了LoRA恢复位移的77%。我们开发了随机搜索理论,结合高斯局部线性定理,证明了即使在非常高维的情况下随机参数搜索的有效性。我们还研究了参数扰动与激活引导之间的关系:单次梯度步产生的激活偏移与标记对比CAA引导向量的余弦为0.58,对Qwen-0.5B BoolQ陈述具有类似的引导效果。我们通过在合成Transformer和LLM上的实验验证了结果。我们的结果表明,训练网络中的线性结构不是全局任务方向,而是演化的局部几何结构,这些结构在参数和激活空间中部分持续存在。

英文摘要

Task vectors, LoRA, activation steering, and random search around pretrained weights all suggest that learned behaviour can be controlled by linear directions. We ask which linear structures actually exist and on what scale. In a synthetic multitask transformer and LoRA adapters on DistilGPT-2 / GPT-2 we find strong local low-rank task-gradient structure but reject the fixed-task-plane hypothesis: static bases miss the recovery direction, and the useful basis drifts substantially within 100 steps. However, the first recovery updates form a trajectory-prefix basis capturing 77% of the LoRA recovery displacement. We develop random search theory with a Gaussian local-linear theorem that justifies the effectiveness of random parameter search even in very high dimensions. We also study the relation between parameter perturbations and activation steering: a single gradient step produces an activation shift with 0.58 cosine to a labelled-contrast CAA steering vector, with a similar steering effect on Qwen-0.5B BoolQ statements. We validate our results with experiments on synthetic Transformers and LLMs. Our results suggest that linear structures in trained networks are not global task directions, but evolving local geometries that partially persist across parameter and activation spaces.

2606.10927 2026-06-10 cs.RO 新提交

AllDayNav: Lifelong Navigation via Real-World Reinforcement Learning

AllDayNav: 通过真实世界强化学习实现终身导航

Hang Yin, Yinan Liang, Jiazhao Zhang, Jiahang Liu, Minghan Li, Zhizheng Zhang, He Wang

发表机构 * Tsinghua University(清华大学) Galbot Robotics Peking University(北京大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出AllDayNav框架,利用自进化多模态记忆和强化学习隐式编码场景动态,在跨房间、跨回合和跨任务场景中实现接近100%的成功率,超越基于地图、VLM和RL的基线方法。

详情
Comments
Project Page: https://bagh2178.github.io/AllDayNav/
AI中文摘要

在动态环境中进行终身具身导航需要机器人从碎片化观察中形成持久的场景理解,这对于依赖显式地图或场景图且难以泛化到结构化设置之外的现有方法仍然困难。我们提出AllDayNav,一个终身自学习导航框架,通过强化学习将场景动态隐式编码到大模型的十亿级参数中,并由一个自进化的多模态记忆驱动,该记忆维护和更新视觉关键帧、语义描述和时间上下文,同时自主生成开放词汇指令、图像目标和结构化奖励。在合成和真实环境中的跨房间、跨回合和跨任务场景实验表明,AllDayNav实现了接近100%的成功率,并在路径效率和鲁棒性上持续超越基于地图、VLM和RL的强基线,证明了隐式、记忆驱动的强化学习作为可靠终身导航的可扩展替代方案。

英文摘要

Lifelong embodied navigation in dynamic environments requires robots to form persistent scene understanding from fragmentary observations, which remains difficult for existing methods that rely on explicit maps or scene graphs and struggle to generalize beyond structured settings. We propose AllDayNav, a lifelong self-learning navigation framework that implicitly encodes scene dynamics into the billion-scale parameters of a large model via reinforcement learning, powered by a self-evolving multimodal memory that maintains and updates visual keyframes, semantic descriptions, and temporal context while autonomously generating open-vocabulary instructions, image goals, and structured rewards. Experiments in both synthetic and real-world environments across cross-room, cross-episode, and cross-task scenarios show that AllDayNav achieves success rates approaching $100\%$ and consistently surpasses strong map-based, VLM, and RL baselines in path efficiency and robustness, demonstrating implicit, memory-driven reinforcement learning as a scalable alternative to explicit mapping for reliable lifelong navigation.

2606.10921 2026-06-10 cs.CL 新提交

Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

仅追踪所需:面向长文档问答的结构感知按需超图记忆

Xiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学与工业研究组织) University of Wollongong(伍伦贡大学)

AI总结 提出DocTrace,一种多智能体RAG框架,通过查询触发的知识组织、文档结构感知和经验引导推理,解决长文档问答中知识组织成本高、结构利用不足和推理经验无法复用的问题,在三个数据集上取得最佳性能。

详情
AI中文摘要

长文档问答需要大型语言模型对散布在长文档中的证据进行推理,答案通常依赖于事件顺序、章节级上下文和跨部分证据连接。尽管检索增强生成通过检索相关证据减少了输入上下文,但现有的结构化RAG方法仍面临三个限制:代价高昂的查询无关知识组织、对原始文档结构利用不足以及无法复用历史推理经验。为解决这些限制,我们提出了DocTrace,一个用于长文档问答的多智能体RAG框架,支持查询触发的知识组织、文档结构感知和经验引导推理。DocTrace通过轻量级文档结构树索引保留文档层次结构,在推理过程中按需构建智能体共享的超图结构工作记忆,并将成功的推理计划存储在图形结构经验记忆中以便未来复用,从而实现对相关长文档问题的自适应探索。在四个长文档问答数据集上的实验表明,DocTrace在三个数据集上取得了最佳性能,在F1和EM上分别比最强基线ComoRAG高出8.85%和4.40%,同时将总体计算成本降低了53.32%。

英文摘要

Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

2606.10918 2026-06-10 cs.RO cs.LG 新提交

Task Robustness via Re-Labelling Vision-Action Robot Data

通过重新标注视觉-动作机器人数据的任务鲁棒性

Artur Kuramshin, Özgür Aslan, Cyrus Neary, Glen Berseth

发表机构 * Mila — Quebec AI Institute(Mila — 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) The University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出TREAD框架,利用大型视觉语言模型对机器人数据集进行语义子任务分解和多样化指令生成,无需额外数据收集,提升策略在未见任务上的泛化能力。

详情
Comments
Project website: https://akuramshin.github.io/tread
AI中文摘要

近年来,机器人学习模型规模的扩大产生了令人印象深刻的策略,能够执行各种操作任务并泛化到新场景。然而,这些策略在遵循指令方面仍然存在困难,很可能是因为现有机器人数据集中的语言和动作序列多样性有限。本文介绍了通过重新标注视觉-动作机器人数据实现任务鲁棒性(TREAD),这是一个可扩展的框架,利用大型视觉语言模型(VLM)在不进行额外数据收集的情况下增强现有机器人数据集,利用这些模型中嵌入的可迁移知识。我们的方法通过三个阶段利用预训练的VLM:从原始指令标签和初始场景生成语义子任务,根据这些子任务对演示视频进行分割,并生成包含对象属性的多样化指令,有效地将较长的演示分解为基于语言-动作对。我们进一步通过用语言多样化的文本目标版本增强数据来提高鲁棒性。在LIBERO上的评估表明,在我们增强的数据集上训练的策略在未见过的、新颖的任务和目标上表现出改进的性能。我们的结果表明,TREAD通过轨迹分解增强了规划泛化,并通过增加语言多样性增强了语言条件策略泛化。

英文摘要

The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and generalize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces Task Robustness via Re-Labelling Vision-Action Robot Data (TREAD), a scalable framework that leverages large Vision-Language Models (VLMs) to augment existing robotics datasets without additional data collection, harnessing the transferable knowledge embedded in these models. Our approach leverages a pretrained VLM through three stages: generating semantic sub-tasks from original instruction labels and initial scenes, segmenting demonstration videos conditioned on these sub-tasks, and producing diverse instructions that incorporate object properties, effectively decomposing longer demonstrations into grounded language-action pairs. We further enhance robustness by augmenting the data with linguistically diverse versions of the text goals. Evaluations on LIBERO demonstrate that policies trained on our augmented datasets exhibit improved performance on novel, unseen tasks and goals. Our results show that TREAD enhances both planning generalization through trajectory decomposition and language-conditioned policy generalization through increased linguistic diversity.

2606.10917 2026-06-10 cs.AI 新提交

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Role-Agent: 通过双角色演化引导LLM智能体

Xucong Wang, Ziyu Ma, Shidong Yang, Tongwen Huang, Pengkun Wang, Yong Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图)

AI总结 提出Role-Agent框架,让单个LLM同时作为智能体和环境,通过世界在智能体(WIA)和智能体在世界(AIW)两个组件实现自举协同演化,在多个基准上平均提升超过4%。

详情
Comments
20 pages, including 12 pages of main text and 8 pages of appendix; work in progress
AI中文摘要

尽管大型语言模型(LLM)智能体在复杂任务上表现出色,但其学习常受限于低效的交互反馈和静态训练环境,阻碍了更广泛的泛化。为解决这些问题,本文引入Role-Agent,一个利用单个LLM同时充当智能体和环境的框架,实现自举协同演化。Role-Agent包含两个协同组件:世界在智能体(WIA)和智能体在世界(AIW)。在WIA中,LLM作为智能体,在每个动作后预测未来状态;预测状态与实际状态的对齐被用作过程奖励,鼓励环境感知推理。在AIW中,LLM分析失败轨迹中的失败模式,并检索具有相似失败模式的任务,从而重塑训练数据分布以进行针对性练习。在多个基准上的实验表明,Role-Agent持续提升性能,相比强基线平均提升超过4%。

英文摘要

Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, \textcolor{black}{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\% over strong baselines.

2606.10913 2026-06-10 cs.LG stat.ML 新提交

Conservation Laws from Data Symmetry in Neural Networks

神经网络中数据对称性导致的守恒律

Jakob Galley, Vahid Shahverdi, Axel Flinth

发表机构 * Umeå University(于默奥大学)

AI总结 研究训练数据的对称性是否在梯度流训练中产生守恒量,证明对于解析非多项式损失函数,数据对称性一般不产生额外守恒量;对于均方误差损失,数据增强可产生额外守恒量,并利用可张量化网络框架描述该现象。

详情
AI中文摘要

我们探讨训练数据的内在对称性是否在神经网络的梯度流训练中导致守恒量。在假设损失函数是解析且非多项式的情况下,我们证明数据对称性通常不会诱导任何额外的运动积分。另一方面,对于均方误差(MSE)损失,存在数据增强产生额外守恒量的情况。我们构建了一个利用\emph{可张量化网络}来描述这一现象的框架。可张量化网络是一类架构,其参数和输入的依赖关系可以通过中间表示分离。它们包括线性网络、多项式网络以及闪电注意力(Lightning Attention)。

英文摘要

We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing \emph{tensorizable networks} to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and polynomial networks, as well as Lightning Attention.

2606.10912 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

What Do Deepfake Speech Detectors Actually Hear?

深度伪造语音检测器实际上听到了什么?

Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出基于自监督表示和积分梯度的可解释性方法,分析三种WavLM检测器在ASVspoof5上的决策线索,发现它们分别依赖环境噪声、音素伪影和词边界。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

深度伪造语音检测器通常输出一个分数,而不解释为什么音频样本被标记、证据在信号中的位置或哪些线索驱动了决策。我们提出了一种音频原生的可解释性管道,使用时间对齐的自监督表示上的积分梯度来随时间定位决策证据。我们将所提出的方法应用于ASVspoof5上的三个基于WavLM的检测器(AASIST、CA-MHFA、SLS),并手动注释最高归因区域以提供最重要线索的语义含义。尽管性能相似,检测器依赖不同的线索:AASIST强调非语音/环境线索,CA-MHFA关注局部音素伪影,SLS依赖词边界和频谱完整性。我们超越推测性推理,通过因果遮蔽主要检测器线索来验证我们的发现。观察到的性能下降进一步支持了解释的检测器语义。

英文摘要

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

2606.10911 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

Ethical and Technical Limits of Deepfake Speech Datasets

深度伪造语音数据集的伦理与技术限制

Vojtěch Staněk, Eva Trnovská, Kamil Malinka, Anton Firc

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 通过审计39个深度伪造语音数据集,发现公平性评估因缺乏人口统计元数据而不可行,且数据集间真实语音源语料库重叠严重,影响跨数据集评估的可靠性。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

关于深度伪造语音检测器的鲁棒性和公平性的声明,其可信度仅与用于训练和评估这些系统的数据集相当。我们对深度伪造语音领域进行了数据集级别的审计。我们整理并分析了39个深度伪造语音数据集,检查了关键属性,包括可访问性、文档、人口统计和语言覆盖范围、数据集规模以及底层的真实语音来源。我们的审计揭示了两个重要的发现。首先,公平性评估在很大程度上不可行,因为大多数数据集缺乏人口统计元数据,只有少数包含性别或语言标签。这阻止了任何有意义的子组分析,并使得其他人口统计属性未被处理。其次,我们识别出不同数据集之间底层的真实语音源语料库存在大量重叠,这可能破坏跨数据集评估,并导致对泛化能力的夸大声称。

英文摘要

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

2606.10908 2026-06-10 cs.SD cs.AI cs.CR cs.LG 新提交

RAT: Reference-Augmented Training for ASV Anti-Spoofing

RAT:面向ASV反欺骗的参考增强训练

Vojtěch Staněk, Anton Firc, Jakub Reš, Kamil Malinka

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出一种基于说话人参考录音的欺骗对抗架构,发现训练时引入参考通道可提升深度伪造检测性能,即使推理时参考缺失或失配。基于此提出参考增强训练(RAT)策略,在ASVspoof 5基准上以单个检测器达到2.57% EER和0.074 minDCF,超越大型集成系统。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们引入了一种以说话人参考录音为条件的欺骗对抗架构,但观察到它收敛到一种在推理时有效忽略参考的解决方案。令人惊讶的是,使用参考通道进行训练会诱导出不变性,从而改进深度伪造检测,即使在推理时参考缺失或失配。基于这一观察,我们提出了一种参考增强训练(RAT)策略。与单话语基线相比,RAT产生了改进的检测性能,即使在推理时将参考录音替换为零向量时也是如此。通过严格分析,我们证明优化过程迅速减少了参考贡献,导致推理很大程度上独立于参考通道。使用RAT,我们在ASVspoof 5基准上以单个检测器实现了最先进的2.57%等错误率和0.074最小检测代价函数,甚至超越了大型集成系统。

英文摘要

We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.

2606.10905 2026-06-10 cs.CV 新提交

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

超越模型规模:通过训练小模型探究视觉上下文学习中的差距

Sunil Khatri, Steven Landgraf, Markus Ulrich, Simon Reiß

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 通过训练仅1百万参数的小模型,挑战大规模视觉上下文学习模型,揭示任务编码、预训练任务和评估指标方面的适应性能力测量差距。

详情
AI中文摘要

视觉上下文学习(VICL)旨在推动自适应视觉模型的发展,使其能够基于少量示例在测试时适应新任务。受自然语言处理研究中上下文学习历史的影响,当前VICL方法通常采用大规模模型和数据扩展作为关键要素。然而,这些要素是否是视觉模型形成上下文学习能力的关键尚不清楚。为了对这类大模型进行压力测试,我们用一个极端的反例挑战它们:我们训练了一个仅含1百万参数和7万张图像的微小视觉上下文模型。我们将这个容量严重受限的小模型与7000倍大的VICL模型在不同自适应设置下进行比较:(1)具有小分布偏移的图像数据,(2)未见过的任务编码,以及(3)全新的任务,即VICL所设想的场景。由于小模型和大模型之间训练资源的巨大差距,我们的实验展示了在任务编码方式、预训练中使用的任务以及评估指标选择方面,自适应能力测量存在的不足。当前VICL基准测试中的这些差距凸显了在自适应能力评估方面进行创新的必要性。

英文摘要

Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

2606.10903 2026-06-10 cs.RO 新提交

AgniNav: Configuration-Driven Cross-Embodiment Local Planning for Robot Navigation

AgniNav:配置驱动的跨具身局部规划机器人导航

Tianhao Zang, Siwei Cheng, Haidong Huang, Shanze Wang, Wei Zhang

发表机构 * Eastern Institute of Technology, Ningbo, China(东方理工(宁波)) University of Nottingham, Nottingham, UK(诺丁汉大学) University of Science and Technology of China, Hefei, China(中国科学技术大学)

AI总结 提出AgniNav框架,通过可配置的四参数安全包络实现单目视觉导航在轮式、四足和人形机器人间的零重训练迁移,联合调节感知与规划。

详情
AI中文摘要

单目局部导航对轻量级机器人具有吸引力,但现有的基于视觉的策略通常将感知耦合到特定机体、相机高度和足迹,使得从轮式底盘到腿式平台的迁移依赖于重新训练或主动深度硬件。本文介绍了AgniNav,一个配置驱动的局部导航框架,在碰撞包络层面标准化跨具身迁移。每个机器人由一个可测量的四参数安全包络指定:碰撞相关高度、前长、后长和半宽。高度参数条件化一个图像到扫描网络,从单目彩色图像预测一维、碰撞相关的伪激光扫描,而剩余的足迹参数配置一个维度感知的局部规划器用于碰撞检测。训练使用从配对的彩色-深度数据生成的高度条件化列最小扫描标签,允许同一图像监督不同的安全包络,无需收集机器人特定数据。据我们所知,AgniNav是第一个单目局部导航框架,它联合调节感知和规划于共享的碰撞包络配置,实现跨轮式、四足和人形平台的零重训练部署。在Turtlebot2、Unitree Go2和Accelerated Evolution K1上的真实机器人实验分别实现了39/40、18/20和18/20的成功率,碰撞次数分别为0/40、1/20和2/20,同时在Jetson Orin上以30 Hz运行。

英文摘要

Monocular local navigation is attractive for lightweight robots, but existing vision-based policies often couple perception to a specific body, camera height, and footprint, making transfer from wheeled bases to legged platforms dependent on retraining or active depth hardware. This paper introduces AgniNav, a configuration-driven local navigation framework that standardizes cross-embodiment transfer at the collision-envelope level. Each robot is specified by a measurable four-parameter safety envelope: collision-relevant height, front length, rear length, and half width. The height parameter conditions an image-to-scan network to predict a one-dimensional, collision-relevant pseudo-laserscan from a monocular color image, while the remaining footprint parameters configure a dimension-aware local planner for collision checking. Training uses height-conditioned column-minimum scan labels generated from paired color-depth data, allowing the same image to supervise different safety envelopes without collecting robot-specific data. To the best of our knowledge, AgniNav is the first monocular local-navigation framework that jointly conditions perception and planning on a shared collision-envelope configuration for zero-retraining deployment across wheeled, quadruped, and humanoid platforms. Real-robot experiments on a Turtlebot2, Unitree Go2, and Accelerated Evolution K1 achieve 39/40, 18/20, and 18/20 successes with 0/40, 1/20, and 2/20 collisions, respectively, while running at 30 Hz on Jetson Orin.

2606.10899 2026-06-10 cs.RO 新提交

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

MV-Actor:对齐多视角语义与空间感知以实现双臂操作

Yinchen Tian, Huan Li, Muyao Peng, Xi Wang, Yan Wang, You Yang

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息与通信学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学智能产业研究院) AIR Wuxi Innovation Center, Tsinghua University(清华大学智能产业研究院无锡创新中心)

AI总结 提出MV-Actor框架,通过多视角语义交互和语义-空间令牌交互统一语义与空间表示,并利用引导度量深度修复模块处理深度噪声,在PerAct2基准上达到87.8%平均成功率。

详情
Comments
14 pages,9 figures
AI中文摘要

机器人操作已广泛应用于工业场景。与单臂操作相比,双臂操作配备多个摄像头以从不同视角捕获信息。然而,现有的多视角策略独立编码每个视角或浅层融合视角特征,导致语义感知共享有限且空间感知不可靠。本文提出\textbf{MV-Actor},一种为双臂操作构建统一语义-空间表示的多视角感知框架。首先,MV-Actor执行多视角语义交互以跨视角共享语义感知。然后,它使用语义-空间令牌交互将视觉语义与前馈重建模型特征对齐,并获取可靠的空间感知。最后,引导度量深度修复模块在消费级深度噪声下细化退化的传感器深度,以提供更可靠的度量锚点。在PerAct2双臂基准上进行的仿真实验中,MV-Actor达到了87.8%的最先进平均成功率。在视角变化更频繁且消费级深度不稳定的真实世界评估中,MV-Actor优于RGB和RGB-D基线,进一步证明了共享语义感知和可靠空间感知对双臂操作的好处。

英文摘要

Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.

2606.10896 2026-06-10 cs.LG cs.DB cs.IR cs.PF 新提交

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

Flash-GMM:一种用于可扩展软聚类的内存高效内核

Gal Bloch, Ariel Gera, Matan Orbach, Ohad Eytan, Assaf Toledo

发表机构 * IBM Research(IBM研究院)

AI总结 提出Flash-GMM融合Triton内核,在单GPU通次中高效计算高斯混合模型,通过避免实例化完整责任矩阵,实现20倍加速并支持比先前大100倍的数据集训练,集成到IVF粗量化器中提升ANN搜索性能。

详情
AI中文摘要

我们提出了 \textbf{Flash-GMM},一个融合的 Triton 内核,用于在单 GPU 通次中高效计算大规模数据上的高斯混合模型(GMM)。通过消除在 GPU 内存中实例化完整责任矩阵的需求,Flash-GMM 实现了比现有实现 \textbf{20$\times$} 的加速,并支持在单个设备上训练比以前可行数据集大 \textbf{100$\times$} 的数据集。为了展示其影响,我们将 Flash-GMM 集成到 IVF 粗量化器中用于近似最近邻(ANN)搜索。我们表明,软 GMM 聚类现在可以作为 $k$-means 的可行即插即用替代方案,并且可以利用 GMM 责任将边界向量分配到多个聚类。我们的方法在达到固定召回目标时,距离计算次数减少多达 $1.7\times$,或者在相同计算成本下,召回率@10 提高 $+2$--$12$。我们将该内核作为开源项目发布。

英文摘要

We present \textbf{Flash-GMM}, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a \textbf{20$\times$} speedup over existing implementations and enables training on datasets more than \textbf{100$\times$} larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for $k$-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to $1.7\times$ fewer distance computations, or equivalently, yields $+2$--$12$ recall@10 at matched computational cost. We release the kernel as an open-source project.

2606.10890 2026-06-10 cs.LG cs.AI 新提交

Optimal Post-Training Quantization Scales and Where to Find Them

最优后训练量化尺度及其寻找方法

Juan Amboage, Pablo Monteagudo-Lago, Ian Colbert, Giuseppe Franco, Nicholas Fraser

发表机构 * AMD

AI总结 提出PiSO算法,利用校准数据精确高效地计算逐通道最优量化尺度,并扩展到分组量化,在Llama和Qwen模型上显著提升困惑度和零样本准确率。

详情
AI中文摘要

后训练量化(PTQ)通过将权重映射到低比特表示来压缩大型语言模型。定义量化网格的缩放因子通常使用简单的、无数据的启发式方法选择。在这项工作中,我们提出了PiSO(分段尺度优化),一种利用校准数据在最近舍入量化下精确且高效地计算最优逐通道权重尺度的算法。PiSO将尺度搜索空间划分为有限个区间,在这些区间上目标函数具有闭式最小值。我们通过原则性启发式方法将PiSO扩展到分组量化,并提出了将尺度优化与纠错交错的有效策略。在Llama和Qwen模型上,跨多个模型大小和目标权重位宽的实验表明,在困惑度和下游零样本准确率上均有持续改进,无论是单独使用还是与纠错结合。特别地,我们观察到随着目标位宽变窄、量化变得更加困难,收益增加。

英文摘要

Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that defines the quantization grid is typically chosen using simple, data-free heuristics. In this work, we present PiSO (Piecewise Scale Optimization), an algorithm that leverages calibration data to compute the optimal channel-wise weight scales exactly and efficiently under round-to-nearest quantization. PiSO partitions the scale search space into finitely many intervals on which the objective admits a closed-form minimizer. We extend PiSO to group-wise quantization via principled heuristics and propose effective strategies for interleaving scale optimization with error correction. Experiments on Llama and Qwen models across multiple model sizes and target weight bit-widths demonstrate consistent improvements in perplexity and downstream zero-shot accuracy, both standalone and combined with error correction. In particular, we observe increased benefits as the target bit-width narrows and quantization becomes more challenging.

2606.10877 2026-06-10 cs.LG cs.CV 新提交

XtrAIn: Training-Guided Occlusion for Feature Attribution

XtrAIn:训练引导的遮挡特征归因

Thodoris Lymperopoulos, Ioannis Kakogeorgiou, Denia Kanellopoulou

发表机构 * NCSR Demokritos(希腊国家科学研究中心德谟克利特)

AI总结 提出XtrAIn方法,将遮挡操作从输入空间转移到参数空间,通过跟踪模型训练轨迹测量特征相关参数更新对输出logits的影响,解决传统遮挡归因中的偏差和不稳定性问题。

详情
Comments
12 pages, 7 figures, 1 table
AI中文摘要

基于遮挡的归因方法通过扰动输入特征并测量模型输出的变化来估计特征重要性,提供了一种直观的方式。然而,其可靠性受到特征移除实现方式的强烈影响:外部选择的基线可能引入偏差、分布外样本和不稳定的解释,而在非线性模型中,遮挡一组特征也可能改变非遮挡特征的贡献。我们将这种效应称为归因偏移,因为非遮挡特征的归因分数偏离其初始值。为了解决这些导致解释不稳定的主要问题,我们引入了XtrAIn,一种训练引导的归因方法,将遮挡操作从输入空间转移到参数空间。XtrAIn不用于工基线替换输入值,而是遵循模型的训练轨迹,测量特征相关参数更新如何影响输出logits。我们进一步引入了Xstep,一种轻量级近似方法以降低计算成本,以及XtrAIn+,一种目标聚焦变体,强调与目标类别一致的更新。在受控图像数据集和PAM50乳腺癌亚型分类上的实验表明,所提出的方法比标准归因基线产生更清晰、更可解释的归因模式。总体而言,XtrAIn提供了对特征归因的训练感知视角,并为研究训练过程中特征级证据的形成提供了有用的诊断工具。

英文摘要

Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model's training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.

2606.10875 2026-06-10 cs.CL 新提交

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

通过经验知识集成与激活推动LLM工具调用极限

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 研究如何通过经验知识获取、激活和内化提升LLM多步工具调用性能,提出知识增强工具执行框架KATE,结合宽度扩展推理与知识感知训练,在BFCL-V3和AppWorld上显著优于基线。

详情
AI中文摘要

大型语言模型(LLM)依赖工具使用来充当自主代理,但由于缺乏足够的工具相关知识和无效的知识激活,在多步执行中常常失败。因此,我们进行了一项系统性研究,探讨知识如何影响工具使用性能,涵盖知识获取、激活和内化阶段。在知识获取阶段,我们获取并评估了各种形式的经验知识,分析表明简单的实例级知识已经能够提供强大且可靠的增益,而抽象的意图级知识收益有限。在推理时,为了激活知识,我们发现提示LLM扩展推理深度会产生递减收益,而通过并行采样与聚合扩展推理宽度能更有效地激活潜在经验知识。在训练时,对于知识内化,使用知识增强数据进行后训练进一步提升了性能,其中强化学习优于监督微调。基于这些见解,我们提出了知识增强工具执行(KATE)框架,该框架将经验知识与宽度扩展推理及知识感知训练相结合。在BFCL-V3和AppWorld上的实验表明,该方法在不同模型规模上均比强基线有一致且显著的改进。我们的代码可在该https URL获取。

英文摘要

Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at https://github.com/hypasd-art/KATE.

2606.10862 2026-06-10 cs.CV cs.AI 新提交

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

LIBERO-Occ:通过视角想象评估和改进场景诱导遮挡下的视觉-语言-动作模型

Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Chinese University of Hong Kong(香港中文大学)

AI总结 针对VLA模型在场景遮挡下性能下降的问题,提出LIBERO-Occ基准和视角想象方法,通过生成互补视图提升鲁棒性。

详情
Comments
14 pages, 7 figures
AI中文摘要

视觉-语言-动作(VLA)模型在标准操作基准上取得了强劲的性能,但大多数评估假设任务相关物体完全可见。这一假设在现实场景中经常不成立,因为遮挡使得操作部分可观察。本文研究了场景诱导遮挡作为VLA模型的一个基本挑战,并引入了LIBERO-Occ,一个面向遮挡的LIBERO扩展。实验表明,最先进的VLA在遮挡下性能显著下降。为解决这一问题,我们提出了视角想象(VIM),该方法从遮挡的主观测中生成互补视图,并基于观察和想象证据共同进行动作预测。VIM在任务套件、遮挡类型和严重程度上提高了鲁棒性,且无需在部署时增加额外摄像头,表明视角想象是部分可观察操作中感知完成的一种有前景的机制。我们的基准和相应代码可在以下网址获取:this https URL。

英文摘要

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

2606.10857 2026-06-10 cs.RO cs.LG 新提交

Embodiment-conditioned Generalist Control for Multirotor Aerial Robots

基于具身条件的多旋翼空中机器人通用控制

Orestis Konstantaropoulos, Welf Rehberg, Mihir Kulkarni, Kostas Alexis

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU), Trondheim, Norway(挪威科技大学工程控制论系)

AI总结 提出一种通用位置控制策略,通过物理具身描述符(质量与惯性归一化控制分配矩阵)实现单一网络权重控制任意多旋翼构型,采用PPO训练,五分钟后零样本迁移至真实世界。

详情
AI中文摘要

我们提出了一种通用位置控制策略,能够使用单一网络权重控制具有特定旋翼数量(例如六旋翼或四旋翼)的任意多旋翼构型。该策略基于一个物理驱动的具身描述符:一个质量和惯性归一化的控制分配矩阵,该矩阵捕捉了质量归一化的电机推力如何在机体坐标系中产生线性和角加速度。为了训练该策略,我们从任意多旋翼构型的广泛分布中采样,包括非平面和非对称系统,并使用近端策略优化(PPO)优化单个紧凑网络。训练仅需在RTX 3090 GPU上使用基于NVIDIA Warp的自定义动力学模拟器进行五分钟。通过大量仿真实验,我们展示了具身条件化使得通用控制能够在任意形态下鲁棒工作。我们还在三种不同的六旋翼系统上展示了该通用策略的零样本真实世界迁移,包括一个平面机器人、一个部分对称的非平面系统,以及一个随机非对称非平面构型。

英文摘要

We present a generalist position control policy capable of controlling arbitrary multirotor configurations of a certain rotor count (e.g., hexarotors or quadrotors) with a single set of network weights. The policy is conditioned on a physics-grounded embodiment descriptor: a mass and inertia-normalized control allocation matrix that captures how mass-normalized motor thrusts generate linear and angular accelerations in the body-frame. To train the policy, we sample from a broad distribution of arbitrary multirotor configurations, including non-planar and asymmetric systems, and optimize a single, compact network using Proximal Policy Optimization. Training requires only five minutes on an RTX 3090 GPU using a custom NVIDIA Warp-based dynamics simulator. Through extensive simulation experiments, we show that embodiment conditioning enables robust generalist control across arbitrary morphologies. We demonstrate zero-shot real-world transfer of this generalist policy on three diverse hexarotor systems, including a planar robot, a partially symmetric non-planar system, and a random asymmetric, non-planar configuration.

2606.10852 2026-06-10 cs.CL cs.AI 新提交

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

Janus: 大语言模型中目标导向信息扭曲的基准测试

Polydoros Giannouris, Mohsinul Kabir, Sophia Ananiadou

发表机构 * The University of Manchester(曼彻斯特大学) Archimedes/Athena RC(阿基米德/雅典研究中心)

AI总结 提出JANUS基准,通过固定事实池对比中性/目标导向条件,测量LLM在事实输出中的选择性扭曲,揭示模型缺乏防误导通信的鲁棒性。

详情
AI中文摘要

LLM的欺骗通常通过直接标记如捏造声明、明确谎言或策略性隐瞒来评估。然而,许多现实中的误导性沟通并不依赖于虚假陈述,而是源于对真实事实的选择性处理:省略不利证据、软化不利细节、强调有利细节或用模糊语言替代精确限定。现有基准大多忽略了这种更微妙且可能更危险的失败模式。我们引入JANUS,一个用于测量基于事实的LLM输出中目标导向语用扭曲的基准。我们基准中的每个场景提供固定的一组有利和不利事实,并比较中性条件与目标导向条件(例如,尽管可能对直接受影响的个人或群体造成伤害,仍要增加采用率、注册率、批准率或支持率)。由于所有输出都被限制使用相同的事实池,JANUS将误导性总体印象与幻觉和捏造分离开来。JANUS包含跨8个领域的160个场景,每个场景配有中性和目标导向提示以及标注的事实材料。跨12个LLM的大量实验揭示了一致的目标导向扭曲,表明当前模型仍然对激励和框架目标敏感,并且缺乏针对选择性误导沟通的鲁棒防护。我们公开发布语料库和代码以供未来研究。

英文摘要

LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

2606.10842 2026-06-10 cs.CL cs.IR 新提交

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

ConvMemory v2: 一种保留召回率的前10证据重排序器用于对话记忆检索

Taiheng Pan

发表机构 * School of Computing and Information Systems, University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出ConvMemory v2,一种轻量级重排序器,在保留v1的Recall@10前提下,通过微调交叉编码器提升MRR和H@1,并分析其机制。

详情
Comments
19 pages, 3 figures. Single-author technical report. Extends arXiv:2605.28062 (ConvMemory v1). Code and checkpoint: github.com/pth2002/ConvMemory
AI中文摘要

我们描述了ConvMemory v2,一种可选的token证据重排序器,位于轻量级ConvMemory v1重排序器之后,仅对v1保护的前10候选集进行重排序。v2是一个微调的ms-marco-MiniLM-L-6-v2交叉编码器(22,713,601个参数,从发布的检查点测量),应用于v1已经选择的十个(查询,记忆)对;它不改变返回的十个记忆,因此Recall@10和Hit@10与v1相同,这是构造决定的,而非统计巧合。在LoCoMo对话记忆基准测试(5个种子,n = 4955个测试行)上,v2将FULL MRR从v1的0.5824提升到0.6560(配对bootstrap +0.0734,95% CI [+0.0645, +0.0827]),H@1从0.4440提升到0.5474。v2缩小了与更昂贵的全池交叉编码器参考(mxbai-rerank-large-v1在前500个上,MRR 0.6688)的大部分差距但未完全消除:在FULL MRR上,v2比mxbai_top500低0.013,但在两个raw-dense-hard切片上(v1保护的前10个比mxbai自己的前10个具有更高的召回率),v2超过了mxbai_top500。一项四臂负载消融实验表明,候选特定的记忆文本是机制:移除、打乱或替换它会使MRR崩溃到低于原始稠密检索。v2最好被理解为一种标准的保留召回率的级联模式,具有LoCoMo特定的微调、显式的抗捷径推理契约和严谨的负载分析;其相对于mxbai的优势是切片特定的,而非一般的优势声明。本报告扩展了v1技术报告(arXiv:2605.28062)。

英文摘要

We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1's 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1's protected top-10 has higher recall than mxbai's own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).

2606.10839 2026-06-10 cs.CV 新提交

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

HarmoView: 协调多视角约束以实现身份一致视频生成

Cong Wang, Zhentao Yu, Hongmei Wang, Weicong Liang, Zixiang Zhou, Zilin Yang, Jiarong Ou, Rui Chen, Yuan Zhou, Qinglin Lu

发表机构 * Tencent Hunyuan(腾讯混元)

AI总结 提出HarmoView框架,通过多级特征注入、可学习代理令牌和Jump-RoPE等架构改进,结合渐进式视角课程训练,解决大视角变化下身份一致视频生成的外观保真度问题,在多视角基准上达到最优性能。

详情
Comments
Project Page: https://conallwang.github.io/HarmoView_Pages
AI中文摘要

当前的身份一致视频生成方法在大的视角变化下难以保持外观保真度。虽然引入多视角参考输入提供了自然解决方案,但由于缺乏有效的多视角输入框架以及多视角数据的稀缺性,进展仍然受限。我们通过提出HarmoView来应对这些挑战,这是一个用于身份一致视频生成的鲁棒框架,通过三种架构改进并辅以分阶段训练课程,有效整合多视角线索。具体来说,我们首先引入多级特征注入(MFI)来锚定身份保真度;通过交叉注意力将来自正面参考的原始ViT特征与文本令牌一起注入,MFI提供了持久的低级外观锚点,补充了DiT块内的高级身份特征,从而增强了身份保持。然后,我们采用可学习代理令牌来统一单/多视角设置下的异构参考布局,同时解决参考-视角不匹配问题。进一步开发了Jump-RoPE用于身份级特征隔离以减少身份串扰。为了在保留原始生成先验的同时激活这些结构能力,我们提出了渐进式视角课程。这种四阶段训练策略采用视角丢弃,以促进从原始T2V生成到高保真、身份持久的空间推理的稳定过渡。此外,我们构建了一个大规模多视角数据集以解决数据稀缺问题。在我们的多视角基准上的广泛评估(包含100个手动策划的案例,涵盖52个独特身份)表明,HarmoView显著优于开源基线,并匹配领先的闭源引擎,在身份一致视频生成中实现了最先进的性能。

英文摘要

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

2606.10835 2026-06-10 cs.LG cs.AI 新提交

Geometrically Averaged Hard Target Updates for Linear Q-Learning

线性Q学习的几何平均硬目标更新

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院)

AI总结 提出λ-几何加权平均的周期目标更新方法,用于线性Q学习,通过切换系统模型分析其稳定性,连接了单周期更新和投影Q值迭代。

详情
AI中文摘要

周期性硬目标更新是现代深度Q学习中最常见的稳定化手段之一。最近的研究表明,目标更新可以提高使用函数逼近(包括线性函数逼近)的Q学习的稳定性。我们引入并分析了所谓的λ-目标更新,通过将m-周期目标更新映射与λ-几何权重$(1-\lambda)\lambda^{m-1}$($\lambda \in [0,1]$)平均得到。端点$\lambda=0$恢复单周期目标更新,而连续端点$\lambda\uparrow1$恢复投影Q值迭代。我们使用切换系统模型和相关工具,研究了这种机制在线性函数逼近的Q学习(即线性Q学习)中的应用。为清晰起见,本文处理确定性版本;该公式可扩展到随机强化学习设置。

英文摘要

Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called $λ$-target update, obtained by averaging the $m$-periodic target update maps with $λ$-geometric weights $(1-λ)λ^{m-1}$, $λ\in [0,1]$. The endpoint $λ=0$ recovers the one-period target update, while the continuous endpoint $λ\uparrow1$ recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.

2606.10833 2026-06-10 cs.AI 新提交

Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

视觉语言模型像工程师一样推理吗?一个基准测试与分阶段评估

Syed Wasiq, Syed Mohamad Tawseeq, Yashwant Pravinrao Bangde, Debaditya Roy

发表机构 * Indian Institute of Technology Kharagpur(印度理工学院卡哈拉格普尔分校)

AI总结 提出工程视觉问答基准EngVQA和8阶段自动评估框架,揭示当前视觉语言模型在工程推理中的显著局限,并验证了自动化评估与人工评分的高度一致性。

详情
Comments
9 pages (main text), 4 figures, 2 tables; 50 pages total including appendix. The first two authors contributed equally
AI中文摘要

视觉语言模型(VLM)在通用多模态推理基准上表现出色,但其进行工程推理的能力尚未得到充分探索。与一般视觉问答不同,工程问题解决需要解读技术图表、选择支配物理原理并保持物理一致的多步推理。这些能力对于用于工程教育、科学辅助和技术决策的AI系统日益重要,因为推理失败可能产生物理上无效但表面上合理的解决方案。现有基准主要评估最终答案,对中间推理过程的评估有限。我们引入了EngVQA,一个跨5个工程学科、包含696个问题的多模态基准,用于评估工程推理。我们提出了一个8阶段自动评估框架,用于评估VLM生成的解决方案。该框架独立评估解决方案的每个阶段,实现对推理失败的细粒度分析。我们在评估框架上对多个最先进的开源和闭源VLM进行了基准测试,并展示了当前工程推理能力的显著局限性。人工评估与我们的自动化框架高度一致,在10分制评分上实现了0.975的皮尔逊相关系数和0.67的平均绝对误差。我们的结果强调了面向过程的评估对于可靠评估多模态工程推理系统的重要性。

英文摘要

Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.

2606.10832 2026-06-10 cs.RO 新提交

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

GUIDE: 目标初始化的定向理解用于端到端视觉导航

Liang Wang, Jin Jin, KanZhong Yao, YiBin Wu, Fangqiang Ding, Jin Wang, Jun Wu, Zhe Sun, Qiuguo Zhu

发表机构 * Institute of Cyber-Systems and Control, Zhejiang University(浙江大学控制科学与工程学院) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) Oxford Robotics Institute, University of Oxford(牛津大学牛津机器人研究所) Center for Robotics, University of Bonn(波恩大学机器人中心) Department of Mechanical Engineering, Massachusetts Institute of Technology(麻省理工学院机械工程系)

AI总结 提出GUIDE框架,通过空间锚点预测器利用多频率本体感受历史提取自运动表示,结合深度流感知局部几何,实现无需后续目标更新的端到端四足机器人导航。

详情
Comments
https://guide-navigation.github.io/
AI中文摘要

基于学习的足式机器人视觉导航通常依赖于层次状态估计的连续目标更新,以提供持久的定向参考。这种依赖引入了额外的感知和计算开销,偏离了完全端到端的移动自主性。此外,在部分可观测性下,策略容易学习短视行为,容易陷入死角和复杂结构布局。为了解决这些限制,我们研究了一种目标初始化的导航设置,其中目标仅在情节开始时提供一次,要求机器人基于内在空间记忆运行,无需来自外部模块的后续目标更新。在这项工作中,我们提出了GUIDE,一个完全端到端的强化学习框架,旨在培养内部定向意识。具体来说,GUIDE包含一个空间锚点预测器,利用多频率本体感受历史来提取自运动表示,从而为导航维持持久的长期空间上下文。同时,它利用原始深度流感知局部环境几何。我们在仿真和真实场景中对四足机器人进行了评估。实验表明,GUIDE学习了可靠的自运动和定向意识,使得完全端到端部署的策略能够在没有后续目标引导或先验地图的情况下,安全地穿越密集杂乱和结构化迷宫。

英文摘要

Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overhead and deviates from fully end-to-end mobile autonomy. Furthermore, under partial observability, policies are prone to learn myopic behaviors, easily becoming trapped in dead ends and complex structural layouts. To address these limitations, we investigate a goal-initialized navigation setting, where the target is provided only once at the beginning of an episode, requiring the robot to operate based on intrinsic spatial memory without subsequent goal updates from external modules. In this work, we propose GUIDE, a fully end-to-end reinforcement learning framework designed to cultivate internal directional awareness. Specifically, GUIDE incorporates a spatial anchor predictor that leverages multi-frequency proprioceptive history to extract egomotion representations, thereby maintaining a persistent long-horizon spatial context for navigation. Concurrently, it utilizes raw depth streams to perceive local environmental geometry. We evaluate the proposed framework across both simulation and real-world scenarios on a quadruped robot. Experiments show that GUIDE learns reliable egomotion and directional awareness, enabling a fully end-to-end deployed policy to safely navigate through dense clutter and structured mazes without subsequent goal guidance or prior maps.

2606.10829 2026-06-10 cs.CL cs.AI 新提交

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

注意力折扣自适应采样器用于掩码扩散语言模型

Yusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro

发表机构 * University of Bern(伯尔尼大学) EPFL(瑞士联邦理工学院洛桑分校)

AI总结 针对掩码扩散语言模型并行解码中候选词交互导致的不安全问题,提出训练无关的重排序规则ADAS,通过注意力折扣软惩罚改进子集构建,在多个基准上提升低NFE性能。

详情
AI中文摘要

掩码扩散语言模型可以通过每次去噪迭代揭示多个令牌来减少推理步骤,但这种并行性很脆弱:当预测相互耦合时,单独置信的位置同时提交可能不安全。现有的免训练采样器如Top-\(k\)、Fast-dLLM和EB-Sampler主要控制揭示多少令牌,而通常通过忽略选定集内交互的逐令牌分数对候选进行排序。我们提出ADAS,一种用于并行掩码扩散解码的免训练重排序规则。ADAS保持基础采样器的停止规则不变,仅修改子集构建:当候选者强烈关注预测仍不确定的已选位置时,它贪婪地折扣该候选者。与将注意力转化为硬兼容性约束的图约束方法不同,ADAS保持注意力连续并将其用作软边际惩罚。在GSM8K、MATH500、HumanEval和MBPP上,针对LLaDA-8B-Base和Dream-7B-Base,将ADAS插入Top-\(k\)、Fast-dLLM和EB-Sampler中,在匹配去噪器评估下,低NFE性能平均分别提高9.11和10.46个百分点,每次前向运行时开销为3.1%。这些结果表明,软注意力折扣重排序是一种简单且模块化的方法,可提高掩码扩散语言模型高度并行解码的质量。

英文摘要

Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

2606.10825 2026-06-10 cs.LG 新提交

MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP:扩散策略的高效基于模型的优化

Zakariae El Asri, Philippe Gratias-Quiquandon, Nicolas Thome, Olivier Sigaud

发表机构 * Sorbonne Université, CNRS, ISIR, F-75005 Paris, France(索邦大学,法国国家科学研究中心,智能系统与机器人研究所,法国巴黎) Institut Universitaire de France (IUF)(法国大学研究院)

AI总结 提出MODIP框架,利用世界模型和模型预测控制生成高质量轨迹,以监督方式微调扩散策略,实现离线到在线的强化学习微调,在D4RL和RoboMimic任务上超越行为克隆基线。

详情
AI中文摘要

扩散策略(DPs)已成为机器人学习中表达力强的策略表示,通常与行为克隆(BC)等模仿学习方法一起使用。然而,虽然它们的成功主要局限于BC,但直接进行强化学习(RL)微调仍然具有挑战性,因为动作是通过多步去噪过程生成的。在这项工作中,我们提出了MODIP,一个用于扩散策略离线到在线微调的框架。MODIP不是直接将RL应用于DPs,而是利用世界模型(WM)来指导策略适应,并保持BC的简单性和稳定性。我们利用模型预测控制(MPC)在WM内生成高质量轨迹,并将其作为监督目标来微调DP。为了使MPC规划高效,MODIP使用终端状态值而不是依赖于策略的状态-动作值,从而减少了推理时间。此外,MODIP使用与策略无关的TD目标训练评论家,减少了训练时间。在D4RL(MuJoCo、Kitchen)和RoboMimic任务上的实验表明,MODIP改进了超越BC的扩散策略,并且与扩散策略RL微调方法和强基于模型的基线(如TD-MPC2)相比具有竞争力或更优性能。

英文摘要

Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.