arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12918 2026-06-12 cs.CR cs.AI 新提交

MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems

MAStrike: 基于Shapley值的多智能体系统合谋红队测试

Chejian Xu, Zhaorun Chen, Jingyang Zhang, Freddy Lecue, Avni Kothari, Sarah Tan, Wenbo Guo, Bo Li

AI总结 提出MAStrike框架,通过Shapley值分析识别多智能体系统中脆弱智能体联盟,生成角色感知的对抗攻击,并迭代优化以绕过防御,显著优于启发式基线。

详情
AI中文摘要

分层多智能体系统(MAS)正迅速部署在金融和软件工程等高危工作流中。在这些系统中,安全本质上是分布在不同角色智能体上的,显著扩大了攻击面,特别是在特权提升和跨智能体合谋等协调对抗行为下。现有的MAS红队测试方法仍然有限:它们依赖启发式选择目标智能体并扰动孤立的消息流,留下了关键问题未解答,即哪些智能体对系统安全最负责,以及受损智能体如何协调以绕过防御。我们提出MAStrike,一个用于分层MAS中合谋红队测试的闭环框架。我们首次提出针对MAS的智能体级Shapley值分析,量化每个智能体在任务特定分布下对系统鲁棒性的边际贡献。在此归因指导下,MAStrike识别脆弱智能体联盟并生成协调的、角色感知的对抗操纵。这些攻击通过结构化因果诊断迭代优化,将失败案例归因于阻止对抗尝试的未受损智能体。我们进一步构建了全面的MAS红队测试基准和可控环境,涵盖不同的分层拓扑和领域,包括金融、软件工程和CRM。在多个前沿模型构建的MAS上进行的广泛实验表明,MAStrike显著优于启发式基线。我们的分析进一步揭示了智能体间非平凡的Shapley值分布和高阶交互结构,揭示了先前单智能体或基于模板的方法忽略的关键漏洞和协调模式。

英文摘要

Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-specialized agents, significantly expanding the attack surface, particularly under coordinated adversarial behaviors such as privilege escalation and cross-agent collusion. Existing red-teaming approaches for MAS remain limited: they rely on heuristic selection of target agents and perturb isolated message streams, leaving critical questions unanswered as which agents are most responsible for system safety, and how compromised agents can coordinate to bypass defenses. We propose MAStrike, a closed-loop framework for collusive red-teaming in hierarchical MAS. We propose the first agent-level Shapley value analysis for MAS, quantifying each agent's marginal contribution to system robustness under task-specific distributions. GGuided by this attribution, MAStrike identifies vulnerable agent coalitions and generates coordinated, role-aware adversarial manipulations. These attacks are iteratively refined through structured causal diagnosis, attributing failure cases to uncompromised agents that block adversarial attempts. We further build a comprehensive MAS red-teaming benchmark and controllable environments spanning diverse hierarchical topologies and domains, including finance, software engineering, and CRM. Extensive experiments across MAS built on multiple frontier models show that MAStrike substantially outperforms heuristic baselines. Our analysis further uncovers non-trivial Shapley value distributions and higher-order interaction structures among agents, revealing critical vulnerabilities and coordination patterns that are overlooked by prior single-agent or template-based methods.

2606.12917 2026-06-12 cs.LG 新提交

Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

计算在 TabPFN 中的位置:注意力头功能的因果定位

Atharva Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

AI总结 通过激活修补、消融和注意力熵分析,发现 TabPFN 2.5 中一个注意力头在峰值层的因果必要性比其他头高2-5倍,且其主导层随任务复杂度变化,其余头呈现对称的后期层轮廓。

详情
Comments
Accepted to Workshop FMSD @ ICML 2026
AI中文摘要

我们首次对表格基础模型进行了因果机制分析,研究了 TabPFN 2.5 的逐特征注意力头如何跨层分布计算。使用两个合成回归数据集上的激活修补、消融和注意力熵,我们发现明确的时间特化:一个头的因果必要性在峰值层比其他头高2到5倍,其主导层随不同复杂度的任务而变化,而其余头表现出对称的后期层轮廓。注意力熵和修补为优势头的计算活跃层提供了收敛证据。我们还通过对比激活引导研究了推理时间的可操控性,发现它无法跨样本迁移。我们将这一结果归因于 TabPFN 的上下文学习机制,该机制通过上下文相关的注意力编码任务结构,而不是语言模型中使引导可行的稳定参数方向。

英文摘要

We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5's feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head's causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN's in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

2606.12916 2026-06-12 cs.AI cs.CL cs.LG 新提交

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge:稀疏模拟器反馈下的智能分子动力学流水线设计

Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出MDForge,利用LLM智能体通过多智能体辩论将稀疏奖励稠密化,自动设计分子动力学流水线,在SAMPL基准上达到专家水平,并发现新型高亲和力CB[7]结合剂。

详情
AI中文摘要

分子动力学(MD)是原子分子科学中经典的计算机模拟方法,从第一性原理物理模拟分子行为。为新系统设计MD流水线需要大量专业知识:即使在一个分子上运行也代价高昂,排除了试错法。我们使用LLM智能体自动化这一专家流水线设计过程。与现有编排预定义工具集的MD智能体不同,我们将流水线设计视为开放式代码生成,其中智能体的行为通过语言奖励在线重塑。具体而言,我们构建了MDForge,一个LLM智能体,其上下文更新规则通过物理专家间的多智能体辩论将稀疏奖励稠密化。在三个SAMPL主客体结合自由能基准上,MDForge自动设计的MD流水线与人类专家竞争。部署在未见过的候选客体库上,其CB[7]流水线发现了一种新型结合剂,湿实验竞争NMR证实其为高亲和力、皮摩尔级的CB[7]结合剂。我们的数据和代码可在https://this URL获取。

英文摘要

Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at this https URL.

2606.12913 2026-06-12 cs.LG cs.CV 新提交

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择:用于无损训练加速的统一数据集剪枝框架

Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

AI总结 提出基于图的统一数据集剪枝框架,将数据集建模为加权图,通过最大权重团问题选择样本,并设计贪心算法,在多种剪枝比例下优于现有方法,实现ImageNet-1k上40%以上训练加速且不损失精度。

详情
Comments
ICML 2026
AI中文摘要

现代训练数据集的快速增长显著增加了计算成本,促使数据集剪枝(DP)方法仅保留信息量丰富的样本子集以减少训练成本。现有的剪枝标准通常依赖于评估样本独立性的内在信号或通过成对关系促进多样性的外在信号。虽然在其特定领域有效,但每种方法仅捕捉样本效用的一方面,且在不同剪枝比例或数据分布下缺乏鲁棒性。在这项工作中,我们提出了一个统一的基于图的DP框架。通过将数据集建模为加权图,其中节点权重编码内在价值,边权重编码外在价值,DP可以转化为最大权重团问题(MWCP)。尽管MWCP是NP难的,但其结构允许基于样本边际增益的原则性贪心解法。在几个温和条件下,我们进一步证明该统一目标具有形式化的近似保证,适用于广泛的度量族,并提供了实用设计指南。大量实验表明,我们的方法优于现有DP方法,同时显著降低训练成本,在ImageNet-1k上使用ResNet-50时,训练时间减少超过40%且不损失精度。

英文摘要

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

2606.12911 2026-06-12 cs.CL 新提交

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

PiDA: 基于语音信息的数据增强用于鲁棒的越南语语音翻译

Giang Son Nguyen, Tung X. Nguyen, Hieu Minh Truong, Nhu Vo, Wray Buntine, Dung D. Le

AI总结 针对级联语音翻译中ASR错误传播问题,提出基于语音信息的数据增强方法PiDA,通过语音词嵌入生成相似音替换,在FLEURS越南语-英语上提升错误ASR输出翻译质量(BLEU+2.04)。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

级联语音翻译(ST)系统在自动语音识别(ASR)输出错误转录时会出现错误传播。我们首次对越南语ST的ASR错误进行系统分类,根据语音原因对替换错误进行分类,并使用线性混合效应模型量化其对下游神经机器翻译(NMT)性能的影响。我们确认大多数ASR替换错误源于语音混淆而非随机噪声,并且这些语音错误显著降低了ST质量。受此发现启发,我们提出了基于语音信息的数据增强(PiDA),该方法通过使用语音词嵌入替换为语音相似的替代词来生成类似ASR的损坏。在FLEURS越南语-英语的PiDA增强版本上进行微调,提高了错误ASR输出的翻译质量(比标准微调最多提高+2.04 BLEU),同时也略微提升了干净文本的性能。

英文摘要

Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

2606.12910 2026-06-12 cs.RO cs.AI cs.CV eess.SY 新提交

Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

边界框作为目标:通过神经符号规划实现语言条件抓取

Allison Andreyev, Landon Eum, Nestor Tiglao, Romel Gomez

AI总结 提出GRASP框架,利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测实现零样本桌面操作,无需任务特定训练。

详情
Comments
Project website: this https URL
AI中文摘要

为了将机器人有效集成到家庭或工业环境中,机器必须实时适应自然语言提示。尽管视觉-语言模型(VLM)已在机器人任务与运动规划(TAMP)中实现零样本泛化,但当前最先进的方法通常计算量“沉重”或需要在数千个演示上进行大量训练。我们提出GRASP(基础推理与符号规划)框架,作为向开放词汇桌面操作迈进的一步。我们的方法利用预训练VLM将自然语言查询转化为神经符号目标状态,通过边界框检测管道在物理世界中接地。与依赖固定颜色列表或硬编码坐标的方法不同,GRASP使机器人能够解释诸如“顶层架子”之类的抽象空间概念,并在无需额外微调的情况下执行任务。我们在三个难度级别的90次真实机器人试验中实现了73.3%的总体成功率,无需任务特定训练。

英文摘要

For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in real time. Although Vision-Language Models (VLMs) have enabled zero-shot generalization in robot task and motion planning (TAMP), current state-of-the-art approaches often remain computationally "heavyweight" or require extensive training on thousands of demonstrations. We present GRASP (Grounded Reasoning and Symbolic Planning), a framework designed as a step toward open-vocabulary tabletop manipulation. Our approach leverages a pretrained VLM to translate natural-language queries into neuro-symbolic goal states, grounded in the physical world via a bounding-box detection pipeline. Unlike methods that rely on fixed color lists or hard-coded coordinates, GRASP enables robots to interpret abstract spatial concepts such as "top shelf" and execute tasks without additional fine-tuning. We achieve 73.3% overall success across 90 real-robot trials at three difficulty levels, requiring no task-specific training.

2606.12908 2026-06-12 cs.CL 新提交

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

SENTINEL: 用于训练工具使用语言模型智能体的失败驱动强化学习

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Qun Liu, Chen Luo, Jiri Gesi, Hanqing Lu, Yisi Sang, Manling Li, Jing Huang, Dakuo Wang

发表机构 * Northeastern University(东北大学) Independent Researcher(独立研究员) Northwestern University(西北大学)

AI总结 提出SENTINEL框架,通过将智能体失败转化为针对性训练任务,在Tau2-Bench Retail上提升Qwen3-4B模型Pass@1从66.4到74.9,优于通用合成任务上的强化学习。

详情
AI中文摘要

语言模型智能体通过多轮工具使用在解决现实任务方面越来越有效。然而,训练可靠的工具使用智能体在实践中仍然具有挑战性。虽然强化学习提供了一种从智能体自身环境交互中改进智能体的在策略范式,但其有效性在很大程度上取决于训练任务分布。当任务在训练前固定时,任务分布可能越来越与策略不断发展的能力不匹配,导致许多轨迹被浪费在无信息的任务上。我们提出SENTINEL,一种失败驱动的强化学习框架,将求解器的轨迹失败转化为有针对性的训练任务。SENTINEL遵循控制器-提议者-求解器循环:控制器分析失败轨迹并总结重复出现的错误模式,提议者生成可执行的任务来强调这些弱点,求解器在针对性任务上接受训练。在Tau2-Bench Retail上使用Qwen3-4B-Thinking-2507,SENTINEL将Pass@1从66.4提高到74.9,并且在Pass@k指标上优于通用合成任务上的强化学习。这些结果表明,模型失败为改进工具使用语言模型智能体提供了有效且可扩展的针对性训练信号来源。

英文摘要

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller--Proposer--Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

2606.12904 2026-06-12 cs.IR cs.CL cs.HC cs.SI 新提交

Trait, Not State: The Durability of Reading Identity in Social Highlighting

特质而非状态:社交高亮中阅读身份的持久性

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 通过分析读者前六个月的高亮行为作为个人档案,追踪其后续选择,发现阅读选择特征在长达24个月以上保持稳定,表明这是一种特质而非状态。

详情
Comments
12 pages, 3 figures, 3 tables
AI中文摘要

先前关于社交网络高亮工具的研究将个体性定位于选择——即一个人选择高亮哪些文档——但仅从横截面角度进行测量。我们提出时间性问题:读者的选择特征是特质还是状态?我们将每位读者前六个月的高亮行为冻结为个人档案,并追踪其在后续选择中(间隔逐渐增大至24个月以上)的自身优势,负样本来自同一日历时期——因此供给漂移不能伪装成个人漂移——在粗粒度全局层面和细粒度层面(其负样本和对照来自读者自身的兴趣领域)进行测量;锚定单元重现了先前的横截面水平(+0.188 vs +0.169),验证了该框架。四个结果:在同一用户内,细粒度优势在任何时间跨度上均未显示统计上可检测的配对下降(6-12个月保留率 R = 1.00 [0.85, 1.18],n = 212;最远的区间与适度下降兼容;唯一区间排除零的对比是12-24个月的粗粒度层,约下降13%)。该信号不可简化为重复域名(排除所有档案来源后约90%信号保留)。个体内漂移缓慢(最近半年的档案比旧半年档案高出+0.042)。前瞻性地,个人档案——即使仅由读者最早期的文档构建(评估前中位数20个月)——其下一阅读的AP值约为所有测试过的简单非个人先验的3倍。我们将“特质”操作性地定义为在持续参与下的稳定特征;研究范围限于一个平台上的重度、长期读者,且曝光与选择不可分离。

英文摘要

Prior work on a social web highlighter located individuality in selection -- which documents a person chooses to highlight -- but measured it cross-sectionally. We ask the temporal question: is a reader's selection signature a trait or a state? We freeze each reader's first six months of highlighting as a profile and track its own-vs-other advantage on their later selections at growing gaps (to 24+ months), with negatives drawn from the same calendar era -- so supply drift cannot masquerade as personal drift -- at a coarse global level and at a fine level whose negatives and controls come from the reader's own interest neighborhood; the anchor cell reproduces the prior cross-sectional level (+0.188 vs +0.169), validating the harness. Four results. Within the same users, the fine-layer advantage shows no statistically detectable paired decline at any horizon (6-12 month retention R = 1.00 [0.85, 1.18], n = 212; the farthest bin is compatible with a modest decline; the only contrast whose interval excludes zero is the coarse layer at 12-24 months, about 13%). The signal is not reducible to repeated domains (~90% survives excluding all profile sources). Within-person drift is slow (a recent-half profile beats the old half by +0.042). Prospectively, personal profiles -- even one built from a reader's earliest documents, median 20 months before evaluation -- rank their next reads at roughly 3x the AP of every simple non-personal prior tested. We use "trait" operationally (a stable signature under continued engagement); the scope is heavy, long-tenured readers of one platform, and exposure is not separable from choice.

2606.12903 2026-06-12 cs.CL 新提交

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

X-MADAM-RAG:诊断和处理检索增强生成中的中英文证据冲突

Yongqi Kang, Yu Fu, Yong Zhao

发表机构 * Sichuan University(四川大学)

AI总结 提出X-MADAM-RAG管道,通过分解证据处理步骤(候选提取、可见证据修复、确定性分组和冲突感知聚合)解决RAG中中英文证据冲突问题,在受控基准上取得高准确率,但发现文档级提取是主要瓶颈。

详情
AI中文摘要

检索增强生成(RAG)系统可能接收到不仅噪声大而且相互矛盾的证据。这个问题在多语言环境中尤为突出,因为检索到的中文和英文证据可能支持不相容的答案候选。我们通过X-RAMDocs-ZHEN(一个从RAMDocs衍生的受控中英文基准)研究此问题,用于诊断RAG中的证据冲突。该基准包含300个示例,涵盖六种平衡条件,包括单语言支持、双语一致、反向冲突方向以及带可选噪声的冲突。我们进一步研究了X-MADAM-RAG,一个可解释的管道,将证据处理分解为每个文档的候选提取、可见证据修复、确定性候选分组和冲突感知聚合。在原始受控基准上使用Qwen2.5-7B-Instruct,X-MADAM-RAG达到了0.9667的严格准确率和0.9767的冲突感知成功率,优于证据归一化的单次调用基线。然而,一个零调用的纯规则提取器在同一基准上达到了1.0000,揭示了强模板规律性。为了探究这一局限性,我们构建了一个确定性自然化压力测试,移除了显式答案模板但保留了候选字符串。在其100样本子集上,纯规则提取器降至0.0000,但X-MADAM-RAG也降至0.3000严格准确率,低于朴素基线和证据归一化基线。特权Oracle保持完美,表明文档级提取是主要瓶颈。这些发现将X-RAMDocs-ZHEN和X-MADAM-RAG定位为受控证据冲突的诊断工具,而非通用幻觉检测或对自然检索鲁棒性的证据。

英文摘要

Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

2606.12902 2026-06-12 cs.CL 新提交

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PRISM:用于共情口语对话的韵律集成多智能体推理框架

Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院)

AI总结 提出PRISM多智能体框架,通过解耦语音感知、响应生成和语音合成,并引入韵律到语言翻译机制,实现共情口语对话中的韵律适当性和知识集成。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

共情口语对话系统不仅需要语义上合适的回应,还需要情感上一致的韵律表达。然而,级联流水线通常在语音到文本转换过程中丢弃声学线索,而端到端语音模型缺乏对情感和知识集成的可解释控制。为了解决这些挑战,我们提出了PRISM,一个用于共情口语对话的多智能体框架,它将语音感知、响应生成和语音合成解耦为协调的组件。PRISM引入了一种韵律到语言的翻译机制来稳定大语言模型的推理,并支持按需调用外部知识工具以生成共情对话。实验结果表明,PRISM在客观和主观指标上均实现了共情性、韵律适当性和文本响应生成质量的一致改进。我们的代码可在以下网址获取:this https URL。

英文摘要

Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: this https URL.

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 新提交

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

零源大语言模型幻觉检测:类人类标准探测

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

AI总结 提出HCPD范式,通过类人类标准探测机制模拟人类评估者的多面推理,结合奖励对齐和多样本聚合,实现零源条件下的有效可解释幻觉检测。

详情
Comments
Accepted at ICML 2026
AI中文摘要

大型语言模型(LLM)常因生成事实错误或不忠实的内容而产生幻觉,对其安全使用构成重大风险。在零源约束下,即无法获取模型内部信息或外部参考,检测必须仅依赖于文本查询-答案对,检测此类幻觉尤为困难。本文提出用于幻觉检测的类人类标准探测(HCPD)范式,该范式模拟人类评估者的多面推理。其核心是类人类标准探测(HCP)机制,其中LLM代理自适应地将其判断分解为一组可解释的加权标准,并将特定标准得分聚合为最终的真实性度量。为实现这种自适应能力,我们引入了一种基于奖励的对齐方案,仅使用来自语义一致性的弱监督。在推理时,我们采用多样本聚合策略,确保决策稳健的同时保持完全可解释性。我们进一步提供了支持我们方法可靠性的理论分析。大量实验表明,HCPD始终优于最先进的基线,为零源幻觉检测提供了一种有效且可解释的解决方案。代码可从此https URL获取。

英文摘要

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at this https URL.

2606.12898 2026-06-12 cs.CV cs.CL 新提交

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

放大关键信息:面向视觉文本理解的注意力引导自适应渲染

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu

发表机构 * Michigan State University(密歇根州立大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对视觉语言模型在视觉文本理解任务中存在的定位与利用脱节问题,提出无需训练、模型无关的注意力引导自适应渲染方法AGAR,通过放大关键文本跨度提升模型性能。

详情
AI中文摘要

视觉文本理解(VTC)将文本渲染为图像供视觉语言模型(VLM)阅读,绕过了LLM的上下文窗口限制,并支持从长页OCR到多页记忆问答等应用。然而,现有的VTC流水线将渲染和布局视为固定的、内容无关的预处理步骤,并且对VLM内部如何处理可视化文本的机制理解甚少。通过对VTC问答任务的聚焦实证研究,我们揭示了VLM存在一种“定位而不利用”的模式:证据定位注意力在中间到后期层中急剧出现,并且与答案正确性在很大程度上解耦,然而仅仅放大渲染页面上定位的跨度就能恢复大部分失败。基于这些观察,我们提出了AGAR(注意力引导自适应渲染),一种无需训练、模型无关的方法,该方法利用VLM自身的中间到后期层注意力来识别前K个重要的视觉补丁,将它们映射回单词跨度,并在重新推理答案之前重新渲染页面,放大这些跨度。在九个VTC基准测试(短文本、长上下文和多页记忆问答)和四个VLM骨干上的大量实验表明,AGAR(i)作为即插即用的增强,持续改进了现成的VLM,(ii)与VLM后训练相结合可带来进一步收益,并且(iii)在视觉和文本侧输入退化下保持鲁棒性。

英文摘要

Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

2606.12897 2026-06-12 cs.CL 新提交

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

SafeLLM: 在安全关键场景中,提取作为重写的抗幻觉替代方案

Julia Ive, Felix Jozsa, Evridiki Georgaki, Nabeel Sheikh, Emma Cattell, Nick Jackson, Paulina Bondaronek, Ciaran Scott Hill, Richard Dobson

发表机构 * Institute of Health Informatics, University College London(伦敦大学学院健康信息学研究所) National Hospital for Neurology and Neurosurgery(国家神经内科与神经外科医院) Somerset NHS Foundation Trust(萨默塞特NHS基金会信托) King's College Hospital(国王学院医院) King's College London(伦敦国王学院)

AI总结 提出将提取作为重写型RAG的抗幻觉替代方案,通过行号选择策略在安全关键文档中实现高召回(95%)和低幻觉,优于直接复制和安全导向方法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于访问组织文档,包括标准操作程序(SOP)、人力资源政策和机构指南。然而,依赖自由形式重写的检索增强生成(RAG)系统可能引入幻觉,并在完整性和简洁性之间产生不稳定的权衡,尤其是在安全和合规关键场景中。目标:评估提取作为基于重写的RAG的抗幻觉替代方案,并比较在文档类型和模型规模之间平衡精确度、召回率和安全性的策略。方法:我们比较了多种提示策略,包括基于行号的源选择、提取带有明确安全注释的相关指南句子,以及使用源指南中的支持证据细化草稿答案的多阶段流水线。实验在长度和结构各异的文档上进行,包括当地NHS急症护理和肿瘤学指南以及英国范围内的NICE指南,使用前沿规模和本地可部署模型。使用自动指标和人类专家评估相关性和完整性来评估性能。结果:行号选择取得了最强结果,在大型和小型模型上均优于直接复制和安全导向策略,同时保持高术语召回率(高达95%)并与源文本紧密对齐。安全导向方法提高了精确度,但引入了系统性遗漏,而多阶段过滤进一步放大了这种权衡。性能随文档结构变化:基于行的提取在协议类内容中表现出色,而替代策略在更冗长的文档上表现更好(术语召回率高达97%)。

英文摘要

Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

2606.12896 2026-06-12 cs.LG cs.AI cs.CR 新提交

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard:面向强化学习智能体的测试时和步级对抗防御

Junfeng Guo Heng Huang

AI总结 提出PolicyGuard,一种基于高斯过程后验方差的测试时步级后门防御方法,通过自适应伪轨迹计算单步不确定性,在七种RL游戏中达到平均AUROC 0.856和0.859。

详情
AI中文摘要

尽管强化学习(RL)的实际应用日益普及,但RL系统的安全性值得更多关注和探索。特别是,最近的研究揭示了RL智能体容易受到后门攻击,即受害智能体在标准条件下表现正常,但在特定触发器被激活时执行恶意动作。现有的RL后门防御要么需要访问智能体的内部参数,要么仅在模型或轨迹级别操作,或者仅限于特定攻击类型。为了确保RL智能体的安全性,我们提出了\texttt{PolicyGuard},一种\textit{测试时步级}后门防御方法,它利用高斯过程(GP)后验方差并自适应伪轨迹以实现单个时间步的不确定性计算。此外,我们还提供了理论基础来解释GP后验方差的有效性。在七个RL游戏上的大量实验表明,PolicyGuard在大多数情况下实现了最先进的检测性能,对于基于扰动的攻击平均AUROC为0.856,对于对抗智能体攻击平均AUROC为0.859。

英文摘要

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

2606.12895 2026-06-12 cs.LG 新提交

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

LongSpike:用于高效长序列学习的分数阶脉冲状态空间模型

Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha

AI总结 提出LongSpike框架,将分数阶状态空间模型(f-SSM)引入脉冲神经网络,通过长记忆核实现高效长序列学习,在多个基准上超越现有SNN。

详情
AI中文摘要

脉冲神经网络(SNN)因其生物合理性和处理序列数据时的能量效率而备受推崇。然而,主流的SNN架构通常依赖一阶常微分方程(ODE)来控制神经元状态转换。这种一阶假设引入了“无记忆”瓶颈,限制了模型捕捉长序列任务中固有的复杂长程依赖关系的能力。在这项工作中,我们提出了LongSpike,一种新颖的SNN框架,它将控制理论中的分数阶状态空间建模(f-SSM)集成到脉冲域中。通过将传统的整数阶SSM扩展到分数阶微积分领域,LongSpike实现了具有长记忆核的神经元动力学的层次化集成。为了缓解分数算子通常带来的计算开销和并行化挑战,我们利用了一种支持高效并行训练的状态空间公式。在具有挑战性的基准测试(包括Long Range Arena(LRA)、大规模WikiText-103和Speech Commands)上的实证评估表明,LongSpike在保持稀疏突触计算的同时,在准确性上优于最先进的SNN。代码可在以下网址获取:https://this URL。

英文摘要

Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at this https URL.

2606.12890 2026-06-12 cs.RO 新提交

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

学会适应:基于表示的多任务技能迁移强化学习

Aryan Naveen, Haitong Ma, Haldun Balim, Na Li

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard School of Engineering and Applied Sciences(哈佛大学工程与应用科学学院)

AI总结 提出RepMT-SAC框架,通过谱MDP分解捕获可迁移动力学,实现任务无关核心与最小任务特定调整的价值函数结构,在四旋翼轨迹跟踪任务上零样本性能提升30%。

详情
Comments
8 pages, 4 figures, 1 table
AI中文摘要

强化学习在学习复杂控制策略方面取得了显著成功,但由于样本效率低和跨任务泛化能力差,其适用性仍然有限。在这项工作中,我们提出了RepMT-SAC,一个多任务强化学习框架,能够实现高效的知识共享和稳健的新任务迁移。RepMT-SAC使用谱MDP分解来捕获可迁移的动力学,将价值函数结构化为一个任务无关的核心和最小的任务特定调整。这种设计允许在分布内任务上具有强大的零样本性能,并在分布外任务上实现快速的少样本适应。我们在四旋翼轨迹跟踪任务上评估了RepMT-SAC在分布内和分布外上下文中的表现,证明其性能优于基线方法高达30%。

英文摘要

Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.

2606.12887 2026-06-12 cs.CR cs.DC cs.NI 新提交

LNTest: A Testbed for Evaluating Bitcoin Lightning Network-Based Botnets

LNTest: 评估基于比特币闪电网络的僵尸网络的测试平台

Thomas Bakaysa, Ahmet Kurt, Abdul-Salem Beibitkhan, Jesus Maria Romo Diaz de Leon, Tag Kalat, Joshua Kramer, Estela Rodriguez, Abraham Watkins, Abdullah Aydeger

AI总结 提出LNTest测试平台,通过容器化闪电网络节点模拟僵尸网络,发现D-LNBot协议生成聚类链而非均匀链,命令传播呈线性复杂度,且覆盖拓扑影响拆除策略效果。

详情
Comments
Accepted at the 21st International Conference on Availability, Reliability and Security (ARES 2026)
AI中文摘要

比特币的闪电网络(LN)可被利用作为僵尸网络的隐蔽、低成本的命令与控制(C&C)通道,如LNBot和D-LNBot设计所示。然而,两者仍仅为通过模拟评估的概念验证原型,关于实际拓扑形成、传播复杂性和抗拆除能力的关键问题尚未解答。我们提出LNTest,首个基于LN的僵尸网络可重用测试平台,基于Core Lightning节点构建,通过Docker容器化并在共享的Bitcoin Core regtest链上运行。LNTest支持三种覆盖拓扑模式(确定性链、自主对等发现和用户提供的图),从而能够跨不同僵尸网络结构进行受控实验。使用LNTest,我们报告三个主要发现。首先,D-LNBot的自主形成协议并未产生其设计中的均匀链;相反,它创建了一个聚类链,其中派系由桥接节点连接,移除这些节点会导致网络分裂。其次,命令传播与僵尸网络规模呈线性关系($\Theta(n)$),而非先前声称的$O(m \log n)$,并且更高的邻居连接度不会带来任何增益。第三,覆盖拓扑决定了拆除策略的有效性:均匀度链抵抗针对性移除但在随机故障下脆弱,无标度拓扑呈现相反模式,而自主形成的聚类链在两种情况下都脆弱,使其成为三者中最易受攻击的。LNTest作为开源发布,附带一个可重现所有实验的脚本,以支持基于LN的僵尸网络防御的可重复研究。

英文摘要

Bitcoin's Lightning Network (LN) can be exploited as a covert, low-cost command-and-control (C&C) channel for botnets, as demonstrated by the LNBot and D-LNBot designs. However, both remain proof-of-concept prototypes evaluated only through simulation, leaving key questions about real-world topology formation, propagation complexity, and resilience to takedowns unanswered. We present LNTest, the first reusable testbed for LN-based botnets, built from Core Lightning nodes containerized with Docker over a shared Bitcoin Core regtest chain. LNTest supports three overlay topology modes (a deterministic chain, autonomous peer discovery, and user-supplied graphs), enabling controlled experiments across different botnet structures. Using LNTest, we report three main findings. First, D-LNBot's autonomous formation protocol does not produce the uniform chain from its design; instead, it creates a clustered chain in which cliques are linked by bridge nodes whose removal fragments the network. Second, command propagation scales linearly with botnet size ($\Theta(n)$), not the $O(m \log n)$ previously claimed, and gains nothing from higher neighbor connectivity. Third, the overlay topology determines the effectiveness of takedown strategies: uniform-degree chains resist targeted removal but fragment under random failure, scale-free topologies show the opposite pattern, and the autonomous clustered chain is fragile under both, making it the most vulnerable of the three. LNTest is released as open source, with a script that reproduces all our experiments, to support reproducible research on LN-based botnet defenses.

2606.12886 2026-06-12 cs.CV cs.AI 新提交

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

交错思维中的模态隔离桥接:通过逐步强化监督模态转换

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MoTiF框架,通过反射式SFT和Flow-GRPO优化模态转换保真度,解决交错思维中图像与文本脱节的模态隔离问题,提升跨模态一致性和任务准确性。

详情
Comments
22 pages, 5 figures, 6 tables
AI中文摘要

交错思维是一种统一的多模态模型交替进行文本推理和视觉生成的方法,在空间和物理任务上显示出潜力。然而,在复杂的长链场景中,我们识别出一个基本故障模式:生成的图像偏离文本上下文,而后续文本忽略视觉证据,导致两种模态交替但并未真正相互通知。我们将其称为模态隔离,并归因于模态边界处的信息损失累积。我们将每个推理循环分解为原子操作,并定义模态转换损失,量化每个边界处的跨模态幻觉(文本到图像)和视觉利用不足(图像到文本)。我们提出MoTiF(模态转换保真度),一个两阶段训练框架,直接优化这些转换:反射式SFT训练模型检测和恢复错误的视觉输出;Flow-GRPO通过强化学习提高图像生成保真度。MoTiF中的所有训练信号来自转换级保真度而非最终任务准确性。在四个视觉谜题基准测试中,这种转换级监督显著提高了跨模态一致性和最终任务准确性。结果表明,有效的交错推理需要在模态边界处进行明确的结构监督,而不仅仅是扩展或最终任务优化。

英文摘要

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

2606.12885 2026-06-12 cs.NE 新提交

Mixed-Categorical Black-Box Optimization via Information-Geometric Bilevel Decomposition

混合类别黑箱优化:基于信息几何的双层分解

Marc Ong, Shinichi Shirakawa, Youhei Akimoto

AI总结 针对混合类别-连续黑箱优化中类别与连续变量强交互导致性能下降的问题,提出信息几何双层优化框架,外层优化类别变量,内层优化连续变量,并通过热启动策略降低计算成本,在二元-连续域上优于现有方法。

详情
Comments
Accepted at PPSN 2026
AI中文摘要

混合类别-连续优化出现在许多实际领域中,但仍然具有挑战性。在黑箱设置中,基于进化策略的方法在将CMA-ES的效率和鲁棒性扩展到混合变量空间方面显示出前景。然而,当存在强类别-连续交互时,这些方法的性能会下降,因为它们的基础搜索分布假设类别变量和连续变量之间独立。为了解决这一限制,我们提出了一个双层优化框架,通过在外循环中优化类别变量,在内循环中优化每个类别配置下的连续变量,显式地捕获这种交互。我们将双层问题的每一层都表述为信息几何优化下的随机松弛。为了减轻双层优化固有的高计算成本,我们引入了一种热启动策略,通过选择多个缓存配置中的最佳配置并在每次迭代后更新缓存来加速下层搜索。在二元-连续域上的实验结果表明,所提出的方法在交互处理能力上优于现有的最先进方法,同时在涵盖先前报告和新提出的交互类型的基准测试中计算效率也更高。

英文摘要

Mixed categorical-continuous optimization arises in many practical domains, yet remains challenging. In the black-box setting, evolution strategy-based approaches have shown promise in extending the efficiency and robustness of the CMA-ES to mixed-variable spaces. However, these methods exhibit worsened performance when strong categorical-continuous interactions are present, as their underlying search distributions assume independence between categorical and continuous variables. To address this limitation, we propose a bilevel optimization framework that explicitly captures such interactions by optimizing over categorical variables in an outer loop, and over continuous variables conditioned on each categorical configuration in an inner loop. We formulate each level of the bilevel problem as a stochastic relaxation under information-geometric optimization. To mitigate the high computational cost inherent to bilevel optimization, we introduce a warm-starting strategy that accelerates the lower-level search by selecting the best among multiple cached configurations and updating the cache after each iteration. Experimental results on binary-continuous domain demonstrate that the proposed method outperforms existing state-of-the-art approaches in interaction-handling capability while also being more computationally efficient across benchmarks encompassing both previously reported and newly proposed types of interaction.

2606.12883 2026-06-12 cs.AI 新提交

The Hidden Power of Scaling Factor in LoRA Optimization

缩放因子在LoRA优化中的隐藏力量

Zicheng Zhang, Haoran Li, Jiaxing Wang, Guoqiang Gong, Anqi Li, Yudong Hu, Ting Xiong, Yurong Gao, Junxing Hu, Zhida Jiang, Yifeng Zhang, Pengzhang Liu, Qixia Jiang

发表机构 * School of Mathematical Sciences, UCAS(中国科学院大学数学科学学院) School of Mathematical Sciences, NKU(南开大学数学科学学院) School of Advanced Interdisciplinary Sciences, UCAS(中国科学院大学前沿交叉科学学院)

AI总结 本文揭示LoRA中缩放因子α与学习率功能不同,α主导优化效果,通过信号-漂移框架发现α能放大任务信号而不增加漂移比,并提出LoRA-α框架以简化超参数搜索并提升性能。

详情
AI中文摘要

在低秩适应(LoRA)中,缩放因子α通常被视为学习率的简单补充,但其在优化中的作用仍未被充分理解。本文揭示缩放因子α和学习率功能不同,α成为有效优化的主导驱动因素,带来无法通过单独缩放学习率复现的收益。通过大量实证分析和理论信号-漂移框架的协同作用,我们发现了关于LoRA缩放机制的三点发现:首先,LoRA的频谱抑制平滑了优化景观,使得标准超参数过于保守,造成优化差距。其次,当利用这种平滑性加速收敛时,α通过放大任务信号而不增加漂移比,优于学习率。第三,最优缩放因子与秩呈次线性关系,由平方根定律很好地刻画,且系数出乎意料地大,揭示了现有秩相关启发式方法的缩放不足。基于这些见解,我们提出LoRA-α,一个极简框架,将α恢复到其原则性状态,使LoRA与标准小学习率兼容。跨多种任务的广泛评估表明,LoRA-α在简化超参数搜索的同时持续提升性能,释放了LoRA的学习潜力。

英文摘要

In Low-Rank Adaptation (LoRA), the scaling factor $\alpha$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $\alpha$ and the learning rate function differently, with $\alpha$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $\alpha$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$\alpha$, a minimalist framework that restores $\alpha$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$\alpha$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

2606.12882 2026-06-12 cs.AI 新提交

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

HarnessBridge: 用于LLM智能体框架的可学习双向控制器

Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang

AI总结 提出HarnessBridge,一种轻量级可学习框架控制器,通过双向投影参数化智能体-环境接口,减少令牌使用和轨迹长度,并泛化到更大模型。

详情
AI中文摘要

大型语言模型越来越多地被部署为用于长周期任务的智能体,但其性能不仅受模型能力和环境设计的影响,还受调节智能体-环境交互的框架的影响。现有的框架大多是手动设计的,随着轨迹变长和交互变得更加复杂,它们难以扩展。在这项工作中,我们探究框架是否可以通过一个可学习的即插即用模块生成,该模块可以以端到端的方式进行训练。我们引入了HarnessBridge,一种轻量级可学习框架控制器,它将智能体-环境接口参数化为双向投影。HarnessBridge学习两个双向投影:观测投影,将原始轨迹提炼为紧凑的、与决策相关的状态;以及动作投影,将提议的动作转换为可执行的转换或基于轨迹的拒绝。我们在框架监督数据集上通过统一指令调优训练HarnessBridge。在Terminal-Bench~2.0和SWE-bench Verified上,HarnessBridge匹配或超越了强大的专用框架,同时大幅减少了令牌使用和轨迹长度,并从较小的生成器泛化到较大的商业模型。

英文摘要

Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent--environment interaction. Existing harnesses are largely manually engineered, making them difficult to scale as trajectories grow longer and interactions become more complex. In this work, we ask whether harness can be generated by a learnable plug-in module that can be trained in an end-to-end fashion. We introduce HarnessBridge, a lightweight learnable harness controller that parameterizes the agent--environment interface as a bidirectional projection. HarnessBridge learns two bidirectional projections: observation projection, which distills raw trajectories into compact, decision-relevant states, and action projection, which converts proposed actions into executable transitions or trajectory-grounded rejections. We train HarnessBridge on a harness supervision dataset via unified instruction tuning. On Terminal-Bench~2.0 and SWE-bench Verified, HarnessBridge matches or surpasses strong specialized harnesses while substantially reducing token usage and trajectory length, and generalizes from smaller generators to larger commercial models.

2606.12881 2026-06-12 cs.CL cs.LG 新提交

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

面向聊天机器人微调的直接偏好优化:一项实证研究

Yvonne Qiu, Dezhi Yu, ShuoJia Fu

AI总结 本文实证研究直接偏好优化(DPO)在聊天机器人微调中的应用,表明其简化训练流程、提升计算效率且性能有竞争力,但存在训练不稳定性。

详情
Comments
7 pages, 3 figures, 1 table
AI中文摘要

我们提出了一种使用直接偏好优化(DPO)微调大型语言模型的方法,这是一种强化学习技术。我们的实验结果表明,DPO简化了训练流程,提高了计算效率,并实现了有竞争力的性能。使用BLEU、ROUGE和余弦相似度指标的评估表明,模型有效学习并收敛,尽管需要进一步研究以解决观察到的训练不稳定性。

英文摘要

We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

2606.12871 2026-06-12 cs.AI 新提交

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

DailyReport: 一个用于评估搜索代理在日常搜索任务上的开放式基准

Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Zheren Fu, Licheng Zhang, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 提出DailyReport基准,包含150个开放式日常搜索任务和3546个级联评分标准,通过分解子任务和维度评估,揭示当前搜索代理系统仍未能满足用户期望。

详情
AI中文摘要

搜索代理(SAs)通常利用大型语言模型(LLMs)通过自主探索网络资源并将信息综合成全面响应来支持复杂的信息寻求任务。对于SAs的评估,先前的基准主要关注在真实用户场景中不太可能出现的专门任务。此外,它们依赖于粗略的任务级评分标准,通常限制了评估的可解释性。为弥补这一差距,我们引入了DailyReport,一个用于评估SA在日常搜索任务上能力的开放式基准。它包含150个开放式任务,配有3546个相关评分标准,捕捉了真实用户广泛讨论和及时的信息需求。每个任务被分解为子任务,并通过跨解缠维度的级联评分标准进行评估。通过级联性能归因和以用户为中心的聚合,我们为每个维度推导出高度可解释的分数,以及一个用户偏好分数。我们在17个代理系统上的结果表明,当前系统仍未能达到用户的期望。为促进未来研究,我们的数据集和代码已在https://this URL公开。

英文摘要

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at this https URL.

2606.12869 2026-06-12 cs.CV 新提交

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

通过密度均衡映射学习具有共享显著性的任务感知采样

Tsz Lok Ip, Han Zhang, Lok Ming Lui

发表机构 * Department of Mathematics, The Chinese University of Hong Kong(香港中文大学数学系) Department of Mathematics, City University of Hong Kong(香港城市大学数学系)

AI总结 提出DECNN框架,利用密度均衡映射根据数据空间重要性动态重分配卷积计算资源,实现任务自适应采样,提升模型效率与可解释性。

详情
Comments
16 pages, 10 figures
AI中文摘要

在基于图像和表面的学习任务中,卷积特征通常使用在整个域上均匀采样的感受野来提取。然而,信息丰富的结构在实践中很少均匀分布,通常集中在局部区域。这种现象在医学影像中尤为常见,其中病理变化在空间上受限。因此,均匀卷积将相同的计算量分配给信息丰富和信息不丰富的区域,导致特征提取效率低下和模型容量利用不充分。为了解决这个问题,我们提出了一个任务自适应采样框架,根据数据的空间重要性动态重分配计算注意力。具体来说,我们引入了密度均衡卷积神经网络(DECNN),它通过密度均衡映射,利用学习到的密度函数来引导卷积。密度函数编码了不同区域的相对重要性,并诱导一种变换,放大信息丰富的区域,同时压缩不太相关的区域。结果,卷积感受野在域上非均匀地重新分布,使得在任务相关区域能够进行更密集的采样。通过将这种重要性驱动的变换与卷积相结合,DECNN执行自适应特征提取,将计算资源集中在信息丰富的结构上。这导致更有效地利用模型容量,产生一个轻量级但表达力强的架构,同时生成可解释的显著性图。在图像分类和颅面表面分析上的实验表明,DECNN以更少的参数实现了竞争性或更优的性能,准确识别任务相关区域,并在复杂的几何变化下保持鲁棒性。

英文摘要

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

2606.12867 2026-06-12 cs.LG 新提交

SMGFM: Spectral Multimodal Graph Pretraining for Multimodal-Attributed Graphs

SMGFM: 面向多模态属性图的谱多模态图预训练

Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

AI总结 提出SMGFM框架,利用图频谱分解区分结构诱导语义与模态特有语义,通过频带路由实现跨模态融合,在图级和模态级任务上取得最优性能。

详情
AI中文摘要

多模态属性图(MAGs)将图拓扑结构与来自文本、图像等模态的节点语义相结合。传统的图学习通过耦合拓扑与节点特征来上下文化节点语义。然而,这种耦合设计在MAGs中变得棘手,因为结构诱导和模态固有的语义可能对下游任务产生不同贡献。结构诱导语义通过平滑拓扑变化促进关系一致性,而模态固有语义通常编码局部、细粒度的区分,不应被统一平滑或对齐。因此,关键挑战在于跨模态融合前识别语义角色。为此,我们利用图频率变化作为先验,其中低频分量捕获拓扑一致语义,高频分量保留模态特定语义。基于这一直觉,我们提出SMGFM,一种谱多模态图预训练框架,将每个模态特定的节点信号分解为图频带,并在跨模态交互前分配频带级语义角色。具体地,SMGFM使用可扩展的切比雪夫滤波器构建频率解析的模态令牌,通过拓扑条件路由估计其耦合可靠性,并在融合前进行频带-模态交互。其频率路由目标在平滑共识路由的同时保留模态特定路由,减轻空间域纠缠和统一跨模态对齐。在MAG数据集上的大量实验表明,SMGFM在图级和模态级任务上均达到最先进性能。

英文摘要

Multimodal-attributed graphs (MAGs) couple graph topology with node semantics from text, images, and other modalities. Traditional graph learning contextualizes node semantics by coupling topology with node features. However, this coupling design becomes troublesome in MAGs, where structure-induced and modality-intrinsic semantics may contribute differently to downstream tasks. Structure-induced semantics promote relational consistency through smooth topological variation, whereas modality-intrinsic semantics often encode local, fine-grained distinctions that should not be uniformly smoothed or aligned. Therefore, the key challenge is to identify semantic roles before cross-modal fusion. To this end, we leverage graph-frequency variation as a prior, where low-frequency components capture topology-consistent semantics and high-frequency components preserve modality-specific semantics. Based on this intuition, we propose SMGFM, a spectral multimodal graph pretraining framework that decomposes each modality-specific node signal into graph-frequency bands and assigns band-level semantic roles before cross-modal interaction. Concretely, SMGFM constructs frequency-resolved modality tokens with scalable Chebyshev filters, estimates their coupling reliability through topology-conditioned routing, and performs band-modality interaction before fusion. Its frequency-routed objectives align smooth consensus routes while preserving modality-specific routes, mitigating spatial-domain entanglement and uniform cross-modal alignment. Extensive experiments conducted on the MAG datasets demonstrate that SMGFM achieves state-of-the-art performance across graph-level and modality-level tasks.

2606.12864 2026-06-12 cs.SE cs.AI 新提交

Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

超越问题求解:用于评估竞赛编程中代码生成、攻击和修复的UOJ-Bench基准

Tingqiang Xu, Hangrui Zhou, Tianle Cai, Alex Gu, Kaifeng Lyu

AI总结 提出UOJ-Bench基准,通过代码生成、攻击和修复三项任务评估LLM在竞赛编程中的问题求解与人类代码错误识别能力,发现最强模型在一次性评估中无法识别超过50%的错误提交,但测试时扩展可提升至90%以上,且能发现约5%的满分提交中的错误。

详情
AI中文摘要

尽管大型语言模型(LLM)在竞赛编程中表现出色,但其在相同环境下支持人类学习的作用仍 largely unexplored。本文介绍UOJ-Bench,一个旨在评估LLM不仅解决问题能力,还能识别人类编写代码中错误的基准——这是传统上通过在线评测系统运行测试用例支持的关键教育活动。UOJ-Bench包含三个不同任务:代码生成、代码攻击和代码修复,所有任务均基于Universal Online Judge(UOJ)上的真实代码提交构建,并通过UOJ的原生评测基础设施进行评估。我们的结果表明,在一次性评估下,即使最强的模型也无法识别超过50%的被UOJ用户发现错误的提交。虽然测试时扩展将成功率提升至90%以上,但模型推理带来的巨大计算成本限制了其大规模部署的实用性。尽管存在这些限制,我们发现,在测试时扩展下,最佳性能模型可以在大约30个问题中识别超过5%的满分提交中的错误,这表明前沿LLM已经能够提供超越标准评测系统的补充信号。

英文摘要

Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same setting remains largely unexplored. In this work, we introduce UOJ-Bench, a benchmark designed to evaluate not only the problem-solving ability of LLMs, but also their ability to identify errors in human-written code -- a crucial educational activity traditionally supported by running test cases over online judge systems. UOJ-Bench consists of three distinct tasks: code generation, code hacking, and code repair, all constructed from real-world code submissions on the Universal Online Judge (UOJ) and evaluated through UOJ's native judging infrastructure. Our results show that under one-shot evaluation, even the strongest models fail to identify errors in more than 50% of a set of submissions that have been found to be incorrect by UOJ users. While test-time scaling improves success rates to above 90%, the substantial computational costs incurred from model inference limit its practicality for large-scale deployment. Despite these limitations, we find that the best-performing models under test-time scaling can uncover errors in over 5% of full-score submissions across roughly 30 problems, suggesting that frontier LLMs can already provide complementary signals beyond standard judging systems.

2606.12863 2026-06-12 cs.LG 新提交

Multimodal Graph Negative Learning

多模态图负学习

Zhengyu Wu, Xu Wang, Hongchao Qin, Xunkai Li, Guang Zeng, Rong-Hua Li, Guoren Wang

AI总结 提出GraphMNL框架,通过负学习解决多模态属性图中节点级分支语义不平衡问题,避免主导分支偏差传播,在Grocery和Reddit M数据集上取得最优性能。

详情
AI中文摘要

多模态属性图(MAGs)将图拓扑与异构模态属性(如文本和图像)集成,从而能够对复杂关系系统进行更丰富的建模。然而,这种表达能力也使得MAGs上的学习依赖于多个语义源,包括结构拓扑、文本和视觉属性,每个都可以被视为节点表示的一个分支。当这些分支在语义信息量和可靠性上因节点而异时,就会出现节点级分支语义不平衡:一个分支为某个节点提供判别性语义,但由于模态质量或结构上下文的偏差,可能会误导另一个节点。现有方法通常通过跨分支一致性或对齐来缓解这种异质性,隐含地将主导预测视为可靠监督。当主导分支有偏差时,强制模仿可能会将其偏差传播到其他分支,并抑制对分类有用的原始语义。我们提出GraphMNL,一种图感知的多模态负学习框架,通过使用负学习作为跨分支指导来解决这个问题。该模型不强制劣质分支模仿教师预测,而是教导它们节点不太可能属于哪些类别。GraphMNL构建分支库,通过图感知可靠性仲裁识别主导和劣质分支,门控不稳定传输,并对非目标类别应用目标保持负学习。这种设计将目标监督与分支指导解耦,使得监督损失学习正确类别,而当分支一致性不可靠时,负学习抑制不太可能的备选类别。通过全面的实验评估,GraphMNL在Grocery数据集上达到72.47%的准确率,在Reddit M数据集上达到76.60的F1分数,取得了最佳性能。

英文摘要

Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

2606.12859 2026-06-12 cs.RO 新提交

AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

AIR-VLA+: 通过级联双动作解码器与非对称MoE解耦空中机器人的移动与操作

Jianli Sun, Bin Tian, Qiyao Zhang, Zijian Liu, Yutong Wang, Zhiyong Cui, Bai Li, Yisheng Lv, Yonglin Tian

发表机构 * The Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Automation, Beijing Institute of Technology(北京理工大学自动化学院) College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) School of Transportation Science and Engineering, Beihang University(北京航空航天大学交通科学与工程学院) Information Science, East China Normal University(华东师范大学信息科学)

AI总结 针对空中机器人移动与操作在动作尺度、动力学和控制目标上的显著差异,提出级联双动作解码器与非对称MoE架构,实现解耦协调控制,在AIR-VLA基准上取得48.0平均分,任务完成度提升80.2%。

详情
AI中文摘要

空中操作系统长期以来在端到端控制中遭受表示耦合问题,因为平台级无人机(UAV)移动与末端执行器级机械臂操作在动作尺度、动力学和控制目标上存在显著差异。本文提出AIR-VLA+,一种专为空中操作设计的流匹配动作生成架构,具有级联双动作解码器和非对称特征级混合专家(MoE)。我们构建了级联的操作和移动解码器,使无人机在移动过程中单向观察机械臂的意图以实现工作流协调,同时隔离无人机移动信息反向传播对机械臂操作稳定性的影响。针对空中操作中无人机移动高度依赖高层语义并负责任务状态转换的特点,我们为无人机移动解码器设计了输入特征增强模块,该模块引入隐式视觉抓取投影器以感知夹爪与物体的交互状态,并注入压缩的全局语义特征。在无人机移动解码器内部,我们部署了隐式MoE架构,使不同的移动专家在训练过程中自发地对不同任务阶段表现出能力倾向。通过在特征流形上进行密集软混合计算,无人机移动获得了更强的任务阶段适应性。在标准化AIR-VLA基准上的实验表明,我们的方法以48.0的总体平均分全面超越所有基线。与单头$\pi_{0.5}$策略相比,整体任务完成分数提高了80.2%,有效缓解了复合机器人的异构协调控制冲突。

英文摘要

Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $\pi_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.

2606.12854 2026-06-12 cs.CL q-bio.QM 新提交

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

小型LLM用于生物医学声明验证:成本效益微调、结构性数据集捷径与跨域泛化

Gaurav Kumar

发表机构 * Moveworks AI University of California San Diego(加州大学圣迭戈分校)

AI总结 通过QLoRA微调小型LLM(Phi-3-mini、Qwen2.5-3B、Mistral-7B),在生物医学声明验证中超越GPT-4o和GPT-5(F1提升12%),并发现SciFact数据集的结构性伪影,提出基于结构稳健数据的跨域迁移方法。

详情
Comments
8 pages, 2 figures, 12 tables. To appear at BioNLP Workshop, ACL 2026
AI中文摘要

大型语言模型如GPT-4o和GPT-5在生物医学声明验证上表现出强大的零样本性能,但成本和透明度限制了其可扩展使用。我们通过QLoRA在SciFact和HealthVer上微调了三个小型LLM:Phi-3-mini(3.8B)、Qwen2.5-3B和Mistral-7B,首次研究了QLoRA模型与GPT-4o及微调BioLinkBERT编码器的对比。Mistral-7B QLoRA在仅使用1,008个训练样本的情况下,以极低的成本超越了GPT-4o和GPT-5(F1提升高达12%)。我们进行了广泛的域内和跨域评估:在SciFact上训练的模型在HealthVer上测试,反之亦然,并匹配模型大小以隔离数据集结构与数据量的影响。我们识别了SciFact中一个先前未报告的结构性伪影,该伪影夸大了域内得分,并通过双向域外评估表明,在结构稳健的数据上训练能够实现鲁棒的跨域迁移。我们计划发布所有代码和适配器检查点。

英文摘要

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

2606.12852 2026-06-12 cs.AI 新提交

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

WISE:具有Why-Which推理的Minecraft长时域智能体

Renmin Cheng, Changhao Chen (The Hong Kong University of Science and Technology (Guangzhou))

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出WISE框架,通过因果事件图增强情景记忆并解耦what-where-when与which-why推理,结合机会主义任务调度和多尺度探索,显著提升长时域稀疏任务的成功率和效率。

详情
AI中文摘要

通过采用LLM增强的分层方法,在Minecraft等环境中开发通用具身智能体取得了快速进展。尽管前景广阔,但低级控制器由于重复执行失败常常成为性能瓶颈。我们认为,一个关键限制不仅是缺乏情景记忆,而且是将\textit{what-where-when}记忆与\textit{which-why}推理解耦。为了解决这个问题,我们提出\textbf{WISE}(Which-Why Informed Semantic Explorer),一个长时域智能体框架,其增强的低级控制器配备因果事件图,通过将观察与任务相关性关联的显式因果结构来增强情景记忆。与先前依赖特征相似性进行检索的工作(如MrSteve)不同,WISE能够在视角变化下实现稳健回忆,并通过因果推理支持机会主义任务重排序。基于这种记忆,我们提出一个机会主义任务调度器,当检测到因果相关机会时动态重新优先化子任务。我们进一步为WISE配备多尺度渐进探索策略,为下游推理提供空间上全面的观察。实验表明,WISE在长时域稀疏任务上大幅提高了任务成功率和效率,特别是在需要自适应决策的场景中。

英文摘要

Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning. To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.