arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12902 2026-06-12 cs.CL 新提交

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

PRISM:用于共情口语对话的韵律集成多智能体推理框架

Wen Zhang, Xiaocui Yang, Zhuoyue Gao, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院)

AI总结 提出PRISM多智能体框架,通过解耦语音感知、响应生成和语音合成,并引入韵律到语言翻译机制,实现共情口语对话中的韵律适当性和知识集成。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

共情口语对话系统不仅需要语义上合适的回应,还需要情感上一致的韵律表达。然而,级联流水线通常在语音到文本转换过程中丢弃声学线索,而端到端语音模型缺乏对情感和知识集成的可解释控制。为了解决这些挑战,我们提出了PRISM,一个用于共情口语对话的多智能体框架,它将语音感知、响应生成和语音合成解耦为协调的组件。PRISM引入了一种韵律到语言的翻译机制来稳定大语言模型的推理,并支持按需调用外部知识工具以生成共情对话。实验结果表明,PRISM在客观和主观指标上均实现了共情性、韵律适当性和文本响应生成质量的一致改进。我们的代码可在以下网址获取:this https URL。

英文摘要

Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: this https URL.

2606.12898 2026-06-12 cs.CV cs.CL 新提交

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

放大关键信息:面向视觉文本理解的注意力引导自适应渲染

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu

发表机构 * Michigan State University(密歇根州立大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对视觉语言模型在视觉文本理解任务中存在的定位与利用脱节问题,提出无需训练、模型无关的注意力引导自适应渲染方法AGAR,通过放大关键文本跨度提升模型性能。

详情
AI中文摘要

视觉文本理解(VTC)将文本渲染为图像供视觉语言模型(VLM)阅读,绕过了LLM的上下文窗口限制,并支持从长页OCR到多页记忆问答等应用。然而,现有的VTC流水线将渲染和布局视为固定的、内容无关的预处理步骤,并且对VLM内部如何处理可视化文本的机制理解甚少。通过对VTC问答任务的聚焦实证研究,我们揭示了VLM存在一种“定位而不利用”的模式:证据定位注意力在中间到后期层中急剧出现,并且与答案正确性在很大程度上解耦,然而仅仅放大渲染页面上定位的跨度就能恢复大部分失败。基于这些观察,我们提出了AGAR(注意力引导自适应渲染),一种无需训练、模型无关的方法,该方法利用VLM自身的中间到后期层注意力来识别前K个重要的视觉补丁,将它们映射回单词跨度,并在重新推理答案之前重新渲染页面,放大这些跨度。在九个VTC基准测试(短文本、长上下文和多页记忆问答)和四个VLM骨干上的大量实验表明,AGAR(i)作为即插即用的增强,持续改进了现成的VLM,(ii)与VLM后训练相结合可带来进一步收益,并且(iii)在视觉和文本侧输入退化下保持鲁棒性。

英文摘要

Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

2606.12897 2026-06-12 cs.CL 新提交

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

SafeLLM: 在安全关键场景中,提取作为重写的抗幻觉替代方案

Julia Ive, Felix Jozsa, Evridiki Georgaki, Nabeel Sheikh, Emma Cattell, Nick Jackson, Paulina Bondaronek, Ciaran Scott Hill, Richard Dobson

发表机构 * Institute of Health Informatics, University College London(伦敦大学学院健康信息学研究所) National Hospital for Neurology and Neurosurgery(国家神经内科与神经外科医院) Somerset NHS Foundation Trust(萨默塞特NHS基金会信托) King's College Hospital(国王学院医院) King's College London(伦敦国王学院)

AI总结 提出将提取作为重写型RAG的抗幻觉替代方案,通过行号选择策略在安全关键文档中实现高召回(95%)和低幻觉,优于直接复制和安全导向方法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于访问组织文档,包括标准操作程序(SOP)、人力资源政策和机构指南。然而,依赖自由形式重写的检索增强生成(RAG)系统可能引入幻觉,并在完整性和简洁性之间产生不稳定的权衡,尤其是在安全和合规关键场景中。目标:评估提取作为基于重写的RAG的抗幻觉替代方案,并比较在文档类型和模型规模之间平衡精确度、召回率和安全性的策略。方法:我们比较了多种提示策略,包括基于行号的源选择、提取带有明确安全注释的相关指南句子,以及使用源指南中的支持证据细化草稿答案的多阶段流水线。实验在长度和结构各异的文档上进行,包括当地NHS急症护理和肿瘤学指南以及英国范围内的NICE指南,使用前沿规模和本地可部署模型。使用自动指标和人类专家评估相关性和完整性来评估性能。结果:行号选择取得了最强结果,在大型和小型模型上均优于直接复制和安全导向策略,同时保持高术语召回率(高达95%)并与源文本紧密对齐。安全导向方法提高了精确度,但引入了系统性遗漏,而多阶段过滤进一步放大了这种权衡。性能随文档结构变化:基于行的提取在协议类内容中表现出色,而替代策略在更冗长的文档上表现更好(术语召回率高达97%)。

英文摘要

Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG) systems that rely on free-form rewriting can introduce hallucinations and unstable trade-offs between completeness and conciseness, particularly in safety- and compliance-critical settings. Objectives: To evaluate extraction as a hallucination-resistant alternative to rewriting-based RAG and compare strategies that balance precision, recall and safety across document types and model scales. Methods: We compare multiple prompting strategies, including line-number-based source selection, extraction of relevant guideline sentences with explicit safety annotations, and a multi-stage pipeline that refines draft answers using supporting evidence from source guidelines. Experiments are conducted on documents of varying length and structure, including local NHS acute care and oncology guidelines and UK-wide NICE guidelines, using both frontier-scale and locally deployable models. Performance is assessed using automatic metrics and human expert evaluation of relevance and completeness. Results: Line-number selection achieves the strongest results, outperforming direct copying and safety-focused strategies across both large and small models while maintaining high term recall (up to 95%) and close alignment with source text. Safety-oriented approaches improve precision but introduce systematic omissions, while multi-stage filtering further amplifies this trade-off. Performance varies with document structure: line-based extraction excels in protocol-like content, whereas alternative strategies perform better on more verbose documents (up to 97% term recall).

2606.12890 2026-06-12 cs.RO 新提交

Learning to Adapt: Representation-Based Reinforcement Learning for Multi-Task Skill Transfer

学会适应:基于表示的多任务技能迁移强化学习

Aryan Naveen, Haitong Ma, Haldun Balim, Na Li

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard School of Engineering and Applied Sciences(哈佛大学工程与应用科学学院)

AI总结 提出RepMT-SAC框架,通过谱MDP分解捕获可迁移动力学,实现任务无关核心与最小任务特定调整的价值函数结构,在四旋翼轨迹跟踪任务上零样本性能提升30%。

详情
Comments
8 pages, 4 figures, 1 table
AI中文摘要

强化学习在学习复杂控制策略方面取得了显著成功,但由于样本效率低和跨任务泛化能力差,其适用性仍然有限。在这项工作中,我们提出了RepMT-SAC,一个多任务强化学习框架,能够实现高效的知识共享和稳健的新任务迁移。RepMT-SAC使用谱MDP分解来捕获可迁移的动力学,将价值函数结构化为一个任务无关的核心和最小的任务特定调整。这种设计允许在分布内任务上具有强大的零样本性能,并在分布外任务上实现快速的少样本适应。我们在四旋翼轨迹跟踪任务上评估了RepMT-SAC在分布内和分布外上下文中的表现,证明其性能优于基线方法高达30%。

英文摘要

Reinforcement learning has achieved remarkable success in learning complex control policies, yet its applicability remains limited due to sample inefficiency and poor generalization across tasks. In this work, we propose RepMT-SAC, a framework for multi-task RL that enables efficient knowledge sharing and robust transfer to new tasks. RepMT-SAC uses spectral MDP decomposition to capture transferable dynamics, structuring the value function into a task-agnostic core with a minimal task-specific adjustment. This design allows for strong zero-shot performance on in-distribution tasks and rapid few-shot adaptation to out-of-distribution tasks. We evaluate RepMT-SAC on quadcopter trajectory-following tasks across in-distribution and out-of-distribution contexts, demonstrating that it outperforms baselines by up to 30%.

2606.12886 2026-06-12 cs.CV cs.AI 新提交

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

交错思维中的模态隔离桥接:通过逐步强化监督模态转换

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MoTiF框架,通过反射式SFT和Flow-GRPO优化模态转换保真度,解决交错思维中图像与文本脱节的模态隔离问题,提升跨模态一致性和任务准确性。

详情
Comments
22 pages, 5 figures, 6 tables
AI中文摘要

交错思维是一种统一的多模态模型交替进行文本推理和视觉生成的方法,在空间和物理任务上显示出潜力。然而,在复杂的长链场景中,我们识别出一个基本故障模式:生成的图像偏离文本上下文,而后续文本忽略视觉证据,导致两种模态交替但并未真正相互通知。我们将其称为模态隔离,并归因于模态边界处的信息损失累积。我们将每个推理循环分解为原子操作,并定义模态转换损失,量化每个边界处的跨模态幻觉(文本到图像)和视觉利用不足(图像到文本)。我们提出MoTiF(模态转换保真度),一个两阶段训练框架,直接优化这些转换:反射式SFT训练模型检测和恢复错误的视觉输出;Flow-GRPO通过强化学习提高图像生成保真度。MoTiF中的所有训练信号来自转换级保真度而非最终任务准确性。在四个视觉谜题基准测试中,这种转换级监督显著提高了跨模态一致性和最终任务准确性。结果表明,有效的交错推理需要在模态边界处进行明确的结构监督,而不仅仅是扩展或最终任务优化。

英文摘要

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

2606.12883 2026-06-12 cs.AI 新提交

The Hidden Power of Scaling Factor in LoRA Optimization

缩放因子在LoRA优化中的隐藏力量

Zicheng Zhang, Haoran Li, Jiaxing Wang, Guoqiang Gong, Anqi Li, Yudong Hu, Ting Xiong, Yurong Gao, Junxing Hu, Zhida Jiang, Yifeng Zhang, Pengzhang Liu, Qixia Jiang

发表机构 * School of Mathematical Sciences, UCAS(中国科学院大学数学科学学院) School of Mathematical Sciences, NKU(南开大学数学科学学院) School of Advanced Interdisciplinary Sciences, UCAS(中国科学院大学前沿交叉科学学院)

AI总结 本文揭示LoRA中缩放因子α与学习率功能不同,α主导优化效果,通过信号-漂移框架发现α能放大任务信号而不增加漂移比,并提出LoRA-α框架以简化超参数搜索并提升性能。

详情
AI中文摘要

在低秩适应(LoRA)中,缩放因子α通常被视为学习率的简单补充,但其在优化中的作用仍未被充分理解。本文揭示缩放因子α和学习率功能不同,α成为有效优化的主导驱动因素,带来无法通过单独缩放学习率复现的收益。通过大量实证分析和理论信号-漂移框架的协同作用,我们发现了关于LoRA缩放机制的三点发现:首先,LoRA的频谱抑制平滑了优化景观,使得标准超参数过于保守,造成优化差距。其次,当利用这种平滑性加速收敛时,α通过放大任务信号而不增加漂移比,优于学习率。第三,最优缩放因子与秩呈次线性关系,由平方根定律很好地刻画,且系数出乎意料地大,揭示了现有秩相关启发式方法的缩放不足。基于这些见解,我们提出LoRA-α,一个极简框架,将α恢复到其原则性状态,使LoRA与标准小学习率兼容。跨多种任务的广泛评估表明,LoRA-α在简化超参数搜索的同时持续提升性能,释放了LoRA的学习潜力。

英文摘要

In Low-Rank Adaptation (LoRA), the scaling factor $\alpha$ is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor $\alpha$ and the learning rate function differently, with $\alpha$ emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, $\alpha$ outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-$\alpha$, a minimalist framework that restores $\alpha$ to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-$\alpha$ consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

2606.12871 2026-06-12 cs.AI 新提交

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

DailyReport: 一个用于评估搜索代理在日常搜索任务上的开放式基准

Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Zheren Fu, Licheng Zhang, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团)

AI总结 提出DailyReport基准,包含150个开放式日常搜索任务和3546个级联评分标准,通过分解子任务和维度评估,揭示当前搜索代理系统仍未能满足用户期望。

详情
AI中文摘要

搜索代理(SAs)通常利用大型语言模型(LLMs)通过自主探索网络资源并将信息综合成全面响应来支持复杂的信息寻求任务。对于SAs的评估,先前的基准主要关注在真实用户场景中不太可能出现的专门任务。此外,它们依赖于粗略的任务级评分标准,通常限制了评估的可解释性。为弥补这一差距,我们引入了DailyReport,一个用于评估SA在日常搜索任务上能力的开放式基准。它包含150个开放式任务,配有3546个相关评分标准,捕捉了真实用户广泛讨论和及时的信息需求。每个任务被分解为子任务,并通过跨解缠维度的级联评分标准进行评估。通过级联性能归因和以用户为中心的聚合,我们为每个维度推导出高度可解释的分数,以及一个用户偏好分数。我们在17个代理系统上的结果表明,当前系统仍未能达到用户的期望。为促进未来研究,我们的数据集和代码已在https://this URL公开。

英文摘要

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at this https URL.

2606.12869 2026-06-12 cs.CV 新提交

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

通过密度均衡映射学习具有共享显著性的任务感知采样

Tsz Lok Ip, Han Zhang, Lok Ming Lui

发表机构 * Department of Mathematics, The Chinese University of Hong Kong(香港中文大学数学系) Department of Mathematics, City University of Hong Kong(香港城市大学数学系)

AI总结 提出DECNN框架,利用密度均衡映射根据数据空间重要性动态重分配卷积计算资源,实现任务自适应采样,提升模型效率与可解释性。

详情
Comments
16 pages, 10 figures
AI中文摘要

在基于图像和表面的学习任务中,卷积特征通常使用在整个域上均匀采样的感受野来提取。然而,信息丰富的结构在实践中很少均匀分布,通常集中在局部区域。这种现象在医学影像中尤为常见,其中病理变化在空间上受限。因此,均匀卷积将相同的计算量分配给信息丰富和信息不丰富的区域,导致特征提取效率低下和模型容量利用不充分。为了解决这个问题,我们提出了一个任务自适应采样框架,根据数据的空间重要性动态重分配计算注意力。具体来说,我们引入了密度均衡卷积神经网络(DECNN),它通过密度均衡映射,利用学习到的密度函数来引导卷积。密度函数编码了不同区域的相对重要性,并诱导一种变换,放大信息丰富的区域,同时压缩不太相关的区域。结果,卷积感受野在域上非均匀地重新分布,使得在任务相关区域能够进行更密集的采样。通过将这种重要性驱动的变换与卷积相结合,DECNN执行自适应特征提取,将计算资源集中在信息丰富的结构上。这导致更有效地利用模型容量,产生一个轻量级但表达力强的架构,同时生成可解释的显著性图。在图像分类和颅面表面分析上的实验表明,DECNN以更少的参数实现了竞争性或更优的性能,准确识别任务相关区域,并在复杂的几何变化下保持鲁棒性。

英文摘要

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

2606.12859 2026-06-12 cs.RO 新提交

AIR-VLA+: Decoupling Movement and Manipulation via Cascaded Dual-Action Decoders with Asymmetric MoE for Aerial Robots

AIR-VLA+: 通过级联双动作解码器与非对称MoE解耦空中机器人的移动与操作

Jianli Sun, Bin Tian, Qiyao Zhang, Zijian Liu, Yutong Wang, Zhiyong Cui, Bai Li, Yisheng Lv, Yonglin Tian

发表机构 * The Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Automation, Beijing Institute of Technology(北京理工大学自动化学院) College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) School of Transportation Science and Engineering, Beihang University(北京航空航天大学交通科学与工程学院) Information Science, East China Normal University(华东师范大学信息科学)

AI总结 针对空中机器人移动与操作在动作尺度、动力学和控制目标上的显著差异,提出级联双动作解码器与非对称MoE架构,实现解耦协调控制,在AIR-VLA基准上取得48.0平均分,任务完成度提升80.2%。

详情
AI中文摘要

空中操作系统长期以来在端到端控制中遭受表示耦合问题,因为平台级无人机(UAV)移动与末端执行器级机械臂操作在动作尺度、动力学和控制目标上存在显著差异。本文提出AIR-VLA+,一种专为空中操作设计的流匹配动作生成架构,具有级联双动作解码器和非对称特征级混合专家(MoE)。我们构建了级联的操作和移动解码器,使无人机在移动过程中单向观察机械臂的意图以实现工作流协调,同时隔离无人机移动信息反向传播对机械臂操作稳定性的影响。针对空中操作中无人机移动高度依赖高层语义并负责任务状态转换的特点,我们为无人机移动解码器设计了输入特征增强模块,该模块引入隐式视觉抓取投影器以感知夹爪与物体的交互状态,并注入压缩的全局语义特征。在无人机移动解码器内部,我们部署了隐式MoE架构,使不同的移动专家在训练过程中自发地对不同任务阶段表现出能力倾向。通过在特征流形上进行密集软混合计算,无人机移动获得了更强的任务阶段适应性。在标准化AIR-VLA基准上的实验表明,我们的方法以48.0的总体平均分全面超越所有基线。与单头$\pi_{0.5}$策略相比,整体任务完成分数提高了80.2%,有效缓解了复合机器人的异构协调控制冲突。

英文摘要

Aerial manipulation systems have long suffered from representation coupling in end-to-end control, as platform-level Unmanned Aerial Vehicle (UAV) movement and end-effector-level arm manipulation differ substantially in action scale, dynamics, and control objectives. In this paper, we propose AIR-VLA+, a flow matching action generation architecture specifically designed for aerial manipulation, featuring cascaded dual-action decoders and an asymmetric feature-level Mixture of Experts (MoE). We construct cascaded manipulation and movement decoders, allowing the UAV to unidirectionally observe the manipulator's intent during movement to achieve workflow coordination, while isolating the impact of UAV movement information backpropagation on arm manipulation stability. Addressing the characteristic that UAV movement is highly dependent on high-level semantics and responsible for task state transitions in aerial manipulation, we design an input feature enhancement module for the UAV movement decoder. This module introduces an implicit visual grasp projector to perceive the interaction state between the gripper and the object, and injects compressed global semantic features. Within the UAV movement decoder, we deploy an implicit MoE architecture, enabling different movement experts to spontaneously exhibit capacity inclinations for various task stages during training. Through dense soft blending computation on the feature manifold, the UAV movement is endowed with stronger task-stage adaptability. Experiments on the standardized AIR-VLA benchmark demonstrate that our method comprehensively surpasses all baselines with an overall average score of 48.0. The overall task completion score improves by 80.2\% compared to the single-head $\pi_{0.5}$ policy, effectively mitigating the heterogeneous coordinated control conflicts of composite robots.

2606.12854 2026-06-12 cs.CL q-bio.QM 新提交

Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization

小型LLM用于生物医学声明验证:成本效益微调、结构性数据集捷径与跨域泛化

Gaurav Kumar

发表机构 * Moveworks AI University of California San Diego(加州大学圣迭戈分校)

AI总结 通过QLoRA微调小型LLM(Phi-3-mini、Qwen2.5-3B、Mistral-7B),在生物医学声明验证中超越GPT-4o和GPT-5(F1提升12%),并发现SciFact数据集的结构性伪影,提出基于结构稳健数据的跨域迁移方法。

详情
Comments
8 pages, 2 figures, 12 tables. To appear at BioNLP Workshop, ACL 2026
AI中文摘要

大型语言模型如GPT-4o和GPT-5在生物医学声明验证上表现出强大的零样本性能,但成本和透明度限制了其可扩展使用。我们通过QLoRA在SciFact和HealthVer上微调了三个小型LLM:Phi-3-mini(3.8B)、Qwen2.5-3B和Mistral-7B,首次研究了QLoRA模型与GPT-4o及微调BioLinkBERT编码器的对比。Mistral-7B QLoRA在仅使用1,008个训练样本的情况下,以极低的成本超越了GPT-4o和GPT-5(F1提升高达12%)。我们进行了广泛的域内和跨域评估:在SciFact上训练的模型在HealthVer上测试,反之亦然,并匹配模型大小以隔离数据集结构与数据量的影响。我们识别了SciFact中一个先前未报告的结构性伪影,该伪影夸大了域内得分,并通过双向域外评估表明,在结构稳健的数据上训练能够实现鲁棒的跨域迁移。我们计划发布所有代码和适配器检查点。

英文摘要

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA surpasses both GPT-4o and GPT-5 (up to 12% F1 gain) at a fractional cost using just 1,008 training examples. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.

2606.12852 2026-06-12 cs.AI 新提交

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

WISE:具有Why-Which推理的Minecraft长时域智能体

Renmin Cheng, Changhao Chen (The Hong Kong University of Science and Technology (Guangzhou))

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出WISE框架,通过因果事件图增强情景记忆并解耦what-where-when与which-why推理,结合机会主义任务调度和多尺度探索,显著提升长时域稀疏任务的成功率和效率。

详情
AI中文摘要

通过采用LLM增强的分层方法,在Minecraft等环境中开发通用具身智能体取得了快速进展。尽管前景广阔,但低级控制器由于重复执行失败常常成为性能瓶颈。我们认为,一个关键限制不仅是缺乏情景记忆,而且是将\textit{what-where-when}记忆与\textit{which-why}推理解耦。为了解决这个问题,我们提出\textbf{WISE}(Which-Why Informed Semantic Explorer),一个长时域智能体框架,其增强的低级控制器配备因果事件图,通过将观察与任务相关性关联的显式因果结构来增强情景记忆。与先前依赖特征相似性进行检索的工作(如MrSteve)不同,WISE能够在视角变化下实现稳健回忆,并通过因果推理支持机会主义任务重排序。基于这种记忆,我们提出一个机会主义任务调度器,当检测到因果相关机会时动态重新优先化子任务。我们进一步为WISE配备多尺度渐进探索策略,为下游推理提供空间上全面的观察。实验表明,WISE在长时域稀疏任务上大幅提高了任务成功率和效率,特别是在需要自适应决策的场景中。

英文摘要

Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning. To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

2606.12848 2026-06-12 cs.AI econ.GN 新提交

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

(人类的)注意力(仍然)就是一切:人类监督使AI辅助的社会科学变得可靠

Chen Zhu, Xiaolu Wang, Weilong Zhang

发表机构 * China Agricultural University(中国农业大学) University of Cambridge(剑桥大学)

AI总结 提出人机协同决策架构HLER,通过预承诺、决策排序、问责和注意力分配,将AI辅助研究的失败率从72%降至16%。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于曾经只有训练有素的研究人员才能完成的任务,包括假设生成、规范选择和结论起草。我们认为,AI辅助研究的可靠性不仅取决于模型能力,还取决于认知劳动在人与机器之间的分配方式。我们通过人机协同经济研究(HLER)来研究这个问题,这是一种基于预承诺、决策排序、问责和注意力分配的决策架构。在一个预先指定的2*4因子实验中,涉及四个数据集的280个完整研究运行,无约束的多智能体基线在72%的运行中产生了关键失败。使用相同的底层模型、相同的智能体分解以及共享推理智能体的相同提示,HLER通过施加三个架构承诺将失败率降低到16%:LLMs进行推理但不执行数据工作,数据和估计以确定性方式处理,以及三个人类决策门约束工作流程。Fisher精确检验在p<0.001水平上拒绝失败率相等的假设。可靠性增益在公开代表性最低的数据集(一份清代人口登记册)上最大,这与基于任务的产出质量服从弗雷歇分布的生产模型一致。一项80次运行的消融研究表明,确定性计算和人类决策门独立贡献,并存在互补性的探索性证据。我们将HLER解释为一种研究框架而非自主的AI科学家:它大幅减少失败,使残留的弱点更加可见,并防止不可靠的主张作为可发表的成果被提出。

英文摘要

Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

2606.12847 2026-06-12 cs.CV 新提交

Language-Guided Abstraction for Visual Reasoning

语言引导的视觉推理抽象

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) Traditional Chinese Medicine Hospital of Zengcheng District(广州市增城区中医医院)

AI总结 提出L-VARC框架,通过语言引导的特权信息学习分支增强视觉推理,设计语义压缩模块和交叉注意力投影器,在ARC任务上以18M参数超越现有方法。

详情
AI中文摘要

抽象与推理语料库(ARC)被视为通往通用人工智能(AGI)的关键途径,因为它使模型能够从少量示例中学习抽象转换规则,然后泛化到新任务。然而,主流的ARC方法要么是纯语言,要么是纯视觉(即VARC)。前者严重依赖大语言模型,消耗数十亿参数;后者通常难以捕捉高层语义,导致在像素级模式上过拟合。为弥合这一差距,我们提出L-VARC,一种通过语言引导的特权信息学习(LUPI)分支增强视觉推理的新框架。具体来说,我们通过将统一的任务无关提示输入DeepSeek-V3来设计语义压缩模块。这样,原始的LARC(一个众包语言描述数据集)可以被大幅精炼和结构化,以适应标准文本编码器(如CLIP)的上下文长度约束。此外,我们设计了交叉注意力投影器来对齐视觉特征与语义嵌入,旨在指导ARC模型的训练。值得注意的是,LUPI分支在训练过程中使用,推理时被丢弃,从而产生一个仅1800万参数的轻量级模型。大量实验表明,我们的L-VARC有效利用语言先验提升视觉推理,并超越现有最优方法。消融研究进一步证实了这两个新设计对L-VARC框架的贡献。代码见https://this URL。

英文摘要

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at this https URL.

2606.12841 2026-06-12 cs.LG cs.AI 新提交

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM:掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TimeROME-DLM,首个无需训练和梯度的推理时知识编辑框架,通过时间因果追踪定位关键坐标并应用低秩残差编辑,在保持模型性能的同时高效删除事实。

详情
AI中文摘要

掩码扩散语言模型(MDLM),如LLaDA,现已能与自回归(AR)大语言模型(LLM)竞争,但现有的所有知识编辑和遗忘方法(如ROME、MEMIT等)均针对AR Transformer,要么做出在迭代去噪下失败的假设,要么需要梯度更新,其反向传播激活会消耗数十GB的额外显存,并在标准学习率下导致MDLM崩溃。我们提出TimeROME-DLM,这是首个针对MDLM的无需训练、无需梯度、推理时的知识编辑框架。它结合了两个组件:时间间接效应(TIE)因果追踪协议,用于识别每个事实中在后续去噪步骤中最强驱动对象预测的坐标;以及一个闭式低秩残差编辑记忆,该记忆聚合所有遗忘事实的主语键和目标差值,并在每个扩散前向步骤中对该坐标应用单次岭正则化更新,同时通过稀疏化限制效用溢出。骨干权重保持冻结;仅需在小型验证集上调整三个超参数(alpha、lambda、q)。在TOFU forget01任务上,使用TOFU微调的LLaDA-8B-Base,TimeROME-DLM将遗忘集的对数概率降低了约83 nats。相同的配置可迁移至LLaDA-8B-Instruct、Dream-7B、MMaDA-8B、DiffuLLaMA-7B和LLaDA-MoE-1.4B。在50个顺序插入的事实中,它使保留集的对数概率几乎持平(在效用安全操作点处波动约1 nat),相比最强的收敛训练时基线,实现了四到十四倍的墙钟加速且零额外显存,并亚线性地扩展到400个事实。TimeROME-DLM以极小的计算代价弥合了AR LLM与MDLM之间的定位-编辑差距。

英文摘要

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

2606.12837 2026-06-12 cs.CL 新提交

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan(美团)

AI总结 提出LoHoSearch基准,基于700万维基实体知识图谱自动构建544个复杂问题,评估显示最强模型仅34.74%准确率,远超人类难度上限。

详情
AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和,最强模型已超过90%准确率。由于这些基准主要由人类编写,标注者缺乏对实体统计的全局视角,无法系统性地最大化搜索空间大小和结构复杂性,这造成了难以突破的难度上限。为解决这一问题,我们引入了LoHoSearch(长时域搜索代理),一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建,该流水线选择具有大搜索空间的关系,并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明,即使是最强模型也仅达到34.74%的准确率,且现有的上下文管理策略(最佳提升+6.8%)带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

2606.12834 2026-06-12 cs.AI 新提交

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

神奇的科学智能体及其构建方法:用于Rietveld精修的AgentBuild

Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva

发表机构 * UT-Battelle, LLC(UT-Battelle有限责任公司) US Department of Energy (DOE)(美国能源部)

AI总结 提出AgentBuild框架,通过科学家编写的合同(包含评分标准、课程和知识库)自动构建科学智能体,用于X射线衍射数据的Rietveld精修,实现可复用的智能体编译而非手动调优。

详情
AI中文摘要

随着科学工作流从确定性可执行文件转向基于LLM的智能体,现有的开发实践(如微调、强化学习和即时运行)掩盖了科学家的判断。我们建议将智能体构建视为一个工作流阶段,并引入AgentBuild,它根据科学家编写的合同构建科学智能体。该合同是一个版本控制的评分标准、一个难度分级的课程和一个精心策划的外部知识库。基于评分标准的裁判门控一个元优化编码智能体,该智能体在声明的边界内编辑智能体,因此构建编译的是智能体,而不是科学家的判断。我们通过MCP和A2A背后的GSAS-II将其实例化用于X射线衍射数据的Rietveld精修,其中空白框架构建运行通过锂镧锆氧(LLZO)信噪比阶梯,达到4小时扫描作为前沿案例,并暴露了工作流范围限制。相同的评分标准既奖励可信的拟合,也评分轨迹范围,使前沿成为合同失败而非模式拟合失败。随着基础模型的发展,重新运行AgentBuild是重新调整,而不是重建,科学家编写的合同仍然是持久的资产。

英文摘要

As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

2606.12830 2026-06-12 cs.CV cs.AI 新提交

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理:构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University(清华大学) Virginia Tech(弗吉尼亚理工大学) NVIDIA(英伟达)

AI总结 提出PERIA智能体,通过视觉感知和交互工具增强VLM的空间推理能力,在13个基准上优于同类模型7.0%-14.8%。

详情
AI中文摘要

尽管最近的视觉语言模型(VLM)展示了强大的多模态理解能力,但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明,仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent(PERIA),一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具:视觉感知工具用于暴露文本、符号和空间证据,以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA,我们开发了一种统一方案,结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化(OR-GIGPO),以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明,PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%,在分布外基准上提高了4.4%,同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型(如Qwen3-VL-235B-A22B-Thinking和GPT-5)相当的性能,证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

2606.12826 2026-06-12 cs.CV cs.AI 新提交

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS: 解耦实例级运动目标分割

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出双解耦特征提取框架分离图像与事件模态的外观和运动信息,并通过多粒度跨模态对齐实现有效融合,在运动实例分割任务中尤其对快速运动和低光下的小目标取得最优性能。

详情
AI中文摘要

运动实例分割(MIS)因其在交通监控、自动驾驶和动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化,提供高时间分辨率和动态范围,使其对运动信息高度敏感。通过融合事件和图像特征,事件中的运动线索可以补充图像中的空间细节,从而提升MIS的性能。然而,当前的多模态MIS方法仍然难以分割小的运动实例,因为事件相机在有限分辨率下往往产生稀疏特征。此外,事件特征将外观属性与运动线索纠缠在一起,进一步限制了有效的跨模态融合。为解决这些挑战,我们首先提出一个双解耦特征提取框架,在图像和事件模态内分离并提取外观和运动信息,从而改善特征密度。随后,引入多粒度跨模态对齐,以对齐跨模态分布和语义一致的特征,实现具有丰富空间和时间细节的更有效融合。实验结果表明,我们的方法在多模态MIS中达到了最先进的性能,特别是在快速运动和低光等挑战性条件下的小实例分割方面。

英文摘要

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

2606.12821 2026-06-12 cs.AI cs.ET 新提交

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

GeoNatureAgent Benchmark:面向前沿与开源基础模型的环境地理空间分析LLM智能体基准测试

Gabriel Diaz-Ireland, Diego Prieto-Herráez, Mario García Peces, Javier Velázquez, Devika Jain

发表机构 * Universidad Católica de Ávila (UCAV)(阿维拉天主教大学) Johns Hopkins University(约翰霍普金斯大学) Independent Researcher(独立研究者) Center for Geographic Analysis, Harvard University(哈佛大学地理分析中心)

AI总结 提出首个通过结构化工具调用真实API评估环境分析智能体的基准,包含93个任务,发现Claude Sonnet 4领先,但开源模型在成本效益上占优,且比较任务普遍未解决。

详情
Comments
Preprint. 10 pages, 8 figures. Submitted to ACM SIGSPATIAL 2026
AI中文摘要

环境科学家在数据整理而非分析上花费了不成比例的精力,而自动化地理空间工作流的AI智能体仍未得到验证:没有基准通过结构化工具调用评估智能体对真实API的操作。我们引入了GeoNatureAgent Benchmark,这是首个通过结构化工具调用生产级地理空间API进行环境分析智能体的基准。它包含18个类别的93个任务,涵盖市政分析、多轮对话、空间推理、跨指标综合、错误处理与恢复、排序、比较、多语言理解、栖息地分析和任务拒绝。任务通过一个开放、可自托管的API进行评估,该API通过16个工具提供西班牙和葡萄牙的三个环境指标。我们评估了七个LLM(Claude Sonnet 4、DeepSeek V3.2、GLM-5、Gemini 2.5 Pro、Qwen3-235B、GPT-OSS-120B、Llama 4 Scout),在三个温度1.0的随机种子下,报告能力与每案例成本作为正交轴。我们发现:(1)Claude Sonnet 4以60.8%±0.8%领先,其次是DeepSeek V3.2的56.3%±3.1%,其他模型均未超过51%;(2)成本-准确率帕累托前沿主要由开源模型占据,DeepSeek V3.2以11倍低的成本(每案例0.011美元)提供Claude 93%的能力;(3)比较任务普遍未解决(接近值比较上为0%),暴露了系统性的推理限制;(4)针对真实API的结构化工具调用比通用GIS基准更具区分度,准确率低25-35个百分点。我们进一步展示了可扩展性,将葡萄牙的BigEarthNet V2土地覆盖与西班牙的CO2和侵蚀指标集成。该基准、工具集和可自托管API均已公开。

英文摘要

Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

2606.12818 2026-06-12 cs.CL cs.AI 新提交

Localizing Anchoring Pathways in Language Models

定位语言模型中的锚定路径

Hillary N. Owusu, Sarah Wiegreffe, Naomi H. Feldman

发表机构 * University of Maryland, College Park(马里兰大学帕克分校)

AI总结 研究提示中无关数字如何影响语言模型数值推理的锚定效应,通过logit差值度量和电路归因定位,发现边级方法优于节点级方法,并揭示锚定路径的共享与迁移特性。

详情
AI中文摘要

提示中的无关数字可以改变语言模型的判断,在数值推理中产生锚定效应。我们使用共享答案选项的受控多项选择设置,研究这种锚定敏感信号在语言模型内部的携带位置。我们定义了一个logit差值度量,比较正确答案选项与对应锚点的答案选项,并验证其追踪行为锚定。通过对7B-8B Qwen和Llama基础及指令微调模型进行基于归因的电路定位,我们发现边级方法比节点级方法更忠实地恢复该信号。低锚和高锚电路在模型内部强迁移,表明跨锚定方向存在共享路径结构。然而,基础模型和指令微调变体之间的稀疏迁移可靠性较低,表明后训练改变了哪些路径最重要。总体而言,我们的结果为锚定相关决策信号如何在语言模型内部携带提供了机制性解释。

英文摘要

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

2606.12817 2026-06-12 cs.AI 新提交

Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

Teach-and-Repeat: 从移动屏幕演示中准确提取操作知识以赋能GUI智能体

Yudong Zhang (1), Lei Hu (1), Daoyang Liu (2), Jiawei Liu (1), Yangfan Luo (1), Xingyu Liu (1), Zuojian Wang (1), Zhilin Gao (1) ((1) Honor Device Co., Ltd, (2) The Chinese University of Hong Kong, Hong Kong, China)

发表机构 * Honor Device Co., Ltd(荣耀终端有限公司) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Teach VLM模型,通过从演示视频中提取关键帧生成操作知识,并构建数据飞轮解决训练数据稀缺问题;在基准测试中达到最优性能,并提升下游智能体的任务成功率。

详情
Comments
20 pages, 9 figures. Yudong Zhang and Lei Hu contributed equally to this work. Xingyu Liu, Zuojian Wang, and Zhilin Gao are corresponding authors
AI中文摘要

理解移动设备上的数字世界正从静态UI感知转向动态动作理解。这种能力使模型能够将视觉状态转换转化为操作知识,定义为描述动作类型、目标UI元素、文本参数和执行顺序的简短自然语言句子。然而,由于跨应用的UI设计高度多样化和异构,现有视觉语言模型(VLM)难以准确推断这些底层操作。为弥补这一差距,我们引入了Teach VLM,这是一个核心模型,旨在通过从演示视频中提取和分析与操作相关的关键帧,将移动屏幕轨迹转化为逐步操作知识。为解决对齐训练数据稀缺的问题,我们开发了一个系统性的数据飞轮以实现可扩展的数据采集。我们进一步引入了一个新颖的中文移动屏幕教学基准用于细粒度评估。基于Teach VLM,我们提出了Teach-and-Repeat范式,其中生成的操作知识作为可解释的程序化参考,指导下游基于屏幕的执行智能体。大量评估表明,Teach VLM显著优于强VLM基线,在操作语义预测中达到了最先进的性能。此外,在Android World中的实验表明,我们的范式为下游智能体带来了持续的任务成功率提升。Teach VLM和Teach-and-Repeat范式共同提供了一条从原始演示到可复用任务自动化的实用路径。

英文摘要

Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability enables models to convert visual state transitions into operational knowledge, defined as short natural-language sentences that describe action types, target UI elements, textual arguments, and execution orders. However, due to the highly diverse and heterogeneous UI designs across applications, existing vision-language models (VLMs) struggle to accurately infer these underlying operations. To bridge this gap, we introduce Teach VLM, a core model designed to translate mobile screen trajectories into step-wise operational knowledge by extracting and analyzing operation-related keyframes from demonstration videos. To address the scarcity of aligned training data, we develop a systematic data flywheel for scalable data acquisition. We further introduce a novel Chinese Mobile Screen Teach Benchmark for fine-grained evaluation. Building upon Teach VLM, we propose the Teach-and-Repeat paradigm, where the generated operational knowledge serves as an interpretable procedural reference to guide downstream screen-based execution agents. Extensive evaluations demonstrate that Teach VLM significantly outperforms strong VLM baselines, achieving state-of-the-art performance in operation semantics prediction. Furthermore, experiments in Android World show that our paradigm yields consistent Task Success Rate improvements for downstream agents. Together, Teach VLM and the Teach-and-Repeat paradigm offer a practical pathway from raw demonstrations to reusable task automation.

2606.12814 2026-06-12 cs.RO cs.AI 新提交

Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids

Stubborn: 一种用于人形机器人鲁棒运动跟踪与摔倒恢复的流线型统一强化学习框架

Xiao Ren, Yuhui Yang, Zongbiao Weng, Zhijie Liu, He Kong

发表机构 * Southern University of Science and Technology(南方科技大学)

AI总结 提出Stubborn框架,通过非对称Actor-Critic架构、偏航对齐表示、伯努利概率终止机制和自适应采样策略,统一实现人形机器人的运动跟踪与摔倒恢复,在性能与鲁棒性上超越现有方法。

详情
AI中文摘要

最近的强化学习方法在改善人形机器人运动跟踪性能和实现扰动下的摔倒恢复方面显示出巨大潜力。然而,现有大多数工作将运动跟踪和摔倒恢复视为不同任务,需要多阶段训练,并配备专门的恢复奖励和/或独立的恢复策略。此外,现有的基于强化学习的方法通常在严重跟踪失败后立即终止训练回合,限制了在不稳定或摔倒状态下的恢复导向探索。为了解决上述问题,我们提出了Stubborn,一个流线型统一的强化学习框架,用于实现鲁棒的人形机器人运动跟踪和摔倒恢复。具体来说,Stubborn采用非对称Actor-Critic架构,包含三个主要组件。首先,采用偏航对齐的跟踪表示,以减少对全局漂移和航向扰动的敏感性,同时保留与重力相关的平衡信息。其次,我们引入基于伯努利的概率终止机制,使策略能够在不同失败模式下鼓励探索摔倒恢复行为。第三,我们提出一种概率终止和跟踪误差驱动的策略,根据跟踪性能动态重塑采样分布,提高困难运动片段和不稳定状态的训练效率。与最先进方法的广泛比较和消融研究表明,Stubborn取得了有竞争力的性能,所提出的概率终止机制和自适应采样策略有助于性能和鲁棒性的提升。真实世界演示请参见此https URL。

英文摘要

Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall recovery under disturbances. However, most existing works treat motion tracking and fall recovery as different tasks and require multi-stage training with specialized recovery rewards and/or separate recovery policies. Moreover, existing reinforcement learning-based methods often terminate training episodes immediately after severe tracking failures, limiting recovery-oriented exploration in unstable or fallen states. To address the above issues, we propose Stubborn, a streamlined and unified reinforcement learning framework to achieve robust humanoid motion tracking and fall recovery. Specifically, Stubborn uses an asymmetric Actor-Critic architecture and consists of three major components. First, a yaw-aligned tracking representation is adopted to reduce sensitivity to global drift and heading disturbances while preserving gravity-related balance information. Second, we introduce a Bernoulli-based probabilistic termination mechanism that enables the policy to encourage exploration of fall-recovery behaviors under varying failure modes. Third, we propose a probabilistic termination and tracking-error-driven strategy that dynamically reshapes the sampling distribution based on tracking performance, increasing the training efficiency for difficult motion segments and unstable states. Extensive comparisons with SOTA methods and ablation studies show that Stubborn achieved competitive performance, and the proposed probabilistic termination mechanism and adaptive sampling strategy contributed to the performance and robustness gains. For real-world demonstrations, please refer to this https URL.

2606.12807 2026-06-12 cs.CL 新提交

Detect, Remask, Repair: Diffusion Editing for Faithful Summarization of Evolving Contexts

检测、重掩、修复:面向动态上下文忠实摘要的扩散编辑

Hao Zou, Zachary Horvitz, Chandhru Karthick, Zhou Yu, Kathleen McKeown

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出DETECT-REMASK-REPAIR框架,利用掩码扩散语言模型识别并修复摘要中过时内容,在保持支持内容的同时实现局部忠实性修复,并引入StreamSum基准评估。

详情
AI中文摘要

现实世界事件的摘要可能随着上下文演变和新信息的到来而过时。常见的做法是从更新后的上下文生成新摘要,但完全重新生成会丢弃之前的草稿,可能掩盖变化,并且当只有少数声明不支持时可能不必要。我们研究局部忠实性修复:在保留支持内容的同时更新现有摘要中的过时片段。我们提出DETECT-REMASK-REPAIR,一个基于扩散的框架,通过掩码扩散语言模型识别、重新掩码并修复过时区域。为了评估动态上下文摘要,我们引入了StreamSum,一个合成事件时间线的基准。在DialogSum和StreamSum上的实验表明,局部扩散修复提供了一种可控的替代完全重写的方法:忠实性导向的修复改进了早期草稿,一步修复将修复成本降低到半秒以下,该框架实现了跨数据集的忠实性-速度-保留权衡。我们还发现该框架可以作为事后修正步骤,提高自回归系统的忠实性。

英文摘要

Summaries of real-world events can become outdated as contexts evolve and new information arrives. A common response is to generate a new summary from the updated context, but full regeneration discards the previous draft, can obscure what changed, and may be unnecessary when only a few claims are unsupported. We study localized faithfulness repair: updating outdated spans in an existing summary while preserving supported content. We propose DETECT-REMASK-REPAIR, a diffusion-based framework that identifies, remasks, and repairs outdated regions with masked diffusion language models. To evaluate evolving-context summarization, we introduce StreamSum, a benchmark of synthetic event timelines. Experiments on DialogSum and StreamSum show that localized diffusion repair provides a controllable alternative to full rewriting: faithfulness-steered repair improves early drafts, one-step repair reduces repair cost to under half a second, with the framework enabling faithfulness-speed-preservation tradeoffs across datasets. We also find that the framework can provide a post-hoc correction step that improves faithfulness for autoregressive systems.

2606.12797 2026-06-12 cs.AI 新提交

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

遏制缺口:已部署的自主AI框架如何未能满足面向公众的安全要求

Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, Nirwan Ansari

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 研究发现主流自主AI框架缺乏架构级安全保证,内存完整性漏洞可导致定向腐败,提出轻量级遏制机制消除攻击向量。

详情
Comments
ICML 2026 (AI4GOOD Workshop)
AI中文摘要

自主调用工具、维护持久内存并执行多步计划的大语言模型系统越来越多地部署在面向公众的领域,包括政府服务、医疗分诊和财务咨询。我们询问用于构建这些系统的框架是否提供架构级结构安全保证。应用从自主架构的组合模型导出的六项遏制原则,我们审计了三个主流框架(LangChain、AutoGPT和OpenAI Agents SDK),发现没有一个原生合规。内存完整性,一种针对最普遍漏洞类别的防御,在三个评估框架中均未观察到。我们通过实证验证这些发现:在基于LangChain构建的模拟政府福利代理中,单次内存投毒写入在所有测试种子和后端上引起持久定向腐败,使目标申请人的错误拒绝率升至88.9%。在复杂的五因素政策下,同一攻击保持总体准确率,同时将目标错误拒绝率提高3.5倍,使腐败难以通过标准监控检测。然后我们引入两种轻量级遏制机制:内存完整性验证器和策略门,它们以亚毫秒开销(每次调用<0.2ms)消除了两种攻击向量。我们得出结论,当前的自主框架生态系统可能尚未满足面向公众部署的默认安全期望,并概述了优先架构干预措施,以实现在高风险、对社会有影响的应用程序中的可信部署。

英文摘要

Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

2606.12790 2026-06-12 cs.CL 新提交

GENIE: A Fine-Grained Measure for Novelty

GENIE:一种细粒度新颖性度量方法

Ramya Namuduri, Manya Wadhwa, Anshun Asher Zheng, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出GENIE指标,通过任务特定特征细粒度衡量模型生成内容的新颖性,克服整体指标无法捕捉高维新颖性的局限。

详情
AI中文摘要

大型语言模型在各项任务中持续表现出缺乏创造力和多样性。先前的工作主要关注模型是否能够生成创造性输出。本文旨在考虑新颖性,并以任务特定方式研究模型生成内容的新颖性。我们提出了一种细粒度评估指标GENIE,用于根据响应群体中的任务特定特征来衡量响应的新颖性。我们表明,与GENIE不同,整体指标难以捕捉新颖性的高维性,并且无法提供关于它们针对哪些属性的见解。最后,我们使用GENIE来衡量解决创造力问题的缓解方法的有效性,以更好地理解这些方法在哪些方面可以提高新颖性。

英文摘要

Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

2606.12789 2026-06-12 cs.CL cs.IR 新提交

How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

RAG基准测试应该有多细粒度?一个用于合成问题生成的层次化框架

Chase M. Fensore, Kaustubh Dhole, Jason Fan, Eugene Agichtein, Joyce C. Ho

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系)

AI总结 提出HieraRAG层次化框架,通过合成问题生成研究RAG基准测试的细粒度,发现最优粒度因维度而异,并引入一致性比率度量。

详情
AI中文摘要

评估检索增强生成(RAG)系统需要能够捕捉多样化问题特征的基准测试,然而实践者缺乏关于在哪些维度上变化以及以何种粒度变化的经验指导。我们提出了HieraRAG,一个用于研究RAG基准测试构建中粒度的层次化框架,将最优粒度定义为在给定RAG配置下最大化区分能力(各类别生成质量的标准差)的水平。作为案例研究,我们从FineWeb-10BT中生成了5,872个合成问答对,涵盖3个维度(问题复杂度、答案类型、语言变异)和3个粒度级别(2、4和8个类别)。使用BM25+Falcon-3-10B流水线,最优粒度因维度而异:复杂度受益于细粒度区分(区分能力:0.053),而答案类型和语言变异在中等粒度达到峰值。我们引入了一致性比率度量来量化细粒度划分是否干净地细分父类别,揭示了维度间的结构差异(问题复杂度:0.40 vs. 答案类型:1.44)。对110个分层问答对的人工评估确认了合成质量。虽然这些具体发现反映的是单一配置,但HieraRAG为实践者提供了可移植的程序和验证度量,以确定其自身RAG设置中的评估粒度。

英文摘要

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

2606.12783 2026-06-12 cs.AI 新提交

A Tutorial on World Models and Physical AI

世界模型与物理AI教程

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonju, Jeonbuk, South Korea(韩国全北全州计算机科学与人工智能系/CAIIT)

AI总结 本文提出统一框架,区分显式与隐式世界模型,并探讨其在机器人、自动驾驶等物理AI领域的应用,以及迈向通用人工智能的挑战。

详情
AI中文摘要

世界建模正成为构建具备预测、推理和决策能力的智能系统的核心原则。显式世界模型与隐式世界模型之间存在一个核心区别:前者学习结构化动态以进行基于推演的推理和规划,后者则将预测结构编码到可扩展的学习表示中。这些互补范式为机器人、自动驾驶等领域的物理AI奠定了基础,使其能够在现实世界约束下实现超越反应式控制的智能。近期的基础模型进一步指明了通向集成感知、预测和行动的通用系统的路径。尽管进展迅速,但在层次推理、长时域规划和自主目标形成方面仍存在重大挑战,这些对于迈向通用人工智能至关重要。本教程提出了一个连贯的框架,其中多种世界建模方法通过共享的预测结构得以统一,并通过这种结构的表示和利用方式加以区分。

英文摘要

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

2606.12780 2026-06-12 cs.LG cs.CL 新提交

ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出ProPlay程序化世界模型,通过程序级预演和因果过程图,使LLM智能体在部分可观测环境中自我进化,无需外部监督。

详情
AI中文摘要

自我进化智能体应能在无外部监督下通过交互改进,但在部分可观测环境中仍困难,智能体必须主动探索、从有限反馈中学习,并决定何时信任先前经验。现有的LLM智能体方法通常依赖记忆或规划模块,但很少在它们之间闭环以持续完善对环境动态的内部理解。我们提出ProPlay,一种程序化世界模型,支持程序级预演,智能体可利用学到的世界知识排练未来的程序路径。ProPlay不将经验表示为孤立的规则或低层动作约束,而是将成功轨迹抽象为程序,并在捕获任务阶段间因果转换的程序图中组织它们。每个转换与一个可靠性记录嵌入相关联,以从过去结果中估计其任务特定贡献。在每个回合前,ProPlay在已知图结构上模拟未来程序轨迹作为结构化软指导;执行后,它利用环境反馈精炼图。在公开基准上的实验表明,ProPlay在环境理解和自我进化能力上持续优于强基线。我们的代码已在此https URL发布。

英文摘要

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in this https URL.

2606.12767 2026-06-12 cs.AI 新提交

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

构建程序性推理评估数据集:平衡自然性、基础性和多跳覆盖

Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 研究基于任务-方法-知识(TMK)模型的问题生成策略对程序性和多跳推理数据集质量的影响,提出基础性验证框架,发现严格TMK生成策略在基础性和可用性上最优。

详情
Comments
10 pages, 2 numbered figures. Workshop submission to HAIL @ AIED 2026
AI中文摘要

评估AI辅助学习系统中的程序性推理需要问答数据集,这些数据集既要像学习者一样,又要基于系统预期使用的教学知识。我们研究了基于TMK的问题生成策略如何影响程序性和多跳推理的数据集质量。我们比较了三种策略:从任务-方法-知识(TMK)模型严格生成、先转录后基于TMK过滤的生成、以及结合转录和结构化指导的TMK感知生成。为了评估生成的项目,我们引入了一个基于从TMK模型中提取的闭集证据单元的基础性验证框架。该框架衡量答案是否由底层表示支持、问题是否自包含、以及是否针对多跳程序性推理。在23个教学主题和690个生成的问答对中,严格TMK生成实现了最强的整体质量,其中96.5%的问题有基础,92.6%的问题可用。先转录生成产生更像学习者的问题,但更多是上下文依赖或基础薄弱的问题,而TMK感知生成产生较高的原始多跳覆盖率但基础性较低。这些结果表明,程序丰富性和自然措辞并不能保证表示基础性,这促使在AI辅助学习中的评估数据集需要进行显式的表示感知验证。

英文摘要

Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

2606.12764 2026-06-12 cs.LG cs.CL cs.CR 新提交

Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

发表机构 * Meta Imperial College London(伦敦帝国学院)

AI总结 研究代码语言模型的功能记忆现象,通过反事实设置对比暴露目标代码的模型与未暴露的参考模型,使用文本和功能相似性度量,发现功能记忆超出文本重叠的检测范围。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于大规模生成代码。同时,先前的工作通过审计训练示例与模型生成之间的文本重叠,研究了训练数据是否可以从模型输出中恢复。然而,代码可能在功能上等价而在文本上不相似。在这项工作中,我们研究了功能记忆:提取超出逐字指标检测的功能逻辑。我们为Olmo-3-32B构建了一个反事实设置,将中期训练模型(暴露于目标代码)与预训练参考模型(未暴露)进行比较。我们使用Python函数签名提示两个模型,并测量文本和功能相似性(即LLM作为评判者、基于执行)。我们的结果显示了功能记忆的明确证据,突出了需要超越文本重叠的审计指标。

英文摘要

Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.