arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪 全部专题
2605.28047 2026-05-28 cs.CL

Knowledge Dependency Estimation for Reliable Question Answering

面向可靠问答的知识依赖估计

Chaodong Tong, Qi Zhang, Nannan Sun, Lei Jiang, Yanbing Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) China Industrial Control Systems Cyber Emergency Response Team(中国工业控制系统网络应急响应团队)

AI总结 提出Knot方法,通过子集级反事实监督和潜在依赖因子覆盖建模,估计黑盒问答模型对不同知识单元的敏感性,以识别关键知识依赖。

Comments 12 tables, 9 figures

详情
AI中文摘要

可靠的问答不仅需要判断答案是否正确,还需要识别预测所依赖的可用知识。在实际的基于LLM的问答中,这些知识可能来自上下文、检索、分解或中间推理,形成一个嘈杂且冗余的候选空间,而非干净的金标准证据集。我们研究\emph{知识依赖估计}:估计固定黑盒问答模型对不同候选知识单元的敏感性。挑战在于无需穷举测试时扰动即可获得细粒度的依赖分数,同时建模冗余性、可替代性和互补性。我们提出 extbf{Knot},一种结构化的排序感知知识依赖估计器。Knot从子集级反事实监督中学习,通过覆盖潜在依赖因子来建模子集敏感性,并推导出排序感知的单元分数以识别有影响力的候选。在多项选择和生成式问答基准上,Knot在子集敏感性预测方面优于所有对比基线,并在无需额外问答模型调用的情况下产生比可部署基线更忠实的单元排序;当用于实际风险筛查时,其依赖分数有助于及早标记易出错的问答预测。

英文摘要

Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.

2605.28046 2026-05-28 cs.AI cs.CL

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

MemCog: 从记忆即工具到记忆即认知的对话代理

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

发表机构 * WeChat, Tencent Inc.(腾讯公司)

AI总结 提出MemCog系统,通过可导航记忆存储、跨维度导航接口和主动推理协议,将记忆访问融入推理过程,在被动问答和主动记忆触发基准上达到最优性能。

详情
AI中文摘要

现有的代理记忆系统普遍遵循我们称之为“记忆即工具”的范式,其中单个查询触发对扁平段落列表的一次性检索,存在被动调用、推理-检索解耦以及检索片段与代理导航需求之间的结构不匹配等问题。我们提出MemCog,一个“记忆即认知”系统,使记忆访问成为推理过程的一个组成部分。MemCog将用户知识组织为具有关联链接图的可导航记忆存储,暴露跨维度导航接口以进行多步推理驱动的遍历,并采用主动推理协议,驱动代理从对话上下文中自发启动记忆探索。我们还构建了ProactiveMemBench,这是第一个用于评估主动记忆触发的基准。实验表明,MemCog在被动问答基准上达到了最先进水平(LoCoMo上92.98,LongMemEval上95.8),同时在ProactiveMemBench上大幅超越基线,展示了记忆即认知的优势。

英文摘要

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

2605.28044 2026-05-28 cs.AI

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

相关并不保证:引用RAG的证据力度校准

Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, Xinpeng Wei

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院) Dartmouth College(达特茅斯学院) Cornell University(康奈尔大学) University of California San Diego(加州大学圣地亚哥分校) University of Glasgow(格拉斯哥大学)

AI总结 针对引用RAG中证据力度不足的问题,提出FORCEBENCH基准测试,通过对比证据校准声明与力度增强变体,评估模型在五个操作轴上的单调性,发现标准支持提示不足以校准证据力度。

详情
AI中文摘要

引用RAG评估通常将可见来源视为接地信号,但一个真实的、主题相关的引用仍可能对附带的措辞支持不足。我们将这种诊断失败称为引用洗白:一个相关的来源被呈现为对过度强声称的保证。我们引入了FORCEBENCH,一个用于证据力度校准的对比压力测试。每个项目固定一个引用的段落,并将一个证据校准的声明与一个局部力度增强的变体配对,涵盖五个操作轴:关系、模态、范围、时间有效性和数值特异性。一个校准的评估器应该给证据校准的声明更高的分数。主要实验使用一个固定的、经过局部过滤的198对评估集。引用存在的合理性检查设计上无信息;标记和实体重叠在32.8--36.4%的对上仍然违反单调性。在四个报告的模型评判中,标准的通用支持提示不足以应对这个力度校准压力测试(总体MVR 47.2%),而显式的保证力度提示将MVR降低到24.5%,但仍不完美。我们发布了基准、提示、输出和即插即用管道,以便引用评估器可以报告单调性违反率和力度敏感性,以及传统的支持指标。

英文摘要

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

2605.28042 2026-05-28 cs.CL cs.AI cs.LG

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

通过激进剪枝专家从LLM中提取小型翻译专家

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出一种从混合专家LLM中激进剪枝与翻译无关的专家,实现大幅压缩MoE块而不显著降低翻译质量的方法。

详情
AI中文摘要

现代大型语言模型(LLM)实现了最先进的机器翻译性能,但它们是作为广泛通才训练的,主要针对许多与翻译无关的任务和能力。因此,它们对于此任务严重过参数化,导致过多的内存和计算需求。在本文中,我们提出了一种从现代混合专家LLM中激进剪枝专家的方法,同时翻译质量下降可忽略不计。我们的方法利用专家专业化和LLM中多语言能力的可分离性来识别与翻译无关的专家。并且由于MoE的模块化特性,这些专家可以在无需任何训练的情况下轻松剪枝。无需重新训练,我们能够剪枝一半的专家而质量下降可忽略,剪枝70%仅造成轻微损失。通过非常短的SFT,我们剪枝75%的专家并恢复基线性能,在某些设置下移除近90%的专家同时保持合理的翻译质量。总体而言,我们的结果表明翻译仅需要LLM的一小部分,从而实现了对包含超过90%参数的MoE块的大幅压缩。

英文摘要

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

2605.28037 2026-05-28 cs.CL

Personality, Role, and Expressive Style in Large Language Models: An Interactionist Analysis

大型语言模型中的个性、角色与表达风格:一种互动主义分析

Moe Nagao, Koichiro Terao, Mikio Nakano, Naoto Iwahashi

发表机构 * Okayama Prefectural University(冈山县立大学) AI & Humans Lab(人工智能与人类实验室) C4A Research Institute(C4A研究所)

AI总结 本研究从互动主义视角,通过因子设计实验分析个性特质、对话角色和表达风格如何共同影响大型语言模型生成对话中感知的大五人格特质表达。

Comments 26 pages

详情
AI中文摘要

基于提示的个性控制是设计在社交情境中行为一致的大型语言模型(LLM)对话智能体的关键技术。然而,在提示中指定大五人格特质(BFTs)并不能确保这些特质在生成的语句中得到表达。本文从互动主义视角研究这种不匹配,将人格表达视为由特质指定与情境因素相互作用塑造的依赖于上下文的结果。我们分析了感知到的LLM生成对话中的BFT表达如何受三个提示因素影响:人格特质、对话角色和表达风格。采用结合六种人格条件、三种角色和三种表达风格条件的因子设计,我们在英语和日语中各生成了1,080个LLM智能体对话。然后,我们使用LLM-as-a-judge框架评估目标智能体的语句,以估计表达的大五人格特质。结果表明,表达的人格不仅受显式特质指定影响,还受对话角色和表达风格影响。这些效应是特质特定的:对话角色强烈影响开放性,表达风格显著塑造尽责性和宜人性,而显式特质指定主导神经质。即使没有显式的人格特质指定,社会和表达条件也会诱发独特的人格印象。跨语言比较显示英语和日语对话之间的模式大致相似,仅在特定的人格、角色和表达风格组合下存在显著差异。这些发现表明,LLM智能体中的个性控制不应被理解为特质提示的直接结果,而是一个涉及人格指定、社会角色和表达风格的依赖于上下文的过程。

英文摘要

Prompt-based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context-dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM-generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive-style conditions, we generate 1,080 LLM-agent dialogues in each of English and Japanese. We then evaluate the target agent's utterances using an LLM-as-a-judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait-specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality-trait specification, social and expressive conditions induce distinct personality-like impressions. Cross-linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context-dependent process involving personality specification, social role, and expressive style.

2605.28036 2026-05-28 cs.CV cs.LG

Stay Fair! Ensuring Group Fairness in Diffusion Models Across Guidance Scales

保持公平!确保扩散模型在不同引导尺度下的群体公平性

Myeongsoo Kim, Eunji Kim, Minwoo Chae, Sangwoo Mo

发表机构 * POSTECH Amazon(亚马逊)

AI总结 提出StayFair方法,通过分解总偏差为模型偏差和引导偏差,并扩展强人口平价到引导过程,设计公平引导算法,使扩散模型在不同引导尺度下保持群体公平性。

Comments 28 pages, 18 figures

详情
AI中文摘要

扩散模型使用可调引导尺度来权衡提示对齐和多样性,从而引导条件生成。然而,现有的去偏技术针对单一尺度进行优化,当用户调整此参数时会降低公平性。我们通过将总偏差分解为两个组成部分:模型偏差和引导偏差,追溯了这种行为的先前被忽视的根源。虽然先前的工作主要针对前者,但我们表明引导偏差随引导尺度单调增长,最终在用户偏好的高引导区域占主导地位。为了解决这个问题,我们将强人口平价扩展到引导,并推导出一个条件,在该条件下目标分布在不同引导尺度下保持其群体比例。我们提出了StayFair,利用该条件在两种引导模式下设计公平引导算法。对于分类器引导,它均衡了分类器在不同群体间的输出分布;对于无分类器引导,它通过依赖于提示的偏移来移动空嵌入。由于StayFair仅修改引导步骤,它与模型去偏正交,可以叠加到现有的公平扩散模型上,以将其公平性扩展到不同引导尺度。在类条件和文本到图像生成中,StayFair在不牺牲图像质量的情况下将公平性与引导尺度解耦。

英文摘要

Diffusion models steer conditional generation with a tunable guidance scale to trade off prompt alignment and diversity. However, existing debiasing techniques are optimized for a single scale, degrading fairness when users adjust this parameter. We trace this behavior to a previously overlooked source by decomposing total bias into two components: a model bias and a guidance bias. While prior work primarily targets the former, we show that the guidance bias grows monotonically with the guidance scale, eventually dominating the high-guidance regimes users prefer. To address this, we extend Strong Demographic Parity to guidance and derive a condition under which the target distribution retains its group ratio across guidance scales. We propose StayFair, which leverages this condition to design fair guidance algorithms in both regimes. For classifier guidance, it equalizes the classifier's output distributions across groups; for classifier-free guidance, it shifts the null embedding by a prompt-dependent offset. Because StayFair modifies only the guidance step, it is orthogonal to model debiasing and can be layered onto existing fair diffusion models to extend their fairness across guidance scales. Across class-conditional and text-to-image generation, StayFair decouples fairness from the guidance scale without sacrificing image quality.

2605.28035 2026-05-28 cs.AI cs.MM cs.SD

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

MTAVG-Bench 2.0:诊断多说话人音视频生成中电影表现力的失败模式

Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen, YiMing Cheng, Xu Liu, Dian Jin, Jiajun Xu, Jingyun Liao, Tian Lan, Ziqin Zhou, Yueying Liu, Yu Bai, Changsen Yuan, Jinxing Zhou, Xian-Ling Mao, Xuefeng Chen, Yousheng Feng

发表机构 * Shanghai University(上海大学) Beijing Institute of Technology(北京理工大学) Shanghai Film Academy(上海电影学院) Tsinghua University(清华大学) Hefei University of Technology(合肥工业大学) Inkeverse Group Limited(Inkeverse集团有限公司) The University of Adelaide(阿德莱德大学) Beijing University of Technology(北京工业大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) OpenNLP Lab(OpenNLP实验室)

AI总结 针对多说话人音视频生成中电影表现力评估不足的问题,提出MTAVG-Bench 2.0基准,通过构建涵盖表演、叙事、氛围和视听语言的高层次失败分类体系及超过1万个问答实例,系统评估全模态大语言模型诊断复杂视听失败的能力。

详情
AI中文摘要

近年来,多说话人音视频生成(MTAVG)模型在唇形同步和视听对齐等基本指标上表现出了有前景的性能。然而,这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中,生成模型必须超越视听真实感,传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白,我们引入了MTAVG-Bench 2.0,这是一个用于诊断多说话人音视频生成中电影表现力失败模式的基准。与先前主要关注基本多轮对话质量的设置不同,MTAVG-Bench 2.0针对短剧和场景级生成,并建立了一个涵盖表演、叙事、氛围和视听语言的高层次失败分类体系。基于该分类体系,我们构建了超过1万个问答评估实例,以及用于短剧级评估和失败模式时间定位的子集,以系统评估全模态大语言模型诊断高层次视听失败的能力。实验结果表明,Gemini等商业全模态模型显著优于其他评估器,但即使是最强的模型在我们的基准中仍难以应对复杂失败。这些结果证明,MTAVG-Bench 2.0为电影级多说话人音视频生成中的失败诊断提供了一个系统化的基准。

英文摘要

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

2605.28034 2026-05-28 cs.AI

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash: 无状态稀疏Johnson-Lindenstrauss量化用于神经嵌入

Stanislav Kirdey, Clark Labs Inc

发表机构 * Clark Labs Inc(Clark实验室)

AI总结 提出Clark Hash方法,通过归一化、稀疏符号投影和固定宽度标量量化,将384维句子嵌入压缩至48字节,无需训练,在保持高余弦相似度相关性的同时实现32倍存储压缩。

Comments First Autoresearch publication. Code available at https://github.com/clark-labs-inc/clark-hash. GPT-5.5 Pro was used for drafting and editing assistance

详情
AI中文摘要

Clark Hash是一种用于以更少空间存储神经嵌入的小型方法。它对每个数据库向量进行归一化,应用确定性稀疏有符号Johnson-Lindenstrauss投影,裁剪结果,并存储固定宽度的标量量化码。查询保持浮点格式,并根据存储的草图进行评分。在默认的384维句子嵌入设置中,Clark Hash将余弦搜索向量存储在48字节中,而密集f32存储需要1536字节。这小了32倍。该方法在存储新向量之前不需要训练过程、学习码本、旋转或语料库统计。我们描述了编解码器、Rust实现,以及对来自29个子集的9,304个标记对进行的多语言句子相似性评估。使用多语言MiniLM编码器,48字节草图在STS17和STS22上与密集余弦评分的宏Pearson相关性分别达到0.910和0.946。Clark Hash不是一个新的Johnson-Lindenstrauss定理,也不是近似最近邻索引的替代品。它是一种用于紧凑嵌入存储的简单无状态编解码器。

英文摘要

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

2605.28033 2026-05-28 cs.RO

How Should We Teach Robots? A Comparison of Kinesthetic, Joystick, and Gesture-Based Teaching

我们应如何教机器人?动觉、摇杆和手势教学的比较

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova

发表机构 * Czech Institute of Informatics, Robotics and Cybernetics (CIIRC CTU)(捷克信息学、机器人学与控制研究所(CIIRC CTU))

AI总结 通过用户研究比较动觉引导、摇杆遥操作和手势教学三种示范方式,评估其在操作任务中的成功率、工作负载和常见错误。

Comments 7 pages, 3 figures, 3 tables, presented at Cognition and Artificial Life (CAL/KUZ) 2026 conference at Chateau Trest

详情
AI中文摘要

通过不同的教学方式可以指导机器人从示范中学习,每种方式在可用性和性能上各有权衡。本文在八名参与者的用户研究中比较了动觉引导、摇杆遥操作和手势教学。我们评估了三种操作任务中的重放成功率、改进的NASA-TLX工作负载和常见教学错误。动觉引导在更注重方向和接触的任务中产生了最短的示范、最低的工作负载和最高的成功率。摇杆遥操作在简单的拾取销钉任务中表现最佳。手势教学虽然整体可靠性较低,但表现优于预期,在某些情况下达到了与动觉引导相当的结果。

英文摘要

Instructing robots from demonstrations can be done through different teaching modalities, each with different usability and performance trade-offs. This paper compares kinesthetic guidance, joystick teleoperation, and hand gestures in a user study with eight participants. We evaluate replay success, modified NASA-TLX workload, and common teaching errors across three manipulation tasks. Kinesthetic guidance produced the shortest demonstrations, lowest workload, and highest success on the more orientation-sensitive and contact-rich tasks. Joystick teleoperation performed best on simple peg picking. Hand-gesture teaching, although less reliable overall, performed better than expected and in some cases achieved results comparable to kinesthetic guidance.

2605.28032 2026-05-28 cs.AI

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

PetroBench:石油工程大语言模型基准测试

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu, Heng Meng, Peng Zhou, Peng Li

发表机构 * School of Petroleum and Natural Gas Engineering, Changzhou University(常州大学石油与天然气工程学院) China University of Petroleum (East China)(中国石油大学(华东))

AI总结 针对石油工程领域,构建包含1200道题目的标准化题库,评估8种主流大语言模型,发现模型在主观题上表现优于客观题,中国模型在选择题上有优势,国际模型在简答题上略优。

详情
AI中文摘要

大语言模型在石油工业中的应用日益广泛,凸显了领域特定评估框架的必要性。本研究开发了一个面向石油工程的大语言模型基准测试,包括数据预处理、质量过滤和多模型验证三个阶段。通过专家评审,构建了具有强领域相关性和区分能力的标准化题库。该基准测试涵盖采油工程、油藏工程和钻井工程,包含1200道题目,涉及选择题、判断题、术语定义和简答题四种格式。在统一API环境下评估了八种主流大语言模型。结果表明,模型在主观题上的表现优于客观题,表明其在事实知识辨别方面存在弱点。选择题和判断题的最高准确率分别为65.3%和74.3%。Gemini-3-Pro、Kimi-K2.5和Claude-Opus-4.6-Thinking取得了72%-74%的最佳总分。模型在采油工程中表现最佳,在油藏工程中最弱。中国模型在选择题上具有优势,而国际模型在简答题上略优。该基准测试为石油工程中大语言模型的评估和部署提供了可重复且实用的参考。

英文摘要

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

2605.28030 2026-05-28 cs.LG cs.AI cs.CR

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

SPARD: 通过安全投影与相关性-多样性数据选择防御有害微调攻击

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程系) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Platform and Content Group, Tencent(腾讯平台与内容组) Chinese Medicine Guangdong Laboratory(广东中医实验室)

AI总结 提出SPARD框架,结合安全投影交替优化和相关性-多样性数据选择,防御有害微调攻击,在保持任务精度的同时显著降低攻击成功率。

Comments Accepted by ICML 2026

详情
AI中文摘要

微调大型语言模型往往会破坏其安全对齐,有害微调攻击进一步加剧了这一问题,其中对抗性数据移除安全防护并诱导不安全行为。我们提出SPARD,一种集成安全投影交替优化与相关性-多样性感知数据选择的防御框架。SPARD采用SPAG,在效用更新和显式安全投影之间交替优化,使用一组安全数据强制执行安全约束。为策划安全数据,我们引入相关性-多样性行列式点过程来选择紧凑的安全数据,平衡任务相关性和安全覆盖。在GSM8K和OpenBookQA上针对四种有害微调攻击的实验表明,SPARD始终实现最低的平均攻击成功率,显著优于最先进的防御方法,同时保持高任务精度。代码可在https://github.com/shuhao02/SPARD获取。

英文摘要

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

2605.28028 2026-05-28 cs.LG

BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

BPPO: 二元前缀策略优化用于高效GRPO式推理强化学习与简洁响应

Qingfei Zhao, Huan Song, Shuyu Tian, Jiawei Shao, Xuelong Li

发表机构 * TeleAI Shanghai Jiao Tong University(上海交通大学)

AI总结 针对GRPO更新成本高且易产生冗长推理的问题,提出BPPO方法,通过仅使用最短正确和错误完成作为更新单元并聚焦前缀优化,实现6倍加速并缩短30-50%响应长度。

详情
AI中文摘要

组相对策略优化(GRPO)广泛用于训练推理模型,但更新每组中的所有采样完成会带来巨大成本,并可能强化冗长的推理轨迹。本文研究在GRPO式推理强化学习中,是否所有完成都提供同样有用的更新信号。我们的梯度相似性分析表明,在同一提示组内,同类完成通常产生高度相似的更新方向,而正确-错误对则提供更明显的对比信号。受此观察启发,我们提出二元前缀策略优化(BPPO),该方法使用最短正确完成和最短错误完成作为紧凑更新单元,同时保留全组优势归一化。BPPO通过自适应完成调度和前缀聚焦优化进一步提高效率;通过仅更新响应前缀,它避免强化冗余后缀并鼓励更简洁的响应。在GSM8K、MATH和Geo3K上的实验表明,BPPO在保持竞争性准确率的同时,相比GRPO实现了高达6.08倍的加速,并将平均响应长度减少约30-50%,而无需在奖励中显式添加长度惩罚。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for training reasoning models, but updating all sampled completions in each group incurs substantial cost and can reinforce verbose reasoning trajectories. In this paper, we study whether all completions provide equally useful update signals in GRPO-style reasoning RL. Our gradient-similarity analysis shows that, within the same prompt group, same-class completions often induce highly similar update directions, whereas correct-incorrect pairs provide more distinct contrastive signals. Motivated by this observation, we propose Binary Prefix Policy Optimization (BPPO), which uses the shortest correct completion and the shortest incorrect completion as a compact update unit while preserving full-group advantage normalization. BPPO further improves efficiency with adaptive completion scheduling and prefix-focused optimization; by updating only response prefixes, it avoids reinforcing redundant suffixes and encourages more concise responses. Experiments on GSM8K, MATH, and Geo3K show that BPPO achieves up to 6.08x speedup over GRPO while maintaining competitive accuracy, and reduces mean response length by approximately 30-50% without modifying the reward with an explicit length penalty.

2605.28025 2026-05-28 cs.AI cs.CL cs.CY

MIRA: A Bilingual Benchmark for Medical Information Response Audit

MIRA: 医学信息响应审计的双语基准

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai, Weiyi Wu, Chongyang Gao

发表机构 * The University of Chicago(芝加哥大学) SynAI Technologies Inc.(SynAI技术公司) Jinzhou Medical University(锦州医学院) Zhejiang University(浙江大学) Dartmouth College(达特茅斯学院) Northwestern University(西北大学)

AI总结 提出MIRA双语基准,通过4,320个提示评估大语言模型在不同用户表达下提供医学信息的一致性,发现低健康素养提示导致信息稀释(DID),并提出知识引导缓解方法。

详情
AI中文摘要

大语言模型(LLM)越来越多地被用于提供面向公众的健康信息,然而现有的安全评估忽略了在相同问题的不同用户表述下,响应是否保留了可比较的医学信息。为了解决这个问题,我们引入了医学信息响应审计(MIRA),这是一个受控的双语基准,评估LLM在用户侧语言、语域和健康素养信号下是否提供可比较的医学信息。MIRA包含从60个经过医学审查的低风险健康问题构建的4,320个提示。在五个主流LLM中,模型回答了所有医学问题,但对低健康素养信号的响应始终省略了更多关键信息,提供的具体后续步骤更少,并为独立判断提供的支持更少。我们将这种模式称为差异信息稀释(DID)。语言效应是模型特定的,而非对非英语提示普遍更差。与300个真实世界健康查询的比较提供了初步的秩次有效性证据。一种知识引导的缓解提示减少了大多数模型的信息稀释,其中Claude(约8%)和Qwen(约6%)在信息不足的简化方面减少最大。

英文摘要

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

2605.28023 2026-05-28 cs.CV cs.AI cs.CL cs.MM

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

VCap: 用于弱到强视觉字幕的超几何奖励

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) Chinese Academy of Sciences(中国科学院) Kuaishou Technology(快手科技)

AI总结 提出VCap,一种证人-裁判奖励机制,通过超几何分布级别的精度验证视觉信号中参考字幕与策略生成字幕之间的事实一致性,实现弱到强泛化,在多个图像和视频字幕基准上超越SOTA模型。

Comments 28 pages, 8 figures

详情
AI中文摘要

视觉字幕要求模型忠实捕捉视觉内容,同时最小化遗漏和幻觉。作为字幕的主导范式,多模态大语言模型通过扩展和高质量数据取得了强大性能。最近,强化学习成为推动多模态大语言模型向更高精度和更广覆盖的关键途径,然而,现有字幕奖励设计未能提供细粒度且可靠的事实验证信号,限制了其有效性。为解决这一问题,我们提出VCap,一种证人-裁判奖励,将参考字幕(证人)与视觉信号(裁判)配对。通过明确验证基于视觉信号的参考字幕与策略生成字幕之间的事实一致性,VCap提供了具有超几何分布级别精度的奖励信号用于字幕质量验证。该设计即使在不完美的参考下也能实现有效学习,促进强化学习训练中的弱到强泛化。在我们的实验中,使用VCap训练的8B模型在多个图像和视频字幕基准上优于开源和闭源的最先进模型。人工评估进一步证实了其与事实正确性的强对齐。此外,VCap提升了多模态大语言模型的感知能力,跨任务泛化,并超越了最佳N蒸馏,挑战了先前关于强化学习与视觉推理的假设。

英文摘要

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

2605.28022 2026-05-28 cs.CL cs.SE

Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

超越 pass@k:面向多样本代码生成的冗余感知 RLVR

Le Bronnec Florian, Alexandre Verine, Rio Yokota, Benjamin Negrevergne

发表机构 * RIKEN Center for Computational Science(日本计算科学中心)

AI总结 针对代码生成中重复采样评估的冗余问题,提出基于 JPlag 相似度的反冗余奖励增强 RLVR,在有限预算下提升可执行正确性。

Comments Preprint under review

详情
AI中文摘要

用于代码生成的 LLM 通常使用 Pass@k 在重复采样设置中进行评估,其中多个候选程序在有限采样预算下针对单元测试执行。虽然最近基于验证器的强化学习(RLVR)方法提高了可执行正确性,但这些目标如何影响采样程序之间的冗余仍不清楚。在这项工作中,我们使用代码抄袭检测系统 JPlag 研究代码生成中的实现级冗余。跨模型和基准测试,我们表明仅正确性的 RLVR 通常使生成集中在重复实现上,而 Pass@k 感知目标保持较低冗余并提高更大预算下的性能。受这些观察的启发,我们基于 JPlag 相似度用直接反冗余奖励增强 RLVR。在 3 个模型和 3 个基准测试中,阻止近重复生成可靠地提高了有限预算下的可执行性能,通常匹配或超越专门的 Pass@k 感知目标。

英文摘要

LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.

2605.28021 2026-05-28 cs.LG

AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

AOE:通过重新校准异常标签实现穷尽式分布外检测

Fengqiang Wan, Qing-Yuan Jiang, Yang Yang

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Nanjing University of Science and Technology(南京理工大学)

AI总结 提出自适应置信度异常暴露(AOE)方法,通过温度缩放重新校准异常标签,利用自适应软目标保留分布外样本与分布内类别的语义关系,从而扩大分离边界并提升分布外检测性能。

详情
AI中文摘要

分布外(OOD)检测对于在开放世界和安全关键场景中部署机器学习模型至关重要,在这些场景中,测试输入可能偏离训练分布,对未知样本的过度自信预测可能导致不可靠的决策。异常暴露(OE)通过训练期间引入辅助异常样本来扩大分布内(ID)和OOD样本之间的间隔,已成为一种有前景的OOD检测范式。现有的基于OE的方法通常通过使用统一标签来最大化OOD样本在ID类别上的熵,从而扩大这一间隔。然而,我们从理论上证明,统一标签不可避免地忽略了OOD样本与ID类别之间的关系,称为过度软化效应,导致次优的间隔边界。我们的理论分析进一步揭示,显式利用这种关系反而可以提高OOD检测性能。受此启发,我们提出了自适应置信度异常暴露(AOE),一种简单而有效的方法,利用温度缩放重新校准异常标签。具体来说,AOE从温度缩放的模型预测中为OOD样本生成自适应软目标,其中可学习的温度平滑预测分布,而不会完全消除类别关系信息。通过使用这些自适应软目标监督OOD样本,AOE保留了OOD样本与ID类别之间的语义接近性,同时鼓励软目标接近高熵分布,从而抑制过度自信的OOD预测并扩大分离边界。在多种基准上的大量实验证明了AOE的有效性。

英文摘要

Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on unknown samples can lead to unreliable decisions. Outlier Exposure (OE) has emerged as a promising OOD detection paradigm by introducing auxiliary outliers during training to enlarge the margin between in-distribution (ID) and OOD samples. Existing OE-based methods typically enlarge this margin by employing uniform labels to maximize the entropy of OOD samples over ID categories. However, we theoretically show that uniform labels inevitably disregard the relations between OOD samples and ID categories, termed the over-softening effect, leading to a suboptimal margin bound. Our theoretical analysis further reveals that explicitly exploiting such relations can instead yield improved OOD detection performance. Motivated by this insight, we propose \underline{A}daptive Confidence \underline{OE} (AOE), a simple yet effective method that leverages temperature scaling to recalibrate outlier labels. Specifically, AOE generates adaptive soft targets from temperature-scaled model predictions for OOD samples, where the learnable temperature smooths the prediction distribution without fully erasing class-wise relational information. By supervising OOD samples with these adaptive soft targets, AOE preserves the semantic proximity between OOD samples and ID categories while encouraging the softened targets to approach a high-entropy distribution, thereby suppressing overconfident OOD predictions and enlarging the separation margin. Extensive experiments across diverse benchmarks demonstrate the effectiveness of AOE.

2605.28020 2026-05-28 cs.CL

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

预训练模型评估中缺失的一环:奖励引导解码无需参数更新即可解锁任务导向行为

Shaobo Wang, Guo Chen, Ziyue Wang, Zhengyang Tang, Qingyang Liu, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Qwen Team, Alibaba Group(阿里云Qwen团队) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出一种无需训练的奖励引导解码框架EBD,通过外部轻量奖励模型调整输出分布,激活冻结预训练模型的任务导向行为,实现更公平的推理时评估。

Comments 26 pages, 5 figures, 8 tables

详情
AI中文摘要

随着大型语言模型(LLMs)的快速发展,可靠地评估预训练LLMs的能力变得越来越重要。挑战在于,基础预训练模型针对下一个词预测进行优化,在标准提示和直接解码下往往无法遵循指令或生成格式良好的答案。因此,基准性能可能混淆模型能力与解码导致的无法产生任务导向输出的问题,而暴露这种行为通常依赖于昂贵的后训练。最近的仅解码方法试图重塑输出分布,但这类方法在开放式任务中可能效率低下且脆弱。为解决这些限制,我们提出基于能量的解码(EBD),一种无需训练、奖励引导的框架,用于从冻结的预训练LLMs中激活任务导向行为,涵盖开放式和客观任务。EBD通过外部轻量奖励模型增强解码,将生成导向高效用响应,同时通过奖励倾斜的目标分布将其锚定到预训练模型先验。我们证明EBD将基础模型输出转向更符合指令的行为,增加了与后训练对应物的行为相似性,并实现了对可访问预训练模型行为的更公平推理时评估。实验上,EBD在五个模型和六个基准上优于基线,将Qwen3-8B-Base在AlpacaEval2.0上的性能从8.8提升到44.5,将Mistral-7B在Math500上的延迟相对于先前的解码工作降低18.9倍,并且对奖励模型大小保持鲁棒。

英文摘要

With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token prediction and often fail to follow instructions or produce well-formed answers under standard prompting and direct decoding. As a result, benchmark performance can conflate model capability with decoding-induced failures to produce task-oriented outputs, while exposing such behavior often relies on costly post-training. Recent decodingonly approaches attempt to reshape output distributions, but such methods can be inefficient and brittle across open-ended tasks. To address these limitations, we propose Energy-Based Decoding (EBD), a training-free, reward-guided framework for activating task-oriented behaviors from frozen pre-trained LLMs across both open-ended and objective tasks. EBD augments decoding with an external lightweight reward model, steering generations toward high-utility responses while anchoring them to the pre-trained model prior through a reward-tilted target distribution. We show that EBD shifts base-model outputs toward more instructionfollowing behavior, increasing behavioral similarity to post-trained counterparts and enabling a fairer inference-time evaluation of accessible pre-trained-model behavior. Empirically, EBD outperforms baselines across five models and six benchmarks, improving Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5, reducing Mistral-7B Math500 latency by 18.9x relative to prior decoding work, and remaining robust to reward-model size.

2605.28018 2026-05-28 cs.CV

Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

双分支蒸馏Transformer用于高效非对称无人机跟踪

Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song

发表机构 * Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University(教育区块链与智能技术重点实验室,教育部,广西师范大学) Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University(广西多源信息挖掘与安全重点实验室,广西师范大学) Nanjing University of Science and Technology(南京理工大学) Xi’an Research Institute of High Technology(西安高新技术研究所)

AI总结 提出EATrack框架,通过教师引导的双分支蒸馏策略,在轻量学生模型中增强特征表达,实现无人机跟踪的精度与速度平衡。

Comments CVPR2026 Highlight

详情
AI中文摘要

鉴于无人机跟踪的实时性需求,许多方法简化骨干网络以减少计算量,但这往往削弱特征表示,导致复杂场景下性能下降。为解决此问题,我们提出EATrack,一种高效的非对称无人机跟踪框架,其核心是教师引导的双分支蒸馏策略,增强轻量学生模型的特征表达能力。具体而言,EATrack探索了知识迁移的两个互补视角:空间聚焦的特征级蒸馏,通过引导学生学习强目标表示来补偿弱化的表示;以及预测级蒸馏,通过学习教师精确目标定位的能力来增强空间定位。此外,为增强对外观变化的鲁棒性,我们引入细粒度目标感知蒸馏策略,选择性地将教师的目标建模能力迁移给学生。推理时集成时间适应模块以增强时间上的鲁棒性。在五个无人机基准上的实验表明,EATrack在精度和速度之间取得了良好的平衡。代码:https://github.com/GXNU-ZhongLab/EATrack

英文摘要

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: https://github.com/GXNU-ZhongLab/EATrack

2605.28016 2026-05-28 cs.CV physics.med-ph

Enhancing Ultra-low-field MRI with Segmentation-guided Adversarial Learning

利用分割引导的对抗学习增强超低场MRI

James Grover, Andrew Phair, Michael Ferraro, David E. J. Waddington

发表机构 * Image X Institute, Sydney School of Health Sciences, Faculty of Medicine and Health(Image X研究院,悉尼健康科学学院,医学与健康学院)

AI总结 提出结合解剖条件分割先验和模型集成的方法,通过Swin UNETR生成组织分割先验,并利用CycleGAN和T-REX两个增强网络合成3T级MRI,有效提升64 mT超低场MRI的图像质量。

详情
AI中文摘要

超低场(ULF)MRI提供便携且低成本的成像,但图像质量较差。为解决此问题,我们提交了2025年ULF增强挑战赛(ULF-EnC)的方案,目标是从64 mT扫描合成类似高场MRI的图像。我们的流程通过解剖条件化和模型集成来增强ULF MRI。首先,使用仅在挑战提供数据上训练的Swin UNETR生成组织分割先验。这些先验条件化两个独立的增强网络——一个CycleGAN和一个基于Transformer的残差增强模型(T-REX)——每个网络都训练用于合成3T级MRI。两个模型的输出通过加权平均结合。我们的方法产生的增强MRI在定量和定性上都与高场扫描相当。

英文摘要

Ultra-low-field (ULF) MRI offers portable and low-cost imaging but suffers from poor image quality. To address this, we present our submission to the 2025 ULF Enhancement Challenge (ULF-EnC), where the goal is to synthesise high-field-like MRIs from 64 mT scans. Our pipeline enhances ULF MRI through a combination of anatomical conditioning and model ensembling. We first generate tissue segmentation priors using a Swin UNETR trained solely on challenge-provided data. These priors condition two independent enhancement networks - a CycleGAN and a transformer-based residual enhancement model (T-REX) - each trained to synthesise 3 T-like MRIs. Outputs from both models are combined using a weighted average. Our approach produces enhanced MRIs that were comparable to high-field scans both quantitatively and qualitatively.

2605.28014 2026-05-28 cs.CL cs.LG

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

ROSD: 面向跨领域语言模型推理的反思式同策略自蒸馏

Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Baidu Inc.(百度公司) Shandong University(山东大学) Leiden University(莱顿大学)

AI总结 提出反思式同策略自蒸馏(ROSD)框架,通过反思引导的错误定位蒸馏将参考解模仿转为针对性推理修正,提升领域内推理和跨领域泛化能力。

Comments Preprint

详情
AI中文摘要

同策略自蒸馏(OPSD)通过为同策略 rollout 提供密集的 token 级监督,提升了大语言模型(LLM)的推理性能。然而,现有的 OPSD 方法在领域内推理上增益有限,且对领域外问题的泛化能力较差。我们识别出两个关键原因:将自教师模型条件化为已验证的解决方案会鼓励模仿训练领域的参考轨迹而非特定错误的修正;将蒸馏应用于完整响应可能会覆盖有效的推理前缀并强化过拟合。我们提出反思式同策略自蒸馏(ROSD),一个通过反思引导的、错误定位的蒸馏将参考解模仿转化为针对性推理修正的框架。对于每个 rollout,ROSD 使用自反思器提取修正思路并定位第一个错误片段。修正思路引导自教师模型进行针对性监督,而定位的错误片段将蒸馏限制在需要修正的区域。这种设计在保留有效前缀的同时修正了有缺陷的推理。在多个领域内和领域外推理基准上的实验表明,ROSD 在整体上产生了更强的领域内推理性能,并且相比标准 OPSD 具有显著更好的领域外泛化能力。代码可在 https://github.com/ZiqiZhao1/ROSD 获取。

英文摘要

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

2605.28013 2026-05-28 cs.CL

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

KSAFE-MM:通过本地化语境化实现韩国文化风险的多模态安全基准

Yongwoo Kim, Sojung An, Yunjin Park, Jungwon Yoon, Dujin Lee, HyunBeom Cho, Jaewon Lee, Wonhyuk Lee, Youngchol Kim, JeongYeop Kim, Donghyun Kim

发表机构 * Korea University(韩国大学) KT Corporation(KT公司)

AI总结 针对多模态大语言模型在安全评估中缺乏文化特异性问题,提出KSAFE-MM基准,通过语言和视觉语境化构建通用与韩国文化特有的多模态安全测试集,揭示模型对文化攻击的脆弱性及安全性与过度拒绝之间的权衡。

详情
AI中文摘要

多模态大语言模型(MLLMs)通过引入跨多种模态(如语言和视觉)的漏洞,加剧了安全风险。然而,当前的MLLM安全评估工具存在重大局限性:1)以英语为中心的数据集构建,以及2)关注与当地文化背景无关的通用风险。本文介绍了KSAFE-MM,一个用于韩语多模态安全评估的基准,涵盖通用安全风险和文化特定漏洞。KSAFE-MM由两部分组成:KSAFE-MM-G和KSAFE-MM-C。KSAFE-MM-G通过语言语境化评估韩语语境中的全球共享风险,将通用安全查询转化为上下文相关的多模态样本。KSAFE-MM-C利用源自真实世界语境的本地化视觉查询,针对文化依赖的MLLM安全漏洞。它将这些视觉查询与越狱式文本查询配对,以覆盖涉及文化视觉线索和恶意文本意图的多模态安全风险。这些组件共同提供了一个从通用到本地的构建流程,用于评估全球共享安全风险和文化特定漏洞。我们在KSAFE-MM上评估了12个最先进的MLLM,并揭示了模型对文化攻击的脆弱性高于通用攻击。值得注意的是,越狱策略显著提高了攻击成功率,其中ProgramExecution的攻击成功率高达74.2%,而标准查询仅为13.4%。此外,我们发现了安全性与过度拒绝之间的系统性权衡,即实现低攻击成功率的模型往往对良性查询表现出过度的拒绝行为。这些发现强调了超越以英语为中心的基准、进行文化基础安全评估的紧迫性。

英文摘要

Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

2605.28011 2026-05-28 cs.CV

Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event Cameras

使用事件相机自动估计羽毛球扣杀中的撞击时间、撞击位置和球速

Yudai Washida, Yuto Kase, Kai Ishibe, Ryoma Yasuda, Sakiko Hashimoto

发表机构 * MIZUNO Corporation(MIZUNO公司) Suminoe-ku, Osaka-shi, Osaka(大阪府大阪市西淀川区)

AI总结 提出一种使用两台同步事件相机的方法,在同一试验中自动估计羽毛球扣杀的撞击时间、球拍面撞击位置和球速,并通过Bland-Altman分析验证其与高速相机参考方法的一致性。

Comments 24 pages, 5 figures

详情
AI中文摘要

量化羽毛球扣杀中的撞击现象对于评估运动表现和装备性能都很重要;然而,传统测量系统在时间分辨率、数据效率和准备工作之间存在权衡。本研究提出了一种使用两台同步事件相机的测量方法,在同一试验中自动估计撞击时间、球拍面上的撞击位置以及撞击后的球速。通过事件率统计检测挥拍区间,从侧视事件数据中的羽毛球轨迹拐点估计撞击时间,通过椭圆拟合后视事件图像中的球拍面确定撞击位置,并在矢状面计算球速。为了验证所提出的方法,使用来自五名运动员的125次扣杀试验,与基于高速相机的参考方法进行了Bland-Altman分析。在所有124次可分析试验中估计了撞击时间和球速,在93.5%(116/124)的试验中估计了撞击位置。撞击时间、内侧-外侧撞击位置、纵向撞击位置和球速的偏差(95%置信区间)分别为1.84毫秒(1.45至2.23)、3.45毫米(2.18至4.72)、-1.92毫米(-2.97至-0.88)和-1.00米/秒(-2.46至0.46)。所有指标均未观察到比例偏差。这些结果表明,所提出的方法可以作为在实际环境中综合评估羽毛球扣杀性能和装备的有用工具。

英文摘要

Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

2605.28010 2026-05-28 cs.AI

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

信心编排的自我进化:应对不确定的LLM反馈

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 提出COSE方法,利用LLM内在置信度作为不确定性信号,通过置信度加权PPO更新和置信度优先重放,在通用推理和数学任务上取得最佳平均性能。

详情
AI中文摘要

自我进化的大语言模型(LLM)通过生成自己的训练任务和解决方案来学习,减少了对人工策划监督的依赖。然而,在许多推理领域,模型还必须验证生成的任务并判断生成的答案以获得训练信号。这带来了训练信号挑战:错误的自我判断会导致错误的梯度更新。现有方法要么依赖外部验证器(限制了通用性),要么将噪声的自我生成反馈视为监督。我们提出COSE(Confidence-Orchestrated Self-Evolution),它利用LLM的内在置信度作为轻量级不确定性信号来调节学习。COSE引入了置信度加权PPO更新和置信度优先重放。在19个保留基准测试和四个Qwen/Llama骨干网络(0.6B-4B)上,COSE始终优于基础模型,并在通用推理和数学方面取得最佳平均性能,同时在代码方面保持竞争力。代码和数据可在https://anonymous.4open.science/r/COSE_-B5C2获取。

英文摘要

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

2605.28009 2026-05-28 cs.CL cs.AI cs.LG

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

MemGuard:防止长期记忆增强型大语言模型中的记忆污染

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Columbia University(哥伦比亚大学) Capital One

AI总结 提出MemGuard,一种类型感知的记忆框架,通过显式分配功能角色、维护类型隔离记忆间的关联并选择性组合必要类型的证据,防止异构记忆污染,提升记忆可靠性最高28.27%并减少检索token数最高5.8倍。

详情
AI中文摘要

记忆增强型大语言模型通过跨交互维护长期记忆,将推理扩展到固定上下文窗口之外。然而,现有的记忆系统常常将稳定的用户事实、情景事件和行为规则折叠到共享空间中,使得功能不同的记忆被检索并用作可互换的证据。我们将这种失败模式识别为异构记忆污染,其中上下文特定的事件被过度概括为声明,或者语义相关但功能不兼容的记忆误导生成。为此,我们引入了MemGuard,一种类型感知的记忆框架,在记忆构建和检索过程中保留功能记忆边界。它在写入时为每个记忆分配显式的功能角色,维护跨类型隔离记忆的关系,并仅从必要的记忆类型中选择性组合证据,从而减少来自无关或功能不兼容证据的污染。在幻觉和长时对话基准测试中,MemGuard将记忆可靠性提高了最多28.27%,同时检索的记忆token数比先前方法减少了最多5.8倍。这些结果表明,可靠的长期推理依赖于对异构记忆的有原则的组织和选择性使用。

英文摘要

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

2605.28008 2026-05-28 cs.AI cs.LG

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

压缩思维:压缩推理数据在LLM后训练中何时以及如何发挥作用

Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 本文通过分类显式、组合和隐式思维链,在合成组合推理任务上实验,发现粗粒度CoT需要更多SFT数据,组合和隐式CoT从数据缩放中获益更多但隐式CoT易导致记忆,后续RLVR会分解压缩步骤,且单向CoT顺序在长序列任务上泛化更强。

详情
AI中文摘要

大型语言模型(LLM)现在能够通过长思维链(CoT)推理解决复杂问题,但性能与token成本之间的权衡仍然是一个核心挑战。为了解决这个问题,监督微调(SFT)通常使用压缩推理数据,其中CoT轨迹被缩短为紧凑形式。然而,这种压缩推理数据对后训练的影响仍然知之甚少。在本文中,我们提出了一个CoT分类法,包括显式CoT(输出所有操作而不聚合)、组合CoT(将多个操作合并为单步)和隐式CoT(省略中间操作)。我们构建了一个合成组合推理任务,允许对难度、压缩粒度和数据大小进行可控变化,并在不同模型家族和大小上进行了全面的实验。值得注意的是,我们发现:(i)粗粒度CoT需要更多SFT数据;(ii)与显式CoT相比,组合CoT和隐式CoT从数据缩放中获益更多,而组合CoT从数据重复中获益,隐式CoT则倾向于导致记忆;(iii)与SFT不同,后续带有可验证奖励的强化学习(RLVR)会分解在SFT期间学到的压缩步骤;(iv)单向CoT顺序在更长序列任务上表现出更强的泛化能力。我们的发现为数据资源约束下的CoT设计提供了启示,并为LLM后训练中SFT和RL的机制提供了重要见解。

英文摘要

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

2605.28007 2026-05-28 cs.LG cs.AI

Learning Compositional Latent Structure with Vector Networks

学习带有向量网络的组合潜在结构

Niclas Pokel, Benjamin F. Grewe

发表机构 * Institute of Neuroinformatics, UZH / ETH Zurich(神经信息学研究所,苏黎世联邦理工学院/苏黎世联邦理工人工智能中心) ETH AI Center Zurich, Switzerland(苏黎世联邦理工人工智能中心,瑞士)

AI总结 提出向量网络(VN),一种层级循环架构,通过可重用的秩1权重原子库实现组合泛化,在分布外任务中误差降低约一个数量级。

详情
AI中文摘要

深度网络是强大的函数逼近器,但它们通常将许多不同的计算存储在共享权重矩阵中,使得当熟悉的结构以新颖组合出现时,难以选择性地重用或调整其中的部分。我们引入了向量网络(VN),一种层级循环架构,其中每一层将固定的权重矩阵替换为可重用的秩1权重原子库。对于每个输入,VN最小化层级局部能量,以推断一组稀疏的活跃权重原子及其系数,这些系数受自底向上的输入重建和自顶向下的反馈一致性共同约束。这些权重原子系数随后为该样本组成一个输入特定的低秩权重矩阵。收敛后,慢速学习更新仅通过推断系数缩放的局部残差信号更新选中的权重原子。我们在四个组合基准上评估了VN,涵盖一维信号、二维空间解码、N体动力学和组合MNIST。在分布内任务中VN与强基线相当,而在需要以新颖方式重新组合熟悉因子的分布外任务中,其误差通常低约一个数量级。因此,向量网络使组合泛化成为架构和推理过程的结构属性,而非将许多行为拟合到单个共享密集参数基底的脆弱副产品。

英文摘要

Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making it difficult to selectively reuse or adapt parts of them when a familiar structure appears in novel combinations. We introduce the Vector Network (VN), a hierarchical recurrent architecture in which each layer replaces a fixed weight matrix with a library of reusable rank-1 weight atoms. For each input, VN minimizes a layer-local energy to infer a sparse set of active weight atoms and their coefficients, jointly constrained by bottom-up input reconstruction and top-down feedback consistency. These weight atom coefficients then compose an input-specific low-rank weight matrix for that sample. After convergence, slow learning updates only the selected weight atoms through local residual signals scaled by the inferred coefficients. We evaluate VN on four compositional benchmarks spanning 1D signals, 2D spatial decoding, N-body dynamics, and compositional MNIST. VN matches strong baselines in distribution while often achieving out-of-distribution error about an order of magnitude lower when familiar factors must be recombined in novel ways. Vector networks thus make compositional generalization a structural property of the architecture and inference process rather than a brittle byproduct of fitting many behaviors into one shared dense parameter substrate.

2605.28006 2026-05-28 cs.CL cs.AI

Integrated and Cross-Architecture Interpretation of LLM Reasoning

LLM推理的集成与跨架构解释

Leonardo Matthew Yauw, Wei-Bin Kou, Yujiu Yang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出集成跨架构推理(IAR)框架,通过带宽校准的MIP与Tukey IQR峰值检测、重叠分析及Jaccard稳定性度量,统一解释LLM推理模式。

详情
AI中文摘要

理解LLM如何推理受到实际不对称性的阻碍:虽然其生成的输出是可观察的,但潜在的推理模式仍然不透明。依赖单一探针,如互信息峰值(MIP)或深度思考比率(DTR),可能会低估真正的推理结构。针对这一不足,我们提出了一个集成跨架构推理(IAR)框架,旨在为LLM推理可解释性提供统一方法。具体来说,我们首先提出使用带宽校准的MIP结合Tukey IQR峰值检测来隔离输出层的关键推理标记。其次,我们对MIP选中的标记和DTR深度标记进行重叠分析,以追踪这些标记的跨层轨迹。这也揭示了关键推理标记是否也是计算密集型的,进一步有助于理解推理模式如何在模型层间演变。最后,我们在多领域问题上应用Jaccard稳定性度量,以验证MIP识别的标记是否具有推理质量保证。在三个模型(Qwen-7B、Qwen-14B和Llama-8B)上跨四个领域(数学、代码、逻辑和常识)的大量实验证明了IAR跨架构的泛化解释能力。

英文摘要

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR's generalizable interpretation capabilities across architectures.

2605.28004 2026-05-28 cs.CL

Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG

超越块级抽取:面向GraphRAG的跨块图增强

Jiaming Zhang, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(数据科学与工程学院,东华大学)

AI总结 提出CrossAug方法,利用GNN引导的跨块图增强,在查询前离线补充GraphRAG索引中缺失的跨块关系,提升多跳和长文档问答性能。

Comments 15 pages, 5 figures, 8 tables

详情
AI中文摘要

GraphRAG通过将语料组织为显式知识图谱来扩展检索增强生成,支持基于图的检索以进行复杂问答。然而,现有框架仅在单个块内抽取实体和关系,导致跨块关系——即证据跨越多个段落的关系——在索引中系统性缺失。由于块组合的组合爆炸,穷举式基于LLM的关系恢复不可行。我们提出CrossAug,一种GNN引导的跨块图增强方法,在查询前离线步骤中为GraphRAG索引补充跨块关系结构。CrossAug通过自监督图损坏获取训练监督,使用拓扑感知的GNN对子图进行缺失性评分,并仅对选中的高评分区域应用基于证据的LLM补全。在三个基于LLM的GraphRAG框架上,跨四个多跳和长文档QA基准的实验表明,CrossAug持续提升性能,证实了跨块图增强对基于检索的问答的益处。我们的代码开源在https://github.com/DonFinliani/CrossAug。

英文摘要

GraphRAG extends retrieval-augmented generation by organizing corpora as explicit knowledge graphs, enabling graph-based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross-chunk relations -- those whose evidence spans multiple passages -- systematically absent from the index. Exhaustive LLM-based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN-guided CROSS-Chunk Graph AUGmentation method that enriches GraphRAG indices with cross-chunk relational structure as an offline step before query-time retrieval. CrossAug derives training supervision through self-supervised graph corruption, uses a topology-aware GNN to score subgraphs for missingness, and applies evidence-grounded LLM completion only to selected high-scoring regions. Experiments on three LLM-based GraphRAG frameworks across four multi-hop and long-document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross-chunk graph augmentation for retrieval-based question answering. Our code is available at https://github.com/DonFinliani/CrossAug.

2605.28003 2026-05-28 cs.CL

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

ResearchMath-14K: 通过智能体扩展研究级数学

Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang, Youngjae Yu

发表机构 * Seoul National University(首尔国立大学) OneLineAI Yonsei University(延世大学)

AI总结 本文通过多智能体流程从学术来源构建了最大的研究级数学问题数据集ResearchMath-14K(14,056个问题),并生成220K教师轨迹,经智能体过滤后微调Qwen3模型(4B-30B)平均提升9.2个点,表明过滤后的开放问题尝试即使没有完全正确的推理轨迹也能提供有效监督。

Comments Work in progress. Dataset available at: https://huggingface.co/datasets/amphora/ResearchMath-14k

详情
AI中文摘要

数学的前沿由尚未知道解法的问题定义,但语言模型能否在没有人类干预的情况下有意义地处理这些问题仍不清楚。一个主要障碍是缺乏大规模的研究级数学数据集。为此,我们引入了ResearchMath-14k,这是一个通过多智能体流程从学术来源整理的问题集,包含14,056个问题,是迄今为止最大的研究级数学问题集合。我们进一步生成了ResearchMath-Reasoning,即来自两个开放模型的220K条教师轨迹,其中我们观察到重复的回避行为,如未尝试和虚构引用。有趣的是,在八个开放权重模型中,新一代模型每条轨迹产生的引用数量增加了5.6倍,虚假引用数量增加了5.0倍。在对ResearchMath-Reasoning进行智能体过滤后,对Qwen3模型(从4B到30B参数)进行微调,平均比基础模型提高了9.2个点。这表明,即使没有完全正确的推理轨迹,过滤后的开放问题尝试也能提供有用的监督。我们将ResearchMath-14k公开,以供未来研究级数学推理的工作使用。

英文摘要

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

2605.28001 2026-05-28 cs.AI cs.CR

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

锚定解码中 k-NAF 预算核算的实证审计

J. Vijayavallabh

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯学院)

AI总结 通过固定工作负载和自适应提示搜索,实证审计锚定解码中的 k-NAF 预算核算机制,发现平均累积 KL 支出远低于序列级预算,自适应搜索虽提高代理支出比率但未导致预算耗尽,且高代理比率归因于代理伪影而非轨迹级预算失败。

Comments 19 pages, 4 figures, 9 main pages remaining supplementary and appendix

详情
AI中文摘要

我们使用 (i) 固定的、按类别分层的工作负载(跨六个提示类别约 8,500 次随机执行)和 (ii) 针对高代理支出比率的目标自适应提示搜索过程,对锚定解码中的 k-NAF 预算核算机制进行实证审计。在固定工作负载下,平均累积 KL 支出远低于序列级预算 K ∈ {600, 1000},并且经验 Bernstein 风格的代理对于每个类别都保持在 K 以下;表面重叠诊断(ROUGE-L 和 5-gram Jaccard)相应较小。自适应搜索增加了代理支出比率,但未产生明显的预算耗尽。在 k=3 的保留版权领域工作负载上,几个提示在早期停止评估且实际样本量较小的情况下显示出高于 1 的代理比率;在可比平均支出下,用更大分配重新评估相同提示将代理比率降低到 [0.26, 0.40] 范围,这与代理伪影一致,而非每个轨迹的预算失败。

英文摘要

We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in {600, 1000}, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.