arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.28013 2026-05-28 cs.CL

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

KSAFE-MM:通过本地化语境化实现韩国文化风险的多模态安全基准

Yongwoo Kim, Sojung An, Yunjin Park, Jungwon Yoon, Dujin Lee, HyunBeom Cho, Jaewon Lee, Wonhyuk Lee, Youngchol Kim, JeongYeop Kim, Donghyun Kim

AI总结 针对多模态大语言模型在安全评估中缺乏文化特异性问题,提出KSAFE-MM基准,通过语言和视觉语境化构建通用与韩国文化特有的多模态安全测试集,揭示模型对文化攻击的脆弱性及安全性与过度拒绝之间的权衡。

详情
AI中文摘要

多模态大语言模型(MLLMs)通过引入跨多种模态(如语言和视觉)的漏洞,加剧了安全风险。然而,当前的MLLM安全评估工具存在重大局限性:1)以英语为中心的数据集构建,以及2)关注与当地文化背景无关的通用风险。本文介绍了KSAFE-MM,一个用于韩语多模态安全评估的基准,涵盖通用安全风险和文化特定漏洞。KSAFE-MM由两部分组成:KSAFE-MM-G和KSAFE-MM-C。KSAFE-MM-G通过语言语境化评估韩语语境中的全球共享风险,将通用安全查询转化为上下文相关的多模态样本。KSAFE-MM-C利用源自真实世界语境的本地化视觉查询,针对文化依赖的MLLM安全漏洞。它将这些视觉查询与越狱式文本查询配对,以覆盖涉及文化视觉线索和恶意文本意图的多模态安全风险。这些组件共同提供了一个从通用到本地的构建流程,用于评估全球共享安全风险和文化特定漏洞。我们在KSAFE-MM上评估了12个最先进的MLLM,并揭示了模型对文化攻击的脆弱性高于通用攻击。值得注意的是,越狱策略显著提高了攻击成功率,其中ProgramExecution的攻击成功率高达74.2%,而标准查询仅为13.4%。此外,我们发现了安全性与过度拒绝之间的系统性权衡,即实现低攻击成功率的模型往往对良性查询表现出过度的拒绝行为。这些发现强调了超越以英语为中心的基准、进行文化基础安全评估的紧迫性。

英文摘要

Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

2605.28011 2026-05-28 cs.CV

Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event Cameras

使用事件相机自动估计羽毛球扣杀中的撞击时间、撞击位置和球速

Yudai Washida, Yuto Kase, Kai Ishibe, Ryoma Yasuda, Sakiko Hashimoto

AI总结 提出一种使用两台同步事件相机的方法,在同一试验中自动估计羽毛球扣杀的撞击时间、球拍面撞击位置和球速,并通过Bland-Altman分析验证其与高速相机参考方法的一致性。

详情
Comments
24 pages, 5 figures
AI中文摘要

量化羽毛球扣杀中的撞击现象对于评估运动表现和装备性能都很重要;然而,传统测量系统在时间分辨率、数据效率和准备工作之间存在权衡。本研究提出了一种使用两台同步事件相机的测量方法,在同一试验中自动估计撞击时间、球拍面上的撞击位置以及撞击后的球速。通过事件率统计检测挥拍区间,从侧视事件数据中的羽毛球轨迹拐点估计撞击时间,通过椭圆拟合后视事件图像中的球拍面确定撞击位置,并在矢状面计算球速。为了验证所提出的方法,使用来自五名运动员的125次扣杀试验,与基于高速相机的参考方法进行了Bland-Altman分析。在所有124次可分析试验中估计了撞击时间和球速,在93.5%(116/124)的试验中估计了撞击位置。撞击时间、内侧-外侧撞击位置、纵向撞击位置和球速的偏差(95%置信区间)分别为1.84毫秒(1.45至2.23)、3.45毫米(2.18至4.72)、-1.92毫米(-2.97至-0.88)和-1.00米/秒(-2.46至0.46)。所有指标均未观察到比例偏差。这些结果表明,所提出的方法可以作为在实际环境中综合评估羽毛球扣杀性能和装备的有用工具。

英文摘要

Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

2605.28010 2026-05-28 cs.AI

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

信心编排的自我进化:应对不确定的LLM反馈

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu

AI总结 提出COSE方法,利用LLM内在置信度作为不确定性信号,通过置信度加权PPO更新和置信度优先重放,在通用推理和数学任务上取得最佳平均性能。

详情
AI中文摘要

自我进化的大语言模型(LLM)通过生成自己的训练任务和解决方案来学习,减少了对人工策划监督的依赖。然而,在许多推理领域,模型还必须验证生成的任务并判断生成的答案以获得训练信号。这带来了训练信号挑战:错误的自我判断会导致错误的梯度更新。现有方法要么依赖外部验证器(限制了通用性),要么将噪声的自我生成反馈视为监督。我们提出COSE(Confidence-Orchestrated Self-Evolution),它利用LLM的内在置信度作为轻量级不确定性信号来调节学习。COSE引入了置信度加权PPO更新和置信度优先重放。在19个保留基准测试和四个Qwen/Llama骨干网络(0.6B-4B)上,COSE始终优于基础模型,并在通用推理和数学方面取得最佳平均性能,同时在代码方面保持竞争力。代码和数据可在https://anonymous.4open.science/r/COSE_-B5C2获取。

英文摘要

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

2605.28009 2026-05-28 cs.CL cs.AI cs.LG

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

MemGuard:防止长期记忆增强型大语言模型中的记忆污染

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

AI总结 提出MemGuard,一种类型感知的记忆框架,通过显式分配功能角色、维护类型隔离记忆间的关联并选择性组合必要类型的证据,防止异构记忆污染,提升记忆可靠性最高28.27%并减少检索token数最高5.8倍。

详情
AI中文摘要

记忆增强型大语言模型通过跨交互维护长期记忆,将推理扩展到固定上下文窗口之外。然而,现有的记忆系统常常将稳定的用户事实、情景事件和行为规则折叠到共享空间中,使得功能不同的记忆被检索并用作可互换的证据。我们将这种失败模式识别为异构记忆污染,其中上下文特定的事件被过度概括为声明,或者语义相关但功能不兼容的记忆误导生成。为此,我们引入了MemGuard,一种类型感知的记忆框架,在记忆构建和检索过程中保留功能记忆边界。它在写入时为每个记忆分配显式的功能角色,维护跨类型隔离记忆的关系,并仅从必要的记忆类型中选择性组合证据,从而减少来自无关或功能不兼容证据的污染。在幻觉和长时对话基准测试中,MemGuard将记忆可靠性提高了最多28.27%,同时检索的记忆token数比先前方法减少了最多5.8倍。这些结果表明,可靠的长期推理依赖于对异构记忆的有原则的组织和选择性使用。

英文摘要

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

2605.28008 2026-05-28 cs.AI cs.LG

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

压缩思维:压缩推理数据在LLM后训练中何时以及如何发挥作用

Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

AI总结 本文通过分类显式、组合和隐式思维链,在合成组合推理任务上实验,发现粗粒度CoT需要更多SFT数据,组合和隐式CoT从数据缩放中获益更多但隐式CoT易导致记忆,后续RLVR会分解压缩步骤,且单向CoT顺序在长序列任务上泛化更强。

详情
AI中文摘要

大型语言模型(LLM)现在能够通过长思维链(CoT)推理解决复杂问题,但性能与token成本之间的权衡仍然是一个核心挑战。为了解决这个问题,监督微调(SFT)通常使用压缩推理数据,其中CoT轨迹被缩短为紧凑形式。然而,这种压缩推理数据对后训练的影响仍然知之甚少。在本文中,我们提出了一个CoT分类法,包括显式CoT(输出所有操作而不聚合)、组合CoT(将多个操作合并为单步)和隐式CoT(省略中间操作)。我们构建了一个合成组合推理任务,允许对难度、压缩粒度和数据大小进行可控变化,并在不同模型家族和大小上进行了全面的实验。值得注意的是,我们发现:(i)粗粒度CoT需要更多SFT数据;(ii)与显式CoT相比,组合CoT和隐式CoT从数据缩放中获益更多,而组合CoT从数据重复中获益,隐式CoT则倾向于导致记忆;(iii)与SFT不同,后续带有可验证奖励的强化学习(RLVR)会分解在SFT期间学到的压缩步骤;(iv)单向CoT顺序在更长序列任务上表现出更强的泛化能力。我们的发现为数据资源约束下的CoT设计提供了启示,并为LLM后训练中SFT和RL的机制提供了重要见解。

英文摘要

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

2605.28007 2026-05-28 cs.LG cs.AI

Learning Compositional Latent Structure with Vector Networks

学习带有向量网络的组合潜在结构

Niclas Pokel, Benjamin F. Grewe

AI总结 提出向量网络(VN),一种层级循环架构,通过可重用的秩1权重原子库实现组合泛化,在分布外任务中误差降低约一个数量级。

详情
AI中文摘要

深度网络是强大的函数逼近器,但它们通常将许多不同的计算存储在共享权重矩阵中,使得当熟悉的结构以新颖组合出现时,难以选择性地重用或调整其中的部分。我们引入了向量网络(VN),一种层级循环架构,其中每一层将固定的权重矩阵替换为可重用的秩1权重原子库。对于每个输入,VN最小化层级局部能量,以推断一组稀疏的活跃权重原子及其系数,这些系数受自底向上的输入重建和自顶向下的反馈一致性共同约束。这些权重原子系数随后为该样本组成一个输入特定的低秩权重矩阵。收敛后,慢速学习更新仅通过推断系数缩放的局部残差信号更新选中的权重原子。我们在四个组合基准上评估了VN,涵盖一维信号、二维空间解码、N体动力学和组合MNIST。在分布内任务中VN与强基线相当,而在需要以新颖方式重新组合熟悉因子的分布外任务中,其误差通常低约一个数量级。因此,向量网络使组合泛化成为架构和推理过程的结构属性,而非将许多行为拟合到单个共享密集参数基底的脆弱副产品。

英文摘要

Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making it difficult to selectively reuse or adapt parts of them when a familiar structure appears in novel combinations. We introduce the Vector Network (VN), a hierarchical recurrent architecture in which each layer replaces a fixed weight matrix with a library of reusable rank-1 weight atoms. For each input, VN minimizes a layer-local energy to infer a sparse set of active weight atoms and their coefficients, jointly constrained by bottom-up input reconstruction and top-down feedback consistency. These weight atom coefficients then compose an input-specific low-rank weight matrix for that sample. After convergence, slow learning updates only the selected weight atoms through local residual signals scaled by the inferred coefficients. We evaluate VN on four compositional benchmarks spanning 1D signals, 2D spatial decoding, N-body dynamics, and compositional MNIST. VN matches strong baselines in distribution while often achieving out-of-distribution error about an order of magnitude lower when familiar factors must be recombined in novel ways. Vector networks thus make compositional generalization a structural property of the architecture and inference process rather than a brittle byproduct of fitting many behaviors into one shared dense parameter substrate.

2605.28006 2026-05-28 cs.CL cs.AI

Integrated and Cross-Architecture Interpretation of LLM Reasoning

LLM推理的集成与跨架构解释

Leonardo Matthew Yauw, Wei-Bin Kou, Yujiu Yang

AI总结 提出集成跨架构推理(IAR)框架,通过带宽校准的MIP与Tukey IQR峰值检测、重叠分析及Jaccard稳定性度量,统一解释LLM推理模式。

详情
AI中文摘要

理解LLM如何推理受到实际不对称性的阻碍:虽然其生成的输出是可观察的,但潜在的推理模式仍然不透明。依赖单一探针,如互信息峰值(MIP)或深度思考比率(DTR),可能会低估真正的推理结构。针对这一不足,我们提出了一个集成跨架构推理(IAR)框架,旨在为LLM推理可解释性提供统一方法。具体来说,我们首先提出使用带宽校准的MIP结合Tukey IQR峰值检测来隔离输出层的关键推理标记。其次,我们对MIP选中的标记和DTR深度标记进行重叠分析,以追踪这些标记的跨层轨迹。这也揭示了关键推理标记是否也是计算密集型的,进一步有助于理解推理模式如何在模型层间演变。最后,我们在多领域问题上应用Jaccard稳定性度量,以验证MIP识别的标记是否具有推理质量保证。在三个模型(Qwen-7B、Qwen-14B和Llama-8B)上跨四个领域(数学、代码、逻辑和常识)的大量实验证明了IAR跨架构的泛化解释能力。

英文摘要

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR's generalizable interpretation capabilities across architectures.

2605.28004 2026-05-28 cs.CL

Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG

超越块级抽取:面向GraphRAG的跨块图增强

Jiaming Zhang, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li

AI总结 提出CrossAug方法,利用GNN引导的跨块图增强,在查询前离线补充GraphRAG索引中缺失的跨块关系,提升多跳和长文档问答性能。

详情
Comments
15 pages, 5 figures, 8 tables
AI中文摘要

GraphRAG通过将语料组织为显式知识图谱来扩展检索增强生成,支持基于图的检索以进行复杂问答。然而,现有框架仅在单个块内抽取实体和关系,导致跨块关系——即证据跨越多个段落的关系——在索引中系统性缺失。由于块组合的组合爆炸,穷举式基于LLM的关系恢复不可行。我们提出CrossAug,一种GNN引导的跨块图增强方法,在查询前离线步骤中为GraphRAG索引补充跨块关系结构。CrossAug通过自监督图损坏获取训练监督,使用拓扑感知的GNN对子图进行缺失性评分,并仅对选中的高评分区域应用基于证据的LLM补全。在三个基于LLM的GraphRAG框架上,跨四个多跳和长文档QA基准的实验表明,CrossAug持续提升性能,证实了跨块图增强对基于检索的问答的益处。我们的代码开源在https://github.com/DonFinliani/CrossAug。

英文摘要

GraphRAG extends retrieval-augmented generation by organizing corpora as explicit knowledge graphs, enabling graph-based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross-chunk relations -- those whose evidence spans multiple passages -- systematically absent from the index. Exhaustive LLM-based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN-guided CROSS-Chunk Graph AUGmentation method that enriches GraphRAG indices with cross-chunk relational structure as an offline step before query-time retrieval. CrossAug derives training supervision through self-supervised graph corruption, uses a topology-aware GNN to score subgraphs for missingness, and applies evidence-grounded LLM completion only to selected high-scoring regions. Experiments on three LLM-based GraphRAG frameworks across four multi-hop and long-document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross-chunk graph augmentation for retrieval-based question answering. Our code is available at https://github.com/DonFinliani/CrossAug.

2605.28003 2026-05-28 cs.CL

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

ResearchMath-14K: 通过智能体扩展研究级数学

Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang, Youngjae Yu

AI总结 本文通过多智能体流程从学术来源构建了最大的研究级数学问题数据集ResearchMath-14K(14,056个问题),并生成220K教师轨迹,经智能体过滤后微调Qwen3模型(4B-30B)平均提升9.2个点,表明过滤后的开放问题尝试即使没有完全正确的推理轨迹也能提供有效监督。

详情
Comments
Work in progress. Dataset available at: https://huggingface.co/datasets/amphora/ResearchMath-14k
AI中文摘要

数学的前沿由尚未知道解法的问题定义,但语言模型能否在没有人类干预的情况下有意义地处理这些问题仍不清楚。一个主要障碍是缺乏大规模的研究级数学数据集。为此,我们引入了ResearchMath-14k,这是一个通过多智能体流程从学术来源整理的问题集,包含14,056个问题,是迄今为止最大的研究级数学问题集合。我们进一步生成了ResearchMath-Reasoning,即来自两个开放模型的220K条教师轨迹,其中我们观察到重复的回避行为,如未尝试和虚构引用。有趣的是,在八个开放权重模型中,新一代模型每条轨迹产生的引用数量增加了5.6倍,虚假引用数量增加了5.0倍。在对ResearchMath-Reasoning进行智能体过滤后,对Qwen3模型(从4B到30B参数)进行微调,平均比基础模型提高了9.2个点。这表明,即使没有完全正确的推理轨迹,过滤后的开放问题尝试也能提供有用的监督。我们将ResearchMath-14k公开,以供未来研究级数学推理的工作使用。

英文摘要

The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

2605.28001 2026-05-28 cs.AI cs.CR

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

锚定解码中 k-NAF 预算核算的实证审计

J. Vijayavallabh

AI总结 通过固定工作负载和自适应提示搜索,实证审计锚定解码中的 k-NAF 预算核算机制,发现平均累积 KL 支出远低于序列级预算,自适应搜索虽提高代理支出比率但未导致预算耗尽,且高代理比率归因于代理伪影而非轨迹级预算失败。

详情
Comments
19 pages, 4 figures, 9 main pages remaining supplementary and appendix
AI中文摘要

我们使用 (i) 固定的、按类别分层的工作负载(跨六个提示类别约 8,500 次随机执行)和 (ii) 针对高代理支出比率的目标自适应提示搜索过程,对锚定解码中的 k-NAF 预算核算机制进行实证审计。在固定工作负载下,平均累积 KL 支出远低于序列级预算 K ∈ {600, 1000},并且经验 Bernstein 风格的代理对于每个类别都保持在 K 以下;表面重叠诊断(ROUGE-L 和 5-gram Jaccard)相应较小。自适应搜索增加了代理支出比率,但未产生明显的预算耗尽。在 k=3 的保留版权领域工作负载上,几个提示在早期停止评估且实际样本量较小的情况下显示出高于 1 的代理比率;在可比平均支出下,用更大分配重新评估相同提示将代理比率降低到 [0.26, 0.40] 范围,这与代理伪影一致,而非每个轨迹的预算失败。

英文摘要

We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in {600, 1000}, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.

2605.27997 2026-05-28 cs.CL cs.AI cs.LG

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

毒性存在于何处?语言模型中的机制定位与定向抑制

Himanshu Beniwal, Mayank Singh

AI总结 通过分析毒性与中性提示的激活差异,定位特定层和神经元中的毒性,并利用推理时缩放或最小秩一权重编辑进行抑制,无需梯度下降,实现毒性降低同时保持语言质量。

详情
AI中文摘要

大型语言模型频繁生成有毒、仇恨或有害内容,然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤,且缺乏对毒性内部起源的机制性理解。我们提出了Meow2X和TRNE,两种互补的无需重新训练的框架,通过分析毒性与中性提示之间的激活差异,将毒性定位到特定层和神经元,然后通过推理时缩放或最小秩一权重编辑进行抑制——无需任何梯度下降。在五个语言模型、两个基准测试和90种配置上的评估,使用双重安全评估器,一致地证明了毒性降低,同时保持了语言建模质量。我们的分析揭示,毒性不成比例地编码在早期MLP层中,在不同架构间有所变化,并且被单一评估器设置系统性地低估——强调了多评估器安全评估的必要性。通过连接机制可解释性与实际去毒化,我们的框架为更安全、更透明的语言模型提供了一条原则性路径。

英文摘要

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

2605.27993 2026-05-28 cs.CL

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

重新思考视觉忽视:通过上下文偏好引导缓解MLLM幻觉

Jingwen Wu, Xijun Zhang, Ge Song

AI总结 针对多模态大语言模型的物体幻觉问题,提出无需训练的上下文偏好激活引导框架(CAS),通过提取上下文偏好向量并在中间早期MLP层注入符号残差来控制信息依赖,有效缓解幻觉且不增加解码延迟。

详情
Comments
15 pages, 5 figures
AI中文摘要

物体幻觉仍然是多模态大语言模型(MLLM)可靠部署的主要障碍。当前的推理时缓解方法主要假设幻觉源于视觉忽视,引导模型增强视觉依赖。相反,我们对多个MLLM的系统干预表明,推动更多视觉依赖可能会加剧某些模型的幻觉,而减少视觉依赖则可能缓解幻觉。这一结果表明,将幻觉单纯归因于视觉不足是不充分的。我们认为,图像作为上下文,同时与模型的参数知识和文本上下文竞争。为此,我们提出了一种无需训练的框架——上下文偏好激活引导(CAS)。它通过两组设计好的冲突样本提取两个语义不同的上下文偏好向量(CPV),并在推理时通过单次符号残差注入到中间早期MLP层来控制信息依赖。实验表明,CAS在不增加解码延迟的情况下显著缓解了物体幻觉,并保持了原生文本生成质量。

英文摘要

Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model's parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.

2605.27992 2026-05-28 cs.LG

Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

Patched-DeltaNet: 用于线性时间异常检测的令牌级事件驱动记忆

Tae-Gyun Lee, Junyoung Park, Kyu Won Han

AI总结 提出Patched-DeltaNet架构,结合时间序列分块与门控Delta网络,通过令牌级事件驱动记忆实现线性时间复杂度的异常检测,在SMD基准上达到ROC-AUC 0.957和PA-F1 0.822。

详情
Comments
7 pages, 2 tables
AI中文摘要

时间序列异常检测对于维护关键任务系统的可靠性至关重要。虽然基于Transformer的模型(如PatchTST)表现出色,但其$\mathcal{O}(L^2)$的计算复杂度严重限制了在资源受限环境中的部署。在本文中,我们提出了Patched-DeltaNet,一种结合时间序列分块与门控Delta网络的新型架构。通过整合这些范式,我们假设并证明了令牌级事件驱动记忆的出现,其中分块机制提取局部语义块,而误差驱动的DeltaNet仅在发生显著物理变化(定义为delta)时更新其循环状态。这种协同作用有效滤除背景噪声并捕获突然的异常漂移。我们在服务器机器数据集(SMD)基准上的严格实验证明了Patched-DeltaNet的结构优越性和样本效率。通过在统一评估约束和相同计算预算下严格优于最新架构,我们的模型实现了ROC-AUC 0.957和PA-F1 0.822,同时将计算复杂度大幅降低至理论最小值$\mathcal{O}(L/P)$。

英文摘要

Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their $\mathcal{O}(L^2)$ computational complexity severely limits deployment in resource-constrained environments. In this paper, we propose Patched-DeltaNet, a novel architecture combining time-series patching with Gated Delta Networks. By integrating these paradigms, we hypothesize and demonstrate the emergence of token-level event-driven memory, whereby the patching mechanism extracts local semantic chunks, while the error-driven DeltaNet updates its recurrent state exclusively when significant physical changes, defined as deltas, occur. This synergy effectively filters out background noise and captures sudden anomalous drifts. Our rigorous experiments on the Server Machine Dataset (SMD) benchmark demonstrate the structural superiority and sample efficiency of Patched-DeltaNet. By strictly outperforming recent architectures under unified evaluation constraints and identical compute budgets, our model yields an ROC-AUC of 0.957 and PA-F1 of 0.822, while drastically reducing computational complexity to the theoretical minimum of $\mathcal{O}(L/P)$.

2605.27990 2026-05-28 cs.LG cs.AI cs.CV

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

几何校正扩散后验采样:基于去噪器回拉曲率引导与流形对齐阻尼

Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

AI总结 提出一种基于去噪器回拉曲率引导和流形对齐阻尼的几何校正扩散后验采样方法,通过每噪声水平的阻尼高斯-牛顿校正替代标量引导,实现稳定高效的后验采样。

详情
Journal ref
International Conference on Machine Learning 2026
Comments
Code: https://github.com/Seunghyeok0715/CLAMP
AI中文摘要

扩散后验采样将扩散先验条件于测量值,但数据一致性更新通常由手动调整的引导权重缩放,并且在刚性、算子依赖的曲率下可能破坏采样稳定性。我们使用在扩散状态坐标中计算的每噪声水平阻尼高斯-牛顿校正替代标量引导。该校正通过去噪器回拉似然梯度,使用避免前向去噪器雅可比矩阵的单侧曲率模型,并应用与去噪器残差对齐的扩散校准秩一阻尼。每个校正通过自动微分的无矩阵GMRES求解,采样通过具有闭式漂移/噪声分离的方差保持朗之万转移进行。在FFHQ和ImageNet上的逆问题中,该方法在PSNR/SSIM/LPIPS上达到竞争性能,同时运行速度显著快于大多数对比基线;在加速MRI重建中,它在对比基线中取得了最佳的PSNR/SSIM。

英文摘要

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

2605.27989 2026-05-28 cs.LG

Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization

神经交互定律:深度-宽度形状、交互效率与泛化

Wenjie Sun, Jinning Yang, Shuai Zhang, Mengnan Du

AI总结 通过将叠加原理从参数空间扩展到梯度空间定义为神经交互,发现固定预算下良好泛化伴随高效神经交互,且可通过调整深度-宽度比(R_D/W)使模型处于高效交互区间,该区间随预算扩展保持稳定,为模型形状初始化和泛化机制提供见解。

详情
Comments
30 pages, 4 figures
AI中文摘要

缩放定律的指导增加了现代大型语言模型(LLMs)的资源需求,但在固定预算下这些模型是否有效利用资源仍存疑问。先前研究证明叠加是损失的关键贡献者。通过利用神经特征假设,我们将叠加从参数空间扩展到梯度空间,并将其定义为神经交互。我们发现,在固定预算下,良好的泛化通常伴随着高效的神经交互,并且可以通过调整模型的深度-宽度比($R_{D/W}$)将模型置于高效交互区间。此外,随着预算扩大,模型的高效交互区间保持相对稳定。通过比较现有小规模密集LLMs,我们观察到接近该区间的模型在MMLU-Pro基准上表现更好。我们的发现揭示了$R_{D/W}$影响资源利用效率,进而影响泛化,为模型形状初始化和理解模型泛化机制提供了见解。神经交互定律的代码可在:https://anonymous.4open.science/r/Neural_Interaction_Law-D788 获取。

英文摘要

The guidance of scaling laws has increased the resource demands of modern large language models (LLMs), yet it remains questionable whether these models utilize resources effectively under a fixed budget. Previous research has proved superposition as a key contributor to loss. By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction. We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio ($R_{D/W}$). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable. By comparing existing small scale dense LLMs, we observe that models operating near this interval tend to perform better on the MMLU-Pro benchmark. Our findings reveal that the $R_{D/W}$ influences resource utilization efficiency and thereby affects generalization, providing insights into model shape initialization and the understanding of model generalization mechanisms. Code for Neural Interaction Law is available at: https://anonymous.4open.science/r/Neural_Interaction_Law-D788

2605.27988 2026-05-28 cs.CL cs.CY

Auditing Stance Asymmetry in Generative Explanations

审计生成性解释中的立场不对称性

Jiarui Han

AI总结 针对语言模型在开放式解释中分配责任、合法性和背景时可能产生的立场不对称性,提出对称性分解评估(SDE)方法,通过配对情境测试揭示表面差异的稳定性差异,并指出自动评分的不稳定性。

详情
AI中文摘要

语言模型的偏见评估在有界比较方面取得了实质性进展,例如明显的贬低、刻板印象关联或受控替换下的标签敏感差异。开放式解释提出了一个不同的问题:它们通过分配责任、合法性、背景和委屈来指导解释。模型可以避免敌对语言,同时使一方在结构上可理解,而另一方则被归咎于个人、反应过度或不太值得认真对待。我们称之为生成性解释中的立场不对称性。我们提出对称性分解评估(SDE),该方法通过具体群体标签、结构角色重写以及明确的支持或反证来测试配对情境。在一个受控的32族原型套件中,这种分解表明表面差异并非全部相同:有些在结构或证据控制下减弱,而另一些则作为模型分配责备、背景或合法性的稳定差异持续存在。针对性的案例审查和法官比较表明,评估开放式框架不对称性存在更广泛的困难:法官的解读随操作化方式而变化,标量分数可能抹平读者用于解释解释性立场的区别。因此,SDE将生成性偏见评估重新定义为对解释性立场的审计——每一方接受什么立场,它在分解下如何变化,以及自动评分在何处变得不稳定。

英文摘要

Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations raise a different problem: they guide interpretation by assigning responsibility, legitimacy, context, and grievance. A model can avoid hostile language while making one side structurally understandable and another personally at fault, overreacting, or less worth taking seriously. We call this stance-bearing asymmetry in generative explanations. We propose Symmetry Decomposition Evaluation (SDE), which tests paired situations with concrete group labels, structural-role rewrites, and explicit support or counter-evidence. In a controlled 32-family prototype suite, this decomposition shows that surface differences are not all alike: some weaken under structural or evidence control, while others remain as stable differences in how the model assigns blame, context, or legitimacy. Targeted case review and judge comparison suggest a broader difficulty for evaluating open-ended framing asymmetries: judge readings shift across operationalizations, and scalar scores can flatten distinctions that readers use to interpret explanatory stance. SDE therefore reframes generative bias evaluation as an audit of explanatory stance -- what stance each side receives, how it changes under decomposition, and where automatic scoring becomes unstable.

2605.27986 2026-05-28 cs.CL q-bio.QM

An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

一种设计稳定、高表达且低免疫原性治疗性mRNA序列的进化方法

Dhawa Sang Dong, Mausam Gurung, Suraj Kandel

AI总结 提出BERT-GA两阶段框架,结合深度学习和遗传算法优化mRNA序列,在翻译效率、结构稳定性和低免疫原性之间取得平衡。

详情
AI中文摘要

信使RNA(mRNA)序列作为治疗药物需要优化设计,以确保高效翻译、结构稳定性和最小免疫原性。本研究提出一个两阶段计算机模拟框架,整合深度学习和进化计算,用于理性mRNA优化,而非现有最先进模型。第一阶段,预训练的CodonTransformer(类似BERT的大语言模型)生成编码目标抗原的生物一致性mRNA序列。第二阶段,遗传算法(GA)通过密码子感知的交叉和同义突变(由人类密码子使用偏好引导)进化这些候选序列。评估的适应度函数结合了翻译相关指标(CAI、tAI、密码子对偏好)、mRNA结构稳定性(通过RNAfold计算的局部和全局MFE、GC含量)以及降低的免疫原性(CpG/UpA基序频率)。经过连续世代(第38、40和42代),GA改进了CAI和tAI(CAI值从0.73到0.74,tAI值从0.63到0.64),提升超过6%,密码子对偏好高且一致(0.97),并改善了5'端的核糖体可及性,未配对30分数达到0.87;全局最小自由能(MFE)收敛到平衡范围-346至-356 kcal/mol,实现约84%的碱基配对结构稳定性,并减少了免疫刺激基序——最终世代平均免疫惩罚降至27.3。线性设计产生超稳定转录本(MFE < -2000 kcal/mol),由于极端刚性存在翻译效率低下的风险;BiLSTM-CRF仅关注高CAI(0.96至0.98)而无结构约束;我们的框架实现了翻译-稳定性的最优平衡,突出了所提出的BERT-GA框架作为一种有效的、数据驱动的计算机模拟mRNA序列设计和优化方法。

英文摘要

Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE < - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

2605.27984 2026-05-28 cs.CL cs.AI

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

KVoiceBench, KOpenAudioBench 和 KMMAU:用于评估语音语言模型的智能体驱动的韩语语音基准

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

AI总结 针对语音语言模型评估中英语中心化的问题,提出两种智能体驱动的基准构建框架,构建并发布了三个韩语语音基准(KVoiceBench、KOpenAudioBench、KMMAU),通过评估八个最新模型揭示了英语-韩语性能差距和不同任务间的互补性弱点。

详情
Comments
16 pages, 4 figures
AI中文摘要

语音语言模型通过将大型语言模型扩展到语音模态取得了显著进展。然而,语音语言模型的评估仍然严重以英语为中心,限制了多语言语音能力的可靠评估。通过ASR、翻译、归一化和TTS直接迁移基准会破坏语言特定的指令、答案约束和口语形式;对于音频理解,迁移源语言音频也无法保留目标语言的说话人属性、口音和副语言特性。为解决这些限制,我们提出了两种智能体驱动的基准构建框架:一种将源语言SpokenQA基准迁移为目标语言SpokenQA基准,另一种利用转录和说话人元数据将目标语言ASR语料库转换为音频理解基准。使用这些框架,我们构建并公开发布了三个韩语语音基准:用于韩语SpokenQA的KVoiceBench和KOpenAudioBench,以及用于韩语音频理解的KMMAU,共包含12,345个样本。我们评估了八个最近的语音语言模型,发现英语-韩语性能差距在不同模型和任务族中差异很大,并且SpokenQA和音频理解的排名出现分歧,揭示了仅靠英语评估无法发现的互补性弱点。

英文摘要

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

2605.27981 2026-05-28 cs.AI

STAB: Specification-driven Testing for Algorithmic Bottlenecks

STAB:面向算法瓶颈的规约驱动测试

Soohan Lim, Joonghyuk Hahn, Hyundong Jin, Yo-Sub Han

AI总结 提出STAB流水线,通过约束饱和与对抗场景注入,从自然语言问题规约生成暴露算法瓶颈的测试用例,显著提升测试用例对瓶颈的检测率。

详情
Comments
16 pages, 5 figures, 8 tables
AI中文摘要

评估算法代码的效率需要能够暴露运行时瓶颈的测试用例。先前的方法通过增加输入规模或生成使给定实现运行缓慢的代码特定输入来生成效率测试用例。因此,它们没有处理驱动算法最坏情况的结构性输入条件。我们引入STAB,一个规约驱动的流水线,仅从自然语言问题规约生成暴露算法瓶颈的测试用例。STAB将任务分为约束边界最大化和对抗结构注入两部分。(i) 约束饱和器提取约束,并通过基于规则的饱和及对相关变量的CP-SAT优化来解析大的可接受规模赋值。(ii) 对抗场景注入器使用关键词匹配和K近邻(KNN)从策划的场景目录中检索实现级别的对抗构造原则。STAB将问题规约、解析的边界和检索到的构造原则编码为结构化生成规约,LLM据此合成Python测试用例生成器。在CodeContests上,STAB将生成测试用例中暴露算法瓶颈的比例从开源LLM平均50.43%提升至73.45%,从闭源LLM平均57.45%提升至71.85%,在Python、Java和C++上均有一致提升。我们的代码可在https://github.com/suhanmen/STAB获取。

英文摘要

Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.

2605.27980 2026-05-28 cs.CL cs.AI

Periodic RoPE for Infinite Context LLMs

周期性RoPE:面向无限上下文的大型语言模型

Simin Huo

AI总结 提出周期性RoPE(P-RoPE)位置编码机制,结合滑动窗口注意力和无位置编码的全局注意力,避免位置耗尽,理论上支持无限上下文窗口。

详情
Comments
5 pages
AI中文摘要

处理超长上下文的能力对于大型语言模型(LLMs)执行长期任务至关重要。尽管最近的努力已将上下文窗口扩展到1M及以上,但当序列长度超过位置编码(如RoPE)的预训练范围时,模型性能会下降,即位置耗尽。必须克服这一基本限制才能实现真正的无限上下文。为此,我们提出了周期性RoPE(P-RoPE),一种旨在避免这种耗尽的位置编码机制。它与滑动窗口注意力(SWA)协同工作,以捕获每个窗口内的局部依赖和相对位置。然后,这一局部层由无位置编码(NoPE)的全局注意力层补充,使得整个序列上的无界交互成为可能,而不受位置限制。通过堆叠这两类层,模型避免了位置外推以泛化更长的序列,并理论上支持无限的上下文窗口。实验结果表明,我们的模型MiniWin在长上下文效率和稳定性上优于采用标准GPT架构的MiniMInd。我们的工作为LLMs实现真正的无限上下文理解提供了一条可能的路径。代码可在\href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}获取。

英文摘要

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}.

2605.27978 2026-05-28 cs.CV

ABot-OCR Technical Report

ABot-OCR 技术报告

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu

AI总结 提出端到端视觉语言模型ABot-OCR,通过单次前向传播将页面图像直接转录为干净Markdown,并采用解耦异构文档优化的强化学习方法提升文本准确性和标记格式正确性,在OmniDocBench基准上达到最先进水平。

详情
Comments
21 pages, 11 figures, technical report
AI中文摘要

我们介绍了ABot-OCR,一个端到端的视觉语言模型,它通过单次前向传播将页面图像直接转录为干净的Markdown。通过这样做,我们的方法完全消除了脆弱的模块化编排。为了最大化解析保真度,我们开发了一个专用数据引擎,以提供大规模、结构一致的监督。此外,我们提出了解耦异构文档优化,一种结构约束的强化学习方法,它在监督微调之外进一步提高了文本准确性并严格强制执行标记格式的正确性。大量评估证明了我们框架的优越性能。在OmniDocBench v1.5和v1.6基准测试中,ABot-OCR在所有端到端系统中达到了最先进的分数92.81和93.30,显著缩小了与强流水线基线之间的性能差距。最后,跨十种不同语言的全面多语言文本识别进一步证实了ABot-OCR的鲁棒泛化能力。

英文摘要

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

2605.27976 2026-05-28 cs.SD

VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

VoiceGiraffe: 极端长上下文音频-语言理解的基准

Jashin Ye, Dongxiao Wang, Yixuan Ye, Sashuai Zhou, Weihuang Lin, Mingyang Han, Kunpeng Wang, Zeyu Yuan, Boyu Li, Haoxiang Shi, Jingchen Shu, Jun Song, Bo Zheng

AI总结 提出VoiceGiraffe基准,通过1500个三元组和双层分类法评估大音频语言模型在长上下文场景下的单跳感知与多跳推理能力,揭示模型在长距离记忆持久性上的瓶颈。

详情
Comments
Benchmark Project: https://github.com/LivingFutureLab/VoiceGiraffe
AI中文摘要

尽管大音频语言模型(LALMs)在秒级或分钟级的音频处理中取得了显著进展,理解小时级音频仍然是一个根本性的瓶颈。现有的基准主要依赖短片段或人工拼接的片段,未能真实评估LALM在播客和长篇演讲等真实场景中的长距离信息理解能力。为填补这一空白,我们引入了VoiceGiraffe,这是一个新颖的基准,旨在严格评估LALMs在长上下文设置下跨多种真实场景、模态和语言的表现。它包含1500个精心策划的三元组,结构化为单跳感知和多跳推理的双层分类法。我们评估了一系列开源和专有LALMs与人类表现的对比。结果强调了三个基本发现。首先,VoiceGiraffe仍然极具挑战性,远未饱和。其次,我们表明没有单一的推理范式普遍占优。端到端推理有利于具有原生长上下文音频理解的模型,级联字幕聚合稳定了被小时级音频淹没的小模型,而借助外部LLM的推理增强级联有助于较弱的模型,但可能成为较强专有系统的瓶颈。第三,我们揭示了长距离记忆持久性是一个关键瓶颈。LALMs在回答需要连接显著因果线索的问题时表现更好,而在需要跨长音频持续跟踪稀疏事件的问题上表现较差,而人类则表现出相反的模式。这些发现使VoiceGiraffe成为长格式音频理解的一个具有挑战性和诊断性的测试平台,突显了需要具有持久记忆和稳健长距离聚合能力的LALMs。

英文摘要

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.

2605.27972 2026-05-28 cs.RO

Simultaneous Contact Selection and Planning for Contact-Rich Manipulation with Cascaded Optimization

基于级联优化的接触丰富操作中同时接触选择与规划

Zhe Zhang, Xingrong Diao, Haoxiang Liang, Han Yang, Bi-Ke Zhu, Dandan Zhang, Jiankun Wang

AI总结 提出一种级联优化框架SCSP,通过接触选择优化和接触规划优化实现接触丰富操作中的主动接触位置选择与轨迹规划。

详情
Comments
20 pages, 18 pages
AI中文摘要

我们提出了一种基于优化的鲁棒接触丰富操作框架。最近的接触隐式方法能够实现跨接触模式的在线混合规划,允许针对给定的目标状态以及机器人和物体的接触位置序列进行闭环操作。然而,大多数现有方法缺乏自主推理和生成多样化接触位置序列及操作轨迹的能力,即主动接触位置选择,这限制了它们对相对简单任务的适用性。由于接触动力学中的互补性和稀疏梯度,主动接触位置选择具有挑战性,使得设计统一的接触选择与规划框架变得困难。为了解决这些挑战,我们引入了同时接触选择与规划(SCSP),这是一个级联优化框架,包括接触选择优化(CSO)和接触规划优化(CPO)。CSO利用代理接触模型和离散-连续优化来有效解决接触选择中的非光滑性和耦合问题,实现最优接触位置的在线全局搜索。CPO通过评估CSO产生的参考接触位置,并实时生成冗余机械臂对应的操作轨迹,执行先验引导的接触规划。大量的仿真和真实世界实验表明,SCSP在不准确的动力学和感知噪声下能够产生多样化的操作行为和鲁棒控制。我们进一步在具有挑战性的操作任务上验证了该框架的泛化能力。 项目网站:\href{https://sites.google.com/view/scsp-robot}{https://sites.google.com/view/scsp-robot}。

英文摘要

We propose an optimization-based framework for robust contact-rich manipulation. Recent contact-implicit methods enable online hybrid planning across contact modes, allowing closed-loop manipulation for a given target state and contact location sequence of the robot and object. However, most existing approaches lack the ability to autonomously reason and generate diverse contact location sequences and manipulation trajectories, i.e., active contact location selection, which limits their applicability to relatively simple tasks. Active contact location selection is challenging due to complementarity in contact dynamics and the sparse gradients, making the design of a unified framework for contact selection and planning difficult. To address these challenges, we introduce Simultaneous Contact Selection and Planning (SCSP), a cascaded optimization framework comprising Contact Selection Optimization (CSO) and Contact Planning Optimization (CPO). CSO leverages a surrogate contact model and discrete-continuous optimization to efficiently resolve the nonsmoothness and coupling in contact selection, enabling online global searching of optimal contact locations. CPO performs prior-guided contact planning by evaluating the reference contact locations produced by CSO and generating corresponding manipulation trajectories in real time for redundant manipulators. Extensive simulations and real-world experiments demonstrate that SCSP produces diverse manipulation behaviors and robust control under inaccurate dynamics and perceptual noise. We further validate the generalization of the framework on challenging manipulation tasks. Project website: \href{https://sites.google.com/view/scsp-robot}{https://sites.google.com/view/scsp-robot}.

2605.27971 2026-05-28 cs.CL cs.AI

Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

语义流正则化:教会LLMs生成多样且连贯的回复

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

AI总结 针对大语言模型微调时输出多样性严重受限的跨风格坍缩问题,提出语义流正则化(SFR),通过条件流匹配监督骨干网络使用连续句子嵌入,在零部署成本下提升多样性和风格保真度。

详情
AI中文摘要

当大语言模型被微调以生成个性或语气条件化的回复时,其输出多样性受到严重限制——我们将这种失败称为跨风格坍缩。我们将这种坍缩追溯到交叉熵目标,该目标在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化(SFR),一种轻量级的辅助目标,通过条件流匹配使用未来片段的连续句子编码器嵌入来监督骨干网络。随机流源通过构造保持多模态;流匹配头在推理时被丢弃,增加零部署成本。在一个大规模工业对话数据集(Qwen3-32B,9种个性)上,SFR在输出多样性、风格保真度和回复质量上优于SFT。我们进一步在公共LiveCodeBench-v5(Qwen2.5-Coder-7B-Instruct)上验证,其中SFR持续改进pass@k,证实了其超越风格化对话的通用性。在MBPP上的受控比较显示,多令牌预测是SFR的一个退化特例。

英文摘要

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

2605.27970 2026-05-28 cs.AI

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

人类感知域的几何结构在LLM表征中短暂出现

Simardeep Singh, Paras Chopra

AI总结 研究大型语言模型内部表征中是否出现与人类感知组织相似的几何结构,发现多个感知域的几何结构在中间层短暂涌现,且与人类基准对齐。

详情
Comments
19 Pages, 28 Figures
AI中文摘要

虽然大型语言模型(LLM)仅基于文本数据进行训练,但先前的工作表明,它们的内部表征在嵌入空间中可能展现出丰富的几何结构。基于这一研究方向,我们调查了这种结构是否与不同领域(例如颜色、音高、情感和味觉)的人类感知组织相似。具体来说,我们研究了多个开源Transformer架构的残差流中,与感知模态对应的内在几何结构逐层涌现的情况。我们的结果揭示了三个关键发现。首先,我们观察到多个感知域的逐层几何结构涌现,尽管训练过程中没有任何直接的感知监督。其次,这些感知域表现出不同的涌现轮廓,几何结构及其与人类基准的一致性在深度上遵循领域和模型特定的轨迹。第三,这种涌现遵循一致的表征轨迹:几何结构在早期层较弱或分散,在中间层逐渐组织化,在后期层减弱,表明感知几何结构作为模型内部转换管道的一部分短暂出现。这为理解类人感知几何结构在LLM中如何以及何处出现提供了新见解,为内部表征的机制分析提供了原则性途径。

英文摘要

While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model's internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

2605.27969 2026-05-28 cs.CL

Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost

后训练助手中的边界抑制不对称性:过度扩展作为可控性代价

Jiarui Han

AI总结 研究后训练语言模型助手中因避免回答不足而导致的边界抑制不对称性,发现反回答不足策略在边界控制评估中更难被抑制,且这种代价与内容预算超支和延续持续性相关。

详情
AI中文摘要

后训练的语言模型助手通常被优化以避免回答不足,鼓励完整、有用、谨慎和主动的响应。我们询问这种优化是否会产生不对称的可控性代价:当用户明确要求更窄的回答时,哪些助手行为仍然可被抑制,哪些继续塑造响应?我们将此问题研究为边界抑制不对称性。跨多个高级响应维度的提示侧探测表明存在选择性代价,集中在“过度助手”方向,如过度完成、额外帮助和反回答不足。使用来自共享基础模型的控制助手策略变体,我们发现反回答不足策略在匹配的边界控制评估下比基线更难被拉回,而最小边界变体在直接边界控制比较中通常避免了这种反侧向上偏移。机制导向的探测指向超出更长的默认输出、纯EOS失败、不确定性补偿和局部延续偏差,而鲁棒性检查在共享系统和更大规模设置下保持了主要的反超基线排序。证据支持混合规划/停止解释,其中内容预算超支和延续持续性共同使边界修正更难。总体而言,后训练可能产生方向特定的可控性代价:一些有用的助手倾向仍然容易调用,但更难局部抑制。

英文摘要

Post-trained language-model assistants are often optimized to avoid under-answering, encouraging complete, helpful, cautious, and proactive responses. We ask whether this optimization creates asymmetric controllability costs: when users explicitly request narrower answers, which assistant behaviors remain suppressible, and which continue to shape the response? We study this problem as boundary-suppression asymmetry. Prompt-side probes across multiple high-level response dimensions suggest a selective cost, concentrated around `too-much assistant' directions such as over-completion, extra help, and anti-underanswering. Using controlled assistant-policy variants derived from a shared base model, we find that anti-underanswering policies are harder to pull back than the baseline under matched boundary-control evaluations, while minimal-boundary variants generally avoid this anti-side upward shift in the direct boundary-control comparisons. Mechanism-oriented probes point beyond longer default outputs, pure EOS failure, uncertainty compensation, and local continuation bias, while robustness checks preserve the main anti-over-baseline ordering under shared-system and larger-scale settings. The evidence supports a mixed planning/stopping account, where content-budget overshoot and continuation persistence jointly make boundary correction harder. Overall, post-training may create direction-specific controllability costs: some helpful assistant tendencies remain easy to invoke, yet harder to locally suppress.

2605.27965 2026-05-28 cs.AI

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

过度思考的形状:长推理轨迹中的回溯爆发

Navid Rezazadeh, Arash Gholami Davoodi

AI总结 通过分析长推理轨迹中的回溯动态,发现早期孤立修复通常与正确推理兼容,而错误轨迹更常出现持续且聚集的晚期中度至重度回溯,并基于此提出爆发感知过滤策略以区分可恢复修复与潜在不稳定。

详情
AI中文摘要

推理模型通常生成长轨迹,其中有用的自我纠正和无效的修改难以区分。我们通过回溯动态研究这种区别:长形式推理轨迹中的局部重新考虑、撤回或重新推导。在6,000条Qwen3-8B AIME轨迹上,我们标注了片段级别的回溯严重性,并分析了事件时序、归一化深度和局部爆发结构。我们发现早期孤立修复通常与正确推理兼容,而错误轨迹更常显示中度至重度回溯,这些回溯持续存在并聚集在后期。跨语料库检查显示,在额外的模型/领域对中存在相同的定性不对称性。过滤分析将信号实例化为前缀因果选择性早期退出策略:在浅层和中间深度,爆发感知过滤优于固定长度过滤,同时仅使用前缀可用特征。中等长度截断仍然是强大的完整轨迹基线,但爆发感知控制提供了一种可部署的机制,用于区分可恢复修复与潜在不稳定。

英文摘要

Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study this distinction through backtracking dynamics: local reconsideration, retraction, or re-derivation inside long-form reasoning traces. On 6{,}000 Qwen3-8B AIME traces, we annotate segment-level backtrack severity and analyze event timing, normalized depth, and local burst structure. We find that early isolated repair is often compatible with correct reasoning, whereas incorrect traces more often show moderate-to-severe backtracks that persist and cluster late. Cross-corpus checks show the same qualitative asymmetry across additional model/domain pairs. Filtering analyses instantiate the signal as a prefix-causal selective early-exit policy: at shallow and intermediate depths, burst-aware filtering outperforms fixed length-based filtering while using only prefix-available features. Moderate length cutoffs remain strong completed-trace baselines, but burst-aware control provides a deployable mechanism for separating recoverable repair from likely instability.

2605.27962 2026-05-28 cs.CV

Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe Perspective

缩小恶劣天气分割中的泛化差距:训练方案视角

Cong Xu, Pu Luo, Yumei Li, Boyou Xue

AI总结 本文从训练方案角度出发,通过域自适应微调、多源数据混合、场景平衡采样和合成退化增强等方法,显著缩小了恶劣天气语义分割中的验证-测试泛化差距。

详情
AI中文摘要

本文描述了我们在第8届UG2+研讨会(CVPR 2026)Track 2中的方法,该赛道针对五种天气条件(模糊、黑暗、雪、雾和眩光)退化的户外场景进行语义分割。我们观察到一个核心挑战是严重的泛化差距——在验证集上表现良好的模型在测试集上往往崩溃。例如,SegFormer-B5从验证到测试下降了16.1 mIoU点,表明仅靠模型容量不足以实现鲁棒性。我们研究精心设计的训练方案(而非架构复杂性)是否可以解决这一差距。从预训练的SegMAN-S骨干开始,我们系统地研究了域自适应微调、多源数据混合、场景平衡采样和合成退化增强的效果。我们的最终系统在官方测试集上达到了59.9%的mIoU,同时验证-测试差距仅为6.5个点——不到更大模型的一半。我们分析了架构修改、损失函数变体和模型缩放的负面结果,为有限数据下天气鲁棒分割提供实用见解。

英文摘要

This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap -- models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9\% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points -- less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

2605.27960 2026-05-28 cs.CV

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Mags-RL: 通过智能体强化学习为多模态大语言模型戴上放大镜以进行复杂场景推理

Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang

AI总结 提出Mags-RL框架,通过智能体强化学习让多模态大语言模型调用超分辨率代理进行高分辨率细粒度检查,实现两轮推理以提升复杂场景下的视觉推理能力。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)广受欢迎且成功,但它们在准确解释图像方面常常遇到困难,这限制了它们在复杂场景(如高物体密度和复杂背景杂乱)中的推理能力。先前的工作主要通过引入额外的显式视觉线索(如需要额外标注的边界框)来解决这一限制。此外,由此产生的低分辨率裁剪往往丢失了MLLMs进行准确推理所需的细粒度细节。因此,我们提出了Mags-RL,一个智能体强化学习(RL)框架,它为MLLMs配备了一个外部超分辨率“放大镜”代理,用于高分辨率细粒度检查。具体来说,该模型执行两轮推理:第一轮,它生成初始推理并自主识别感兴趣区域,无需依赖额外标注;第二轮,它调用超分辨率代理裁剪并放大这些区域,然后重新审视并验证其先前的推理以产生最终答案。我们还引入了一种新颖的课程学习策略,实现了数据高效的RL训练,仅需少至40个训练样本即可达到合理的性能。在VSR、TallyQA和GQA子集上的实验表明,与近期强竞争方法相比,它表现出优越的性能,展示了具有精确视觉基础的高质量推理。代码和权重将很快发布。

英文摘要

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

2605.27958 2026-05-28 cs.CL cs.AI cs.LG

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

压力测试LLM中的欺骗探针:扩展性、鲁棒性与欺骗表示的几何结构

Sachin Kumar

AI总结 本文通过系统压力测试,诊断线性探针在分布偏移下失效的原因,发现风格增强可恢复近完美检测,并证明欺骗编码非单一线性方向或熵代理,而是分布式亚阈值特征。

详情
Comments
Accepted at the GEM Workshop @ ACL 2026
AI中文摘要

基于LLM激活训练的线性探针越来越多地被提议作为欺骗检测指标,但在干净基准上报告AUROC超过0.96,而在分布偏移下崩溃。本文系统地对Gemma 3模型家族(1B-27B参数)的探针指标进行压力测试,诊断其失败原因而不仅仅是记录失败。我们测试了关于欺骗编码的四个假设:(1)单一线性方向,(2)多维子空间,(3)凸锥包,(4)熵代理。我们的设计包括跨域转移矩阵、基于排列零基线的多维探针分析、熵残差化测试以及8种风格偏移下的干扰评估。我们发现:(a)探针在干净数据上达到近乎完美的AUROC(>=0.998),但在风格偏移下崩溃;风格增强的探针在未见风格上恢复近乎完美的检测(平均AUROC 0.979-0.983);(b)单一方向假设被拒绝(k=1仅捕获0.61-0.80 AUROC),跨域转移失败被确认为几何原因而非层不匹配驱动;(c)熵代理假设被拒绝(最大|rho|=0.454,残差化后最大Delta-AUROC=0.004);(d)欺骗并未形成显著的线性子空间(每域k*=0),但多维探针(k>=5)通过分布式亚阈值特征恢复信号。探针脆弱性反映了分布狭窄性而非架构限制:风格增强的探针在4B和27B均恢复近乎完美的检测,表明逆缩放模式是训练分布伪影而非真正的规模依赖现象。

英文摘要

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.