arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪 全部专题
2605.27997 2026-05-28 cs.CL cs.AI cs.LG

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

毒性存在于何处?语言模型中的机制定位与定向抑制

Himanshu Beniwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院冈德辛加尔)

AI总结 通过分析毒性与中性提示的激活差异,定位特定层和神经元中的毒性,并利用推理时缩放或最小秩一权重编辑进行抑制,无需梯度下降,实现毒性降低同时保持语言质量。

详情
AI中文摘要

大型语言模型频繁生成有毒、仇恨或有害内容,然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤,且缺乏对毒性内部起源的机制性理解。我们提出了Meow2X和TRNE,两种互补的无需重新训练的框架,通过分析毒性与中性提示之间的激活差异,将毒性定位到特定层和神经元,然后通过推理时缩放或最小秩一权重编辑进行抑制——无需任何梯度下降。在五个语言模型、两个基准测试和90种配置上的评估,使用双重安全评估器,一致地证明了毒性降低,同时保持了语言建模质量。我们的分析揭示,毒性不成比例地编码在早期MLP层中,在不同架构间有所变化,并且被单一评估器设置系统性地低估——强调了多评估器安全评估的必要性。通过连接机制可解释性与实际去毒化,我们的框架为更安全、更透明的语言模型提供了一条原则性路径。

英文摘要

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

2605.27993 2026-05-28 cs.CL

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

重新思考视觉忽视:通过上下文偏好引导缓解MLLM幻觉

Jingwen Wu, Xijun Zhang, Ge Song

发表机构 * School of Computer and Electronic Information, Nanjing Normal University(计算机与电子信息学院,南京师范大学)

AI总结 针对多模态大语言模型的物体幻觉问题,提出无需训练的上下文偏好激活引导框架(CAS),通过提取上下文偏好向量并在中间早期MLP层注入符号残差来控制信息依赖,有效缓解幻觉且不增加解码延迟。

Comments 15 pages, 5 figures

详情
AI中文摘要

物体幻觉仍然是多模态大语言模型(MLLM)可靠部署的主要障碍。当前的推理时缓解方法主要假设幻觉源于视觉忽视,引导模型增强视觉依赖。相反,我们对多个MLLM的系统干预表明,推动更多视觉依赖可能会加剧某些模型的幻觉,而减少视觉依赖则可能缓解幻觉。这一结果表明,将幻觉单纯归因于视觉不足是不充分的。我们认为,图像作为上下文,同时与模型的参数知识和文本上下文竞争。为此,我们提出了一种无需训练的框架——上下文偏好激活引导(CAS)。它通过两组设计好的冲突样本提取两个语义不同的上下文偏好向量(CPV),并在推理时通过单次符号残差注入到中间早期MLP层来控制信息依赖。实验表明,CAS在不增加解码延迟的情况下显著缓解了物体幻觉,并保持了原生文本生成质量。

英文摘要

Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model's parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.

2605.27992 2026-05-28 cs.LG

Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

Patched-DeltaNet: 用于线性时间异常检测的令牌级事件驱动记忆

Tae-Gyun Lee, Junyoung Park, Kyu Won Han

发表机构 * ETRI(电子与电信研究院)

AI总结 提出Patched-DeltaNet架构,结合时间序列分块与门控Delta网络,通过令牌级事件驱动记忆实现线性时间复杂度的异常检测,在SMD基准上达到ROC-AUC 0.957和PA-F1 0.822。

Comments 7 pages, 2 tables

详情
AI中文摘要

时间序列异常检测对于维护关键任务系统的可靠性至关重要。虽然基于Transformer的模型(如PatchTST)表现出色,但其$\mathcal{O}(L^2)$的计算复杂度严重限制了在资源受限环境中的部署。在本文中,我们提出了Patched-DeltaNet,一种结合时间序列分块与门控Delta网络的新型架构。通过整合这些范式,我们假设并证明了令牌级事件驱动记忆的出现,其中分块机制提取局部语义块,而误差驱动的DeltaNet仅在发生显著物理变化(定义为delta)时更新其循环状态。这种协同作用有效滤除背景噪声并捕获突然的异常漂移。我们在服务器机器数据集(SMD)基准上的严格实验证明了Patched-DeltaNet的结构优越性和样本效率。通过在统一评估约束和相同计算预算下严格优于最新架构,我们的模型实现了ROC-AUC 0.957和PA-F1 0.822,同时将计算复杂度大幅降低至理论最小值$\mathcal{O}(L/P)$。

英文摘要

Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their $\mathcal{O}(L^2)$ computational complexity severely limits deployment in resource-constrained environments. In this paper, we propose Patched-DeltaNet, a novel architecture combining time-series patching with Gated Delta Networks. By integrating these paradigms, we hypothesize and demonstrate the emergence of token-level event-driven memory, whereby the patching mechanism extracts local semantic chunks, while the error-driven DeltaNet updates its recurrent state exclusively when significant physical changes, defined as deltas, occur. This synergy effectively filters out background noise and captures sudden anomalous drifts. Our rigorous experiments on the Server Machine Dataset (SMD) benchmark demonstrate the structural superiority and sample efficiency of Patched-DeltaNet. By strictly outperforming recent architectures under unified evaluation constraints and identical compute budgets, our model yields an ROC-AUC of 0.957 and PA-F1 of 0.822, while drastically reducing computational complexity to the theoretical minimum of $\mathcal{O}(L/P)$.

2605.27990 2026-05-28 cs.LG cs.AI cs.CV

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

几何校正扩散后验采样:基于去噪器回拉曲率引导与流形对齐阻尼

Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

发表机构 * Department of Electrical and Computer Engineering, Inha University, Incheon, 22212, South Korea(电气与计算机工程系,Inha大学,Incheon,22212,韩国)

AI总结 提出一种基于去噪器回拉曲率引导和流形对齐阻尼的几何校正扩散后验采样方法,通过每噪声水平的阻尼高斯-牛顿校正替代标量引导,实现稳定高效的后验采样。

Comments Code: https://github.com/Seunghyeok0715/CLAMP

Journal ref International Conference on Machine Learning 2026

详情
AI中文摘要

扩散后验采样将扩散先验条件于测量值,但数据一致性更新通常由手动调整的引导权重缩放,并且在刚性、算子依赖的曲率下可能破坏采样稳定性。我们使用在扩散状态坐标中计算的每噪声水平阻尼高斯-牛顿校正替代标量引导。该校正通过去噪器回拉似然梯度,使用避免前向去噪器雅可比矩阵的单侧曲率模型,并应用与去噪器残差对齐的扩散校准秩一阻尼。每个校正通过自动微分的无矩阵GMRES求解,采样通过具有闭式漂移/噪声分离的方差保持朗之万转移进行。在FFHQ和ImageNet上的逆问题中,该方法在PSNR/SSIM/LPIPS上达到竞争性能,同时运行速度显著快于大多数对比基线;在加速MRI重建中,它在对比基线中取得了最佳的PSNR/SSIM。

英文摘要

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

2605.27989 2026-05-28 cs.LG

Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization

神经交互定律:深度-宽度形状、交互效率与泛化

Wenjie Sun, Jinning Yang, Shuai Zhang, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院) New Jersey Institute of Technology(新泽西理工学院)

AI总结 通过将叠加原理从参数空间扩展到梯度空间定义为神经交互,发现固定预算下良好泛化伴随高效神经交互,且可通过调整深度-宽度比(R_D/W)使模型处于高效交互区间,该区间随预算扩展保持稳定,为模型形状初始化和泛化机制提供见解。

Comments 30 pages, 4 figures

详情
AI中文摘要

缩放定律的指导增加了现代大型语言模型(LLMs)的资源需求,但在固定预算下这些模型是否有效利用资源仍存疑问。先前研究证明叠加是损失的关键贡献者。通过利用神经特征假设,我们将叠加从参数空间扩展到梯度空间,并将其定义为神经交互。我们发现,在固定预算下,良好的泛化通常伴随着高效的神经交互,并且可以通过调整模型的深度-宽度比($R_{D/W}$)将模型置于高效交互区间。此外,随着预算扩大,模型的高效交互区间保持相对稳定。通过比较现有小规模密集LLMs,我们观察到接近该区间的模型在MMLU-Pro基准上表现更好。我们的发现揭示了$R_{D/W}$影响资源利用效率,进而影响泛化,为模型形状初始化和理解模型泛化机制提供了见解。神经交互定律的代码可在:https://anonymous.4open.science/r/Neural_Interaction_Law-D788 获取。

英文摘要

The guidance of scaling laws has increased the resource demands of modern large language models (LLMs), yet it remains questionable whether these models utilize resources effectively under a fixed budget. Previous research has proved superposition as a key contributor to loss. By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction. We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio ($R_{D/W}$). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable. By comparing existing small scale dense LLMs, we observe that models operating near this interval tend to perform better on the MMLU-Pro benchmark. Our findings reveal that the $R_{D/W}$ influences resource utilization efficiency and thereby affects generalization, providing insights into model shape initialization and the understanding of model generalization mechanisms. Code for Neural Interaction Law is available at: https://anonymous.4open.science/r/Neural_Interaction_Law-D788

2605.27986 2026-05-28 cs.CL q-bio.QM

An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

一种设计稳定、高表达且低免疫原性治疗性mRNA序列的进化方法

Dhawa Sang Dong, Mausam Gurung, Suraj Kandel

发表机构 * Department of Artificial Intelligence(人工智能系) Department of Electronics and Computer Engineering(电子与计算机工程系) School of Engineering, Kathmandu University(加德满都大学工程学院) Kathmandu Engineering College, Trivubhan University(特里布罕大学加德满都工程学院) School of Management and Social Science(管理与社会科学学院)

AI总结 提出BERT-GA两阶段框架,结合深度学习和遗传算法优化mRNA序列,在翻译效率、结构稳定性和低免疫原性之间取得平衡。

详情
AI中文摘要

信使RNA(mRNA)序列作为治疗药物需要优化设计,以确保高效翻译、结构稳定性和最小免疫原性。本研究提出一个两阶段计算机模拟框架,整合深度学习和进化计算,用于理性mRNA优化,而非现有最先进模型。第一阶段,预训练的CodonTransformer(类似BERT的大语言模型)生成编码目标抗原的生物一致性mRNA序列。第二阶段,遗传算法(GA)通过密码子感知的交叉和同义突变(由人类密码子使用偏好引导)进化这些候选序列。评估的适应度函数结合了翻译相关指标(CAI、tAI、密码子对偏好)、mRNA结构稳定性(通过RNAfold计算的局部和全局MFE、GC含量)以及降低的免疫原性(CpG/UpA基序频率)。经过连续世代(第38、40和42代),GA改进了CAI和tAI(CAI值从0.73到0.74,tAI值从0.63到0.64),提升超过6%,密码子对偏好高且一致(0.97),并改善了5'端的核糖体可及性,未配对30分数达到0.87;全局最小自由能(MFE)收敛到平衡范围-346至-356 kcal/mol,实现约84%的碱基配对结构稳定性,并减少了免疫刺激基序——最终世代平均免疫惩罚降至27.3。线性设计产生超稳定转录本(MFE < -2000 kcal/mol),由于极端刚性存在翻译效率低下的风险;BiLSTM-CRF仅关注高CAI(0.96至0.98)而无结构约束;我们的框架实现了翻译-稳定性的最优平衡,突出了所提出的BERT-GA框架作为一种有效的、数据驱动的计算机模拟mRNA序列设计和优化方法。

英文摘要

Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5' end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE < - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

2605.27984 2026-05-28 cs.CL cs.AI

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

KVoiceBench, KOpenAudioBench 和 KMMAU:用于评估语音语言模型的智能体驱动的韩语语音基准

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

发表机构 * KRAFTON Graduate School of AI, KAIST(韩国科学技术院人工智能研究生院) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 针对语音语言模型评估中英语中心化的问题,提出两种智能体驱动的基准构建框架,构建并发布了三个韩语语音基准(KVoiceBench、KOpenAudioBench、KMMAU),通过评估八个最新模型揭示了英语-韩语性能差距和不同任务间的互补性弱点。

Comments 16 pages, 4 figures

详情
AI中文摘要

语音语言模型通过将大型语言模型扩展到语音模态取得了显著进展。然而,语音语言模型的评估仍然严重以英语为中心,限制了多语言语音能力的可靠评估。通过ASR、翻译、归一化和TTS直接迁移基准会破坏语言特定的指令、答案约束和口语形式;对于音频理解,迁移源语言音频也无法保留目标语言的说话人属性、口音和副语言特性。为解决这些限制,我们提出了两种智能体驱动的基准构建框架:一种将源语言SpokenQA基准迁移为目标语言SpokenQA基准,另一种利用转录和说话人元数据将目标语言ASR语料库转换为音频理解基准。使用这些框架,我们构建并公开发布了三个韩语语音基准:用于韩语SpokenQA的KVoiceBench和KOpenAudioBench,以及用于韩语音频理解的KMMAU,共包含12,345个样本。我们评估了八个最近的语音语言模型,发现英语-韩语性能差距在不同模型和任务族中差异很大,并且SpokenQA和音频理解的排名出现分歧,揭示了仅靠英语评估无法发现的互补性弱点。

英文摘要

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

2605.27981 2026-05-28 cs.AI

STAB: Specification-driven Testing for Algorithmic Bottlenecks

STAB:面向算法瓶颈的规约驱动测试

Soohan Lim, Joonghyuk Hahn, Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出STAB流水线,通过约束饱和与对抗场景注入,从自然语言问题规约生成暴露算法瓶颈的测试用例,显著提升测试用例对瓶颈的检测率。

Comments 16 pages, 5 figures, 8 tables

详情
AI中文摘要

评估算法代码的效率需要能够暴露运行时瓶颈的测试用例。先前的方法通过增加输入规模或生成使给定实现运行缓慢的代码特定输入来生成效率测试用例。因此,它们没有处理驱动算法最坏情况的结构性输入条件。我们引入STAB,一个规约驱动的流水线,仅从自然语言问题规约生成暴露算法瓶颈的测试用例。STAB将任务分为约束边界最大化和对抗结构注入两部分。(i) 约束饱和器提取约束,并通过基于规则的饱和及对相关变量的CP-SAT优化来解析大的可接受规模赋值。(ii) 对抗场景注入器使用关键词匹配和K近邻(KNN)从策划的场景目录中检索实现级别的对抗构造原则。STAB将问题规约、解析的边界和检索到的构造原则编码为结构化生成规约,LLM据此合成Python测试用例生成器。在CodeContests上,STAB将生成测试用例中暴露算法瓶颈的比例从开源LLM平均50.43%提升至73.45%,从闭源LLM平均57.45%提升至71.85%,在Python、Java和C++上均有一致提升。我们的代码可在https://github.com/suhanmen/STAB获取。

英文摘要

Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.

2605.27980 2026-05-28 cs.CL cs.AI

Periodic RoPE for Infinite Context LLMs

周期性RoPE:面向无限上下文的大型语言模型

Simin Huo

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出周期性RoPE(P-RoPE)位置编码机制,结合滑动窗口注意力和无位置编码的全局注意力,避免位置耗尽,理论上支持无限上下文窗口。

Comments 5 pages

详情
AI中文摘要

处理超长上下文的能力对于大型语言模型(LLMs)执行长期任务至关重要。尽管最近的努力已将上下文窗口扩展到1M及以上,但当序列长度超过位置编码(如RoPE)的预训练范围时,模型性能会下降,即位置耗尽。必须克服这一基本限制才能实现真正的无限上下文。为此,我们提出了周期性RoPE(P-RoPE),一种旨在避免这种耗尽的位置编码机制。它与滑动窗口注意力(SWA)协同工作,以捕获每个窗口内的局部依赖和相对位置。然后,这一局部层由无位置编码(NoPE)的全局注意力层补充,使得整个序列上的无界交互成为可能,而不受位置限制。通过堆叠这两类层,模型避免了位置外推以泛化更长的序列,并理论上支持无限的上下文窗口。实验结果表明,我们的模型MiniWin在长上下文效率和稳定性上优于采用标准GPT架构的MiniMInd。我们的工作为LLMs实现真正的无限上下文理解提供了一条可能的路径。代码可在\href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}获取。

英文摘要

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}.

2605.27978 2026-05-28 cs.CV

ABot-OCR Technical Report

ABot-OCR 技术报告

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu

发表机构 * AMAP CV Lab(AMAP视觉实验室)

AI总结 提出端到端视觉语言模型ABot-OCR,通过单次前向传播将页面图像直接转录为干净Markdown,并采用解耦异构文档优化的强化学习方法提升文本准确性和标记格式正确性,在OmniDocBench基准上达到最先进水平。

Comments 21 pages, 11 figures, technical report

详情
AI中文摘要

我们介绍了ABot-OCR,一个端到端的视觉语言模型,它通过单次前向传播将页面图像直接转录为干净的Markdown。通过这样做,我们的方法完全消除了脆弱的模块化编排。为了最大化解析保真度,我们开发了一个专用数据引擎,以提供大规模、结构一致的监督。此外,我们提出了解耦异构文档优化,一种结构约束的强化学习方法,它在监督微调之外进一步提高了文本准确性并严格强制执行标记格式的正确性。大量评估证明了我们框架的优越性能。在OmniDocBench v1.5和v1.6基准测试中,ABot-OCR在所有端到端系统中达到了最先进的分数92.81和93.30,显著缩小了与强流水线基线之间的性能差距。最后,跨十种不同语言的全面多语言文本识别进一步证实了ABot-OCR的鲁棒泛化能力。

英文摘要

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

2605.27976 2026-05-28 cs.SD

VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

VoiceGiraffe: 极端长上下文音频-语言理解的基准

Jashin Ye, Dongxiao Wang, Yixuan Ye, Sashuai Zhou, Weihuang Lin, Mingyang Han, Kunpeng Wang, Zeyu Yuan, Boyu Li, Haoxiang Shi, Jingchen Shu, Jun Song, Bo Zheng

发表机构 * Future Living Lab(未来生活实验室) Alibaba(阿里巴巴)

AI总结 提出VoiceGiraffe基准,通过1500个三元组和双层分类法评估大音频语言模型在长上下文场景下的单跳感知与多跳推理能力,揭示模型在长距离记忆持久性上的瓶颈。

Comments Benchmark Project: https://github.com/LivingFutureLab/VoiceGiraffe

详情
AI中文摘要

尽管大音频语言模型(LALMs)在秒级或分钟级的音频处理中取得了显著进展,理解小时级音频仍然是一个根本性的瓶颈。现有的基准主要依赖短片段或人工拼接的片段,未能真实评估LALM在播客和长篇演讲等真实场景中的长距离信息理解能力。为填补这一空白,我们引入了VoiceGiraffe,这是一个新颖的基准,旨在严格评估LALMs在长上下文设置下跨多种真实场景、模态和语言的表现。它包含1500个精心策划的三元组,结构化为单跳感知和多跳推理的双层分类法。我们评估了一系列开源和专有LALMs与人类表现的对比。结果强调了三个基本发现。首先,VoiceGiraffe仍然极具挑战性,远未饱和。其次,我们表明没有单一的推理范式普遍占优。端到端推理有利于具有原生长上下文音频理解的模型,级联字幕聚合稳定了被小时级音频淹没的小模型,而借助外部LLM的推理增强级联有助于较弱的模型,但可能成为较强专有系统的瓶颈。第三,我们揭示了长距离记忆持久性是一个关键瓶颈。LALMs在回答需要连接显著因果线索的问题时表现更好,而在需要跨长音频持续跟踪稀疏事件的问题上表现较差,而人类则表现出相反的模式。这些发现使VoiceGiraffe成为长格式音频理解的一个具有挑战性和诊断性的测试平台,突显了需要具有持久记忆和稳健长距离聚合能力的LALMs。

英文摘要

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.

2605.27971 2026-05-28 cs.CL cs.AI

Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

语义流正则化:教会LLMs生成多样且连贯的回复

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

发表机构 * Tencent Inc.(腾讯公司) Beijing, China(中国北京)

AI总结 针对大语言模型微调时输出多样性严重受限的跨风格坍缩问题,提出语义流正则化(SFR),通过条件流匹配监督骨干网络使用连续句子嵌入,在零部署成本下提升多样性和风格保真度。

详情
AI中文摘要

当大语言模型被微调以生成个性或语气条件化的回复时,其输出多样性受到严重限制——我们将这种失败称为跨风格坍缩。我们将这种坍缩追溯到交叉熵目标,该目标在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化(SFR),一种轻量级的辅助目标,通过条件流匹配使用未来片段的连续句子编码器嵌入来监督骨干网络。随机流源通过构造保持多模态;流匹配头在推理时被丢弃,增加零部署成本。在一个大规模工业对话数据集(Qwen3-32B,9种个性)上,SFR在输出多样性、风格保真度和回复质量上优于SFT。我们进一步在公共LiveCodeBench-v5(Qwen2.5-Coder-7B-Instruct)上验证,其中SFR持续改进pass@k,证实了其超越风格化对话的通用性。在MBPP上的受控比较显示,多令牌预测是SFR的一个退化特例。

英文摘要

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

2605.27970 2026-05-28 cs.AI

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

人类感知域的几何结构在LLM表征中短暂出现

Simardeep Singh, Paras Chopra

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校)

AI总结 研究大型语言模型内部表征中是否出现与人类感知组织相似的几何结构,发现多个感知域的几何结构在中间层短暂涌现,且与人类基准对齐。

Comments 19 Pages, 28 Figures

详情
AI中文摘要

虽然大型语言模型(LLM)仅基于文本数据进行训练,但先前的工作表明,它们的内部表征在嵌入空间中可能展现出丰富的几何结构。基于这一研究方向,我们调查了这种结构是否与不同领域(例如颜色、音高、情感和味觉)的人类感知组织相似。具体来说,我们研究了多个开源Transformer架构的残差流中,与感知模态对应的内在几何结构逐层涌现的情况。我们的结果揭示了三个关键发现。首先,我们观察到多个感知域的逐层几何结构涌现,尽管训练过程中没有任何直接的感知监督。其次,这些感知域表现出不同的涌现轮廓,几何结构及其与人类基准的一致性在深度上遵循领域和模型特定的轨迹。第三,这种涌现遵循一致的表征轨迹:几何结构在早期层较弱或分散,在中间层逐渐组织化,在后期层减弱,表明感知几何结构作为模型内部转换管道的一部分短暂出现。这为理解类人感知几何结构在LLM中如何以及何处出现提供了新见解,为内部表征的机制分析提供了原则性途径。

英文摘要

While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model's internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

2605.27965 2026-05-28 cs.AI

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

过度思考的形状:长推理轨迹中的回溯爆发

Navid Rezazadeh, Arash Gholami Davoodi

发表机构 * University of California, Irvine(加州大学尔湾分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过分析长推理轨迹中的回溯动态,发现早期孤立修复通常与正确推理兼容,而错误轨迹更常出现持续且聚集的晚期中度至重度回溯,并基于此提出爆发感知过滤策略以区分可恢复修复与潜在不稳定。

详情
AI中文摘要

推理模型通常生成长轨迹,其中有用的自我纠正和无效的修改难以区分。我们通过回溯动态研究这种区别:长形式推理轨迹中的局部重新考虑、撤回或重新推导。在6,000条Qwen3-8B AIME轨迹上,我们标注了片段级别的回溯严重性,并分析了事件时序、归一化深度和局部爆发结构。我们发现早期孤立修复通常与正确推理兼容,而错误轨迹更常显示中度至重度回溯,这些回溯持续存在并聚集在后期。跨语料库检查显示,在额外的模型/领域对中存在相同的定性不对称性。过滤分析将信号实例化为前缀因果选择性早期退出策略:在浅层和中间深度,爆发感知过滤优于固定长度过滤,同时仅使用前缀可用特征。中等长度截断仍然是强大的完整轨迹基线,但爆发感知控制提供了一种可部署的机制,用于区分可恢复修复与潜在不稳定。

英文摘要

Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study this distinction through backtracking dynamics: local reconsideration, retraction, or re-derivation inside long-form reasoning traces. On 6{,}000 Qwen3-8B AIME traces, we annotate segment-level backtrack severity and analyze event timing, normalized depth, and local burst structure. We find that early isolated repair is often compatible with correct reasoning, whereas incorrect traces more often show moderate-to-severe backtracks that persist and cluster late. Cross-corpus checks show the same qualitative asymmetry across additional model/domain pairs. Filtering analyses instantiate the signal as a prefix-causal selective early-exit policy: at shallow and intermediate depths, burst-aware filtering outperforms fixed length-based filtering while using only prefix-available features. Moderate length cutoffs remain strong completed-trace baselines, but burst-aware control provides a deployable mechanism for separating recoverable repair from likely instability.

2605.27962 2026-05-28 cs.CV

Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe Perspective

缩小恶劣天气分割中的泛化差距:训练方案视角

Cong Xu, Pu Luo, Yumei Li, Boyou Xue

发表机构 * Xidian University(西安电子科技大学)

AI总结 本文从训练方案角度出发,通过域自适应微调、多源数据混合、场景平衡采样和合成退化增强等方法,显著缩小了恶劣天气语义分割中的验证-测试泛化差距。

详情
AI中文摘要

本文描述了我们在第8届UG2+研讨会(CVPR 2026)Track 2中的方法,该赛道针对五种天气条件(模糊、黑暗、雪、雾和眩光)退化的户外场景进行语义分割。我们观察到一个核心挑战是严重的泛化差距——在验证集上表现良好的模型在测试集上往往崩溃。例如,SegFormer-B5从验证到测试下降了16.1 mIoU点,表明仅靠模型容量不足以实现鲁棒性。我们研究精心设计的训练方案(而非架构复杂性)是否可以解决这一差距。从预训练的SegMAN-S骨干开始,我们系统地研究了域自适应微调、多源数据混合、场景平衡采样和合成退化增强的效果。我们的最终系统在官方测试集上达到了59.9%的mIoU,同时验证-测试差距仅为6.5个点——不到更大模型的一半。我们分析了架构修改、损失函数变体和模型缩放的负面结果,为有限数据下天气鲁棒分割提供实用见解。

英文摘要

This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap -- models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9\% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points -- less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

2605.27960 2026-05-28 cs.CV

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning

Mags-RL: 通过智能体强化学习为多模态大语言模型戴上放大镜以进行复杂场景推理

Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang

发表机构 * Arizona State University(亚利桑那州立大学) Clemson University(克莱姆森大学) Washington University in St. Louis(圣路易斯华盛顿大学) Halmstad University(哈姆斯塔德大学) Florida State University(佛罗里达州立大学) Rice University(里士满大学)

AI总结 提出Mags-RL框架,通过智能体强化学习让多模态大语言模型调用超分辨率代理进行高分辨率细粒度检查,实现两轮推理以提升复杂场景下的视觉推理能力。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)广受欢迎且成功,但它们在准确解释图像方面常常遇到困难,这限制了它们在复杂场景(如高物体密度和复杂背景杂乱)中的推理能力。先前的工作主要通过引入额外的显式视觉线索(如需要额外标注的边界框)来解决这一限制。此外,由此产生的低分辨率裁剪往往丢失了MLLMs进行准确推理所需的细粒度细节。因此,我们提出了Mags-RL,一个智能体强化学习(RL)框架,它为MLLMs配备了一个外部超分辨率“放大镜”代理,用于高分辨率细粒度检查。具体来说,该模型执行两轮推理:第一轮,它生成初始推理并自主识别感兴趣区域,无需依赖额外标注;第二轮,它调用超分辨率代理裁剪并放大这些区域,然后重新审视并验证其先前的推理以产生最终答案。我们还引入了一种新颖的课程学习策略,实现了数据高效的RL训练,仅需少至40个训练样本即可达到合理的性能。在VSR、TallyQA和GQA子集上的实验表明,与近期强竞争方法相比,它表现出优越的性能,展示了具有精确视觉基础的高质量推理。代码和权重将很快发布。

英文摘要

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

2605.27958 2026-05-28 cs.CL cs.AI cs.LG

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

压力测试LLM中的欺骗探针:扩展性、鲁棒性与欺骗表示的几何结构

Sachin Kumar

发表机构 * LexisNexis(LexisNexis公司)

AI总结 本文通过系统压力测试,诊断线性探针在分布偏移下失效的原因,发现风格增强可恢复近完美检测,并证明欺骗编码非单一线性方向或熵代理,而是分布式亚阈值特征。

Comments Accepted at the GEM Workshop @ ACL 2026

详情
AI中文摘要

基于LLM激活训练的线性探针越来越多地被提议作为欺骗检测指标,但在干净基准上报告AUROC超过0.96,而在分布偏移下崩溃。本文系统地对Gemma 3模型家族(1B-27B参数)的探针指标进行压力测试,诊断其失败原因而不仅仅是记录失败。我们测试了关于欺骗编码的四个假设:(1)单一线性方向,(2)多维子空间,(3)凸锥包,(4)熵代理。我们的设计包括跨域转移矩阵、基于排列零基线的多维探针分析、熵残差化测试以及8种风格偏移下的干扰评估。我们发现:(a)探针在干净数据上达到近乎完美的AUROC(>=0.998),但在风格偏移下崩溃;风格增强的探针在未见风格上恢复近乎完美的检测(平均AUROC 0.979-0.983);(b)单一方向假设被拒绝(k=1仅捕获0.61-0.80 AUROC),跨域转移失败被确认为几何原因而非层不匹配驱动;(c)熵代理假设被拒绝(最大|rho|=0.454,残差化后最大Delta-AUROC=0.004);(d)欺骗并未形成显著的线性子空间(每域k*=0),但多维探针(k>=5)通过分布式亚阈值特征恢复信号。探针脆弱性反映了分布狭窄性而非架构限制:风格增强的探针在4B和27B均恢复近乎完美的检测,表明逆缩放模式是训练分布伪影而非真正的规模依赖现象。

英文摘要

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

2605.27957 2026-05-28 cs.CL

DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

DisasterBench: 在类型化工具接口约束下基准测试LLM规划

Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee

发表机构 * Texas A&M University(德克萨斯A&M大学) University of Toronto(多伦多大学)

AI总结 提出DisasterBench基准,通过类型化工具接口评估LLM在灾害响应中的结构化多智能体规划能力,并引入首次故障点(FPoF)方法进行步骤级故障归因,揭示语义推理与执行约束之间的差距。

详情
AI中文摘要

灾害造成严重的社会影响,需要快速协调异构AI工具(从卫星分析到洪水预测和损害评估)形成连贯的多步骤工作流。随着LLM越来越多地充当此类管道的编排者,有效的协调需要的不仅仅是选择语义上合理的工具:LLM必须生成具有正确参数绑定和依赖传播的可执行工作流。我们引入了DisasterBench,这是一个基准,用于评估在语义相似但操作上不同的灾害响应工具上的结构化多智能体规划。为了实现步骤级故障归因,我们进一步提出了首次故障点(FPoF),它定位预测工作流中最早的根因,将主要错误与下游级联效应分开。我们的评估揭示了三个发现:规划方法的有效性强烈依赖于模型容量;工具不匹配和参数绑定错误主导了首次故障,揭示了语义基础和执行一致性是不同瓶颈;冗长的中间推理可能与结构化输出要求产生指令冲突,破坏计划生成。总之,这些发现凸显了语义推理与执行基础协调之间的根本差距,强调了需要联合建模语义意图、执行约束和工作流一致性的规划框架。代码、数据和评估资源可在 https://github.com/TamuChen18/DisasterBench_Open 获取。

英文摘要

Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open

2605.27954 2026-05-28 cs.LG

Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement Learning

循环熵爆发:智能体强化学习中的熵动力学

Wendi Li, Shawn Im, Sharon Li

发表机构 * Department of Computer Sciences, University of Wisconsin–Madison(威斯康星大学麦迪逊分校计算机科学系)

AI总结 本文发现智能体强化学习训练中存在循环熵爆发现象,并提出了SEAL辅助损失函数来稳定训练、提升性能。

详情
AI中文摘要

智能体大型语言模型通过推理目标、调用工具和与外部环境交互,越来越多地被用于解决现实世界任务。强化学习为改进这些行为提供了自然框架,最近的智能体RL方法在多个领域取得了强劲成果。然而,智能体RL的训练动力学仍然知之甚少,限制了诊断不稳定性和设计更有效训练算法的能力。在这项工作中,我们识别了智能体RL中一个先前未被充分探索的现象,我们称之为循环熵爆发。与单轮推理RL(其中熵通常崩溃并保持低位)不同,智能体RL训练表现出独特的重复循环:熵急剧爆发然后逐渐消退。我们将这种动态分解为三个阶段,并对每个阶段进行理论和实证分析,解释其循环振荡的机制。我们进一步表明,一旦在爆发期间获得,诸如句子重复和幻觉等退化模式可以在循环中持续并累积。受这些发现的启发,我们提出了SEAL(分离增强型智能体学习),一种轻量级辅助损失,它在表示空间中分离正确和错误轨迹,直接针对熵爆发的根本原因。跨多个基准、模型和RL算法的实验表明,SEAL稳定了训练并产生了更强的下游智能体性能。

英文摘要

Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving these behaviors, and recent agent RL methods have achieved strong results across domains. However, the training dynamics of agent RL remain poorly understood, limiting our ability to diagnose instabilities and design more effective training algorithms. In this work, we identify a previously underexplored phenomenon in agent RL, which we term cyclical entropy eruption. Unlike single-turn reasoning RL, where entropy typically collapses and stays low, agent RL training exhibits unique recurring cycles of sharp entropy eruption and gradual subsidence. We decompose this dynamic into three phases and provide theoretical and empirical analyses of each, explaining the mechanisms underlying its cyclical oscillation. We further show that degenerate patterns such as sentence duplication and hallucination, once acquired during eruption, can persist and accumulate across cycles. Motivated by these findings, we propose SEAL (Separation-Enhanced Agent Learning), a lightweight auxiliary loss that separates correct and incorrect trajectories in representation space, directly targeting the root cause of entropy eruption. Experiments across multiple benchmarks, models, and RL algorithms demonstrate that SEAL stabilizes training and yields stronger downstream agent performance.

2605.27952 2026-05-28 cs.CV cs.RO

Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

Con-DSO:学习RGB-D直接稀疏里程计的短时一致性先验

Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(信息科学学系,日本科学技术先进研究院) College of Information Engineering, Shenyang University of Chemical Technology(信息工程学院,沈阳化学工业大学)

AI总结 提出Con-DSO框架,通过预测光度与深度几何一致性不确定性,实现质量感知的像素选择和加权,提升RGB-D直接稀疏里程计在动态、遮挡等挑战环境下的鲁棒性。

Comments Submitted

详情
AI中文摘要

视觉里程计(VO)是机器人和增强现实中的基础组件。RGB-D直接VO受益于度量深度测量,但在动态物体、遮挡、光照变化和不可靠深度违反直接对齐所使用的短时光度和深度几何一致性假设的挑战环境中,性能会下降。现有方法通过语义过滤、显式遮挡推理、光照适应或手工几何准则来缓解这些问题,但通常依赖外部模块或针对个别故障模式的固定假设,限制了其灵活性和以统一方式处理多样挑战的能力。本文提出Con-DSO,一种一致性感知的RGB-D直接稀疏里程计框架,从时间相邻的RGB-D帧对预测密集的光度和深度几何一致性不确定性。一致性网络通过流引导的光度误差和投影深度一致性误差进行训练,使得一致性违规可表示为像素级不确定性。这些成对不确定性预测被转换为关键帧跟踪的主机侧质量先验。该先验随后通过质量感知的支持像素选择和位姿估计中的解耦光度-几何加权应用于VO,使得不可靠观测持续衰减,而非硬拒绝或基于阈值的门控。在五个公开RGB-D基准上的实验表明,与直接RGB-D VO基线相比,在ICL-NUIM上绝对轨迹误差降低超过20%,在RGB-D Scenes V2、TUM/Bonn Dynamic和OpenLORIS序列上降低50%-80%。

英文摘要

Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

2605.27950 2026-05-28 cs.CV

Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

从以自我为中心的饮食环境图像推断饮食行为改变接受度的可行性评估

Long Li, Yuning Huang, Heather A. Eicher-Miller, J. Graham Thomas, Fengqing Zhu, Edward Sazonov

发表机构 * The University of Alabama(阿拉巴马大学) Purdue University(普渡大学) Brown University(布朗大学)

AI总结 本研究利用可穿戴相机收集的以自我为中心的饮食图像,通过预训练CLIP视觉编码器和轻量级Transformer分类器,初步验证了被动感知推断饮食行为改变接受度的可行性。

详情
AI中文摘要

准确评估饮食行为改变接受度对于设计有效的即时自适应干预措施(JITAIs)以促进更健康的饮食习惯至关重要。然而,基于自我报告的行为改变接受度评估稀疏且延迟,限制了其在持续监测中的实际应用。为探索被动感知是否有助于解决这一挑战,本研究进行了一项初步调查,从可穿戴相机收集的以自我为中心的饮食图像中推断参与者自我报告的行为改变接受度。我们使用自动摄入监测器v2(AIM-2)在自由生活饮食事件中获取的初步数据。数据包括饮食期间捕获的以自我为中心的图像序列,并配以评估行为改变接受度特定维度(意识、互动能力和动机)的问题的回答。为了检查视觉信息是否与这些回答相关,我们评估了一个结合预训练对比语言-图像预训练(CLIP)视觉编码器和轻量级Transformer分类器的迁移学习辅助框架。该模型处理饮食事件图像序列,以提取与行为改变接受度相关的潜在语义和时间线索。初步实验结果显示,在行为改变接受度指标上,相比于简单基线模型有显著改进。这些早期发现表明,以自我为中心的饮食事件图像可能包含与饮食行为改变接受度相关的线索,并需要在更大、更全面的数据集上进行进一步研究。

英文摘要

Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

2605.27948 2026-05-28 cs.RO

VLM-Based Advanced Rider Assistance System for Motorcycle Safety

基于VLM的摩托车安全高级骑手辅助系统

Mohamed Elnoor, Francesca Baldini, Ananya Trivedi, Faizan M. Tariq, Jovin D'sa, David Isele, Sangjae Bae, Dinesh Manocha, Yosuke Sakamoto

发表机构 * HRI Honda Research Institute(本田研究院) University of Maryland(马里兰大学) Northeastern University(东北大学)

AI总结 提出一种利用视觉语言模型进行语义感知和风险感知规划的摩托车高级骑手辅助系统,通过构建密集风险地图并采用基于采样的规划器,在CARLA模拟器中实现更高的成功率和更低的风险暴露。

Comments Accepted to IEEE IV 2026

详情
AI中文摘要

与汽车相比,摩托车由于防护有限且对路面危险更敏感,面临不成比例的高碰撞风险,然而高级骑手辅助系统(ARAS)相对于高级驾驶辅助系统(ADAS)仍不发达。我们提出一种新颖的ARAS,通过语义感知和风险感知规划来提升摩托车安全性。我们的方法利用视觉语言模型(VLM)进行上下文危险推理,并将其与基于分割的检测相结合,以构建密集风险地图。这些地图编码了语义特征(例如,坑洼严重程度、水坑湿滑度)和物理属性(例如,大小、深度),从而产生捕捉摩托车特定风险的逐像素危险成本。这些地图被一个针对摩托车动力学定制的基于采样的规划器使用,以推荐油门和转向动作,在向目的地前进的同时最小化危险暴露。我们在CARLA模拟器的不同场景中评估了我们的系统。与基线方法相比,我们的方法实现了更高的成功率和更低的危险暴露,同时定性结果展示了可解释的风险地图和安全的轨迹推荐。

英文摘要

Motorcycles face disproportionately high crash risks compared to cars due to limited protection and heightened sensitivity to surface hazards, yet Advanced Rider Assistance Systems (ARAS) remain underdeveloped relative to Advanced Driver Assistance Systems (ADAS). We propose a novel ARAS that enhances motorcycle safety through semantic perception and risk-aware planning. Our approach leverages Vision-Language Models (VLMs) for contextual hazard reasoning and integrates them with segmentation-based detection to construct dense risk maps. These maps encode both semantic characteristics (e.g., pothole severity, puddle slipperiness) and physical attributes (e.g., size, depth), which produce per-pixel hazard costs that capture motorcycle-specific risks. These maps are used by a sampling-based planner tailored to motorcycle dynamics to recommend throttle and steering actions that minimize hazard exposure while advancing toward the destination. We evaluate our system in different scenarios in the CARLA simulator. Compared to the baseline method, our method achieves higher success rates and lower hazard exposure, while qualitative results demonstrate interpretable risk maps and safe trajectory recommendations.

2605.27947 2026-05-28 cs.RO

SANTS: A State-Adaptive Scheduler for World Action Models

SANTS:面向世界动作模型的状态自适应调度器

Yirui Sun, Guangyu Zhuge, Keliang Liu, Jie Gu, Xinyu Bing, Zhongxue Gan, Chunxu Tian

发表机构 * Fudan University(复旦大学) Harbin Institute of Technology(哈尔滨工业大学) Deep Computing Era Technology Co., Ltd(深计算时代科技有限公司)

AI总结 提出状态自适应噪声轨迹调度器(SANTS),通过根据视频状态动态选择去噪深度来优化视频到动作的扩散策略,在保持控制性能的同时大幅降低推理延迟。

Comments 17 pages, 5 figures, 8 tables. Project page: https://advanced-robotics-lab.github.io/SANTS/

详情
AI中文摘要

世界动作模型(WAMs)通过使用基于视频的未来表示来条件化动作生成,从而改进机器人操作。然而,在像素空间WAM中,最佳动作条件不一定是完全去噪的视频。受控去噪深度扫描显示,视频细化可以降低动作误差,直到一个状态依赖的点,此后当后期预测变得与动作相关性较低或物理上不可靠时,增益可能饱和甚至逆转。这表明动作生成应使用沿视频噪声轨迹的状态依赖点,而不是固定的终端去噪深度。我们引入了状态自适应噪声轨迹调度器(SANTS),一种用于视频到动作扩散策略的轻量级调度器。在每个视频决策点,SANTS读取当前视频状态表示和噪声水平,然后联合预测累积停止风险和相对噪声进展比率。SANTS在冻结的动作分支生成最终动作块后,通过路径级奖励进行后训练,因此调度器针对下游动作质量而非中间视频保真度进行优化,同时显式惩罚冗余的视频状态更新。实验表明,SANTS在RoboTwin 2.0上达到94.4%的整体成功率,在七个真实机器人任务上平均成功率为73.1%,同时相对于完全视频去噪分别降低了81.7%和79.0%的延迟。这些结果表明,沿视频噪声轨迹的自适应选择可以保留WAM式未来推理的控制优势,同时消除其大部分冗余推理成本。

英文摘要

World Action Models (WAMs) improve robot manipulation by using video-based future representations to condition action generation. In pixel-space WAMs, however, the best action condition is not necessarily the fully denoised video. Controlled denoising-depth scans show that video refinement can reduce action error up to a state-dependent point, after which the gain may saturate or even reverse when late predictions become less action-relevant or physically unreliable. This suggests that action generation should use a state-dependent point along the video noise trajectory rather than a fixed terminal denoising depth. We introduce State-Adaptive Noise Trajectory Scheduler (SANTS), a lightweight scheduler for video-to-action diffusion policies. At each video decision point, SANTS reads the current video-state representation and noise level, then jointly predicts a cumulative stopping hazard and a relative noise-progression ratio. SANTS is post-trained with a path-level reward computed after the frozen action branch generates the final action chunk, so the scheduler is optimized for downstream action quality rather than intermediate video fidelity, while redundant video-state updates are explicitly penalized. Experiments show that SANTS reaches \(94.4\%\) overall success on RoboTwin 2.0 and \(73.1\%\) average success across seven real-robot tasks, while reducing latency by \(81.7\%\) and \(79.0\%\) relative to full video denoising, respectively. These results indicate that adaptive selection along the video noise trajectory can preserve the control benefits of WAM-style future reasoning while removing much of its redundant inference cost.

2605.27944 2026-05-28 cs.AI cs.MM cs.SD

From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection

从说话到唱歌:音视频深度伪造检测的新挑战

Ke Liu, Jiwei Wei, Wenyu Zhang, Shuchang Zhou, Ruikun Chai, Yutao Dai, Chaoning Zhang, Yang Yang

发表机构 * Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China(未来媒体中心,计算机科学与工程学院,电子科技大学)

AI总结 针对现有音视频深度伪造检测方法在唱歌场景中性能下降的问题,提出文本引导的音视频伪造检测框架(T-AVFD),通过面部真实性模式学习和多模态差异权重学习,在说话和唱歌场景中均实现鲁棒检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

随着音视频生成模型的快速发展,可靠的伪造检测变得日益关键。现有的音视频深度伪造检测方法通常依赖于跨模态不一致性。在唱歌中,有节奏的发声削弱了这种耦合,并引入了显著的领域偏移,大幅降低了检测性能。我们使用节奏感知生成模型构建了唱歌头部深度伪造(SHDF)数据集,以填补唱歌基准的空白。为了应对跨场景领域偏移,我们提出了文本引导的音视频伪造检测(T-AVFD)框架,该框架在说话和唱歌场景中均具有泛化能力。T-AVFD 包含一个面部真实性模式学习器和一个多模态差异权重学习模块。模式学习器将面部特征与多粒度文本描述对齐,以学习可泛化的真实性模式。权重学习模块保留固有的音视频一致性,并通过差异权重将其与真实性模式自适应地整合。在多个说话头部深度伪造数据集和 SHDF 上的大量实验表明,该方法在现有基线上取得了一致的改进,并在多种扰动下表现出强大的鲁棒性。

英文摘要

With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

2605.27938 2026-05-28 cs.CV

SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images

SEMAGIC: 从野外图像中学习语义一致的可变形3D表示

Sky Cen, Wufei Ma, Guofeng Zhang, Alan Yuille, Adam Kortylewski

发表机构 * Johns Hopkins University(约翰霍普金斯大学) CISPA Helmholtz Center for Information Security(信息安全霍普金斯中心)

AI总结 针对现有可变形3D重建方法语义对应不稳定的问题,提出SEMAGIC框架,通过特征级一致性损失和顶点索引条件变形,在重建过程中强制语义一致性,从而提升类别级语义对应性能。

详情
AI中文摘要

从单视图野外图像中学习可变形3D物体模型已实现了无需监督的令人印象深刻的3D形状重建。然而,这些模型是否捕捉到下游任务所需的语义结构仍不清楚。我们发现,现有的可变形重建方法尽管生成了视觉上合理的几何形状,但在实例间产生了不稳定的对应关系,并在语义对应基准上表现不佳。我们引入了SEMAGIC,一个从单视图野外图像中学习语义一致的可变形3D表示的框架。SEMAGIC不将重建视为最终目标,而是将可变形建模作为发现类别级对应关系的机制。每个类别由一个规范模板网格和一个学习到的变形场表示,其功能类似于一个从图像特征重建实例几何的自编码器,使得顶点能够在实例间保持一致的语义含义。训练过程中通过(i)对齐规范网格和变形网格之间语义特征的特征级一致性损失,以及(ii)保持实例间语义对应的顶点索引条件变形,来强制语义一致性。通过将几何变形与语义对齐显式耦合,SEMAGIC生成了在类别内变化中保持稳定部件对应的表示。实验表明,SEMAGIC在SPair-71k上将可变形模型的语义对应提高了+14.7 PCK@0.1,确立了可变形模型作为有效语义3D表示的地位。

英文摘要

Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.

2605.27934 2026-05-28 cs.CL

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

GeneralThinker: 通过似然引导的答案条件优化实现领域通用推理

Shengmin Piao, Sanghyun Park

发表机构 * Yonsei University(延世大学)

AI总结 提出GeneralThinker框架,利用答案似然进行密集监督和细粒度信用分配,无需领域特定验证器,在数学、STEM和通用推理等11个基准上取得最佳平均性能。

详情
AI中文摘要

基于可验证奖励的强化学习提升了语言模型的推理能力,但其对领域特定验证器的依赖、稀疏的结果奖励以及粗粒度的信用分配限制了其适用性。我们提出了GeneralThinker,一个在策略框架,将推理监督重新表述为密集的答案条件优化,无需领域特定验证器即可实现响应级评估和令牌级信用分配。GeneralThinker使用真实答案的似然来评估生成的推理轨迹,并推导出令牌级的兼容性信号用于细粒度信用分配。为了稳定优化,它通过裁剪和方向保持调制来约束令牌级更新。在涵盖数学、STEM和通用推理的11个基准测试中,GeneralThinker取得了最佳平均性能。进一步分析表明,不受控的令牌级调制可能破坏训练稳定性,而受控的调制使细粒度信用分配始终有效。

英文摘要

Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全:什么决定了多模态越狱鲁棒性?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher(独立研究者) Stanford University(斯坦福大学) Harvard University(哈佛大学) Purdue University(普渡大学) Duke University(杜克大学)

AI总结 本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响,发现显式图像工具交互能显著降低攻击成功率,并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情
AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式,但其安全性影响尚不明确。现有系统已涵盖多种流程设计,包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上,我们的实验表明显式图像工具交互的攻击成功率最低,平均相对降低约30%。这一发现起初令人惊讶:即使返回的图像工具输出被人为覆盖或本身不安全,攻击成功率仍保持较低,但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明,较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式,我们引入了一个图像工具安全向量框架,将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言,我们的结果表明,显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式,同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

2605.27931 2026-05-28 cs.AI

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

DiagramRAG:一个用于科学图表生成的轻量级检索增强框架

Xinjiang Yu, Junyi Han, Zhuofan Chen, Chi Zhang, Xiangyu Fu, Jingyuan Tan, Zirui You, Yixiang Jian, Yu-Ping Wang, Chengliang Chai

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出DiagramRAG框架,通过检索与草图语义和拓扑结构兼容的参考图表,实现草图到科学图表的自动补全与生成。

Comments 23 pages, 9 figures

详情
AI中文摘要

科学图表对于在学术论文中传达复杂方法至关重要。研究人员指定此类图表的一种自然方式是通过粗略草图,其中文本标签、连接器和空间布局表达了早期的语义和拓扑意图。然而,草图通常不完整,不足以直接生成出版质量的图表。现有的基于草图的生成方法主要重构草图本身,而最近的文本驱动图表生成框架依赖文本语义,未能充分利用草图中包含的拓扑结构。在本文中,我们介绍了DiagramRAG,一个轻量级的检索增强框架,用于基于草图的科学图表补全。给定用户草图,DiagramRAG检索与草图内容语义相关且与其结构拓扑兼容的参考图表,并使用它们指导下游图表生成。为了实现高效的结构感知检索,我们将图表表示为知识图谱,在不同简化级别合成草图变体,并训练一个嵌入模型,将草图与共享空间中的兼容图表对齐。检索到的参考进一步提供内容、拓扑和视觉先验,用于补全和渲染最终图表。实验表明,DiagramRAG在DiagramBank和FigureBench上分别达到0.848和0.802的F1分数,并以最佳的VLM-as-a-Judge评分7.170提高了生成质量,同时将推理延迟降低到每个样本35.48秒。我们的代码和数据可在https://anonymous.4open.science/r/DiagramRAG-A262和https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch获取。

英文摘要

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

2605.27927 2026-05-28 cs.CV cs.LG

Structure-Guided Visual Perturbation Neutralization for LVLMs

结构引导的视觉扰动中和用于大型视觉语言模型

Yuanhe Zhang, Xueting Wang, YanBin Ren, Haoran Gao, Xinhan Zheng, Zhenhong Zhou, Fanyu Meng, Li Sun, Sen Su

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学) JIUTIAN Research(JIUTIAN研究所) Nanyang Technological University(南洋理工大学) Chongqing University of Posts and Telecommunications(重庆邮电大学)

AI总结 提出结构诱导引导中和(SIGN)框架,通过先验结构提取和动态引导中和实现轻量级、即插即用的对抗性防御,在仅0.5%像素修改和0.16秒每图下达到87%以上防御成功率。

详情
AI中文摘要

图像输入使大型视觉语言模型(LVLMs)能够感知细粒度的视觉信息,但也引入了一个像素级攻击面,通过该攻击面,对抗性扰动可以引发不安全的模型行为。然而,大多数现有防御是为传统计算机视觉场景设计的,因此常常忽略LVLMs所需的跨模态对齐,导致性能下降。同时,针对LVLMs的有限防御通常需要大量的图像修改并引入可观的计算开销,从而损害推理质量和效率。为解决这些限制,我们提出了结构诱导引导中和(SIGN),一个轻量级、即插即用的防御框架,通过先验结构提取提高LVLM兼容性,并通过动态引导中和实现高效的扰动抑制。大量实验表明,SIGN在仅0.5%像素修改和每张图像0.16秒的情况下实现了超过87%的防御成功率,同时几乎保留了原始视觉表示和良性任务性能。我们的工作为需要昂贵模型训练的防御提供了一种轻量级替代方案,并突显了利用视觉编码器进行高效对抗性保护的潜力。我们的代码已在 https://anonymous.4open.science/r/SIGN-BCB1 开源。

英文摘要

Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.

2605.27924 2026-05-28 cs.CV

SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

SIGMA: 基于语义差异的指令引导掩码标注器用于文本驱动图像操作定位

Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie, Jishen Zeng, Baoying Chen, Jiwu Huang, Xiaochun Cao

发表机构 * Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) Guangdong Provincial Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security(广东省智能信息处理重点实验室和深圳媒体安全重点实验室) Shenzhen University of Advanced Technology and Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术大学和深圳先进技术研究所,中国科学院) Alibaba Group(阿里巴巴集团) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 提出SIGMA方法,通过视觉基础模型中的语义特征差异和指令引导的空间先验,自动从公开编辑数据集中生成像素级掩码,用于训练图像操作定位模型,在五个基准上F1提升12.20%,并生成约110万训练集使六个检测器平均F1提升18.34%。

详情
AI中文摘要

文本驱动的图像编辑发展迅速,但可靠地定位这些操作需要在大规模像素标注数据集上训练的图像操作定位(IML)模型,目前尚无低成本获取此类训练数据的方法。我们观察到这些数据实际上已经以伪装形式存在:公开编辑数据集包含数百万个与IML训练样本结构相同的(原始、编辑)图像对,仅缺少像素级掩码。自动恢复这些掩码并非易事:像素差异被扩散引起的所有像素扰动淹没,而仅基于指令的定位只能定位提示描述的内容,遗漏了意外的编辑副作用。我们提出SIGMA(语义差异指令引导掩码标注器),它在视觉基础骨干网络中进行语义特征差异计算,并通过双向跨模态精炼将指令导出的空间先验注入视觉流,在编辑器忠实实现用户意图时放大预期编辑区域的差异信号。SIGMA通过两个互补阶段训练:第一阶段在修复掩码上进行监督;第二阶段通过VAE往返噪声校准、EMA自训练和编辑噪声解耦损失来弥合扩散域偏移。SIGMA在五个基准上优于现有自动掩码生成器(F1提升12.20%,IoU提升11.16%)。当应用于公开编辑语料库时,它生成了约110万IML训练集,使六个不同检测器在五个数据集上平均F1提升18.34%,将以前未使用的编辑数据转化为IML的模型无关监督资源。论文被接收后我们将立即发布完整代码库。

英文摘要

Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.