arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.04483 2026-06-04 cs.CL

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

分布外声音:同人小说子类型作为对齐LLM的通用白话越狱

Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang

AI总结 本文发现安全训练覆盖不足的自然人类写作语域是对齐LLM的真正失败模式,并提出首个利用真实同人小说子类型作为通用攻击载体的越狱方法,显著提升攻击成功率。

详情
Comments
23 pages
AI中文摘要

现有的针对对齐LLM的越狱方法是离散的产物,其表面形式容易被指纹识别和修补。我们认为真正的失败模式不是任何特定的提示,而是安全训练覆盖不足的整个自然人类写作语域。基于这一见解,我们引入了第一个使用真实同人小说子类型作为通用攻击载体的越狱家族:一种创意写作元条件基于来自十二个Archive of Our Own (AO3)子类型之一的段落,有害行为被嵌入为结果场景的高潮。该构造不需要攻击者LLM,也不需要针对每个目标进行适应。在HarmBench和JailbreakBench的并集上对八个对齐LLM,该攻击在四评委集成下将平均ASR从0.278提升到0.731;因子分解显示增益由语域而非长度或结构带来。两种主动防御扩大了而非缩小了白话与基线的比率,表明针对模板的防御仅仅将攻击者引向像我们这样的基于语域的攻击。我们还提出了SAGA-A4,一种静态的四轮扩展,实现了平均ASR 0.924,大大超过了现有的三种多轮方法。

英文摘要

Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.

2606.04480 2026-06-04 cs.CV cs.HC

IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation

IMPose: 基于动态校正传播的交互式多人姿态估计

Haoyang Ge, Jian Ma, Ziwen Wang, Qihe Wang, Jianqi Fan, Hongzhi Yu, Xingyu Chen, Kun Li

AI总结 提出IMPose交互式工具,通过双级跟踪机制(关键点级和实例级)将稀疏的多人姿态校正传播到整个视频,显著减少手动标注工作量。

详情
AI中文摘要

高质量动态人体姿态标注为人工智能提供精确的运动学信息,使其能够掌握人类行为,但仍然劳动密集且耗时。当前的标注工具要么缺乏时间校正传播,要么在多人场景中失败,需要过多的人工干预。在本文中,我们介绍了IMPose,一种用于多人动态姿态标注的交互式工具。它具有双级跟踪机制,可将标注者的一帧多人姿态校正传播到整个视频。关键点级通过顺序建模确保校正的时间传播,而实例级采用关键点感知嵌入和相对位置编码来维持多人跨帧一致性。为了进一步提高鲁棒性,IMPose在轨迹库中维护历史姿态和实例线索,增强了长程时间关联,并在遮挡和运动模糊等挑战性情况下稳定标注。通过将稀疏的人工校正转换为密集且连贯的姿态轨迹,我们的框架显著减少了跨帧的重复手动细化。大量实验表明,IMPose在不同交互预算下始终实现强精度-效率权衡,在低点击标注设置中表现出特别优势。IMPose实现了高精度和高效率的标注,在3DPW上每1050帧视频仅需27次点击,在PoseTrack21上每个轨迹段每84帧仅需3次点击。我们进一步扩展了PoseTrack21,以10名标注员10小时的最小成本添加了188K个姿态实例(355万个关键点)。标注工具、代码和扩展数据集将开源。

英文摘要

High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.

2606.04479 2026-06-04 cs.CV cs.AI cs.CL

Evaluating Reasoning Fidelity in Visual Text Generation

评估视觉文本生成中的推理保真度

Jiajun Hong, Jiawei Zhou

AI总结 通过长文本渲染、事实知识探测、上下文理解和多步推理等任务,评估当前文本到图像模型在视觉文本生成中是否忠实保持推理能力,发现其常产生语义错误和逻辑不一致,与纯文本模型存在显著差距。

详情
Comments
Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)
AI中文摘要

最近的文本到图像(T2I)模型能够在图像中渲染高度清晰且结构良好的文本,从而支持文档生成和幻灯片生成等应用。然而,当复杂解决方案必须直接通过渲染文本表达时,这些系统是否忠实地保留了推理能力,还是仅仅模仿表面模式,目前尚不清楚。我们通过评估视觉文本生成中的推理保真度来研究这一问题,其中模型必须将完整的推理过程表达为图像。我们的评估包括长文本渲染、事实知识探测、上下文理解和多步推理。在这些设置中,我们发现当前的T2I模型经常产生语义错误、逻辑不一致和错误的中间步骤,即使渲染的文本在视觉上清晰。这些失败与纯文本模型在相同任务上的强推理表现形成对比。我们的发现揭示了视觉文本生成与程序性推理之间的显著差距,促使更可靠的视觉文本推理。

英文摘要

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

2606.04477 2026-06-04 cs.RO

TransTac: Visuo-Tactile Modality Transition via Ultraviolet-Encoded Transparent Elastomers

TransTac: 通过紫外编码透明弹性体实现视觉-触觉模态转换

Lingyue Yang, Bin Fang

AI总结 提出一种透明紫外编码双目视觉触觉传感器TransTac,结合视觉观察与标记触觉重建,通过先验引导的Delaunay立体匹配算法实现鲁棒稀疏三角化,在零样本触觉图像识别上达到83.3%准确率,并显著增强跨模态对齐。

详情
Comments
Accepted at IEEE International Conference on Robotics and Automation (ICRA) 2026. 8 pages, 7 figures
AI中文摘要

基于视觉的触觉传感器(VBTS)能够恢复高分辨率接触几何形状,但通常依赖于不透明的弹性体层,这阻碍了视觉透明性;而RGB-D相机提供全局深度感知,但在近距离时性能显著下降。为解决这一局限,我们提出了TransTac,一种透明的紫外(UV)编码双目VBTS,它将视觉观察和基于标记的触觉重建集成在一个紧凑设备中。该系统采用嵌入UV反射标记的透明弹性体,以及一种先验引导的Delaunay立体匹配算法,用于鲁棒的稀疏三角化。为了可靠地检测密集分布的半透明标记,我们开发了一种轻量级检测器,能够在接触和变形下实现稳定定位。所提出的先验引导的Delaunay匹配相比全局分配基线,将对应鲁棒性提高了约21%,同时保持高重建精度。在语义评估中,TransTac在触觉图像上实现了高达83.3%的零样本识别准确率,超过不透明触觉基线约50个百分点。嵌入分析进一步揭示了与自然图像的跨模态对齐显著增强,类中心相似度从约0.2提升至超过0.77。受控的近距实验量化了RGB-D深度可靠性的下降,并展示了通过视觉-触觉集成实现的扩展几何覆盖。最后,实现了一个紧凑原型,硬件成本约为70美元。

英文摘要

Vision-based tactile sensors (VBTS) recover high-resolution contact geometry but typically rely on opaque elastomer layers that prevent visual transparency, while RGB-D cameras provide global depth perception yet degrade significantly at close range. To address this limitation, we present TransTac, a transparent ultraviolet (UV)-encoded binocular VBTS that integrates visual observation and marker-based tactile reconstruction within a single compact device. The system employs a transparent elastomer embedded with UV-reflective markers and a prior-guided Delaunay stereo matching algorithm for robust sparse triangulation. To reliably detect densely distributed semitransparent markers, we develop a lightweight detector that enables stable localization under contact and deformation. The proposed prior-guided Delaunay matching improves correspondence robustness by approximately 21% compared with global assignment baselines while maintaining high reconstruction accuracy. In semantic evaluation, TransTac achieves up to 83.3% zero-shot recognition accuracy on tactile images, exceeding opaque tactile baselines by approximately 50 percentage points. Embedding analysis further reveals substantially stronger cross-modal alignment with natural images, with class-center similarity increasing from around 0.2 to over 0.77. Controlled near-distance experiments quantify the degradation of RGB-D depth reliability and demonstrate extended geometric coverage enabled by visuo-tactile integration. Finally, a compact prototype is implemented with an approximate hardware cost of $70.

2606.04476 2026-06-04 cs.LG math.OC math.ST stat.ML stat.TH

When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks

当两层都学习:通过ReLU网络表示线性模型的训练动力学

Berk Tinaz, Changzhi Xie, Mahdi Soltanolkotabi

AI总结 本文研究单隐层ReLU网络联合训练两层以拟合线性目标函数的梯度下降动力学,通过三阶段分析证明从随机初始化出发能以线性速率收敛到全局最小化器并达到最优样本复杂度。

详情
Comments
47 pages, 8 figures, published at the 39th Annual Conference on Learning Theory (COLT), 2026
AI中文摘要

在本文中,我们研究了联合训练单隐层ReLU网络的两层以拟合线性目标函数的梯度下降动力学。具体来说,我们考虑一个可实现设置,其中输入从高斯分布中独立同分布采样,标签遵循一个植入的线性模型。这种风格化的框架捕捉了逆问题和某些自编码器模型中端到端训练的关键特征。尽管其表面简单,但动力学仍然难以理解,部分原因是损失景观包含多个非严格鞍点,这使得不清楚为什么从随机初始化开始的梯度下降能够可靠地逃离坏的驻点区域。我们提供了优化景观的详细刻画,并证明从适度小的随机初始化开始——同时训练两层——梯度下降以线性速率收敛到全局最小化器,并具有阶次最优的样本复杂度。我们的分析通过三个阶段追踪轨迹:对齐阶段,其中隐藏权重逐渐与植入方向对齐,而输出权重保持正确的符号模式;增长阶段,其中两层的范数增加同时保持对齐;以及局部细化阶段,其中对齐的神经元快速收敛到植入方向,产生快速的局部收敛。为了严格证明梯度下降避免非严格鞍点,我们为端到端动力学开发了轨迹级控制论证。此外,我们建立了沿整个轨迹成立的新颖的均匀集中结果,这对于获得阶次最优的样本复杂度至关重要。我们通过一系列配置的大量实验验证了我们的理论。

英文摘要

In this paper, we study the gradient descent dynamics for jointly training both layers of a one-hidden-layer ReLU network to fit a linear target function. Concretely, we consider a realizable setting where inputs are drawn i.i.d. from a Gaussian distribution and labels follow a planted linear model. This stylized framework captures salient features of end-to-end training in inverse problems and certain auto-encoder models. Despite its apparent simplicity, the dynamics remain poorly understood, in part because the loss landscape contains multiple non-strict saddle points, making it unclear why gradient descent from random initialization reliably escapes bad stationary regions. We provide a detailed characterization of the optimization landscape and prove that gradient descent from a moderately small random initialization-simultaneously training both layers-converges to a global minimizer at a linear rate with order-wise optimal sample complexity. Our analysis tracks the trajectory through three phases: an alignment phase in which hidden weights progressively align with the planted direction while the output weights maintain the correct sign pattern; a growth phase in which the norms of both layers increase while preserving alignment; and a local refinement phase in which the aligned neurons rapidly converge to the planted direction, yielding fast local convergence. To rigorously show that GD avoids non-strict saddles, we develop trajectory-level control arguments for the end-to-end dynamics. In addition, we establish novel uniform concentration results that hold along the entire trajectory, and are essential for obtaining order-wise optimal sample complexity. We corroborate our theory with extensive experiments across a range of configurations.

2606.04475 2026-06-04 cs.SD cs.MM math.SP

A Second-Order Cepstral Signature of Contact-Vibration Sounds Reproduced by Laptop Loudspeakers: A Synthetic Case Study

笔记本电脑扬声器再现的接触振动声音的二阶倒谱特征:一个合成案例研究

Jim Salsman

AI总结 通过合成信号链分析,提出接触振动声音在笔记本电脑扬声器再现时具有一阶和二阶倒谱周期性结构,其中二阶倒谱双峰性在机械源和扬声器播放时最明显。

详情
Comments
11 pages, 4 tables, 5 figures, 8 references
AI中文摘要

手机在硬表面上振动时,通过笔记本电脑扬声器再现的声音通常在质量上不同于普通的视听录音。我们提出这种感知独特性的部分原因可以描述为嵌套周期性:一阶倒谱结构反映振动周期及其倍数,二阶倒谱结构反映一阶倒谱内的重复间隔。将感知效应视为真实的,并使用刻意透明的合成信号链,我们建模了六个阶段:机械生成、表面和空气传播、麦克风捕获、编码和解码、笔记本电脑扬声器播放以及重新录制或后处理。合成分析表明,一阶倒谱周期性在整个链中得以保留,而更干净的双峰或准双峰二阶倒谱特征在机械源和笔记本电脑扬声器播放时最为明显。该结果支持但未证明以下假设:笔记本电脑再现可以重新强调潜在的接触振动周期性,而这种周期性在中间记录和编码形式中表达得不够清晰。我们将二阶倒谱双峰性视为接触振动播放的探索性描述符,而非完整的感知度量。所需的验证包括真实设备的录音、受控的播放传递函数、感知判断以及与普通语音、音乐和环境录音的比较。

英文摘要

A mobile phone vibrating on a hard surface often sounds qualitatively unlike ordinary audiovisual recordings when reproduced through laptop loudspeakers. We propose that part of this perceptual distinctiveness can be described as a nested periodicity: a first-order cepstral structure reflecting the vibration period and its multiples, and a second-order cepstral structure reflecting repeated spacing within the first-order cepstrum. Treating the perceptual effect as real and using a deliberately transparent synthetic signal chain, we model six stages: mechanical generation, surface and air propagation, microphone capture, encoding and decoding, laptop-speaker playback, and re-recording or post-processing. The synthetic analysis shows that the first-order cepstral periodicity is preserved across the chain, whereas a cleaner bimodal or quasi-bimodal second-order cepstral signature is most evident at the mechanical source and at laptop-speaker playback. The result supports, but does not prove, the hypothesis that laptop reproduction can re-emphasize a latent contact-vibration periodicity that is less cleanly expressed in intermediate recorded and encoded forms. We frame second-order cepstral bimodality as an exploratory descriptor of contact-vibration playback rather than as a completed perceptual metric. Required validation includes recordings of real devices, controlled playback transfer functions, perceptual judgments, and comparisons against ordinary speech, music, and environmental recordings.

2606.04473 2026-06-04 cs.LG cs.AI

ChessMimic: Per-Rating Transformer Models for Human Move, Clock, and Outcome Prediction in Online Blitz Chess

ChessMimic: 用于在线闪电棋中人类走棋、时钟和结果预测的按等级划分的Transformer模型

Thomas Johnson

AI总结 提出ChessMimic系统,包含三个小型编码器Transformer模型,分别用于走棋、思考时间和结果预测,通过按Elo等级分段训练实现更精细的技能校准,在Lichess闪电棋数据上走棋预测准确率超越Maia-2,结果预测AUC达0.78,时钟模型提供可用但非最优的思考时间信号。

详情
AI中文摘要

我们提出了ChessMimic,一个由三个小型编码器Transformer组成的系统——分别用于走棋、思考时间和结果预测——以局面、最近走棋历史、玩家等级和时钟状态为条件。我们为每100 Elo等级区间拟合每个模型的独立实例,以参数效率换取更精细的技能校准。在Lichess Rated Blitz游戏的一个月保留切片上,ChessMimic的人类走棋预测准确率在每个Elo区间都优于Maia-2。与Maia-3相比,我们的9M参数模型的准确率介于Maia-3-5M和Maia-3-23M之间,且没有几何注意力偏置的额外复杂性。除了走棋匹配模型,我们还训练了一个游戏结果模型,该模型不仅以局面为条件,还以玩家等级、时间控制和剩余时钟时间为条件。结果模型在样本外达到了0.78的AUC,击败了Maia-2以及基于子力、等级和时钟时间的逻辑回归。最后,我们训练了一个时钟模型来预测人类思考时间。该时钟模型在ALLIE风格过滤器下提供了可用但非最优的每步思考时间信号(Pearson r = 0.41,Spearman rho = 0.50,MAE 4.10秒,而ALLIE报告的r = 0.70),残差差距集中在每位置桶的锐度上,而非桶边际校准。公开演示在1e4.ai,我们在GitHub上发布了代码、每个区间的权重以及C++数据过滤管道代码。

英文摘要

We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on the position, recent move history, player rating, and clock state. We fit a separate instance of each model per 100-Elo rating band, trading parameter efficiency for sharper per-skill calibration. On a held-out month-wide slice of Lichess Rated Blitz games ChessMimic's human move prediction accuracy outperforms Maia-2 in every Elo band. Compared to Maia-3, our 9M parameter model's accuracy sits between Maia-3-5M and Maia-3-23M without the additional complexity of Geometric Attention Bias. In addition to the move matching model, we also train a game outcome model that conditions not only on the position, but also player ratings, time control, and remaining clock times. The outcome model achieves an AUC of 0.78 out of sample, beating Maia-2 as well as logistic regressions based on material, ratings, and clock time. Finally, we train a clock model that predicts human thinking times. The clock model provides a usable but non-SOTA per-ply think-time signal under ALLIE-style filters (Pearson r = 0.41, Spearman rho = 0.50, MAE 4.10 s, against ALLIE's reported r = 0.70), with the residual gap concentrated in per-position bucket sharpness rather than bucket-marginal calibration. A public demo is at 1e4.ai and we release code, per-band weights, and the C++ data-filter pipeline code in GitHub.

2606.04469 2026-06-04 cs.CV cs.AI

Adaptive Calibration for Fair and Performant Facial Recognition

自适应校准:实现公平且高性能的面部识别

Ryan Brown, Chris Russell

AI总结 提出自适应校准(AC)方法,通过将归一化嵌入的余弦相似度映射为校准概率,并融入局部上下文校正区域差异,从而在无需人口统计元数据的情况下提升面部识别的整体性能和公平性。

详情
AI中文摘要

我们引入自适应校准(AC),一种新颖的面部识别校准策略,将归一化嵌入之间的余弦相似度映射为良好校准的概率。通过将局部上下文纳入校准,自适应校正确保了余弦相似度中的一个基本不匹配问题,即相同的距离在不同嵌入区域可能对应不同的匹配概率。我们的方法在无需人口统计元数据的情况下,既提高了整体性能,又实现了更公平的校准。在各种预训练模型和标准基准上,我们的方法在准确性和公平性指标上始终优于现有方法。AC为公平的面部识别提供了实用的解决方案,无需人口统计组注释,同时提高了整体性能。与现有方法不同,我们的方法提供了连续的、区域特定的校准,避免了“降级”现象,即公平性以牺牲某些群体的性能为代价。

英文摘要

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

2606.04468 2026-06-04 cs.LG cs.AI cs.NE math.OC

ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion

ParetoPilot:通过推断-扰动-引导扩散实现零代理离线多目标优化

Ruiqing Sun, Sen Yang, Dawei Feng, Bo Ding, Yijie Wang, Huaimin Wang

AI总结 提出ParetoPilot,一种无需外部代理模型的零代理扩散框架,通过推断-扰动-引导引擎在无条件去噪步骤中隐式推断目标方向、正交化并行引力场和边缘感知排斥力,实现离线多目标优化的帕累托最优设计。

详情
AI中文摘要

离线多目标优化旨在基于静态数据集发现新颖的帕累托最优设计,而无需昂贵的环境交互。尽管最近的生成方法取得了显著成功,但它们主要依赖外部代理模型。这种依赖引入了显著的计算开销,遭受欺骗性评估,并偏离了联合训练主流生成模型与条件的流行范式。为了解决这些瓶颈,我们提出了ParetoPilot,一种用于离线多目标优化的新颖零代理扩散框架。ParetoPilot充分利用预训练扩散模型中固有的条件先验。其核心是引入了推断-扰动-引导引擎,该引擎无缝地插入在反向生成过程的无条件去噪步骤中。首先,通过匹配条件噪声预测和无条件噪声预测,隐式推断瞬时目标方向。其次,数学上正交化一个用于严格收敛的平行引力场和一个用于相互多样性的边缘感知排斥力,从而生成一个动态退火的扰动向量。最后,这个扰动目标通过标准的无分类器引导无缝地引导生成过程。在51个任务上的大量实验表明,ParetoPilot优于14个最先进的基于代理和逆生成基线。通过消除辅助代理训练,我们的方法在实现超体积改进和鲁棒帕累托前沿覆盖的同时,保护了数据隐私。

英文摘要

Offline multi-objective optimization (Offline MOO) aims to discover novel Pareto-optimal designs based on static datasets without expensive environment interactions. While recent generative methods have achieved notable success, they predominantly rely on external surrogate models. This dependency introduces significant computational overhead, suffers from deceptive evaluations, and deviates from the prevailing paradigm of jointly training mainstream generative models with conditions. To address these bottlenecks, we propose ParetoPilot, a novel zero-surrogate diffusion framework for offline MOO. ParetoPilot fully leverages the conditional priors inherently embedded within pre-trained diffusion models. At its core, the framework introduces the Infer-Perturb-Guide (IPG) engine, which is seamlessly interleaved within the unconditional denoising steps of the reverse generation process. First, it implicitly infers the instantaneous objective direction by matching conditional and unconditional noise predictions. Next, it mathematically orthogonalizes a parallel gravity field for strict convergence and an edgeness-aware repulsive force for mutual diversity, creating a dynamically annealed perturbation vector. Finally, this perturbed target seamlessly steers the generation process via standard Classifier-Free Guidance (CFG). Extensive experiments across 51 tasks demonstrate that ParetoPilot outperforms 14 state-of-the-art surrogate-based and inverse generative baselines. By eliminating auxiliary proxy training, our approach preserves data privacy while achieving hypervolume improvement and robust Pareto front coverage.

2606.04466 2026-06-04 cs.CL

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

学习什么:小语言模型推理中SFT-then-RL的阶段特定数据集

Chongyang He, Rui Zhang, Zixuan Wang, Xin Li

AI总结 提出一种难度感知的SFT-then-RL框架,通过阶段特定数据集(SFT阶段使用桥接机制,RL阶段使用批判微调)协调数据难度,提升小语言模型推理性能。

详情
Comments
25 pages, 12 figures
AI中文摘要

后训练小语言模型(SLM)进行推理通常遵循SFT-then-RL流程,但现有工作很少考虑每个阶段应该学习什么数据。我们认为数据策略应与SFT和RL的不同角色对齐:SFT更适合获取尚未掌握的推理技能,而RL更适合巩固模型已部分掌握的技能。基于这一原则,我们提出了一种难度感知的SFT-then-RL框架,将训练数据组织成阶段特定的数据集。对于SFT阶段的困难样本,我们引入桥接机制,将教师生成的原始推理轨迹转化为SLM更易学习的监督信号。对于RL阶段仍未解决的困难样本,我们应用批判微调,将零奖励失败转化为诊断、修复和新的推理轨迹监督,用于下一SFT阶段。在两个SLM上跨越五个推理基准的实验表明,我们的方法在代表性SFT、蒸馏和RL基线上持续改进。我们的结果强调了协调SFT和RL之间数据难度对于有效SLM推理后训练的重要性。

英文摘要

Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.

2606.04465 2026-06-04 cs.CL cs.AI

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

SePO: 用于系统提示优化的自我进化提示智能体

Wangcheng Tao, Han Wu, Weng-Fai Wong

AI总结 提出SePO方法,通过自我指涉设计让提示智能体同时优化任务智能体和自身的系统提示,采用两阶段进化训练,在多个基准上平均准确率提升4.49%。

详情
Comments
26 pages. Code: https://github.com/taowangcheng/SePO
AI中文摘要

系统提示优化在不修改底层模型的情况下改善智能体行为,生成可读且模型无关的指令。现有方法构建一个提示智能体来优化任务智能体的系统提示,但提示智能体自身的系统提示仍由人工设计且固定不变。我们提出自我进化提示优化(SePO),将提示智能体自身的系统提示与任务智能体的系统提示一同作为优化目标。SePO采用自我指涉设计:一个单一的提示智能体在开放式进化搜索下同时改进任务智能体的系统提示和自身的系统提示,该搜索维护一个候选提示档案作为垫脚石。训练分为两个阶段:预训练在多任务池上进化提示智能体,微调则将其应用于目标任务。在涵盖数学(AIME'25)、抽象推理(ARC-AGI-1)、研究生级科学(GPQA)、代码生成(MBPP)和逻辑谜题(数独)的五个基准上,SePO始终优于Manual-CoT、TextGrad和MetaSPO,与Manual-CoT相比平均准确率提升4.49%。预训练中的提示优化技能也能泛化到预训练混合任务之外的任务,而非记忆每个任务的提示。

英文摘要

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

2606.04461 2026-06-04 cs.CV

ChannelTok: Efficient Flexible-Length Vision Tokenization

ChannelTok: 高效灵活长度视觉分词

Sukriti Paul, Arpit Bansal, Tom Goldstein

AI总结 提出一种基于通道的轻量级灵活长度分词器,通过随机尾部丢弃训练实现语义重要性排序,在保持高质量的同时大幅提升解码速度和模型效率。

详情
AI中文摘要

领先的灵活视觉分词器以极端成本实现SOTA质量,依赖参数繁重的骨干网络和缓慢的多步生成解码器。我们摆脱这种复杂的空间分词范式,引入一种简单、轻量且快速的通道级灵活长度分词器。我们的方法将每个潜在通道视为一个视觉标记,采用参数高效的CNN-Transformer混合骨干网络。此外,在训练过程中采用随机尾部丢弃范式,自然地迫使通道按语义重要性排序。这使得在推理时只需保留前$k$个通道即可实现灵活压缩,并自然支持可变长度自回归图像生成。我们通过在ImageNet上的大量实验验证了该方法,展示了在不同标记预算下的一致质量。结果建立了新的质量-效率前沿:我们的模型实现了最先进的感知质量(rFID 2.92),同时解码速度比次优方案快$8.6\times$,参数量小$2.1\times$(1.59亿参数)。我们的工作将通道级分词确立为高效视觉表示的一种强大且实用的范式。项目页面:https://channeltok.github.io

英文摘要

Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io

2606.04460 2026-06-04 cs.CR cs.AI cs.LG

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

CyberGym-E2E:面向AI代理端到端网络安全能力的可扩展真实世界基准

Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, Jianhong Tu, Gabriel Han, Chenguang Wang, Jingxuan He, Wenbo Guo, Dawn Song

AI总结 提出CyberGym-E2E,一个大规模、真实的端到端网络安全基准,通过自动化流水线将开源漏洞数据转化为评估环境,全面评估AI代理在漏洞发现、PoC生成和补丁生成全生命周期中的能力。

详情
Comments
ICML 2026
AI中文摘要

人工智能有潜力通过使系统能够自主检测、分析和修复软件漏洞来改变网络安全。然而,现有对AI系统的网络安全评估在规模或范围上有限,未能捕捉真实世界软件漏洞发现和修复的端到端生命周期。为了解决这一差距,我们提出了CyberGym-E2E,一个大规模、真实的端到端网络安全基准,全面评估AI代理在漏洞发现、PoC生成和补丁生成整个生命周期中的能力。CyberGym-E2E全面且可扩展,因为我们构建了一个自动化的、代理增强的流水线,用于将开源漏洞数据转化为真实的评估环境。目前,该基准包含139个不同开源项目中的920个真实世界漏洞。

英文摘要

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

2606.04459 2026-06-04 cs.CR cs.AI cs.CC cs.CL

Token Rankings are Unforgeable Language Model Signatures

Token排名是不可伪造的语言模型签名

Matthew Finlayson, Andreas Grivas, Xiang Ren, Swabha Swayamdipta

AI总结 本文发现语言模型的token排名(按概率排序)构成唯一且不可伪造的签名,并研究了在限制API下如何平衡签名展示与参数泄露。

详情
AI中文摘要

已知语言模型参数对其logit输出施加了(每个模型)独特的几何约束,这作为识别模型的签名,但当API分发logits时也会泄露模型的最后一层参数。我们研究了更严格的API,这些API只暴露token排名(即按概率排序,但不暴露概率值),并发现排名也构成签名:对于足够大的$k$,每个模型都有一组唯一的可行top-$k$排名。此外,排名签名是第一个已知的(多项式时间)不可伪造签名,因为找到一个具有相同可行排名集的模型是NP难的。在安全方面,我们发现token排名已经足以近似窃取模型的最后一层,类似于logits,尽管近似太粗糙以至于无法伪造签名,并且可以通过将API限制为足够小的$k$的top-$k$ token来有效应对。由于展示模型签名所需的top-$k$通常小于防止窃取所需的$k$,因此API可以在不泄露模型参数的情况下展示不可伪造的签名。

英文摘要

Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top-$k$ rankings for sufficiently large $k$. Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top-$k$ tokens with sufficiently small $k$. Since the top-$k$ required to present the model signature is generally smaller than the $k$ required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.

2606.04457 2026-06-04 cs.CV

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

先构思再绘制:面向图像生成的视觉提示工程

Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang

AI总结 提出视觉提示工程(VPE),通过在单一模型内先生成视觉语义令牌作为中间计划,再生成完整图像,从而避免信息瓶颈,提升图像生成质量与编辑保真度。

详情
AI中文摘要

在图像生成之前,将视觉语义表示作为中间步骤引入,可以降低文本与图像之间的建模难度,从而提高生成质量。近期工作如X-Omni和BLIP3o-Next探索了这一方向,但它们通常采用两阶段外部流水线:一个独立的自回归模型首先生成语义令牌,然后将其作为条件输入给独立的扩散解码器。由于解码器无法同时访问原始输入和语义计划,这种设计引入了信息瓶颈,限制了编辑等下游任务中的细节保留。而Transfusion、BAGEL和Show-o2等内部架构通过单一模型内的跨模态交互避免了这一瓶颈,但它们在没有中间语义引导的情况下,仍然面临困难的文本到像素建模差距。我们提出了视觉提示工程(VPE),它可以无缝集成到此类内部框架中。具体来说,模型首先自回归地生成视觉语义令牌(例如SigLIP 2)作为“视觉提示”,以捕捉语义布局,然后基于该计划生成完整图像令牌。我们在类别条件生成、文本到图像生成和图像编辑上验证了VPE,涵盖了多种令牌类型和模型架构。结果表明,VPE可以加速收敛、提高质量上限,并且通过内部集成,在相同参数规模下,相比外部替代方案实现了显著更好的编辑保真度(PSNR:26.76 vs. 19.92),同时保持了有竞争力的编辑响应速度。

英文摘要

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

2606.04455 2026-06-04 cs.AI cs.CL

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

元智能体挑战:当前智能体能否自主开发智能体?

Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

AI总结 提出元智能体挑战(MAC)框架,评估前沿模型自主开发智能体系统的能力,发现多数元智能体难以匹敌人类设计的基线策略,且存在鲁棒性和对齐问题。

详情
Comments
Website: https://meta-agent-challenge.com/
AI中文摘要

当前的AI基准测试评估智能体在人类设计的工作流程中执行任务的能力。这些评估从根本上未能衡量一个关键的更高级能力:模型能否自主开发智能体系统。我们引入了元智能体挑战(MAC),这是一个评估框架,旨在测试前沿模型自主开发智能体的能力。具体来说,一个代码智能体(元智能体)被赋予一个沙盒环境、一个评估API和一个时间限制,以迭代地编程一个智能体工件,该工件在五个领域的保留测试集上最大化性能。为确保评估完整性,该框架通过多层防御机制防止奖励黑客攻击。利用该框架,我们证明元智能体很少能匹配人类设计的基线策略,而少数能匹配的则主要由专有前沿模型主导。此外,设计过程表现出高方差,高优化压力会浮现出诸如真实数据窃取等新兴对抗行为——凸显了鲁棒性和模型对齐方面的关键缺陷。最终,MAC为自主AI研究和开发提供了一个严格的、开源的基准测试,为评估递归自我改进提供了经验代理。基准测试公开于:https://github.com/ant-research/meta-agent-challenge。

英文摘要

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

2606.04454 2026-06-04 cs.CL

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

通过外部子图生成增强大语言模型的逐步推理

Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

AI总结 提出SGR框架,通过从知识图谱生成查询相关子图来引导大语言模型进行逐步推理,提升复杂多步推理的准确性、鲁棒性和可解释性。

详情
AI中文摘要

大语言模型在自然语言生成和下游推理任务中表现出色,但在复杂多步推理中仍面临逻辑一致性、事实基础和可解释性方面的挑战。为解决这些局限,本文提出SGR,一种通过查询相关子图生成将大语言模型与外部知识图谱集成的逐步推理增强框架。给定输入问题,SGR首先提取关键实体、关系和约束以构建结构化模式,然后通过模式引导查询从知识图谱中检索紧凑子图。生成的子图提供明确的关系证据,引导语言模型进行逐步推理。此外,SGR结合了基于Cypher的直接推理与协作推理集成,允许根据模型置信度和图一致性验证和聚合来自多个推理路径的候选答案。在包括CWQ、WebQSP、GrailQA和KQA Pro的基准数据集上的实验表明,SGR在推理准确性和Hits@1性能上优于标准提示和几种知识增强基线。消融研究进一步表明,模式引导和基于Neo4j的检索对框架的有效性都至关重要。这些结果表明,动态生成的外部子图可以提高基于大语言模型的推理的准确性、鲁棒性和可解释性。

英文摘要

Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.

2606.04453 2026-06-04 cs.CV cs.LG

Radiomic Feature Selection Using Gradient Loss of Deep Neural Network for Lung Cancer Stage Detection

基于深度神经网络梯度损失的放射组学特征选择用于肺癌分期检测

Hina Shakir, Mohammad Mohatram, Javeed Hussain, Syed Rizwan Ali, Muhammad Irfan Memon

AI总结 提出GL-RFE框架,利用深度神经网络梯度敏感性分析递归消除低贡献特征,从106个放射组学特征中选出前15个用于肺癌早晚期分类,准确率达90.22%。

详情
Journal ref
J. Vis. Exp. (230), e70181, (2026)
AI中文摘要

放射组学能够从医学图像中提取定量成像生物标志物,已成为计算机辅助癌症诊断的重要工具。然而,放射组学数据集通常具有高维小样本的特点,使得特征选择成为构建可靠预测模型的关键步骤。本研究提出了一种梯度损失递归特征消除(GL-RFE)框架,该框架集成深度神经网络的梯度敏感性分析,以识别对肺癌分期检测最具影响力的放射组学特征。使用3D Slicer平台的PyRadiomics扩展从胸部计算机断层扫描(CT)中提取了总共106个放射组学特征。所提出的方法通过计算网络损失相对于输入特征的梯度来评估特征重要性,并递归消除贡献最小的特征。最终选出的前15个放射组学特征用于训练深度神经网络分类器,以区分早期和晚期肺癌。该框架在测试数据集上取得了强劲的分类性能,准确率为90.22%,精确率为90.10%,召回率为90.24%,F1分数为90.16%。可视化分析(包括相关性热图和分布图)进一步证实了特征冗余减少和类别可分性提高。与传统特征选择技术相比,GL-RFE有效捕捉了非线性特征交互并增强了模型泛化能力。所提出的协议为基于放射组学的癌症分期检测提供了一种可重复且可解释的方法,特别适用于高维小样本生物医学数据集,并在基因组学和多模态临床分析等其他领域具有潜在应用价值。

英文摘要

Radiomics enables extraction of quantitative imaging biomarkers from medical images and has become an important tool for computer-aided cancer diagnosis. However, radiomics datasets are typically high-dimensional with limited samples, making feature selection a critical step for building reliable predictive models. This study proposes a Gradient-Loss Recursive Feature Elimination (GL-RFE) framework that integrates gradient sensitivity analysis from a deep neural network to identify the most influential radiomic features for lung cancer stage detection. A total of 106 radiomic features were extracted from chest Computed Tomography (CT) scans using the PyRadiomics extension of the 3D Slicer platform. The proposed method evaluates feature importance by computing gradients of the network loss with respect to input features and recursively eliminates features with minimal contribution. The resulting top-15 radiomic features are used to train a deep neural network classifier for distinguishing early-stage and advanced-stage lung cancer. The proposed framework achieves strong classification performance, with accuracy of 90.22%, precision of 90.10%, recall of 90.24%, and F1-score of 90.16% on the test dataset. Visualization analyses, including correlation heat maps and distribution plots, further confirm reduced feature redundancy and improved class separability. Compared to conventional feature selection techniques, GL-RFE effectively captures nonlinear feature interactions and enhances model generalization. The presented protocol provides a reproducible and interpretable methodology for radiomics-based cancer stage detection and is particularly suitable for high-dimensional, small-sample biomedical datasets, with potential applications in other domains such as genomics and multimodal clinical analysis.

2606.04451 2026-06-04 cs.LG

On Out-of-sample Embedding in UMAP

UMAP中的样本外嵌入

Mohammad Tariqul Islam, Jason W. Fleischer

AI总结 针对UMAP在添加新样本时产生的排斥效应,通过优化原始k近邻图中的成对交互,提出参数化UMAP方法以改善嵌入质量。

详情
Comments
22 pages, 16 figures
AI中文摘要

邻域嵌入算法通过在低维空间中构建等价的图表示来揭示高维数据中的相关性。一种日益流行的算法是统一流形学习与投影(UMAP),它使用代数拓扑来映射两个空间之间的距离。虽然它在许多类型的数据集上表现良好,但UMAP在将样本外点添加到现有映射时存在困难。特别是,UMAP通常将新点放置在所发现簇的周边,而不是与它们的相关邻居一起放在簇的内部。在这里,我们通过优化原始k近邻图中的成对交互来克服这种“排斥效应”。此外,我们表明参数化UMAP比非参数算法获得更好的嵌入,特别是当数据变得更复杂时(例如,医学图像)。我们还表明,当使用参数化UMAP嵌入数据时,排斥效应自然得到缓解。我们使用可信度、最近邻分类器以及分析嵌入中的吸引力和排斥力来表征不同的UMAP方法。

英文摘要

Neighbor embedding algorithms reveal correlations in high-dimensional data by constructing an equivalent graph representation in a lower-dimensional space. An increasingly popular algorithm is Uniform Manifold Learning and Projection (UMAP), which uses algebraic topology to map distances between the two spaces. While it works well on many types of data sets, UMAP has trouble adding out-of-sample points to a pre-existing mapping. In particular, UMAP often places new points on the periphery of the found clusters, rather than in their interiors with their correlated neighbors. Here, we overcome this ``repulsion effect'' by optimizing pairwise interactions within the original k-nearest-neighbor graph. Moreover, we show that parameterizing UMAP obtains better embeddings than non-parametric algorithms, particularly as the data gets more complex (e.g., medical images). We also show that the repulsion effect is naturally mitigated when a parameterized UMAP is employed to embed the data. We characterize different UMAP approaches using trustworthiness, nearest neighbor classifiers, and by analyzing attractive and repulsive forces in the embeddings.

2606.04450 2026-06-04 cs.CL cs.CY

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

倾听劳动力:使用LLMs从社交媒体话语中测量建筑工人安全态度

Farouq Sammour, Yuxin Zhang, Zhenyu Zhang

AI总结 提出并验证了建筑安全态度框架(CSAF),通过LLM分类器从Reddit社区话语中测量工人安全态度,实现高精度多维分析。

详情
AI中文摘要

工人安全态度是决定建筑工地上保护措施是否被应用或规避的关键因素。然而,大规模测量安全态度一直难以实现。安全态度是多维的,因话题而异,并且在工人自己的对话中最为坦诚。本研究创建并验证了建筑安全态度框架(CSAF),该框架整合了两个组成部分:一个基于理论的结构,沿八个维度表征安全态度;以及一个用于在工人自然话语中测量这些态度的操作化编码手册。将CSAF应用于Reddit上r/Construction社区的250条帖子和评论,经过训练的编码者达到了高度一致(Krippendorff's α = 0.85)。成对提升度和条件概率证实了八个维度既相关又不同。为了将框架应用于大量话语,CSAF通过大语言模型(LLM)分类器进行操作化。在450条r/Construction贡献中,分类器再现了专家人工编码(Cohen's κ = 0.90,精确率 = 0.98,召回率 = 0.98),并且在400条r/Roofing贡献中,转移到不同行业社区后仍保持该准确率(κ = 0.89,精确率 = 0.98,召回率 = 0.97)。一项价值验证案例研究将经过验证的分类器应用于10,346条r/Roofing贡献,证明CSAF能够按安全主题区分多维态度,追踪它们随时间的变化,并追溯不利态度背后的推理。因此,本研究提供了一个理论扎实、经验验证的工具来检查安全态度,为针对不安全实践背后态度的干预措施提供了基础。

英文摘要

Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across topics, and surface most candidly in workers' own conversations. This study created and validated the Construction Safety Attitude Framework (CSAF), which integrates two components: a theory-grounded structure that characterizes safety attitudes along eight dimensions, and an operational codebook for measuring them in worker naturalistic discourse. Applying CSAF to 250 posts and comments from the r/Construction community on Reddit, trained coders reached strong agreement (Krippendorff's α = 0.85). Pairwise lift and conditional probability confirmed that the eight dimensions are related yet distinct. To apply the framework across large volumes of discourse, CSAF was operationalized through a large language model (LLM) classifier. On 450 r/Construction contributions, the classifier reproduced expert human coding (Cohen's \k{appa} = 0.90, precision = 0.98, recall = 0.98), and on 400 contributions from r/Roofing it retained that accuracy after transfer to a different trade community (\k{appa} = 0.89, precision = 0.98, recall = 0.97). A proof-of-value case study then applied the validated classifier to 10,346 contributions from r/Roofing, demonstrating that CSAF can distinguish multidimensional attitudes by safety topic, track how they shift over time, and trace the reasoning behind unfavorable ones. The study therefore provides a theoretically grounded, empirically vetted instrument for examining safety attitudes, offering a basis for targeted interventions that address the attitudes underlying unsafe practices.

2606.04446 2026-06-04 cs.DC cs.LG

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

D^2SD: 使用双重扩散草稿模型加速推测解码

Liyuan Zhang, Jiarui Zhang, Jinwei Yao, Ran Yan, Yuchen Yang, Jiahao Zhang, Tongkai Yang, Yi Wu, Binhang Yuan

AI总结 提出D^2SD框架,通过双重扩散草稿模型和置信度引导的前缀树,提升推测解码的接受率,优于现有扩散方法和自回归推测解码基线。

详情
AI中文摘要

推测解码通过草拟多个令牌并在单次目标模型前向传递中验证它们,加速自回归大语言模型推理。最近的基于扩散的草稿模型并行生成整个令牌块,但通常每次验证只提交单个草稿序列:一旦出现第一个不匹配,所有后续草稿令牌被丢弃,导致接受率有限。简单地对更多草稿候选序列进行批处理只会带来边际改进,因为冗余或位置不当的分支增加了草拟和验证的成本,而没有成比例地增加接受的令牌数量。我们提出D^2SD,一种双重扩散草稿推测解码框架,将候选组织成置信度引导的前缀树,其中第一个扩散草稿器生成一个块以及每个位置的置信度分数,用于识别最可能的拒绝边界并选择前K个前缀范围进行恢复;第二个可变前缀扩散草稿器在每个选定前缀处重新锚定,并在一次批处理中提出替代延续;得到的共享前缀候选通过级联注意力联合验证。实验表明,D^2SD在底层扩散方法和强自回归推测解码基线上均有明显改进。

英文摘要

Speculative decoding accelerates autoregressive large language model inference by drafting multiple tokens and verifying them in a single target-model forward pass. Recent diffusion-based drafters generate an entire block of tokens in parallel but usually commit to a single draft sequence per verification: once the first mismatch occurs, all subsequent draft tokens are discarded, resulting in a limited acceptance rate. Naively batching more draft candidate sequences only introduces a marginal improvement, as redundant or poorly placed branches increase the cost of drafting and verification without proportionally increasing the number of accepted tokens. We propose D^2SD, a dual diffusion draft speculative decoding framework that organizes candidates into a confidence-guided prefix tree, where the first diffusion drafter generates a block along with per-position confidence scores that are used to identify the most likely rejection boundary and select the top-K prefix ranges for recovery; the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes alternative continuations in one batched pass; the resulting shared-prefix candidates are jointly verified via cascade attention. Empirically, D^2SD shows clear improvements over both the underlying diffusion approach and strong autoregressive speculative decoding baselines.

2606.04445 2026-06-04 cs.LG cs.AI math.ST stat.TH

RowNet: A Memory Transformer for Tabular Regression

RowNet: 用于表格回归的记忆Transformer

Askat Rakhymbekov, Gulshat Muhametjanova

AI总结 针对房地产估值中表格回归问题,提出RowNet,一种基于检索的神经网络架构,通过记忆库中的成对相似性特征、目标一致性增强和混合专家模块实现价格预测。

详情
Comments
Retrieval-based neural architecture for real estate valuation. Related to TabR (arXiv:2307.14338) and retrieval-augmented tabular learning
AI中文摘要

房地产估值是一个结构化回归问题,其中价格受异构特征类型、稀疏区域效应、非线性交互以及可比房产的实际逻辑影响。标准多层感知器将每一行视为孤立向量,必须仅从监督中学习局部性、尺度敏感性和类别匹配。梯度提升决策树提供了强大的表格基线,但其以特征为中心的分裂机制并未显式建模相似历史观测的检索。本文提出了RowNet,一种用于房地产每平方米价格预测的基于检索的神经网络架构。RowNet通过针对标记属性记忆库的成对相似性特征来表示查询属性。第一检索层从仅特征相似性中估计粗略目标。第二层通过目标一致性特征增强记忆比较,并使用多个学习注意力头检索互补的可比集。最终的混合专家模块结合了学习门控、残差校正、熵正则化和头多样性正则化以产生预测。

英文摘要

Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effects, nonlinear interactions, and the practical logic of comparable properties. Standard multilayer perceptrons treat each row as an isolated vector and must learn locality, scale sensitivity, and categorical matching from supervision alone. Gradient-boosted decision trees provide strong tabular baselines, but their feature-centric splitting mechanism does not explicitly model the retrieval of similar historical observations. This paper presents RowNet, a retrieval-based neural architecture for real estate price-per-square-meter prediction. RowNet represents a query property through pairwise similarity features against a memory bank of labeled properties. A first retrieval layer estimates a coarse target from feature-only similarities. A second layer augments the memory comparison with target-consistency features and uses multiple learned attention heads to retrieve complementary comparable sets. A final mixture-of-experts module combines learned gating, residual correction, entropy regularization, and head-diversity regularization to produce the prediction.

2606.04444 2026-06-04 eess.IV cs.LG

Scaling Datasets for Multi-Sensor, Multi-Agent, and Multi-Domain Learning in Autonomous Systems

面向自主系统中多传感器、多智能体与多领域学习的数据集扩展

R. Spencer Hallyburton, David Hunt, Miroslav Pajic

AI总结 提出基于AVstack和CARLA的模块化数据集生成流程,创建TB级带真实标签的多域数据,支持单/多智能体与灵活传感器配置,用于特定应用训练和协作自主研究。

详情
AI中文摘要

现有数据集无法支持多智能体、多传感器或多领域自主系统中的大规模学习,而多样性和协调性在这些系统中至关重要。我们提出了一种模块化数据集生成流程,利用AVstack框架和CARLA模拟器,为地面、空中和基础设施系统创建TB级、带有真实标签的数据。该流程支持单智能体和多智能体配置,配备灵活的传感器套件,能够在具有挑战性的条件下进行可控实验。代表性的感知与融合研究表明,生成的数据可以支持特定应用的训练和协作自主性。

英文摘要

Existing datasets cannot support large-scale learning in multi-agent, multi-sensor, or multi-domain autonomy, where diversity and coordination are essential. We present a modular dataset generation pipeline that creates terabyte-scale, ground-truth-labeled data for ground, aerial, and infrastructure-based systems using the AVstack framework and CARLA simulator. Supporting single- and multi-agent configurations with flexible sensor suites, the pipeline enables controllable experimentation across challenging conditions. Representative perception and fusion studies show how generated data can support application-specific training and collaborative autonomy.

2606.04442 2026-06-04 cs.CL cs.AI

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

MemoryDocDataSet: 联合对话记忆与长文档推理的基准测试

Qiyang Xie, Jialun Wu, Xinjie He, Su Liu, Shuai Xiao, Zhiyuan Lin, Weikai Zhou

AI总结 提出MemoryDocDataSet合成基准,包含50个微世界和1000个QA对,评估系统同时处理多轮对话历史和长文档阅读理解的能力,其中75.1%的问题需要混合推理(先导航对话历史再提取文档答案),实验显示联合检索存在明显差距。

详情
Comments
17 pages, 2 figures, 8 tables. Submitted for peer review
AI中文摘要

人工智能系统越来越需要结合两种要求很高的能力:导航多轮对话历史和在长文档中进行深度阅读理解。然而,现有的基准测试没有同时评估这两者。我们引入了MemoryDocDataSet,一个包含50个微世界和1000个QA对的合成基准,其中每个实例包含3-5个人物角色、一个跨越数月活动的时间事件图、3-5篇真实长文档(每篇20,000-50,000个token,来自Caselaw Access Project)、基于这些文档的多轮对话,以及跨越五个推理类别的20个问答对。其定义特征是混合源标签:需要系统首先导航对话历史以确定哪个文档相关,然后从该文档中提取答案的问题。混合问题占数据集的75.1%。通过使用LLM作为评判者的提示敏感性自一致性分析来表征数据集质量,在所有50个微世界中得到中位数Cohen's $κ= 0.634$。我们评估了六种基线配置,涵盖截断上下文、长上下文LLM、检索增强生成(RAG)和记忆系统。最佳基线(RAG-Both)在整体F1上达到0.358,在混合问题上达到0.342。仅文档检索(RAG-Doc)在混合问题上降至0.267,尽管在仅文档问题上达到0.453,这显示了明显的联合检索差距,激励了统一对话记忆与长文档导航的架构。我们发布了数据集、生成流水线和所有基线实现。

英文摘要

AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $κ= 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

2606.04438 2026-06-04 cs.LG cs.AI

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

LoopMoE:统一迭代计算与混合专家模型用于语言建模

Wenkai Chen, Tianshu Li, Wenyong Huang, Yichun Yin, Lifeng Shang, Chengwei Qin

AI总结 提出LoopMoE,通过迭代自适应层归一化和容量平衡策略,在相同参数和FLOPs下,循环MoE语言模型在多个基准上优于标准MoE。

详情
AI中文摘要

混合专家模型(MoE)和循环架构分别沿着参数容量和有效深度两个正交维度扩展模型。然而,主流的循环架构依赖于密集主干,将参数数量与每个token的FLOPs耦合,这使得在匹配预算下无法隔离迭代计算的效果。为此,我们提出了LoopMoE,一种循环MoE语言模型,通过两种设计将稀疏路由与迭代权重共享计算相结合。第一种是IterAdaLN,它通过联合以迭代索引和每个token隐藏状态为条件的调制信号来解决权重共享对称性。第二种是一种容量平衡策略,恢复了经过良好调整的非循环参考模型的注意力到FFN活跃参数比率。这些设计共同实现了在相同总参数、每个token FLOPs和活跃子层比率下,循环MoE与标准MoE的首次严格受控的头对头评估。在3B规模下,LoopMoE在9个下游基准测试中的8个上优于标准MoE,平均提升超过1个点。在9B规模下,LoopMoE继续优于匹配的标准MoE,表明架构优势在更大规模下持续存在。我们的工作建立了稀疏性和循环性的受控综合,并为循环语言模型指明了一个有前景的方向。

英文摘要

Mixture-of-Experts (MoE) and looped architectures scale models along two orthogonal axes, namely parameter capacity and effective depth. However, mainstream looped architectures rely on dense backbones that couple parameter count with per-token FLOPs, which makes it impossible to isolate the effect of iterative computation under matched budgets. To this end, we present LoopMoE, a looped MoE language model that integrates sparse routing with iterative weight-shared computation through two designs. The first is IterAdaLN, which resolves weight-sharing symmetry via a modulation signal jointly conditioned on the iteration index and the per-token hidden state. The second is a capacity-balancing strategy that recovers the attention-to-FFN active parameter ratio of well-tuned non-looped references. Together, these designs enable the first strictly controlled, head-to-head evaluation of a looped MoE against a Vanilla MoE under identical total parameters, per-token FLOPs, and active sublayer ratios. At the 3B scale, LoopMoE outperforms the Vanilla MoE on 8 of 9 downstream benchmarks with an average improvement exceeding 1 point. At the 9B scale, LoopMoE continues to outperform the matched Vanilla MoE, indicating that the architectural gain persists at larger scale. Our work establishes a controlled synthesis of sparsity and recurrence, and suggests a promising direction for looped language models.

2606.04437 2026-06-04 cs.CV

INTACT: Ego-Guided Typed Sparse Evidence Retrieval for Heterogeneous Collaborative Perception

INTACT: 面向异构协同感知的自我引导类型化稀疏证据检索

Chen Li, Shengrong Yuan, Jialong Zuo, Xinzhong Zhu, Nong Sang, Changxin Gao

AI总结 提出INTACT框架,通过自我车辆发出类型化证据查询、协作方仅返回局部证据的稀疏检索机制,实现异构协同感知中零训练的新节点接入,在OPV2V-H和DAIR-V2X上取得高效性能。

详情
AI中文摘要

协同感知通过跨智能体共享信息扩展自动驾驶车辆的感知范围,但异构传感器和感知模型使得中间特征融合难以大规模部署。现有的异构协同方法通常遵循先翻译后融合的范式:协作方特征必须在对齐、适应或投影到自我兼容空间后才能融合。这种特征兼容性契约提升了固定系统的性能,但将部署与协作方特定的适配耦合,使得新加入的异构智能体集成成本高昂。为解决这一问题,我们提出INTACT,一种面向异构协同感知的自我引导类型化稀疏证据检索框架。INTACT不翻译整个协作方特征图,而是让自我车辆发出类型化证据查询,表达可疑目标和证据不足的区域。协作方仅在查询位置返回局部证据,自我车辆通过稀疏的每查询路由选择有用响应,并通过门控残差回写注入。这将兼容性要求从全局特征图可解释性转变为在自我车辆查询下的局部、类型化响应可比性,实现了零训练的异构插入协议:自我接口训练一次,新协作方通过检查点合并加入。在模拟和真实世界的异构协同感知基准上的大量实验验证了INTACT的有效性和可部署性。在OPV2V-H上,INTACT仅用0.52M额外参数和18.0 $\log_2$通信量达到80.1 AP70,相当于密集特征传输的约16倍压缩。在DAIR-V2X上,INTACT在具有挑战性的真实条件下达到43.8 AP50。

英文摘要

Collaborative perception extends the perceptual range of autonomous vehicles by sharing information across agents, but heterogeneous sensors and perception models make intermediate feature fusion difficult to deploy at scale. Existing heterogeneous collaboration methods typically follow a translation-first paradigm: collaborator features must be aligned, adapted, or projected into an ego-compatible space before fusion. Such feature-compatibility contracts improve fixed-system performance, but they couple deployment to collaborator-specific adaptation and make newly joined heterogeneous agents costly to integrate. To address this gap, we propose INTACT, an ego-guided typed sparse evidence retrieval framework for heterogeneous collaborative perception. Instead of translating an entire collaborator feature map, INTACT lets the ego vehicle issue typed evidence queries that express suspected objects and evidence-deficient regions. Collaborators respond only with local evidence at queried locations, and the ego selects useful responses through sparse per-query routing and injects them through gated residual write-back. This changes the compatibility requirement from global feature-map interpretability to local, typed response comparability under ego-issued queries, enabling a zero-training heterogeneous insertion protocol in which the ego interface is trained once and new collaborators join through checkpoint merging. Extensive experiments on simulated and real-world heterogeneous collaborative perception benchmarks validate the effectiveness and deployability of INTACT. On OPV2V-H, INTACT achieves 80.1 AP70 with only 0.52M additional parameters and 18.0 $\log_2$ communication volume, corresponding to about 16$\times$ compression over dense feature transmission. On DAIR-V2X, INTACT achieves 43.8 AP50 under challenging real-world conditions.

2606.04436 2026-06-04 cs.CV cs.RO

3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

3DThinkVLA:通过3D思维引导的协同训练赋予视觉-语言-动作模型潜在3D先验

Jiaxin Shi, Xidong Zhang, Fucai Zhu, Zhe Li, Siyu Zhu, Weihao Yuan

AI总结 提出3D思维引导的协同训练框架,通过解耦3D几何感知与空间推理并在不同特征层次注入,使VLA模型在动作预测中隐式进行3D空间推理,无需3D传感器或外部模型,在多个基准上达到最优性能。

详情
AI中文摘要

我们提出了一种3D思维引导的协同训练框架,使视觉-语言-动作(VLA)模型能够在动作预测过程中隐式地进行3D空间推理。我们的核心见解是,3D几何感知和3D空间推理是两种不同的能力,可以在不同的特征层次上解耦并注入。在训练过程中,三个紧密耦合的组件主要在潜在空间中协同工作:(1)为了获得几何先验,一个潜在3D几何感知模块将中间视觉特征与3D基础模型对齐,在不修改VLM骨干架构的情况下获取低级几何线索。(2)作为补充,一个在线3D推理蒸馏模块通过共享推理锚点令牌缓解提示引发的推理差距。在3D VLM协同训练期间,该锚点作为第一个输出令牌发出,以稳健地编码空间先验。在VLA训练期间,它作为插入在任务指令和动作指令之间的输入令牌,将高级空间思维从显式教师推理提示转移到学生动作提示,无需链式思维文本生成。(3)然后,这些解耦的几何和推理特征通过空间增强的动作集成统一起来,该集成将它们作为分层空间条件共同注入到动作查询令牌中,以防止动作捷径。在部署时,我们的方法仅保留其轻量级适配器以执行隐式3D推理,丢弃用于监督的3D基础模型和教师分支。因此,它纯粹在2D图像上运行,无需3D传感器、外部模型或显式文本生成,同时防止预训练VLM的灾难性遗忘,在LIBERO、LIBERO-PLUS、SimplerEnv和真实世界操作任务上实现了最先进的性能。

英文摘要

We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric priors, a latent 3D geometry perception module aligns intermediate visual features with a 3D foundation model, acquiring low-level geometric cues without architectural modifications to the VLM backbone. (2) Complementing this, an online 3D reasoning distillation module mitigates the prompt-induced reasoning gap via a shared reasoning anchor token. During 3D VLM co-training, this anchor is emitted as the first output token to robustly encode spatial priors. During VLA training, it serves as an input token inserted between the task and action instructions, transferring high-level spatial thinking from explicit teacher reasoning prompts to student action prompts without chain-of-thought text generation. (3) These disentangled geometric and reasoning features are then united by a spatially augmented action integration, which jointly injects them into the action-query tokens as hierarchical spatial conditions to prevent action shortcuts. At deployment, our method retains only its lightweight adapters to perform implicit 3D reasoning, discarding the 3D foundation model and the teacher branch used for supervision. Consequently, it operates purely on 2D images without 3D sensors, external models, or explicit text generation while preventing catastrophic forgetting of the pretrained VLM, achieving state-of-the-art performance on LIBERO, LIBERO-PLUS, SimplerEnv, and real-world manipulation tasks.

2606.04435 2026-06-04 cs.AI cs.CL cs.CR cs.IR

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

智能体RAG中的级联幻觉:用于检测和缓解的CHARM框架

Saroj Mishra

AI总结 针对多步智能体RAG管道中早期错误传播并放大为最终错误输出的级联幻觉问题,提出CHARM框架,通过阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发四个组件实现检测与缓解,在多个数据集上达到89.4%的级联检测率和82.1%的错误传播减少。

详情
AI中文摘要

多步智能体检索增强生成(RAG)管道在复杂推理任务中展现出显著能力,但仍然容易受到一类现有幻觉检测机制系统性遗漏的故障影响:级联幻觉,即在管道早期阶段引入的错误会通过连续推理步骤传播并放大,产生自信但事实不正确的最终输出。为解决这一漏洞,我们将级联幻觉形式化为智能体RAG系统中的一种独特故障模式,提出四种级联模式的分类法,并引入CHARM(级联幻觉感知解析与缓解),一种用于检测和中断多步推理管道中错误传播的架构框架。CHARM包含四个组件——阶段级事实验证、跨阶段一致性跟踪、置信度传播监控和级联解析触发——它们与标准智能体RAG管道并行运行,无需替换架构。我们在HotpotQA、MuSiQue、2WikiMultiHopQA以及一个自定义对抗数据集上,在LangChain智能体管道配置下评估CHARM,实现了89.4%的级联检测率、5.3%的假阳性率、每阶段平均215 ms ± 18 ms的延迟开销,以及82.1%的错误传播减少,而输出级检测器仅为18.5%。组件消融实验证实每个检测模块对整体级联覆盖都有显著贡献。CHARM与人在回路监督框架集成,为生产级智能体AI部署提供完整的可靠性和治理栈。

英文摘要

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

2606.04434 2026-06-04 cs.CV cs.LG

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

Hyper-ICL:基于双曲锚点蒸馏的注意力校准用于多模态上下文学习

Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah

AI总结 提出Hyper-ICL,一种轻量级训练框架,通过低秩logit适配器和双曲锚点蒸馏损失校准注意力分布,无需推理时提供上下文示例即可重建演示效果,提升多模态上下文学习的准确性和稳定性。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

多模态上下文学习已成为多模态大语言模型的一种实用推理范式,其中少量交错的图像-文本上下文示例条件化模型以解决新任务。尽管灵活,但多模态ICL由于对演示格式、顺序和内容的敏感性,导致高推理延迟和不稳定性。为解决这些限制,我们提出Hyper-ICL,一种轻量级、基于训练的无演示多模态ICL框架,它直接在推理时无需ICD即可重建演示效果。Hyper-ICL学习一个参数高效的低秩logit级适配器,校准注意力分布以更好地匹配演示诱导的注意力重分布。为捕捉演示影响如何随查询变化,我们引入查询自适应调制机制,根据当前查询在层和头之间自适应控制token级的干预强度。最后,我们提出逐层双曲锚点蒸馏损失,通过Lorentz测地距离将中间学生特征对齐到演示条件化的教师。该损失鼓励学生重建ICD诱导的演示-查询关系。在六个不同多模态基准(包括VQAv2、OK-VQA和COCO Caption)上的大量实验表明,Hyper-ICL在准确性和稳定性上持续优于普通ICL和现有最先进方法。

英文摘要

Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL incurs high inference latency and suffers from instability due to sensitivity to demonstration formatting, ordering, and content. To address these limitations, we propose Hyper-ICL, a lightweight, training-based framework for demonstration-free multimodal ICL that reconstructs demonstration effects directly without requiring ICDs at inference time. Hyper-ICL learns a parameter-efficient low-rank logit-level adapter that calibrates attention distributions to better match demonstration-induced attention redistribution. To capture how demonstration influence varies across queries, we introduce a query-adaptive modulation mechanism that adaptively controls intervention strength at token level across layers and heads based on the current query. Finally, we propose a layer-wise hyperbolic anchor distillation loss that aligns intermediate student features to a demonstration-conditioned teacher via Lorentz geodesic distance. This loss encourages the student to reconstruct the demonstration-query relationships induced by ICDs. Extensive experiments across six different multimodal benchmarks (including VQAv2, OK-VQA, and COCO Caption) demonstrate that Hyper-ICL consistently improves accuracy and stability over vanilla ICL and existing state-of-the-art methods.

2606.04433 2026-06-04 cs.CV cs.CL cs.LG

Stateful Visual Encoders for Vision-Language Models

用于视觉-语言模型的有状态视觉编码器

Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell

AI总结 提出有状态视觉编码器,通过将每个视觉表示条件于先前的视觉特征,增强视觉-语言模型在多图像、多轮交互中的视觉变化感知能力,在跨图像空间聚合、多目标视觉差异和轨迹行为克隆等任务上取得一致改进。

详情
Comments
Project page: https://statefulvisualencoders.github.io/
AI中文摘要

视觉-语言模型(VLM)越来越多地用于多图像、多轮代理场景,其中决策依赖于视觉变化。然而,在现有的开源权重VLM中,视觉比较仅在语言模型内部进行,而视觉编码器本身是无状态的:每个图像独立编码,无法访问先前的视觉上下文。因此,微小但任务关键的变化可能在语言模型有机会比较之前被减弱,尤其是当这些变化不影响场景的高层语义时。我们引入了一种有状态视觉编码器,它将每个视觉表示条件于先前的视觉特征。在监督微调下,配备有状态编码器的VLM在涉及跨图像空间聚合、多目标视觉差异和视觉轨迹行为克隆的控制任务上取得了一致的改进。这些改进在输入分辨率、语言模型大小和VLM骨干网络上保持一致。最后,我们在实际任务上验证了我们的模型,包括纵向放射学、细粒度图像比较和遥感,其中有状态编码器一致地改进了通用VLM基线,并在选定领域可以匹配或超越专用模型。项目页面:https://statefulvisualencoders.github.io/

英文摘要

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/