arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.24703 2026-05-26 cs.CL cs.AI

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

TS-Skill: 用于评估时间序列问答中分析技能的基准

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Samsung Research America(三星美国研究院) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) Amazon(亚马逊)

AI总结 提出TS-Skill基准,通过三种可组合的分析技能(时间尺度选择、时间定位和跨区间整合)来诊断时间序列问答中模型的信号级能力,并开发SKEvol框架自动构建基准,实验揭示不同技能上的能力差距。

详情
AI中文摘要

大型语言模型(LLMs)和时间序列语言模型(TSLMs)越来越多地应用于时间序列问答(TSQA)。与纯文本问答不同,TSQA要求模型将答案基于时间信号,这些信号的模式可能出现在不同尺度、特定时间位置或跨分离区间。然而,现有的基准通常按任务类型或高层次推理类别组织,难以诊断驱动模型性能的底层信号级能力。我们引入TS-Skill,一个用于评估TSQA中三种可组合分析技能的控制基准:时间尺度选择(SK1)、时间定位(SK2)和跨区间整合(SK3)。TS-Skill提供时间戳感知的问题、广泛的领域覆盖以及人工验证的问答质量。为了大规模构建基准,我们开发了SKEvol,一个技能引导的智能体框架,结合了领域感知的时间序列种子生成、技能控制的问题生成、元数据和代码辅助的答案构建、多阶段信号接地验证以及人在回路中的策展。在十个最先进的LLMs和TSLMs上的实验揭示了SK1-SK3之间显著且不均匀的能力差距。特别是,SK3对非智能体模型始终具有挑战性,而工具增强的智能体在独立的SK3上显示出选择性优势。这些发现表明,技能级评估可以揭示被聚合TSQA分数掩盖的时间推理失败。

英文摘要

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

2605.24702 2026-05-26 cs.CV

Do Image-Text Metrics Respect Semantic Invariances?

图像-文本度量是否尊重语义不变性?

Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, Michael Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 通过空间、物体和社会语言框架三个维度的语义保持扰动,系统评估了五种流行图像-文本评估器(CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判)的语义不变性,发现它们对非语义变化敏感,并提出了不变性校准评分作为后处理调整方法。

详情
AI中文摘要

无参考图像到文本评估器现在已成为评分图像-标题对齐的标准工具,但尚不清楚它们是否尊重语义不变性。我们对五种流行评估器(CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判)进行了不变性探测,在三个轴向上施加语义保持扰动:空间(翻转、上下文保持的重定位、轻微旋转)、物体(尺度、类别)和社会语言框架(带有中性及长度匹配对照的文化/经济形容词)。在三个检测数据集和三个标题评估套件的精心策划切片上,我们发现了一致的非语义敏感性,其中良性的空间编辑和简单的措辞变化平均使分数变化约6-9%,而对于仅相差0.7%的系统,这些变化可能导致高达约37%的情况下的排名翻转,尤其是在空间变化下。一项小型人类研究也支持这一发现,并确认标注者通常认为扰动对同样正确,因此这些变化反映了度量行为而非语义变化。我们进一步提出了不变性校准评分,这是一种后处理调整方法,大致将中位数绝对敏感性减半,同时保持与学习型标题评估器的相关性。

英文摘要

Reference-free image-to-text evaluators are now standard for scoring image-caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes -- spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities, where benign spatial edits and simple phrasing changes shift scores by $\approx$6--9\% on average, and for systems separated by just 0.7\%, these shifts can cause ranking flips in up to $\sim$37\% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.

2605.24699 2026-05-26 cs.AI cs.LG

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

MDIA:HealthBench Professional上的多智能体诊断智能流水线

Roberto Cruz, David Rey-Blanco

发表机构 * TietAI

AI总结 提出MDIA多智能体诊断系统,通过7节点专业路由临床推理图架构,在非微调LLM上实现HealthBench Professional基准性能提升3.72个百分点,归因于系统架构设计而非提示工程。

Comments 33 pages, 10 figures

详情
AI中文摘要

大多数关于agentic-LLM临床基准测试的报告收益通常归因于提示工程,但我们的结果表明,更大的改进可能来自架构和引擎级别的设计。我们提出了MDIA,一个多智能体诊断智能体,实现为7节点专业路由临床推理图,在完整的HealthBench Professional基准测试(n=525)上,使用非微调LLM。MDIA在OpenAI的GPT-5.4-2026-03-05下达到0.6272,比OpenAI的ChatGPT for Clinicians的性能高出3.72个百分点。实验工作表明,性能提升归因于系统架构:专业路由、多轮上下文保留、药物状态安全门控、站点过滤搜索、长度感知合成和引擎级可靠性。这些发现支持了agentic临床基准性能由底层基础模型和编排架构共同塑造的观点。然而,我们也注意到在使用其他模型作为评分器时存在显著差异;特别是,当使用Gemini 2.5 Pro时,MDIA得分为0.6585,这表明评分器的选择是变异性来源。因此,对LLM的稳健评估需要跨多个独立评分器模型进行评估。

英文摘要

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

2605.24697 2026-05-26 cs.CL cs.AI

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

路径很重要:学习扩散语言模型的令牌提交策略

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Engineering Science, University of Oxford(工程科学系,牛津大学)

AI总结 本文提出TraceLock,一种轻量级可插拔控制器,通过学习可复用的轨迹状态策略来优化扩散语言模型中的令牌提交决策,从而改善质量与步数之间的权衡。

详情
AI中文摘要

扩散大语言模型通过并行细化多个令牌位置有望实现更快的生成,但这种并行性引入了一个隐藏的控制问题:每一步中哪些提议的令牌应被转移到部分解码的序列中?我们将此决策称为令牌提交。现有的冻结生成器解码器主要依赖于手工设计的置信度规则或特定块的接受过滤器。我们认为令牌提交可以学习为一种可复用的轨迹状态策略。我们引入了TraceLock,一种轻量级可插拔控制器,为冻结的扩散语言模型实例化此策略。由于无法获得 oracle 提交时间,TraceLock 从未来稳定性中推导出自我监督:在解码步骤 t,如果提议的令牌在完整解码轨迹完成后与位置 i 的最终令牌匹配,则将其标记为稳定。控制器对可变长度的轨迹状态进行评分,并决定哪些活跃的令牌提议应被提交到部分解码的序列中。一旦为给定的冻结主干训练完成,该控制器可以在局部窗口宽度、生成长度和步数预算下部署,无需重新训练或按设置校准。在问答、数学推理和代码生成上的实验表明,TraceLock 在质量-步数权衡上优于启发式和学习的基线,在跨设置部署下尤其稳定。诊断分析表明,其决策不能简化为标量置信度,这表明冻结的扩散语言模型暴露了一个超越基于置信度解码的可学习的提交轨迹空间。代码可在 https://github.com/BobSun98/TraceLock 获取。

英文摘要

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.

2605.24693 2026-05-26 cs.CL

CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

CP-Agent: 一种用于反馈驱动竞赛编程的校准风险控制智能体

Peisong Wang, Bowen Liu, Zehua Li, Yuyao Wang, Zhiwei Ma, Yuhan Li, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出CP-Agent,通过校准停止过程建模反馈驱动求解,结合双重粒度验证、测试增强和经验驱动自我进化机制,在不更新参数的情况下显著提升竞赛编程性能。

Comments Code: https://github.com/NineAbyss/CP-Agent

详情
AI中文摘要

大型语言模型在竞赛级编程中仍存在困难,而许多智能体解决方案依赖于大量的推理时采样或昂贵的多阶段后训练。我们研究了执行反馈何时能可靠地帮助LLM竞赛编程求解器,以及哪些机制支配着性能提升。我们将反馈驱动求解建模为校准停止过程,并识别出三个量:虚假接纳风险、针对不良程序的程序级证据以及活跃状态成功风险。在保留的轨迹校准和从预先声明的有限控制器清单中选择下,所得的结构性证书在虚假接纳之前为干净成功概率提供了下界。我们针对这些量实例化了机制:双重粒度验证、测试增强和经验驱动自我进化,从而得到CP-Agent。在不更新任何参数的情况下,CP-Agent在LiveCodeBench Pro上将Pass@1从25.8%提升至48.5%,并在ICPC-Eval上将Refine@5提高了11.0%。在三个LLM骨干网络上,CP-Agent处于成本-准确率效率前沿,消融实验表明每个组件主要影响其对应的证书量。

英文摘要

Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8\% to 48.5\% on LiveCodeBench Pro and improves Refine@5 by 11.0\% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost--accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.

2605.24691 2026-05-26 cs.CV

AdaFuse-Det: Adaptive Cross-Modal Fusion of Event Cameras for Robust Object Detection in Low-Light RGB Imagery

AdaFuse-Det: 自适应跨模态融合事件相机用于低光照RGB图像中的鲁棒目标检测

Raju Imandi, Chethana B, Bharatesh Chakravarthi, Yong-Guk Kim, Manipriya S, Pavan Kumar B N

发表机构 * SRM University AP, India(印度SRM大学AP分校) Aptiv, Bengaluru, India(印度Aptiv公司,班加罗尔) Arizona State University, USA(美国亚利桑那州立大学) Sejong University, South Korea(韩国世宗大学) Indian Institute of Information Technology Sri City, India(印度Sri City信息学院)

AI总结 提出AdaFuse-Det双流框架,通过基于最小方差线性估计的自适应跨模态融合模块融合CLAHE增强RGB与事件数据,在低光照下实现鲁棒目标检测,在LLE-VOS基准上召回率65.54%、精确率53.85%、F1分数59.12%。

详情
AI中文摘要

在极端低光照条件下可靠地检测目标是计算机视觉中的一个开放性问题,在从夜间监控到搜索救援机器人等应用中具有实际紧迫性。传统RGB相机在低光子通量下性能急剧下降,而事件相机以微秒分辨率和宽动态范围记录异步逐像素亮度变化,提供了很大程度上与光照无关的互补结构线索。我们提出AdaFuse-Det,一个双流框架,通过基于最小方差线性估计理论的自适应跨模态融合模块,将CLAHE增强的RGB帧与体素化事件张量融合。我们形式化地证明学习到的注意力图渐近地恢复了高斯-马尔可夫最优融合权重,并为体素化阶段建立了事件守恒和时间分辨率界限。在LLE-VOS基准上,AdaFuse-Det在严重光照退化下实现了召回率65.54%、精确率53.85%和F1分数59.12%,在召回率上优于单模态检测器,其差距反映了理论上预测的光照适应行为。

英文摘要

Detecting objects reliably under extreme low-light conditions is an open problem in computer vision, with practical urgency in applications ranging from nighttime surveillance to search-and-rescue robotics. Conventional RGB cameras degrade sharply at low photon flux, while event cameras which record asynchronous per-pixel brightness changes at microsecond resolution and high dynamic range provide complementary structural cues that are largely illumination-invariant. We present AdaFuse-Det, a dual-stream framework that fuses CLAHE-enhanced RGB frames with voxelized event tensors through an Adaptive Cross-Modal Fusion (ACMF) module grounded in minimum-variance linear estimation theory. We formally show that the learned attention map asymptotically recovers the Gauss-Markov optimal fusion weights, and establish event conservation and temporal resolution bounds for the voxelization stage. On the LLE-VOS benchmark, AdaFuse-Det achieves a Recall of $65.54\%$, Precision of $53.85\%$, and F1-Score of $59.12\%$ under severe illumination degradation, outperforming single-modality detectors in recall by a margin that reflects the theoretically predicted illumination-adaptation behavior.

2605.24690 2026-05-26 cs.RO cs.LG

Sum of Costs Diffusion with Dynamic Guidance for Motion Planning

运动规划的动态引导代价和扩散模型

Aysu Aylin Kaplan, Özgür Erkent

发表机构 * Computer Engineering Department, Hacettepe University(哈切特佩大学计算机工程系)

AI总结 提出一种基于扩散模型的高泛化运动规划方法,通过总碰撞代价梯度引导去噪过程并动态选择引导起始步,在Mπnets数据集上取得最优性能。

Comments Accepted at the Frontiers of Optimization for Robotics Workshop at the IEEE International Conference of Robotics & Automation (ICRA), 2026

详情
AI中文摘要

机器人操作的运动规划问题可以通过经典方法或深度学习方法来解决。现有方法在泛化到不同场景时面临重大挑战。在本研究中,我们提出了一种具有高泛化能力的方法,该方法使用扩散模型生成无碰撞轨迹,其中去噪过程由总碰撞代价的梯度引导。我们还提出了一种动态选择梯度引导起始步的方法。实验结果表明,通过动态引导扩散模型与碰撞代价之和,能够克服竞争方法面临的泛化问题,提供更鲁棒的性能。所提出的模型在Mπnets数据集的不同测试场景中,相比其他方法取得了最高性能,证明了其有效性。

英文摘要

The motion planning problem for robotic manipulation can be addressed through classical or deep learning approaches. Existing methods face significant challenges in generalizing to diverse settings. In this study, we present a method with high generalization capability that generates collision-free trajectories using diffusion models where the denoising process is guided by the gradient of the total collision cost. We are also presenting a dynamic approach for choosing start step of the gradient guidance. Experimental results demonstrate that guiding the diffusion model dynamically with the sum of collision costs offers more robust performance by overcoming the generalization issues faced by competing methods. The proposed model demonstrates its effectiveness by achieving the highest performance on diverse test settings in M$π$nets\ dataset among the compared methods.

2605.24687 2026-05-26 cs.CV cs.AI

HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing

HoloFair: 统一的T2I公平性评估与Fair-GRPO去偏

Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu, Liming Fang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) School of Software Technology, Zhejiang University, Ningbo, China(浙江大学宁波校区软件学院) Ningbo Global Innovation Center, Zhejiang University, Ningbo, China(浙江大学宁波全球创新中心) Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心)

AI总结 提出HoloFair基准框架,通过多属性组间偏差指数(MGBI)评估文本到图像模型的公平性,并引入基于强化学习的Fair-GRPO方法进行去偏,在SD3.5-Medium模型上显著提升多维公平性且保持图像质量。

Comments Accepted to ICML 2026. Code and dataset are available at https://github.com/1059684669/HoloFair

详情
AI中文摘要

文本到图像(T2I)模型在视觉真实感和语义一致性方面取得了显著进展,但它们常常延续并放大社会偏见。现有的评估方法通常只处理单维偏见,缺乏从社会相关深层语义层面揭示模型偏见的视角。我们引入了HoloFair,一个用于多维人口统计偏见分析的综合基准框架。该框架基于我们大规模面向公平性的数据集和SpaFreq(空间-频率)属性分类器,提出了多属性组间偏差指数(MGBI)指标,旨在评估内在多样性和条件偏见。除评估外,我们还进一步引入了Fair-GRPO,一种基于强化学习的去偏方法,通过设计的多目标奖励函数改变生成模型的分布。例如,在SD3.5-Medium模型上的实验表明,Fair-GRPO在保持高图像质量的同时显著改善了多维公平性。我们还分析了潜在的奖励黑客现象,并提供了相应的缓解策略。代码和数据集可在https://github.com/1059684669/HoloFair获取。

英文摘要

Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair

2605.24686 2026-05-26 cs.AI

Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

大型语言模型中的情商在感知、认知和交互上存在碎片化

Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han, Mengyue Wu

发表机构 * X-LANCE Lab(X-LANCE实验室) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室) Beijing Key Laboratory of Applied Experimental Psychology(北京应用实验心理学重点实验室) National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University(北京师范大学实验心理学教育国家级示范中心,心理学学院)

AI总结 本文提出FACET框架,基于Mayer-Salovey-Caruso四分支能力模型评估大型语言模型的情商,发现其并非单一能力,而是在认知和交互维度上碎片化,且隐藏情绪识别是普遍瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地集成到情感敏感领域,其情商(EI)的结构完整性成为安全和对齐的关键前沿。当前的基准测试常常将表面的礼貌与深层次的情感推理混为一谈,未能区分感知准确性和交互效能。在此,我们引入FACET(功能性情感能力和共情测试),这是一个基于心理测量学的框架,包含480个专家设计的项目。与先前的指标不同,FACET在理论上锚定于Mayer-Salovey-Caruso四分支能力模型,通过情绪感知、促进、理解和管理来操作化情商。通过对九个前沿模型(包括GPT-5、Claude-Sonnet-4)的评估,我们证明情商并非单一能力,而是在认知和交互维度上碎片化。尽管前沿模型在客观情绪识别和社会推理方面表现出强大的能力,但这并不一致地转化为交互成功。我们将这些差异归类为三种不同的表现类型:认知主导型、交互主导型和情境依赖型。这些类型表明情感技能并非随通用智能或模型大小均匀扩展;相反,它们由特定的对齐范式塑造。值得注意的是,我们识别出隐藏情绪识别是所有架构的普遍性能瓶颈。我们的结果表明,当前的RLHF过程可能优化了“随机共情”,即对情感句法的统计模仿,而牺牲了整合的情感推理。这些发现挑战了线性情感扩展的假设,并为开发能够真正临床共鸣的社会感知智能体提供了严谨的路线图。

英文摘要

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

2605.24684 2026-05-26 cs.LG cs.AI

Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs

超越聚合困境:多模态图的先验保持解耦学习

Hao Yan, Xuanru Wang, Jun Yin, Shirui Pan, Senzhang Wang, Chengqi Zhang

发表机构 * School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(香港理工大学数据科学与人工智能系) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 针对多模态属性图学习中强制聚合导致性能反转的聚合困境,提出解耦双路径架构SUPRA,通过保持先验特征的独立性和轻量级共享GNN捕获结构协同,并辅以深度监督缓解梯度饥饿,实现SOTA性能且显著降低计算开销。

详情
AI中文摘要

多模态属性图学习(MAGL)通过图聚合将节点内在属性与结构拓扑相结合。然而,随着预训练编码器演变为大型基础模型(LFM),MAGL的格局发生了根本性转变:在高置信度LFM先验下,强制聚合引入了拓扑噪声,淹没了判别信号,引发反直觉的性能反转,即复杂的MAGL架构性能不如简单的拓扑无关MLP。通过系统的实证和理论分析,我们确定这种反转源于一个基本的聚合困境,其特征是两种并发病理:(1)表征病理(信噪比退化)——强制聚合用拓扑噪声稀释了鲁棒的内在特征,导致噪声惩罚超过其协作收益;(2)优化病理(梯度饥饿)——拓扑聚合减弱了梯度流,而共享任务损失导致主导模态过早抑制较弱模态。为解决这一困境,我们提出SUPRA(共享-独特先验保持架构),一种解耦的双路径范式。SUPRA通过拓扑无关的MLP处理模态特定特征,同时通过轻量级共享GNN捕获结构协同,并辅以深度监督来对抗梯度饥饿。大量评估表明,SUPRA实现了最先进的性能,同时峰值GPU内存需求降低3.5倍,训练时间比多模态图变换器快4.4倍。

英文摘要

Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, as pretrained encoders evolve into Large Foundation Models (LFMs), the landscape of MAGL fundamentally shifts: under high-confidence LFM priors, mandatory aggregation introduces topological noise that overwhelms discriminative signals, triggering a counter-intuitive performance inversion where sophisticated MAGL architectures underperform simple topology-agnostic MLPs. Through systematic empirical and theoretical analysis, we identify that this inversion stems from a fundamental aggregation dilemma characterized by two concurrent pathologies: (1) Representational Pathology (SNR Degradation) - mandatory aggregation dilutes robust intrinsic features with topological noise, causing the noise penalty to outweigh its collaborative benefit; and (2) Optimization Pathology (Gradient Starvation) - topological aggregation attenuates gradient flow, while a shared task loss causes dominant modalities to prematurely suppress weaker ones. To resolve this dilemma, we propose SUPRA (Shared-Unique Prior-Retaining Architecture), a decoupled dual-pathway paradigm. SUPRA processes modality-specific features through topology-agnostic MLPs while capturing structural synergy via a lightweight shared GNN, with auxiliary deep supervision counteracting gradient starvation. Extensive evaluations demonstrate that SUPRA achieves state-of-the-art performance while requiring 3.5x lower peak GPU memory and up to 4.4x faster training time than Multimodal Graph Transformers.

2605.24680 2026-05-26 cs.LG

Trajectory-Based Difficulty Scoring for Reliable Learning on Tabular Data

基于轨迹的难度评分用于表格数据的可靠学习

Tomer Lavi, Bracha Shapira, Nadav Rappoport

发表机构 * Faculty of Computer and Information Science, Ben-Gurion University of the Negev(计算机与信息科学学院,本·古里安内盖夫大学)

AI总结 提出轨迹难度评分(TDS),通过分析梯度提升树的逐树累积预测轨迹,为每个实例估计难度,并在分类和回归任务中优于现有基线,同时支持主动学习、选择性预测和共形预测等应用。

详情
AI中文摘要

梯度提升树在表格数据上表现出色,但常常留下一个长尾的预测不佳实例。我们引入了一种基于轨迹的难度评分(TDS),这是一种针对提升集成模型的实例级难度估计器,源自每棵树的累积预测轨迹。对于每个实例,我们计算可解释的轨迹描述符(例如,方差、振荡峰值、符号切换和尾部稳定性),并训练一个轻量级回归模型来预测保留损失。经验CDF将得到的信号校准为$[0,1]$内的分数,支持对困难案例进行排序。在多种表格基准和集成大小上,TDS与误差表现出强秩相关性,并且在分类任务上优于现有的实例难度和不确定性基线,同时在回归任务上保持竞争力。然后,我们展示了单个难度信号如何改进多个数据挖掘工作流:用于标签高效训练的难度驱动主动学习、用于改进风险覆盖权衡的难度阈值选择性预测,以及用于更均匀条件覆盖的TDS分层(Mondrian)共形预测。最后,使用SHAP归因对高TDS实例进行聚类,揭示了以紧凑特征值范围为特征的连贯故障模式,支持错误分析和针对性数据采集。

英文摘要

Gradient-boosted trees achieve strong performance on tabular data, yet often leave a long tail of poorly predicted instances. We introduce a Trajectory-based Difficulty Score (TDS), an instance-level difficulty estimator for boosted ensembles derived from per-tree cumulative prediction trajectories. For each instance, we compute interpretable trajectory descriptors (e.g., variance, oscillation peaks, sign switches, and tail stability) and train a lightweight regression model to predict held-out loss. An empirical CDF calibrates the resulting signal into a score in $[0,1]$ that supports ranking hard cases. Across diverse tabular benchmarks and ensemble sizes, TDS exhibits strong rank correlation with error and outperforms established instance-hardness and uncertainty baselines on classification, while remaining competitive on regression. We then show how a single difficulty signal improves multiple data mining workflows: difficulty-driven active learning for label-efficient training, difficulty-thresholded selective prediction for improved risk-coverage trade-offs, and TDS-stratified (Mondrian) conformal prediction for more uniform conditional coverage. Finally, clustering high-TDS instances using SHAP attributions reveals coherent failure modes characterized by compact feature-value ranges, supporting error analysis and targeted data acquisition.

2605.24679 2026-05-26 cs.CV

MindAdapter: Few-Shot Parameter-Efficient Residual Calibration of Cross-Subject Brain-to-Visual Decoding Models

MindAdapter: 跨被试脑到视觉解码模型的少样本参数高效残差校准

Jiaxiang Liu, Jiawei Du, Xupeng Chen, Guoqi Li, Jiang Cai, Simon Fong, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东智能科学与技术研究院) Agency for Science, Technology and Research(科技研究局) New York University(纽约大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系)

AI总结 提出MindAdapter框架,通过解耦的线性-残差级联对齐和拓扑锚定双流流形约束,实现跨被试脑到视觉解码的少样本参数高效校准。

Comments Accepted to KDD 2026 (AI4Sciences Track). 15 pages, 7 figures

详情
AI中文摘要

跨被试脑到视觉解码由于严重的个体间变异性导致系统性的被试特异性功能错位,仍然是脑机接口的核心挑战。为了解决这个问题,我们提出了MindAdapter,一个针对预训练脑到视觉解码模型的参数高效少样本校准框架。MindAdapter采用解耦的线性-残差级联对齐范式,冻结预训练的显式脑功能对齐主干(粗粒度),并引入轻量级非线性残差适配器(细粒度),从而将全局跨被试对应关系与被试特异性残差校正分离,实现细粒度的空间和语义校准。为了进一步保持全局表征稳定性,我们设计了一个拓扑锚定的双流流形约束,其中一小部分共享刺激作为拓扑锚点,提供体素级配对监督,而语义流通过冻结的视觉-语言解码器在未配对的脑数据上强制执行一致性。总之,MindAdapter在保持预训练期间学习的全局表征几何结构的同时,高效地注入被试特异性校正。在自然场景数据集(NSD)上的实验表明,MindAdapter仅使用少量共享刺激就能显著提高跨被试视觉重建和检索精度,为个性化脑到视觉解码提供了一种实用且数据高效的解决方案。

英文摘要

Cross-subject brain-to-visual decoding remains a core challenge in brain-computer interfaces due to severe inter-individual variability that induces systematic subject-specific functional misalignment. To address this issue, we propose MindAdapter, a parameter-efficient few-shot calibration framework for pretrained brain-to-visual decoding models. MindAdapter adopts a decoupled linear-residual cascade alignment paradigm by freezing a pretrained explicit brain functional alignment backbone (coarse) and introducing a lightweight nonlinear residual adapter (fine), thereby disentangling global cross-subject correspondence from subject-specific residual corrections for fine-grained spatial and semantic calibration. To further preserve global representational stability, we design a topology-anchored dual-stream manifold constraint, where a small set of shared stimuli serves as topological pins with voxel-level paired supervision, while a semantic stream enforces consistency through a frozen vision-language decoder on unpaired brain data. Together, MindAdapter efficiently injects subject-specific corrections while maintaining the global representational geometry learned during pretraining. Experiments on the Natural Scenes Dataset (NSD) demonstrate that MindAdapter substantially improves cross-subject visual reconstruction and retrieval accuracy using only a few shared stimuli, offering a practical and data-efficient solution for personalized brain-to-visual decoding.

2605.24675 2026-05-26 cs.CV cs.AI

VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

发表机构 * The Hong Kong University of Science(香港科技大学) Tianjin University(天津大学) Tsinghua University(清华大学)

AI总结 针对网页图像翻译中视觉表示差距问题,提出VaaWIT框架,通过双流注意力模块和视觉感知适配器,实现大语言模型对细粒度视觉特征的动态融合,在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要,尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型(LVLMs)已经推进了多模态理解,但由于视觉表示差距,将它们应用于网页图像翻译仍然具有挑战性:标准编码器通常优先考虑高级语义,而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战,我们提出了VaaWIT,一个端到端框架,用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献:(1)双流注意力模块(DSAM),促进多语言语义特征与详细视觉表示之间的双向交互,从而合成对文本变化鲁棒的统一特征;(2)视觉感知适配器(VAA),一种参数高效的微调策略,将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐,同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明,VaaWIT显著优于最先进(SOTA)的开源基线,并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

2605.24674 2026-05-26 cs.CV

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

推理对齐:扩散Transformer在视频编辑中的隐式推理

Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian

发表机构 * The Hongkong University of Science and Technology(香港科技大学) Huawei Inc.(华为公司)

AI总结 针对指令式视频编辑中条件信号未分化及交叉注意力监督不足的问题,提出RVEDiT框架,通过粒度路由令牌条件和参考锚定注意力对齐实现粗到细编辑与内部推理正则化。

详情
AI中文摘要

基于指令的视频编辑需要根据自然语言指令转换源视频,同时保留无关内容并保持时间连贯性。我们认为现有的扩散Transformer(DiT)编辑器由于两个结构原因难以完成此任务。首先,条件信号未分化地输入所有Transformer块,迫使单个令牌流同时编码全局编辑意图和细粒度视觉证据。其次,控制编辑的交叉注意力模式仅通过像素级重建间接监督,使得模型内部推理过程约束不足。为了解决这两个限制,我们提出了RVEDiT,一个隐式推理视频编辑DiT框架,围绕两个互补组件构建。第一个组件,粒度路由令牌条件,引入从多模态大语言模型蒸馏的可学习编辑令牌,并将其路由到浅层块,同时将原生视觉和文本令牌保留给深层块,从而在骨干网络内部诱导出从粗到细的编辑过程。第二个组件,参考锚定注意力对齐,在训练期间采用参数共享的参考分支,并最大化编辑分支和参考分支注意力特征之间的互信息,正则化模型的内部推理而不产生任何额外的推理成本。在标准基于指令的视频编辑基准上的实验表明,RVEDiT始终优于最先进的基线,特别是在局部和组合编辑方面取得了显著提升。

英文摘要

Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

2605.24667 2026-05-26 cs.AI cs.LG

When Mean CE Fails: Median CE Can Better Track Language Model Quality

当平均交叉熵失效时:中位数交叉熵能更好地跟踪语言模型质量

Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

发表机构 * i14 University of Melbourne(墨尔本大学) University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 本文发现中位数交叉熵比平均交叉熵更能反映语言模型在训练过程中的任务性能,并建议在评估时报告多个百分位交叉熵。

Comments 20 pages

详情
AI中文摘要

平均交叉熵是语言模型的标准验证指标,但在训练过程中可能无法跟踪模型质量。我们在两种常见场景下研究了这一点。首先,在Qwen2.5-1.5B的合成事实学习SFT中,我们发现平均CE在初始学习阶段后显著上升,而保留的事实召回准确率保持接近峰值。其次,在TinyStories上的top-K蒸馏中,我们发现减小K会改善中位数CE而恶化平均CE;Top-5学生获得了最高的LLM评判分数,并在中位数CE上低于其教师,尽管其平均CE最差。在这两种情况下,中位数CE与任务性能的相关性比平均CE更紧密。分析训练过程中整体和尾部百分位CE的变化表明,训练重塑了经验性的每token CE分布。在top-K蒸馏中,较小的K产生了一个在两端都有更多质量的分布,降低了中位数并增加了平均值。在Qwen SFT中,整体部分迅速饱和,而尾部在训练后半段延伸。在这两种情况下,任务评估指标似乎对整体部分比尾部更敏感。实际上,我们建议在报告平均CE的同时报告一小部分百分位CE摘要,并利用它们之间的一致性作为跟踪分布重塑的工具,以及当平均和中位数CE在模型选择上不一致时的低成本诊断。

英文摘要

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

2605.24661 2026-05-26 cs.AI cs.CL

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量:一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * Department of Computer Engineering, Tarsus University(塔鲁斯大学计算机工程系) School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)(计算与增强智能学院(SCAI),亚利桑那州立大学(ASU)) HumaConn AI Consulting(HumaConn AI咨询)

AI总结 提出一个基于行为的多维框架,从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量,揭示仅靠准确率无法观察到的行为,并支持部署决策。

详情
AI中文摘要

LLMs在复杂推理任务中取得了显著成功,但当前的评估方法主要依赖最终答案的正确性,对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白,本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量,操作化了六个理论驱动的维度:正确性(CQ)、一致性(CS)、鲁棒性(RS)、逻辑连贯性(LS)、效率(ES)和稳定性(SS)。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明,该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是,逻辑连贯性与正确性正交(r = -0.172,不显著),证实了正确答案可能源于不连贯的推理,而Claude-Haiku-4.5取得了最高的多维得分(Q_bal = 0.778)。此外,该框架暴露了关键的排名反转:DeepSeek-V3在准确率优先下排名第二,但在法律/合规权重下排名第五,这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的(|r| < 0.50),为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策:识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型(LS--CQ正交性);防止仅基于准确率的基准测试导致的排名错误;以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

2605.24659 2026-05-26 cs.LG

IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

IterInject: 通过反馈引导的迭代优化实现对LLM智能体的间接提示注入

Zixuan Chen, Jiaxiang Chen, Li Luo, Ke Xu, Xiaoxiang Huang, Tanfeng Sun, Xinghao Jiang

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Hong Kong(香港大学)

AI总结 提出IterInject框架,通过规则诊断器和LLM优化器迭代优化对抗载荷,实现对LLM智能体的间接提示注入攻击,在多个基准和实际系统中显著优于现有方法,并揭示了注意力介导的阈值机制。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

基于LLM的智能体越来越多地被部署用于需要规划、工具使用和与外部服务交互的复杂任务。它们对外部不可信内容的依赖使其容易受到间接提示注入(IPI)攻击,其中嵌入在检索数据中的对抗指令劫持智能体行为。现有攻击依赖于无法适应智能体特定防御的静态载荷;即使是最近的适应性方法也缺乏结构化反馈来指导优化。我们提出\oursys,一个反馈引导的迭代框架,闭合了注入、诊断和精炼之间的循环:基于规则的诊断器产生带有行为描述的结构化结果标签,基于LLM的优化器根据完整的优化历史精炼载荷。一个合成步骤从失败模式中生成新的伪装种子,使策略空间能够自我进化。在AgentDojo和InjectAgent上,\oursys在四个受害模型上显著优于静态基线和现有的适应性方法。在Claude Code(一个具有分层防御的生产级编码智能体)上的扩展实验表明,优化后的载荷在9个目标中的5个上取得了完全成功;即使那些抵抗完全利用的目标也显示出通过迭代精炼可衡量的改进。我们进一步对IPI进行了机制分析,识别出中后层中注意力介导的阈值机制;三个因果干预验证了这一发现,并指出了具体的防御方向。

英文摘要

LLM-based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent-specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM-based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self-evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production-grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention-mediated threshold mechanism in mid-to-late layers; three causal interventions validate this finding and point to concrete defense directions.

2605.24658 2026-05-26 cs.LG

WLNO: Wavelet-Laplace Neural Operator for Solving Partial Differential Equations

WLNO: 用于求解偏微分方程的小波-拉普拉斯神经算子

Muhammad Abid, Arth Sojitra, Omer San

发表机构 * Department of Mechanical and Aerospace Engineering, University of Tennessee, Knoxville(田纳西大学机械与航空航天工程系)

AI总结 提出WLNO,通过融合Haar小波多尺度空间分解与拉普拉斯神经算子的极点-留数公式,在五个基准PDE问题上优于LNO,尤其擅长处理具有强空间多尺度结构的问题。

详情
AI中文摘要

本文介绍了小波-拉普拉斯神经算子(WLNO),一种新颖的神经算子,它将Haar小波多尺度空间分解与拉普拉斯神经算子(LNO)的拉普拉斯域极点-留数公式融合在一起。虽然LNO通过可学习的系统极点和留数捕捉瞬态和稳态动力学,但它缺乏提取复杂PDE解中固有的空间局部多尺度特征的显式机制。WLNO通过用并行单级Haar离散小波变换(DWT)分支增强LNO核心来解决这一问题,该分支将提升的特征图分解为四个频率子带:近似(LL)、水平细节(LH)、垂直细节(HL)和对角细节(HH),并在通过逆DWT重建之前对每个子带应用独立学习的$1\times1$卷积。两个分支通过一个可学习的sigmoid门控权重$\alpha_\mathrm{wav}$融合,该权重初始化为给小波分支一个小的初始贡献,允许模型在整个训练过程中自适应地平衡拉普拉斯域动力学与空间多尺度特征。WLNO与LNO在五个基准PDE问题上使用相同的超参数、训练数据和评估协议进行评估:扩散方程、Burgers方程、反应扩散系统、达西流和二维Navier-Stokes方程。WLNO在所有五个问题上始终优于LNO,在具有强空间多尺度结构的问题上改进最为显著,例如具有尖锐激波前沿的Burgers方程和具有相干涡旋结构的Navier-Stokes方程,而在更平滑和椭圆问题上表现一致。这些结果表明,基于小波的多尺度空间分解是拉普拉斯域算子学习的一种有原则且有效的补充。

英文摘要

This work introduces the Wavelet-Laplace Neural Operator (WLNO), a novel neural operator that fuses Haar wavelet multi-scale spatial decomposition with the Laplace-domain pole-residue formulation of the Laplace Neural Operator (LNO). While LNO captures transient and steady-state dynamics through learnable system poles and residues, it lacks an explicit mechanism for extracting spatially localized multi-scale features inherent in complex PDE solutions. WLNO addresses this by augmenting the LNO core with a parallel single-level Haar discrete wavelet transform (DWT) branch that decomposes the lifted feature map into four frequency subbands: approximation (LL), horizontal detail (LH), vertical detail (HL), and diagonal detail (HH) and applies independent learned $1\times1$ convolutions to each subband before reconstruction via the inverse DWT. The two branches are fused through a learnable sigmoid-gated weight $α_\mathrm{wav}$, initialized to give a small initial contribution to the wavelet branch, allowing the model to adaptively balance Laplace-domain dynamics against spatial multi-scale features throughout training. WLNO is evaluated against LNO on five benchmark PDE problems using identical hyperparameters, training data, and evaluation protocols: the diffusion equation, the Burgers equation, the reaction-diffusion system, Darcy flow, and the two-dimensional Navier-Stokes equation. WLNO consistently outperforms LNO on all five problems, with the most pronounced improvement on problems with strong spatial multi-scale structure, such as the Burgers equation with sharp shock fronts and the Navier-Stokes equation with coherent vortical structures, while remaining consistent across smoother and elliptic problems. These results demonstrate that wavelet-based multi-scale spatial decomposition is a principled and effective complement to Laplace-domain operator learning.

2605.24657 2026-05-26 cs.AI cs.SE

Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

超越纯推理部署:比较基于权重的巩固与级联压缩

Simon Dennis, Kevin Shabahang, Hao Guo, Rivaan Patil

发表机构 * University of Melbourne(墨尔本大学)

AI总结 针对大型语言模型纯推理部署中用户知识无法持久化的问题,提出通过夜间反射、合成和LoRA微调将交互知识巩固到模型权重中,实验表明该方法相比级联压缩知识保留率提升43.6个百分点。

Comments 15 pages

详情
AI中文摘要

主流LLM平台以纯推理配置部署模型:模型服务请求但从不更新每个用户的权重。用户必须反复重新教授偏好、修正和项目上下文,基于上下文的变通方法消耗上下文窗口空间,并在级联压缩下退化。我们评估了一种替代方案:通过反射、合成和低秩适应(LoRA)微调,在单个消费级GPU上将交互知识夜间巩固到模型权重中。在十次真实的软件开发对话中(n=10,三种记忆类型共1146个测试问题),三轮级联压缩保留了36.8±3.0%的知识(介于11.8%的无上下文下限和90.1%的全上下文上限之间),而巩固保留了80.4±1.3%——提升了43.6个百分点(配对t(9)=14.8,p<0.001),是压缩保留量的两倍多,其中程序性修正(36.3%->74.6%)和情景项目事实(31.5%->78.2%)的增益最大。作为方法论上的附带说明,平均每token验证交叉熵与LLM判断的准确性呈负相关(r=-0.51),而中位数每token验证交叉熵几乎完全跟踪准确性(r=+0.99):在容忍表面形式变化的评估器下,平均值具有误导性,而重尾鲁棒统计量才是可靠的信号。持久个性化需要超越纯推理部署,转向将知识巩固到权重的架构。

英文摘要

Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users must repeatedly re-teach preferences, corrections, and project context, and context-based workarounds consume context-window space and degrade under cascading compaction. We evaluate an alternative: nightly consolidation of interaction knowledge into model weights via reflection, synthesis, and Low-Rank Adaptation (LoRA) fine-tuning on a single consumer GPU. Across ten realistic software development conversations (n = 10, 1,146 test questions across three memory types), three cycles of cascading compaction retain 36.8 +/- 3.0% of knowledge (between an 11.8% no-context floor and a 90.1% full-context ceiling), while consolidation retains 80.4 +/- 1.3% -- a 43.6 pp gain (paired t(9) = 14.8, p < 0.001) that more than doubles what compaction preserves, with the largest gains on procedural corrections (36.3% -> 74.6%) and episodic project facts (31.5% -> 78.2%). As a methodological aside, mean per-token validation cross-entropy is negatively correlated with LLM-judged accuracy (r = -0.51) while median per-token validation cross-entropy tracks accuracy almost exactly (r = +0.99): under evaluators that tolerate surface-form variation, the mean is misleading and a heavy-tail-robust statistic is the faithful signal. Persistent personalization requires moving beyond inference-only deployment toward architectures that consolidate knowledge into weights.

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench:面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出AVBench,通过细粒度人类中心指标和偏好学习训练的专业评估器,实现音视频生成的自动化、准确评估。

详情
AI中文摘要

音视频(AV)生成的快速进步使得能够生成具有同步声音的高保真合成内容,特别是涉及语音和交互的人类相关场景。然而,AV生成的评估仍处于早期阶段,只有少数针对人类相关场景的粗粒度基准,并且依赖于有限的预设评估和通用多模态大语言模型,导致对模型能力的不准确评估。为了解决这些问题,我们引入了AVBench,一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估:(i)人类中心和细粒度指标。AVBench整合了十个评估维度,专为以人为中心的现实场景设计,涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。(ii)通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题,我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后,评估器学会可靠地检测细微的跨模态不一致性。关键的是,AVBench不输出离散的文本判断,而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠,并且与人类判断高度一致。综合来看,AVBench为AV生成提供了自动化评估,展示了数据过滤的强大潜力,并可作为来自人类反馈的强化学习(RLHF)的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

2605.24647 2026-05-26 cs.CL

Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation

在你说话之前了解你:多轮对话中用于LLM个性化的用户状态建模

Jiani Luo, Xiaoyan Zhao, Yang Zhang, Shuyi Miao, Bingbing Xu, Stefan Konigorski, Tat-Seng Chua

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) School of Artificial Intelligence, Beihang University(北航人工智能学院) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) German Institute of Human Nutrition Potsdam-Rehbruecke(德国人类营养研究所波茨坦-雷赫布鲁克)

AI总结 提出基于自由能原理的PUMA框架,通过显式用户状态模型和动作选择机制,将个性化对话从被动记忆检索转变为基于模型的用户演化决策,在医疗咨询基准上提升长期对话效果。

Comments 30pages, 3 figures

详情
AI中文摘要

个性化对话不仅需要回忆显式的用户历史:系统还需要推断通过交互演化并塑造适当响应策略的隐藏用户状态。现有的基于记忆和配置文件的方法主要重用可观察的用户信息,对建模用户状态动态或基于它们如何塑造未来用户状态来选择动作的支持有限。我们提出了PUMA(面向动作选择的预期用户状态建模),这是一个基于自由能原理(FEP)的框架,将个性化形式化为部分可观测下的决策,围绕一个显式的用户状态模型,该模型捕获潜在用户状态及其动作条件动态。在每一轮中,PUMA维护对用户隐藏状态的信念,细化用于观测生成和动作条件状态转移的用户状态模型,并通过最小化预期自由能来选择对话动作,在统一标准下平衡认知和实用目标。这种表述将个性化从被动记忆检索转变为基于模型的用户演化决策。我们在面向医疗咨询和动机性访谈的基准上实例化PUMA,并带有潜在状态标注以进行严格评估。实验表明,PUMA在保持强响应质量的同时改善了长期对话结果,跨数据集研究展示了更可靠的用户状态估计和下一状态预测。

英文摘要

Personalized dialogue requires more than recalling explicit user histories: systems also need to infer hidden user states that evolve through interaction and shape appropriate response strategies. Existing memory- and profile-based methods primarily reuse observable user information, offering limited support for modeling user-state dynamics or selecting actions based on how they shape future user states. We propose PUMA (Prospective User-state Modeling for Action selection), a framework grounded in the Free Energy Principle (FEP) that formulates personalization as decision-making under partial observability, centered on an explicit user state model that captures latent user states and their action-conditioned dynamics. At each turn, PUMA maintains a belief over the user's hidden state, refines the user state model for observation generation and action-conditioned state transition, and selects dialogue actions by minimizing expected free energy, balancing epistemic and pragmatic objectives under a unified criterion. This formulation shifts personalization from passive memory retrieval to model-based decision-making over user evolution. We instantiate PUMA on healthcare-oriented counseling and motivational interviewing benchmarks with latent state annotations for rigorous evaluation. Experiments show that PUMA improves long-horizon dialogue outcomes while maintaining strong response quality, and a cross-dataset study demonstrates more reliable user-state estimation and next-state prediction.

2605.24643 2026-05-26 cs.RO cs.SY eess.SY

Towards Low-Gravity Planetary Exploration using Reinforcement Learning for Walking, Jumping, and In-flight Attitude Control

面向低重力行星探测的强化学习行走、跳跃与飞行姿态控制

Jørgen Anker Olsen, Kostas Alexis

发表机构 * Autonomous Robots Lab(自主机器人实验室) NTNU(挪威特罗姆瑟大学)

AI总结 本文利用强化学习为四足机器人在火星低重力环境下开发行走、垂直跳跃、前向跳跃及飞行姿态控制策略,实现跨越障碍物并安全着陆,仿真与实验验证了策略的有效性。

Comments 16 pages, 16 figures

详情
AI中文摘要

本文提出了用于行星探测场景中动态四足运动的强化学习策略。基于采用五杆腿设计的任务优化四足机器人,我们开发了针对行走、垂直跳跃、前向跳跃和飞行姿态控制的强化学习策略,这些策略明确针对火星上的低重力环境进行了调整。这些策略共同使机器人能够通过协调跳跃和精确的飞行中重新定向来克服比自身更大的障碍物,实现安全着陆。我们通过单轴重新定向测试在Olympus四足机器人上展示了姿态控制策略的Sim2Real迁移,而所有运动策略均在仿真中进行了验证。一个完整的火星探测任务场景展示了在复杂地形上协调策略部署的能力。实验结果显示,在2.6秒内完成90°姿态重新定向,仿真表明在火星重力条件下可实现3.1米的垂直跳跃和3.9米的前向跳跃。- 补充视频:https://www.youtube.com/watch?v=qlSJ3P87A4A

英文摘要

This paper presents reinforcement learning (RL) policies for dynamic quadrupedal locomotion in planetary exploration scenarios. Building on a taskoptimized quadruped with a 5-bar leg design, we develop RL policies for walking, vertical jumping, forward jumping, and in-flight attitude control, explicitly tailored to the reduced gravity on Mars. These policies jointly enable such robots to overcome obstacles larger than themselves through coordinated jumping and precise in-flight reorientation for safe landings. We demonstrate Sim2Real transfer of the attitude control policy on the Olympus quadruped through single-axis reorientation tests, while all locomotion policies are validated in simulation. A complete Mars exploration mission scenario demonstrates coordinated policy deployment across challenging terrain. Experimental results show 90° attitude reorientation in 2.6 seconds, with simulations demonstrating 3.1 meter vertical jumps and 3.9 meter forward jumps under Martian gravity conditions. - Supplementary video: https://www.youtube.com/watch?v=qlSJ3P87A4A

2605.24642 2026-05-26 cs.CV cs.RO

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

理解几何基础模型对视觉-语言-动作模型的影响

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone

发表机构 * Amazon Personal Robotics Group(亚马逊个人机器人小组) University of Texas at Austin(德克萨斯大学奥斯汀分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过线性探测分析量化了视觉-语言-动作模型(VLA)与几何基础模型(GFM)之间的“几何差距”,比较了三种注入几何信息的架构,并研究了非架构因素对几何VLA性能的影响。

详情
AI中文摘要

近期工作探索了视觉-语言-动作模型(VLA)与用于3D重建的几何基础模型(GFM)(如VGGT)交叉领域的新机遇。虽然由此产生的几何VLA通常表现出改进的性能,但仍不清楚:(i) 现代VLA是否已经具备足够的几何理解能力,(ii) 将几何理解注入VLA的最佳架构是什么,以及(iii) 其他影响几何VLA的设计选择的效果。在本文中,我们针对特定的VLA(GR00T-N1.5)和GFM(VGGT)进行了严格的实验分析,以阐明这些问题。我们的第一个贡献是通过基于线性探测的严格分析,形式化了先前工作中关于当前VLA缺乏几何理解的直觉。该分析首次量化了VLA与GFM之间的“几何差距”。我们的第二个贡献是识别并比较了将GFM与VLA桥接的不同策略。我们实现了三种不同的架构,它们在将几何信息注入VLA的方式上有所不同,同时尽可能保持低级实现细节相似,以确保公平比较。最后,我们分析了非架构选择(例如,训练数据、相机数量、重建质量)对几何VLA性能的影响。

英文摘要

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

2605.24639 2026-05-26 cs.CV cs.AI

DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection

DisDop: 基于领域先验蒸馏的开放词汇航空目标检测

Ruihao Xu, Yong Liu, Yansong Tang, Sule Bai, Xubing Ye, Bingyao Yu, Yutao Guo, Jiwen Lu, Jie Zhou

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Tsinghua University(清华大学)

AI总结 提出DisDop框架,通过从遥感基础模型(RemoteCLIP和DINOv3)中系统蒸馏多级领域先验知识到轻量级检测器,实现开放词汇航空目标检测的最新性能。

详情
AI中文摘要

近年来,随着无人机的广泛应用,航空图像的目标检测引起了越来越多的关注,尤其是不受预定义类别限制的开放词汇航空检测。由于无人机视角图像的稀缺性及其与自然图像的显著差异,直接应用为自然场景设计的普通开放词汇检测方法难以取得令人满意的结果。一些研究提出通过使用轻量级网络或生成伪标签来从预训练模型迁移知识,但它们往往依赖于在自然图像上训练的模型,忽略了专门为遥感和航空图像定制的基础模型的潜力。为了解决这一局限性,我们提出了DisDop,一个统一的框架,系统地将来自遥感基础模型(例如RemoteCLIP和DINOv3)的多级领域先验知识蒸馏到轻量级检测器中。具体来说,我们首先通过教师融合策略蒸馏视觉先验,该策略结合了RemoteCLIP的跨模态对齐能力和DINOv3的细粒度局部特征提取能力,将其互补优势迁移到检测器的骨干网络中。其次,我们通过显式建模类别间语义关系来蒸馏嵌入在RemoteCLIP文本编码器中的文本先验,同时结合全局上下文先验以增强小目标的局部特征表示。通过这种多级先验蒸馏框架,我们的DisDop在开放词汇航空检测基准上取得了新的最先进性能。大量的消融分析也证明了我们提出模块的合理性和有效性。

英文摘要

With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially open-vocabulary aerial detection which is not restricted to predefined categories. Due to the scarcity of drone's viewpoint images and their significant differences from natural images, it is difficult to achieve satisfying results by directly applying vanilla open-vocabulary detection methods designed for natural scenarios. Some studies propose to transfer knowledge from pre-trained models by using lightweight networks or generating pseudo labels, but they tend to rely on models trained on natural images, neglecting the potential of foundation models specifically tailored for remote sensing and aerial imagery. To address this limitation, we propose DisDop, a unified framework that systematically distills multi-level domain priors from remote sensing foundation models (e.g., RemoteCLIP and DINOv3) into a lightweight detector. Specifically, we first distill visual priors through a teacher fusion strategy that combines RemoteCLIP's cross-modal alignment capability with DINOv3's fine-grained local feature extraction ability, transferring their complementary strengths to the detector's backbone. Second, we distill textual priors embedded in RemoteCLIP's text encoder by explicitly modeling inter-category semantic relationships, while incorporating global contextual priors to enhance local feature representation for small objects. Through this multi-level prior distillation framework, our DisDop achieves new state-of-the-art performance on open-vocabulary aerial detection benchmarks. Extensive ablation analysis also demonstrates the rationality and effectiveness of our proposed modules.

2605.24635 2026-05-26 cs.CL

HiMed: Incentivizing Hindi Reasoning in Medical LLMs

HiMed: 激励医疗大语言模型中的印地语推理

Dingfeng Jiang, Han Yan, Chenze Ma, Amit Kumar Jaiswal, Ang Li, Yunxiang Jiang, Xinlei Xiong, Juhao Liang, Hongru Xiao, Xiang Li, Fan Bu, Jiale Han, Ruchir Gupta, Prayag Tiwari, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Indian Institute of Technology (Banaras Hindu University) Varanasi(印度理工学院(班加罗尔 Hindu 大学)瓦拉纳西分校) Tongji University(同济大学) Shenzhen Research Institute of Big Data(深圳大数据研究院) Shenzhen Loop Area Institute(深圳科创园区研究院) The Hong Kong University of Science and Technology(香港科技大学) Halmstad University(哈尔姆斯塔德大学)

AI总结 针对医疗大语言模型在印地语上表现不佳的问题,提出HiMed印地语医疗推理语料库与基准,并通过衰减支架奖励训练HiMed-8B模型,显著提升印地语医疗推理性能并缩小英印准确率差距。

详情
AI中文摘要

医疗大语言模型有望减少医疗保健差距,但印地语仍然严重代表性不足。尽管医疗大语言模型在高资源语言中表现出色,但其性能在印地语中急剧下降,尤其是在印度医学体系方面。我们认为,稳健的跨语言医疗迁移需要印地语推理。为此,我们引入了HiMed,一个涵盖西方和印度医学的印地语推理医疗语料库和基准套件。我们进一步通过设计衰减支架奖励,提出了HiMed-8B,一个印地语医疗推理大语言模型。大量实验表明,印地语医疗推理性能得到提升,英印准确率差距缩小。消融研究验证了每个训练阶段和奖励组件的贡献。所有数据和代码均可在GitHub上获取:https://github.com/FreedomIntelligence/HiMed。

英文摘要

Medical large language models hold promise for reducing healthcare disparities, yet Hindi remains severely underrepresented. While medical LLMs excel in high-resource languages, their performance degrades sharply in Hindi, particularly on Indian systems of medicine. We argue that robust cross-lingual medical transfer requires Hindi reasoning. To this end, we introduce HiMed, a Hindi reasoning medical corpus and benchmark suite covering both Western and Indian medicine. We further propose HiMed-8B, a Hindi-form medical reasoning LLM, through the design of decaying scaffolding reward. Extensive experiments demonstrate improvement in Hindi medical reasoning performance and reduction in the English--Hindi accuracy gap. Ablation studies validate the contribution of each training stage and reward component. All data and code are available on GitHub: https://github.com/FreedomIntelligence/HiMed.

2605.24631 2026-05-26 cs.LG cs.AI cs.CV

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

超越生成先验:JEPA引导扩散的少数采样

Sol Park, Soobin Um

发表机构 * Department of Artificial Intelligence, Kookmin University, Seoul, South Korea(人工智能系,韩国全州大学,首尔)

AI总结 提出一种基于世界模型JEPA引导的扩散采样框架,通过近似策略实现高效计算,在无条件、类别条件和文本到图像生成中提升少数样本的保真度和语义有效性。

Comments ICML 2026, 21 pages, 9 figures

详情
AI中文摘要

少数采样旨在数据流形上生成低密度实例,在医学诊断、异常检测和创意AI等应用中具有核心重要性。然而,现有方法相对于从训练数据中学习的生成先验来定义少数样本,将稀有性限制在可能无法很好反映现实世界语义的模型特定概念中。在这项工作中,我们提出了一种以世界为中心的少数采样视角,该视角相对于现实世界先验而非生成器诱导的密度来定义稀有性。为此,我们引入了JEPA引导,一种由联合嵌入预测架构(JEPA)引导的扩散采样框架——JEPA是一类编码广泛、语义丰富表示的世界模型。JEPA引导将扩散轨迹导向JEPA隐含密度下的低密度区域,从而使生成的少数样本与现实世界的语义稀有性对齐。为了使JEPA引导在计算上实用,我们开发了带有理论误差界限的原则性近似策略,显著降低了引导计算的开销。在无条件、类别条件和文本到图像生成上的大量实验表明,JEPA引导持续提高了少数样本的保真度和语义有效性,在捕捉现实世界的稀有性概念方面优于以生成器为中心的基线。代码可在https://github.com/soobin-um/jepa-guidance获取。

英文摘要

Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical diagnosis, anomaly detection, and creative AI. Existing approaches, however, define minority samples relative to generative priors learned from training data, confining rarity to model-specific notions that may poorly reflect real-world semantics. In this work, we propose a world-centric perspective on minority sampling, which defines rarity with respect to real-world priors rather than generator-induced densities. To this end, we introduce JEPA guidance, a diffusion sampling framework guided by a Joint-Embedding Predictive Architecture (JEPA) -- a class of world models that encode broad, semantically rich representations. JEPA guidance steers diffusion trajectories toward low-density regions under the implicit density induced by the JEPA, thereby aligning generated minorities with real-world semantic rarity. To make JEPA guidance computationally practical, we develop principled approximation strategies accompanied by theoretical error bounds, significantly reducing the overhead of guidance computation. Extensive experiments across unconditional, class-conditional, and text-to-image generation demonstrate that JEPA guidance consistently improves the fidelity and semantic validity of minority samples, outperforming generator-centric baselines in capturing real-world notions of rarity. Code is available at https://github.com/soobin-um/jepa-guidance.

2605.24630 2026-05-26 cs.CV

DexSIM: Real-time Dexterous Simulation with Unified Causal Video Diffusion

DexSIM: 具有统一因果视频扩散的实时灵巧仿真

Adam Lee

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DexSIM框架,通过两阶段训练(双向视频扩散和自回归滚动训练)实现实时、长时一致的灵巧操作仿真,在像素相似度、运动保真度和手部投影精度上超越基线。

Comments World Model @ ICLR 2026

详情
AI中文摘要

视频扩散模型的最新进展已实现对物理世界的大规模仿真,但手部物体交互的仿真研究较少。我们提出DexSIM,一个用于实时灵巧操作仿真的灵巧仿真框架。以往利用视频扩散和3D重建的工作侧重于导航,而灵巧操作虽在创建交互式仿真体验和为机器人生成合成数据方面有广泛应用,但进展有限。现有方法缺乏实时交互性、长期空间一致性和记忆。我们为DexSIM提出两阶段训练框架。首先,通过在手部动作轨迹和视频的统一特征空间中进行联合嵌入,训练一个双向视频扩散模型。我们利用高斯热图手部编码实现更准确的手部表示。然后,我们进行基于滚动的自回归训练,将更新的空间缓存作为注意力汇点用于空间记忆,从而提高了长期一致性和3D感知灵巧操作仿真。DexSIM在像素和语义相似度、运动保真度和手部投影精度上优于基线。它还支持手部运动迁移等新应用,并以15.24 FPS的帧率实现实时交互。

英文摘要

Recent progress of video diffusion models have enabled extensive simulation of the physical world. While simulation with hand object interaction has been less explored. We propose DexSIM, a dexterous simulation framework for simulating dexterous manipulation in real-time. While previous works utilizing video diffusion and 3D reconstruction focus on navigation, dexterous manipulation has been limited while it has extensive applications for creating interactive experiences with the simulated world and for generating synthetic data for robotics. Existing methods lack real-time interactivity and long-term spatial consistency and memory. We propose a 2-stage training framework for DexSIM. First we train a bi-directional video diffusion model by jointly embedding the hand action trajectory and video in a unified feature space. We utilize gaussian heatmap hand encoding for more accurate hand representation. Then we conduct a roll-out based autoregressive training with updated spatial cache as attention sink for spatial memory, which improves long-term consistency and 3D aware dexterous manipulation simulation. DexSIM outperforms the baseline on pixel and semantic similarity, motion fidelity, and hand projection accuracy. It also allows new applications such as hand motion transfer and runs at 15.24 FPS real-time interactivity.

2605.24625 2026-05-26 cs.CV

ULF-Synth: Physics-Guided Ultra-Low-Field MRI Enhancement for Pediatric Neuroimaging

ULF-Synth:用于儿科神经影像的物理引导超低场MRI增强

Toufiq Musah, Salvatore Calcagno, Federica Proietto Salanitri, Xiaomeng Li, Maruf Adewole, Marawan Elbatel

发表机构 * Kwame Nkrumah University of Science and Technology(科拉努姆大学科学与技术学院) University of Catania(卡塔尼亚大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Medical Artificial Intelligence Lab(医学人工智能实验室)

AI总结 提出ULF-Synth框架,通过从高场MRI合成逼真的超低场图像并采用空间-频率域目标,实现无需真实配对数据的超低场MRI增强,提升结构相似性和诊断可接受性。

Comments 10 pages, 2 figures, 3 tables

详情
AI中文摘要

超低场(ULF)MRI提供了便携且可及的神经影像,但与高场(HF)系统相比,存在信噪比降低和空间分辨率有限的问题。获取配对的ULF-HF数据进行监督增强通常很困难,尤其是在资源有限的环境中。我们提出了ULF-Synth框架,它结合了:(i)基于采集的从HF体积合成逼真ULF图像的方法,以创建大规模配对训练数据;(ii)优先恢复高频解剖细节的空间-频率域目标。该公式与架构无关,在编码器-解码器、对抗性和基于扩散的翻译模型中一致地提高了结构相似性和感知保真度。当仅使用合成数据训练时,所得模型有效泛化到真实的64mT ULF采集,改善了下游多类脑分割,并在盲法读者研究中获得了更高的放射科医生偏好和诊断可接受性。这些发现表明,合成配对监督提供了一种实用且可扩展的途径来增强ULF MRI,而无需真实的配对采集。代码、模型和数据集:https://github.com/toufiqmusah/ULF-Synth

英文摘要

Ultra-low-field (ULF) MRI offers portable and accessible neuroimaging but suffers from reduced signal-to-noise ratio and limited spatial resolution compared to high-field (HF) systems. Acquiring paired ULF-HF data for supervised enhancement is often difficult, particularly in resource-limited settings. We introduce ULF-Synth, a framework that combines: (i) acquisition-based synthesis of realistic ULF images from HF volumes to create large-scale paired training data, (ii) a spatial-frequency domain objective that prioritizes recovery of high-frequency anatomical detail. This formulation is architecture-agnostic, consistently improving structural similarity and perceptual fidelity across encoder-decoder, adversarial, and diffusion-based translation models. When trained exclusively on synthetic data, the resulting models generalize effectively to real 64mT ULF acquisitions, improving downstream multiclass brain segmentation and achieving higher radiologist preference and diagnostic acceptability in a blinded reader study. These findings demonstrate that synthetic paired supervision provides a practical and scalable pathway for enhancing ULF MRI without requiring real paired acquisitions. Code, Models and Dataset: https://github.com/toufiqmusah/ULF-Synth

2605.24624 2026-05-26 cs.CV

Vision-Language Binding in In-Context Image Generation

上下文图像生成中的视觉-语言绑定

Chris Ge, Rohit Gandikota, Antonio Torralba, Tamar Rott Shaham

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Northeastern University(东北大学)

AI总结 本文通过因果干预方法揭示FLUX.2模型中文本令牌与参考图像之间的隐式跨模态绑定机制,并定位绑定发生在文本序列的填充令牌上。

Comments 35 pages, 19 figures

详情
AI中文摘要

上下文图像生成模型(如FLUX.2)接收文本提示和可选的参考图像作为输出的视觉条件。在内部,所有三个输入——文本、参考图像和噪声令牌——被连接并通过单个注意力流处理,其中所有令牌可以相互关注。这留下了参考信息如何通过模型流动以产生输出图像的问题。我们展示了文本令牌与参考图像之间出现隐式跨模态绑定:在前向传播过程中,文本令牌吸收视觉参考内容,并且这些吸收的内容因果地影响生成的输出。我们通过三种因果干预方法揭示了FLUX.2中的这种绑定:T2I Lens,通过文本到图像路径解码中间文本令牌激活;Attention Knockout,切断特定的注意力边;以及I2I-to-I2I Patching,在编辑运行之间复制文本令牌激活。在包括SUN397和DreamBench++数据集以及在线收集的图像在内的2875个编辑任务中,我们观察到一致的分工:参考图像的属性(如颜色、风格和场景设置)首先被写入文本令牌,然后由文本令牌携带到生成的图像中;像素精确的属性(如特定面孔或实例身份)绕过文本令牌,通过图像到图像注意力直接从参考图像流向生成的图像。我们进一步将参考-文本绑定定位到文本序列的填充令牌。这些结果表明,多模态DiT中的文本令牌不仅仅是提示持有者,而是参考图像内容的结构化通道。更广泛地说,它们表明即使在统一注意力的多模态生成模型中,令牌模态也决定了条件信息如何在网络中表示和路由。

英文摘要

In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs -- text, reference image, and the noise tokens -- are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.

2605.24622 2026-05-26 cs.RO cs.CV

PoseRefer: Pathway-Local Parameters for Semantically Grounded Reference Resolution

PoseRefer: 用于语义基础指代消解的通路-局部参数

Anna Deichler

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出PoseRefer架构,通过解耦姿态和文本通路并冻结MiniLM类别嵌入,在MM-Conv数据集上实现31.9%的top-1准确率,并揭示融合准确性可能受类别表示伪影影响。

Comments ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction

详情
AI中文摘要

一个机器人解析“把杯子放在那个上面”必须融合手势、语言和场景几何,然而3D基础基准测试仅部分捕获了这一情况:描述是事后编写的,手势是模板化的,或者指向是为相机摆拍的。MM-Conv从二元VR交互中捕获自然的伴随语音手势,同时包含全身动作捕捉和3D场景图。我们使用它来评估姿态-语言融合,采用解耦的后期融合架构,其中姿态和文本通路不共享任何学习参数。这两个选择共同使得通过受控消融更容易隔离类别、姿态和文本的贡献。使用冻结的MiniLM类别嵌入的融合在每种指代类型上都超过了仅姿态和最佳文本通路,达到31.9%的top-1。学习到的标量门根据文本通路是否有类别访问权限而在相反策略之间切换。这是一个可靠性诊断:除非通路在架构上解耦,否则语义基础系统的融合准确性声明与类别表示伪影无法区分。

英文摘要

A robot resolving ``put the cup on that one'' must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.