arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2605.15763 2026-05-18 cs.CL cs.AI

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

CompactQE: 通过小规模开源大语言模型实现可解释的翻译质量估计

Kamil Guttmann, Zofia Fraś, Artur Nowakowski, Krzysztof Jassem

AI总结 本文提出CompactQE,利用小规模开源大语言模型实现翻译质量估计,生成质量评分、错误标注、修正建议和完整润色,其性能优于传统指标和人类标注。

详情
AI中文摘要

当前最先进的机器翻译质量估计(QE)依赖于大规模专有LLM,引发数据隐私问题。我们证明较小的开源LLM(<30B参数)是可行、成本效益高且隐私保护的替代方案。使用单次提示策略,我们的模型同时生成质量评分、MQM错误标注、建议的错误修正和完整的润色。我们的分析表明,这些模型在系统层面与人类判断的关联性很高,优于传统神经度量、微调模型和人类标注者一致性,有效逼近更大专有LLM的能力。

英文摘要

Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.

2605.15761 2026-05-18 cs.LG

A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

用于分析排行榜稳定性和操纵的统一扰动框架

Hosna Oyarhoseini, Jimmy Lin, Amir-Hossein Karimi

AI总结 本文提出统一扰动框架分析Bradley-Terry排行榜在结构数据修改下的鲁棒性,研究Drop、Add、Flip等扰动对排行榜稳定性的影响,揭示现代排行榜在三个目标上的非鲁棒性,并提供评估工具。

详情
AI中文摘要

评估排行榜如LMArena在通过聚合人对模型的偏好来基准大型语言模型中起核心作用,但这些排名的鲁棒性仍缺乏理解。我们提出一个统一扰动框架,利用基于影响的近似方法分析Bradley-Terry排行榜在结构数据修改下的鲁棒性。该框架研究三种匹配层面的扰动——Drop、Add和Flip,以及玩家移除,并评估其对top-k成员资格、全局排名一致性(通过Kendall's tau)和基于置信区间不确定性的效果。在Chatbot Arena和六个额外的成对比较数据集中,我们证明现代排行榜在所有三个目标上均不鲁棒:子1%的目标扰动可改变排名第一的模型,降低Kendall's tau,并改变置信区间。除了鲁棒性审计外,我们还显示相同的影响力分数可实现高效的有针对性扰动,促进或降低特定模型,并通过更少的操作减少目标模型的不确定性,优于之前的操纵和主动采样基线。通过用标准化的数据集级别鲁棒性分数总结这些效果,我们的框架为审计排行榜稳定性并推动更鲁棒的评估协议提供了实用且有用的工具。

英文摘要

Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remains poorly understood. We present a unified perturbation framework for analyzing Bradley-Terry leaderboards under structured data modifications using influence-based approximations. Our framework studies three match-level perturbations -- Drop, Add, and Flip -- together with player removal, and evaluates their effects on top-k membership, global ranking consistency via Kendall's tau, and confidence-interval-based uncertainty. Across Chatbot Arena and six additional pairwise-comparison datasets, we show that modern leaderboards are non-robust across all three objectives: sub-1% targeted perturbations can change the top-ranked model, degrade Kendall's tau, and alter confidence intervals. Beyond robustness auditing, we show that the same influence scores enable efficient targeted perturbations, promoting or demoting specific models and reducing target-model uncertainty with fewer actions than previous manipulation and active-sampling baselines. By summarizing these effects with normalized dataset-level robustness scores, our framework provides a practical and helpful tool for auditing leaderboard stability and motivating more robust evaluation protocols.

2605.15760 2026-05-18 cs.CV

Learn2Splat: Extending the Horizon of Learned 3DGS Optimization

Learn2Splat: 扩展学得3DGS优化的视野

Naama Pearl, Stefano Esposito, Haofei Xu, Amit Peleg, Patricia Gschossmann, Lorenzo Porzi, Peter Kontschieder, Gerard Pons-Moll, Andreas Geiger

AI总结 本文提出了一种学得优化器,通过元学习方案扩展优化视野,提升稀疏和密集视角下的重建质量与稳定性,实现零样本泛化。

详情
AI中文摘要

3D高斯散射(3DGS)优化通常使用标准优化器(Adam、SGD)。尽管在多样场景中稳定,但标准优化器通用性强,无法针对问题结构进行优化。特别是,它们产生独立的参数更新,无法捕捉场景中的结构和空间关系,导致优化效率低和收敛慢。近期的工作引入了学得优化器,通过参数间和高斯间依赖预测相关更新。然而,这些方法在固定迭代次数训练,并依赖手动调度学习率以避免退化。本文提出了一种学得优化器,能够在延长的优化视野中避免退化,无需辅助机制。为此,我们提出了一种元学习方案,通过检查点缓冲区和优化器滚动策略扩展优化视野,并结合一种编码梯度尺度信息的架构。结果表明,早期新颖视角合成质量得到提升,同时在长视野中保持稳定,实现零样本泛化。为支持我们的发现,我们引入了第一个统一框架,用于训练和评估学得和传统优化器,适用于稀疏和密集视角设置。代码和模型将公开发布。我们的项目页面可在 https://naamapearl.github.io/learn2splat 上找到。

英文摘要

3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at https://naamapearl.github.io/learn2splat .

2605.15755 2026-05-18 cs.CV

Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

基于属性的选型推理用于艺术品情感理解的多模态大语言模型

Cheng Zhang, Yuer Liu, Zhiyu Zhou, Hongxia Xie, Wen-Huang Cheng

AI总结 本文提出基于属性的选型推理方法,通过多模态大语言模型实现艺术品情感理解,通过引入属性瓶颈引导框架提升情感预测精度和解释简洁性。

详情
AI中文摘要

多模态大语言模型(MLLMs)能够生成流畅的艺术品情感解释,但常面临属性泛滥问题:它们列举许多可见的正式属性,但未能识别哪些线索真正支持情感判断。因此,本文将艺术品情感理解定义为属性引导的选型推理(AGSR),其中预定义的正式属性作为证据单元,只有情感相关属性应进入最终解释。为使该问题可测量,我们扩展了EmoArt,最初在ACM MM 2025上介绍为包含132,664件艺术品的资源,具有内容、正式属性、价值-唤醒和情感标注,通过添加1,400件艺术品的人类显著性扩展标注,由15名艺术训练标注者标注。此扩展提供了实例级监督,以区分仅存在的属性和情感显著的属性。我们进一步提出FAB-G(正式属性瓶颈引导推理),一个监督的多代理框架,首先预测属性级显著性,然后将下游情感分析限制在保留的线索上。实验表明,FAB-G在情感、唤醒和价值预测上取得了一致的提升,实现了在Dice和Tversky度量下与人类标记的显著属性更强的一致性,并产生了比基于提示的基线更紧凑的最终解释。跨数据集评估进一步表明,基于属性的显著性选择在EmoArt的源分布之外转移,同时揭示了属性特定的边界案例。数据集和项目页面可在https://zhiliangzhang.github.io/EmoArt-130k/上获取。

英文摘要

Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/

2605.15753 2026-05-18 cs.RO cs.CV

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

层次化和整体化的开放词汇功能3D场景图用于室内空间

Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

AI总结 本文提出一种开放词汇管道,结合2D视觉定位和3D图优化,解决小规模密集相似实例的场景图推理问题,通过时间图优化和全局层次塑造提升室内空间的功能3D场景图生成能力。

详情
AI中文摘要

功能3D场景图提供了一种灵活的3D场景理解和机器人操作的表示方法,由物体节点、交互元素和功能关系边定义。然而,由于现有基准覆盖有限和先前管道设计过于简单,其潜力尚未被充分挖掘。因此,本文通过引入密集的桌面上物体和显式的多级功能关系扩展基准覆盖。这种扩展引入了关键挑战,包括小规模、密集和相似实例的处理,关系推理中缺乏视觉锚点,跨帧融合中的实例混淆,以及动态视角下的属性不确定性。为了解决这些问题,我们提出了一种基于2D视觉定位和3D图优化的开放词汇管道。具体而言,我们从2D视觉证据中锚定细粒度的功能边,并使用多个线索在3D中跨帧关联节点。此外,边关联被公式化为时间图优化,整合证据积累、熵正则化和时间平滑,以稳健地确定每个节点的功能连接。最后,通过全局层次塑造恢复层次图结构。大量实验表明,所提方法能够在具有挑战性的现实场景中可靠地推断功能3D场景图,从而进一步解锁其在实际应用中的潜力。

英文摘要

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

2605.15737 2026-05-18 cs.CV

BARRIER: Bounded Activation Regions for Robust Information Erasure

BARRIER:基于鲁棒信息擦除的有界激活区域

Jan Miksa, Patryk Krukowski, Przemysław Spurek, Dawid Damian Rymarczyk, Marcin Sendera

AI总结 BARRIER通过动态隐藏层激活几何结构,利用区间算术保护中性概念,实现稳定的信息擦除,同时保持其他表示的完整性。

详情
AI中文摘要

机器无学习面临关键瓶颈。传统方法主要消除目标概念,但常导致其他重要表示的意外抑制。为此,BARRIER将干预从静态模型权重转移到隐藏层激活的动态几何结构。通过SVD投影的激活空间区间算术,将目标区域封装在包围超立方体中,确保保留分布的严谨保护。此几何构造将知识保护从经验启发式转化为具有概率尾界的功能漂移优化目标。关键稳定性允许在遗忘区域进行激进的无学习更新。实验表明,BARRIER在分类器和扩散模型中达到最佳折中,最大化目标概念擦除同时保护其他表示的完整性。代码见https://github.com/OneAndZero24/BARRIER。

英文摘要

Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at https://github.com/OneAndZero24/BARRIER.

2605.15736 2026-05-18 cs.CV cs.AI

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

BiomedAP: 一种基于视觉的双锚框架与门控跨模态融合用于鲁棒的医学视觉-语言适应

Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

AI总结 BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

Comments CVPR2026 Workshop

详情
AI中文摘要

BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

英文摘要

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

2605.15734 2026-05-18 cs.AI

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

我们能否信任AI推断的用户状态。一种用于验证由LLMs在操作环境中对用户状态分类的可靠性的人格测量框架

Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

AI总结 本文通过实证测试检验了使用大语言模型评估用户状态的假设,探讨了AI测量在人格测量中的可靠性问题,并提出可复制的评估框架以提高适应性系统的AI设计可靠性。

Comments Full survey article with data tables for futher possible replicabilty and comparison

详情
AI中文摘要

使用大语言模型来评估对话和自适应系统中的用户状态是基于一种假设,即用于此类评估的指标在个体分数层面是稳定且可解释的。本文通过实证测试检验这一假设,重点研究了人工智能(AI)测量在人格测量中的可靠性。本研究采用复制评估程序,评估了三个不同双模大语言模型(GPT-4o音频、Gemini 2.0 Flash、Gemini 2.5 Flash)中广泛指标的可重复性。分析包括个体分数可靠性和聚合可靠性,使我们能够区分可能对实时适应有用的指标,以及仅在聚合分析中保留价值的指标。结果表明,指标的可靠性不能被视为解释领域中的默认属性。个体分数层面的不稳定性使得在实时自适应系统中将这些分数解释为用户状态的指标是不可能的,即使这些指标在聚合后表现出稳定性。同时,本研究指出,个体不稳定指标可以在事后研究中保留分析效用,识别交互规则及其与用户经验参数如满意度、信任和参与度的关系。本文的主要贡献,除了量化问题的严重性(只有213个指标中的31个符合标准)外,还提出了一个可复制的评估框架,使指标适用性的可测量评估成为可能。这种方法支持更负责任的AI设计,其中结果的解释需要显式验证可靠性和随时间监测违规情况。

英文摘要

The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.

2605.15728 2026-05-18 cs.CV cs.AI

DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

DecomPose:解耦跨类优化冲突以实现类别级6D物体姿态估计

Yifan Gao, Lu Zou, Zhangjin Huang, Guoping Wang

AI总结 本文提出DecomPose框架,通过数据驱动的难度代理和不对称分支策略,解耦跨类优化冲突,提升类别级6D姿态估计性能。

详情
AI中文摘要

类别级6D物体姿态估计通常被建模为多类联合学习问题,但类别间的几何异质性导致共享模块中不兼容的优化信号纠缠,产生梯度冲突和负迁移。为此,我们首先引入基于梯度的诊断方法量化模块级跨类冲突。基于诊断结果,我们提出DecomPose框架,通过难度感知的梯度解耦和稳定性驱动的不对称分支策略,缓解优化冲突:(1) 难度感知的梯度解耦通过数据驱动的难度代理将类别分组,并将每个实例路由到组特定的对应分支以隔离不兼容的更新;(2) 稳定性驱动的不对称分支将更高容量的分支分配给结构简单的类别作为稳定的优化锚点,同时通过轻量级分支约束复杂类别以抑制噪声更新并缓解负迁移。在REAL275、CAMERA25和HouseCat6D上的大量实验表明,DecomPose有效减少了跨类优化冲突,并在多个基准上实现了优越的姿态估计性能。

英文摘要

Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

2605.15726 2026-05-18 cs.AI cs.CL

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

走出舒适区:为RLVR的高效策略引导探索

Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang

AI总结 本文提出NudgeRL框架,通过策略引导实现结构化和多样性探索,提升RLVR在数学基准上的表现,相比标准GRPO和oracle引导方法更高效。

Comments 28 pages, 7 figures

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的可扩展范式。然而,其效果受限于探索:策略只能改进已采样的轨迹。增加轨迹数量可缓解此问题,但计算成本高,现有方法对探索内容控制有限。本文提出NudgeRL框架,引入策略引导,通过轻量策略上下文条件化每个轨迹,诱导多样化推理轨迹,不依赖昂贵的oracle监督。为进一步学习此类结构化探索,提出统一目标,将奖励信号分解为跨和内上下文组件,并结合蒸馏目标将发现的行为转移回基础策略。实验证明,NudgeRL在五项挑战性数学基准上平均优于oracle引导的RL基线,且在8倍更大的轨迹预算下优于标准GRPO。这些结果表明,结构化、上下文驱动的探索可作为高效且可扩展的替代方案,替代暴力轨迹扩展和基于特权信息的方法。代码可在https://github.com/tally0818/NudgeRL获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.

2605.15725 2026-05-18 cs.CV cs.AI cs.RO

DiLA: Disentangled Latent Action World Models

DiLA:解耦的潜在动作世界模型

Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

AI总结 DiLA通过内容-结构解耦解决动作抽象与生成保真度的平衡问题,实现高质量视频生成和动作迁移。

Comments Project Page: http://disentangled-latent-action-world-models.github.io

详情
AI中文摘要

潜在动作模型(LAMs)通过推断连续帧间的抽象动作来学习世界模型,但面临动作抽象与生成保真度的权衡问题。现有方法通常通过两阶段训练或限制预测到光流来解决。本文提出DiLA,一种解耦的潜在动作世界模型,通过内容-结构解耦解决这一权衡。我们的关键发现是解耦和潜在动作学习是共演进的:潜在动作学习中的预测瓶颈驱动解耦,迫使模型将空间布局压缩到结构路径,同时将视觉细节卸载到单独的内容路径进行生成。这种协同作用产生了一个连续且语义结构化的潜在动作空间,而不牺牲生成质量。DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面表现优异。这些发现确立了DiLA作为统一框架,同时实现高层动作抽象和高保真生成,推动了自监督世界模型学习的前沿。

英文摘要

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

2605.15723 2026-05-18 cs.LG cs.CV

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

GOMA:从图信号平滑视角迈向结构驱动的多模态对齐

Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

AI总结 GOMA通过统一设计解决多模态对齐中的拓扑障碍、平滑控制与信息保留问题,在七个多模态图基准上取得最佳检索性能并保持稳定性。

详情
AI中文摘要

多模态对齐通常通过CLIP式双编码器从孤立图像-文本对学习,忽略了实体间的关系上下文。多模态属性图(MAGs)中节点携带多模态属性,边编码语料结构,为优化冻结的视觉-语言嵌入提供自然设置。这种优化具有挑战性:视觉、文本和跨模态关系常诱导不同的邻域几何结构,而无限制的图传播可能导致检索表示快速过平滑。有效利用图上下文需要同时打破模态特定的拓扑障碍、控制平滑制度,并在语义边界崩溃前保留信息性平滑。我们提出图优化多模态对齐(GOMA),一种结构驱动的后对齐框架,将冻结的多模态嵌入视为图信号,并通过统一的检索导向设计解决这些需求。GOMA解耦了三个关键设计选择:消息应流动何处、多模态证据应如何传播,以及应保留哪种平滑深度。具体而言,它学习模态感知的传播算子,执行有限步耦合平滑而不使用对角线跨模态快捷方式,并自适应读取节点特定的平滑轨迹以在崩溃前保留有用平滑。所有实验遵循一种转换性MAG检索协议,其中图仅作为无标签上下文,且移除对角线自配对边。在七个MAG基准上,GOMA取得最佳或并列最佳检索性能,并显著优于最强的图竞争对手,证明MAG结构可以作为冻结多模态嵌入的有效后编码器。

英文摘要

Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.

2605.15722 2026-05-18 cs.LG cs.AI cs.CV eess.SP

Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

双向融合引导心脏模式用于半监督ECG分割

Jeonghwa Lim, Minje Park, Sunghoon Joo

AI总结 本文提出CardioMix框架,通过心脏模式引导的双向CutMix策略提升ECG分割性能,实验表明其在多种数据集和标注比例下均优于现有方法。

Comments 11 pages, 6 figures, 6 tables

详情
AI中文摘要

准确界定心电图(ECG)并分割有意义的波形特征对心血管诊断至关重要。然而,标注数据稀缺给深度学习模型训练带来了重大挑战。传统半监督语义分割(SemiSeg)方法主要关注未标注数据的一致性,未能充分利用标注与未标注集之间的信息交换。为此,我们引入CardioMix,基于心脏模式引导的双向CutMix策略构建ECG分割框架。该方法通过从未标注数据中引入真实变化丰富标注集,同时对未标注集施加更强的监督信号,确保所有增强样本在生理上具有意义。本框架设计为即插即用模块,与各种SemiSeg算法具有高度兼容性。在SemiSegECG公共多数据集基准上的大量实验表明,CardioMix在多种数据集和标注比例下均优于现有基于CutMix的融合策略作为即插即用模块兼容各种SemiSeg算法。

英文摘要

Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

2605.15721 2026-05-18 cs.CL

Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

上下文作为推荐:面向上下文工程的进化式协同过滤

Jiachen Zhu, Zhuoying Ou, Congmin Zheng, Yuxiang Chen, Zeyu Zheng, Rong Shan, Lingyu Yang, Lionel Z. Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

AI总结 本文提出将上下文工程视为推荐问题,通过Neural Collaborative Context Engineering框架,实现动态实例级路由,提升LLM上下文工程的个性化性能。

详情
AI中文摘要

大型语言模型(LLMs)对输入上下文高度敏感,推动了自动化上下文工程的发展。然而,现有方法大多将其视为全局搜索问题,寻找单一上下文策略以最大化数据集的平均性能。本文提出将上下文工程作为推荐问题,引入Neural Collaborative Context Engineering(NCCE)框架,将优化从静态全局搜索转向动态实例级路由。NCCE首先构建多样化的锚点上下文目录,然后采用新颖的Context-CF共进化机制。该阶段建立协同反馈循环:轻量级Neural Collaborative Filtering(NCF)模型学习实例-上下文偏好以指导生成专用上下文变体,而新评估的上下文不断精炼NCF模型对潜在偏好的理解。在推理阶段,训练好的NCF模型作为上下文路由器,动态分配最合适的上下文策略给每个未见实例。理论证明和全面实验表明,通过匹配个体输入与最优上下文,NCCE显著提升任务准确性,突显了LLM上下文工程中个性化的重要性。

英文摘要

Large Language Models (LLMs) are highly sensitive to their input contexts, motivating the development of automated context engineering. However, existing methods predominantly treat this as a global search problem, seeking a single context strategy that maximizes average performance across a dataset. This restrictive assumption overlooks the fact that different inputs often require distinct guidance, leaving substantial instance-level performance gains untapped. In this paper, we propose a paradigm shift by formulating context engineering as a recommendation problem. We introduce \textbf{Neural Collaborative Context Engineering (NCCE)}, a framework that transitions optimization from a static global search to dynamic, instance-wise routing. NCCE first bootstraps a diverse catalog of anchor contexts and then employs a novel \textbf{Context-CF Co-Evolution} mechanism. This stage establishes a synergistic feedback loop: a lightweight Neural Collaborative Filtering (NCF) model learns instance-context preferences to guide the generation of specialized context variants, while the newly evaluated contexts continuously refine the NCF model's understanding of latent preferences. At inference time, the trained NCF model acts as a context router, dynamically assigning the most suitable context strategy to each unseen instance. Theoretical Proofs and comprehensive experiments demonstrate that by matching individual inputs with their optimal contexts, NCCE significantly improves task accuracy, highlighting the critical importance of personalization in LLM context engineering.

2605.15720 2026-05-18 cs.CV cs.LG

Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment

Semi-MedRef:基于跨模态对齐的半监督医学指引用图像分割

Yuchen Li, Zhen Zhao, Yi Liu, Luping Zhou

AI总结 本文提出Semi-MedRef框架,通过三个组件维持医学图像与位置语言的一致性,实验显示其在低标签条件下优于其他方法。

详情
AI中文摘要

医学指引用图像分割(MRIS)需要像素级掩码与解剖位置的文本描述对齐,这在低标签环境下使标注成本高昂。半监督学习(SSL)可通过利用未标记数据缓解这一负担,但其成功依赖于在扰动下保持可靠的图像-文本对齐。现有SSL方法多采用独立或简单的多模态扰动(如左右翻转),未能充分解决强增强下的跨模态对齐问题,而CutMix在单模态SSL中效果显著,但在多模态设置中因破坏图像-文本一致性而未被广泛探索。本文提出Semi-MedRef,一种教师-学生SSL框架,通过三个保持对齐的组件:T-PatchMix,一种跨模态CutMix风格增强,通过位置约束和概率驱动规则同步补丁混合与指引用表达;PosAug,一种位置感知文本增强,通过遮蔽或模糊解剖短语;以及ITCL,一种位置引导的图像-文本对比学习模块,利用位置伪标签构建软解剖正例并加强医学基础的跨模态对齐。在QaTa-COV19和MosMedData+上的实验表明,Semi-MedRef在所有标签条件下均优于完全监督和半监督基线。

英文摘要

Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.

2605.15713 2026-05-18 cs.RO cs.AI

Learning Dynamic Pick-and-Place for a Legged Manipulator

学习动态抓取与放置用于四足机械臂

Moonkyu Jung, Jiseong Lee, Zhengmao He, Donghoon Youm, Juhyeok Mun, HyeongJun Kim, Hyunsik Oh, Donghyuk Choi, Jungwoo Hur, Jie Song, Jemin Hwangbo

AI总结 本文提出一种分层强化学习框架,用于四足机械臂的动态抓取与放置任务,通过模拟和现实实验验证了其在不同负载和工作空间下的高成功率。

Comments Accepted to IEEE Robotics and Automation Letters 2026

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7652-7659, 2026
AI中文摘要

四足机械臂通过结合敏捷移动与多功能臂控制,扩展了机器人静态操作的能力。然而,实现精确操作的同时保持协调移动仍是一个重大挑战。本文提出了一种分层强化学习框架,用于四足机械臂的动态抓取与放置任务。该框架包含一个显式的质量估计模块,能够实现对不同重量物体的自适应全身控制。在模拟中,系统在负载达2.3kg时的成功率高达86.05%。通过六个代表性场景的现实实验,验证了该方法在不同物体物理属性(尺寸和质量)和任务高度下的有效性。在垂直工作空间从地面到1.1米高桌面的范围内,系统在负载达1.3kg时的平均成功率为73.3%,平均执行时间为4.06秒。与以往处理轻质物体并执行慢速分步操作的方法不同,本文的方法利用移动和操作的同时进行,实现了动态连续执行。这些结果展示了四足移动机械臂在适应性、全身抓取与放置任务中处理更重负载和扩展工作空间的潜力。

英文摘要

Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.

2605.15711 2026-05-18 cs.CV

EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

EntropyScan: 向通过视觉注意力熵实现LVLMs的模型级后门检测

Xuanyu Ge, Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

AI总结 本文提出EntropyScan,一种轻量且不依赖触发器的模型级后门检测方法,通过量化视觉注意力分布的结构扭曲来检测后门模型,实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

Comments 20 pages, 6 figures, 8tables

详情
AI中文摘要

本文提出EntropyScan,一种轻量且不依赖触发器的模型级后门检测方法,通过量化视觉注意力分布的结构扭曲来检测后门模型,实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.

2605.15710 2026-05-18 cs.CL

SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

SMMBench:一种用于源分布多模态智能体记忆的基准测试

Huacan Chai, Yukai Wang, Yingxuan Yang, Dan Peng, Yuanyi Song, Zhihui Fu, Weiwen Liu, Jianghao Lin, Jun Wang, Weinan Zhang

AI总结 SMMBench旨在评估智能体在多源分布证据下进行多模态推理、冲突解决和行动预测的能力,揭示当前系统在处理碎片化异构数据时的不足。

详情
AI中文摘要

现有多模态记忆推理基准主要在预编排上下文中评估系统,但未能充分评估智能体能否利用跨独立来源分布的证据。我们提出SMMBench,评估智能体能否从多个来源检索、对齐和组合多模态证据,而非在单一整理上下文中推理。该基准包含1877个样本,基于264个来源。实验表明,当前系统在这些能力上仍存在困难,凸显源分布多模态记忆对多模态智能体的重要性。数据可在https://huggingface.co/datasets/HuacanChai/SMMBench获取。

英文摘要

Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at https://huggingface.co/datasets/HuacanChai/SMMBench.

2605.15708 2026-05-18 cs.CV

3D Segmentation Using Viewpoint-Dependent Spatial Relationships

基于视角依赖空间关系的3D分割

Ayaka Nanri, Klara Reichard, Mert Kiray, Federico Tombari, Benjamin Busam, Asako Kanezaki

AI总结 本文提出一个包含22万样本的3D参照分割数据集,通过密集视角采样扩展至数千万样本,研究视角依赖空间关系对3D大模型的影响,提升分割精度并提高mIoU至0.47。

详情
AI中文摘要

近期3D数据集和多模态模型的进步显著提升了自然语言3D场景理解。然而,大多数3D参照分割方法未显式表示观察者视角,导致

英文摘要

Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as "left," "right," "front," and "behind" ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.

2605.15705 2026-05-18 cs.RO cs.AI

Feedback World Model Enables Precise Guidance of Diffusion Policy

反馈世界模型使扩散策略获得精准指导

Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, Jianfei Yang

AI总结 本文提出反馈世界模型,通过实时反馈修正预测误差,提升机器人决策性能,实验显示在分布偏移下预测准确率和策略表现显著提升。

Comments 21 pages, 9 figures

详情
AI中文摘要

世界模型旨在通过预测动作后果来提高机器人决策能力。然而,当机器人遇到训练分布外的状态时,其预测往往不可靠,限制了实际应用。我们发现执行本身提供了一个自然但未被充分利用的信号:每次动作后,机器人直接观察到真实下一步状态,揭示了预测与实际结果之间的不匹配。基于这一见解,我们提出反馈世界模型,一种在推理时关闭预测与观察之间循环的新范式。与将世界模型视为静态开环预测器不同,我们的方法维护一个轻量级反馈状态,在线更新以迭代修正未来预测,利用实时观测补偿模型误差,而无需额外训练数据或参数更新。我们证明这一过程可以被视为潜在空间观察者,并在温和条件下具有收敛保证。我们进一步引入动作感知指导,通过强调动作可控的组件而抑制无关变化,以更好地将修正预测转化为控制。在LIBERO-Plus、Robomimic和真实世界操控任务上的实验表明,我们的方法在分布偏移下显著提高了预测准确性和策略性能。特别是,它将世界模型预测误差减少了高达76.4%,并提高了分布外(OOD)成功率30%。这些结果表明,在推理时纳入实时反馈为静态世界建模提供了一个简单而有力的替代方案。

英文摘要

World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.

2605.15701 2026-05-18 cs.CL cs.AI

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

H-Mem: 一种通过混合结构进化和检索智能体记忆的新型记忆机制

Jiawei Yu, Yixiang Fang, Xilin Liu, Yuchi Ma

AI总结 H-Mem通过混合结构有效建模智能体记忆的长期演化并高效检索记忆数据,提升问答任务性能。

详情
AI中文摘要

在基于大语言模型(LLM)的智能体(如OpenClaw和Manus)中,记忆数据无处不在。尽管近期有研究尝试利用智能体的记忆来提高问答(QA)任务的性能,但缺乏有效建模记忆数据随时间演化和高效检索的原理性机制,导致记忆利用效率低下。为此,我们提出了H-Mem,一种通过混合结构实现的新型记忆机制,能够有效建模智能体记忆的长期演化,并提供高效的记忆检索方法。特别是,H-Mem构建了时间与语义树结构,使短期记忆数据逐步演变为长期记忆数据,后者为前者提供总结信息,同时构建知识图谱以捕捉记忆中实体之间的关系。此外,通过利用树和图结构的混合特性,H-Mem提供了有效的记忆检索方法。在三个智能体记忆基准测试中,H-Mem在问答任务上实现了最先进的性能。

英文摘要

Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

2605.15700 2026-05-18 cs.LG

AGOP-IxG: A Gradient Covariance Filter for Local Feature Attribution on Tabular Data, with a Controlled Benchmark

AGOP-IxG:一种用于表格数据局部特征归因的梯度协方差滤波器,配有受控基准

Raj Kiran Gupta Katakam

AI总结 本文提出AGOP-IxG,一种用于表格分类器的快速样本归因方法,通过预乘样本梯度与Top-K秩截断的平均梯度外积矩阵,对比四个常用基线方法,在为AutoML从业者设计的受控表格基准上进行评估。

Comments 12 pages, 2 figures, 3 tables. Submitted to AutoML Conference 2026 (ABCD Track)

详情
AI中文摘要

自动化机器学习流水线越来越多地生成需要向终端用户、审计员和下游决策系统解释预测的模型。最广泛使用的特征归因方法(SHAP、集成梯度、LIME)通常是通过惯例而非测量保真度来选择的,因为严格评估受到真实数据上缺乏真实归因的阻碍。我们提出了AGOP-IxG,一种针对表格分类器的快速样本归因方法,该方法将样本梯度乘以一个Top-K秩截断的平均梯度外积矩阵,并在为AutoML从业者设计的受控表格基准上评估了四个广泛使用的基线方法。在第一部分中,我们构建了三个合成的多类表格任务(线性、稀疏非线性、交互式),其中每个样本的真实归因可以解析或数值计算,我们比较了五种方法:AGOP-IxG、SHAP(DeepExplainer)、集成梯度、InputXGradient和LIME。AGOP-IxG在所有三个合成数据集上的Spearman秩相关性和噪声特征质量上领先,并在交互数据集上的Top-K精度上领先。在所有设置中,AGOP-IxG的速度比SHAP快约350倍至1650倍。在第二部分中,我们使用ROAR协议评估全局忠实性,在Adult Income和Credit Card Default上进行评估;方法在相对AUC上聚类在约1.7%范围内,这与AGOP-IxG优化于样本局部归因而非全局特征排名一致。

英文摘要

Automated machine learning pipelines increasingly produce models whose predictions must be explained to end users, auditors, and downstream decision systems. The most widely used feature attribution methods (SHAP, Integrated Gradients, LIME) are typically chosen by convention rather than measured fidelity, because rigorous evaluation is impeded by the absence of ground-truth attribution on real data. We propose AGOP-IxG, a fast per-sample attribution method for tabular classifiers that pre-multiplies the per-sample gradient by a top-$K$ rank-truncated Average Gradient Outer Product matrix, and evaluate it against four widely-used baselines on a controlled tabular benchmark designed for AutoML practitioners. In Part 1, we construct three synthetic multi-class tabular tasks (linear, sparse nonlinear, interaction-based) where ground-truth attribution per sample is analytically or numerically derivable, and compare five methods: AGOP-IxG, SHAP (DeepExplainer), Integrated Gradients, InputXGradient, and LIME. AGOP-IxG leads on Spearman rank correlation and noise feature mass on all three synthetic datasets, and on top-$k$ precision on the interaction dataset. Across all settings, AGOP-IxG is approximately $350\times$ to $1{,}650\times$ faster than SHAP. In Part 2, we evaluate global faithfulness on Adult Income and Credit Card Default using the ROAR protocol; the methods cluster within $\sim 1.7\%$ relative AUC, consistent with AGOP-IxG being optimized for per-sample local attribution rather than global feature ranking.

2605.15692 2026-05-18 cs.LG stat.ML

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

更紧的基于上下文动作集强化学习的遗憾界

Zijun Chen, Zihan Zhang

AI总结 本文研究了具有固定奖励和转移函数的回合制强化学习,但每个回合的动作集依赖于回合。通过MVP算法,建立了对抗性和随机性情境下的更紧遗憾界,并推导了样本复杂度和间隙依赖的遗憾界。

详情
AI中文摘要

我们研究了具有固定奖励和转移函数的回合制强化学习,但每个回合的动作集依赖于回合。性能通过累积遗憾衡量,即$\sum_{k=1}^K [V^{*,M^k} - V^{π^k,M^k}]$,其中$M^k$表示第$k$个回合的动作上下文。我们证明MVP算法可以自然扩展到此框架并享有强理论保证。特别是,我们建立了对抗性情境下的最小最大遗憾界$\widetilde{O}(\sqrt{SAH^3K\log L})$,其中$L$表示可能的上下文数量。此结果意味着在随机性情境下的遗憾界为$\widetilde{O}(\sqrt{SAH^3K})$。我们进一步将随机性遗憾保证转换为固定上下文分布的样本复杂度界$\widetilde{O}(SAH^3/ε^2)$。此外,我们推导了一个依赖间隙的遗憾界$\widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right)$,其中$Δ_{\min}^{p}$是子最优$(h,s,a)$三元组的全局$p$-修剪正间隙底。此界在相关子最优间隙较大的情况下可以显著改进最小最大速率。

英文摘要

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{π^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/ε^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $Δ_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.

2605.15689 2026-05-18 cs.CV

How to Choose Your Teacher for Fine Grained Image Recognition

如何为细粒度图像识别选择教师

Oswin Gosal, Edwin Arkel Rios, Augusto Christian Surya, Fernando Mikael, Bo-Cheng Lai, Min-Chun Hu

AI总结 本文提出Ratio 1-2指标,通过分析实验数据提升教师选择效果,使小模型在细粒度图像识别中获得17%的准确率提升。

Comments Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 3 figures, 4 tables

详情
AI中文摘要

细粒度图像识别用于分类如鸟类物种或汽车型号等子类别。尽管最先进的模型准确率高,但往往资源消耗过大,难以部署在受限设备上。知识蒸馏通过将大教师模型的知识转移到小学生模型中解决此问题。选择合适的教师模型是关键挑战,本文引入Ratio 1-2指标,基于教师预测比例进行评估。对超过1000次实验的分析显示,该指标比先前方法提升18%,使小模型在细粒度图像识别中达到17%的准确率提升。实验代码库可在https://github.com/arkel23/FGIR-KD-Teacher获取。

英文摘要

Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.

2605.15684 2026-05-18 cs.CV

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

ElasticDiT:通过弹性架构和稀疏注意力实现高效扩散变换器,用于移动设备上的高分辨率图像生成

Kunpeng Du, Haizhen Xie, Sen Lu, Lei Yu, Binglei Bao, Huaao Tang, Chuntao Liu, Hao Wu, Yang Zhao, Zhicai Huang, Heyuan Gao, Zhijun Tu, Jie Hu, Xinghao Chen

AI总结 本文提出ElasticDiT,通过弹性架构和稀疏注意力机制,在移动设备上实现高效扩散变换器,平衡图像质量和计算效率,同时减少内存占用。

详情
AI中文摘要

扩散变换器(DiT)架构是高保真图像生成的最新范式,支撑如Stable Diffusion-3和FLUX.1等模型。然而,将这些模型部署到资源受限的移动设备上会带来极高的计算和内存开销。尽管效率驱动的方法如Linear-DiT和静态剪枝缓解了瓶颈,但通常会带来质量下降。不同于云环境,移动约束要求一种单模型范式,能够动态平衡保真度和延迟。我们引入ElasticDiT,通过调整空间压缩比和DiT块深度实现这种动态权衡。通过整合Shift Sparse Block Attention(SSBA)和Tiny DWT-Distilled VAE(T-DVAE),ElasticDiT在保持图像质量的同时减少了推理延迟和内存占用。实验表明,ElasticDiT能够在一个参数集内覆盖广泛的保真度-延迟权衡范围。通过联合调整压缩和深度,单个ElasticDiT模型可以动态重新配置以超越任务特定的基线。具体而言,我们的flex lite变体实现了32.87的HPS,超过了Flux模型,同时通过SSBA保持84.16%的平均稀疏度质量。此外,插件式的T-DVAE仅需标准VAEs的1/8计算成本即可实现SD3级的重建,而Flow-GRPO提升了语义对齐(GenEval: 66.93到73.62)。这些结果表明,ElasticDiT提供了一种多功能、硬件适应性的解决方案,消除了对多个专用模型的需求,为未来移动设备上的高分辨率图像生成提供了有前景的路径。

英文摘要

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

2605.15682 2026-05-18 cs.CV

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

DreamSR:通过增强感受野的扩散变换器实现超高清图像超分辨率

Qingji Dong, Hang Dong, Mingqin Chen, Rui Zhang, Yitong Wang

AI总结 DreamSR通过双分支MM-ControlNet和增强感受野策略,解决超分辨率中局部过生成和细节合成问题,实现高质量细节恢复。

详情
AI中文摘要

大规模预训练扩散模型因强大的生成先验通过文本引导被广泛应用于实际图像超分辨率。然而,当使用基于补丁的推理策略超分辨率处理高分辨率图像时,现有扩散基超分辨率方法常因LR图像全局提示与每次推理步骤中局部补丁不完整语义信息之间的不匹配而产生过生成问题。另一方面,现有方法由于网络设计和训练策略过度强调全局生成能力,也难以在局部补丁中生成细节纹理。为了解决这个问题,我们提出了DreamSR,一种新的超分辨率模型,通过抑制局部过生成并提高细节合成,从而实现具有超高质量细节的视觉忠实结果。具体来说,我们提出了一个双分支MM-ControlNet,其中ControlNet使用补丁级提示生成局部文本特征,而预训练的DiT使用全局提示生成全局文本特征,从而缓解过生成并确保补丁间的语义一致性。我们还设计了全面的训练策略,包含阶段特定的数据处理管道和增强感受野策略,增强模型捕捉补丁信息和有效恢复局部纹理的能力。广泛的实验表明,DreamSR优于最先进的方法,提供高质量的超分辨率结果。代码和模型可在https://github.com/jerrydong0219/DreamSR上获得。

英文摘要

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

2605.15680 2026-05-18 cs.CL cs.LG q-bio.QM

Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

少样本大语言模型在在线患者咨询可操作分诊中的应用

Liqi Zhou, Jiafu Li

AI总结 本文研究少样本条件下大语言模型在在线患者咨询分诊中的应用,通过构建不同数据集比较TF-IDF和BioBERT与六个LLM在0-shot、4-shot和12-shot条件下的表现,发现Claude Haiku 4.5在12-shot条件下达到0.475的宏F1值,优于监督基线模型。

Comments 4 figures, 19 tables, 23 pages (including appendix and reference)

详情
AI中文摘要

在线患者咨询通常非正式、不完整且在专业评估前撰写,但仍需路由至适当的临床随访级别。我们将此任务定义为四类可操作分诊任务——自我护理、预约就诊、紧急医生审查或紧急转诊,并探讨在低资源标注条件下,提示式大语言模型(LLMs)是否能支持此类路由。使用公开的HealthCareMagic-100K语料库,我们构建了300例人工校准的金标准评估集、700例自动标注的银色训练集和40例少样本池。我们比较了在银色标签上训练的TF-IDF和BioBERT基线模型与六个提示式LLM在0-shot、4-shot和12-shot条件下的表现。我们通过宏F1值以及安全意识指标,包括紧急召回率、漏诊率和严重漏诊率进行评估。最强的LLM(Claude Haiku 4.5,12-shot)达到宏F1值0.475,优于最佳监督基线模型(BioBERT,0.378)的点估计,且置信区间有重叠。少样本提示和两模型一致性在标签依赖方式上有所帮助:自我护理一致性可靠,紧急医生审查不可靠。我们得出结论,LLM可以支持分诊优先级和选择性的人类审核,但不能自主部署。

英文摘要

Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

2605.15677 2026-05-18 cs.CL cs.CV

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

VCG-Bench:迈向统一的视觉导向基准,用于结构化生成与编辑

Xiaoyan Su, Peijie Dong, Zhenheng Tang, Song Tang, Yuyao Zhai, Kaitao Lin, Liang Chen, Gai Yuhang, Yuyu Luo, Qiang Wang, Xiaowen Chu

AI总结 本文提出VCG-Bench,一个统一的视觉导向mxGraph任务基准,通过符号逻辑和XML实现精确的图表生成与编辑,解决现有方法在结构化任务中的局限性。

Comments Accepted by ICML2026, 37 pages, 10 figures

详情
AI中文摘要

尽管视觉语言模型(VLMs)迅速发展,但在处理专业工作流程中至关重要的结构化、可控图表任务方面仍存在关键差距。现有方法主要依赖像素级合成,其在可编辑性和保真度上存在固有限制。本文提出一种新的图表即代码范式,利用mxGraph可扩展标记语言(XML)进行精确的图表生成与编辑。我们提出了VCG-Bench,一个统一的视觉导向mxGraph任务基准。VCG-Bench包括:(1)一个包含1,449种不同图表的分类数据集,涵盖6个领域和15个子领域;(2)一种整合生成(视觉到代码)和可编辑性(代码到代码)的范式定义;(3)一种定制的评估协议,采用多维指标,如mxGraph执行成功率、风格一致性分数(SCS)等。实验结果突显了当前最先进(SOTA)VLMs在结构保真度和指令合规性方面的挑战,反映了其视觉和推理能力。

英文摘要

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

2605.15676 2026-05-18 cs.CL

Dynamic Chunking for Diffusion Language Models

扩散语言模型的动态分块

Yichen Zhu, Xiaoming Shi, Peng Zhao, Weiyu Chen, Debing Zhang, James Kwok

AI总结 本文提出动态分块扩散模型,通过内容定义语义分块替代固定位置分块,提升序列结构利用效率,在参数规模达1.5B的下游任务中表现更优。

详情
AI中文摘要

块离散扩散语言模型将序列自回归地分解为固定大小的位置块,将块内并行去噪与块间条件解耦。我们认为这种刚性划分浪费了序列中已有的结构:以位置而非内容定义的块将语义连贯的token分开,将不相关的token分组。我们引入动态分块扩散模型(DCDM),用内容定义的语义分块替代位置块。其核心是Chunking Attention,一个可微层,将token路由到由可学习子空间参数化的K个聚类中,并通过扩散目标端到端塑造形状。所得聚类分配诱导出一个chunk因果注意力掩码,在此掩码下,离散扩散去噪器将序列似然自回归地分解为语义分块,严格推广块离散扩散。在参数规模达1.5B的下游任务中,DCDM在无结构和位置块扩散基线中均表现更优,优势在不同规模和训练早期均稳定。

英文摘要

Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.

2605.15675 2026-05-18 cs.LG cs.AI

Interaction-Aware Influence Functions for Group Attribution

群体属性中的交互感知影响函数

Jaeseung Heo, Kyeongheung Yun, Youngbin Choi, Sehyun Hwang, Jungseul Ok, Dongwoo Kim

AI总结 本文提出交互感知影响函数,通过考虑样本间相互作用来改进群体属性评估,实验显示其在多个任务中优于传统方法。

详情
AI中文摘要

影响函数近似于移除训练样本如何改变感兴趣的量,如保留损失。为估计群体样本的影响,常规做法是求和个体影响。然而,这种求和无法捕捉样本联合影响:样本对可能是冗余或互补的,但求和无法区分这些情况。我们提出交互感知影响函数,通过在训练参数周围扩展目标到二次项,获得一个估计器,该估计器在标准求和基础上增加了一个双变量交互项,捕捉两个样本对目标影响的对齐情况。我们实验证明,该估计器在六个数据集-模型组合上显著优于一阶影响方法。此外,当用作Llama-3.1-8B指令微调数据的贪心选择规则时,在五个七下游任务中优于传统影响和表示相似性基线,在标准影响选择表现不佳的领域中。

英文摘要

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.