arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20798 2026-05-21 cs.LG cs.CL

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

大多数变换器修改仍无法在1-3B规模上迁移：对Narang等人（2021）的2020-2026年更新，包含下游评估和噪声底限

Yang Zhao, Jiahao Lu, Bin Huang, Guhua Zhang, Jie Zhou

AI总结本文在1-3B参数规模下，大多数变换器修改仍无法迁移，通过严格的等数据、等计算、等配方控制测试，并结合下游评估和噪声底限进行验证。

Comments 19 pages, 3 figures, under review at EMNLP 2026

详情

AI中文摘要

Narang等人（2021）在T5-base规模上评估了40多种变换器修改，并得出结论，大多数修改无法迁移。五年后，典型的运行模式已转移到1-3B参数规模，下游评估已取代预训练困惑度，且出现了一大批新的修改类别。我们通过在1.2B和3B参数规模上严格测试20种2021年后出现的变换器修改，采用多种子基线噪声底限和CLIMB-12下游评估作为主要指标，重新审视其问题。核心发现在此精心挑选的集合中重现了他们的结论：大多数修改仍无法迁移。在20种修改中，只有两种在1.2B规模上通过Bonferroni校正；其中一种在3B规模下采用共享配方时无法稳定训练。我们还发现，Tay等人（2023）报告的损失-下游差距对于注意力输出修改而言扩大了几倍：两种显著失败的修改在基准验证损失上接近2-3%，但CLIMB点数却下降了6-16点。我们得出结论，噪声底限报告、下游评估和跨规模稳定性测试现在是1-3B参数规模下架构比较的必要条件。

英文摘要

Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

URL PDF HTML ☆

赞 0 踩 0

2605.20797 2026-05-21 cs.LG

Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization

超越数值特征：通过等高线图进行CNN驱动的算法选择用于连续黑盒优化

Yiliang Yuan, Xiang Shi, Mustafa Misir

AI总结本文提出了一种基于表示的实例级算法选择方法，应用于黑盒优化，用于自动从固定组合中选择最有前途的求解器。传统连续优化工作主要依赖于数值描述符，包括探索景观分析特征和学习嵌入如Deep-ELA。本文研究了一种互补的表示：探测景观的等高线图可视化。一个CNN回归器利用多个实例特定的等高线视图（堆叠或编码每个视图并聚合）来预测每个求解器的性能，从而通过预测的最佳值进行选择。在标准BBOB 2009单目标协议上，所得到的选者显著优于单最佳求解器（SBS），并与基于特征的基线具有竞争力。随后在DeepELA设置下的双目标评估进一步表明，当使用窗口等高线视图时，基于图像的原则同样具有竞争力。总体而言，结果表明，简单的视觉模型可以利用探测景观中的空间结构进行算法选择，而无需手工设计ELA特征。

详情

AI中文摘要

本文介绍了一种新的基于表示的方法，用于实例级算法选择，应用于黑盒优化，以自动从固定组合中选择最有前途的求解器。传统连续优化工作主要依赖于数值描述符，包括探索景观分析特征和学习嵌入如Deep-ELA。本文研究了一种互补的表示：探测景观的等高线图可视化。一个CNN回归器利用多个实例特定的等高线视图（堆叠或编码每个视图并聚合）来预测每个求解器的性能，从而通过预测的最佳值进行选择。在标准BBOB 2009单目标协议上，所得到的选者显著优于单最佳求解器（SBS），并与基于特征的基线具有竞争力。随后在DeepELA设置下的双目标评估进一步表明，当使用窗口等高线视图时，基于图像的原则同样具有竞争力。总体而言，结果表明，简单的视觉模型可以利用探测景观中的空间结构进行算法选择，而无需手工设计ELA特征。

英文摘要

The present paper introduces a new representation-driven approach to per-instance algorithm selection, applied to black-box optimization, for automatically choosing the most promising solver from a fixed portfolio. Prior work in continuous optimization largely relies on numerical descriptors, including Exploratory Landscape Analysis features and learned embeddings such as Deep-ELA. This work studies a complementary representation: contour-map visualizations of probed landscapes. A CNN regressor takes multiple instance-specific contour views (stacked or encoded per view and aggregated) and predicts per-solver performance, enabling selection by the predicted best value. On the standard BBOB 2009 single-objective protocol, the resulting selectors significantly outperform the single best solver (SBS) and are competitive with feature-based baselines. A subsequent bi-objective evaluation under the DeepELA setting further indicates that the same image-based principle can be competitive when using windowed contour views. Overall, the results suggest that simple vision models can exploit spatial structure in probed landscapes for algorithm selection without handcrafted ELA features.

URL PDF HTML ☆

赞 0 踩 0

2605.20796 2026-05-21 cs.RO

CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization

CMC-Opt: 带角落的约束流形用于不等式约束优化

Yetong Zhang, Frank Dellaert

AI总结本文提出了一种基于流形的框架，用于解决机器人中存在等式和不等式约束的优化问题。通过引入带角落的约束流形，将原问题直接转换为无约束优化问题，从而在约束状态空间上进行优化，并在大规模动力学规划问题中验证了该框架的有效性和鲁棒性。

2605.20793 2026-05-21 cs.CL

Assessing socio-economic climate impacts from text data

从文本数据评估社会经济气候影响

Mariana Madruga de Brito, Brielen Madureira, Taís Maria Nunes Carvalho, Damien Delforge, Aglaé Jézéquel, Murathan Kurfalı, Ni Li, Gabriele Messori, Joakim Nivre, Barbara Pernici, Niko Speybroeck, Stefano Terzi, Wim Thiery, Bram Valkenborg, Jingxian Wang, Shorouq Zahra, Jakob Zscheischler, Jan Sodoge

AI总结本文提出了一种利用文本数据系统分析气候灾害社会经济影响的方法，旨在解决现有研究碎片化、缺乏明确指导的问题，通过总结常见实践和挑战，提出改进建议以构建更可靠的文本衍生社会经济影响数据集。

Comments Work in progress

详情

AI中文摘要

近年来，自然语言处理（NLP）和大语言模型（LLMs）的进步使得能够系统利用新闻、社交媒体和报告中的大规模文本数据，创建包含洪水、干旱、风暴和多灾害事件等气候灾害的社会经济影响数据集。随着文本作为数据用于影响评估领域的扩展，其方法学复杂性也在增加。然而，研究仍处于碎片化状态，缺乏明确的指南来定义什么是影响、处理时间和空间偏差以及选择适当的建模和后处理策略。这种不连贯性限制了研究的透明度和可比性。本文通过整合常见实践、描述使用文本数据方法分析社会经济影响数据的关键挑战，并提出解决这些问题的建议来填补这一空白。通过提供最佳实践的指导，我们旨在支持构建更加稳健的文本衍生社会经济影响数据集，以更准确地支持灾害风险管理和归因研究。

英文摘要

Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.

URL PDF HTML ☆

赞 0 踩 0

2605.20786 2026-05-21 cs.CL

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

从基础建设开始构建阿拉伯语NLP：二十年的经验、失败与开放问题

Wajdi Zaghouani

AI总结本文总结了过去二十年阿拉伯语NLP资源和研究基础设施建设的经验，指出构建数据集是社会和技术过程的结合，社区围绕共享任务形成比任务本身更重要，从语言资源到计算社会科学暴露了传统NLP训练无法解决的挑战。

Comments Accepted at the ACL 2026 Workshop : The Big Picture 2026: Crafting a Research Narrative v2

详情

AI中文摘要

本文回顾了过去二十年构建阿拉伯语NLP资源和研究基础设施的经验，阿拉伯语是一种被数亿人使用的语言，但在历史上相对英语或中文等语言而言被忽视。第一十年专注于基础语言基础设施，第二十年转向计算社会科学、社交媒体分析和社会导向应用。本文并未列举产出，而是探讨了构建这些产出的经验。三个反直觉的教训浮现：构建数据集是社会和技术过程的结合；围绕共享任务形成的社区比任务本身更重要；从语言资源到计算社会科学暴露了传统NLP训练无法解决的挑战。我们讨论了三个失败：一个从未达到临床实践的抑郁检测语料库，一个在过多共享任务中扩展而缺乏深度的时期，以及一个长期存在的假设，即现代标准阿拉伯语基础设施可以干净地转移到方言任务中。这些经验表明，在开发欠发达社区的NLP中，最困难的问题不是语言学的，而是社会、制度和认识论的，需要该领域很少教授的能力。

英文摘要

This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.

URL PDF HTML ☆

赞 0 踩 0

2605.20784 2026-05-21 cs.AI cs.LG

Interaction Locality in Hierarchical Recursive Reasoning

层次递归推理中的交互局部性

Yosuke Miyanishi, Tetsuro Morimura

AI总结本文提出交互局部性框架，用于测量信息流是否在附近单元或语义段内传输或跨越，通过在HRM和TRM等层次递归推理模型上应用，验证了局部执行与全局规划的可重复测量框架。

详情

AI中文摘要

空间推理需要位置绑定计算和位置不变结构：智能体必须在保持路线、对象或约束层次计划的同时进行局部移动。我们提出交互局部性，一种任务-几何感知的框架，用于衡量信息流是否在附近单元或语义段内传输或跨越。我们通过稀疏自动编码器特征消融和有限噪声激活补丁来实例化该框架，并在附录中报告了结构性雅可比和注意力检查。将其应用于Maze-Hard、Sudoku Extreme和ARC-AGI等模型。在这些模型中，激活补丁给出了最清晰的架构指纹：高层递归状态倾向于在附近单元或相同段内写入信息，而重复的递归更新将这些局部写入累积到更广泛的解决方案结构中。这种模式在迷宫路径、数独约束和ARC-AGI对象邻域中均成立，其中TRM表现最强。为了测试交互局部性是否超越玩具但具有挑战性的网格基准，我们还将其应用于MTU3D，一个大规模的具身3D场景-grounding模型。在MTU3D设置中，因果空间局部性主要出现在视觉场景特征传递给下游grounding模块的过渡处，而不是在视觉编码器中均匀分布。这种对比表明，HRM和TRM中观察到的局部到全局的交接与显式递归推理动态有关，而具身3D模型可能在模块边界集中因果空间结构。交互局部性将直观的局部执行/全局规划故事转化为可重复测量的递归和具身空间推理框架。

英文摘要

Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.20782 2026-05-21 cs.LG

Causal Machine Learning Is Not a Panacea: A Roadmap for Observational Causal Inference in Health

因果机器学习并非万能：健康领域观察性因果推断的路线图

Donna Tjandra, Trenton Chang, Sonali Parbhoo, Rajesh Ranganath, Andre Kurepa Waschka, William Mitchell, Maggie Makar, Shalmali Joshi, Finale Doshi-Velez, Leo Anthony Celi, Jenna Wiens

AI总结本文探讨了因果机器学习在观察性数据中的应用，强调了验证有效性假设和合理使用因果机器学习的重要性，提出了加强因果分析严谨性和可解释性的模板。

详情

AI中文摘要

目的：随着大规模观察性临床数据集的日益可用以及随机对照试验的挑战，使用因果机器学习（ML）进行观察性数据中的因果推断引发了热情。我们提出了应用因果ML到观察性数据的路线图。材料和方法：我们概述了在可用数据中评估有效性假设的重要性，并负责任地应用于临床专家使用因果ML和ML从业者有限的临床专业知识。观察：尽管因果ML有所进步，但其限制在各学科中仍然被低估。这种知识缺口可能影响发现的有效性。讨论：因果假设必须得到满足，模型选择必须得到证明。否则，这些方法可能会产生有偏见或误导性的结果，对临床研究和患者护理产生影响。结论：因果ML可以成为生成因果假设的强大工具。我们提供了一个模板来加强因果分析的严谨性和可解释性。

英文摘要

Objective: The growing availability of large-scale observational clinical datasets and challenges in conducting randomized controlled trials have spurred enthusiasm in using causal machine learning (ML) for causal inference in observational data. We present a roadmap for applying causal ML to observational data. Materials and methods: We outline the importance of assessing validity assumptions within available data and applying causal ML responsibly for clinical experts using causal ML and ML practitioners with limited clinical expertise. Observations: Despite advances in causal ML, its limitations remain largely under-appreciated across disciplines. This gap in shared knowledge may impact the validity of findings. Discussion: Causal assumptions must be satisfied and modeling choices justified. Otherwise, these approaches risk producing biased or misleading results, with consequences for clinical research and patient care. Conclusion: Causal ML can be a powerful tool for generating causal hypotheses. We provide a template to strengthen the rigor and interpretability of causal analyses.

URL PDF HTML ☆

赞 0 踩 0

2605.20780 2026-05-21 cs.LG cs.CV

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

学习物理中的推理：通过表征对齐打破科学扩散中的捷径学习

Haozhe Jia, Pengyu Yin, Wenshuo Chen, Shaofeng Liang, Lei Wang, Bowen Tian, Xiucheng Wang, Nanqian Jia, Yutao Yue

AI总结该研究提出了一种无需教师的框架REPA-P，通过使用原理残差对中间特征与物理状态进行对齐，以解决物理信息扩散模型中中间表示在边界条件变化时容易产生捷径学习的问题，从而在四个PDE任务中提高了收敛速度、减少了物理残差并增强了分布外鲁棒性。

详情

AI中文摘要

物理信息扩散模型通常只在最终输出上强制实施PDE约束，导致中间表示不受约束且在边界条件变化时容易产生捷径学习。我们引入了REPA-P，一种无需教师、架构无关的框架，通过原理残差对中间特征与物理状态进行对齐。REPA-P在选定的层上附加轻量级1×1投影头，将隐藏激活解码为物理量，并在训练过程中应用PDE残差损失。这些头在推理时被丢弃，引入了零开销。在四个PDE任务中，包括达西流、拓扑优化、静电势和湍流通道流，REPA-P通过2倍的收敛加速、66.4%的残差减少和49.3%的分布外鲁棒性提升，实现了在U-Net和扩散变换器骨干网络上的持续收益。消融实验显示，监督少量中间层捕获了大部分收益，并补充了输出级物理损失。代码可在[https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P)获得。

英文摘要

Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).

URL PDF HTML ☆

赞 0 踩 0

2605.20777 2026-05-21 cs.CV

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

AttriStory: 基于扩散模型的视觉叙事中细粒度属性实现

Manogna Sreenivas, Rohit Kumar, Soma Biswas

AI总结本文提出AttriStory基准，通过细粒度属性实现提升视觉叙事的质量，引入了在早期去噪步骤中操作的潜在优化模块，并通过AttriLoss目标增强属性-对象对的对齐度，从而实现更精确的属性定位。

Comments Accepted at CVPR AIStory Workshop, 2026

详情

AI中文摘要

基于扩散模型的视觉叙事在保持叙事场景中角色一致性方面取得了显著进展。然而，一个关键的差距仍然存在：尽管这些方法确保角色在不同场景中保持一致，但它们没有系统的方法来确保生成图像中诸如服装颜色和纹理等细粒度属性得到忠实呈现。为此，我们引入了AttriStory基准，通过大型语言模型收集了200个跨场景故事，涵盖10种不同的艺术风格。每个场景都包含详细的属性规范，以实现丰富的视觉叙事。进一步，为了解决属性实现问题，我们提出了一种插件式的潜在优化模块，在早期去噪步骤中操作，当模型建立结构和语义内容时。我们通过AttriLoss目标实现这一点，该目标旨在最大化所需属性-对象对的交叉注意力图的对齐度，同时抑制虚假关联，引导模型正确定位属性。这种方法与现有的一致性机制正交，能够无缝集成到当前的故事生成流程中，而无需进行架构修改。我们的实验表明，AttriLoss在所有基线中都实现了持续的改进。这项工作将属性实现定位为视觉叙事的一个独立且互补的维度，与角色一致性并列，推动该领域向细粒度属性控制的故事生成发展。项目页面：https://manogna-s.github.io/attristory/

英文摘要

Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/

URL PDF HTML ☆

赞 0 踩 0

2605.20774 2026-05-21 cs.RO

VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

VLA-REPLICA: 一种低成本、可重复的现实世界评估视觉-语言-动作模型的基准

Alex S. Huang, Jiahui Zhang, Shiqing Tang, Yu Xiang

AI总结本文提出VLA-REPLICA，一种低成本、可重复的现实世界评估视觉-语言-动作模型的基准，通过使用现成组件构建，提供一致的环境用于政策评估，并包含多样化的操作任务和小规模演示数据集，用于目标域适应。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在通用目的机器人操作中显示出强大的潜力，但其现实世界评估仍受到缺乏可访问、可重复和一致的基准的限制。模拟基准无法捕捉现实世界的复杂性，而现有的现实世界基准通常需要昂贵的硬件、集中评估或任务多样性有限。我们介绍了VLA-REPLICA，一种低成本、易于重复的现实世界评估VLA模型的基准。该系统由现成组件构建，可以快速组装并在不同实验室中复制，为全球各地的政策评估提供一致的环境。VLA-REPLICA包含多样化的操作任务和一个小规模的演示数据集用于目标域适应，并为在分布和出分布设置中的现实世界评估提供了协议。对模仿学习和最先进的VLA模型的实验揭示了模型的优势和局限性，而不同独立构建设置中的一致结果证明了我们基准的可重复性。

英文摘要

Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.20771 2026-05-21 cs.LG

Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

通过主动学习查询进行累积元学习以增强对虚假相关性的鲁棒性

Kin Whye Chew, Jingxian Wang

AI总结本文提出了一种累积主动元学习（CAML）框架，通过主动学习查询样本来元学习先验知识，以提高模型对虚假相关性的鲁棒性，实验结果显示在多个基准测试中性能显著提升。

Comments Under review. 26 pages, 7 figures

详情

AI中文摘要

现实世界数据集中的虚假相关性导致机器学习模型依赖于无关模式，削弱了可靠性、泛化能力和公平性。主动学习提供了一种有前景的方法来解决这一故障模式，通过查询能够区分核心特征和虚假特征的信息样本。然而，标准的主动学习方法只是将查询的示例添加到标记集中，仅更新了似然项。在深度学习领域，这些信息样本的影响可能被更大的标记集稀释，并被过参数化的模型记忆化。我们提出了累积主动元学习（CAML），一种主动学习框架，利用查询的示例来元学习先验，或归纳偏差，以指导模型的适应。CAML将每个主动学习轮次视为一个元学习任务：当前的标记集作为元训练数据用于适应，而新查询的批次作为元测试数据用于评估泛化能力。与传统元学习不同，CAML利用主动学习轮次之间的序列依赖性，通过维护一个逐步细化的累积归纳偏差。理论上，我们证明了这种累积形式引入了交互项，将早期元学习的归纳偏差与后期查询诱导的目标联系起来，捕捉了标准元学习中缺失的依赖关系。实验表明，CAML在多个虚假相关性基准测试和获取策略中提高了少数群体的准确性，最高在Dominoes上提升了27.8%，在Waterbirds上提升了29.9%，在SpuCo上提升了14.3%，在CivilComments上提升了24.0%。

英文摘要

Spurious correlations in real-world datasets cause machine learning models to rely on irrelevant patterns, undermining reliability, generalization, and fairness. Active learning offers a promising way to address this failure mode by querying informative samples that distinguish core features from spurious ones. However, standard active-learning methods simply append queried examples to the labeled set, effectively updating only the likelihood term. In deep learning regimes, the influence of these informative samples can be diluted by the larger labeled set and memorized by overparameterized models. We propose Cumulative Active Meta-Learning (CAML), an active-learning framework that uses queried examples to meta-learn the prior, or inductive bias, governing how the model adapts. CAML casts each active-learning round as a meta-learning task: the current labeled set serves as meta-train data for adaptation, while the newly queried batch serves as meta-test data for evaluating generalization. Unlike conventional meta-learning, which treats tasks as independent and identically distributed, CAML exploits the sequential dependence between active-learning rounds by maintaining a cumulative inductive bias that is progressively refined. Theoretically, we show that this cumulative formulation introduces interaction terms that couple earlier meta-learned inductive biases with later query-induced objectives, capturing dependencies absent from standard meta-learning. Empirically, CAML improves minority-group accuracy across spurious-correlation benchmarks and acquisition strategies, with gains of up to 27.8% on Dominoes, 29.9% on Waterbirds, 14.3% on SpuCo, and 24.0% on CivilComments.

URL PDF HTML ☆

赞 0 踩 0

2605.20767 2026-05-21 cs.CL cs.LG stat.ME

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

干预的幻觉：你的LLM模拟实验实际上是一个观察性研究

Victoria Lin, Taedong Yun, Maja Matarić, John Canny, Arthur Gretton, Alexander D'Amour

AI总结本文探讨了大型语言模型在模拟人类行为中的潜在作用，指出在LLM模拟的合成用户中进行干预可能引起潜在用户属性的意外变化，从而导致用户漂移，影响效果估计。本文提出了使用负对照结果来检测分布变化的方法，并通过调整角色描述以减少偏倚来缓解漂移问题。

详情

AI中文摘要

大型语言模型（LLMs）显示出作为人类行为模拟器的潜力，提供了一种可扩展的方式研究对干预的反应。然而，由于LLMs主要基于观察性数据进行训练，在与LLM模拟的合成用户进行实验时，干预可能会引起潜在用户属性的意外变化，导致用户漂移，其中隐含的模拟总体在不同处理条件下有所不同，这可能会扭曲效应估计。我们正式化了由于用户漂移可能产生的混淆或选择偏差，并展示了干预依赖性变化如何放大或减弱干预下用户响应的观测差异。为了诊断混淆，我们提出使用负对照结果——在干预下应保持不变的属性——来识别干预条件间的分布变化，提供用户漂移的证据。为了缓解漂移，我们研究了通过获取额外的混杂因素来调整角色描述，发现针对特定场景的相关混杂因素可以显著减少调查式和多轮代理评估中的偏倚。

英文摘要

Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.20766 2026-05-21 cs.CV

Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

Diffuse to Detect: 基于伪标签扩散的双级样本再平衡点监督红外小目标检测

Zhu Liu, Yuanhang Yao, Ping Qian, Zihang Chen, Risheng Liu

AI总结本文提出了一种更适应且稳定的框架，通过利用热辐射模式与热扩散的内在一致性，提出了一种物理诱导的标注策略，扩展单点标签为可靠的伪掩码，并开发了双级双更新框架，联合优化检测器权重、样本权重和扩散参数，以提高监督效果并缓解样本不平衡问题。

详情

AI中文摘要

点监督已成为解决红外小目标检测密集标注问题的可扩展解决方案，但其性能受限于两个耦合的瓶颈：在杂乱、低对比度的红外图像中伪标签演化的不稳定性以及严重的样本分布不平衡。本文提出了一种更适应且稳定的框架来解决这些问题。利用热辐射模式与热扩散的内在一致性，我们提出了一种物理诱导的标注策略，将单点标签扩展为可靠的伪掩码。为进一步增强监督并缓解样本不平衡，我们开发了双级双更新框架，联合优化检测器权重、样本权重和扩散参数。一个元分类器动态预测样本级损失权重，而一个可微扩散模块通过检测反馈细化伪标签，使训练与超参数优化之间实现自适应交互。在多个数据集上的广泛实验表明，该方法实现了五倍的标注加速，优越的检测精度，并在仅使用30%训练数据时表现出可比的性能，验证了该方法的效率和实用性。我们的代码可在https://github.com/yuanhang-yao/diffuse-to-detect获取。

英文摘要

Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at https://github.com/yuanhang-yao/diffuse-to-detect.

URL PDF HTML ☆

赞 0 踩 0

2605.20760 2026-05-21 cs.CV

SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation

SpineContextResUNet: 一种计算高效的残差U-Net用于脊柱CT分割

K S Nithurshen, Saurabh J. Shigwan

AI总结本文提出SpineContextResUNet，一种高效的3D残差U-Net，用于快速脊柱定位，通过轻量级的上下文块在不牺牲性能的情况下减少了计算资源需求，适用于资源受限环境。

Comments 2 Figures, 3 Tables

详情

AI中文摘要

自动分割CT扫描中的脊柱是病理评估和手术规划的前提。然而，基于Transformer或大规模集合的方法需要大量GPU资源，限制了在资源受限环境或边缘设备上的临床应用。为此，我们引入了SpineContextResUNet，一种计算高效的3D残差U-Net，用于快速脊柱定位。我们的架构整合了一个轻量级的上下文块，该块使用并行多扩张卷积来捕捉长距离解剖依赖，而无需递归神经网络（RNN）的高延迟或自注意力机制的记忆开销。在两个公开基准测试集VerSe2020和CTSpine1K上的广泛验证显示，我们的模型分别实现了88.17%和88.13%的Dice分数。为了评估在严格硬件限制下的性能，我们将模型与一个缩放后的瓶颈SwinUNETR进行了比较，以匹配我们的~1.7M硬件足迹。尽管受限的Transformer由于在有限数据集中的空间归纳偏置缺乏而遭受严重性能下降，我们的CNN方法成功地保持了高精度。关键的是，重基线如TotalSegmentator由于在商用硬件（Intel Core i5，8GB RAM）上的内存耗尽而失败，而我们的模型在内存限制下执行稳健，使其成为点诊诊断和在Nvidia Jetson Orin Nano等边缘平台部署的可行解决方案。

英文摘要

Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.

URL PDF HTML ☆

赞 0 踩 0

2605.20758 2026-05-21 cs.AI cs.CV cs.LG cs.RO

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

面向组合奖励的冲突感知加法引导：流模型中的对抗性生成

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

AI总结本文提出了一种面向组合奖励的冲突感知加法引导方法，用于在流模型中处理对抗性生成问题，通过动态检测和解决梯度冲突来纠正离曼福德漂移，提升了生成保真度。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

在推理时间进行引导采样可以无需微调就通过解释生成过程为可控轨迹来驱动最先进的扩散和流模型。这提供了一种简单灵活的方式，将外部约束（如成本函数或预训练验证器）注入受控生成中。然而，现有方法在同时组合多个约束时往往失效，导致偏离真实数据曼福德。在本工作中，我们识别出这种离曼福德漂移的根本原因，并发现近似误差随着梯度不一致程度严重增加。基于这些发现，我们提出了一种轻量且可学习的方法，即冲突感知加法引导（g^car），该方法通过动态检测和解决梯度冲突来主动纠正离曼福德漂移。我们验证了g^car在多样化的领域中的有效性，从合成数据集和图像编辑到生成决策规划与控制。我们的结果表明，g^car有效纠正了离曼福德漂移，在生成保真度方面超越了基线方法，同时使用轻量计算。代码可在https://github.com/yuxuehui/CAR-guidance获取。

英文摘要

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.20756 2026-05-21 cs.LG cs.AI math.OC stat.ML

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

纠正预条件语言模型优化器中的随机更新偏差

Nikhil Nayak, Julia White, Urchade Zaratiana, Kelton Zhang, Henrijs Princis, Dhruv Atreja, Henry Fawcett, Matthew Thomas, George Hurn-Maloney, Ash Lewis

AI总结本文研究了预条件优化器中随机更新规则的有限样本偏差问题，提出了一种单批次偏差校正框架，通过交叉拟合预条件估计和方差校正逆运算来减少梯度-预条件器耦合偏差和逆运算偏差，从而提升预条件优化器的性能。

Comments 32 pages, 3 figures, 13 tables

详情

AI中文摘要

预条件优化器在语言模型训练中至关重要，但其随机更新规则通常被视为对群体预条件下降的直接近似。我们证明这种观点忽略了两个有限样本偏差。首先，梯度和预条件器通常从同一个mini-batch估计，引入梯度-预条件器耦合偏差。其次，即使预条件器估计是无偏的，其逆或逆根通常有偏，因为逆运算是非线性的。我们提出了一种单批次偏差校正框架，以解决这两种效应：交叉拟合预条件估计从独立的微批次组中估计分子和预条件器，而方差校正逆运算利用微批次变化来减去主导的delta-方法偏差项。该框架适用于对角矩、对角曲率和矩阵预条件方法，分别在AdamW、Sophia和Shampoo中实现。偏差校正将Qwen2.5-0.5B的保持预训练损失减少了0.15、0.07和0.11 nat，分别；对混合质量预训练和下游指令微调的影响始终是中性到积极的。这些结果确立了偏差校正作为减少有限样本更新偏差和提升预条件优化器性能的实用机制。

英文摘要

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

URL PDF HTML ☆

赞 0 踩 0

2605.20751 2026-05-21 cs.LG cs.AI cs.SY eess.SY

PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

PACD-Net: 假设增强对比学习用于从SMBG估计血糖控制

Canyu Lei, David Repaske, Jianxin Xie

AI总结本研究提出PACD-Net，一种自监督对比学习框架，用于从稀疏不规则采样的SMBG数据中估计血糖控制指标，通过伪SMBG样本指导学习并提高模型的准确性和稳定性。

详情

AI中文摘要

有效的糖尿病管理需要持续监测血糖水平。临床中，通过连续葡萄糖监测（CGM）获取的指标如时间范围（TIR）、低于范围时间（TBR）和高于范围时间（TAR）用于评估血糖控制。然而，由于CGM成本高且可及性有限，许多患者依赖自测血糖（SMBG）。与CGM不同，SMBG提供稀疏且不规则的测量，使得准确估计这些指标具有挑战性。传统监督学习方法在稀疏数据下表现不佳，导致泛化能力差和性能不稳定。为此，我们提出PACD-Net，一种自监督对比学习框架，用于从SMBG估计血糖控制。使用具有更丰富时间覆盖的伪SMBG样本作为教师信号，指导从稀疏观测中学习。此外，多视图对比学习强制不同采样模式下的表征一致性。模型采用混合Swin Transformer-CNN主干网络以捕捉稀疏SMBG序列中的时间依赖性。实验结果表明，PACD-Net在真实世界SMBG数据中对TAR、TIR和TBR的估计优于现有方法，实现了在极稀疏观测设置下的改进准确性和增强的稳定性与泛化能力。所提出的框架为临床SMBG解释提供了实用工具，并为从稀疏且不规则采样的传感器数据中学习提供了通用方法。

英文摘要

Effective diabetes management requires continuous monitoring of glycemic levels. Clinically, glycemic control is assessed using metrics such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), typically derived from continuous glucose monitoring (CGM). However, many patients rely on self-monitoring of blood glucose (SMBG) due to the high cost and limited accessibility of CGM. Unlike CGM, SMBG provides sparse and irregular measurements, making accurate estimation of these metrics challenging. Conventional supervised learning approaches struggle under such sparsity, leading to poor generalization and unstable performance. To address this, we propose PACD-Net, a self-supervised contrastive knowledge distillation framework for estimating glycemic control from SMBG. Pseudo-SMBG samples with richer temporal coverage are used as teacher signals to guide learning from sparse observations. In addition, multi-view contrastive learning enforces representation consistency across diverse sampling patterns. The model adopts a hybrid Swin Transformer-CNN backbone to capture temporal dependencies in sparse SMBG sequences. Experimental results demonstrate that PACD-Net consistently outperforms existing methods in estimating TAR, TIR, and TBR from real-world SMBG data, achieving improved accuracy as well as enhanced stability and generalization under extremely sparse observation settings. The proposed framework provides a practical tool for clinical SMBG interpretation and offers a generalizable approach for learning from sparse and irregularly sampled sensor data in broader applications.

URL PDF HTML ☆

赞 0 踩 0

2605.20745 2026-05-21 cs.LG cs.AI cs.CL

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

验证器严格性的隐含信号：通过选择性潜在引导控制和改进逐步验证

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui

AI总结本文研究了通过隐藏状态干预控制验证器严格性的方法，提出VerifySteer通过利用潜在正确性信号进行样本级路由并选择性干预段落边界，从而在ProcessBench和Hard2Verify数据集上优于基线方法，且在推理计算上更高效。

详情

AI中文摘要

生成验证器已成为逐步验证的一种有前途的范式，但其验证行为往往校准不佳：它们可能过于宽松而错过错误步骤，或过于严格而拒绝正确推理。我们将这种倾向于过于宽松或过于严格的行为称为验证器严格性。在本工作中，我们研究是否可以通过隐藏状态干预来控制验证器严格性。我们揭示了一个验证特定的隐藏状态信号：在逐步验证中，验证器接受或拒绝解决方案步骤的倾向编码在对应的验证段落边界附近。利用这一信号，我们证明隐藏状态引导可以直接调节验证器严格性，而无需微调。然而，统一引导会导致错误检测与正确性认证之间的权衡。为了解决这个问题，我们提出了VerifySteer，它利用潜在正确性信号进行样本级路由，并选择性地在段落边界进行干预。在ProcessBench和Hard2Verify上的实验表明，VerifySteer优于提示优化和激活引导基线，并且在需要更少推理计算的情况下与自一致性竞争。VerifySteer还与验证微调互补，在微调验证器上提供进一步的收益。代码可在https://github.com/YefanZhou/VerifySteer上获得。

英文摘要

Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

URL PDF HTML ☆

赞 0 踩 0

2605.20744 2026-05-21 cs.LG cs.AI

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

可验证的环境：面向大规模评估奖励黑客的尝试

Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni

AI总结本文提出了一种新的评估方法来衡量奖励黑客，通过在环境中嵌入可检测的奖励黑客机会，使评估更加可靠和自动化，通过TextArena测试床分析了不同语言模型在多样化环境中的奖励黑客行为。

Comments Project Page - https://majoroth.github.io/hack-verifiable-environments/

详情

AI中文摘要

使自主代理与人类意图对齐仍然是现代AI中的核心挑战。这一挑战的一个关键表现是奖励黑客，即代理在评估信号下表现成功，但违反了预期目标。奖励黑客已在多种设置中被观察到，但可靠的大规模测量方法仍然匮乏。在本文中，我们引入了一种新的评估范式来衡量奖励黑客。与以往主要通过事后分析代理轨迹不同，我们直接在环境中嵌入可检测的奖励黑客机会，使其利用可验证，从而能够确定和自动化测量代理如何利用这些漏洞。我们通过TextArena实现了这一方法，并发布了Hack-Verifiable TextArena，一个可以可靠测量奖励黑客的测试床。使用此基准，我们分析了不同语言模型在多样化环境和设置中的奖励黑客行为。我们开源代码在https://github.com/MajoRoth/hack-verifiable-environments/。

英文摘要

Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in $\textit{TextArena}$ and release $\textit{Hack-Verifiable TextArena}$, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack-verifiable-environments/.

URL PDF HTML ☆

赞 0 踩 0

2605.20743 2026-05-21 cs.CV cs.CL

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Draw2Think: 通过约束引擎交互增强几何推理

Juncheng Hu, Jiawei Du, Xin Zhang, Joey Tianyi Zhou

AI总结 Draw2Think通过与GeoGebra约束引擎交互，将几何推理从潜在空间推断转换为与约束引擎的代理交互，从而提高几何推理的准确性和可验证性。

详情

AI中文摘要

视觉-语言模型在解决几何问题时准确性不断提高，但其中间状态仍然保持在潜在空间中且不可验证：文本推理或绘图代码中表达的关系无法保证约束满足的配置能实现它。我们发现现有的基于渲染像素或单次脚本的外部化方法无法提供精确的、每一步的几何保证。通过代数定义强制几何关系从而填补了这一差距：工作空间变成一个经过约束检查的动态画布。我们提出了Draw2Think框架，该框架将几何推理从潜在空间推断转换为与GeoGebra约束引擎的代理交互。在提出-绘制-验证循环中，Draw2Think将假设外部化到可执行画布上，测量精确的几何量，并将结构化的观察反馈给模型，使后续推理从由共享工作空间支撑的检查画布状态开始。这种外部化使两个属性可以分别审计：模型级别的构造保真度（画布是否实现了预期的配置）和引擎级别的测量保真度（来自画布约束的精确值和关系）。在构造、结果和渲染评估中，Draw2Think构建的画布在GeoGoal上通过95.9%的谓词级别和84.0%的严格问题级别构造检查，改进了平面/实体基准测试的结果准确性，最高提高了4.1%/16.4%，并在GenExam-math上达到了68.2%/90.5%的严格/宽松渲染分数。项目页面可在https://draw2think.github.io/上找到。

英文摘要

Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.20742 2026-05-21 cs.AI

VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

VBFDD-Agent 用于电动汽车电池故障检测与诊断：电池数字信号的描述性文本建模

Joey Chan, Zhen Chen, Ershun Pan

AI总结本研究提出了一种基于描述性文本建模的电池信号报告方法，用于解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题，通过构建语言语料库来改进电池健康诊断和维护，提出了VBFDD-Agent，整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理，生成结构化的诊断结果和维护建议。

详情

AI中文摘要

随着电动汽车的迅速普及，锂离子电池的安全性和可靠性已成为关键问题。有效的异常检测对于确保电池安全运行至关重要。然而，随着电池系统和运行场景日益复杂，电池故障诊断和维护需要更强的跨领域适应性和人机协作能力。传统故障检测和诊断方法通常针对特定场景和预定义流程设计，使其在复杂现实应用中效果有限。为了解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题，本研究提出了一种电池信号报告的描述性文本建模方法。监测信号、统计特征、异常记录和状态评估结果被转换为结构化且易于阅读的自然语言描述，形成用于电池健康诊断和维护的语言语料库。基于此语料库，我们提出了VBFDD-Agent，一种用于汽车级电池系统的电池故障检测和诊断代理。VBFDD-Agent整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理，以生成结构化的诊断结果和维护建议。实验表明，所提出的框架能够基于描述性文本表示准确执行异常监控，并提供灵活、高效且可操作的维护建议。专家评估进一步确认了所生成建议的实用价值。总体而言，VBFDD-Agent将传统电池诊断从标签预测扩展到可解释和以维护为导向的决策支持。

英文摘要

With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support.

URL PDF HTML ☆

赞 0 踩 0

2605.20740 2026-05-21 cs.LG cs.AI cs.CL

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Distribution-Aware Reward: 用于LLM回归的预测分布强化学习

Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter

AI总结本文提出Distribution-Aware Reward，一种基于预测分布的强化学习方法，旨在提升语言模型在回归任务中的预测分布质量，而非仅优化单个解码输出。通过连续排名概率分数评估多个解码样本的分布，并基于每个rollout对分布质量的边际贡献分配信用，从而提升预测的准确性和分散性。实验表明，该方法在多个任务中优于监督微调和点wise强化学习基线，尤其在KBSS数据集上Spearman相关性提升6点。

Comments 21 pages, 5 figures

详情

AI中文摘要

大型语言模型能够从异质输入（如文本、代码和分子字符串）预测实值量，但大多数训练目标独立评分每个解码的浮点数，仅改进点估计而无法确保校准的预测分布。这限制了需要候选排序或不确定性估计的应用。我们引入Distribution-Aware Reward，一种基于策略的强化学习目标，其主要贡献是训练语言模型生成更好的回归任务预测分布，而非仅优化单个解码输出与标量目标的匹配。我们的方法将多个解码样本视为经验预测分布，并使用连续排名概率分数进行评估，基于每个rollout对分布质量的边际贡献分配leave-one-out信用，奖励既准确又适当分散的预测。我们在受控高斯混合任务、代码性能预测和分子属性预测（从SMILES字符串）上评估了我们的方法。在所有任务中，我们的方法优于监督微调和点wise强化学习基线，具有显著的排名相关性提升，包括在KBSS数据集上Spearman相关性提升6点。在MoleculeNet上，仅使用SMILES字符串，仍能与强大的图基和3D分子模型竞争。进一步分析表明，我们的方法缓解了rollout多样性崩溃并改进了不确定性诊断，表明直接优化预测分布使语言模型回归更具鲁棒性和校准性。

英文摘要

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

URL PDF HTML ☆

赞 0 踩 0

2605.20738 2026-05-21 cs.CV

STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

STAR-IOD: 无尺度耦合拓扑对齐与伪标签细化用于遥感增量目标检测

Yaoteng Zhang, Qing Zhou, Junyu Gao, Qi Wang

AI总结本文提出STAR-IOD框架，通过子空间解耦拓扑蒸馏模块和聚类驱动伪标签生成器，解决遥感增量目标检测中类别间拓扑关系对齐和尺度变化导致的表示差异问题，同时通过动态识别类别特定阈值来缓解旧类标注缺失问题，实验表明在DIOR-IOD和DOTA-IOD数据集上，方法在mAP上分别优于现有方法1.7%和2.1%。

Comments STAR-IOD was accepted by ISPRS Journal of Photogrammetry and Remote Sensing

详情

AI中文摘要

遥感影像通常以连续数据流的形式出现。传统检测器在学习新类别时往往会遗忘之前学习的类别；因此，研究遥感增量目标检测（RS-IOD）具有重要意义。然而，现有方法大多忽视了遥感场景中普遍存在的类别内尺度变化，这削弱了知识迁移和旧知识保留的有效性。此外，RS-IOD还受到标注缺失的影响，导致模型将旧类实例误分类为背景。为了解决这些挑战，我们提出了一种新的框架STAR-IOD。首先，我们引入了子空间解耦拓扑蒸馏（STD）模块，以转移结构知识，显式对齐类别间拓扑关系，并缓解由尺度变化引起的类别内表示差异。此外，我们引入了聚类驱动伪标签生成器（CPG），这是一个即插即用模块，利用K-Means聚类动态识别类别特定阈值，从而保证真正阳性目标与背景噪声之间的准确区分，并缓解旧类标注缺失问题。我们还构建了两个遥感增量目标检测数据集，DIOR-IOD和DOTA-IOD，以促进RS-IOD的研究。广泛的实验表明，我们的方法在DIOR-IOD和DOTA-IOD数据集上分别以1.7%和2.1%的mAP优于现有方法，有效缓解了灾难性遗忘，同时在基础类和新类上保持了强劲的检测性能。代码和数据集已发布在：https://github.com/zyt95579/STAR-IOD。

英文摘要

Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: https://github.com/zyt95579/STAR-IOD.

URL PDF HTML ☆

赞 0 踩 0

2605.20737 2026-05-21 cs.CV

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

通过语言先验解决无监督3D点云分割中的长尾歧义

Siqi Wei, Hongbin Xu, Feng Xiao, Tian Lan, Chun Li, Ming Li, Qiuxia Wu

AI总结本文提出LangTail框架，利用语言模型中的平衡世界知识来缓解无监督3D分割中的长尾歧义问题，通过建立语言衍生语义先验与视觉上不常见的小类之间的多级关联，提升小类的表示能力，实验表明在ScanNet-v2、S3DIS和nuScenes数据集上均取得显著提升。

Comments In submission. The code will be released at: https://github.com/Whisky0129/langtail_official

详情

AI中文摘要

现有的无监督3D点云分割方法主要依赖于纯视觉相似性基于聚类的学习范式，这存在一个根本性限制：长尾歧义。在这样的范式中，次要类别的特征会被主导簇持续吸收，导致预测严重不平衡。为了解决这个问题，我们提出了LangTail，一种语言引导的分层学习框架，利用语言模型中编码的平衡世界知识来缓解无监督3D分割中的长尾歧义。关键思想是建立语言衍生语义先验与视觉上不常见的次要类别之间的多级关联，从而补偿纯粹视觉聚类对主导类别的偏关注。具体来说，LangTail首先从语言模型中构建实体级语义先验，捕捉跨类别的平衡和细粒度世界知识。这些先验通过对比对齐注入到分层聚类框架中。这引导多粒度语义结构的形成，并防止次要类别被主导簇吸收，从而为不常见的类别产生更具判别性的表示。在ScanNet-v2、S3DIS和nuScenes上进行的大量实验表明，LangTail在ScanNet-v2、S3DIS和nuScenes上分别比现有方法提高了+13.5、+12.9和+8.9 mIoU。这些结果证明了语言先验在提升3D点云中少数类别表示的有效性。代码将在：https://github.com/Whisky0129/langtail_official发布。

英文摘要

Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.

URL PDF HTML ☆

赞 0 踩 0

2605.20733 2026-05-21 cs.CV

Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

Sketch2MinSurf: 通过视觉-语言引导从手绘草图生成可编辑的最小曲面

Wenda Wang, Anqi Liu, Junqi Yang, Lei He, Luying Wang, Jiachen Lu, Weixin Huang

AI总结本研究提出Sketch2MinSurf方法，结合视觉-语言引导和几何优化，从手绘草图生成平滑且可编辑的3D曲面，通过空间-拓扑编码和Sketch2MinSurf结构损失函数实现拓扑一致性与几何重建的联合约束。

Comments 22 pages, 16 figures, includes appendix

详情

AI中文摘要

将手绘草图转换为结构化的3D几何体仍然具有挑战性，因为非欧几里得曲面的表示和拓扑一致性维护困难。现有的生成模型如GANs、NeRFs和扩散架构往往无法直接生成可编辑的流形用于下游设计流程。我们提出了Sketch2MinSurf，一种结合视觉-语言和几何优化的混合框架，通过将视觉-语言引导与最小曲面理论相结合，从手绘草图生成平滑且可编辑的3D曲面。我们的方法核心是一种空间-拓扑编码，将几何表示为节点坐标和实/虚拟边骨架的元组，使在生成过程中能够实现稳定的拓扑控制。我们进一步引入了Sketch2MinSurf结构损失函数（S2MS-Loss），一种奖励调制的目标，联合约束几何重建和拓扑一致性。在100个草图的测试集上，Sketch2MinSurf实现了0.844的拓扑相似度得分，优于现有的草图到形状基线。生成的流形可以直接编辑且没有非流形伪影。一所大学的公共艺术装置展示了该方法在人类意图驱动的3D形式生成中的潜力。数据集和代码可在https://anonymous.4open.science/r/Sketch2MinSurf/上获取。

英文摘要

Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.

URL PDF HTML ☆

赞 0 踩 0

2605.20732 2026-05-21 cs.CV

Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

深度注意力重加权：CNN中的后处理注意力特征聚合以解纠缠核心与伪相关特征

Kin Whye Chew, Jingxian Wang

AI总结本文提出了一种基于注意力的后处理特征聚合方法DAR，通过替换全局平均池化层来减少CNN中因伪相关特征引起的纠缠，从而提升模型的泛化能力和公平性。

Comments Under review. 26 pages, 7 figures

详情

AI中文摘要

卷积神经网络（CNNs）经常利用数据集中的伪相关性，学习出表面预测但因果无关的特征，导致泛化能力差和公平性问题。深度特征重加权（DFR）是一种后处理技术，通过在目标数据集上重新训练分类头来减少模型对伪相关性的依赖。然而，我们发现DFR受限于在纠缠特征上操作，限制了其增强核心特征同时抑制伪特征的能力。我们追溯这种纠缠到普遍存在的全局平均池化（GAP）层，该层 indiscriminately 将空间上不同的核心和伪特征压缩成单一表示。为了解决这个问题，我们提出了深度注意力重加权（DAR），一种基于注意力的后处理特征聚合模块，它替换了GAP层并与分类头一起重新训练。DAR在特征图上计算空间位置的自适应加权，使在压缩成纠缠特征前能选择性地抑制伪特征。在各种数据集、指标和消融实验中，DAR始终优于DFR，证明了我们的基于注意力的聚合方法减轻了GAP引起的纠缠并减少了对伪相关性的依赖。

英文摘要

Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

URL PDF HTML ☆

赞 0 踩 0

2605.20730 2026-05-21 cs.CL cs.AI

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

分布对齐作为设计任务向量在上下文学习中的准则

Jihoon Kwon, Jiwon Choi, Jy-yong Sohn

AI总结本文提出通过分布对齐来设计任务向量，引入了NTP距离作为衡量指标，并开发了线性任务向量方法以提升性能和效率。

Comments 9 pages, preprint

详情

AI中文摘要

在上下文学习（ICL）中，大型语言模型（LLMs）通过演示来适应新任务，但随着上下文长度增加，推理成本也随之上升。虽然任务向量通过压缩演示为紧凑的隐藏状态表示提供了有前途的替代方案，但其质量只能通过下游任务准确性来评估。本文认为，使用任务向量的推理应使其预测分布与ICL的预测分布对齐。为此，我们引入了$d_{ ext{NTP}}$，一个衡量任务向量推理与ICL推理之间下一个标记概率差异的指标。我们的实证分析表明，$d_{ ext{NTP}}$作为性能代理，与下游准确性呈强负相关。受此启发，我们开发了线性任务向量（LTV）方法，通过闭合形式的线性映射来最小化$d_{ ext{NTP}}$，通过回归估计演示效果。在八个分类基准和五个LLMs上，LTV一致优于现有任务向量基线，平均准确率提高了9.2%，同时减少了推理延迟。我们进一步证明LTV在回归任务上优于基线。此外，我们研究了LTV在不同模型规模间的可转移性；这在任务向量研究中仍是一个初级问题。具体而言，我们实证显示，较大模型的任务向量可以将较小模型的性能提高6.4%，表明提取的任务表示有新的用途。

英文摘要

In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

URL PDF HTML ☆

赞 0 踩 0

2605.20729 2026-05-21 cs.CL

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

MTR-Suite: 一个用于评估和合成对话检索基准的框架

Junhao Ruan, Abudukeyumu Abudula, Bei Li, Yongjing Yin, Xinyu Liu, Kechen Jiao, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, Jingbo Zhu

AI总结本文提出MTR-Suite框架，通过LLM审计、多智能体系统生成和通用领域基准，解决对话检索基准评估和合成中的成本高、标注稀疏和自动化方法僵化的问题。

Comments Accepted to ACL 2026 (main conference). 28 pages. Code and data: https://github.com/rangehow/mtr-suite

详情

AI中文摘要

准确评估对话检索对于推进检索增强生成（RAG）系统至关重要。然而，现有的对话检索基准存在成本高、标注稀疏或自动化方法僵化、不自然的问题。为了解决这些挑战，我们引入MTR-Suite，一个统一的框架，用于审计、合成和基准测试检索。它具有三个特点：（1）MTR-Eval，一个基于LLM的审计器，用于量化先前基准中的对齐差距；（2）MTR-Pipeline，一个使用贪心遍历聚类的多智能体系统，能够以1/400的成本生成高保真对话；（3）MTR-Bench，一个严谨的通用领域基准。MTR-Bench模拟生产式挑战（如困难的主题切换、冗长），提供更强大的判别能力。我们公开了代码和数据，以促进未来研究，网址为https://github.com/rangehow/mtr-suite.

英文摘要

Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.

URL PDF HTML ☆

赞 0 踩 0

2605.20728 2026-05-21 cs.CV

Early High-Frequency Injection for Geometry-Sensitive OOD Detection

早期高频注入用于几何敏感的域外检测

Chuanjie Cheng, Ningkang Peng, Chenxi Liu, Yifan He, Peirong Ma, Yanhui Gu

AI总结本文通过带宽分析揭示了高频输入对几何敏感域外检测的重要性，提出EIHF方法在CIFAR-100和ImageNet-100上提升了检测性能，同时揭示了其在场景中心Places迁移上的局限性。

详情

AI中文摘要

事后域外检测器在训练后对logits或特征进行评分，其成功依赖于表示中已编码的几何结构。我们通过跨CE、SimCLR、SupCon和域外导向表示方法PALM的带宽MMD^2分析重新审视这一假设。在我们的诊断中，低频输入带诱导更弱的ID/OOD特征差异，而高频带倾向于提供更强的分离性。这一观察促使提出EIHF，一种输入侧干预方法，在第一次卷积之前暴露高频证据而不改变训练目标。EIHF在几何敏感的域外检测中表现最强：在匹配的训练和评分设置下，它重塑类条件特征几何并减少ID/OOD马哈拉诺斯距离重叠。在CIFAR-100和ImageNet-100上的实验显示，在CIFAR-100上获得提升，在ImageNet-100上获得最佳的平均FPR95和次佳的平均AUROC，同时揭示了在场景中心Places迁移上的局限性。代码可在https://anonymous.4open.science/r/EIHF获得。

英文摘要

Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at https://anonymous.4open.science/r/EIHF.

URL PDF HTML ☆

赞 0 踩 0

2605.20727 2026-05-21 cs.CV

GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

GAMR: 带虚拟异常合成的几何感知流形正则化用于噪声标签学习

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Peirong Ma, Xichen Yang, Weiguang Qu, Yanhui Gu

AI总结本文提出了一种几何感知流形正则化方法，通过主动合成虚拟异常样本来重构特征空间几何，从而提升在噪声标签下的学习性能，其核心贡献是增强模型对难样本和噪声样本的区分能力，实现更鲁棒的表示学习。

详情

AI中文摘要

深度神经网络（DNNs）在处理噪声标签时会遭受显著的性能下降，主要由于过度拟合错误标记的数据。当前主流方法试图通过在训练过程中被动过滤干净样本来缓解这一问题。然而，在受噪声破坏的特征空间中，简单的样本过滤难以区分具有挑战性的样本和噪声样本，从而成为模型性能的瓶颈。我们首次强调了主动重塑特征空间几何在学习噪声数据中的根本重要性。我们提出了一种新颖的几何感知流形正则化范式，其核心思想是通过主动合成虚拟异常样本来显式构建数据流形之间的能量屏障。通过施加促进类内紧凑性和类间分离的几何约束，该方法增强了难样本与噪声样本之间的可区分性，从而学习到更鲁棒的表示。我们的正则化机制具有高度的通用性，其有效性不依赖于任何关于噪声模式的先验假设。它可以作为独立机制集成到现有的样本选择框架中，提供更强的鲁棒性以应对多样的噪声环境。实验表明，我们的范式在多个基准上，包括CIFAR-10，均实现了超越当前最先进（SOTA）方法的性能，特别是在更具挑战性的不对称噪声条件下表现尤为突出。此外，该范式显著增强了模型在Out-of-Distribution（OOD）检测方面的能力，确保了在开放世界场景中更高的可靠性和安全性。

英文摘要

Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model's capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.

URL PDF HTML ☆

赞 0 踩 0