arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2511.08704 2026-05-19 cs.CV cs.LG

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

重新思考生成图像预训练：我们离扩大下一步像素预测还有多远？

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

发表机构 * Google Deepmind（谷歌深Mind）

AI总结本文研究了自回归下一步像素预测的扩展特性，探讨了统一视觉模型中简单且端到端但尚未充分探索的框架。通过在32x32分辨率的图像上训练Transformer模型，评估了三个目标指标：下一步像素预测目标、ImageNet分类准确率和基于生成的完成度（通过Fr'echet距离测量）。研究发现，最优扩展策略高度依赖任务，且随着图像分辨率的增加，模型大小必须比数据量增长得更快。通过预测发现，计算能力是主要瓶颈，而非训练数据量。随着计算能力每年增长四到五倍，预计在五年内可实现像素级图像建模。

Comments Accepted by ICML2026

详情

AI中文摘要

本文研究了自回归下一步像素预测的扩展特性，一种简单、端到端但尚未充分探索的统一视觉模型框架。从32x32分辨率的图像开始，我们训练了一系列Transformer模型，使用IsoFlops配置在计算预算高达7e19 FLOPs的情况下进行训练，并评估了三个不同的目标指标：下一步像素预测目标、ImageNet分类准确率和基于生成的完成度（通过Fr'echet距离测量）。首先，最优扩展策略高度依赖于任务。在固定的32x32分辨率下，图像分类和图像生成的最优扩展特性不同，其中生成最优设置要求数据量增长是分类最优设置的三到五倍。其次，随着图像分辨率的增加，最优扩展策略表明模型大小必须比数据量增长得更快。令人惊讶的是，通过投影我们的发现，我们发现主要瓶颈是计算能力，而不是训练数据量。随着计算能力每年增长四到五倍，我们预测在五年内可以实现像素级图像建模。

英文摘要

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

URL PDF HTML ☆

赞 0 踩 0

2511.04070 2026-05-19 cs.CL

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

T-FIX：基于文本的可解释性方法，具备可解释的专家特征

Shreya Havaldar, Weiqiu You, Chaehyeon Kim, Anton Xue, Helen Jin, Marco Gatti, Bhuvnesh Jain, Helen Qu, Amin Madani, Daniel A. Hashimoto, Gary E. Weissman, Rajat Deo, Sameed Khatana, Lyle Ungar, Eric Wong

发表机构 * Department of Computer and Information Science, University of Pennsylvania（宾夕法尼亚大学计算机与信息科学系）； Department of Computer Science, University of Texas at Austin（德克萨斯大学奥斯汀分校计算机科学系）； Department of Physics and Astronomy, University of Pennsylvania（宾夕法尼亚大学物理与天文学系）； Flatiron Institute（Flatiron研究所）； Department of Surgery, Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院外科系）； Division of Pulmonary, Allergy, and Critical Care, Perelman School of Medicine, University of Pennsylvania（宾夕菲亚大学佩雷尔曼医学院呼吸、过敏与危重医学科）； Division of Cardiovascular Medicine, Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院心血管医学科）； Department of Surgery, University of Toronto（多伦多大学外科系）； University Health Network（大学健康网络）

AI总结本文提出T-FIX框架，用于评估LLM生成的解释是否符合专家的推理方式，通过七个科学任务和三个领域进行验证，实现了自动且可定制的专家对齐评估。

详情

AI中文摘要

随着LLM被应用于知识密集型领域（例如手术、天文学、治疗），用户通常是领域专家，他们不仅期望答案，还期望解释能反映专业推理。然而，评估LLM是否'像专家一样思考'仍然困难：现有方法依赖于每个示例的专家注释，使它们成本高、难以扩展，并且局限于每个领域的单一正确推理观念。为了解决这一差距，我们引入了T-FIX，一个统一的评估框架，将专家对齐作为LLM生成解释的期望属性进行操作化。T-FIX涵盖七个科学任务和三个领域，每个任务均根据专家定义的准则进行评估，这些准则捕捉的是领域相关的推理而非通用的解释质量。我们的框架实现了自动且可定制的专家对齐评估，能够在没有持续专家参与的情况下泛化到未见过的解释。代码可在https://github.com/BrachioLab/FIX-2/上获得。

英文摘要

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. Yet evaluating whether an LLM "thinks like an expert" remains difficult: existing approaches rely on per-example expert annotation, making them costly, hard to scale, and tied to a single notion of correct reasoning within each domain. To address this gap, we introduce T-FIX, a unified evaluation framework that operationalizes expert alignment as a desired attribute of LLM-generated explanations. T-FIX spans seven scientific tasks across three domains, with each task evaluated against expert-defined criteria that capture domain-grounded reasoning rather than generic explanation quality. Our framework enables automatic, personalizable evaluation of expert alignment that generalizes to unseen explanations without ongoing expert involvement. Code is available at https://github.com/BrachioLab/FIX-2/.

URL PDF HTML ☆

赞 0 踩 0

2511.03828 2026-05-19 cs.LG

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning

从静态约束到动态适应：样本级约束放松用于离线到在线强化学习

Lipeng Zu, Yu Qian, Shayok Chakraborty, Xiaonan Zhang

发表机构 * Department of Computer Science, Florida State University, Tallahassee, FL, USA（佛罗里达州立大学计算机科学系）

AI总结本文提出DARE框架，通过行为一致性实现样本级约束放松，解决了离线到在线强化学习中保留离线保守性与适应在线反馈之间的挑战，改进了细调稳定性并优于现有基线。

详情

AI中文摘要

离线到在线强化学习（O2O RL）面临在保留离线保守性与适应在线反馈下的分布偏移挑战。此挑战出现因为数据行为在微调期间演变，使得数据来源成为约束处理的误导基础，从而导致目标-数据不匹配。因此，我们提出了动态对齐用于放松（DARE），一种基于行为模型的行为一致性分布感知框架，用于样本级约束放松。据我们所知，DARE是第一个通过后验诱导交换机制将约束放松条件化于行为一致性，超越二元离线/在线数据区别的方法。重要的是，DARE仅需要每个样本的行为对齐，使它能够在许多离线算法上进行实例化，具有灵活的行为模型和微调目标选择。我们提供理论分析，显示基于行为的样本交换一致地提高了离线样本人群与在线样本人群之间的区分。在D4RL上的实验表明，DARE一致提高了微调稳定性，并在强离线到在线基线之上实现了优越的最终性能。（代码可在https://github.com/lpzu/DARE上公开获取。）

英文摘要

Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves during fine-tuning, rendering data origin a misleading basis for constraint handling and thereby leading to objective-data mismatch. We therefore propose Dynamic Alignment for RElaxation (DARE), a distribution-aware framework for sample-level constraint relaxation based on the behavioral consistency with a behavior model. To our knowledge, DARE is the first to condition constraint relaxation on behavioral consistency via a posterior-induced exchange mechanism, moving beyond a binary offline/online data distinction. Importantly, DARE requires only per-sample behavioral alignment, enabling instantiation on top of many offline algorithms with flexible choices of behavior models and fine-tuning objectives. We provide a theoretical analysis showing that behavior-based sample exchange consistently improves the distinction between offline-like and online-like subsets. Experiments on D4RL demonstrate that DARE consistently improves fine-tuning stability and achieves superior final performance over strong offline-to-online baselines. (The code is publicly available at \url{https://github.com/lpzu/DARE}.)

URL PDF HTML ☆

赞 0 踩 0

2511.02610 2026-05-19 cs.LG

Towards Migrating Neural Network Implementations

向神经网络实现迁移迈进

Nadia Daoudi, Ivan Alfonso, Jordi Cabot

发表机构 * Luxembourg Institute of Science and Technology（卢森堡科学与技术研究所）； University of Luxembourg（卢森堡大学）； Technology University of Luxembourg Luxembourg（卢森堡技术大学卢森堡）

AI总结本文提出了一种自动迁移神经网络代码跨深度学习框架的方法，通过使用一个中间神经网络模型来创建迁移前的抽象，从而解决神经网络库之间迁移的挑战。

Comments To appear at the International Conference on AI-powered Software (AIware 2026)

详情

AI中文摘要

智能系统的开发（即通过AI组件增强的系统）得益于神经网络（NNs）的快速进步。由于神经网络设计和实现的支持，各种库和框架随之涌现。选择框架取决于可用功能、易用性、文档和社区支持等因素。在采用某个NN框架后，组织可能后来选择切换到另一个框架，如果性能下降、需求变化或新功能被引入。不幸的是，由于缺乏专门针对NNs的迁移方法，跨库迁移NN实现具有挑战性。这导致了更多的现代化时间与努力，因为手动更新是必要的，以避免依赖过时的实现并确保与新功能的兼容性。在本文中，我们提出了一种自动迁移神经网络代码跨深度学习框架的方法。我们的方法利用一个中间NN模型来创建迁移前的抽象。我们通过两个流行的NN框架，即PyTorch和TensorFlow，验证了我们的方法。我们还讨论了在两个框架之间迁移代码的挑战以及我们的方法如何处理这些问题。对五个NN的实验评估显示，我们的方法成功地迁移了它们的代码，并生成了与原始功能等效的NN。我们的工作成果已在线上可用。

英文摘要

The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.

URL PDF HTML ☆

赞 0 踩 0

2510.26384 2026-05-19 cs.AI cs.LG

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Scales++: 一种计算高效的评估子集选择方法，基于认知尺度嵌入

Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz

发表机构 * Thomson Reuters Foundational Research（汤姆森路透基础研究）； University of Oxford（牛津大学）； Imperial College London（帝国理工学院伦敦分校）

AI总结本文提出了一种基于任务项目内在属性的评估子集选择方法Scales++，通过减少预选成本并保持预测保真度，提高了大规模语言模型的评估效率，同时提升了冷启动性能和可解释性。

Comments 9 pages, 2 figures, 4 tables

详情

AI中文摘要

对大规模语言模型（LLMs）进行全面评估的高昂成本需要创建小而有代表性的数据子集（即小型基准），以实现高效的评估同时保留预测保真度。当前的方法基于模型为中心的范式，根据现有模型的集体性能选择基准项目。这些方法受限于前期成本高、无法立即处理新基准（冷启动）以及假设未来模型会共享前代模型的失败模式的脆弱性。在本文中，我们提出了一种新的以项目为中心的基准子集选择方法，认为选择应基于任务项目的内在属性，而不是模型特定的失败模式。我们通过一种新的方法Scales++来实现这种以项目为中心的高效基准方法，其中数据选择基于基准样本的认知需求。实证研究表明，Scales++将前期选择成本降低了超过18倍，同时实现了有竞争力的预测保真度。在Open LLM Leaderboard上，使用仅0.25%的数据子集，我们预测完整基准分数的均方误差为3.2%，在Humanity's Last Exam上，使用2.0%的样本预测完整分数的均方误差为2.9%。我们证明这种以项目为中心的方法可以在不显著降低保真度的情况下更高效地评估模型，同时提供更好的冷启动性能和更可解释的基准测试。

英文摘要

The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.

URL PDF HTML ☆

赞 0 踩 0

2510.24701 2026-05-19 cs.CL cs.AI cs.IR cs.LG cs.MA

Tongyi DeepResearch Technical Report

通义深研技术报告

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

发表机构 * Tongyi Lab（通义实验室）； Alibaba Group（阿里巴巴集团）

AI总结本文介绍了一种专为长时间深度信息检索任务设计的代理大语言模型，通过端到端训练框架结合代理中期和后期训练，实现了在复杂任务中的可扩展推理和信息检索，同时提供了高可扩展的数据合成管道，实现了无需昂贵人工标注的自动化训练流程，并在多个深度研究基准测试中取得了最先进的性能。

Comments https://tongyi-agent.github.io/blog

详情

AI中文摘要

我们介绍了通义深研，一种专为长周期、深度信息检索任务设计的代理大语言模型。为了激励自主深度研究代理，通义深研通过端到端训练框架结合代理中期和后期训练，实现了在复杂任务中的可扩展推理和信息检索。我们设计了一个高度可扩展的数据合成管道，完全自动化，无需依赖昂贵的人工标注，并赋能所有训练阶段。通过为每个阶段构建定制化环境，我们的系统在整个过程中实现了稳定一致的交互。通义深研拥有305亿总参数，每token仅激活33亿个参数，在多个代理深度研究基准测试中，包括人类最后考试、浏览比较、浏览比较-中文、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510，均取得了最先进的性能。我们开源了该模型、框架和完整解决方案，以赋能社区。

英文摘要

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

URL PDF HTML ☆

赞 0 踩 0

2510.18822 2026-05-19 cs.CV

SAM 2++: Tracking Anything at Any Granularity

SAM 2++: 任意粒度下的任意目标跟踪

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Tencent（腾讯）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出SAM 2++框架，通过统一的提示编码、输出解码和记忆表示设计，实现了对不同粒度的目标状态（如掩码、框和点）的统一跟踪，同时引入Tracking-Any-Granularity数据集以提升统一跟踪模型的训练和评估效果。

Comments 14 pages

详情

AI中文摘要

由于不同任务中目标状态的粒度差异，现有跟踪器多针对单一任务进行设计，这种特异性限制了其泛化能力，无法有效利用多任务训练数据，导致模型设计和参数冗余。尽管最近的统一视觉模型在不同任务间共享部分架构，但通常保留任务特定的接口，并忽视不同粒度背后共同的跟踪原理，留下真正统一视频跟踪的空白。为统一视频跟踪任务，我们提出了SAM 2++，一个能够处理不同粒度目标状态的统一框架，包括掩码、框和点，通过集成设计的提示编码、输出解码和记忆表示。首先，为处理不同目标粒度，我们设计了任务特定的提示，将多样化的任务输入映射到通用的提示嵌入，同时引入统一解码器，以共同的输出形式生成任务结果，而无需重新设计整体流程。其次，为满足记忆匹配，跟踪的核心操作，我们引入了任务自适应的记忆机制，统一不同粒度的记忆同时保持其不同的状态语义，防止全参数共享导致粒度间的干扰。最后，我们引入Tracking-Any-Granularity，第一个大规模且多样化的视频跟踪数据集，具有丰富的三粒度注释。它通过定制的数据引擎，结合分阶段的手动标注和模型辅助完成，提供全面的资源用于训练、基准测试和分析统一跟踪模型。全面的实验表明，SAM 2++在不同粒度的多样化跟踪任务中设定了新的状态-of-the-art，建立了统一且稳健的跟踪框架。

英文摘要

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

URL PDF HTML ☆

赞 0 踩 0

2510.17363 2026-05-19 cs.CV cs.LG cs.RO

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H：基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science（地球观测科学系）

AI总结本文提出M2H框架，通过高效的窗口交叉任务注意力模块，实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计，同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情

DOI: 10.1109/IROS60139.2025.11246974

AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型，这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra（M2H），一种新的多任务学习框架，用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同，M2H引入了基于窗口的跨任务注意力模块，实现了结构化的特征交换同时保留任务特定的细节，提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络，优化了实时部署，并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示，M2H在NYUDv2上优于最先进的多任务模型，在Hypersim上超越了单任务深度和语义基线，在Cityscapes数据集上实现了更优的性能，同时在笔记本硬件上保持计算效率。除了基准测试外，M2H还在真实世界数据上得到了验证，证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.16609 2026-05-19 cs.LG cs.AI cs.CC cs.DS

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

先验知识使其成为可能：从次线性图算法到LLM测试时方法

Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless

发表机构 * Toyota Technological Institute at Chicago（芝加哥丰田技术研究所）； Columbia University（哥伦比亚大学）； Google Research（谷歌研究）

AI总结本文研究了测试时增强方法中先验知识与外部信息交互的理论基础，通过将多步推理建模为知识图中的s-t连通性问题，揭示了在部分先验知识下，测试时增强步骤数量与图结构之间的关系，发现当知识图中存在小组件时，增强步骤数呈平方根增长，而当知识密度超过阈值形成大组件时，增强步骤数趋于常数。

详情

AI中文摘要

测试时增强，如检索增强生成（RAG）或工具使用，关键依赖于模型参数知识与外部检索信息之间的相互作用。然而，这种关系的理论基础仍不明确。具体来说，不清楚在少量增强步骤下需要多少预训练知识来回答查询，这在实践中是理想的属性。为了解决这个问题，我们将多步推理建模为知识图中的s-t连通性问题。我们将模型的预训练参数知识表示为部分、可能嘈杂的子图。我们将增强视为查询一个 oracle 以获得真实的边，从而扩展模型的知识。然后，我们表征了在部分先验知识下，模型生成准确答案所需的必要和充分的增强步骤数。一个关键结果表明：如果包含n个顶点的知识图被分割成小组件，则通过增强找到路径是低效的，需要Ω(√n)次查询。另一方面，一旦正确知识的密度超过阈值，形成大组件，我们可以通过预期常数次查询找到路径。

英文摘要

Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.

URL PDF HTML ☆

赞 0 踩 0

2510.16252 2026-05-19 cs.LG cs.CL

WEBSERV: A Full-Stack and RL-Ready Web Environment for Training Web Agents at Scale

WEBSERV: 一个全栈且适合强化学习的网页环境，用于大规模训练网页代理

Yuxuan Lu, Ziyi Wang, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Xianfeng Tang, Chen Luo, Yisi Sang, Jin Lai, Dakuo Wang

发表机构 * Northeastern University（东北大学）； Amazon（亚马逊）

AI总结本文提出WebServ，一个全栈且适合强化学习的网页环境，用于大规模训练网页代理。该环境在服务器端使用Incus容器减少启动延迟和存储需求，浏览器端提供自动化的观察和动作接口，以及可靠的执行后端。实验表明，WebServ在WebArena-Lite上实现了最先进的单提示结果，并在强化学习训练中超越了现有方法。

详情

AI中文摘要

针对网页代理强化学习需求，本文提出WebServ，一个全栈且适合强化学习的网页环境，用于大规模训练网页代理。当前网页环境存在不足：服务器端Docker设置过于资源密集，无法支持大规模并行展开；浏览器端接口产生噪声观察，执行动作在现代单页应用中不可靠，并遗漏视觉交互提示。我们引入WebServ，一个全栈、适合强化学习的网页环境，解决这些限制。在服务器端，WebServ使用Incus容器，通过块级拷贝-写入减少启动延迟约5倍，持久化存储减少约240倍，使单台主机支持200+个隔离环境。在浏览器端，WebServ提供一个紧凑的、站点无关的观察和动作接口，自动从DOM派生，并提供人类对齐的交互提示，以及使用网络感知等待的稳健动作执行后端。在WebArena-Lite上，WebServ实现了最先进的单提示结果，受控比较确认在GPT-4o、OpenAI-o3和Llama-3.1-8B上均优于普通WebArena。我们进一步在WebServ中完全训练Qwen3-4B和Qwen3-30B-A3B；RL训练的4B模型在均值准确率上达到55.5%，超过了Claude 4.5 Sonnet（50.0%）和WebAgent-R1中的RL训练8B模型（51.8%）

英文摘要

Reinforcement learning (RL) for web agents demands environments that are both effective for evaluation and efficient enough for large-scale on-policy training. Current web environments fall short: server-side Docker setups are too resource-intensive for massive parallel rollouts, while browser-side interfaces produce noisy observations, execute actions unreliably under modern single-page applications, and omit visual interactivity cues. We introduce WebServ, a full-stack, RL-ready web environment that addresses these limitations end-to-end. On the server side, WebServ uses Incus containers with block-level copy-on-write, reducing launch latency by ~5x and persistent storage by ~240x, enabling 200+ concurrent isolated environments on a single host. On the browser side, WebServ provides a compact, site-agnostic observation and action interface derived automatically from the DOM with human-aligned interactivity cues, and a robust action execution backend using network-aware waiting for reliable SPA support. On WebArena-Lite, WebServ achieves state-of-the-art single-prompt results, with controlled comparisons confirming consistent gains across GPT-4o, OpenAI-o3, and Llama-3.1-8B over vanilla WebArena. We further train Qwen3-4B and Qwen3-30B-A3B with RL entirely within WebServ; the RL-trained 4B model achieves 55.5% mean accuracy, surpassing both Claude 4.5 Sonnet (50.0%) and the RL-trained 8B model from WebAgent-R1 (51.8%).

URL PDF HTML ☆

赞 0 踩 0

2510.14466 2026-05-19 cs.CL cs.AI

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

迈向低资源语言LLM鲁棒多语言适应

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang

发表机构 * Department of Automation, Tsinghua University, Beijing, China（清华大学自动化系）； Alibaba International Digital Commerce Group, Beijing, China（阿里巴巴国际数字 commerce 集团）； School of Software, Tsinghua University, Beijing, China（清华大学软件学院）

AI总结本文提出LiRA框架，通过轻量级微调实现低资源语言LLM的鲁棒多语言适应，结合Arca和LaSR组件提升跨语言语义一致性与表示稳定性。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）在低资源语言上仍面临挑战，主要由于训练数据有限、翻译噪声和跨语言对齐不稳定。为解决这些问题，我们提出LiRA（LLM的语言鲁棒锚定框架）——一个插件式框架，仅需在现有预训练模型上进行轻量级微调。LiRA通过结合两个关键组件：Arca（锚定表示组合架构），通过基于锚点的对齐和协作编码将低资源输入对齐到共享的英语语义空间；以及LaSR（语言耦合语义推理器），一个轻量级、语言感知的头部，通过一致性正则化强制统一的跨语言理解、检索和推理。我们理论证明，在受控的锚定误差和翻译诱导偏差下，LiRA保证了表示偏差的有界性和稳定的下游性能，基于局部Lipschitz连续性。为促进研究，我们发布了一个新的多语言产品检索数据集，涵盖五个东南亚语言和两种南亚语言。在多样化的低资源基准测试中，广泛实验显示在检索、排序、问答和推理任务上均取得一致的改进。代码将在GitHub上公开，数据集将托管在Hugging Face上。

英文摘要

Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

URL PDF HTML ☆

赞 0 踩 0

2510.13870 2026-05-19 cs.CL cs.AI

Unlocking the Potential of Diffusion Language Models through Template Infilling

通过模板填充解锁扩散语言模型的潜力

Junhoo Lee, Seungyeon Kim, Nojun Kwak

发表机构 * Seoul National University（首尔国立大学）

AI总结本文提出了一种针对扩散语言模型的模板填充方法，通过在生成响应空间中建立全局蓝图，提升了数学推理、代码生成和旅行规划等任务的性能，同时在多token生成中实现了生成质量与速度的平衡。

Comments ACL 2026 Main Conference - Long Paper, Oral Presentation

详情

AI中文摘要

扩散语言模型（DLMs）作为一种有前景的替代自回归语言模型的候选者，其推理策略仍局限于自回归范式继承的前缀提示。本文提出模板填充（TI），一种针对DLMs的定制化条件化方法。与传统前缀提示不同，TI在目标响应空间中灵活对齐结构锚点，建立全局蓝图后再填充被遮蔽段落。我们在数学推理、代码生成和旅行规划等多样基准上展示了方法的有效性，相对于基线模型在多个任务上实现了9.40%的提升。此外，我们发现TI在多token生成设置中提供了额外优势，能够在保持生成质量与鲁棒性的同时实现有效加速。通过强制这些全局约束，TI最终促进了系统2推理，使模型能够在结构定义的解决方案空间内进行深入思考。

英文摘要

Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs. Unlike conventional prefix prompting, TI flexibly aligns structural anchors across the entire target response space, establishing a global blueprint before filling in the masked segments. We demonstrate the effectiveness of our approach on diverse benchmarks, including mathematical reasoning, code generation, and trip planning, achieving consistent improvements of 9.40% over the baseline. Furthermore, we observe that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality and robustness. By enforcing these global constraints, TI ultimately facilitates System-2 reasoning, empowering the model to deliberate within a structurally defined solution space.

URL PDF HTML ☆

赞 0 踩 0

2510.13068 2026-05-19 cs.LG cs.AI cs.HC

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

NeuroRVQ：多尺度生物信号分词用于生成式基础模型

Konstantinos Barmpas, Na Lee, Dimitrios Chalatsis, William Raftery, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Alexandros Koliousis, Dario Farina, Stefanos Zafeiriou

发表机构 * Imperial College London（帝国理工学院伦敦分校）； Cogitat ； National and Kapodistrian University of Athens（国家与资本主义大学雅典分校）； Archimedes Research Unit（阿基米德研究单位）； Aristotle University of Thessaloniki（亚里士多德大学塞萨洛尼基分校）； Northeastern University London（东北大学伦敦分校）

AI总结本文提出NeuroRVQ，一种多尺度生物信号分词方法，通过多尺度时序卷积分解生物信号并结合相位感知损失，实现高保真信号重建，验证了高质量分词对下游性能的重要性。

详情

AI中文摘要

生物信号如脑电图（EEG）、心电图（ECG）和肌电信号（EMG）在多个时间和频谱尺度上编码生理活动，产生丰富但对机器学习具有挑战性的表示。训练以预测掩码信号标记为基础模型的方法在学习通用生物信号表示方面显示出前景，但其性能取决于分词器保留高频动态和高保真重建信号的能力。我们引入NeuroRVQ，一种适用于高保真信号重建的多模态生物信号分词家族。为了捕获完整的频谱，NeuroRVQ通过多尺度时序卷积将生物信号分解为频特定表示，每个表示编码为层次化的RVQ代码本以保留高频细节，并结合一种新的相位感知训练损失，该损失尊重傅里叶相位的环形拓扑。通过调整时间分辨率、时间核的数量和大小以及RVQ深度，此设计适应每种生物信号模态的频谱-时间特性。为验证分词质量驱动下游性能，我们为每种模态训练一个简单的掩码标记基础模型（NeuroRVQ-FM）使用相应的NeuroRVQ分词器。NeuroRVQ-FM家族在与现有模态特定基础模型相比时实现了竞争或更优的下游性能，证明了高保真分词是有效生物信号建模的关键因素。

英文摘要

Biosignals such as electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) encode physiological activity across multiple temporal and spectral scales, yielding representations that are rich but challenging for machine learning. Foundation models trained to predict masked signal tokens have shown promise in learning generalizable biosignal representations, yet their performance depends on the tokenizer's ability to preserve high-frequency dynamics and reconstruct signals with high fidelity. We introduce NeuroRVQ, a modality-adaptive biosignal tokenizer family designed for high-fidelity signal reconstruction. To capture the full frequency spectrum, NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning the temporal resolution, number and size of temporal kernels and RVQ depth, this design adapts to the spectro-temporal characteristics of each biosignal modality. To validate that tokenizer quality drives downstream performance, we train a simple masked-token foundation model for each modality (NeuroRVQ-FM) using the corresponding NeuroRVQ tokenizer. The NeuroRVQ-FM family achieves competitive or superior downstream performance compared to existing modality-specific foundation models, demonstrating that high-fidelity tokenization is a critical factor for effective biosignal modeling.

URL PDF HTML ☆

赞 0 踩 0

2510.10528 2026-05-19 cs.CL cs.LG

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Merlin's Whisper：通过黑盒说服提示增强大语言模型的高效推理

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）； Sea AI Lab（Sea AI实验室）； Peking University（北京大学）

AI总结本文提出Whisper框架，通过黑盒说服提示减少大语言模型（LRM）的推理过程中的token使用量，同时保持性能，展示了在多个基准测试中显著的token减少效果。

Comments ACL 2026 (Long Paper), camera-ready version

详情

AI中文摘要

大型推理模型（LRMs）通过逐步思考在解决复杂任务方面表现出色。然而，这种漫长的推理过程带来了显著的计算和延迟开销，阻碍了LRMs的实用部署。本文提出了一种通过黑盒说服提示来减轻LRMs过度思考的新方法。通过将LRMs视为黑盒通信者，我们研究如何说服它们生成简洁响应而不影响准确性。我们引入了Whisper，一个迭代细化框架，能够从多种视角生成高质量的说服提示。在多个基准测试中的实验表明，Whisper在保持性能的同时，能够显著减少token使用量。值得注意的是，Whisper在简单的GSM8K问题上对Qwen3模型系列实现了平均3倍的响应长度减少，并在所有基准测试中实现了平均约40%的token减少。对于闭源API，Whisper在MATH-500上分别使Claude-3.7和Gemini-2.5的token使用量减少了46%和50%。进一步分析显示，Whisper在数据领域、模型规模和家族中的广泛应用，凸显了黑盒说服提示作为提升LRM效率的实用策略的潜力。

英文摘要

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex tasks through step-by-step thinking. However, this lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of LRMs. This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. By treating LRMs as black-box communicators, we investigate how to persuade them to generate concise responses without compromising accuracy. We introduce Whisper, an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that Whisper consistently reduces token usage while preserving performance. Notably, Whisper achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series and delivers an average ~40% token reduction across all benchmarks. For closed-source APIs, Whisper reduces token usage on MATH-500 by 46% for Claude-3.7 and 50% for Gemini-2.5. Further analysis reveals the broad applicability of Whisper across data domains, model scales, and families, underscoring the potential of black-box persuasive prompting as a practical strategy for enhancing LRM efficiency.

URL PDF HTML ☆

赞 0 踩 0

2510.10140 2026-05-19 cs.LG cs.CR stat.ML

Adversarial Attacks on Downstream Weather Forecasting Models: Application to Tropical Cyclone Trajectory Prediction

对下游天气预测模型的对抗攻击：应用于热带气旋轨迹预测

Yue Deng, Francisco Santos, Pang-Ning Tan, Lifeng Luo

发表机构 * Michigan State University（密歇根州立大学）

AI总结本文研究了对抗攻击对深度学习天气预测模型的脆弱性，提出了一种新的攻击方法Cyc-Attack，用于生成对抗性轨迹，以提高攻击的准确性并减少检测难度。

Comments Compared with the previous version, we added zeroth-order optimization methods as baselines, clarified the motivation for using a surrogate model, and provided a more detailed investigation of the upstream attack

详情

AI中文摘要

基于深度学习的天气预测（DLWF）模型利用过去的天气观测数据生成未来的预测，支持广泛的应用，包括热带气旋（TC）预测。在本文中，我们研究了这些模型对对抗攻击的脆弱性，其中对上游预测的细微扰动可以改变下游TC轨迹预测。尽管最近对DLWF模型的对抗攻击研究有所增长，但仍然具有挑战性，即创建扰动的上游预测，使下游输出朝向攻击者指定的轨迹。首先，传统的TC检测系统是不透明的、非可微的黑箱，这使得标准的梯度基攻击不可行。其次，TC事件的极端稀有性导致严重的类别不平衡问题，使得开发扰动上游预测的方法变得困难，这些扰动产生的轨迹看起来真实并与攻击者的目标轨迹一致。为了克服这些限制，我们提出了Cyc-Attack，一种新的方法，用于扰动DLWF模型的上游预测以生成对抗性轨迹。所提出的方法使用可微的替代模型来近似TC检测器的输出，使梯度基攻击的应用成为可能。Cyc-Attack还采用了一种考虑偏度的损失函数和核扩张策略来解决不平衡问题。最后，基于距离的梯度加权方案和正则化用于约束扰动并消除不真实的轨迹，从而使对抗性上游预测更难以检测。我们的实验表明，Cyc-Attack在匹配攻击者目标轨迹方面具有更高的真实阳性率，同时具有更低的误报率和更隐蔽的扰动，优于传统攻击方法。

英文摘要

Deep learning-based weather forecasting (DLWF) models leverage past weather observations to generate future forecasts, supporting a wide range of downstream applications, including tropical cyclone (TC) prediction. In this paper, we investigate their vulnerability to adversarial attacks, where subtle perturbations to the upstream forecasts can alter the downstream TC trajectory predictions. Although research into adversarial attacks on DLWF models has grown recently, it remains challenging to craft perturbed upstream forecasts that steer the downstream outputs toward attacker-specified trajectories. First, conventional TC detection systems are opaque, non-differentiable black boxes, making standard gradient-based attacks infeasible. Second, the extreme rarity of TC events leads to severe class imbalance problem, making it difficult to develop attack methods for perturbing upstream forecasts that produce realistic-looking cyclone paths aligned with attacker's target trajectories. To overcome these limitations, we propose Cyc-Attack, a novel method for perturbing the upstream forecasts of DLWF models to generate adversarial trajectories. The proposed method uses a differentiable surrogate model to approximate the TC detector's output, enabling the application of gradient-based attacks. Cyc-Attack also employs a skewness-aware loss function with kernel dilation strategy to address the imbalance problem. Finally, a distance-based gradient weighting scheme and regularization are used to constrain the perturbations and eliminate unrealistic-looking trajectories, thereby making the adversarial upstream forecasts less easily detectable. Our experiments show that Cyc-Attack achieves a higher true positive rate in matching the attacker's target trajectories, along with lower false alarm rates and stealthier perturbations than conventional attack methods.

URL PDF HTML ☆

赞 0 踩 0

2510.08886 2026-05-19 cs.CL cs.CE cs.IR

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

FinAuditing: 一个基于财务分类结构的多文档基准，用于评估LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Yankai Chen, Víctor Gutiérrez-Basulto, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Xue Liu, Jian-Yun Nie

发表机构 * Columbia University（哥伦比亚大学）； Stevens Institute of Technology（史蒂文斯理工学院）； Rensselaer Polytechnic Institute（拉特格斯理工学院）； University of Montreal（蒙特利尔大学）； McGill University（麦吉尔大学）； MBZUAI（麦吉尔大学人工智能研究所）； Cardiff University（卡迪夫大学）； The University of Manchester（曼彻斯特大学）； Harvard University（哈佛大学）

AI总结本文提出FinAuditing，一个基于财务分类结构的多文档基准，用于评估大型语言模型在财务审计任务中的能力，通过三个任务：财务语义匹配、财务关系提取和财务数学推理，揭示了现有LLMs在概念检索、分类感知关系建模和跨文档一致性推理方面的显著差距。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情

AI中文摘要

超越简单的文本处理，财务审计需要在大规模披露中检测语义、结构和数值的一致性。由于财务报告以XBRL（一种受会计标准规范的结构化XML格式）提交，审计成为涉及概念对齐、分类定义的关系和跨文档一致性的结构化信息提取和推理问题。尽管大型语言模型（LLMs）在孤立的财务任务上表现出色，但其在专业级审计中的能力仍不明确。我们引入了FinAuditing，一个基于分类结构的基准，由真实的XBRL文件构建。它包含1,102个注释实例，平均超过33,000个标记，并定义了三个任务：财务语义匹配（FinSM）、财务关系提取（FinRE）和财务数学推理（FinMR）。对13种最先进的LLMs的评估揭示了概念检索、分类感知关系建模和跨文档一致性推理方面的显著差距。这些发现突显了需要现实且结构感知的基准的重要性。我们发布了评估代码（https://github.com/The-FinAI/FinAuditing）和数据集（https://huggingface.co/collections/TheFinAI/finauditing）。目前，该任务已成为正在进行的公开评估竞赛的官方基准（https://open-finance-lab.github.io/SecureFinAI_Contest_2026/）

英文摘要

Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.

URL PDF HTML ☆

赞 0 踩 0

2510.08702 2026-05-19 cs.CL

Scaling Laws for Code: A More Data-Hungry Regime

代码的缩放规律：一个更数据渴求的阶段

Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Chinese Academy of Sciences（中国科学院）； Fudan University（复旦大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结本文研究了代码的缩放规律，通过大规模实验发现Farseer定律在准确性上更优，代码模型在模型大小上表现良好，但需要更高的数据与参数比，且在代码-自然语言混合数据中，自然语言在资源受限场景下有益，但在更高计算预算下成为负担。

Comments Accepted by ACL2026

详情

AI中文摘要

代码大型语言模型（LLMs）正在革新软件工程。然而，指导高效训练的缩放定律主要是在自然语言（NL）上分析的。鉴于代码和自然语言之间的根本差异，如严格的语法，这些定律是否直接适用于代码尚不清楚。为填补这一空白，我们进行了首次大规模的代码缩放定律实证研究，包括117次实验运行，模型大小从0.2B到3.8B，训练token从2B到128B。我们拟合了Chinchilla定律和Farsser定律。首先，结果表明，更具表现力的Farsser定律在准确性上更优。其次，分析显示代码LLMs在模型大小上有效扩展。关键的是，代码代表了一个更数据渴求的阶段，需要比自然语言显著更高的数据与参数比。最后，对代码-自然语言混合数据的两个额外实验显示，自然语言在资源受限的场景下有益，但在更高计算预算下成为负担。

英文摘要

Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2510.07239 2026-05-19 cs.CL

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Red-Bandit：通过带引导的LoRA专家实现LLM红队测试的测试时适应

Christos Ziakas, Nicholas Loo, Nishita Jain, Alessandra Russo

发表机构 * Imperial College London（伦敦帝国学院）

AI总结本文提出Red-Bandit框架，通过带引导的LoRA专家在不同攻击风格下实现LLM的测试时适应，通过强化学习生成不安全提示，并利用多臂老虎机策略动态选择攻击风格专家，从而在AdvBench上取得最佳结果，同时生成更易读的提示。

Comments Accepted to the Main Conference at ACL 2026

详情

AI中文摘要

自动化红队测试已成为在部署前审计大型语言模型（LLM）的可扩展方法，但现有方法缺乏有效适应模型特定漏洞的机制。我们介绍了Red-Bandit，一种红队测试框架，能够在线适应以识别和利用不同攻击风格（例如操纵、俚语）下的模型失败模式。Red-Bandit通过强化学习后训练一组参数高效的LoRA专家，每个专家专门针对特定的攻击风格，奖励生成不安全提示通过基于规则的安全模型。在推理时，多臂老虎机策略根据目标模型的响应安全性动态选择这些攻击风格专家，平衡探索和利用。Red-Bandit在足够的探索（ASR@10）下在AdvBench上实现了最先进的结果，同时生成更易于人类阅读的提示（更低的困惑度）。此外，Red-Bandit的老虎机策略还充当诊断工具，通过指示哪些攻击风格最有效引发不安全行为来揭示模型特定的漏洞。

英文摘要

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

URL PDF HTML ☆

赞 0 踩 0

2510.06388 2026-05-19 cs.LG cs.DS stat.ML

Truthful Calibration Errors for Multi-Class Prediction

多类预测中的诚实校准误差

Yuxuan Lu, Yifan Wu, Jason Hartline, Lunjia Hu

发表机构 * Peking University（北京大学）； Northwestern University（西北大学）； Microsoft Research, New England（微软研究院（新英格兰））； Northeastern University（东北大学）； Khoury College of Computer Sciences（计算机科学学院）

AI总结本文研究了多类预测中诚实校准误差的实用作用，提出了完美诚实校准误差以处理标签分布的多维线性属性，并分析了这些诚实误差在决策理论上的影响，从而解释并缓解了分箱校准误差的排名鲁棒性问题。

详情

AI中文摘要

校准预测之所以有用，是因为其数值可以被解释为概率。校准误差因此被广泛用于评估、比较和调整概率预测器。最近，Haghtalab等人（2024）引入了一个额外的要求：诚实性。如果预测器通过报告真实的条件标签分布来最小化其预期测量误差，则校准度量是诚实的。许多标准的经验校准误差是非诚实的：预测器可能通过扭曲其概率而不是报告真实值来显得更校准。我们研究了诚实性在多类预测中校准测量的实用作用。首先，我们引入了完美诚实校准误差以处理标签分布的多维线性属性，推广了Hartline等人（2025）中二元预测的诚实校准误差。此框架包括完整的多类校准和类内校准。我们还确定了置信度校准的诚实修正。其次，我们分析了这些诚实误差的决策理论影响。对于校准预测器，诚实校准误差保持了Blackwell主导性：更信息丰富的校准预测器不会产生更大的预期误差。第三，我们表明这种决策理论解释解释并缓解了已观察到的分箱校准误差的排名鲁棒性问题。经验上，非诚实的置信度校准误差在分箱数量变化时可能逆转模型排名，而我们的诚实误差在不同分箱选择下提供更稳定的排名。

英文摘要

Calibrated predictions are useful because their numerical values can be interpreted as probabilities. Calibration errors are therefore widely used to evaluate, compare, and tune probabilistic predictors. Recently, Haghtalab et al. (2024) introduced an additional requirement for such measures: truthfulness. A calibration measure is truthful if a predictor minimizes its expected measured error by reporting the true conditional label distribution. Many standard empirical calibration errors are non-truthful: a predictor may appear better calibrated by distorting its probabilities rather than reporting them truthfully. We study the practical role of truthfulness for calibration measurement in multiclass prediction. First, we introduce perfectly truthful calibration errors for multidimensional linear properties of the label distribution, generalizing the truthful calibration error for binary predictions in Hartline et al. (2025). This framework includes full multiclass calibration and classwise calibration. We also identify a truthful correction for confidence calibration. Second, we characterize the decision-theoretic implications of these truthful errors. For calibrated predictors, truthful calibration errors preserve the Blackwell dominance: a more informative calibrated predictor receives no larger expected error. Third, we show that this decision-theoretic interpretation explains and mitigates the well-observed ranking robustness problem of binned calibration errors. Empirically, non-truthful confidence-based errors can reverse model rankings when the number of bins changes, while our truthful errors give more stable rankings across binning choices.

URL PDF HTML ☆

赞 0 踩 0

2510.05921 2026-05-19 cs.CL cs.LG

Prompt reinforcing for long-term planning of large language models

通过提示强化实现大语言模型的长期规划

Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić

发表机构 * Heinrich-Heine-Universität Düsseldorf（杜伊斯堡-埃森大学）

AI总结本文提出了一种基于强化学习的提示优化框架，通过修改LLM代理的任务指令提示来实现长期规划，提升了多轮交互任务如文本到SQL和任务导向对话的表现，并能泛化到不同LLM代理和多种LLM作为元提示代理。

详情

AI中文摘要

大型语言模型（LLMs）在广泛自然语言处理任务中取得了显著成功，并可通过提示进行适应。然而，它们在多轮交互中仍表现不足，常依赖错误的早期假设，无法随时间跟踪用户目标，使此类任务尤其具有挑战性。先前对话系统的工作表明，长期规划对于处理交互任务至关重要。在本工作中，我们提出了一种受强化学习启发的提示优化框架，仅通过修改LLM代理的任务指令提示即可实现此类规划。通过生成回合间的反馈并利用经验回放进行提示重写，我们的方法在文本到SQL和任务导向对话等多轮任务中显示出显著改进。此外，该方法能跨不同LLM代理泛化，并可利用多种LLM作为元提示代理。这促使未来在受强化学习启发的无参数优化方法上的研究。

英文摘要

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

URL PDF HTML ☆

赞 0 踩 0

2510.01857 2026-05-19 cs.AI

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

通过逆强化学习学习推理奖励从专家示范

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出了一种名为Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL)的方法，通过逆强化学习从专家示范中学习推理奖励，以克服传统监督微调的局限性，并在多个数据集上展示了其在训练和推理过程中的有效性。

详情

AI中文摘要

教学大型语言模型（LLMs）在训练后进行推理通常依赖于具有显式结果或过程基础的强化学习奖励函数。然而，在许多现实世界设置中，获得或定义此类奖励函数是困难的，尤其是对于复杂任务，使从专家示范中学习成为有吸引力的替代方法。主流方法监督微调（SFT）训练模型直接模仿专家推理轨迹，但受到离策略学习的一般限制：性能可能对推理时偏离演示中明确覆盖的状态敏感。为了解决这个问题，我们提出了推理对抗逆强化学习（R-AIRL）。与其模仿专家的推理，R-AIRL从专家的思维链中推断出底层的过程级奖励。通过在GSM8K、MMLU-Pro和MedReason上进行实验，我们展示了通过R-AIRL学习的推理奖励函数可以有效地用于整个训练和推理流程：（1）为训练提供训练信号，在大多数考虑的设置中优于SFT，（2）用于推理时的重排序，将pass@1提高高达17.4个点，（3）用于过程级评估，以高达86.1%的准确性局部化推理失败。总体而言，R-AIRL弥合了模仿学习和基于奖励的优化，使从专家思考轨迹中提取有意义的推理信号成为可能。

英文摘要

Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.

URL PDF HTML ☆

赞 0 踩 0

2510.01479 2026-05-19 cs.LG cs.SY eess.SY

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

密度比加权行为克隆：从受污染的数据集中学习控制策略

Shriram Karpoora Sundara Pandian, Ali Baheri

发表机构 * Department of Cybersecurity（网络安全系）； Rochester Institute of Technology（罗切斯特理工学院）； Mechanical Engineering Department（机械工程系）

AI总结本文提出了一种鲁棒的模仿学习方法Density-Ratio Weighted Behavioral Cloning，通过使用一个小的验证干净参考集估计轨迹级密度比，以优先考虑干净的专家行为并降低或丢弃受污染的数据，从而在不需了解污染机制的情况下提升政策性能。

详情

AI中文摘要

离线强化学习（RL）通过固定数据集进行策略优化，使其适用于在线探索不可行的安全关键应用。然而，这些数据集常受到对抗性污染、系统错误或低质量样本的污染，导致标准行为克隆（BC）和离线RL方法的策略性能下降。本文介绍了密度比加权行为克隆（Weighted BC），一种鲁棒的模仿学习方法，通过二元判别器估计轨迹级密度比，这些比值被截断并用作BC目标中的权重，以优先考虑干净的专家行为，同时降低或丢弃受污染的数据，而无需了解污染机制。我们建立了理论保证，证明在有限样本界限下，能够收敛到干净的专家策略，这些界限与污染率无关。建立了一个全面的评估框架，该框架包含各种污染协议（奖励、状态、转换和动作）在连续控制基准上的应用。实验表明，Weighted BC即使在高污染比下也能保持接近最优性能，优于传统BC、批量约束Q学习（BCQ）和行为正则化的Actor-Critic（BRAC）等基线方法。

英文摘要

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

URL PDF HTML ☆

赞 0 踩 0

2510.00304 2026-05-19 cs.LG cs.AI

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

在不断变化的世界中学习的障碍：对学习能力丧失的数学理解

Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Apple（苹果公司）

AI总结本文研究了在非平稳环境中深度学习模型因学习能力丧失（LoP）而失效的问题，通过动力系统理论分析了LoP的两个主要机制，并探讨了缓解策略。

详情

AI中文摘要

深度学习模型在静态数据上表现优异，但在非静态环境中因一种称为学习能力丧失（LoP）的现象而表现不佳，即其未来学习能力下降。本文首次从原理上研究了基于梯度的学习中的LoP。基于动力系统理论，我们通过在参数空间中识别稳定的流形来正式定义LoP，这些流形会捕获梯度轨迹。我们的分析揭示了两种主要机制，这些机制创造了这些陷阱：来自激活饱和的冻结单元和来自表征冗余的克隆单元流形。我们的框架揭示了一个根本性的矛盾：在静态设置中促进泛化的属性，如低秩表示和简单性偏差，直接在持续学习场景中促成LoP。我们通过数值模拟验证了我们的理论分析，并探讨了架构选择或针对性扰动作为潜在的缓解策略。

英文摘要

Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

URL PDF HTML ☆

赞 0 踩 0

2509.25969 2026-05-19 cs.CV

A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments

一种用于挑战性环境中鲑鱼福利监测的多用途跟踪框架

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

发表机构 * Norwegian University of Science and Technology（挪威科学技术大学）； SINTEF Ocean（SINTEF海洋）

AI总结本文提出了一种多用途跟踪框架，用于在具有挑战性的环境中实现鲑鱼福利的自动化监测，通过使用姿态估计网络提取鲑鱼的边界框及其对应的身体部位信息，以解决水下鲑鱼场景中的特定挑战，并构建了两个新的数据集来评估鲑鱼跟踪的挑战。

Comments Accepted to the Joint Workshop on Marine Vision 2025 (CVAUI & AAMVEM), held in conjunction with ICCV 2025

详情

DOI: 10.1109/ICCVW69036.2025.00225

AI中文摘要

基于计算机视觉（CV）的连续、自动化和精确的鲑鱼福利监测是减少工业网箱养鱼中鲑鱼死亡率和改善鲑鱼福利的关键步骤。现有的CV方法用于确定福利指标主要集中在单一指标上，并依赖于其他应用领域的对象检测器和跟踪器来帮助其福利指标计算算法。这在实际应用中带来了高资源需求，因为每个指标必须单独计算。此外，这些方法在水下鲑鱼场景中容易受到物体遮挡、相似物体外观和相似物体运动等困难的影响。为了解决这些挑战，我们提出了一种灵活的跟踪框架，该框架使用姿态估计网络提取鲑鱼及其对应身体部位的边界框，并利用身体部位的信息，通过专门的模块，来解决水下鲑鱼场景中的特定挑战。随后，高细节的身体部位跟踪被用于计算福利指标。我们构建了两个新的数据集，评估两个鲑鱼跟踪挑战：拥挤场景中的鲑鱼ID转移和转弯期间的鲑鱼ID切换。我们的方法在两个鲑鱼跟踪挑战中均优于当前最先进的行人跟踪器BoostTrack。此外，我们创建了一个用于计算鲑鱼尾鳍拍打波长的数据集，证明了我们的身体部位跟踪方法适合基于尾鳍分析的自动化福利监测。数据集和代码可在https://github.com/espenbh/BoostCompTrack上获得。

英文摘要

Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.

URL PDF HTML ☆

赞 0 踩 0

2509.21820 2026-05-19 cs.CL

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

大语言模型能否生成并解决语言奥林匹克谜题？

Neh Majmudar, Elena Filatova

发表机构 * CUNY（纽约大学）

AI总结本文研究了大语言模型在生成和解决语言谜题中的能力，发现其在大多数谜题类型上优于人类，但对书写系统和不为人知语言的谜题表现较弱，提出了通过谜题生成促进语言学普及的研究意义。

Comments Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

详情

DOI: 10.18653/v1/2025.emnlp-main.969

AI中文摘要

在本文中，我们介绍了一种新的任务组合：语言谜题的解决方案和生成。我们专注于用于高中生的语言奥林匹克谜题。我们首先扩展了现有基准，以解决语言谜题的任务。我们探索了大型语言模型（LLMs）在解决语言谜题中的应用，包括最近的最先进的模型，如OpenAI的o1，在各种语言主题上的表现。我们证明，LLMs在大多数谜题类型上优于人类，除了那些以书写系统为中心的谜题，以及不为人知的语言。我们利用谜题解决实验的洞察力，指导了新的谜题生成任务。我们相信，即使对于相对简单的谜题，自动化谜题生成也有望扩大对语言学的兴趣，并将该领域介绍给更广泛的受众。这一发现突显了语言谜题生成作为研究任务的重要性：此类谜题不仅能促进语言学，还能支持对稀有和不为人知语言的知识传播。

英文摘要

In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

URL PDF HTML ☆

赞 0 踩 0

2509.19102 2026-05-19 cs.RO cs.AI cs.CV

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg（汉堡大学信息学院TAMS（多模态系统技术））； Technical University of Munich（慕尼黑技术大学）； Agile Robots SE（敏捷机器人有限公司）

AI总结本文提出FUNCanon框架，通过功能对象规范化学习姿态感知的动作原语，以实现通用的机器人操作，该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段，从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情

AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略，这些策略难以超越训练分布进行泛化。因此，我们引入FUNCanon框架，将长周期操作任务转换为一系列动作片段，每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身，而不是孤立的任务，从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性，我们对功能对象进行规范化，通过功能对齐和自动操作轨迹转移，利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练，自然尊重对象的 affordances 和姿态，简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明，该方法在类别层面实现了泛化，跨任务行为重用和鲁棒的sim2real部署，显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得：https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

URL PDF HTML ☆

赞 0 踩 0

2509.18150 2026-05-19 cs.LG cs.AI

Improving MLLM Training Efficiency via Stage-Aware Sparsity

通过阶段感知稀疏性提升MLLM训练效率

Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang

发表机构 * Peking University（北京大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出了一种基于稀疏表示的高效训练框架STS，通过阶段感知设计适应不同训练阶段的冗余，采用视觉标记压缩器和层动态跳过器来减少计算开销，验证了其在多种MLLM架构上的有效性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在各种领域中表现出色，但训练效率低下，由于长输入序列和未充分利用的层间操作导致大量计算冗余。值得注意的是，这种冗余并非静态，而是随训练阶段变化。基于此观察，我们关注训练过程本身，提出了一种基于稀疏表示的高效训练框架，称为稀疏训练方案（STS）。不同于统一的稀疏性策略，STS采用阶段感知设计，适应训练过程中不同的冗余来源。具体而言，该框架包含两个互补组件：视觉标记压缩器，通过在模态对齐过程中压缩视觉标记来减少信息负载；层动态跳过器，通过在指令微调过程中动态跳过不必要的层来减轻计算开销。我们的方法广泛适用于多种MLLM架构，并已在多个基准上进行了广泛评估，证明了其有效性和效率。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient, as much of the computation is redundant due to the long input sequences from multimodal data and underutilized inter-layer operations. Notably, such redundancy is not static but varies across different stages of training. Building on this observation, we shift the focus to the training process itself and propose a training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). Instead of applying a uniform sparsity strategy, STS adopts a stage-aware design that adapts to different sources of redundancy during training. Specifically, the framework consists of two complementary components: the Visual Token Compressor, which reduces the information load by compressing visual tokens during modality alignment, and the Layer Dynamic Skipper, which mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2509.16391 2026-05-19 cs.LG cs.AI cs.CV

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn: 通过对比学习赋能机器无学习

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文提出CoUn框架，通过对比学习和监督学习调整保留数据的表示，以提高机器无学习的有效性，实验表明其在多个数据集和模型架构上均优于现有方法。

详情

AI中文摘要

机器无学习（MU）旨在从已训练模型中移除特定'遗忘'数据的影响，同时保持对剩余'保留'数据的知识。现有的基于标签操纵或模型权重扰动的MU方法往往效果有限。为此，我们引入了CoUn，一种受观察启发的新MU框架：当模型仅使用保留数据重新训练时，它会根据保留数据的语义相似性对遗忘数据进行分类。CoUn通过对比学习（CL）和监督学习调整学习的数据表示，仅应用于保留数据。具体而言，CoUn（1）利用数据样本之间的语义相似性，通过CL间接调整遗忘表示，（2）通过监督学习保持保留表示在其各自聚类内。在各种数据集和模型架构上的广泛实验表明，CoUn在无学习有效性上 consistently 超过最先进的MU基线。此外，将我们的CL模块集成到现有基线中可以增强其无学习有效性。

英文摘要

Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2509.02351 2026-05-19 cs.CV cs.AI cs.LG

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

序数自适应校正：一种数据导向的带有噪声标签的序数图像分类方法

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology（伊朗科学技术大学计算机工程学院）

AI总结本文提出了一种数据导向的序数图像分类方法ORDAC，通过利用标签分布学习来建模序数标签的内在模糊性和不确定性，动态调整每个样本的标签分布均值和标准差，从而有效校正噪声标签并提高模型性能。

Comments 10 pages, 5 figures, 5 tables

详情

AI中文摘要

标记数据是训练计算机视觉任务中监督深度学习模型的基本组成部分。然而，尤其是在序数图像分类中，类边界往往具有模糊性，因此标注过程容易产生错误和噪声。此类标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中检测和校正标签噪声的问题，提出了一种新的数据导向方法，称为ORDinal Adaptive Correction（ORDAC）。该方法利用标签分布学习（LDL）的能力来建模序数标签的内在模糊性和不确定性。在训练过程中，ORDAC动态调整每个样本的标签分布的均值和标准差。与其丢弃可能含有噪声的样本不同，该方法旨在校正这些样本并充分利用整个训练数据集。所提出方法在年龄估计（Adience）和疾病严重程度检测（糖尿病视网膜病变）基准数据集上，针对各种不对称高斯噪声场景进行了评估。结果表明，ORDAC及其扩展版本（ORDAC_C和ORDAC_R）在模型性能上取得了显著提升。例如，在Adience数据集上40%的噪声情况下，ORDAC_R将均方误差从0.86降低到0.62，并将召回指标从0.37提高到0.49。该方法还展示了其在原始数据集中固有噪声的校正效果。这项研究表明，使用标签分布进行自适应标签校正是增强在存在噪声数据时序数分类模型鲁棒性和准确性的一种有效策略。

英文摘要

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

URL PDF HTML ☆

赞 0 踩 0

2508.20836 2026-05-19 cs.RO math.OC

First Experimental Demonstration of Natural Hovering Extremum Seeking: A New Paradigm in Flapping Flight Physics

首次实验性演示自然悬停极值搜索：飞行力学领域的新范式

Ahmed A. Elgohary, Rohan Palanikumar, Simone Martini, Sameh A. Eisa

发表机构 * Department of Aerospace Engineering and Engineering Mechanics（航空航天工程与工程力学系）； University of Cincinnati（辛辛那提大学）； Cincinnati, Ohio 45221, USA（俄亥俄州辛辛那提市45221号美国）

AI总结本文首次实验验证了自然悬停极值搜索（NH-ES）这一新范式，展示了通过无需模型的实时反馈机制，利用飞行动物自身振荡实现稳定悬停飞行的原理。

详情

AI中文摘要

在本文中，我们报告了首次实验性演示了最近出现的悬停和振翅飞行力学新范式，称为自然悬停极值搜索（NH-ES），该范式提出，通过无需模型的实时反馈机制，利用振翅翼的内置自然振荡作为控制和推进输入，可以生成自然界中通过振翅昆虫和蜂鸟观察到的稳定悬停飞行力学。我们进行了moth-like、光源导向的实验，使用振翅翼体在完全无模型的设置中进行，该设置不依赖形态学参数和身体/空气动力学模型。我们展示了使用NH-ES的振翅体能够自主增益高度并稳定控制负责振翅的伺服器，包括具有pitching动态（文献中认为是开环悬停不稳定的主要原因）。振翅体仅需局部光强度反馈即可有效稳定悬停在光源附近。我们的结果也实现了在延迟和噪声效应下的验证，支持了之前观察到的NH-ES对潜在处理延迟和噪声感觉的鲁棒性。

英文摘要

In this letter, we report the first experimental demonstration of the recently emerged new paradigm in hovering and flapping flight physics called (Natural Hovering Extremum Seeking (NH-ES)) [doi.org/10.1103/4dm4-kc4g], which theorized that stable hovering flight physics observed in nature by flapping insects and hummingbirds can be generated via a model-free, real-time, computationally-basic, sensory-based feedback mechanism that only needs the built-in natural oscillations of the flapping wing as both the control and the propulsive input. We run experiments of moth-like, light source-seeking, on a flapping-wing body in a total model-free setting that is agnostic to morphological parameters and body/aerodynamic models. We show that the flapping body using NH-ES gains altitude and stabilizes autonomously the servos responsible for flapping, including with pitching dynamics (believed in literature to be a main reason of instability in open-loop hovering). The flapping body effectively/stably hovers about the light source, needing only feedback of local measurements of light intensity. Our results were also achieved under delay/noise effects, supporting earlier observations that NH-ES is robust against potential processing delays and noisy-sensations.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning

Towards Migrating Neural Network Implementations

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Tongyi DeepResearch Technical Report

SAM 2++: Tracking Anything at Any Granularity

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

WEBSERV: A Full-Stack and RL-Ready Web Environment for Training Web Agents at Scale

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

Unlocking the Potential of Diffusion Language Models through Template Infilling

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Adversarial Attacks on Downstream Weather Forecasting Models: Application to Tropical Cyclone Trajectory Prediction

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Scaling Laws for Code: A More Data-Hungry Regime

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Truthful Calibration Errors for Multi-Class Prediction

Prompt reinforcing for long-term planning of large language models

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

Improving MLLM Training Efficiency via Stage-Aware Sparsity

CoUn: Empowering Machine Unlearning via Contrastive Learning

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

First Experimental Demonstration of Natural Hovering Extremum Seeking: A New Paradigm in Flapping Flight Physics