arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2510.24701 2026-05-19 cs.CL cs.AI cs.IR cs.LG cs.MA

Tongyi DeepResearch Technical Report

通义深研技术报告

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

AI总结 本文介绍了一种专为长时间深度信息检索任务设计的代理大语言模型,通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索,同时提供了高可扩展的数据合成管道,实现了无需昂贵人工标注的自动化训练流程,并在多个深度研究基准测试中取得了最先进的性能。

Comments https://tongyi-agent.github.io/blog

详情
AI中文摘要

我们介绍了通义深研,一种专为长周期、深度信息检索任务设计的代理大语言模型。为了激励自主深度研究代理,通义深研通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索。我们设计了一个高度可扩展的数据合成管道,完全自动化,无需依赖昂贵的人工标注,并赋能所有训练阶段。通过为每个阶段构建定制化环境,我们的系统在整个过程中实现了稳定一致的交互。通义深研拥有305亿总参数,每token仅激活33亿个参数,在多个代理深度研究基准测试中,包括人类最后考试、浏览比较、浏览比较-中文、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510,均取得了最先进的性能。我们开源了该模型、框架和完整解决方案,以赋能社区。

英文摘要

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

2510.18822 2026-05-19 cs.CV

SAM 2++: Tracking Anything at Any Granularity

SAM 2++: 任意粒度下的任意目标跟踪

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang

AI总结 本文提出SAM 2++框架,通过统一的提示编码、输出解码和记忆表示设计,实现了对不同粒度的目标状态(如掩码、框和点)的统一跟踪,同时引入Tracking-Any-Granularity数据集以提升统一跟踪模型的训练和评估效果。

Comments 14 pages

详情
AI中文摘要

由于不同任务中目标状态的粒度差异,现有跟踪器多针对单一任务进行设计,这种特异性限制了其泛化能力,无法有效利用多任务训练数据,导致模型设计和参数冗余。尽管最近的统一视觉模型在不同任务间共享部分架构,但通常保留任务特定的接口,并忽视不同粒度背后共同的跟踪原理,留下真正统一视频跟踪的空白。为统一视频跟踪任务,我们提出了SAM 2++,一个能够处理不同粒度目标状态的统一框架,包括掩码、框和点,通过集成设计的提示编码、输出解码和记忆表示。首先,为处理不同目标粒度,我们设计了任务特定的提示,将多样化的任务输入映射到通用的提示嵌入,同时引入统一解码器,以共同的输出形式生成任务结果,而无需重新设计整体流程。其次,为满足记忆匹配,跟踪的核心操作,我们引入了任务自适应的记忆机制,统一不同粒度的记忆同时保持其不同的状态语义,防止全参数共享导致粒度间的干扰。最后,我们引入Tracking-Any-Granularity,第一个大规模且多样化的视频跟踪数据集,具有丰富的三粒度注释。它通过定制的数据引擎,结合分阶段的手动标注和模型辅助完成,提供全面的资源用于训练、基准测试和分析统一跟踪模型。全面的实验表明,SAM 2++在不同粒度的多样化跟踪任务中设定了新的状态-of-the-art,建立了统一且稳健的跟踪框架。

英文摘要

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

2510.17363 2026-05-19 cs.CV cs.LG cs.RO

M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception

M2H:基于高效窗口交叉任务注意力的多任务学习用于单目空间感知

U. V. B. L Udugama, George Vosselman, Francesco Nex

AI总结 本文提出M2H框架,通过高效的窗口交叉任务注意力模块,实现单目图像上的语义分割、深度估计、边缘检测和表面法线估计,同时在计算效率上优于现有方法。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 7 figures

详情
AI中文摘要

在边缘设备上部署实时空间感知需要高效的多任务模型,这些模型能够在利用互补任务信息的同时最小化计算开销。本文介绍了Multi-Mono-Hydra(M2H),一种新的多任务学习框架,用于从单张单目图像中进行语义分割、深度、边缘和表面法线估计。与传统方法依赖独立单任务模型或共享编码器-解码器架构不同,M2H引入了基于窗口的跨任务注意力模块,实现了结构化的特征交换同时保留任务特定的细节,提高了任务间预测的一致性。M2H基于轻量级的ViT-based DINOv2主干网络,优化了实时部署,并作为支持动态环境中3D场景图构建的单目空间感知系统的基础。全面评估显示,M2H在NYUDv2上优于最先进的多任务模型,在Hypersim上超越了单任务深度和语义基线,在Cityscapes数据集上实现了更优的性能,同时在笔记本硬件上保持计算效率。除了基准测试外,M2H还在真实世界数据上得到了验证,证明了其在空间感知任务中的实用性。

英文摘要

Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state-of-the-art multi-task models on NYUDv2, surpasses single-task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real-world data, demonstrating its practicality in spatial perception tasks.

2510.16727 2026-05-19 cs.CL cs.AI

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon:单轮诊断和缓解大型语言模型中潜在的阿谀倾向

Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal

AI总结 本文提出Beacon基准测试,用于单轮诊断和缓解大型语言模型中潜在的阿谀倾向,通过评估十二种最先进的模型,揭示了阿谀倾向在语言和情感方面的稳定子偏差,并提出了在提示和激活层面的干预措施,以调节这些偏差,从而揭示对齐作为事实性和社会合规判断之间的动态流形。

详情
AI中文摘要

大型语言模型内部化了诚实与奉承之间的结构权衡,这种权衡源于奖励优化,将有用性与礼貌服从混淆。这种潜在的偏见,称为阿谀倾向,表现为对用户同意的偏好而非原则性推理。我们引入Beacon,一种单轮强制选择基准测试,该测试独立于对话上下文,能够精确测量事实准确性与顺从偏见之间的张力。在十二种最先进的模型上的评估表明,阿谀倾向分解为稳定的语言和情感子偏见,每个都随模型容量而扩大。我们进一步提出了提示级别和激活级别干预,以调节这些偏见的相反方向,揭示对齐作为事实性和社会合规判断之间的动态流形。Beacon将阿谀倾向重新定义为可测量的规范性误泛化形式,为研究和缓解大规模生成系统中的对齐漂移提供了可重复的基础。

英文摘要

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

2510.16609 2026-05-19 cs.LG cs.AI cs.CC cs.DS

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

先验知识使其成为可能:从次线性图算法到LLM测试时方法

Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless

AI总结 本文研究了测试时增强方法中先验知识与外部信息交互的理论基础,通过将多步推理建模为知识图中的s-t连通性问题,揭示了在部分先验知识下,测试时增强步骤数量与图结构之间的关系,发现当知识图中存在小组件时,增强步骤数呈平方根增长,而当知识密度超过阈值形成大组件时,增强步骤数趋于常数。

详情
AI中文摘要

测试时增强,如检索增强生成(RAG)或工具使用,关键依赖于模型参数知识与外部检索信息之间的相互作用。然而,这种关系的理论基础仍不明确。具体来说,不清楚在少量增强步骤下需要多少预训练知识来回答查询,这在实践中是理想的属性。为了解决这个问题,我们将多步推理建模为知识图中的s-t连通性问题。我们将模型的预训练参数知识表示为部分、可能嘈杂的子图。我们将增强视为查询一个 oracle 以获得真实的边,从而扩展模型的知识。然后,我们表征了在部分先验知识下,模型生成准确答案所需的必要和充分的增强步骤数。一个关键结果表明:如果包含n个顶点的知识图被分割成小组件,则通过增强找到路径是低效的,需要Ω(√n)次查询。另一方面,一旦正确知识的密度超过阈值,形成大组件,我们可以通过预期常数次查询找到路径。

英文摘要

Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.

2510.16252 2026-05-19 cs.LG cs.CL

WEBSERV: A Full-Stack and RL-Ready Web Environment for Training Web Agents at Scale

WEBSERV: 一个全栈且适合强化学习的网页环境,用于大规模训练网页代理

Yuxuan Lu, Ziyi Wang, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Xianfeng Tang, Chen Luo, Yisi Sang, Jin Lai, Dakuo Wang

AI总结 本文提出WebServ,一个全栈且适合强化学习的网页环境,用于大规模训练网页代理。该环境在服务器端使用Incus容器减少启动延迟和存储需求,浏览器端提供自动化的观察和动作接口,以及可靠的执行后端。实验表明,WebServ在WebArena-Lite上实现了最先进的单提示结果,并在强化学习训练中超越了现有方法。

详情
AI中文摘要

针对网页代理强化学习需求,本文提出WebServ,一个全栈且适合强化学习的网页环境,用于大规模训练网页代理。当前网页环境存在不足:服务器端Docker设置过于资源密集,无法支持大规模并行展开;浏览器端接口产生噪声观察,执行动作在现代单页应用中不可靠,并遗漏视觉交互提示。我们引入WebServ,一个全栈、适合强化学习的网页环境,解决这些限制。在服务器端,WebServ使用Incus容器,通过块级拷贝-写入减少启动延迟约5倍,持久化存储减少约240倍,使单台主机支持200+个隔离环境。在浏览器端,WebServ提供一个紧凑的、站点无关的观察和动作接口,自动从DOM派生,并提供人类对齐的交互提示,以及使用网络感知等待的稳健动作执行后端。在WebArena-Lite上,WebServ实现了最先进的单提示结果,受控比较确认在GPT-4o、OpenAI-o3和Llama-3.1-8B上均优于普通WebArena。我们进一步在WebServ中完全训练Qwen3-4B和Qwen3-30B-A3B;RL训练的4B模型在均值准确率上达到55.5%,超过了Claude 4.5 Sonnet(50.0%)和WebAgent-R1中的RL训练8B模型(51.8%)

英文摘要

Reinforcement learning (RL) for web agents demands environments that are both effective for evaluation and efficient enough for large-scale on-policy training. Current web environments fall short: server-side Docker setups are too resource-intensive for massive parallel rollouts, while browser-side interfaces produce noisy observations, execute actions unreliably under modern single-page applications, and omit visual interactivity cues. We introduce WebServ, a full-stack, RL-ready web environment that addresses these limitations end-to-end. On the server side, WebServ uses Incus containers with block-level copy-on-write, reducing launch latency by ~5x and persistent storage by ~240x, enabling 200+ concurrent isolated environments on a single host. On the browser side, WebServ provides a compact, site-agnostic observation and action interface derived automatically from the DOM with human-aligned interactivity cues, and a robust action execution backend using network-aware waiting for reliable SPA support. On WebArena-Lite, WebServ achieves state-of-the-art single-prompt results, with controlled comparisons confirming consistent gains across GPT-4o, OpenAI-o3, and Llama-3.1-8B over vanilla WebArena. We further train Qwen3-4B and Qwen3-30B-A3B with RL entirely within WebServ; the RL-trained 4B model achieves 55.5% mean accuracy, surpassing both Claude 4.5 Sonnet (50.0%) and the RL-trained 8B model from WebAgent-R1 (51.8%).

2510.14466 2026-05-19 cs.CL cs.AI

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

迈向低资源语言LLM鲁棒多语言适应

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang

AI总结 本文提出LiRA框架,通过轻量级微调实现低资源语言LLM的鲁棒多语言适应,结合Arca和LaSR组件提升跨语言语义一致性与表示稳定性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在低资源语言上仍面临挑战,主要由于训练数据有限、翻译噪声和跨语言对齐不稳定。为解决这些问题,我们提出LiRA(LLM的语言鲁棒锚定框架)——一个插件式框架,仅需在现有预训练模型上进行轻量级微调。LiRA通过结合两个关键组件:Arca(锚定表示组合架构),通过基于锚点的对齐和协作编码将低资源输入对齐到共享的英语语义空间;以及LaSR(语言耦合语义推理器),一个轻量级、语言感知的头部,通过一致性正则化强制统一的跨语言理解、检索和推理。我们理论证明,在受控的锚定误差和翻译诱导偏差下,LiRA保证了表示偏差的有界性和稳定的下游性能,基于局部Lipschitz连续性。为促进研究,我们发布了一个新的多语言产品检索数据集,涵盖五个东南亚语言和两种南亚语言。在多样化的低资源基准测试中,广泛实验显示在检索、排序、问答和推理任务上均取得一致的改进。代码将在GitHub上公开,数据集将托管在Hugging Face上。

英文摘要

Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

2510.13870 2026-05-19 cs.CL cs.AI

Unlocking the Potential of Diffusion Language Models through Template Infilling

通过模板填充解锁扩散语言模型的潜力

Junhoo Lee, Seungyeon Kim, Nojun Kwak

AI总结 本文提出了一种针对扩散语言模型的模板填充方法,通过在生成响应空间中建立全局蓝图,提升了数学推理、代码生成和旅行规划等任务的性能,同时在多token生成中实现了生成质量与速度的平衡。

Comments ACL 2026 Main Conference - Long Paper, Oral Presentation

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归语言模型的候选者,其推理策略仍局限于自回归范式继承的前缀提示。本文提出模板填充(TI),一种针对DLMs的定制化条件化方法。与传统前缀提示不同,TI在目标响应空间中灵活对齐结构锚点,建立全局蓝图后再填充被遮蔽段落。我们在数学推理、代码生成和旅行规划等多样基准上展示了方法的有效性,相对于基线模型在多个任务上实现了9.40%的提升。此外,我们发现TI在多token生成设置中提供了额外优势,能够在保持生成质量与鲁棒性的同时实现有效加速。通过强制这些全局约束,TI最终促进了系统2推理,使模型能够在结构定义的解决方案空间内进行深入思考。

英文摘要

Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs. Unlike conventional prefix prompting, TI flexibly aligns structural anchors across the entire target response space, establishing a global blueprint before filling in the masked segments. We demonstrate the effectiveness of our approach on diverse benchmarks, including mathematical reasoning, code generation, and trip planning, achieving consistent improvements of 9.40% over the baseline. Furthermore, we observe that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality and robustness. By enforcing these global constraints, TI ultimately facilitates System-2 reasoning, empowering the model to deliberate within a structurally defined solution space.

2510.13068 2026-05-19 cs.LG cs.AI cs.HC

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

NeuroRVQ:多尺度生物信号分词用于生成式基础模型

Konstantinos Barmpas, Na Lee, Dimitrios Chalatsis, William Raftery, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Alexandros Koliousis, Dario Farina, Stefanos Zafeiriou

AI总结 本文提出NeuroRVQ,一种多尺度生物信号分词方法,通过多尺度时序卷积分解生物信号并结合相位感知损失,实现高保真信号重建,验证了高质量分词对下游性能的重要性。

详情
AI中文摘要

生物信号如脑电图(EEG)、心电图(ECG)和肌电信号(EMG)在多个时间和频谱尺度上编码生理活动,产生丰富但对机器学习具有挑战性的表示。训练以预测掩码信号标记为基础模型的方法在学习通用生物信号表示方面显示出前景,但其性能取决于分词器保留高频动态和高保真重建信号的能力。我们引入NeuroRVQ,一种适用于高保真信号重建的多模态生物信号分词家族。为了捕获完整的频谱,NeuroRVQ通过多尺度时序卷积将生物信号分解为频特定表示,每个表示编码为层次化的RVQ代码本以保留高频细节,并结合一种新的相位感知训练损失,该损失尊重傅里叶相位的环形拓扑。通过调整时间分辨率、时间核的数量和大小以及RVQ深度,此设计适应每种生物信号模态的频谱-时间特性。为验证分词质量驱动下游性能,我们为每种模态训练一个简单的掩码标记基础模型(NeuroRVQ-FM)使用相应的NeuroRVQ分词器。NeuroRVQ-FM家族在与现有模态特定基础模型相比时实现了竞争或更优的下游性能,证明了高保真分词是有效生物信号建模的关键因素。

英文摘要

Biosignals such as electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) encode physiological activity across multiple temporal and spectral scales, yielding representations that are rich but challenging for machine learning. Foundation models trained to predict masked signal tokens have shown promise in learning generalizable biosignal representations, yet their performance depends on the tokenizer's ability to preserve high-frequency dynamics and reconstruct signals with high fidelity. We introduce NeuroRVQ, a modality-adaptive biosignal tokenizer family designed for high-fidelity signal reconstruction. To capture the full frequency spectrum, NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning the temporal resolution, number and size of temporal kernels and RVQ depth, this design adapts to the spectro-temporal characteristics of each biosignal modality. To validate that tokenizer quality drives downstream performance, we train a simple masked-token foundation model for each modality (NeuroRVQ-FM) using the corresponding NeuroRVQ tokenizer. The NeuroRVQ-FM family achieves competitive or superior downstream performance compared to existing modality-specific foundation models, demonstrating that high-fidelity tokenization is a critical factor for effective biosignal modeling.

2510.10528 2026-05-19 cs.CL cs.LG

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Merlin's Whisper:通过黑盒说服提示增强大语言模型的高效推理

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

AI总结 本文提出Whisper框架,通过黑盒说服提示减少大语言模型(LRM)的推理过程中的token使用量,同时保持性能,展示了在多个基准测试中显著的token减少效果。

Comments ACL 2026 (Long Paper), camera-ready version

详情
AI中文摘要

大型推理模型(LRMs)通过逐步思考在解决复杂任务方面表现出色。然而,这种漫长的推理过程带来了显著的计算和延迟开销,阻碍了LRMs的实用部署。本文提出了一种通过黑盒说服提示来减轻LRMs过度思考的新方法。通过将LRMs视为黑盒通信者,我们研究如何说服它们生成简洁响应而不影响准确性。我们引入了Whisper,一个迭代细化框架,能够从多种视角生成高质量的说服提示。在多个基准测试中的实验表明,Whisper在保持性能的同时,能够显著减少token使用量。值得注意的是,Whisper在简单的GSM8K问题上对Qwen3模型系列实现了平均3倍的响应长度减少,并在所有基准测试中实现了平均约40%的token减少。对于闭源API,Whisper在MATH-500上分别使Claude-3.7和Gemini-2.5的token使用量减少了46%和50%。进一步分析显示,Whisper在数据领域、模型规模和家族中的广泛应用,凸显了黑盒说服提示作为提升LRM效率的实用策略的潜力。

英文摘要

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex tasks through step-by-step thinking. However, this lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of LRMs. This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. By treating LRMs as black-box communicators, we investigate how to persuade them to generate concise responses without compromising accuracy. We introduce Whisper, an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that Whisper consistently reduces token usage while preserving performance. Notably, Whisper achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series and delivers an average ~40% token reduction across all benchmarks. For closed-source APIs, Whisper reduces token usage on MATH-500 by 46% for Claude-3.7 and 50% for Gemini-2.5. Further analysis reveals the broad applicability of Whisper across data domains, model scales, and families, underscoring the potential of black-box persuasive prompting as a practical strategy for enhancing LRM efficiency.

2510.10140 2026-05-19 cs.LG cs.CR stat.ML

Adversarial Attacks on Downstream Weather Forecasting Models: Application to Tropical Cyclone Trajectory Prediction

对下游天气预测模型的对抗攻击:应用于热带气旋轨迹预测

Yue Deng, Francisco Santos, Pang-Ning Tan, Lifeng Luo

AI总结 本文研究了对抗攻击对深度学习天气预测模型的脆弱性,提出了一种新的攻击方法Cyc-Attack,用于生成对抗性轨迹,以提高攻击的准确性并减少检测难度。

Comments Compared with the previous version, we added zeroth-order optimization methods as baselines, clarified the motivation for using a surrogate model, and provided a more detailed investigation of the upstream attack

详情
AI中文摘要

基于深度学习的天气预测(DLWF)模型利用过去的天气观测数据生成未来的预测,支持广泛的应用,包括热带气旋(TC)预测。在本文中,我们研究了这些模型对对抗攻击的脆弱性,其中对上游预测的细微扰动可以改变下游TC轨迹预测。尽管最近对DLWF模型的对抗攻击研究有所增长,但仍然具有挑战性,即创建扰动的上游预测,使下游输出朝向攻击者指定的轨迹。首先,传统的TC检测系统是不透明的、非可微的黑箱,这使得标准的梯度基攻击不可行。其次,TC事件的极端稀有性导致严重的类别不平衡问题,使得开发扰动上游预测的方法变得困难,这些扰动产生的轨迹看起来真实并与攻击者的目标轨迹一致。为了克服这些限制,我们提出了Cyc-Attack,一种新的方法,用于扰动DLWF模型的上游预测以生成对抗性轨迹。所提出的方法使用可微的替代模型来近似TC检测器的输出,使梯度基攻击的应用成为可能。Cyc-Attack还采用了一种考虑偏度的损失函数和核扩张策略来解决不平衡问题。最后,基于距离的梯度加权方案和正则化用于约束扰动并消除不真实的轨迹,从而使对抗性上游预测更难以检测。我们的实验表明,Cyc-Attack在匹配攻击者目标轨迹方面具有更高的真实阳性率,同时具有更低的误报率和更隐蔽的扰动,优于传统攻击方法。

英文摘要

Deep learning-based weather forecasting (DLWF) models leverage past weather observations to generate future forecasts, supporting a wide range of downstream applications, including tropical cyclone (TC) prediction. In this paper, we investigate their vulnerability to adversarial attacks, where subtle perturbations to the upstream forecasts can alter the downstream TC trajectory predictions. Although research into adversarial attacks on DLWF models has grown recently, it remains challenging to craft perturbed upstream forecasts that steer the downstream outputs toward attacker-specified trajectories. First, conventional TC detection systems are opaque, non-differentiable black boxes, making standard gradient-based attacks infeasible. Second, the extreme rarity of TC events leads to severe class imbalance problem, making it difficult to develop attack methods for perturbing upstream forecasts that produce realistic-looking cyclone paths aligned with attacker's target trajectories. To overcome these limitations, we propose Cyc-Attack, a novel method for perturbing the upstream forecasts of DLWF models to generate adversarial trajectories. The proposed method uses a differentiable surrogate model to approximate the TC detector's output, enabling the application of gradient-based attacks. Cyc-Attack also employs a skewness-aware loss function with kernel dilation strategy to address the imbalance problem. Finally, a distance-based gradient weighting scheme and regularization are used to constrain the perturbations and eliminate unrealistic-looking trajectories, thereby making the adversarial upstream forecasts less easily detectable. Our experiments show that Cyc-Attack achieves a higher true positive rate in matching the attacker's target trajectories, along with lower false alarm rates and stealthier perturbations than conventional attack methods.

2510.08886 2026-05-19 cs.CL cs.CE cs.IR

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

FinAuditing: 一个基于财务分类结构的多文档基准,用于评估LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Yankai Chen, Víctor Gutiérrez-Basulto, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Xue Liu, Jian-Yun Nie

AI总结 本文提出FinAuditing,一个基于财务分类结构的多文档基准,用于评估大型语言模型在财务审计任务中的能力,通过三个任务:财务语义匹配、财务关系提取和财务数学推理,揭示了现有LLMs在概念检索、分类感知关系建模和跨文档一致性推理方面的显著差距。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情
AI中文摘要

超越简单的文本处理,财务审计需要在大规模披露中检测语义、结构和数值的一致性。由于财务报告以XBRL(一种受会计标准规范的结构化XML格式)提交,审计成为涉及概念对齐、分类定义的关系和跨文档一致性的结构化信息提取和推理问题。尽管大型语言模型(LLMs)在孤立的财务任务上表现出色,但其在专业级审计中的能力仍不明确。我们引入了FinAuditing,一个基于分类结构的基准,由真实的XBRL文件构建。它包含1,102个注释实例,平均超过33,000个标记,并定义了三个任务:财务语义匹配(FinSM)、财务关系提取(FinRE)和财务数学推理(FinMR)。对13种最先进的LLMs的评估揭示了概念检索、分类感知关系建模和跨文档一致性推理方面的显著差距。这些发现突显了需要现实且结构感知的基准的重要性。我们发布了评估代码(https://github.com/The-FinAI/FinAuditing)和数据集(https://huggingface.co/collections/TheFinAI/finauditing)。目前,该任务已成为正在进行的公开评估竞赛的官方基准(https://open-finance-lab.github.io/SecureFinAI_Contest_2026/)

英文摘要

Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.

2510.08702 2026-05-19 cs.CL

Scaling Laws for Code: A More Data-Hungry Regime

代码的缩放规律:一个更数据渴求的阶段

Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che

AI总结 本文研究了代码的缩放规律,通过大规模实验发现Farseer定律在准确性上更优,代码模型在模型大小上表现良好,但需要更高的数据与参数比,且在代码-自然语言混合数据中,自然语言在资源受限场景下有益,但在更高计算预算下成为负担。

Comments Accepted by ACL2026

详情
AI中文摘要

代码大型语言模型(LLMs)正在革新软件工程。然而,指导高效训练的缩放定律主要是在自然语言(NL)上分析的。鉴于代码和自然语言之间的根本差异,如严格的语法,这些定律是否直接适用于代码尚不清楚。为填补这一空白,我们进行了首次大规模的代码缩放定律实证研究,包括117次实验运行,模型大小从0.2B到3.8B,训练token从2B到128B。我们拟合了Chinchilla定律和Farsser定律。首先,结果表明,更具表现力的Farsser定律在准确性上更优。其次,分析显示代码LLMs在模型大小上有效扩展。关键的是,代码代表了一个更数据渴求的阶段,需要比自然语言显著更高的数据与参数比。最后,对代码-自然语言混合数据的两个额外实验显示,自然语言在资源受限的场景下有益,但在更高计算预算下成为负担。

英文摘要

Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

2510.07239 2026-05-19 cs.CL

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Red-Bandit:通过带引导的LoRA专家实现LLM红队测试的测试时适应

Christos Ziakas, Nicholas Loo, Nishita Jain, Alessandra Russo

AI总结 本文提出Red-Bandit框架,通过带引导的LoRA专家在不同攻击风格下实现LLM的测试时适应,通过强化学习生成不安全提示,并利用多臂老虎机策略动态选择攻击风格专家,从而在AdvBench上取得最佳结果,同时生成更易读的提示。

Comments Accepted to the Main Conference at ACL 2026

详情
AI中文摘要

自动化红队测试已成为在部署前审计大型语言模型(LLM)的可扩展方法,但现有方法缺乏有效适应模型特定漏洞的机制。我们介绍了Red-Bandit,一种红队测试框架,能够在线适应以识别和利用不同攻击风格(例如操纵、俚语)下的模型失败模式。Red-Bandit通过强化学习后训练一组参数高效的LoRA专家,每个专家专门针对特定的攻击风格,奖励生成不安全提示通过基于规则的安全模型。在推理时,多臂老虎机策略根据目标模型的响应安全性动态选择这些攻击风格专家,平衡探索和利用。Red-Bandit在足够的探索(ASR@10)下在AdvBench上实现了最先进的结果,同时生成更易于人类阅读的提示(更低的困惑度)。此外,Red-Bandit的老虎机策略还充当诊断工具,通过指示哪些攻击风格最有效引发不安全行为来揭示模型特定的漏洞。

英文摘要

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

2510.06388 2026-05-19 cs.LG cs.DS stat.ML

Truthful Calibration Errors for Multi-Class Prediction

多类预测中的诚实校准误差

Yuxuan Lu, Yifan Wu, Jason Hartline, Lunjia Hu

AI总结 本文研究了多类预测中诚实校准误差的实用作用,提出了完美诚实校准误差以处理标签分布的多维线性属性,并分析了这些诚实误差在决策理论上的影响,从而解释并缓解了分箱校准误差的排名鲁棒性问题。

详情
AI中文摘要

校准预测之所以有用,是因为其数值可以被解释为概率。校准误差因此被广泛用于评估、比较和调整概率预测器。最近,Haghtalab等人(2024)引入了一个额外的要求:诚实性。如果预测器通过报告真实的条件标签分布来最小化其预期测量误差,则校准度量是诚实的。许多标准的经验校准误差是非诚实的:预测器可能通过扭曲其概率而不是报告真实值来显得更校准。我们研究了诚实性在多类预测中校准测量的实用作用。首先,我们引入了完美诚实校准误差以处理标签分布的多维线性属性,推广了Hartline等人(2025)中二元预测的诚实校准误差。此框架包括完整的多类校准和类内校准。我们还确定了置信度校准的诚实修正。其次,我们分析了这些诚实误差的决策理论影响。对于校准预测器,诚实校准误差保持了Blackwell主导性:更信息丰富的校准预测器不会产生更大的预期误差。第三,我们表明这种决策理论解释解释并缓解了已观察到的分箱校准误差的排名鲁棒性问题。经验上,非诚实的置信度校准误差在分箱数量变化时可能逆转模型排名,而我们的诚实误差在不同分箱选择下提供更稳定的排名。

英文摘要

Calibrated predictions are useful because their numerical values can be interpreted as probabilities. Calibration errors are therefore widely used to evaluate, compare, and tune probabilistic predictors. Recently, Haghtalab et al. (2024) introduced an additional requirement for such measures: truthfulness. A calibration measure is truthful if a predictor minimizes its expected measured error by reporting the true conditional label distribution. Many standard empirical calibration errors are non-truthful: a predictor may appear better calibrated by distorting its probabilities rather than reporting them truthfully. We study the practical role of truthfulness for calibration measurement in multiclass prediction. First, we introduce perfectly truthful calibration errors for multidimensional linear properties of the label distribution, generalizing the truthful calibration error for binary predictions in Hartline et al. (2025). This framework includes full multiclass calibration and classwise calibration. We also identify a truthful correction for confidence calibration. Second, we characterize the decision-theoretic implications of these truthful errors. For calibrated predictors, truthful calibration errors preserve the Blackwell dominance: a more informative calibrated predictor receives no larger expected error. Third, we show that this decision-theoretic interpretation explains and mitigates the well-observed ranking robustness problem of binned calibration errors. Empirically, non-truthful confidence-based errors can reverse model rankings when the number of bins changes, while our truthful errors give more stable rankings across binning choices.

2510.05921 2026-05-19 cs.CL cs.LG

Prompt reinforcing for long-term planning of large language models

通过提示强化实现大语言模型的长期规划

Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić

AI总结 本文提出了一种基于强化学习的提示优化框架,通过修改LLM代理的任务指令提示来实现长期规划,提升了多轮交互任务如文本到SQL和任务导向对话的表现,并能泛化到不同LLM代理和多种LLM作为元提示代理。

详情
AI中文摘要

大型语言模型(LLMs)在广泛自然语言处理任务中取得了显著成功,并可通过提示进行适应。然而,它们在多轮交互中仍表现不足,常依赖错误的早期假设,无法随时间跟踪用户目标,使此类任务尤其具有挑战性。先前对话系统的工作表明,长期规划对于处理交互任务至关重要。在本工作中,我们提出了一种受强化学习启发的提示优化框架,仅通过修改LLM代理的任务指令提示即可实现此类规划。通过生成回合间的反馈并利用经验回放进行提示重写,我们的方法在文本到SQL和任务导向对话等多轮任务中显示出显著改进。此外,该方法能跨不同LLM代理泛化,并可利用多种LLM作为元提示代理。这促使未来在受强化学习启发的无参数优化方法上的研究。

英文摘要

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

2510.01857 2026-05-19 cs.AI

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

通过逆强化学习学习推理奖励从专家示范

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

AI总结 本文提出了一种名为Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL)的方法,通过逆强化学习从专家示范中学习推理奖励,以克服传统监督微调的局限性,并在多个数据集上展示了其在训练和推理过程中的有效性。

详情
AI中文摘要

教学大型语言模型(LLMs)在训练后进行推理通常依赖于具有显式结果或过程基础的强化学习奖励函数。然而,在许多现实世界设置中,获得或定义此类奖励函数是困难的,尤其是对于复杂任务,使从专家示范中学习成为有吸引力的替代方法。主流方法监督微调(SFT)训练模型直接模仿专家推理轨迹,但受到离策略学习的一般限制:性能可能对推理时偏离演示中明确覆盖的状态敏感。为了解决这个问题,我们提出了推理对抗逆强化学习(R-AIRL)。与其模仿专家的推理,R-AIRL从专家的思维链中推断出底层的过程级奖励。通过在GSM8K、MMLU-Pro和MedReason上进行实验,我们展示了通过R-AIRL学习的推理奖励函数可以有效地用于整个训练和推理流程:(1)为训练提供训练信号,在大多数考虑的设置中优于SFT,(2)用于推理时的重排序,将pass@1提高高达17.4个点,(3)用于过程级评估,以高达86.1%的准确性局部化推理失败。总体而言,R-AIRL弥合了模仿学习和基于奖励的优化,使从专家思考轨迹中提取有意义的推理信号成为可能。

英文摘要

Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.

2510.01479 2026-05-19 cs.LG cs.SY eess.SY

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

密度比加权行为克隆:从受污染的数据集中学习控制策略

Shriram Karpoora Sundara Pandian, Ali Baheri

AI总结 本文提出了一种鲁棒的模仿学习方法Density-Ratio Weighted Behavioral Cloning,通过使用一个小的验证干净参考集估计轨迹级密度比,以优先考虑干净的专家行为并降低或丢弃受污染的数据,从而在不需了解污染机制的情况下提升政策性能。

详情
AI中文摘要

离线强化学习(RL)通过固定数据集进行策略优化,使其适用于在线探索不可行的安全关键应用。然而,这些数据集常受到对抗性污染、系统错误或低质量样本的污染,导致标准行为克隆(BC)和离线RL方法的策略性能下降。本文介绍了密度比加权行为克隆(Weighted BC),一种鲁棒的模仿学习方法,通过二元判别器估计轨迹级密度比,这些比值被截断并用作BC目标中的权重,以优先考虑干净的专家行为,同时降低或丢弃受污染的数据,而无需了解污染机制。我们建立了理论保证,证明在有限样本界限下,能够收敛到干净的专家策略,这些界限与污染率无关。建立了一个全面的评估框架,该框架包含各种污染协议(奖励、状态、转换和动作)在连续控制基准上的应用。实验表明,Weighted BC即使在高污染比下也能保持接近最优性能,优于传统BC、批量约束Q学习(BCQ)和行为正则化的Actor-Critic(BRAC)等基线方法。

英文摘要

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

2510.00304 2026-05-19 cs.LG cs.AI

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

在不断变化的世界中学习的障碍:对学习能力丧失的数学理解

Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

AI总结 本文研究了在非平稳环境中深度学习模型因学习能力丧失(LoP)而失效的问题,通过动力系统理论分析了LoP的两个主要机制,并探讨了缓解策略。

详情
AI中文摘要

深度学习模型在静态数据上表现优异,但在非静态环境中因一种称为学习能力丧失(LoP)的现象而表现不佳,即其未来学习能力下降。本文首次从原理上研究了基于梯度的学习中的LoP。基于动力系统理论,我们通过在参数空间中识别稳定的流形来正式定义LoP,这些流形会捕获梯度轨迹。我们的分析揭示了两种主要机制,这些机制创造了这些陷阱:来自激活饱和的冻结单元和来自表征冗余的克隆单元流形。我们的框架揭示了一个根本性的矛盾:在静态设置中促进泛化的属性,如低秩表示和简单性偏差,直接在持续学习场景中促成LoP。我们通过数值模拟验证了我们的理论分析,并探讨了架构选择或针对性扰动作为潜在的缓解策略。

英文摘要

Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

2509.25969 2026-05-19 cs.CV

A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments

一种用于挑战性环境中鲑鱼福利监测的多用途跟踪框架

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

AI总结 本文提出了一种多用途跟踪框架,用于在具有挑战性的环境中实现鲑鱼福利的自动化监测,通过使用姿态估计网络提取鲑鱼的边界框及其对应的身体部位信息,以解决水下鲑鱼场景中的特定挑战,并构建了两个新的数据集来评估鲑鱼跟踪的挑战。

Comments Accepted to the Joint Workshop on Marine Vision 2025 (CVAUI & AAMVEM), held in conjunction with ICCV 2025

详情
AI中文摘要

基于计算机视觉(CV)的连续、自动化和精确的鲑鱼福利监测是减少工业网箱养鱼中鲑鱼死亡率和改善鲑鱼福利的关键步骤。现有的CV方法用于确定福利指标主要集中在单一指标上,并依赖于其他应用领域的对象检测器和跟踪器来帮助其福利指标计算算法。这在实际应用中带来了高资源需求,因为每个指标必须单独计算。此外,这些方法在水下鲑鱼场景中容易受到物体遮挡、相似物体外观和相似物体运动等困难的影响。为了解决这些挑战,我们提出了一种灵活的跟踪框架,该框架使用姿态估计网络提取鲑鱼及其对应身体部位的边界框,并利用身体部位的信息,通过专门的模块,来解决水下鲑鱼场景中的特定挑战。随后,高细节的身体部位跟踪被用于计算福利指标。我们构建了两个新的数据集,评估两个鲑鱼跟踪挑战:拥挤场景中的鲑鱼ID转移和转弯期间的鲑鱼ID切换。我们的方法在两个鲑鱼跟踪挑战中均优于当前最先进的行人跟踪器BoostTrack。此外,我们创建了一个用于计算鲑鱼尾鳍拍打波长的数据集,证明了我们的身体部位跟踪方法适合基于尾鳍分析的自动化福利监测。数据集和代码可在https://github.com/espenbh/BoostCompTrack上获得。

英文摘要

Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.

2509.21820 2026-05-19 cs.CL

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

大语言模型能否生成并解决语言奥林匹克谜题?

Neh Majmudar, Elena Filatova

AI总结 本文研究了大语言模型在生成和解决语言谜题中的能力,发现其在大多数谜题类型上优于人类,但对书写系统和不为人知语言的谜题表现较弱,提出了通过谜题生成促进语言学普及的研究意义。

Comments Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

详情
AI中文摘要

在本文中,我们介绍了一种新的任务组合:语言谜题的解决方案和生成。我们专注于用于高中生的语言奥林匹克谜题。我们首先扩展了现有基准,以解决语言谜题的任务。我们探索了大型语言模型(LLMs)在解决语言谜题中的应用,包括最近的最先进的模型,如OpenAI的o1,在各种语言主题上的表现。我们证明,LLMs在大多数谜题类型上优于人类,除了那些以书写系统为中心的谜题,以及不为人知的语言。我们利用谜题解决实验的洞察力,指导了新的谜题生成任务。我们相信,即使对于相对简单的谜题,自动化谜题生成也有望扩大对语言学的兴趣,并将该领域介绍给更广泛的受众。这一发现突显了语言谜题生成作为研究任务的重要性:此类谜题不仅能促进语言学,还能支持对稀有和不为人知语言的知识传播。

英文摘要

In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

2509.19102 2026-05-19 cs.RO cs.AI cs.CV

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

AI总结 本文提出FUNCanon框架,通过功能对象规范化学习姿态感知的动作原语,以实现通用的机器人操作,该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段,从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情
AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略,这些策略难以超越训练分布进行泛化。因此,我们引入FUNCanon框架,将长周期操作任务转换为一系列动作片段,每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身,而不是孤立的任务,从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性,我们对功能对象进行规范化,通过功能对齐和自动操作轨迹转移,利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练,自然尊重对象的 affordances 和姿态,简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明,该方法在类别层面实现了泛化,跨任务行为重用和鲁棒的sim2real部署,显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得:https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

2509.18150 2026-05-19 cs.LG cs.AI

Improving MLLM Training Efficiency via Stage-Aware Sparsity

通过阶段感知稀疏性提升MLLM训练效率

Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang

AI总结 本文提出了一种基于稀疏表示的高效训练框架STS,通过阶段感知设计适应不同训练阶段的冗余,采用视觉标记压缩器和层动态跳过器来减少计算开销,验证了其在多种MLLM架构上的有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种领域中表现出色,但训练效率低下,由于长输入序列和未充分利用的层间操作导致大量计算冗余。值得注意的是,这种冗余并非静态,而是随训练阶段变化。基于此观察,我们关注训练过程本身,提出了一种基于稀疏表示的高效训练框架,称为稀疏训练方案(STS)。不同于统一的稀疏性策略,STS采用阶段感知设计,适应训练过程中不同的冗余来源。具体而言,该框架包含两个互补组件:视觉标记压缩器,通过在模态对齐过程中压缩视觉标记来减少信息负载;层动态跳过器,通过在指令微调过程中动态跳过不必要的层来减轻计算开销。我们的方法广泛适用于多种MLLM架构,并已在多个基准上进行了广泛评估,证明了其有效性和效率。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient, as much of the computation is redundant due to the long input sequences from multimodal data and underutilized inter-layer operations. Notably, such redundancy is not static but varies across different stages of training. Building on this observation, we shift the focus to the training process itself and propose a training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). Instead of applying a uniform sparsity strategy, STS adopts a stage-aware design that adapts to different sources of redundancy during training. Specifically, the framework consists of two complementary components: the Visual Token Compressor, which reduces the information load by compressing visual tokens during modality alignment, and the Layer Dynamic Skipper, which mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

2509.16391 2026-05-19 cs.LG cs.AI cs.CV

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn: 通过对比学习赋能机器无学习

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

AI总结 本文提出CoUn框架,通过对比学习和监督学习调整保留数据的表示,以提高机器无学习的有效性,实验表明其在多个数据集和模型架构上均优于现有方法。

详情
AI中文摘要

机器无学习(MU)旨在从已训练模型中移除特定'遗忘'数据的影响,同时保持对剩余'保留'数据的知识。现有的基于标签操纵或模型权重扰动的MU方法往往效果有限。为此,我们引入了CoUn,一种受观察启发的新MU框架:当模型仅使用保留数据重新训练时,它会根据保留数据的语义相似性对遗忘数据进行分类。CoUn通过对比学习(CL)和监督学习调整学习的数据表示,仅应用于保留数据。具体而言,CoUn(1)利用数据样本之间的语义相似性,通过CL间接调整遗忘表示,(2)通过监督学习保持保留表示在其各自聚类内。在各种数据集和模型架构上的广泛实验表明,CoUn在无学习有效性上 consistently 超过最先进的MU基线。此外,将我们的CL模块集成到现有基线中可以增强其无学习有效性。

英文摘要

Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

2509.02351 2026-05-19 cs.CV cs.AI cs.LG

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

序数自适应校正:一种数据导向的带有噪声标签的序数图像分类方法

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

AI总结 本文提出了一种数据导向的序数图像分类方法ORDAC,通过利用标签分布学习来建模序数标签的内在模糊性和不确定性,动态调整每个样本的标签分布均值和标准差,从而有效校正噪声标签并提高模型性能。

Comments 10 pages, 5 figures, 5 tables

详情
AI中文摘要

标记数据是训练计算机视觉任务中监督深度学习模型的基本组成部分。然而,尤其是在序数图像分类中,类边界往往具有模糊性,因此标注过程容易产生错误和噪声。此类标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中检测和校正标签噪声的问题,提出了一种新的数据导向方法,称为ORDinal Adaptive Correction(ORDAC)。该方法利用标签分布学习(LDL)的能力来建模序数标签的内在模糊性和不确定性。在训练过程中,ORDAC动态调整每个样本的标签分布的均值和标准差。与其丢弃可能含有噪声的样本不同,该方法旨在校正这些样本并充分利用整个训练数据集。所提出方法在年龄估计(Adience)和疾病严重程度检测(糖尿病视网膜病变)基准数据集上,针对各种不对称高斯噪声场景进行了评估。结果表明,ORDAC及其扩展版本(ORDAC_C和ORDAC_R)在模型性能上取得了显著提升。例如,在Adience数据集上40%的噪声情况下,ORDAC_R将均方误差从0.86降低到0.62,并将召回指标从0.37提高到0.49。该方法还展示了其在原始数据集中固有噪声的校正效果。这项研究表明,使用标签分布进行自适应标签校正是增强在存在噪声数据时序数分类模型鲁棒性和准确性的一种有效策略。

英文摘要

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

2508.20836 2026-05-19 cs.RO math.OC

First Experimental Demonstration of Natural Hovering Extremum Seeking: A New Paradigm in Flapping Flight Physics

首次实验性演示自然悬停极值搜索:飞行力学领域的新范式

Ahmed A. Elgohary, Rohan Palanikumar, Simone Martini, Sameh A. Eisa

AI总结 本文首次实验验证了自然悬停极值搜索(NH-ES)这一新范式,展示了通过无需模型的实时反馈机制,利用飞行动物自身振荡实现稳定悬停飞行的原理。

详情
AI中文摘要

在本文中,我们报告了首次实验性演示了最近出现的悬停和振翅飞行力学新范式,称为自然悬停极值搜索(NH-ES),该范式提出,通过无需模型的实时反馈机制,利用振翅翼的内置自然振荡作为控制和推进输入,可以生成自然界中通过振翅昆虫和蜂鸟观察到的稳定悬停飞行力学。我们进行了moth-like、光源导向的实验,使用振翅翼体在完全无模型的设置中进行,该设置不依赖形态学参数和身体/空气动力学模型。我们展示了使用NH-ES的振翅体能够自主增益高度并稳定控制负责振翅的伺服器,包括具有pitching动态(文献中认为是开环悬停不稳定的主要原因)。振翅体仅需局部光强度反馈即可有效稳定悬停在光源附近。我们的结果也实现了在延迟和噪声效应下的验证,支持了之前观察到的NH-ES对潜在处理延迟和噪声感觉的鲁棒性。

英文摘要

In this letter, we report the first experimental demonstration of the recently emerged new paradigm in hovering and flapping flight physics called (Natural Hovering Extremum Seeking (NH-ES)) [doi.org/10.1103/4dm4-kc4g], which theorized that stable hovering flight physics observed in nature by flapping insects and hummingbirds can be generated via a model-free, real-time, computationally-basic, sensory-based feedback mechanism that only needs the built-in natural oscillations of the flapping wing as both the control and the propulsive input. We run experiments of moth-like, light source-seeking, on a flapping-wing body in a total model-free setting that is agnostic to morphological parameters and body/aerodynamic models. We show that the flapping body using NH-ES gains altitude and stabilizes autonomously the servos responsible for flapping, including with pitching dynamics (believed in literature to be a main reason of instability in open-loop hovering). The flapping body effectively/stably hovers about the light source, needing only feedback of local measurements of light intensity. Our results were also achieved under delay/noise effects, supporting earlier observations that NH-ES is robust against potential processing delays and noisy-sensations.

2508.13977 2026-05-19 cs.CV

ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

ROVR-Open-Dataset: 一个大规模深度数据集用于自动驾驶

Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Matteo Poggi, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Yanlun Peng, Yuan Si, Qin Zou

AI总结 本文提出ROVR-Open-Dataset,一个大规模、多样化且成本效益高的深度数据集,用于提升自动驾驶中空间感知的能力,通过提供丰富的场景、光照和天气条件数据,以及经过验证的地面真实数据,支持鲁棒的模型训练,并识别出当前架构共享的三种失败模式。

详情
AI中文摘要

深度估计是自动驾驶和其他在开放城市环境中运行的无人驾驶系统空间感知的基本组成部分。现有的深度数据集如KITTI、nuScenes和DDAD虽然推动了该领域的发展,但在多样性和可扩展性方面存在局限,且在这些数据集上的基准性能已接近饱和。一个较少讨论的约束是传感器经济性:这些数据集背后的定制多激光雷达装置成本高、耗电且难以在大规模车队中复制,这限制了任何单一基准所能覆盖的地理和时间多样性。我们提出了ROVR,一个大规模、多样化且成本效益高的深度数据集,旨在捕捉现实驾驶的复杂性。ROVR包含200,000个高分辨率帧,涵盖高速公路、乡村和城市场景,覆盖昼夜周期和恶劣天气条件,收集于北美洲、欧洲和亚洲。我们还发布了校准、同步、预处理和隐私管道,使该平台能够被第三方复现。轻量级的采集管道支持可扩展的收集,而稀疏但统计上充分的地面真实数据——通过密度消融验证——支持稳健的模型训练。广泛的消融研究进一步表征了不同场景类型、光照、天气条件和地面真实稀疏程度下的性能,并识别出三种定性不同的失败模式——光度崩溃、几何混淆和范围饱和——这些当前架构共享。该数据集、数据加载器、校准和隐私管道以及评估代码已在https://xiandaguo.net/ROVR-Open-Dataset上公开发布。

英文摘要

Depth estimation is a fundamental component of spatial perception for autonomous driving and other unmanned systems operating in open urban environments. Existing depth datasets such as KITTI, nuScenes, and DDAD have advanced the field but are limited in diversity and scalability, and benchmark performance on them is approaching saturation. A less discussed constraint is \emph{sensor economics}: the bespoke multi-LiDAR rigs behind these datasets are expensive, power-hungry, and difficult to replicate at fleet scale, which caps the geographic and temporal diversity that any single benchmark can cover. We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving. ROVR comprises 200K high-resolution frames across highway, rural, and urban scenarios, spanning day/night cycles and adverse weather conditions, collected across North America, Europe, and Asia. We additionally release the calibration, synchronization, preprocessing, and privacy pipeline so that the platform can be reproduced by third parties. The lightweight acquisition pipeline enables scalable collection, while sparse but statistically sufficient ground truth -- validated by a density ablation -- supports robust model training. Extensive ablation studies further characterize performance across scene types, illumination, weather conditions, and ground-truth sparsity levels, and identify three qualitatively distinct failure modes -- photometric collapse, geometric confusion, and range saturation -- that current architectures share. The dataset, data loaders, calibration and privacy pipelines, and evaluation code are publicly available at \url{https://xiandaguo.net/ROVR-Open-Dataset}.

2508.05415 2026-05-19 cs.RO

Do Robots Really Need Anthropomorphic Hands? A Comparison of Human and Robotic Hands

机器人真的需要拟人化的手吗?人类手与机器人手的比较

Alexander Fabisch, Wadhah Zai El Amri, Chandandeep Singh, Nicolás Navarro-Guerrero

AI总结 本文通过比较人类手与机器人手的生物力学、感知和控制机制,探讨机器人是否需要拟人化手,发现复杂的手部设计并非所有任务所必需,而手部机制的复杂性与执行任务的广度相关,同时指出传感器集成和智能操作策略仍需进一步研究。

详情
AI中文摘要

人类操控技能是其自愿运动功能的巅峰,需要协调多个自由度并处理高维传感器输入以实现卓越的灵活性。因此,我们试图回答是否人类手与其相关的生物力学特性、传感器和控制机制是机器人应追求的理想。机器人真的需要拟人化手吗?我们首先从生物力学和感知的角度提取人类手的特征,与目前商用的机器人手进行比较。通过这种比较,我们得出研究问题,将操控系统复杂性与技能 repertoire 大小和灵活性联系起来。我们通过系统文献综述来回答这些问题,在2019-2025年的125篇论文中分析了操控能力。尽管复杂的五指手常被认为是机器人操控器的终极目标,但并非所有任务都必需。我们发现,在手内操控并不受益于拟人化手设计,因为更简单的机制就足够,但机制复杂性与手能执行的操控任务的广度相关。传感器集成和智能操控策略仍处于探索阶段,这可能是因为与手设计的不匹配:而不是复制手指数量和自由度,关注鲁棒性和柔软性将允许更智能的控制和学习,以利用环境接触并集成更多传感器。最后,我们呼吁标准化的评估标准,以实现手部设计和操控系统系统的比较。

英文摘要

Human manipulation skills represent a pinnacle of their voluntary motor functions, requiring the coordination of many degrees of freedom and processing of high-dimensional sensor input to achieve remarkable dexterity. Thus, we set out to answer whether the human hand, with its associated biomechanical properties, sensors, and control mechanisms, is an ideal that we should strive for in robotics. Do robots need anthropomorphic hands? We start by extracting characteristics of the human hand in terms of biomechanics and perception to compare them with currently commercially available robotic hands. From this comparison, we derive our research questions that connect manipulation system complexity to skill repertoire size and dexterity. We attempt to answer these with a systematic literature review, in which we analyze the manipulation capabilities demonstrated in 125 papers from 2019-2025. Although complex five-fingered hands are often considered the ultimate goal for robotic manipulators, they are not necessary for all tasks. We find that in-hand manipulation does not benefit from anthropomorphic hand design as simpler mechanisms are sufficient, but mechanism complexity correlates with the breadth of manipulation tasks a hand can perform. Sensor integration and intelligent manipulation strategies remain underexplored, which may be because of a misalignment with hand design: instead of replicating the number of fingers and degrees of freedom, focusing on robustness and softness would allow more intelligent control and learning to exploit environmental contacts and integrate more sensors. Finally, we argue for standardized evaluation criteria to enable systematic comparison of hand designs and manipulation systems.

2507.21035 2026-05-19 cs.AI cs.LG cs.MA q-bio.GN

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS:通过代码驱动的基因表达分析进行科学发现的多智能体框架

Haoyang Liu, Yijiang Li, Haohan Wang

AI总结 该研究提出GenoMAS多智能体框架,通过类型消息传递协议协调六个专门的LLM代理,以实现基因表达数据的高效处理和科学发现,其在数据预处理和基因识别任务上均优于现有方法。

Comments 51 pages (14 pages for the main text, 10 pages for references, and 27 pages for the appendix)

详情
AI中文摘要

基因表达分析对于许多生物医学发现至关重要,但从原始转录组数据中提取见解仍然极具挑战性,这归因于多个大型半结构化文件的复杂性和对大量领域专业知识的需求。当前的自动化方法往往受到不灵活的工作流或完全自主代理的限制,这些代理缺乏进行严谨科学探究所需的精确度。GenoMAS则另辟蹊径,通过集成结构化工作流的可靠性与自主代理的适应性,提出了一支基于LLM的科学家团队。GenoMAS通过类型消息传递协议协调六个专门的LLM代理,每个代理都为共享的分析画布贡献互补的强项。GenoMAS的核心是一个引导规划框架:编程代理将高层任务指南展开为动作单元,并在每个节点选择前进、修订、绕过或回溯,从而在保持逻辑一致性的同时,灵活适应基因组数据的特性。在GenoTEX基准测试中,GenoMAS在数据预处理方面达到了89.13%的复合相似度相关性,在基因识别方面达到了60.48%的F1分数,分别超过了最佳现有方法10.61%和16.85%。除了指标外,GenoMAS还揭示了由文献支持的生物合理基因-表型关联,同时调整了潜在混杂因素。代码可在https://github.com/Liu-Hy/GenoMAS上获得。

英文摘要

Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

2507.20917 2026-05-19 cs.CL cs.AI

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: 一个用于知识和推理评估的法语医学问答数据集

Adrien Bazoge

AI总结 本文提出MediQAl数据集,用于评估语言模型在事实性医学记忆和现实临床场景推理方面的能力,包含32,603个法语医学问题,涵盖41个医学科目,包含三种任务,通过14个大型语言模型的评估发现事实记忆与推理任务之间存在显著性能差距。

详情
AI中文摘要

本文介绍了MediQAl,一个法语医学问答数据集,旨在评估语言模型在事实性医学记忆和现实临床场景推理方面的能力。MediQAl包含32,603个问题,来源于41个医学科目中的法语医学考试。该数据集包含三种任务:(i) 有唯一答案的多项选择题,(ii) 有多个答案的多项选择题,以及(iii) 有短答案的开放性问题。每个问题都被标记为理解或推理,使能够对模型的认知能力进行详细分析。我们通过与14个大型语言模型的广泛评估,包括最近的推理增强模型,验证了MediQAl数据集,并观察到事实记忆与推理任务之间存在显著的性能差距。我们的评估为评估语言模型在法语医学问答上的性能提供了全面的基准,填补了医学领域多语言资源中的关键空白。

英文摘要

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.