arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2605.29336 2026-05-29 cs.CL

Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

通过最小贝叶斯风险解码在摘要中实现基于共识和一致性的事实性增强

Riza Setiawan Soetedjo, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe

AI总结 提出ConSUM方法,利用最小贝叶斯风险解码建立候选摘要间的共识,并结合与源文档的一致性指标进行重排序,以提升摘要的事实性。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

提高模型生成摘要的质量,尤其是事实性(摘要相对于源内容的准确性)仍然是一个挑战。虽然重排序可以从多个生成候选中选择最优输出,但它仅限于使用源作为指导,导致摘要不可靠。为了解决这一局限性,我们提出了ConSUM,该方法通过考虑两个因素对候选摘要进行重排序:与源文档的一致性以及与其他候选之间的共识。共识是通过对生成的摘要集进行最小贝叶斯风险(MBR)解码建立的,同时通过使用将摘要与源进行比较的事实性感知指标来确保一致性。严格的测试表明,我们的系统与现有方法具有竞争力,人工评估进一步证实其生成的摘要优于其他系统。我们的代码可在https://github.com/naist-nlp/ConSUM获取。

英文摘要

Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist-nlp/ConSUM .

2605.29335 2026-05-29 cs.CV cs.AI

Rethinking FID Through the Geometry of the Reference Dataset

通过参考数据集的几何结构重新思考FID

Yunghee Lee, Byeonghyun Pak

AI总结 本文通过分析参考数据集的几何特性(密度和有效秩)来解释Fréchet Inception Distance (FID) 与样本质量之间的不一致性,并提出应结合参考数据集几何结构来更可靠地评估生成模型。

Comments 9 pages, 2 figures. Accepted to ICML 2026 Workshop: Combining Theory and Benchmarks

详情
AI中文摘要

Fréchet Inception Distance (FID) 被广泛用于评估图像生成器,但较低的FID并不总是对应更好的样本质量。我们表明,这种不匹配部分取决于参考数据集的几何结构。在六个数据集的受控研究中,分布密度和有效秩显著解释了随着样本质量提高FID如何变化。集中数据集往往产生更有利的FID趋势,而更分散的数据集可能导致尽管样本更好但FID恶化。对精确率和召回率的归因以及使用替代特征空间和距离的消融实验支持了相同的结论。这些结果表明,分布度量应与参考数据集的几何结构一起解释,以实现更可靠的基准测试。

英文摘要

Fréchet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample quality. We show that this mismatch depends in part on the geometry of the reference dataset. In a controlled study across six datasets, distributional density and effective rank significantly explain how FID changes as sample quality improves. Concentrated datasets tend to yield more favorable FID trends, whereas more dispersed datasets can make FID worsen despite better samples. Attribution to precision and recall and ablations with alternative feature spaces and distances support the same conclusion. These results suggest that distributional metrics should be interpreted together with the geometry of the reference dataset for more reliable benchmarking.

2605.29330 2026-05-29 cs.CV

EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation

EarthShift: 衡量地球观测中真实分布偏移鲁棒性的基准

Kelsey Doerksen, Hannah Kerner

AI总结 提出EarthShift基准,通过多源配对数据集评估地理空间基础模型在时间、地理、尺度、传感器等真实分布偏移下的鲁棒性,发现模型性能平均下降15-20%。

详情
AI中文摘要

当前地球观测基准侧重于衡量多样任务和应用上的性能,通常衡量分布内泛化能力。但当模型部署时,它们必须泛化到无数分布外场景,例如新的时间段、地理区域、尺度和传感器。我们提出EarthShift:首个用于衡量遥感中多种真实分布偏移鲁棒性的公开测试平台。EarthShift通过使用来自不同来源、时间窗口、地理位置和传感器的配对数据集,比较分布内和分布外的性能,使用户能够衡量分布鲁棒性。我们在8个地理空间基础模型(GFMs)和覆盖5种偏移类型的11个任务上的实验表明,无论模型架构、大小、预训练或微调策略如何,GFMs在分布外的平均性能始终低15-20%。我们表明GFM的鲁棒性与通用视觉基础模型甚至全监督模型相似。这凸显了未来研究需要致力于提升分布鲁棒性,而不仅仅是性能,这可以通过EarthShift进行基准测试。我们发布代码和数据集,提供一个测试平台,以指导未来工作创建在真实应用中鲁棒且可靠的基础模型。EarthShift的代码和数据可在https://earthshift.github.io获取。

英文摘要

Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios, such as new time periods, geographies, scales, and sensors. We introduce EarthShift: the first public testbed for benchmarking robustness across multiple realistic distribution shifts encountered in remote sensing. EarthShift enables users to measure distributional robustness by comparing performance in- and out-of-distribution using datasets from paired datasets from different sources, temporal windows, geographic locations, and sensors. Our experiments on 8 geospatial foundation models (GFMs) and 11 tasks covering 5 shift types show that GFMs consistently perform 15-20% worse out-of-distribution on average regardless of model architecture, size, pre-training or fine-tuning strategy. We show that GFM robustness is similar to that of generic vision foundation models, and even fully-supervised models. This highlights a need for future research to strive for improvements in distributional robustness, not just performance, which can be benchmarked using EarthShift. We release our code and datasets to provide a testbed to guide future work to create foundation models that are robust and reliable in real-world applications. Code and data for EarthShift are available at: https://earthshift.github.io

2605.29327 2026-05-29 cs.CL cs.LG

Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

保留推理能力的大语言模型高效蒸馏:基于激活感知初始化

Junlin He, Yihong Tang, Tong Nie, Guilong Li, Binyu Yang, Jinxiao Du, Lijun Sun, Wei Ma

AI总结 针对高效蒸馏导致的多步推理能力严重下降(推理崩溃),提出RED方法,通过激活感知初始化投影矩阵为通道选择矩阵,理论缓解有效秩崩溃,恢复推理能力并保持高效训练与通用性能。

详情
AI中文摘要

高效蒸馏(EDistill)通过结构化剪枝参数和调优轻量模块以高训练效率压缩大语言模型(LLM)。尽管这些EDistill LLM在通用能力基准上相对于类似大小的LLM取得了最先进的(SOTA)性能,但我们发现其多步推理能力严重下降,我们称之为推理崩溃。我们系统分析了推理崩溃的几何起源,并表明基于宽度缩减投影矩阵的SOTA EDistill方法遭受有效秩(eRank)崩溃,即隐藏表示的有效秩下降。我们从理论上解释了随机初始化投影矩阵的奇异值如何变得分布不均,导致eRank崩溃,进而导致token不可区分性。为解决此问题,我们提出了RED(保留推理能力的高效蒸馏)方法,该方法引入激活感知初始化,将投影矩阵初始化为通道选择矩阵,从而在理论上缓解eRank崩溃。在Llama和Qwen系列上的实验表明,RED在保持高训练效率和SOTA通用能力的同时,显著恢复了推理能力。

英文摘要

Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

2605.29326 2026-05-29 cs.LG

NeuroEdge: Real-Time Hand Gesture Recognition with High-Density EMG Using Deep Learning at the Edge

NeuroEdge:基于边缘深度学习的密集肌电实时手势识别

Peter Chudinov, Zhenyu Lin, Jay Motamarry, Srihita Panati, Xiaorong Zhang, Zhuwei Qin

AI总结 提出NeuroEdge系统,通过HD-EMG无线传输和轻量级CNN推理引擎,在微控制器上实现实时手势识别,准确率90%,延迟83ms。

详情
AI中文摘要

高密度肌电(HD-EMG)已成为解码精细神经肌肉活动的强大方式,可实现用于假肢控制、康复和增强交互等应用的实时神经-机器接口(NMI)。尽管卷积神经网络(CNN)等深度学习方法在基于EMG的手势识别中表现出高分类精度,但由于计算和内存限制,它们在嵌入式硬件上的部署仍然是一个重大挑战。本文提出NeuroEdge,一种基于实时HD-EMG的NMI系统,完全在资源受限的微控制器上执行手势识别。该系统包含两个定制模块:HD-EMG StreamBridge,一种无线通信接口,将原始HD-EMG数据从Quattrocento放大器流式传输到ESP32微控制器;以及EdgeDL推理引擎,一种在索尼Spresense微控制器上执行的轻量级深度学习框架。一个针对嵌入式推理优化的紧凑一维CNN实时处理滑动窗口的EMG数据。数据流和推理通过利用直接内存访问(DMA)进行数据传输以及ESP32和Spresense之间的串行外设接口(SPI)突发通信的架构进行流水线和同步,确保低延迟性能。实验结果表明,NeuroEdge在七种手势中实现了90%的实时分类准确率,使用从前臂记录的192通道HD-EMG,总平均延迟为83毫秒。我们的系统证明了在基于微控制器的边缘设备上部署基于HD-EMG的复杂手势识别的可行性,弥合了高分辨率生物信号采集与基于深度学习的嵌入式推理之间的差距,为下一代NMI铺平了道路。

英文摘要

High-density electromyography (HD-EMG) has emerged as a powerful modality for decoding fine-grained neuromuscular activity, enabling real-time neural-machine interfaces (NMIs) for applications such as prosthetic control, rehabilitation, and augmented interaction. While deep learning approaches such as convolutional neural networks (CNNs)have demonstrated high classification accuracy for EMG-based gesture recognition, their deployment on embedded hardware remains a major challenge due to computational and memory constraints. This paper presents NeuroEdge, a real-time HD EMG-based NMI system that performs gesture recognition entirely on resource-constrained microcontrollers. The system features two custom-designed modules: the HD-EMG StreamBridge, a wireless communication interface that streams raw HD-EMG data from a Quattrocento amplifier to an ESP32 microcontroller; and the EdgeDL Inference Engine, a lightweight deep learning framework executing on a Sony Spresense microcontroller. A compact 1-dimensional CNN optimized for embedded inference processes, sliding windows of EMG data in real time. Data streaming and inference are pipelined and synchronized through an architecture that utilizes Direct Memory Access (DMA) for data transfer and Serial Peripheral Interface (SPI) burst communication between the ESP32 and Spresense, ensuring low-latency performance. Experimental results show that NeuroEdge achieves a real-time classification accuracy of 90% across seven hand gestures, with a total average latency of 83 ms using 192 channels of HD-EMG recorded from the forearm. Our system demonstrates the feasibility of deploying complex HD-EMG-based gesture recognition on microcontroller-based edge devices, bridging the gap between high-resolution biosignal acquisition and deep learning-based embedded inference for next-generation NMIs.

2605.29325 2026-05-29 cs.CV

Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding

用于零样本交通事故理解的多阶段VLM流水线

Fumiya Tatematsu, Fumihiko Takahashi

AI总结 提出一个三阶段VLM流水线,在冻结的Qwen3-VL-32B-Instruct和235B MoE模型上实现零样本交通事故预测,通过9:1融合和车辆检测对齐赢得CVPR 2026 ACCIDENT挑战赛。

Comments Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 13. Code: https://github.com/fuumin621/cvpr2026-accident-1st-place-solution

详情
AI中文摘要

我们提出了CVPR 2026 AUTOPILOT Workshop中ACCIDENT挑战赛的第一名解决方案,该挑战要求从CCTV视频中零样本预测事故时间、撞击中心点和碰撞类型。在冻结的Qwen3-VL-32B-Instruct检查点上,我们构建了一个三阶段流水线(全视频联合预测、时间细化、单帧撞击中心点定位),在235B混合专家模型上再次运行相同的流水线,以9:1的比例融合两个输出,最后将每个预测点对齐到最近的车辆检测框。最终系统在Public LB上达到0.55469,在Private LB上达到0.57080,比最强的主办方基线(Molmo-7B,0.358)高出约0.21,并赢得了挑战赛。我们对每个组件进行了消融实验,报告了影响最终设计的负面结果,并在https://github.com/fuumin621/cvpr2026-accident-1st-place-solution 上发布了代码。

英文摘要

We present the 1st-place solution to the ACCIDENT challenge at the CVPR 2026 AUTOPILOT Workshop, which asks for zero-shot prediction of accident timing, impact centroid, and collision type from CCTV footage. On a frozen Qwen3-VL-32B-Instruct checkpoint we build a three-stage pipeline (full-video joint prediction, time refinement, and single-frame grounding of the impact centroid), run the same pipeline a second time on a 235B Mixture-of-Experts sibling, blend the two outputs 9:1, and finally snap each predicted point onto the nearest vehicle detection. The final system reaches Public LB 0.55469 / Private LB 0.57080, roughly +0.21 over the strongest host baseline (Molmo-7B, 0.358) and wins the challenge. We ablate each component, report the negative results that shaped the final design, and release the code at https://github.com/fuumin621/cvpr2026-accident-1st-place-solution.

2605.29324 2026-05-29 cs.CL cs.CV

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

STAMP:在可控且可扩展的虚拟环境中训练移动GUI代理的显式记忆

Junyang Wang, Haiyang Xu, Xi Zhang, Zhaoqing Zhu, Ming Yan, Jieping Ye, Jitao Sang

AI总结 提出STAMP框架,通过可控虚拟环境注入确定性记忆变量,生成可验证监督数据并支持在线强化学习,解决移动GUI代理在长时任务中因上下文窗口限制和缺乏显式记忆导致的失败问题。

Comments 24 pages, 4figures, 21 tables

详情
AI中文摘要

移动GUI代理在即时反应控制方面表现出色,但在需要记忆的现实长时任务中经常失败。这种失败源于有限的上下文窗口与令牌密集的屏幕截图之间的根本冲突。为了节省有限的上下文,代理必须逐步丢弃较旧的视觉历史,永久丢失关键的瞬时信息。此外,现有的以行动为中心的数据集无法教会代理记忆什么或何时显式记忆,并且增强静态真实世界数据成本高昂且缺乏交互验证。为了解决这个问题,我们提出了STAMP,一个通过可控虚拟环境训练移动代理显式记忆的框架,其中确定性记忆变量被程序化地注入到合成任务中,以控制必须记忆的内容、何时编码以及何时检索,从而大规模生成可验证的监督数据,并通过环境驱动的奖励反馈实现在线强化学习。在我们新引入的Memory-World基准测试上评估,得到的Stamp-GUI代理在GUI专用模型中达到了最先进的性能,并在我们的Memory-World基准测试上树立了新的高水位线,展示了卓越的记忆准确性和任务韧性,同时保持了强大的通用移动导航能力。

英文摘要

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

2605.29319 2026-05-29 cs.CL

Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning Perspective

重新思考逐步模型路由:一种成本高效的表格推理视角

Shenghao Ye, Yuxiang Wang, Yu Guo, Dong Jin, Shuangwu Chen, Jian Yang

AI总结 提出EcoTab框架,通过分别估计表格令牌和文本令牌的不确定性并映射到下一步失败风险,实现表格推理中准确性与效率的更好平衡。

Comments 17pages, 15 figures, submitted to EMNLP 2026

详情
AI中文摘要

大型推理模型(LRMs)在表格推理任务上表现出色,但由于长推理轨迹导致推理成本高昂。逐步模型路由通过将推理步骤动态分配给较小或较大的模型来缓解此问题。然而,用于表格推理的逐步模型路由仍未得到充分探索。通过实证分析,我们发现涉及表格的推理步骤包含两种具有不同不确定性分布的令牌:基于表格结构的表格令牌(如单元格值和表头)和表示周围自然语言推理的文本令牌。两种令牌的不确定性与模型在下一步推理中出错的风险相关。然而,现有方法未能分别建模它们,导致路由决策次优。为解决此问题,我们提出EcoTab,一种表格感知的逐步路由框架,用于高效表格推理。在每个推理步骤中,EcoTab分别估计表格令牌和文本令牌的不确定性,将其映射到小模型的下一步失败风险,并组合两种风险进行路由。在多个表格推理基准上的实验表明,EcoTab始终优于强基线,并在准确性和效率之间实现了更好的平衡。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models. However, stepwise model routing for table reasoning remains underexplored. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural-language reasoning. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step. However, existing methods fail to model them separately, leading to suboptimal routing decisions. To address this, we propose EcoTab, a table-aware stepwise routing framework for efficient table reasoning. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next-step failure risks for the small model, and combines the two risks for routing. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency.

2605.29316 2026-05-29 cs.CV

CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

CapTalk: 文本引导的风格化与语音驱动的3D头部动画

Xuangeng Chu, Yuan Gan, Ziteng Cui, Shuhong Liu, Jian Wang, Bing Zhou, Tatsuya Harada

AI总结 提出CapTalk框架,通过文本描述控制说话风格和情感,结合语音驱动生成同步唇动和面部表情,支持动态情感变化。

详情
AI中文摘要

音频驱动的3D面部动画旨在从任意音频片段生成同步的唇部运动和生动的面部表情。现有方法虽能产生同步唇动,但通常依赖预定义的身份或风格潜在特征,限制了用户自由控制说话风格的能力。此外,将固定风格或身份应用于整个音频片段通常导致面部动画风格无法适应音频的情感内容。为解决这些挑战,我们重新审视风格与情感的纠缠,构建了一个包含风格和情感文本描述的大规模数据集,并提出了一种新颖的说话头生成框架,能够分别控制风格和情感。我们的模型以说话风格和角色情感的文本描述以及驱动音频流为输入,能够实时生成与描述高度同步的唇部运动和面部表情。此外,我们的模型在推理时支持动态情感控制,能够处理目标情感在语音过程中变化的情况。

英文摘要

Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.

2605.29313 2026-05-29 cs.CL

PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent Collaboration

PatchBoard: 基于Schema的可靠且可审计的LLM多智能体协作状态变更框架

Shuyu Zhang, Yaqi Shi, Lu Wang

AI总结 提出PatchBoard架构,通过Schema约束的JSON Patch状态变更替代智能体间对话,实现可验证、可审计的多智能体协作,在ALFWorld任务中成功率84.6%,令牌消耗45.5k。

详情
AI中文摘要

LLM多智能体系统通常通过自然语言对话或松散结构的共享内存进行协调,这使得中间状态难以验证、归因和审计。我们引入PatchBoard,一种基于Schema的协作架构,用经过验证的JSON Patch变更替代智能体间对话,作用于共享结构化状态。一个架构智能体构建任务特定的Schema和工作流规则,而确定性内核在事务性提交之前,根据Schema约束、角色特定的写入合约和运行时不变性验证每个提议的状态变更。在630个匹配的ALFWorld场景中,PatchBoard实现了84.6%的成功率,而LangGraph为30.8%,Flock为61.6%,同时每个成功任务的令牌消耗降至45.5k,而LangGraph和Flock分别为368.3k和64.2k。

英文摘要

LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.

2605.29310 2026-05-29 cs.AI cs.CL

Rubric-Guided Process Reward for Stepwise Model Routing

基于评分准则的逐步模型路由过程奖励

Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

AI总结 提出RoRo框架,通过收集路由轨迹、构建偏好对、训练Rubricor生成评估准则和Judge评分,结合过程与结果奖励优化路由策略,提升大型推理模型逐步路由的准确性和成本效率。

Comments 17 pages, 9 figures, submitted to EMNLP 2026

详情
AI中文摘要

逐步模型路由通过将每个推理步骤分配给合适的模型来提高大型推理模型(LRM)的效率。最近的方法将路由建模为顺序决策过程,并使用强化学习训练路由器。然而,尽管它们将路由建模为一个过程,但仍然使用结果奖励来监督路由器。这种奖励仅反映最终答案的正确性,未能评估中间路由决策,这可能会削弱性能和泛化能力。为了解决这一差距,我们提出了RoRo,一种基于评分准则的逐步模型路由过程奖励框架。RoRo首先收集多样化的路由轨迹,并基于结果、成本和过程质量构建偏好对。然后,它通过交替优化训练一个Rubricor来生成查询特定的评估准则,以及一个Judge来在此准则下对路由轨迹进行评分。由此产生的过程奖励与结果奖励相结合,通过GRPO优化路由策略。在五个推理基准上的实验,无论是在同族还是跨族设置下,都表明RoRo始终优于强基线,并实现了更好的准确性和成本权衡。

英文摘要

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek:训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

AI总结 提出GrepSeek,一种通过两阶段训练(冷启动数据集+GRPO优化)和语义保持的分片并行执行引擎,训练紧凑型搜索代理直接与文本语料库交互(通过shell命令),在开放域问答中取得最优F1和精确匹配。

详情
AI中文摘要

大型语言模型(LLM)搜索代理通过多轮推理和信息检索,在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器,该检索器接收关键词或自然语言查询,并利用预计算文档表示的索引返回排序后的文档列表。在本工作中,我们探索了一种互补视角,其中搜索代理将语料库本身视为搜索环境,并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek,一种优化的直接语料库交互(DCI)搜索代理,它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性,我们提出了一种两阶段训练流程。首先,我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集,生成经过验证的、因果基础的搜索轨迹。其次,我们使用组相对策略优化(GRPO)优化初始化的策略,使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用,我们进一步使用语义保持的分片并行执行引擎,该引擎将基于shell的检索加速高达7.6倍,同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明,GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性,表明DCI作为搜索代理的一种实用且具有竞争力的方法,可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

2605.29303 2026-05-29 cs.AI

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

基于熵-KL散度的令牌掩码:一种用于大语言模型选择性微调的新方法

Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

AI总结 针对低数据场景下标准监督微调导致模型分布偏移的问题,提出EKSFT方法,通过选择性掩码高熵或高KL散度的令牌,在注入任务知识的同时保持预训练分布完整性,在数学推理基准上优于标准SFT并提升后续RL性能。

Comments 17 pages

详情
AI中文摘要

监督微调(SFT)后接强化学习(RL)已成为大语言模型的标准后训练范式。该范式为RL探索提供了冷启动,避免了纯RL中在线采样产生不足正样本的低效问题。然而,在实践中,现有方法通常使用少量数据进行SFT初始化(相比RL阶段),这可能导致模型拟合有限样本并偏离其预训练分布。这种分布偏移阻碍了模型在后续RL训练中有效探索的能力。为解决这一挑战,我们提出在低数据场景下,SFT应优先激活任务相关能力而非记忆特定内容。沿着这一思路,我们提出EKSFT(熵-KL选择性微调),该方法选择性掩码那些相对于参考模型表现出高熵或高KL散度的令牌。通过排除这些高不确定性、分布偏移的令牌进行模仿,EKSFT在注入任务特定知识的同时保持了模型预训练分布的完整性。在数学推理基准上的实证评估表明,EKSFT始终优于标准SFT。从EKSFT模型进行进一步的RL微调可获得一致更好的后RL性能,表明RL阶段的探索得到了改善。我们的代码和数据集可在https://github.com/MINE-USTC/EKSFT获取。

英文摘要

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

2605.29302 2026-05-29 cs.CV

ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement

ViASNet:用于预测动态显著性和观众参与度的视频广告显著性网络

Jianping Ye, Michel Wedel

AI总结 提出基于3D U-Net架构的ViASNet模型,融合音频和场景语义,预测视频广告的动态显著性图,并通过熵分析诊断观众参与度。

详情
AI中文摘要

数字媒体领域已普遍转向电视、社交媒体和电子商务平台上的短视频广告。本研究聚焦于短视频广告的深度显著性预测。深度显著性模型已被用于生成人类眼动注视模式的预测,以增强用户与数字技术的交互并优化其设计。对于视频广告,动态显著性图捕捉观众观看的位置和时间,揭示视频广告为何有效以及如何优化其内容。我们开发并测试了一种新的深度动态显著性预测模型ViASNet(视频广告显著性网络),其架构基于3D U-Net,并考虑了音频和场景语义的影响。我们评估了该模型在151个视频广告上的性能,每个广告约有20名观众观看并记录其眼动,并通过消融实验探索影响模型性能的关键因素。我们逐帧计算预测显著性图的熵,作为诊断工具来识别未能吸引观众的广告和场景,并在15个未见广告的测试数据上展示了其应用。我们的研究表明,通过基于ViASNet等深度显著性模型的自动化系统,可以显著加快广告设计和测试的速度。

英文摘要

The digital media landscape has seen a pervasive shift toward short-form video advertising on TV, social media and e-commerce platforms. The present study focuses on deep saliency prediction for short-form video advertising. Deep saliency models have been used to generate predictions of human eye fixation patterns with the purpose of enhancing user interaction with digital technology and optimizing its design. For video ads, dynamic saliency maps capture where and when viewers are looking, revealing why video ads are effective, and how their content should be optimized. We develop and test a new deep dynamic saliency prediction model called ViASNet (Video Ad Saliency Network), which has an architecture founded on the 3D U-Net, and accommodates the influence of audio and the semantic meaning of scenes. We assess the model's performance on 151 video ads, each seen by about 20 viewers wile their eye movements were tracked, and explore the critical factors influencing model performance through ablation experiments. We calculate the entropy of the predicted saliency maps frame-by-frame as a diagnostic tool to identify ads and scenes that fail to engage viewers, and illustrate its use on test data of 15 unseen ads. Our study reveals that ad design and testing can be sped up considerably through automated systems built on deep saliency models such as ViASNet.

2605.29301 2026-05-29 cs.RO

The Open Motion Planning Library 2.0

开放运动规划库2.0

Weihang Guo, Theodoros Tyrovouzis, Emiliano Flores, Clayton W. Ramsey, Zachary K. Kingston, Ioan A. Şucan, Mark Moll, Lydia E. Kavraki

AI总结 本文介绍OMPL 2.0,通过硬件加速实现实时运动规划,并集成现代AI研究流程,总结了库与运动规划领域的共同发展及其对研究社区的影响。

详情
AI中文摘要

开放运动规划库(OMPL)于2008年首次发布,已成为运动规划社区的基石,提供了广泛的最先进的基于采样的算法的实现。经过近二十年的持续开发,我们不断扩展该库,增加了新的规划器、状态空间和问题表述。这些新增内容包括渐近最优和懒惰规划器、约束运动规划以及具有时序逻辑目标的规划。在此基础上,我们推出了OMPL 2.0,这是该库的一次重大演进,旨在通过硬件加速实现实时运动规划,并与现代AI研究流程无缝集成。我们还反思了OMPL和运动规划领域多年来如何共同成长,并讨论了该库对研究社区的更广泛影响。

英文摘要

The Open Motion Planning Library (OMPL), first released in 2008, has become a cornerstone of the motion planning community, providing implementations of a wide range of state-of-the-art sampling-based algorithms. Over almost two decades of continuous development, we have steadily expanded the library with new planners, state spaces, and problem formulations. These additions range from asymptotically optimal and lazy planners to constrained motion planning and planning with temporal-logic goals. Building on this foundation, we introduce OMPL 2.0, a major evolution of the library that targets real-time motion planning through hardware acceleration and integrates seamlessly with modern AI research workflows. We also reflect on how OMPL and the field of motion planning have grown together over the years, and discuss the library's broader impact on the research community.

2605.29300 2026-05-29 cs.CL cs.AI cs.SD

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH:音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

AI总结 提出MusTBENCH基准和MusT四阶段优化方法,评估并提升音乐大语言模型在音频中的时间定位能力。

详情
AI中文摘要

近期的大型音频-语言模型(LALMs)在理解音乐内容方面展现了有前景的能力。然而,它们的响应是否基于音频中正确的时间区域仍未得到充分探索。这一限制对于音乐理解尤为关键,因为关键信息通常以时间局部化事件的形式出现,例如乐器进入和节奏转换。为了解决这一差距,我们引入了MusTBENCH,一个由音乐专家验证的基准,旨在通过五个时间定位的问答任务评估LALMs中的时间定位能力。为了进一步提升现有模型中的时间定位,我们提出了MusT,一种新颖的四阶段时间优化方案,涵盖音乐编码器适应、LLM适应、LLM监督微调和基于RL的优化。在MusTBENCH上的实验表明,现有LALMs在精确时间定位方面存在困难,而MusT相比强基线带来了显著改进。这些结果将时间定位确立为当前LALMs中缺失的关键能力,并将MusTBENCH定位为未来时间定位音乐理解研究的具有挑战性的基准。

英文摘要

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

2605.29298 2026-05-29 cs.RO

MonoDuo: Using One Robot Arm to Learn Bimanual Policies

MonoDuo: 使用单机械臂学习双臂策略

Sandeep Bajamahal, Lawrence Yunliang Chen, Toru Lin, Zehan Ma, Jitendra Malik, Ken Goldberg

AI总结 提出MonoDuo框架,利用单臂机器人演示和人类协作数据,通过数据增强生成合成演示,训练双臂机器人策略,在五项任务中实现零样本部署和少样本微调,成功率高达70%。

Comments Accepted to appear in the 2026 IEEE International Conference on Robotics and Automation (ICRA), Vienna, Austria, 1-5 June 2026

详情
AI中文摘要

双臂协调对于许多现实世界的操作任务至关重要,然而学习双臂机器人策略受到双臂机器人和数据集稀缺的限制。相比之下,单臂机器人在研究实验室中广泛可用。我们能否利用它们来训练双臂机器人策略?我们提出MonoDuo,一个利用单臂机器人演示与人类协作来学习双臂操作策略的框架。MonoDuo通过遥操作单臂机器人执行双臂任务的一侧,同时由人类执行另一侧来收集数据,然后交换角色以覆盖两侧。来自腕部安装和固定摄像头的RGB-D观测通过最先进的手部姿态估计、图像和点云分割以及修复,被增强为目标双臂机器人的合成演示。这些基于真实机器人运动学的合成演示用于训练双臂策略。我们在五项任务上评估MonoDuo:举箱、背包打包、叠布、拉拉链和递盘子。与仅依赖人类双臂视频的方法相比,MonoDuo能够在未见过的双臂机器人配置上实现零样本部署,成功率高达70%。仅使用25个目标机器人演示进行少样本微调,相比从头训练,成功率进一步提升65-70%,展示了MonoDuo在将单臂机器人数据高效迁移到双臂机器人策略方面的有效性。

英文摘要

Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks: box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65-70% over training from scratch, demonstrating MonoDuo's effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies.

2605.29288 2026-05-29 cs.AI

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

诊断答案正确长链思维训练轨迹中的有害延续

Chen He, Yuhao Wu, Lei Wang, Wenxuan Zhang, Fumin Shen

AI总结 研究长链思维训练数据中答案正确但后续推理有害的延续现象,通过删除后缀实验发现其损害训练效果,并提出轻量级边界代理方法。

详情
AI中文摘要

长链思维(CoT)轨迹被广泛用作面向推理的大语言模型监督微调(SFT)的监督信号,然而答案正确的轨迹仍可能导致显著不同的微调结果。我们研究了答案正确的长CoT数据中的结论后延续:即答案已充分支持,但轨迹继续包含额外推理并保留在监督目标中。为了测试其训练效果,我们使用仅删除的编辑器构建保留答案的后缀移除,并比较原始和经过处理的轨迹上的CoT监督微调。我们观察到移除编辑器识别的结论后延续后监督微调结果有所改善,表明这种延续在我们的设置中对训练有害。因此,我们将这一经验支持的现象称为有害延续。除了这一干预,我们还通过不确定性和隐藏状态进展进一步刻画了被移除的结论后延续。我们观察到持续的局部不确定性以及减弱的终端方向进展,形成了不确定性-几何不匹配。最后,我们实例化了有害延续切割(HCC),一种轻量级边界代理,近似于编辑器识别的结论后延续边界。

英文摘要

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.

2605.29283 2026-05-29 cs.LG cs.AI

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

物理基础模型能否学习可泛化的物理?一种跨物理机制和分布偏移的偏差感知基准

Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

AI总结 通过构建包含8种物理动力学、3种训练数据混合和25种测试机制的基准,评估五种物理基础模型架构,发现当前模型是条件性而非通用性泛化者,其泛化能力依赖于物理机制、时间尺度、初始条件、预训练、模型大小和架构,并指出改进需超越缩放模型或扩展数据,转向学习跨机制、时间尺度和分布偏移的可迁移物理知识。

Comments 26 pages, 31 figures

详情
AI中文摘要

最近的物理基础模型声称具有通用的时空预测能力,但它们的评估通常将性能压缩为固定训练分布下的单一平均分数。这使得难以确定模型是否学习了可泛化的物理动力学,还是仅在特定设置下表现良好。我们构建了一个包含8种物理动力学、3种训练数据混合和25种测试机制的基准,这些测试机制由动态尺度和初始条件复杂性变化引起,涵盖了分布内、分布偏移和分布外设置。我们评估了五种物理基础模型架构和每种架构的四种模型变体(从头训练和三种预训练大小),共得到60,000个测量结果。我们的结果表明,当前的物理基础模型表现为条件性而非通用性泛化者:它们的泛化能力取决于物理机制、时间尺度、初始条件设置、预训练、模型大小和架构。改进训练数据分布只能部分缓解这一限制。预训练和缩放也无法可靠地消除它们的能力偏差。我们认为,改进物理基础模型需要超越缩放模型或扩展数据,转向学习能够更好地跨机制、时间尺度和分布偏移捕获可迁移物理知识的机制。

英文摘要

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.

2605.29278 2026-05-29 cs.CL

Accommodation Goes Both Ways: Studying Linguistic Convergence Between Humans and Language Models

适应是双向的:研究人类与语言模型之间的语言趋同

Terra Blevins

AI总结 通过大规模研究人类与LLM对话中的语言趋同现象,发现LLM在功能词和开放类特征上过度适应人类风格,而人类对LLM的适应程度与人类之间对话的基线一致。

详情
AI中文摘要

随着LLM日益融入日常生活,理解它们的存在将如何塑造人类语言行为是一个开放性问题。我们提出了一个关于人机对话中语言趋同的大规模研究,考察在多轮对话中人类和LLM如何相互适应对方的语言风格。使用WildChat(一个真实世界ChatGPT对话语料库)上的非对称趋同度量,我们发现,尽管LLM在八种语言的功能词和开放类特征上显著过度趋同于用户,但人类在此环境下的趋同率与人类-人类基线基本一致。这些发现表明,人机对话中的适应是非对称的:LLM过度拟合用户的风格,而人类对LLM的语言适应与对另一个人的适应没有区别。

英文摘要

As LLMs become increasingly integrated into daily life, understanding how their presence will shape human linguistic behavior is an open question. We present a large-scale study of linguistic convergence in human-LLM dialogue, examining how humans and LLMs accommodate each other's linguistic style during multi-turn conversations. Using an asymmetric convergence metric on WildChat, a corpus of real-world ChatGPT transcripts, we find that while LLMs significantly overconverge toward their users on both function word and open-class features across eight languages, human convergence rates in this setting are broadly consistent with human-human baselines. These findings suggest that accommodation in human-LLM dialogue is asymmetric: while LLMs dramatically overfit to their users' style, humans linguistically accommodate LLMs no differently than they would another person.

2605.29275 2026-05-29 cs.CL

Prompt-Level Reward Specifications for Open-Ended Post-Training

面向开放式后训练的提示级奖励规范

Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang

AI总结 提出一种提示级奖励规范框架,通过离线构建可复用的任务自适应评分准则和可执行硬约束检查器,在训练前显式化奖励标准,无需人工偏好标注或单独训练奖励模型,在多个开放式基准上提升了离线排序和在线强化学习效果。

Comments 39 pages, 4 figures, 16 tables

详情
AI中文摘要

开放式后训练受益于能够明确提示特定成功条件的奖励,而非仅依赖事后标量分数。在指令遵循、写作和决策支持任务中,响应质量取决于局部要求、整体偏好和显式约束,但现有奖励方法往往隐含这些标准或仅覆盖狭窄的可验证情况。我们提出一个提示级奖励规范框架,将奖励规范与奖励计算分离。仅凭提示,我们的框架离线构建可复用的任务自适应评分准则和可执行硬约束检查器,在训练前显式化奖励标准,并可在多次 rollout 中复用。在评分时,基于工件的评分准则和代码分数与独立的全局分数(用于残余整体质量)相结合,生成关于需求满足度、整体质量和确定性约束的归一化混合奖励。该框架无需人工偏好标注、参考答案或单独训练的奖励模型。实验表明,所得奖励改进了离线 RM 风格的响应排序,并支持在多个开放式基准上进行在线强化学习。消融实验进一步表明,评分准则、全局评分和可执行验证提供了互补的监督。

英文摘要

Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.

2605.29274 2026-05-29 cs.CL

Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization

基于LLM的自动评分中可学习的评估技能:通过迭代优化构建评分标准

Yun Wang, Xin Xia, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu

AI总结 提出一种迭代框架,使LLM能从评分经验中学习评估技能(即与题目无关的自然语言程序性知识),自动构建评分标准,无需人工干预,在ASAP-SAS数据集上超越专家编写的评分标准。

Comments 12 pages, 5 figures

详情
AI中文摘要

基于LLM的自动评分方法接近人类水平,但扩展到新任务时仍受限于上游阶段(如评分标准构建)的逐项人工配置。人类专家通过长期实践形成的评估启发式方法绕过了这一瓶颈。我们探究LLM是否可以直接从评分经验中学习类似的启发式方法,并将其形式化为评估技能的概念:即与题目无关的自然语言程序性知识,指导LLM完成评分工作流程的特定阶段。聚焦于评分标准构建作为首次实例化,我们提出一个迭代框架,将技能分解为固定支架和可学习的与题目无关的规则,通过LLM驱动的评分错误诊断和验证门控选择来优化规则。该框架无需专家编写的评分标准。在所有十个ASAP-SAS题目上,优化后的技能显著提升了基于LLM的评分,并经常超过数据集提供的专家评分标准。跨题目迁移实验进一步表明,学习到的技能捕捉到了可泛化和题目特定的模式。

英文摘要

LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.

2605.29273 2026-05-29 cs.LG math.OC

A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm

一种新型自适应学习算法的理论与实验研究

Sakshi Kumari, Shyam Kumar M, Sushmitha P

AI总结 针对现有自适应优化器(如Adam和AMSGrad)的收敛性问题,提出基于视线方法的C-Adam优化器,给出收敛性理论证明并通过数值实验验证。

详情
AI中文摘要

机器学习算法的一个关键组成部分是以更少的计算成本和更少的振荡来最小化损失函数。虽然基于自适应学习率的优化器已广泛用于实际任务,但它们不能保证收敛,这就是后来引入AMSGrad来研究Adam的非收敛行为的原因。本文批判性地回顾了流行的自适应优化方法(如Adam和AMSGrad),重点介绍了它们的基本设计概念。为了解决上述优化器的局限性,基于视线方法提出了一种新的优化器变体C-Adam。还提供了收敛性的理论证明,并通过一系列基于实际生活的数值实验验证了该优化器。

英文摘要

A crucial component of machine learning algorithms is minimizing loss functions with less computational cost and less oscillations. While adaptive learning rate-based optimizers have been widely used for real-world tasks, they do not guarantee convergence, which is why AMSGrad was later introduced to investigate the non-convergence behaviour of Adam. In this paper, popular adaptive optimization methods like Adam and AMSGrad are critically reviewed with an emphasis on their fundamental design concepts. To address limitations of the above mentioned optimizers, a new optimizer variant, C-Adam, is proposed based on the line of sight approach. A theoretical proof for convergence is also provided and the optimizer is validated through a number of real-life based numerical experiments.

2605.29272 2026-05-29 cs.LG cs.AI stat.ML

Causal Label Recovery in Payment Networks

支付网络中的因果标签恢复

Gaurav Dhama

AI总结 针对支付网络中标签存在的四种系统偏差,提出序列三重稳健(STR)估计器,同时纠正所有偏差并达到半参数效率界,实现基于数天而非数月数据的训练。

Comments 49 pages

详情
AI中文摘要

支付网络中的欺诈检测模型依赖于存在系统性偏差的退单标签进行训练。每个标签必须依次经过三个门控:授权(被拒绝的交易不产生标签)、发卡行报告(未报告的欺诈不可见)和延迟(待处理的退单在训练时缺失)。到达的标签可能因第一方滥用或发卡行错误分类而受损。配套论文[arXiv:2605.27557]证明这四种损害对检测性能施加了极小极大下界。本文问:能否达到该下界?我们将观测流程形式化为一个具有三个倾向阶段和一个损坏层的顺序缺失数据问题,并构建了序列三重稳健(STR)估计器。STR同时纠正所有四种损害,并达到半参数效率界——没有估计器能具有更低的渐近方差。它是序列三重稳健的:在每个门控处,一致性仅要求倾向模型或结果回归中有一个正确指定,而非两者。我们提供了通过噪声率调整的伪标签进行损坏校正、通过经验贝叶斯收缩稳定小发卡行的逆倾向权重、提供有效置信区间的插件方差估计量,以及用于有限样本保证的伯恩斯坦集中不等式。在操作层面,我们推导了最优训练延迟——使标签质量损失和模型过时之和最小化的成熟窗口——并证明STR允许使用数天而非数月前的数据进行训练,将模型新鲜度与退单成熟周期解耦。对于任何样本量,STR在均方误差上严格优于基于退单的朴素训练。

英文摘要

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved? We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees. On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.

2605.29271 2026-05-29 cs.AI cs.IR cs.LG

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE: 用于工具检索的LLM改写器与稠密编码器的迭代协同训练

Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

AI总结 提出CoHyDE方法,通过迭代协同训练稠密编码器和LLM改写器,结合对比学习和偏好对齐,在工具检索任务中同时提升标准查询和模糊查询的性能。

详情
AI中文摘要

在大规模API目录上的工具检索是LLM智能体的核心瓶颈:用户查询以口语化、通常不明确的语言出现,而目录使用技术性API词汇,没有固定的编码器能够单独弥合这一差距。两种主要的训练方法,对比编码器微调和基于冻结LLM的HyDE式查询扩展,从相反的角度解决这个问题,并在互补的方向上失败:微调编码器在查询的表面形式与目录匹配时表现出色,但在不匹配时性能崩溃;而零样本HyDE对不明确的查询更鲁棒,但生成不感知目录的假设描述,当查询形式良好时检索性能下降。我们提出CoHyDE,一种迭代过程,将稠密编码器和LLM改写器训练为单个共同演化的系统:编码器使用改写器生成的目录风格假设描述通过InfoNCE重新训练,改写器通过DPO基于编码器的检索分数进行偏好对齐,两者在循环开始前在工具目录上进行热启动。在ToolBench目录的约10k工具子集上,三轮CoHyDE在标准查询上比最强的单组件基线提高+2.5个百分点的NDCG@5,在保留的模糊查询上提高+6.3个百分点,在最难的模糊层级上增益高达+8个百分点。消融实验证实协同训练是关键因素:单独使用任一组件都无法在形式良好和模糊查询上匹配CoHyDE,在模糊查询上损失高达-8个百分点。

英文摘要

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

2605.29270 2026-05-29 cs.AI

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

索引不可读之物:基于LLM原生的服务分类法递归构建与搜索

Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou

AI总结 针对LLM在服务发现中因上下文窗口限制和长输入中间信息丢失问题,提出LLM原生的渐进式披露方案A2X,通过自动构建层次化服务分类法并在查询时逐层遍历,显著提升检索准确率并降低token消耗。

Comments Preprint. 8 pages main paper + appendix; 2 figures. Under submission to EMNLP 2026

详情
AI中文摘要

物联网代理(IoA)时代正在形成:LLM代理预计通过编排快速增长中的模型上下文协议(MCP)服务器、代理到代理(A2A)端点、可复用技能以及其他LLM可调用服务来实现用户目标。然而,LLM面临与此机制的结构性不匹配:有效上下文是一种稀缺资源,无法随服务数量扩展。将数千个服务描述串联到提示中会溢出上下文窗口,即使窗口足够大,模型也会系统性地忽略长输入中间部分的信息,即文献中充分记录的“中间迷失”现象。这本质上是服务发现中的上下文管理问题。为解决此问题,我们提出一种LLM原生的渐进式披露方案及其具体实例A2X(代理到任何事物的服务发现):一个LLM驱动的流水线,自动将注册服务组织成层次化分类法,并在查询时逐层遍历,使得每次LLM调用仅看到与用户查询高度相关的小候选集。这将有效上下文稀缺性与注册表规模解耦,显著降低token消耗并提高检索准确性。与全上下文转储相比,A2X在提示token成本仅为九分之一的情况下实现了6.2个点的命中率提升;与最先进的开源基于嵌入的基线相比,A2X将命中率提高了超过20个点。

英文摘要

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

2605.29267 2026-05-29 cs.AI cs.LG

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

人类策展何时以及如何适得其反:多模型自消费循环下的偏好对齐

Yang Zhang, Xiukun Wei, Xueru Zhang

AI总结 研究多模型自消费训练中人类策展对模型对齐的影响,发现跨模型交互可能削弱甚至逆转策展效果,导致长期对齐退化。

详情
AI中文摘要

基础模型越来越多地使用先前模型迭代生成的合成数据进行训练,而非仅依赖真实数据。这种自消费训练范式可能导致模型崩溃、发散或偏差放大。近期工作(Ferbach et al., 2024)表明,将人类策展纳入循环可以引导自消费模型向人类对齐的行为,但这些分析聚焦于单一孤立模型,该模型仅消耗自身输出。然而,在实践中,模型经常交互并训练于其他模型产生的输入-输出对。本文研究多模型机制下的自消费训练。我们首先形式化了一个交互自消费模型的框架,并刻画了所得动力系统何时收敛到稳定点。然后,我们考察了一个模型的人类策展如何影响其自身对齐(自影响),以及这种效应如何传播到其他模型(交叉影响)。与孤立设置中人类策展总是增强模型对齐不同,我们表明跨模型交互可以削弱甚至逆转这种效应,最终损害长期对齐。

英文摘要

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

2605.29262 2026-05-29 cs.AI

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

协调实时约束与长视距推理:一种用于动态调度的异步智能体框架

Shijie Cao, Yuan Yuan, Jing Liu

AI总结 提出RACE-Sched异步智能体框架,通过双流架构解耦策略执行与逻辑推理,利用LLM合成和验证符号启发式规则,在保证实时性的同时提升动态调度质量。

详情
AI中文摘要

动态柔性作业车间调度问题(DFJSP)需要在即时响应随机扰动与全局优化生产目标之间进行权衡。传统的优先级规则在处理复杂扰动时灵活性不足,而基于学习的方法往往牺牲可解释性或难以跨问题规模泛化。尽管大语言模型(LLM)提供了高级推理能力以弥合这一差距,但其显著的推理延迟与工业控制系统的毫秒级决策周期不兼容。为解决这一冲突,我们引入了RACE-Sched,一种异步智能体框架,通过双流架构将策略执行与逻辑推理解耦。反应流执行低延迟的符号启发式规则以实现实时调度,而并行的深思流利用LLM合成、验证和演化这些规则。候选规则在沙箱中经过严格测试,并通过原子更新部署,确保安全且不阻塞控制循环。此外,语义规则库索引已验证的启发式规则,用于基于检索的初始化,从而增强跨问题规模的可迁移性。在GEN-Bench、MK-Bench和JMS-Bench上的广泛评估表明,RACE-Sched优于领先的深度强化学习和其他基于LLM的基线方法。该方法协调了实时约束与长视距推理,实现了更优的解决方案质量和对动态事件的鲁棒适应。

英文摘要

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

2605.29259 2026-05-29 cs.LG cs.AI

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

KLAS:利用相似性拼接神经网络以改进精度-效率权衡

Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain, Akshay Jajoo, Myungjin Lee, Clayton Kerce, Alexey Tumanov

AI总结 提出KLAS框架,通过KL散度度量中间表示相似性自动选择最佳拼接配置,在相同微调成本下提升拼接模型的精度-效率曲线。

详情
AI中文摘要

鉴于部署目标的广泛性,灵活模型选择对于在给定计算预算内优化性能至关重要。最近的研究表明,在模型家族内拼接预训练模型能够实现精度-效率权衡空间的成本效益插值。拼接将一个预训练模型的中间激活变换到另一个模型,生成新的插值拼接网络。这类网络沿精度-效率谱提供了部署选项池。然而,现有拼接方法往往产生次优权衡且缺乏泛化性,因为它们主要依赖启发式方法选择拼接配置。我们认为,构建改进的精度-效率权衡需要显式捕获并利用被拼接预训练模型之间的相似性。为此,我们引入KLAS,一种新颖的拼接选择框架,通过利用中间表示之间的KL散度,自动化和泛化跨模型家族的拼接选择。KLAS从$O(k^2n^2)$种可能性中为$k$个深度为$n$的预训练模型识别最有前景的二元拼接。通过全面实验,我们证明KLAS在相同微调成本下改进了拼接模型的精度-效率曲线,与基线相比,KLAS在相同计算成本下实现了高达$1.21\%$的ImageNet-1K top-1准确率提升,或在保持准确率的同时将FLOPs降低$1.33\times$。

英文摘要

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.

2605.29257 2026-05-29 cs.SD

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

ChildVox:理解与表征儿童期声音的语音、音频及大型音频语言模型基准

Tiantian Feng, Anfeng Xu, Xuan Shi, Aditya Kommineni, Shakhrul Iman Siam, Megan Micheletti, Zhonghao Shi, Helen Tager-Flusberg, Mi Zhang, Lynn K. Perry, Catherine Lord, Daniel Messinger, Shrikanth Narayanan

AI总结 提出ChildVox基准,整合17个儿童音频数据集和20多个子任务,评估多种基础模型在儿童生理声、非语言发声、规范音节和口语识别上的性能。

Comments preprint under review

详情
AI中文摘要

我们提出了ChildVox,这是一个新颖的基准,用于表征儿童通过其交流的多样化声学信号。具体来说,ChildVox遵循从出生到学龄的完整发展轨迹,涵盖生理声音、非语言发声、规范音节和口语。ChildVox整合了来自17个以儿童为中心的音频和语音数据集的20多个子任务,实现了系统的跨语料库和跨领域比较。我们评估了一系列代表性的音频和语音基础模型,包括自监督、面向ASR和大型音频语言模型,在生理声音分类、发声和规范音节建模以及语音质量评估和识别等任务上的表现。基准测试结果表明,ChildVox提供了一套高性能模型,用于识别来自儿童的广泛声学信号,支持下游应用,如表征儿童语言水平和追踪随年龄变化的语音产生。

英文摘要

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.