arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2605.22106 2026-05-22 cs.AI

ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

ArborKV: 一种面向树状推理的KV缓存管理方法

Yeqiu Chen, Ziyan Liu, Zhenxin Huang, Runquan Gui, Hong Wang, Lei Liu

AI总结 本文提出ArborKV,一种结构感知的KV缓存管理方法,通过轻量级值估计器和树状分配策略,实现纯token提取式淘汰与惰性再水合,从而在保持高精度的同时减少KV内存使用,使在固定硬件预算下能支持更大规模的树状推理搜索。

详情
AI中文摘要

最近在大语言模型推理方面的进展越来越多地从单次生成转向在中间推理状态上的显式搜索。Tree-of-Thoughts (ToT) 将推理组织为具有分支和回溯的树状搜索,但显著放大了键值(KV)缓存:保留用于前沿部分轨迹的KV状态很快成为内存瓶颈,限制了吞吐量并约束了在固定硬件预算下的搜索深度和宽度。我们通过观察到ToT风格推理中的KV重用由搜索动态决定:短期解码主要依赖于活跃分支及其祖先,而无效子树具有低短期重用概率但必须保持可恢复以供回溯。受此启发,我们提出了ArborKV,一种结构感知的淘汰框架,结合轻量级值估计器和树状分配策略,并进行纯token提取式淘汰与惰性再水合以支持回溯。在ToT风格推理基准上的实验表明,ArborKV实现了高达约4倍的KV内存减少,同时保持接近完整保留的精度,使在固定设备预算下能支持更大规模的树状推理搜索。

英文摘要

Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

2605.22104 2026-05-22 cs.CV

OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

OPERA: 一种用于图像修复的智能体,通过端到端联合规划-执行优化

Feng Zhu, Shuyang Xie, Yihan Zeng, Ming Liu, Wangmeng Zuo

AI总结 该研究提出OPERA框架,通过端到端联合优化修复规划和工具执行,解决图像修复中复杂混合退化问题,优于现有方法和统一模型。

详情
AI中文摘要

现实中的图像修复因复杂的、相互作用的混合退化而具有挑战性。最近的基于智能体的方法通过组合多个任务特定的修复工具来解决这个问题。然而,实证分析表明,其性能根本上受到隐式约束的规划空间和独立预训练工具之间缺乏协调的限制。为了解决这些问题,我们提出了OPERA(优化规划-执行修复智能体),一种框架,通过端到端的方式联合优化修复规划和工具执行。在规划方面,OPERA使用强化学习直接优化工具组合在一个组合计划空间上,最终修复质量作为奖励。在执行方面,OPERA引入了智能体引导的修复工具协同训练,使它们能够在顺序组合下学习合作行为。在多退化基准和真实世界数据集上的大量实验表明,OPERA在多样且复杂的退化场景中始终优于所有-in-one修复模型和现有基于智能体的方法。

英文摘要

Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.

2605.22102 2026-05-22 cs.AI

ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

ExComm:探索阶段通信用于容错的代理测试时间扩展

Woomin Song, Beomjun Kim, Daewon Choi, Sai Muralidhar Jayanthi, Saket Dingliwal, Jinwoo Shin, Aram Galstyan

AI总结 本文提出ExComm,一种用于探索阶段的代理测试时间扩展通信协议,通过定期审计代理信念状态以检测跨代理事实冲突,并通过专用工具验证循环解决冲突,从而提升测试时间扩展的容错能力。

详情
AI中文摘要

在长周期代理测试时间扩展中,错误传播是一个常见的失败模式,其中中间步骤中引入的事实错误或无效推论会持续存在于代理的信念状态中,并污染后续推理。现有测试时间扩展方法对这一过程控制有限,因为它们通常依赖于代理自行检测错误、在错误轨迹中选择或仅在错误已影响推理路径后才修正解决方案。我们提出ExComm,一种用于探索阶段的代理测试时间扩展通信协议。ExComm受到经验观察的启发,即并行代理推理中的大多数中间错误会产生可检测的跨代理事实冲突。利用代理工作流的迭代结构,ExComm定期审计代理信念状态以检测此类冲突,通过专用工具验证循环解决冲突,并将简洁、针对性的反馈返回相关代理。通过软信念更新将修正纳入其中,即附加已验证的反馈而非覆盖现有信念。此外,为防止由于通信导致轨迹多样性崩溃,ExComm进一步引入轨迹多样化模块,将冗余轨迹引导至正交策略。在AIME 2024、AIME 2025和GAIA上使用Gemini-2.5-Flash-Lite和Qwen3.5-4B的实验表明,ExComm在测试时间扩展中一致优于强基线,分别在最佳基线上实现了平均性能提升5.7%和5.0%。进一步分析显示了改进的错误恢复、有利的扩展行为、比适应通信基线更强的多样性,以及在评估方法中最佳的性能-成本权衡。

英文摘要

A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing test-time scaling methods provide limited control over this process, as they often rely on agents to detect their own mistakes, select among flawed trajectories, or refine solutions only after errors have already shaped the reasoning path. We propose ExComm, a communication protocol for exploration-stage agentic test-time scaling. ExComm is motivated by the empirical observation that the majority of intermediate errors in parallel agentic reasoning produce detectable cross-agent factual conflicts. Leveraging the iterative structure of agentic workflows, ExComm periodically audits agent belief states to detect such conflicts, resolves them through a dedicated tool-based verification loop, and returns concise, targeted feedback to the involved agents. Corrections are incorporated through soft belief updates, which append verified feedback rather than overwriting existing beliefs. Furthermore, to prevent collapsing trajectory diversity due to communication, ExComm further introduces a trajectory diversification module that redirects redundant trajectories toward orthogonal strategies. Experiments on AIME 2024, AIME 2025, and GAIA with Gemini-2.5-Flash-Lite and Qwen3.5-4B show that ExComm consistently outperforms strong test-time scaling baselines, achieving average performance gains of 5.7% and 5.0% over the best-performing baselines, respectively. Further analyses demonstrate improved error recovery, favorable scaling behavior, stronger diversity than adapted communication baselines, and the best performance-cost trade-off among the evaluated methods.

2605.22099 2026-05-22 cs.CL

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

一种用于柬埔寨检索增强问答的语言模型比较研究

Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho, Saksonita Khoeurn

AI总结 本文针对低资源非拉丁语种柬埔寨语言,比较了多种语言模型在检索增强问答任务中的性能,发现检索器选择是影响效果的关键因素,生成器在不同指标上表现各异。

Comments 14 pages, 1 figure,

详情
AI中文摘要

检索增强生成(RAG)作为一种将大型语言模型(LLM)输出与检索到的证据相结合的范式,已被证明可以减少幻觉并提高事实准确性。然而,其在低资源、非拉丁语种如柬埔寨语言中的有效性仍鲜有研究。本文提出了一种基于RAG的柬埔寨语言电信领域文档问答系统。我们进行了两阶段的比较评估。首先,我们基准测试了三种嵌入模型:BGE-M3(567M)、Jina-Embeddings-v3(570M)和Qwen3-Embedding(597M),用于柬埔寨文档的密集检索。BGE-M3在Hit Rate@3、File Hit Rate@3、MRR@3和Precision@3等指标上均表现最佳,显著优于其他检索器。其次,使用BGE-M3作为选定的检索器,我们在200个柬埔寨问答对的精心编纂的黄金数据集上评估了五个生成器后端:Qwen3(8B)、Qwen3.5(9B)、Sailor2-8B-Chat、SeaLLMs-v3-7B-Chat和Llama-SEA-LION-v2-8B-IT。为了量化系统性能,我们应用了六个受RAGAS启发的指标:忠实度、答案相关性、上下文相关性、事实正确性、答案相似性和答案正确性。结果表明,没有单一模型在所有指标上占据主导:Qwen3.5-9B在忠实度(0.859)和上下文相关性(0.726)上最高,Qwen3-8B在事实正确性(0.380)上最高,而SeaLLMs-v3-7B-Chat在答案相关性(0.867)、答案相似性(0.836)和答案正确性(0.599)上表现最佳。这些发现突显了检索器选择仍是柬埔寨RAG的主要瓶颈,而生成器的强项则取决于优先考虑的是 grounding、事实精度还是语义相似性。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

2605.22098 2026-05-22 cs.CV cs.AI cs.LG

TextTeacher: What Can Language Teach About Images?

TextTeacher: 语言能教会我们关于图像什么?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

AI总结 该研究提出TextTeacher方法,通过将语言模型的语义知识注入到图像分类训练中,提升视觉模型的性能,同时保持推理时的模型简洁性。

Comments Published at TMLR

详情
Journal ref
Transactions on Machine Learning Research, ISSN 2835-8856, 2026
AI中文摘要

柏拉图表示假设认为,足够大的模型会收敛到共享的表示几何结构,即使跨模态。受此启发,我们提出问题:语言模型的语义知识能否有效提升视觉模型?为此,我们引入TextTeacher,一种简单的辅助目标,将文本嵌入作为额外信息注入图像分类训练。TextTeacher利用 readily available 的图像描述、预训练并冻结的文本编码器以及轻量级投影,生成语义锚点,高效引导训练期间的表示,同时保持推理时的模型不变。在ImageNet上使用标准ViT后端,TextTeacher将准确率提升高达+2.7个百分点(p.p.),并在相同配方和计算条件下产生一致的迁移增益(平均+1.0 p.p.)。它优于视觉知识蒸馏,在相同计算预算下更准确,或在相似准确率下更快。我们的分析表明,TextTeacher在训练初期塑造了更深的层,并通过补充互补的语义线索帮助泛化。TextTeacher增加的开销很小,不需要对目标模型进行昂贵的多模态训练,并保持纯视觉模型的简洁性和延迟。

英文摘要

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

2605.22096 2026-05-22 cs.CV

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

VISTA:基于解剖解码的时空基础模型验证引导集成用于罕见病理VCE事件检测——竞赛结果后

Bo-Cheng Qiu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

AI总结 本文提出VISTA框架,结合时空基础模型和解剖解码,通过验证引导的加权融合和时间事件解码,提升罕见病理VCE事件检测的性能,竞赛后进一步优化后取得第二名。

详情
AI中文摘要

胶囊内镜事件检测具有挑战性,因为临床相关发现稀少、视觉异质且以事件级别评估而非帧精度。我们提出VISTA,一个针对RAREVISION任务的度量对齐多主干框架。VISTA结合EndoFM-LV进行时间上下文分析和DINOv3 ViTL/16进行帧级视觉语义,随后通过Diverse Head Ensemble (DHE)、Validation-Guided Weighted Fusion (VGWF)和Anatomy-Aware Temporal Event Decoding (ATED)。原始官方提交在隐藏测试中达到mAP@0.5为0.3530和mAP@0.95为0.3235。竞赛后,通过局部阈值细化和全局粗略搜索的扩展,性能提升至mAP@0.5为0.3726和mAP@0.95为0.3431,排名Team ACVLab第二。

英文摘要

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

2605.22090 2026-05-22 cs.AI

A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing

一种用于多模非合作无人机感知的相机协作ISAC框架

Wenfeng Wu, Luping Xiang, Kun Yang

AI总结 本文提出了一种相机协作ISAC框架,通过多模感知实现高效的无人机波束定向和跟踪,提升了感知精度和资源效率。

详情
AI中文摘要

非合作无人驾驶航空器(UAV)的检测对集成感知与通信(ISAC)系统提出了重大挑战,因为单模感知存在固有局限,且共享通信和感知资源之间存在竞争。为了解决这些挑战,本文提出了一种新颖的相机协作ISAC(CC-ISAC)框架,利用多模感知实现高效的UAV波束定向和跟踪。该框架利用摄像头进行粗粒度空域监控,并利用ISAC实现细粒度、高精度感知,形成互补的感知回路,从而提升感知精度和资源效率。在该框架中,开发了两个关键模块:(1)一种通过交叉注意力机制对齐视觉和回波域特征的视觉到回波数据对齐(V2EDA)模型,以及(2)一种基于多模融合的估计(MMFE)模型,该模型整合历史多模数据与当前观测以实现稳健的状态估计。在DeepSense 6G数据集上进行的广泛评估表明,所提出的框架在保持高角估计精度的同时,实现了平均71%的波束定向开销减少和1.69-11.15%的跟踪开销减少。CC-ISAC框架有效缓解了感知与通信之间的资源竞争,实现了可靠的UAV监控,同时释放了大量系统资源用于额外的通信任务,从而代表了ISAC系统设计的实用进步。

英文摘要

The detection of non-cooperative unmanned aerial vehicles (UAVs) presents significant challenges for Integrated Sensing and Communication (ISAC) systems due to the inherent limitations of single-modal perception and the competition for shared communication and sensing resources. To address these challenges, this paper proposes a novel Camera-Cooperative ISAC (CC-ISAC) framework that employs multimodal sensing to enable efficient UAV beam steering and tracking. The proposed framework employs cameras for coarse-grained airspace monitoring and utilizes ISAC for fine-grained, high-precision sensing, forming a complementary perception loop that enhances both sensing accuracy and resource efficiency. Within this framework, two key modules are developed: (1) a Vision-to-Echo Data Alignment (V2EDA) model that aligns visual and echo-domain features through cross-attention mechanisms, and (2) a Multimodal Fusion-Based Estimation (MMFE) model that integrates historical multimodal data with current observations for robust state estimation. Extensive evaluations conducted on the DeepSense 6G dataset demonstrate that the proposed framework achieves an average reduction of 71% in beam steering overhead and 1.69-11.15% in tracking overhead while maintaining high angular estimation accuracy. The CC-ISAC framework effectively mitigates resource contention between sensing and communication, enabling reliable UAV surveillance while freeing substantial system resources for additional communication tasks, thereby representing a practical advancement in ISAC system design.

2605.22089 2026-05-22 cs.CV cs.AI

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

LVDrive: 基于潜在视觉表征的视觉-语言-动作自动驾驶模型

Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

AI总结 本文提出LVDrive,一种增强视觉-语言-动作能力的自动驾驶模型,通过引入未来场景预测任务,在高维潜在空间中学习语义丰富的场景表示,从而提升闭环驾驶性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型已逐渐成为端到端自动驾驶的有前途的框架。然而,现有VLA通常依赖于稀疏的动作监督,这未能充分利用其强大的场景理解和推理能力。最近尝试通过世界建模引入密集视觉监督时,往往过度强调像素级图像重建,忽略了语义丰富的场景表示学习。在本文中,我们提出LVDrive,一种基于潜在视觉表征的VLA框架,用于自动驾驶。LVDrive在VLA范式中引入了未来场景预测任务,其中未来表示在预训练视觉主干的辅助监督下完全在高维潜在空间中学习。脱离低效的自回归生成,我们在一个统一的嵌入空间中联合建模未来场景和运动预测,通过单次前向传递进行未来感知推理。我们进一步设计了一种两阶段轨迹解码策略,明确利用所学的潜在未来表示来细化轨迹生成。在具有挑战性的Bench2Drive基准测试中,大量实验表明,LVDrive在闭环驾驶性能上实现了显著提升,优于动作监督方法和基于图像重建的世界模型方法。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

2605.21413 2026-05-22 cs.AI

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

通过基准构建教学AI:QuestBench作为一门课程实践以实现问责知识工作

Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen, Weichen Bi, Yudong Han, Xiaoying Bai, Yun Ma

AI总结 本文提出通过构建基准来教学AI,介绍QuestBench作为一门课程实践,帮助学生理解在AI时代知识工作的责任。

Comments 24 pages, 5 figures, 4 tables

详情
AI中文摘要

随着AI成为日常学习的一部分,许多课程教授学生主要将其作为生产力工具:如何提示、搜索、总结、写作、编程和更高效地使用工具。我们主张AI教育也需要一个让学生学习测试AI并理解自己在判断机器生成知识角色的环境。为此,我们介绍了一种基于课程的实践,通过构建基准来教学AI,以深度研究系统为例展示AI时代的知识工作。学生将学科知识转化为可验证的专家级问题,互相审查设计以发现歧义和捷径,并在由此产生的任务上评估AI系统。这项活动让学生直接接触到强大工具,同时要求他们明确信任答案所需的标准。所生成的基准QuestBench包含256个问题,涵盖14个文科和社会科学领域。在QuestBench上的评估显示,学生设计的任务揭示了当前深度研究系统中的隐藏失败:在十三个评估系统中,平均问题级通过率仅为16.85%,最佳表现系统GPT-5.5的通过率为57.58%。这些失败在教育上是有用的,因为它们展示了流畅、来源支持的答案仍可能错过正确的查询、来源、术语或证据标准。来自五名学生贡献者的反思表明,基准构建可以帮助学生将专业知识不仅视为AI可能检索的内容,而是作为判断AI输出的基础。我们以QuestBench作为基准制品和可重用的课堂设置,提出一个更大的教育问题:当AI进入学习和专业工作时,学生如何保持负责任的知识行动者。数据集可在https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main获取。

英文摘要

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

2605.21273 2026-05-22 cs.CV

DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

DriveMA: 重新思考驾驶VLAs中的语言接口以单步元动作

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang zhao

AI总结 本文提出DriveMA,通过单步元动作替代传统的自然语言推理,解决了驾驶VLAs中语言接口的三个瓶颈问题,实现了更高效的端到端规划。

Comments We withdraw this submission because the current version contains a mismatch between the paper title, conceptual framing, and the intended contribution of the work. To avoid potential misunderstanding by readers, the authors have decided to withdraw this version and substantially revise the title, organization, and presentation before any future submission

详情
AI中文摘要

驾驶视觉-语言-动作模型(Driving VLAs)通常将自然语言推理作为端到端规划的中间接口,但以推理为中心的接口面临三个实际瓶颈:获得高质量的推理注释困难,生成和理解长推理链对紧凑模型具有挑战性,且推理延迟显著增加。本文重新思考了驾驶VLAs中的语言接口设计,表明简洁的单步元动作是替代冗长推理的有效替代方案。元动作提供语义决策基础,同时保持低熵,并能自动从专家轨迹推导出来,从而实现可扩展的监督和可靠的轨迹条件化。基于此接口,我们提出了DriveMA,结合以动作为中心的监督训练和基于转弯级别的信用分配强化学习框架,共同优化元动作的正确性、轨迹质量和轨迹-元动作一致性。实验表明,DriveMA在Waymo端到端驾驶挑战中已使用2B模型达到新的状态,Rater Feedback Score(RFS)为8.060,其4B版本进一步将状态提升至8.079;DriveMA在NAVSIM上也取得了具有竞争力的性能。消融实验显示,单步元动作在表达性、可预测性和推理效率之间提供了更好的实际权衡,优于自然语言推理或更细粒度的动作序列。代码、数据和模型将被发布以促进未来研究。

英文摘要

Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory--meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.

2605.21214 2026-05-22 cs.LG cs.AI

Behavior-Consistent Deep Reinforcement Learning

行为一致的深度强化学习

Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton

AI总结 本文提出了一种行为一致的深度强化学习方法,通过控制策略的分布相似性来减少跨训练运行的策略分歧,从而提高稳定性和性能。

详情
AI中文摘要

强化学习(RL)在不同训练运行中常常表现出高方差,导致性能不可靠,并对现实领域中的部署构成重大挑战。在本文中,我们通过形式化行为一致的RL问题来解决跨运行策略分歧的挑战,目标是获得在不同训练运行中表现优异且分布相似的策略。我们的关键观察是最大熵RL提供了一种直接机制来控制行为分歧,通过将运行锚定到一个共同的(均匀)先验。我们证明,对于玻尔兹曼策略,选择温度与Q函数分歧界成正比可以限制诱导策略之间的成对KL散度。然而,我们还表明,简单地增加熵可能会损害策略优化并放大非策略误差。基于这些观察,我们提出了Q值期望分歧(QED),一种状态依赖的温度调度,利用双批评机分歧作为单次运行的跨运行分歧代理。经验上,我们在18个连续控制任务中展示了QED将跨运行分歧减少两个数量级,而不会牺牲性能,从而在适度的样本效率成本下实现了显著的回报方差减少。

英文摘要

Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that naïvely increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

2605.21143 2026-05-22 cs.SD cs.LG

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

CoarseSoundNet:构建一个可靠的生态声音景观分析模型

Alexander Gebhard, Andreas Triantafyllopoulos, Dominik Arend, Sandra Müller, Svenja Schmidt, Michael Scherer-Lorenzen, Björn W. Schuller

AI总结 本文提出CoarseSoundNet模型,用于在真实噪声环境下对生物声音、地质声音和人类声音进行分类,并通过系统研究模型架构、训练数据和评估策略,提高了模型在被动声学监测中的泛化能力。

Comments Currently under review

详情
AI中文摘要

声音景观由三种声音组成:生物声音(动物发出的声音)、地质声音(自然非生物声音)和人类声音(人类发出的声音)。在声音景观生态学领域,一个关键研究问题是这些组成部分如何相互作用,特别是生物声音如何响应地质声音和人类声音。然而,目前尚缺乏能够对这些元素进行区分量化分析的工具。最近的机器学习(ML)方法旨在支持自动化分析,但通常依赖于任务特定或干净的数据,限制了其在噪声被动声学监测(PAM)记录中的泛化能力。本文提出了一种清晰且可重复的结构来构建用于粗粒度声音景观分类的ML模型,并引入了CoarseSoundNet,一个经过训练以在真实PAM条件下区分生物声音、地质声音和人类声音的深度学习模型。我们系统地研究了模型架构、额外训练类的影响、数据组成和评估策略。我们的发现表明,模型性能随着额外PAM数据的增加而提高,特别是当数据与目标领域相似时,并且通过在训练中引入显式的静默类进一步提高性能。类特定的决策阈值和基于持续时间的约束进一步提高了性能,特别是在人类声音和地质声音方面。错误分析显示,人类声音由于掩蔽效应而面临挑战,而静默和昆虫声音在地质和生物声音方面存在混淆。最后,我们进行了一项生态案例研究,表明使用CoarseSoundNet预过滤记录可以产生与地面真实过滤相当的声学指数趋势,支持其作为生态声学分析有效预处理工具的使用。

英文摘要

A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

2605.20975 2026-05-22 cs.LG cs.CR

Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning

明智且私密地选择:为公平和高效的联邦学习进行主动客户端选择

Adda Akram Bendoukha, Heber Hwang Arcolezi, Nesrine Kaaniche, Aymen Boudguiga

AI总结 本文提出了一种主动客户端选择框架,旨在在训练前找到满足效用和公平性要求的最佳客户端联邦,以提高联邦学习的效率和公平性。

详情
AI中文摘要

联邦学习使能够在去中心化的数据源上进行协作模型训练而无需数据传输。基于平均的联邦学习受限于非独立同分布数据的存在,这会负面影响收敛速度和最终模型的准确性。传统替代方法存在显著的低效率。包含噪声或高度异质数据的客户端会进行昂贵的梯度计算,这些计算在聚合前要么被丢弃要么被大幅降权。这些反应式方法浪费计算资源,需要更多的通信轮次并导致不必要的隐私暴露。在本文中,我们提出了一种主动客户端选择框架,旨在在训练开始前找到一个最优的客户端联邦,其联合数据满足效用和公平性要求。我们的方法依赖于从差分隐私连续表中计算出的互信息来量化联合数据集中的跨特征相关性的重要性。我们引入了一个潜在联邦损失(PFL)在固定大小的联邦集上,它平衡了两个目标。最大化集体数据效用的同时确保公平的跨特征相关性以防止群体不公平。客户端选择被表达为一个最优子集搜索问题,基于PFL目标,我们使用模拟退火在强差分隐私保证下解决客户端的本地统计信息。在四个基准上的实验结果表明,与均匀抽样相比,使用最优找到的联邦训练的模型更快、更公平且更准确,即使当使用最先进的自适应聚合或抽样策略时也是如此。

英文摘要

Federated Learning enables collaborative model training across decentralized data sources without data transfer. Averaging-based FL is limited by the presence of non-IID data, which negatively impacts convergence speed and final model accuracy. Conventional alternatives suffer from significant inefficiency. Clients with noisy or highly heterogeneous data contribute expensive gradient computations that are either discarded or heavily down-weighted before aggregation. These reactive approaches waste computational resources, require more communication rounds and result in unnecessary privacy exposure. In this paper, we propose a proactive client selection framework that aims to find an optimal federation of clients whose combined data match utility and fairness requirements before training begins. Our method relies on mutual information computed from differentially private contingency tables to quantify the relevance of cross-feature correlations in the union dataset. We introduce a Potential Federation Loss (PFL) over the set of fixed-size federations, which balances two objectives. Maximizing collective data utility while ensuring fair cross-features correlations to prevent group unfairness. Client selection is expressed as an optimal subset search problem over the PFL objective, which we solve using simulated annealing under strong differential privacy guarantees for clients' local statistics. Experimental results on four benchmarks show faster, fairer, and more accurate models trained on optimally found federations, compared to uniform sampling, even when state-of-the-art adaptive aggregation or sampling strategies are employed.

2605.20578 2026-05-22 cs.SD cs.CV

A strongly annotated passive acoustic dataset for tropical bird monitoring

一个强注解的被动声学数据集用于热带鸟类监测

Daniela Ruiz, Juan Sebastián Ulloa, Zhongqi Miao, Nicolás Betancourt, Maria Paula Toro-Gómez, Andrés Hernández, Bruno Demuro, Eliana Barona-Cortés, Angela Mendoza-Henao, Andrés Sierra-Ricaurte, Sebastián Pérez-Peña, Rahul Dodhia, Pablo Arbeláez, Juan M. Lavista Ferres

AI总结 本文提出PteroSet数据集,用于热带鸟类监测,通过强注解的音频数据和COCO-inspired JSON格式,为机器学习提供基准,并展示了二元鸟类检测的深度学习基线。

详情
AI中文摘要

被动声学监测能够实现对多样化生态系统的连续、非侵入性生物多样性评估。这些数据集的规模推动了机器学习的应用,监督方法表现出强劲的性能。然而,监督方法需要时间分辨的注解数据集,这些数据仍然稀缺,尤其是在复杂的热带声音景观中。我们提出了PteroSet,这是一个经过精心编纂的数据集,包含在哥伦比亚Putumayo的Puerto Asis和Magdalena的Pivijay之间2023年至2025年录制的强注解新热带鸟类叫声数据集。该数据集包含563个录音(73.62小时)和15,372个时频注解,包括6,702个事件,这些事件被识别到物种水平,涵盖168个物种。我们以COCO启发的JSON模式发布注解,将音频文件、分类类别和机器学习工作流程的标签统一起来。除了提供注解数据外,PteroSet还充当一个现实的基准,突显了热带声音景观的关键特征,包括不同录制地点的声学共现和领域转移。我们提供了一个二元鸟类检测的深度学习基线,展示了PteroSet的可用性和其带来的挑战。

英文摘要

Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet's usability and the challenges it presents.

2605.20342 2026-05-22 cs.CV

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

ParaVT: 平衡工具先验悖论以实现代理视频强化学习中的并行工具使用

Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing

AI总结 本文提出ParaVT,一种用于并行视频工具调用的端到端强化学习框架,通过引入PARA-GRPO机制解决工具先验悖论,提升了长视频理解的性能。

Comments Project Page: https://evolvinglmms-lab.github.io/ParaVT/

详情
AI中文摘要

通过强化学习(RL)训练大型多模态模型(LMMs)以原生调用视频处理工具(如裁剪)已成为实现长视频理解的有前景途径。然而,现有原生RL方法按顺序调度工具调用(即每回合一个):单个错误的裁剪会传播错误而无法得到同伴纠正,多回合工具调用会破坏上下文,且推理成本与回合数成线性关系。我们引入ParaVT,首个多智能体端到端RL训练框架用于并行视频工具调用,通过单个回合内调度多个时间窗口裁剪以获得更干净的上下文和更好的容错能力。然而,将标准RL应用于ParaVT揭示了一个我们称之为工具先验悖论的障碍:预训练的工具先验能够促进工具探索,但也破坏了冷启动的结构格式并暴露了在温度采样下的跳过工具奖励捷径。一个较弱先验LMM的跨模型对比支持这一观点:格式保持稳定但RL触发零工具调用,表明先验强度是格式崩溃和工具探索的共同驱动因素。我们提出PARA-GRPO(Parseability-Anchored和Ratio-gAted GRPO),它通过两种互补机制增强标准RL:(i)仅在最易崩溃的结构标记位置应用目标格式奖励;(ii)每提示帧预算随机化,创建训练提示,其中调用工具会提供可测量的奖励信号,而跳过工具则不会。在六个长视频理解基准测试中,ParaVT在平均上比Qwen3-VL基线提升了7.9%,而PARA-GRPO将训练时间格式合规性从0.13提升到0.64。随着工具能力在现代LMMs中日益内部化,RL必须与由此产生的先验合作,ParaVT提供了一种通用的代理RL配方。代码、数据和模型权重已公开可用。

英文摘要

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

2605.20069 2026-05-22 cs.LG cs.GT

Smooth Partial Lotteries for Stable Randomized Selection

用于稳定随机选择的平滑部分彩票

Alexander Goldberg, Giulia Fanti, Nihar B. Shah

AI总结 本文提出平滑性作为部分彩票设计原则,通过定义评分到选择概率的Lipschitz条件,提出Clipped Linear Lottery机制,证明其在平滑性与遗憾之间取得更好的平衡,并通过实验验证其在实际应用中的有效性。

详情
AI中文摘要

竞争性选择过程,从科学资金资助到招生和招聘,使用评估来评分候选人,并最终根据这些评分选择一部分人。最近,许多组织采用了部分彩票,根据评估评分随机化选择。然而,现有的彩票设计本质上是不稳定的,因为对单个候选人的评分的微小变化会导致其选择概率的大幅变化。这种不稳定性削弱了彩票的一个关键目标:减少决策边界附近细微评分区别的影响。我们提出平滑性作为部分彩票的设计原则,并将其形式化为评分到选择概率的映射的Lipschitz条件。我们引入了Clipped Linear Lottery,一种简单的机制,其中选择概率与估计质量在上阈值和下阈值之间线性变化,上阈值以上我们总是接受,下阈值以下我们总是拒绝。我们证明Clipped Linear Lottery的最坏遗憾与任何平滑选择规则的下界在(1 - k/n)因子内匹配,其中k/n是接受率。我们比较平滑选择与其他稳定性概念如个体公平性和差分隐私,证明Clipped Linear Lottery在平滑性与遗憾的权衡上优于其他方法。在ICLR 2025、NeurIPS 2024和瑞士国家科学基金会的真实同行评审数据上的实验表明,现有彩票设计在实践中即使在单个评分扰动下也高度不稳定。我们的实验还确认了我们的理论分析的紧性,并证明我们提出的Clipped Linear Lottery在实践中比其他方法在平滑性与效用的权衡上更优。

英文摘要

Competitive selection processes, from scientific funding to admissions and hiring, use evaluations to score candidates, and eventually choose a subset of them based on those scores. Recently, many organizations have adopted partial lotteries, which randomize selection based on evaluation scores. However, existing lottery designs are inherently unstable, as a small change to a single candidate's score can cause large shifts in their selection probabilities. This instability undermines a key goal of lotteries: reducing the influence of fine-grained score distinctions near the decision boundary. We propose smoothness as a design principle for partial lotteries, formalizing it as a Lipschitz condition on the mapping from review scores over candidates to selection probabilities. We introduce the Clipped Linear Lottery, a simple mechanism in which selection probabilities scale linearly with estimated quality between an upper threshold, above which we always accept, and a lower threshold, below which we always reject. We prove that the Clipped Linear Lottery's worst-case regret matches a lower bound for any smooth selection rule up to a factor of $(1 - k/n)$, where $k/n$ is the acceptance rate. We compare smooth selection to other stability notions like Individual Fairness and Differential Privacy, showing that the Clipped Linear Lottery achieves a better smoothness-regret tradeoff than alternatives. Experiments on real peer review data from ICLR 2025, NeurIPS 2024, and the Swiss National Science Foundation demonstrate that existing lottery designs are highly unstable in practice even under perturbations to a single score. Our experiments also confirm the tightness of our theoretical analysis and show that our proposed Clipped Linear Lottery achieves a better smoothness-utility tradeoff than alternatives in practice.

2605.19965 2026-05-22 cs.LG eess.SP

Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

通过局部可塑性和树突计算进行源分离的规范网络

Bariscan Bozkurt, Efe Ali Gorguner, Francesco Innocenti, Rafal Bogacz

AI总结 本文提出了一种基于局部可塑性和树突计算的预测熵最大化方法,用于源分离,该方法在结构化源域上最大化正则化的二阶熵,实现了在增加的源相关性和观测噪声下的鲁棒性,并在生物合理算法和精确基线中表现优异。

详情
AI中文摘要

盲源分离(BSS)是研究如何从感觉混合中恢复潜在原因的自然框架,但推导出针对结构化(即受限于已知领域)且可能相关源的在线和生物合理算法仍然具有挑战性。最近的工作从最大化熵度量出发推导出BSS的神经网络,但其在线实现涉及复杂且非局部的递归动力学。受此视角启发,我们提出了预测熵最大化方法,仅使用局部权重更新即可实现BSS的竞争力。该方法采用熵度量的近似,产生一个具有易于解释组件的目标函数。最小化该目标导致预测神经架构,其中前馈突触遵循误差驱动规则(可通过树突机制实现),横向抑制连接通过局部海马体可塑性学习,源域约束通过简单的输出非线性性强制执行。我们推导了对偶误差的显式频谱界限,表征了何时近似是准确的。经验上,预测熵最大化在增加的源相关性和观测噪声下保持稳健,优于依赖更强独立性或去相关假设的生物合理算法,并在精确行列式和相关信息基线中表现竞争。这些结果展示了如何通过最大化结构化源域上的正则化二阶熵,使局部可塑性和适应性横向抑制得以出现。我们的实现代码可在https://github.com/BariscanBozkurt/Predictive-Entropy-Maximization上获得。

英文摘要

Blind source separation (BSS) is a natural framework for studying how latent causes may be recovered from sensory mixtures, but deriving online and biologically plausible algorithms for structured (i.e., constrained to known domains) and potentially correlated sources remains challenging. Recent work has derived neural networks for BSS from maximization of an entropy measure, yet its online implementations involve complex and nonlocal recurrent dynamics. Motivated by this perspective, we propose Predictive Entropy Maximization, which achieves competitive performance in BSS, using only local weight updates. The method employs a close approximation of an entropy measure, yielding an objective function with easily interpretable components. Minimizing this objective leads to a predictive neural architecture in which feedforward synapses follow an error-driven rule (that can be realized through dendritic mechanisms), lateral inhibitory connections are learned with local Hebbian plasticity, and source-domain constraints are enforced through simple output nonlinearities. We derive explicit spectral bounds on the surrogate error, characterizing when the approximation is accurate. Empirically, Predictive Entropy Maximization remains robust under increasing source correlation and observation noise, outperforms biologically plausible algorithms that rely on stronger independence or decorrelation assumptions, and remains competitive with exact determinant- and correlative-information-based baselines. These results show how local plasticity and adaptive lateral inhibition can emerge from maximizing a regularized second-order entropy over structured source domains. Our implementation code is available at https://github.com/BariscanBozkurt/Predictive-Entropy-Maximization.

2605.19578 2026-05-22 cs.CV cs.AI

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

Lens Privacy Sealing: 一种新的基准和方法用于物理隐私保护的动作识别

Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan

AI总结 本文提出了一种名为Lens Privacy Sealing (LPS)的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,实现低成本的预传感器隐私保护,并引入P$^3$AR数据集用于隐私保护的动作识别,同时提出MSPNet框架以应对LPS带来的视频退化问题,实验表明MSPNet在动作识别准确率和隐私保护方面具有优势。

Comments Accepted by IEEE Transactions on Image Processing (TIP), 2026

详情
AI中文摘要

基于RGB摄像头的监控系统能够为公共安全和医疗保健提供人类动作识别,但引发了严重的隐私问题。现有方法依赖于事后捕获算法,这些算法在数据采集过程中无法保护隐私。我们提出Lens Privacy Sealing (LPS),一种简单的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,以最低的成本提供预传感器隐私保护。与软件方法或昂贵的工程光学不同,LPS通过随机多层散射实现强隐私保护,这种散射是物理不可逆的。我们引入了P$^3$AR数据集用于隐私保护的动作识别,该数据集包含大规模回放捕获(P$^3$AR-NTU,114K视频)和现实世界收集(P$^3$AR-PKU)的子集,并带有隐私属性注释。为处理LPS带来的视频退化,我们提出MSPNet,一种单阶段框架,结合了帧间噪声抑制器(IFNS)和跨帧语义聚合器(CFSA),并借助对比语言-图像预训练进行增强的语义提取。大量实验表明,与基线方法相比,MSPNet结合IFNS和CFSA几乎将动作识别准确率提高了一倍,同时抑制身份识别到低水平。全面验证显示,LPS在隐私-效用权衡方面优于现有最先进的硬件方法,能够抵御包括PSF反向计算和数据驱动恢复在内的重建攻击,并在不同光学配置和挑战性环境中具有良好的泛化能力。代码可在https://github.com/wangzy01/MSPNet上获得。

英文摘要

RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.

2605.19329 2026-05-22 cs.CV cs.AI

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

RE-VLM:事件增强的视觉-语言模型用于场景理解

Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu

AI总结 本文提出RE-VLM,一种结合RGB图像和事件流的双流视觉-语言模型,旨在提升在正常和恶劣条件下对场景的理解能力。通过事件相机提供的高时间分辨率和宽动态范围的数据,RE-VLM在场景描述和视觉问答任务中优于现有模型。

Comments 10 pages, 6 figures, 6 tables

详情
AI中文摘要

传统视觉-语言模型(VLMs)在恶劣条件下(如低光、高动态范围或快速运动)捕获的场景解释能力不足,因为标准RGB图像在这些环境中质量下降。事件相机提供了一种互补的模态:它们异步记录每个像素的亮度变化,具有高时间分辨率和宽动态范围,在帧失效时保留运动线索。我们提出了RE-VLM,第一个双流视觉-语言模型,联合利用RGB图像和事件流,以在正常和挑战性条件下实现稳健的场景理解。RE-VLM采用并行的RGB和事件编码器,以及一种渐进训练策略,将异构视觉特征与语言对齐。为了解决RGB-Event-Text监督不足的问题,我们进一步提出了一种图驱动的流程,将同步的RGB-Event流转换为可验证的场景图,从中合成描述和问答对。为了开发和评估RE-VLM,我们构建了两个数据集:PEOD-Chat,针对光照挑战性场景,和RGBE-Chat,涵盖多样化的场景。在描述和VQA基准测试中,RE-VLM在与现有RGB-only和事件-only模型参数量相当的情况下,始终优于现有模型,特别是在挑战性条件下表现显著提升。这些结果证明了事件增强的VLMs在广泛现实环境中实现稳健视觉-语言理解的有效性。

英文摘要

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments.

2605.18507 2026-05-22 cs.CV

Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

弱监督跨模态学习用于4D雷达场景流估计

Jingyun Fu, Zhiyu Xiang, Na Zhao

AI总结 本文提出了一种弱监督的雷达场景流学习框架,利用图像和里程计进行辅助监督,通过实例感知的自监督损失和静态区域的刚性损失,实现了更高效的场景流估计。

Comments Accepted by ICML2026

详情
AI中文摘要

由于获取4D雷达场景流的真实数据困难,先前的方法通常依赖于自监督损失或利用3D激光雷达数据、2D图像和里程计进行跨模态监督。然而,自监督方法由于雷达固有的低保真度测量往往导致次优结果,而现有跨模态监督方法引入复杂的多任务架构并需要昂贵的激光雷达传感器来从预训练的3D跟踪模型中生成伪雷达场景流标签。为克服这些限制,我们提出了一种任务特定的迭代框架,仅使用图像和里程计进行训练中的辅助监督。特别地,我们通过利用现成的2D跟踪和分割算法获得跟踪实例掩码,并将其投影到3D空间,以提供实例级别的语义指导;对于静态区域,我们整合车辆里程计与雷达的内在运动线索以构建刚性静态损失。在现实世界的View-of-Delft(VoD)数据集上的大量实验表明,我们的方法不仅超越了依赖密集LiDAR点云的最新跨模态监督方法,还优于现有的全监督场景流估计方法。代码已开源在https://github.com/FuJingyun/IterFlow。

英文摘要

Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar's inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar's intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \href{https://github.com/FuJingyun/IterFlow}{https://github.com/FuJingyun/IterFlow}.

2605.18047 2026-05-22 cs.RO

FUSE: A Framework for Unified State Estimation in Vehicular and Robotic SLAM Systems

FUSE:一种用于车辆和机器人SLAM系统统一状态估计的框架

Wei Wu, Honglin Chen, Wenhan Cao, Yao Lyu, Shaobing Xu, Kun Jiang, Jiangtao Li, Tao Zhang, Lei Guo, Shengbo Eben Li

AI总结 本文提出FUSE框架,用于统一车辆和机器人SLAM系统中的状态估计,通过分离时间处理、局部几何关联、估计器公式和地图更新策略,提高状态估计设计的灵活性和准确性。

详情
AI中文摘要

在混合速率传感下,紧密耦合的SLAM公式通常将时间处理、局部几何关联、估计器公式和地图更新策略绑定到特定方法的设计中。这种绑定使得难以在不重新设计其余状态估计过程的情况下改变一个设计选择。本文提出了FUSE,一种用于车辆和机器人SLAM系统统一状态估计的框架。FUSE围绕观察摄入、传播、更新和状态查询组织状态估计接口,并利用此接口将时间处理、残差准备的局部几何关联、估计器公式和地图更新策略分开。开发了一个LiDAR-IMU实例来在混合速率传感和方向退化下检验该框架,其中高速惯性传播、LiDAR触发的几何更新、残差筛选和退化感知的修正通过相同的接口边界操作。在418米的环形走廊序列中,该实例报告了1.626米的端到端轨迹误差,与Faster-LIO相比,相对误差减少了7.9%。结果支持FUSE作为组织状态估计设计选择的框架,并展示了评估实例如何在弱可观测方向上正则化更新。

英文摘要

Tightly coupled SLAM formulations under mixed-rate sensing often bind temporal processing, local geometric association, estimator formulation, and map-update policy into method-specific designs. Such binding makes it difficult to vary one design choice without re-engineering the rest of the state-estimation process. This paper presents FUSE, a framework for unified state estimation in vehicular and robotic SLAM systems. FUSE organizes the state-estimation interface around observation ingestion, propagation, update, and state query, and uses this interface to separate temporal processing, residual-ready local geometric association, estimator formulation, and map-update policy. A LiDAR--IMU instantiation is developed to examine the framework under mixed-rate sensing and directional degeneracy, where high-rate inertial propagation, LiDAR-triggered geometric update, residual screening, and degeneracy-aware correction operate through the same interface boundaries. On a 418~m loop-corridor sequence, the instantiation reports a 1.626 m end-to-end trajectory error, corresponding to a 7.9% relative error reduction compared with Faster-LIO, the lowest-error baseline on this sequence. The results support FUSE as a framework for organizing state-estimation design choices and show how the evaluated instantiation regularizes updates along weakly observable directions.

2605.17950 2026-05-22 cs.RO cs.SY eess.SY

Active Defense Against False Data Injection Attacks in Robotic Manipulators

对抗机器人机械臂中虚假数据注入攻击的主动防御

Gabriele Gualandi, Carl Mikael Larsson, Alessandro V. Papadopoulos

AI总结 本文提出两种防御方法,即异常感知虚拟阻尼和操作性降低,以提高机器人机械臂在有限时间范围内抵御虚假数据注入攻击的能力,并通过仿真验证其有效性。

Comments Extended 8-page version containing full proofs. An abridged 6-page version has been accepted for publication in the Proceedings of the 23rd IFAC World Congress (2026). v3: Minor typographical fixes and updated reference formatting

详情
AI中文摘要

机器人系统容易受到虚假数据注入攻击(FDIAs)的影响,其中攻击者通过篡改传感器信号来获得恶意控制。反馈线性化使机器人系统暴露于积分器漏洞,使其容易受到隐蔽攻击,这些攻击可能导致末端执行器行为出现显著偏差而不会引发警报。本文通过形式化两种防御方法,即异常感知虚拟阻尼和操作性降低,以提高机械臂在有限时间范围内抵御FDIAs的韧性,并在名义任务执行中提供概率保证。在7自由度冗余机械臂上的仿真显示,所提出的防御方法在与仅使用阈值基ADS如卡方检测相比时,显著减少了FDIA的影响,同时在无攻击情况下保持了名义任务性能。

英文摘要

Robotic systems are vulnerable to False Data Injection Attacks (FDIAs), where adversaries corrupt sensor signals to gain malicious control. Feedback linearization exposes robotic systems to integrator vulnerability, making them susceptible to stealthy attacks that can cause significant deviations in end-effector behavior without raising alarms. This paper addresses the resilience of manipulators against finite-horizon FDIAs by formalizing two defense methods, namely anomaly-aware virtual damping and manipulability reduction, with probabilistic guarantees on nominal task execution. Simulations on a 7-DOF redundant manipulator show that the proposed defenses substantially reduce the impact of FDIA compared to using solely a threshold-based ADS like the Chi-squared, while preserving nominal task performance in the absence of attack.

2605.16984 2026-05-22 cs.CL

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

在CRAC 2026上缩小差距:基于LLM的多语言核心指代解析的两阶段适应

Antoine Bourgois, Olga Seminck, Thierry Poibeau

AI总结 本文提出了一种基于LLM的多语言核心指代解析方法,通过两阶段适应策略,在CRAC 2026共享任务中取得了74.32的平均CoNLL F1分数,排名第一。

详情
AI中文摘要

我们提交了参加2026年计算参照、指代和核心指代(CRAC 2026)共享任务的LLM赛道的系统。在官方测试集上,我们的系统以74.32的平均CoNLL F1分数排名第一,整体排名第三。我们的系统基于Gemma-3-27b模型,通过多语言基础适配器和数据集特定适配器的两阶段策略进行微调。我们使用XML启发式的格式用头词表示提及跨度,并通过局部重新索引进行标注。这些设计选择在不同语言、文档长度和标注指南下均表现出色。

英文摘要

We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.

2605.16545 2026-05-22 cs.LG cs.AI cs.CL

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

Symphony for Speech-to-Text: 支持实时医疗语音接口

Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

AI总结 本文提出Symphony for Speech-to-Text,一种医疗级实时语音识别系统,通过分解转录过程为识别、格式化和上下文校正等专业化组件,优化医学术语召回,实现实时临床结构文本生成,并在医疗场景中显著优于现有系统,同时在通用领域表现不逊。

Comments Updated with a correction and improvement to Symphony's performance in spoken punctuation evaluation (R_punct, P_punct)

详情
AI中文摘要

在数十年用于打字和更近期的环境记录后,语音正逐渐成为与技术及AI交互的主要方式,在医疗领域也不例外。然而,医疗语音识别仍然具有挑战性:系统必须捕捉专业术语,解决上下文歧义,并精确渲染测量、缩写和临床缩写。现有解决方案通常针对通用目的转录或狭窄的打字工作流进行优化,限制了其在安全关键设置中的可靠性以及在更广泛临床工作流中的实用性。我们引入Symphony for Speech-to-Text,一种用于实时流式和基于批量文件的临床使用的医疗级语音识别系统。Symphony将转录过程分解为识别、格式化和上下文校正等专业化组件,以优化医学术语召回,同时在实时生成临床结构文本并适应不同使用场景。在公共基准和医疗语音数据集上的评估表明,Symphony在临床场景中显著优于现有系统,同时在通用领域表现不逊,表明具有鲁棒的泛化能力而非过拟合。我们发布了一个临床基准数据集以支持可靠的验证和进一步推进医疗语音识别。Symphony通过生产级API提供,用于实时打字、对话转录和批量音频文件处理。

英文摘要

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

2605.15153 2026-05-22 cs.RO cs.AI

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Pelican-Unify 1.0:一种用于理解和推理、想象和行动的统一具身智能模型

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Zeting Liu, Xianzhou Hou, Yong Dai, Jian Tang, Xiaozhu Ju

AI总结 本文提出Pelican-Unify 1.0,一种基于统一原则训练的首个具身基础模型,通过单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间,并通过统一推理模块生成任务、行动和未来导向的思维链,最终将隐藏状态投影到密集潜在变量中,再通过统一未来生成器生成未来视频和行动。

详情
AI中文摘要

我们提出了Pelican-Unify 1.0,首个根据统一原则训练的具身基础模型。Pelican-Unify 1.0使用单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间。同一视觉语言模型也作为统一推理模块,通过单次前向传递自回归地生成任务、行动和未来导向的思维链,并将最终隐藏状态投影到密集潜在变量中。统一未来生成器(UFG)然后基于该潜在变量,在同一去噪过程中通过两个模态特定的输出头联合生成未来视频和未来行动。语言、视频和行动损失均反向传播到共享表示中,使模型在训练过程中共同优化理解和推理、想象和行动,而非训练三个独立专家系统。实验表明,统一并不意味着妥协。通过单一检查点,Pelican-Unify 1.0在所有三种能力上均取得强劲表现:在八个VLM基准测试中得分为64.7,是同类模型中最佳;在WorldArena中得分为66.03,排名第一;在RoboTwin中得分为93.5,是对比行动方法中第二好的平均值。这些结果表明,统一范式在保持专业能力的同时,将理解和推理、想象和行动整合到一个模型中。

英文摘要

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unify 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

2605.15040 2026-05-22 cs.AI cs.CL

Orchard: An Open-Source Agentic Modeling Framework

Orchard:一个开源的智能体建模框架

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao

AI总结 本文提出Orchard,一个开源的智能体建模框架,通过轻量级环境服务和三种智能体建模食谱,实现了跨领域可重用的智能体数据、训练和评估。

详情
AI中文摘要

智能体建模旨在通过规划、推理、工具使用和与环境的多轮交互,将大语言模型转化为能够解决复杂任务的自主智能体。尽管有大量投入,开放研究仍受制于基础设施和训练差距。许多高性能系统依赖于专有代码库、模型或服务,而大多数开源框架专注于编排和评估,而非可扩展的智能体训练。我们提出了Orchard,一个用于可扩展智能体建模的开源框架。其核心是Orchard Env,一个轻量级环境服务,提供可重用的原语用于跨任务领域、智能体利用和流水线阶段的沙盒生命周期管理。在Orchard Env之上,我们构建了三种智能体建模食谱。Orchard-SWE针对编码智能体,从MiniMax-M2.5和Qwen3.5-397B中蒸馏出107K条轨迹,引入信用分配SFT来学习未解决轨迹的 productive 段落,并应用平衡自适应回滚进行强化学习。从Qwen3-30B-A3B-Thinking开始,Orchard-SWE在SWE-bench Verified上经过SFT后达到64.3%,经过SFT+RL后达到67.5%,在同等规模的开源模型中设立了新的状态。Orchard-GUI使用仅0.4K蒸馏轨迹和2.2K开放性任务训练了一个4B视觉-语言计算机使用智能体,在WebVoyager、Online-Mind2Web和DeepShop上分别达到74.1%、67.0%和64.0%的成功率,成为最强的开源模型,同时在与专有系统竞争中保持竞争力。Orchard-Claw针对个人助理智能体,仅用0.2K合成任务训练,达到Claw-Eval上的59.6% pass@3和与更强的ZeroClaw利用配合时的73.9%。总体而言,这些结果表明,一个轻量级、开源、不依赖利用的环境层能够实现跨领域的可重用智能体数据、训练和评估。

英文摘要

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

2605.14926 2026-05-22 cs.CV

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

Hanxu Zhang, Chen Jia, Hui Liu, Xu Cheng, Fan Shi, Shengyong Chen

AI总结 本文提出了一种高效的SCRWKV网络,通过新颖的结构场编码器和轻量级解码器,实现结构裂缝拓扑分割的高精度建模,其在多个复杂纹理和严重干扰的基准测试中表现出色,参数量仅为1.22M,达到了84.28%的F1分数和85.12%的mIoU。

Comments Accept by ICML2026

详情
AI中文摘要

实现跨多样场景的结构裂缝像素级准确分割仍然是一个严峻的挑战。现有方法在平衡裂缝拓扑建模与计算效率之间面临显著瓶颈,往往无法在高分割质量与低资源需求之间取得平衡。为了解决这些限制,我们提出了Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV),一种通过新颖的Structure-Field Encoder (SFE) backbone实现高精度建模的网络,同时保持线性复杂度。SFE集成了Adaptive Multi-scale Cascaded Modulator (AMCM)以增强纹理表示,并利用Structure-Calibrated Insight Unit (SCIU)作为其核心引擎。具体而言,SCIU采用Geometry-guided Bidirectional Structure Transformation (GBST)来捕捉拓扑相关性,并将Dynamic Self-Calibrating Decay (DSCD)整合到Dy-WKV中以抑制噪声传播。此外,我们引入了一种轻量级的Cross-Scale Harmonic Fusion (CSHF)解码器以实现精确的特征聚合。系统评估表明,在多个具有复杂纹理和严重干扰的基准测试中,仅拥有1.22M参数的SCRWKV显著优于现有最佳方法。在TUT数据集上,该模型达到了84.28%的F1分数和85.12%的mIoU,证明了其在高效现实部署中的鲁棒潜力。代码可在https://github.com/zhxhzy/SCRWKV上获取。

英文摘要

Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.

2605.14257 2026-05-22 cs.CL

Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?

BEA 2026 共同任务 1:什么使词汇困难?

Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang, Hitomi Yanaka

AI总结 本文提出两种模型用于预测词汇难度:一种高精度的黑盒模型,在公开赛道取得最佳成绩,另一种可解释模型,优于微调编码器基线。黑盒模型通过软目标损失函数微调LLM,在评分任务中达到r>0.91的精度,而可解释模型在保持强相关性(r>0.77)的同时,揭示了影响每个项目难度的因素。进一步分析显示,英国理事会知识型词汇列表(KVL)中词汇难度常受拼写难度或测试项目构造影响,而不仅仅是词汇本身的生产难度。

Comments To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

我们描述了两种用于词汇难度预测的模型类型:一种高精度的黑盒模型,其在公开赛道中取得了最佳共同任务结果;另一种可解释的模型,其优于微调编码器基线。作为黑盒模型,我们使用软目标损失函数微调了一个LLM,以在评分任务中实现有效的应用,达到了r > 0.91的精度。可解释模型在保持强相关性(r > 0.77)的同时,提供了关于影响每个项目难度因素的见解。我们进一步分析了结果,证明英国理事会知识型词汇列表(KVL)中词汇的难度往往受到拼写难度或测试项目构造的影响,而不仅仅是词汇本身的生产难度。我们的代码已在线上提供,网址为https://github.com/ynklab/vocabulary-difficulty.

英文摘要

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/ynklab/vocabulary-difficulty .

2605.13989 2026-05-22 cs.CL

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

VectraYX-Nano: 一个具有课程学习和原生工具使用的4200万参数西班牙语网络安全语言模型

Juan S. Santillana

AI总结 本文提出了一种基于西班牙语的网络安全语言模型VectraYX-Nano,通过课程学习和原生工具调用,展示了在网络安全领域的应用与改进。

Comments 24 pages, 5 figures, 12 tables. v3: post-Chinchilla compute ablation (v8-v15), Globant affiliation finalized, EMNLP Findings 2026 submission. Released model: VectraYX-Nano v7 (42M params, GGUF Q4 ~20 MB, native MCP)

详情
AI中文摘要

我们介绍了VectraYX-Nano,一个从头开始训练的41.95M参数解码器-only语言模型,专为西班牙语网络安全领域设计,具有拉丁美洲地区侧重和通过模型上下文协议(MCP)进行的原生工具调用。该模型有四个贡献。(i)语料库:VectraYX-Sec-ES,一个由八台虚拟机分布式管道构建的170M-token西班牙语语料库,成本约为25美元的云计算,分为三个课程阶段(对话42M,网络安全118M,攻击性工具10M)。(ii)架构:一个42M Transformer解码器,包含GQA、QK-Norm、RMSNorm、SwiGLU、RoPE和z-loss,配以领域平衡的16,384-token字节回退BPE。(iii)课程学习通过跨三个阶段的回放,导致单调损失下降(9.80 -> 3.17 -> 3.00 -> 2.16);在SFT(损失1.74)后,v2 bootstrap-ablation参考在B5上达到0.775 +/- 0.043的对话门,经过受控的Phase-2回放扫描,以{0,5,10,25,50}%的饱和度在B5上达到>=25%的回放。(iv)两个经验发现,N=4。一个受控的bootstrap-语料库消融跨v2(OpenSubs)、v4(mC4-ES)和v6(60/25/15 OpenSubs/mC4/Wiki)暴露了损失与注册的倒置:低困惑度的bootstrap产生可测量更差的对话行为(v2 > v4 > v6在B5上每个配对的种子)。B4(工具选择)地板为0.000是语料库密度的产物,而不是容量门:重新平衡SFT混合物的工具使用比例为1:21,得到VectraYX-Nano v7,即发布的头版配置,达到B4 = 0.230 +/- 0.052在42M的同时保持B1 = 0.332 +/- 0.005和B5 = 0.725 +/- 0.130;在260M从头开始的中档模型上进行LoRA复制达到0.445 +/- 0.201。发布的GGUF在F16中为96 MB,能够在llama.cpp下在商用硬件上实现亚秒级TTFT,并且,据我们所知,是第一个发布的西班牙语原生网络安全LLM,具有端到端的MCP集成。

英文摘要

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model has four contributions. (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus assembled by an eight-VM distributed pipeline at ~$25 USD of cloud compute and split into three curriculum phases (conversational 42M, cybersecurity 118M, offensive tooling 10M). (ii) Architecture: a 42M Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE and z-loss, paired with a domain-balanced 16,384-token byte-fallback BPE. (iii) Curriculum with replay across the three phases yields a monotonic loss descent (9.80 -> 3.17 -> 3.00 -> 2.16); after SFT (loss 1.74) the v2 bootstrap-ablation reference attains a conversational gate of 0.775 +/- 0.043 on B5 over N=4 seeds, and a controlled Phase-2 replay sweep over {0,5,10,25,50}% saturates B5 at >=25% replay. (iv) Two empirical findings, both N=4. A controlled bootstrap-corpus ablation across v2 (OpenSubs), v4 (mC4-ES), and v6 (60/25/15 OpenSubs/mC4/Wiki) exposes a loss-versus-register inversion: lower-perplexity bootstraps yield measurably worse conversational behavior (v2 > v4 > v6 on B5 at every paired seed). The B4 (tool-selection) floor of 0.000 is a corpus-density artifact, not a capacity gate: rebalancing the SFT mixture to tool-use ratio 1:21 yields VectraYX-Nano v7, the released headline configuration, reaching B4 = 0.230 +/- 0.052 at 42M while retaining B1 = 0.332 +/- 0.005 and B5 = 0.725 +/- 0.130; a LoRA replication on a 260M from-scratch mid-tier reaches 0.445 +/- 0.201. The released GGUF is 96 MB in F16, runs sub-second TTFT on commodity hardware under llama.cpp, and is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.

2605.12058 2026-05-22 cs.LG cs.AI

Holder Policy Optimisation

Hölder Policy Optimisation

Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang

AI总结 本文提出HölderPO框架,通过Hölder均值统一token级概率聚合,解决固定聚合机制导致的训练崩溃与性能不足问题,理论证明不同p值对梯度集中度和方差的平衡作用,并通过动态退火算法实现训练周期内的p值调度,实验表明其在多个数学基准测试中取得更优的稳定性和收敛性。

详情
AI中文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

英文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.