arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2183
2606.12412 2026-06-11 cs.CV cs.AI 新提交

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

重新路由,而非移除:面向视觉语言模型的可恢复视觉令牌路由

Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) National Taiwan University(国立台湾大学)

AI总结 针对视觉语言模型中视觉令牌重要性随解码器深度变化的问题,提出无需训练的可恢复路由方法Reroute,将不可逆移除改为可恢复路由,在激进令牌缩减下提升定位能力并保持通用VQA性能。

详情
Comments
Code: this https URL
AI中文摘要

视觉语言模型(VLM)将图像投影为数百到数千个视觉令牌,使得解码器推理在注意力计算和KV缓存内存方面代价高昂。现有的视觉令牌缩减方法大多遵循排序-移除范式:它们对视觉令牌进行评分,保留一个紧凑的子集,并永久丢弃其余部分。我们表明这种不可逆操作是脆弱的,因为视觉令牌的重要性随解码器深度变化;在某一阶段排名低的令牌可能在后续层中变得相关,尤其是对于需要定位的查询。我们提出Reroute,一种无需训练的插件,用可恢复路由替代移除。在每个路由阶段,选中的视觉令牌通过解码器块,而延迟的令牌绕过该阶段并在下一个路由决策时重新进入候选池。Reroute重用现有的注意力分数排序规则和阶段级调度,保留了它所增强的剪枝方法的理论TFLOPs和KV缓存预算类别。在LLaVA-1.5和Qwen骨干网络上的FastV、PDrop和Nüwa变体中,Reroute在激进令牌缩减下改善了定位性能,同时保持通用VQA性能。这些结果表明,VLM令牌缩减不应仅被视为不可逆剪枝,也应被视为可恢复路由。代码可在此处获取:this https URL

英文摘要

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: this https URL

2606.12411 2026-06-11 cs.CL cs.LG 新提交

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

上下文驱动的增量压缩用于多轮对话生成

Yeongseo Jung, Jaehyeok Kim, Eunseo Jung, Jiachuan Wang, Yongqi Zhang, Ka Chun Cheung, Simon See, Lei Chen

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) NVIDIA AI Technology Center(NVIDIA AI技术中心) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出上下文驱动的增量压缩(C-DIC),通过可修订的线程压缩状态和轻量级检索-修订-写回循环,实现跨轮信息共享,稳定长对话性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

现代对话代理在每一轮都会处理不断增长的对话历史,导致冗余的注意力和编码成本随对话长度增加。简单的截断或摘要会降低保真度,而现有的上下文压缩器缺乏跨轮记忆共享或修订,导致信息丢失和长对话中的累积错误。我们重新审视了对话动态下的上下文压缩,并经验性地展示了其脆弱性。为了提高效率和鲁棒性,我们引入了上下文驱动的增量压缩(C-DIC),它将对话视为交织的上下文线程,并在单个紧凑的对话记忆中存储每个线程的可修订压缩状态。在每一轮,一个轻量级的检索、修订和写回循环在轮次之间共享信息并更新过时的记忆,从而稳定长期行为。此外,我们将截断反向传播(TBPTT)适应于我们的多轮设置,学习跨轮依赖关系而无需完整历史反向传播。在长对话基准上的大量实验证明了C-DIC的优越性能和效率;值得注意的是,C-DIC在数百轮对话中表现出稳定的推理延迟和困惑度,为高质量对话建模提供了一条可扩展的路径。

英文摘要

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

2606.12407 2026-06-11 cs.CV 新提交

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

看似无关紧要的设计选择如何决定病理学中LLM的性能

Kian R. Weihrauch, Thomas A. Buckley, William Lotter, Arjun K. Manrai

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard Medical School(哈佛医学院) Dana-Farber Cancer Institute(丹娜-法伯癌症研究所)

AI总结 通过系统因子分析发现,调整补丁大小、放大倍数等输入配置可使通用大语言模型在病理切片任务上性能大幅提升,缩小与专用模型的差距。

详情
AI中文摘要

通用大语言模型(LLM)在评估全切片图像(WSI)上的专用病理模型时,常被用作基线。由于WSI超出当代模型上下文限制,LLM基线通常使用独立处理的小尺寸高放大倍数补丁,通过多数投票进行,而缺乏对补丁大小、补丁数量和放大倍数等看似无关紧要的设计选择的系统评估。通用LLM一直表现不如专用系统,这强化了领域特定训练或架构适应对于涉及WSI的病理任务必要的观点。在这里,我们对四个输入设计因素:推理模式、补丁大小、放大倍数和补丁数量进行了系统因子分析。我们证明,先前研究通过选择非优化的输入配置夸大了专用模型与通用LLM之间的差距。在MultiPathQA基准上,切换到单一平衡配置(低放大倍数下的大补丁,联合处理)将GPT-5在癌症类型分类(TCGA)上从15.1%提升至39.5%,在器官分类(GTEx)上从38.1%提升至62.9%。每任务优化进一步带来增益,分别达到43.9%(TCGA)和71.6%(GTEx)。相同的配置推广到另外两个模型以及完全保留的CPTAC队列,在无需任何任务特定调整的情况下,将Gemini 3 Flash提升了23.4个百分点。

英文摘要

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

2606.12406 2026-06-11 cs.RO cs.AI cs.LG eess.SY 新提交

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

FACTR 2: 学习商用机器人手臂的外部力感知提升策略学习

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Waseda University(早稻田大学)

AI总结 提出无需专用力传感器的数据驱动方法NEXT,可在1分钟内从10分钟自由运动数据中训练,实现与专用关节力矩传感器相当的估计,并结合FIRST采样策略提升策略学习性能。

详情
Comments
Website at this https URL
AI中文摘要

接触丰富的操作需要力敏感性,但由于成本高昂,许多机器人手臂缺乏专用的力传感器。我们提出了神经外部力矩估计(NEXT),一种无需任何专用力传感器即可估计外部关节力矩的数据驱动方法。NEXT 仅需 10 分钟的自由运动数据即可在 1 分钟内完成训练,却能实现与专用关节力矩传感器相当的估计。NEXT 能够在低成本手臂上实现力反馈遥操作,并通过力信息重采样训练(FIRST)改进策略学习,该训练在行为克隆过程中对预接触和接触段进行上采样。在五个长时域任务中,FIRST 在任务进展上比先前的力感知策略提高了超过 17%。NEXT 和 FIRST 共同将力感知遥操作和策略学习引入现成的机器人,无需额外的传感硬件。视频结果和代码可在 https://this URL 获取。

英文摘要

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL

2606.12403 2026-06-11 cs.RO 新提交

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

World Pilot: 用世界动作先验引导视觉-语言-动作模型

Zefu Lin, Rongxu Cui, Junjia Xu, Xiaojuan Jin, Wenling Li, Lue Fan, Zhaoxiang Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences (CASIA)(中国科学院自动化研究所) Nanjing University(南京大学) Beihang University(北京航空航天大学)

AI总结 提出World Pilot框架,通过世界动作模型(WAM)的潜在引导和动作引导两条路径,为VLA模型提供场景演化先验和轨迹级运动提示,在LIBERO-Plus零样本OOD基准上达到84.7%的总成功率,并在多个真实机器人操作任务中取得最高成功率。

详情
Comments
Project Website: this https URL
AI中文摘要

视觉-语言-动作(VLA)模型从大规模预训练中继承了语义基础,并在分布内的操作任务中表现良好。然而,这种语义基础建立在静态图像-文本对上,而操作是一个连续的、接触丰富的过程,其动态特性是这种预训练无法捕捉的。我们提出了World Pilot,一个VLA框架,通过两条互补路径将世界动作模型(WAM)的先验注入决策链。潜在引导(Latent Steering)以场景演化潜变量为条件作用于感知层,动作引导(Action Steering)将预期轨迹作为运动先验提供给动作生成器。这两个先验共同为VLA提供了场景的预期视图和轨迹级运动提示,同时保留了其语义条件。即使由未经过动作后训练的视频预训练世界模型提供,场景演化先验仍然有效。World Pilot在LIBERO-Plus零样本OOD基准上达到了84.7%的总成功率,并在四个操作任务的每个真实机器人设置中取得了最高成功率,在视角、几何、变形状态和姿态变化下具有最大的优势。项目网站:此 https URL

英文摘要

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: this https URL

2606.12402 2026-06-11 cs.RO cs.AI cs.CV 新提交

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

DIRECT: 在具身规划器中何时何地分配测试时计算?

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

发表机构 * Stanford University(斯坦福大学) University of Waterloo(滑铁卢大学) NVIDIA(英伟达)

AI总结 提出DIRECT路由框架,根据多模态场景上下文按提示分配计算资源,优化成功-成本帕累托前沿,实验表明不同缩放轴带来不同能力增益,在物理机器人上以更低延迟匹配或超越更强模型。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署为具身智能体的高层规划器,一种新兴策略是扩展测试时计算以提高能力。然而,我们观察到这样做会增加延迟、令牌使用和FLOPs,同时在下游任务中产生不均匀且往往递减的收益,限制了具身智能体的部署范围。我们认为,选择何时何地花费测试时计算是将前沿性能带入现实世界的关键。我们引入了DIRECT,一个路由框架,利用多模态场景上下文按提示分配计算资源,在固定模型选择上改进了成功-成本帕累托前沿。在三种主要的缩放轴(即思维链深度、模型大小和记忆历史)上,我们在VLABench和RoboMME上的实验表明,测试时计算并非均匀的杠杆:不同的轴产生性质不同的能力增益。我们在DROID设置中的物理Franka机械臂上验证了这些见解,涵盖了零样本操作和长程链式任务,我们的路由器以高达65%的平均延迟降低匹配或超过了更强模型的成功率。最终,我们的结果表明,天真地扩展测试时计算是浪费的,而DIRECT能够以极低的成本在机器人系统中提供前沿级别的具身规划。项目页面可在此http URL找到。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.

2606.12400 2026-06-11 cs.CL cs.IR 新提交

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

Doc-to-Atom:学习编译和组合记忆原子

Xingjian Diao, Wenbo Li, Yashas Malur Saidutta, Avinash Amballa, Lazar Valkov, Srinivas Chappidi

发表机构 * AI Center-Mountain View, Samsung Electronics(三星电子AI中心-山景城) Dartmouth College(达特茅斯学院)

AI总结 提出Doc2Atom框架,将文档分解为语义类型化的知识原子并编译为微LoRA适配器,通过轻量查询路由器选择相关原子组装成查询特定适配器,以解决文档压缩中的干扰和扩展性问题,在六个QA基准上优于Doc-to-LoRA。

详情
Comments
20 pages
AI中文摘要

长输入序列是大语言模型文档理解和多步推理的核心,但注意力的二次成本使得推理既内存密集又缓慢。上下文蒸馏通过将上下文信息压缩到模型参数中来缓解这一问题,最近的工作如Doc-to-LoRA将上下文蒸馏摊销为一次前向传播,为每个文档生成一个LoRA适配器。然而,为所有查询生成单个整体适配器会导致无关查询干扰、有限的组合回忆以及长文档推理的可扩展性差。为了解决这些挑战,我们提出了Doc-to-Atom(Doc2Atom),一种组合参数化记忆框架,将每个文档分解为语义类型化的知识原子。每个原子被编译成一个独立的微LoRA适配器和一个来源检索键。在推理时,一个轻量查询路由器选择并仅组装相关原子到一个查询特定适配器中,然后将其注入冻结的基础模型。整个系统通过多目标蒸馏框架进行端到端训练。在六个不同的QA基准上的实验表明,Doc2Atom优于Doc-to-LoRA基线,同时降低了文档内部化的内存成本。

英文摘要

Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

2606.12397 2026-06-11 cs.LG cs.AI cs.CL 新提交

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

重新设计混合专家模型的路由器:基于流形幂迭代

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Large Language Model Department, Tencent(腾讯大型语言模型部门)

AI总结 提出将路由器行与专家矩阵主奇异方向对齐,并基于流形幂迭代(MPI)重新设计路由器,通过“幂迭代-收缩”范式实现对齐,理论证明收敛性,实验验证1B至11B参数规模下模型效果提升。

详情
Comments
Preprint
AI中文摘要

路由器是混合专家模型的核心组件。作为专家代理,路由器矩阵的行计算与MoE输入的相似度,以确定激活哪些专家子集。理想情况下,每个路由器行被设计为将专家矩阵编码到该代表性向量中,使得其与token的点积能更好地反映token-专家亲和性。然而,目前没有设计原则来强制这种压缩。在本文中,我们提出将每个路由器行与相关专家的主奇异方向对齐,因为该方向提供了矩阵最具表现力的数学描述。基于这一原则,我们提出了一种基于流形幂迭代(MPI)的路由器重新设计。具体来说,它引入了一种“幂迭代-收缩”范式,其中对路由器权重执行幂迭代步骤,然后进行收缩以施加范数约束,确保效率和稳定性。理论上,我们证明MPI驱动路由器行收敛到相关专家的主奇异方向。实验上,我们在1B到11B参数规模的MoE模型上进行预训练,证实这种对齐有助于更有效的MoE模型。

英文摘要

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

2606.12396 2026-06-11 cs.CV cs.RO 新提交

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

VLGA:用于自动驾驶的视觉-语言-几何-动作模型

Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman

发表机构 * Uber AV Labs(Uber自动驾驶实验室) University of Virginia(弗吉尼亚大学)

AI总结 提出VLGA模型,通过引入几何作为第四模态,利用逐像素点图回归损失监督,实现密集3D世界重建,在nuScenes和Bench2Drive上达到SOTA。

详情
Comments
Project page: this https URL
AI中文摘要

视觉-语言-动作(VLA)模型能够描述场景并用语言进行推理,但仍难以将其动作锚定在周围的密集3D世界中。现有方法要么从冻结的3D基础模型中注入特征,而没有确保策略使用这些特征的目标,要么通过稀疏的框和地图损失来约束几何,这些损失不提供密集的空间信号。我们引入了VLGA,这是第一个被监督以重建其驾驶通过的密集3D世界的视觉-语言-动作模型。VLGA通过一个专门的专家模块,由针对LiDAR的逐像素点图回归损失监督,将几何作为第四模态与视觉、语言和动作一起引入。在具有挑战性的nuScenes和Bench2Drive数据集上分别进行开环和闭环评估的大量实验表明,VLGA优于对应的VLA方法。特别是在开环nuScenes上,VLGA在没有自车状态的情况下,在VLA方法中取得了新的最先进结果,具有最低的L2误差(平均0.50米)和3秒碰撞率(0.18%)。在闭环Bench2Drive上,VLGA取得了79.08的最先进驾驶得分,比最强的先前VLA高出0.71,同时具有相当的效率和舒适性。

英文摘要

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

2606.12392 2026-06-11 cs.CL cs.AI 新提交

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

CCL25-Eval 任务5系统报告:新数据集与LoRA微调Qwen2.5

Haotao Xie

发表机构 * The Hangzhou International Innovation Institute Beihang University(北京航空航天大学杭州国际创新研究院)

AI总结 针对古典诗歌翻译与情感理解任务,构建高质量指令数据集CCPoetry-49K,并采用LoRA微调Qwen2.5-14B模型得到PoetryQwen,在CCL25-Eval任务5上取得0.757分,较基线提升9.7%。

详情
AI中文摘要

近年来,大语言模型(LLMs)在古典汉语翻译和古典诗歌生成领域取得了令人瞩目的进展。然而,针对古典诗歌精确翻译和情感语义理解的领域特定研究仍然有限。主要挑战在于大多数研究将诗歌鉴赏任务视为通用领域问题,忽略了诗歌鉴赏的独特特征,同时高质量且领域特定的数据集极为稀缺。为解决这一局限,我们将任务分解为三个子任务:术语解释、语义解释和情感推理。基于多个开源数据集,我们进行数据清洗和对齐,构建了古典诗歌指令对数据集(CCPoetry-49K),包含49,404个高质量指令-响应对,专门针对该领域进行了优化。随后,我们提出领域专用LLM,称为PoetryQwen,通过应用低秩适配(LoRA)微调Qwen2.5-14B模型。在CCL25-Eval任务5基准上的实验结果表明,PoetryQwen得分为0.757,较Qwen2.5-14B-Instruct基线(0.690)提升9.7%。这些发现明确表明,PoetryQwen在古典诗歌的精确翻译和情感理解方面显著提升了性能。我们提供了新数据集和方法论考虑,旨在支持LLMs的领域特定优化。

英文摘要

Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

2606.12386 2026-06-11 cs.LG cs.AI 新提交

ATLAS: Active Theory Learning for Automated Science

ATLAS: 自动化科学的主动理论学习

Noémi Éltető, Nathaniel D. Daw, Kimberly L. Stachenfeld, Kevin J. Miller

发表机构 * Google DeepMind(谷歌深度思维) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) University College London(伦敦大学学院)

AI总结 提出ATLAS框架,通过主动学习迭代生成稀疏神经网络假设并设计最优区分实验,在bandit任务中恢复强化学习智能体,相比随机实验采样效率提升5-10倍。

详情
AI中文摘要

通过机制建模推进科学理解需要提出正确的实验问题以产生信息量最大的数据。为了在认知科学中自动化这一追求,我们引入了ATLAS(自动化科学的主动理论学习),这是一个用于数据驱动的可解释行为模型发现的主动学习框架。ATLAS在生成机制假设(实例化为多样化的稀疏神经网络集成,即解缠RNN)和设计能够最优区分这些假设的实验之间迭代。我们在从bandit任务中的行为恢复强化学习智能体的问题上测试了这种方法。ATLAS设计了具有时间结构的定性新颖实验序列,该结构针对底层智能体特征量身定制。在这些实验上训练的模型通过一套全面的机制建模指标进行评估,这些指标捕捉了行为、结构和计算相似性。与随机实验相比,ATLAS在所有指标上实现了5-10倍的采样效率提升,并且其性能进一步通过与文献中专家设计的实验进行验证得到确认。这些计算机模拟结果展示了ATLAS在加速人类可解释洞察方面的潜力,适用于认知科学以及其他科学探究依赖于发现机制模型的领域。

英文摘要

Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informative data. To automate this pursuit within cognitive science, we introduce ATLAS (Active Theory Learning for Automated Science), an active learning framework for the data-driven discovery of interpretable behavioral models. ATLAS iterates between generating mechanistic hypotheses--instantiated as a diverse ensemble of sparse neural networks (Disentangled RNNs)--and designing experiments that optimally distinguish between them. We test this approach on the problem of recovering reinforcement learning agents from their behavior in bandit tasks. ATLAS designs varied sequences of qualitatively novel experiments with temporal structure tailored to underlying agent characteristics. The models trained on these experiments are evaluated against a comprehensive set of metrics for mechanistic modeling that capture behavioral, structural, and computational similarity. ATLAS achieves a 5-10x improvement in sample efficiency across all metrics compared to random experimentation, and its performance is further validated against expert-designed experiments derived from literature. These in silico results showcase ATLAS's potential to accelerate human-interpretable insights in cognitive science and other domains where scientific inquiry relies on discovering mechanistic models.

2606.12385 2026-06-11 cs.CL 新提交

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

我们的模型建立在哪些模型之上?审计现代LLM中的隐形依赖

Sanjay Adhikesaven, Haoxiang Sun, Sewon Min

发表机构 * University of California, Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 本文提出ModSleuth系统,通过递归重建LLM依赖图,揭示多跳许可义务、训练-评估耦合等隐藏依赖问题,并发布工具和依赖图以支持透明分析。

详情
AI中文摘要

现代LLM训练流程越来越依赖其他模型来生成数据、过滤语料库、判断输出和指导开发决策。这些依赖是递归的:一个模型可能依赖于上游工件,而该工件的自身依赖仅在单独的发布和工件中记录。因此,完整的依赖结构分散在异构的公共工件中,其复杂性和递归深度远超人类追踪能力。我们引入了ModSleuth,一个智能系统,可以从公共工件中递归重建LLM依赖图,并提供基于来源的证据。我们发现主要挑战不再是信息提取,而是定义什么构成依赖以及在不一致的文档中协调工件引用。我们通过形式化方法解决这些挑战,该方法区分直接和间接依赖,通过以操作为中心的关系表示异构管道角色,并在名称、版本和仓库之间解析工件身份。将ModSleuth应用于四个富含公共工件的LLM发布,我们恢复了1,060个来源验证的依赖,并构建了现代LLM开发的大规模依赖图。这些图揭示了多跳许可义务、训练-评估耦合、发布时与训练时工件之间的差异,以及否则难以发现的文档不一致性。我们发布ModSleuth和生成的依赖图,以支持对现代LLM日益复杂生态系统的透明分析。

英文摘要

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

2606.12384 2026-06-11 cs.LG cs.AI 新提交

APPO: Agentic Procedural Policy Optimization

APPO: 智能体程序策略优化

Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图) Southern University of Science and Technology(南方科技大学)

AI总结 提出APPO方法,通过细粒度分支和程序级优势缩放改进智能体强化学习的信用分配,在13个基准上平均提升近4个点。

详情
Comments
25 pages, including 14 pages of main text and 11 pages of appendix; work in progress
AI中文摘要

近期智能体强化学习(RL)的进展显著提升了大型语言模型智能体的多轮工具使用能力。然而,现有方法大多基于粗粒度的启发式单元(如工具调用边界或固定工作流)进行信用分配,难以识别哪些中间决策影响下游结果。本文从两个角度研究智能体RL:\textit{何处分支以及分支后如何分配信用}。我们的初步分析表明,有影响力的决策点广泛分布在生成序列中,而非集中于工具调用,而仅凭token熵无法可靠反映其对最终结果的影响。基于这些观察,我们提出\textbf{智能体程序策略优化(APPO)},将分支和信用分配从粗粒度的交互单元转移到序列中的细粒度决策点。APPO使用分支分数选择分支位置,该分数结合了token不确定性和后续延续的策略诱导似然增益,从而在过滤掉虚假高熵位置的同时实现更有针对性的探索。它进一步引入了程序级优势缩放,以更好地在分支展开中分配信用。在13个基准上的实验表明,APPO在保持高效工具调用和行为可解释性的同时,一致地将强智能体RL基线提升了近4个点。

英文摘要

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

2606.12378 2026-06-11 cs.CV cs.AI 新提交

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

面向机器人生理感知的鲁棒光照相机心率估计

Zhi Wei Xu, Torbjörn E. M. Nordling

发表机构 * National Cheng Kung University(国立成功大学)

AI总结 提出一种端到端时空Transformer框架,结合PRNet三维人脸对齐、光照增强、残差时序标准化和混合时频监督,在光照变化数据集上实现0.79 bpm心率MAE和0.982相关系数,相比PhysFormer降低93.6%误差。

详情
Comments
8 pages, 4 figures
AI中文摘要

生理感知对于在日常生活环境中与人类交互的服务型、社交型和辅助型机器人至关重要。远程光电容积描记法(rPPG)能够从RGB相机中实现非接触式心率(HR)估计,使其成为机器人视觉系统的一种有前景的感知模态。然而,光照变化仍然是鲁棒部署的主要障碍。本文提出了一种端到端的时空Transformer框架,用于在具有不同光照条件的新数据集上进行远程心率估计。我们的估计器集成了基于PRNet的三维人脸对齐、片段级光照增强、残差时序标准化模块以及受控的混合时频监督。训练目标结合了Soft-Shifted Pearson波形损失和频谱Kullback-Leibler散度损失,其中调优权重($\mathbf{\beta}$)控制频域心率指导的贡献。在覆盖三个光照级别的静态全混合协议上的实验表明,$\mathbf{\beta}=5$在测试的beta设置中提供了最强结果,实现了最佳运行心率平均绝对误差(MAE)为0.79 bpm,心率相关系数为0.982。与在我们的数据集上评估的PhysFormer基线相比,我们的估计器将心率MAE降低了93.6%,同时将心率相关系数从0.088提高到0.982,使其在光照变化时可用。

英文摘要

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

2606.12374 2026-06-11 cs.RO cs.CV 新提交

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

语义感知的潜水员活动识别框架用于有效的水下多人类-机器人协作

Sadman Sakib Enan, Junaed Sattar

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出DAR-Net框架,结合Transformer时间推理与像素级场景监督,通过多损失训练对齐全局活动识别与局部人机交互语义,解决低可见度水下环境中的潜水员活动识别问题,并发布首个水下潜水员活动数据集UDA。

详情
AI中文摘要

有效的人机多体协作对于在具有挑战性和高风险的水下环境中扩展人类主导的操作至关重要。为了使自主水下航行器(AUV)成为真正的队友,它们必须能够理解周围环境并识别潜水员的活动,以提供帮助并确保安全。为此,我们引入了DAR-Net,一种新颖的基于Transformer的框架,用于分析复杂的水下场景并对潜水员活动进行分类。我们的贡献在于一种语义引导的学习公式,它将基于Transformer的时间推理与像素级场景监督相结合。这种多损失训练策略明确地将全局活动识别与局部人机交互语义对齐,这在低可见度水下条件下尤为关键。为了解决该领域数据稀缺的重大挑战,我们首次提出了水下潜水员活动(UDA)数据集,这是一个基础资源,包含超过2600张带有像素级掩码的注释图像。通过在受控环境中进行严格的实验评估,我们证明DAR-Net在识别六种不同潜水员活动方面达到了有希望的准确性,优于现有最先进的模型。虽然该数据集提供了关键的基线,但我们的工作作为开创性的一步,为未来研究奠定了基础,并促进了更智能、协作的水下机器人系统的发展。

英文摘要

Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

2606.12372 2026-06-11 cs.RO cs.LG 新提交

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

UniIntervene:用于高效现实世界强化学习的智能干预

Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang

发表机构 * Nanyang Technological University(南洋理工大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出UniIntervene智能干预模型,通过检测低效探索并自主恢复策略至高价值状态,在真实机器人操作任务中平均成功率提升8.6%,人类干预减少57%。

详情
Comments
Project page: this https URL
AI中文摘要

人在回路强化学习(HiL-RL)已成为现实世界机器人操作的有效范式,能够通过人类指导实现在线策略改进。然而,当前的HiL-RL框架仍然依赖频繁的人类干预来纠正策略,使其脱离低效探索,这导致高昂的人力成本并限制了现实世界的可扩展性。为解决这一问题,我们提出UniIntervene,一种智能干预模型,它能够检测低效探索并自主将策略恢复至高价值状态,从而接管人类操作员的大部分干预工作。具体而言,UniIntervene首先执行未来条件化的动作价值估计,预测当前动作的潜在后果并评估其诱导价值,从而提供更稳定的进展信号。在此基础上,一个时间价值风险评论家聚合最近的价值动态,并在估计价值出现持续停滞或下降时触发干预。当需要干预时,UniIntervene从过去干预事件的内存中检索高价值恢复目标,并通过目标条件化的恢复策略生成可执行的纠正动作。通过这种方式,UniIntervene将干预从被动的人类纠正转变为价值感知的恢复过程,从而实现高效的现实世界强化学习。在多种真实世界操作任务上的大量实验表明,与最先进的HiL-RL基线相比,UniIntervene将平均成功率提高了8.6%,同时将人类干预减少了57%。

英文摘要

Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.

2606.12371 2026-06-11 cs.CV 新提交

A Turbo-Inference Strategy for Object Detection and Instance Segmentation

一种用于目标检测和实例分割的涡轮推理策略

Zhen Zhao, Gang Zhang, Xiaolin Hu, Liang Tang

发表机构 * School of Technology, Beijing Forestry University(北京林业大学工学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Beijing National Research Center for Information Science and Technology, Tsinghua University(清华大学北京信息科学与技术国家研究中心) Chinese Institute for Brain Research (CIBR)(北京脑科学与类脑研究中心)

AI总结 提出一种涡轮推理策略,通过迭代利用检测与分割的互补信息,设计涡轮检测头和涡轮分割头形成闭环,无需重新训练即可提升两者精度。

详情
Comments
Preprint version of an article published in Computer Vision and Image Understanding
AI中文摘要

目标检测和实例分割任务密切相关。现有的自上而下实例分割方法通常遵循先检测后分割的范式,即先使用初始检测器识别并用边界框定位对象,然后在每个边界框内分割实例掩码。在这种方法中,检测精度直接影响后续分割性能。然而,以往的研究很少探讨实例分割任务对目标检测的影响。本文提出一种用于自上而下方法的涡轮推理策略,该策略迭代利用检测和分割任务之间的互补信息。具体来说,我们设计了两个模块:涡轮检测头和涡轮分割头,它们促进任务之间的通信。这两个模块形成一个闭环,交织检测和分割结果,而无需重新训练模型。在COCO、iFLYTEK和Cityscapes数据集上的综合实验表明,我们的方法在计算成本增加的情况下,显著提高了检测和分割精度。所提出的方法代表了预测精度和推理速度之间的权衡。代码可在以下网址获取:https://this URL。

英文摘要

Object detection and instance segmentation tasks are closely related. Existing top-down instance segmentation methods usually follow a detect-then-segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo-inference strategy for the top-down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo-detection head and turbo-segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at this https URL.

2606.12370 2026-06-11 cs.LG cs.CL 新提交

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

打破熵界:通过带拒绝采样的多令牌预测加速强化学习训练

Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou

发表机构 * Qwen Team, Alibaba Inc(阿里巴巴集团 Qwen 团队)

AI总结 针对强化学习训练中多令牌预测接受率因熵波动而下降的问题,提出Bebop方法,采用概率拒绝采样和端到端TV损失优化,实现高达95%接受率和1.8倍加速。

详情
AI中文摘要

强化学习(RL)已成为现代大型语言模型的关键组成部分,但展开阶段仍是RL训练流程中的主要瓶颈。尽管多令牌预测(MTP)通过推测解码提供了一种自然的加速方案,但许多研究观察到MTP接受率在RL训练期间显著下降,导致加速效果有限。为解决这一瓶颈,我们提出Bebop,对LLM后训练中的MTP进行系统研究,并提供将MTP集成到大规模RL流水线中的实用方案。首先,我们揭示MTP接受率根本上受模型熵波动的限制,其与RL阶段熵的上升呈现清晰的负线性关系。其次,我们证明与贪婪草稿采样相比,概率拒绝采样在很大程度上减轻了RL中熵引入的干扰。我们进一步发现,传统的MTP训练目标(交叉熵或KL)在此类设置中次优,因此我们提出一种新颖的端到端TV损失,直接优化多步拒绝采样接受率,带来约10%的接受率提升,在数学推理、代码生成和智能体任务中实现高达95%的接受率和高达25%的额外推理吞吐量增益。第三,我们测试了RL期间的各种在线MTP训练策略,并表明使用端到端TV损失和拒绝采样的预RL MTP训练在整个RL过程中保持一致的接受率和加速,消除了昂贵的在线MTP更新需求。我们提供了大量实验和分析来验证我们的发现。实验结果表明,我们的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步RL训练中实现了高达1.8倍的端到端加速。

英文摘要

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

2606.12366 2026-06-11 cs.RO 新提交

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

APT: 动作专家预训练提升视觉-语言-动作策略的指令泛化能力

Kechun Xu, Zhenjie Zhu, Anzhe Chen, Rong Xiong, Yue Wang

发表机构 * Zhejiang University(浙江大学) Zhejiang Humanoid Robot Innovation Center(浙江人形机器人创新中心)

AI总结 针对连续动作专家模型对分布外语言指令泛化差的问题,提出APT两阶段训练方法,先预训练动作专家作为视觉-动作先验,再通过门控融合注入语言,显著提升泛化性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型将预训练的视觉-语言模型(VLM)与连续动作专家结合,在操作任务上取得了强劲性能,但对分布外(OOD)语言指令的泛化能力仍然较差。一个已知挑战是VLA数据中的结构不平衡,其中语言的多样性远低于视觉和动作内容,使得策略容易依赖视觉捷径。虽然离散动作方法通过视觉-语言联合训练缓解了这一问题,但连续动作专家缺乏此类保护:它们从随机初始化开始,完全从不平衡数据中学习,产生噪声梯度,破坏VLM并无法利用其语言能力。我们从贝叶斯角度出发,将策略分解为与语言无关的视觉-动作(VA)先验和语言条件化的VLA似然,并提出APT,一种强调动作专家预训练的两阶段训练方法。在第一阶段,动作专家作为VA先验,在冻结的VLM提供的视觉-动作对上进行预训练,绕过了语言不平衡问题。在第二阶段,通过门控融合机制注入语言标记,该机制整合VLM特征的同时保留已学习的视觉运动先验。APT适用于主流VLA架构,包括π和GR00T风格架构。综合实验验证了APT在未见指令和组合任务上取得了一致的性能提升。项目页面:此 https URL

英文摘要

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $\pi$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: this https URL

2606.12365 2026-06-11 cs.RO cs.AI 新提交

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

环境扩散策略:从次优数据中进行机器人模仿学习

Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake

发表机构 * MIT(麻省理工学院)

AI总结 提出环境扩散策略,通过噪声依赖的数据使用从次优数据中提取有用特征,在六项任务上优于现有方法,最高提升33%。

详情
Comments
14 pages (main body), 52 pages total. Project website: this https URL
AI中文摘要

我们提出环境扩散策略,一种从机器人次优数据中进行模仿学习的简单且原则性的方法。高质量、特定任务的机器人数据收集昂贵且耗时,而低质量或分布外演示的次优数据集则丰富。现有的在机器人中同时训练两种数据源的方法通常无法分离次优样本中的有意义和有害特征。相比之下,我们的方法通过引入机器人协同训练的新轴:噪声依赖的数据使用,仅提取有用特征。环境扩散策略在训练期间将次优数据的贡献限制在仅高和低扩散时间。为了严格证明我们的方法,我们首先观察到机器人动作数据表现出频谱幂律。这在我们利用的最优扩散策略上引出了两个重要性质:全局到局部层次结构和局部性。我们使用简化模型从理论上形式化这一讨论。我们的实验在六项任务上验证了环境扩散策略对四种类型的次优动作数据(噪声轨迹、模拟到现实差距、任务不匹配和大规模数据混合)的有效性。结果表明,它有效地从任意来源的次优数据中学习。值得注意的是,当扩展到Open X-Embodiment(一个具有异质数据质量和非结构化分布偏移的大规模数据集)时,它比现有协同训练基线高出33%。总体而言,环境扩散策略提高了次优演示的实用性,并扩展了机器人中可用数据源的范围。

英文摘要

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

2606.12364 2026-06-11 cs.LG 新提交

On Subquadratic Architectures: From Applications to Principles

关于次二次架构:从应用到原理

Anamaria-Roberta Hartl, Levente Zólyomi, David Stap, Pieter-Jan Hoedt, Niklas Schmidinger, Lukas Hauzenberger, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

发表机构 * ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz(林茨ELLIS单元、LIT AI实验室、机器学习研究所、约翰内斯·开普勒大学林茨) NVIDIA(英伟达)

AI总结 本文比较了xLSTM、Mamba-2和Gated DeltaNet三种次二次架构,发现xLSTM在代码预训练、蒸馏和时间序列预训练中表现最佳,其优势源于灵活稳定的门控记忆校正机制。

详情
AI中文摘要

Transformer主导现代序列建模,但其二次注意力机制带来了巨大的计算成本。次二次架构提供了一种可扩展的替代方案。然而,目前尚不清楚哪些设计能产生最有效的序列模型。我们比较了三种领先的方法:xLSTM、Mamba-2和Gated DeltaNet。我们在具有复杂依赖关系的任务上评估这些模型:(1)代码模型预训练,(2)从大型语言模型蒸馏代码模型,以及(3)时间序列基础模型预训练。在这些设置中,xLSTM提供了最强的整体性能。为了解释xLSTM的优势,我们提出了一个统一的公式并分析了底层架构机制,重点关注状态跟踪和记忆动态。我们的结果表明,xLSTM通过其门控方案实现了更灵活和稳定的记忆校正。我们在受控的合成长度泛化任务上证实了这些发现。总体而言,我们的发现表明,xLSTM在复杂任务上的收益源于稳健的状态跟踪和积累。

英文摘要

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.

2606.12362 2026-06-11 cs.LG cs.AI 新提交

Latent World Recovery for Multimodal Learning with Missing Modalities

缺失模态下的多模态学习中的潜在世界恢复

Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade

发表机构 * Queen's University Belfast(贝尔法斯特女王大学)

AI总结 提出潜在世界恢复(LWR)框架,通过邻居潜在对齐和可用性感知融合,在缺失模态下实现鲁棒的多模态预测,避免显式重构误差。

详情
AI中文摘要

我们研究了缺失模态下的多模态学习,特别受到生物科学应用的启发,在这些应用中,当需要做出决策时,异构模态通常仅部分可用。我们提出了潜在世界恢复(LWR),这是一个基于两个关键思想的框架:(i) 来自不同模态的特定模态嵌入在共享潜在空间中对齐,以及 (ii) 通过仅融合在训练和推理时实际可用的模态嵌入来构建统一表示。LWR 不填补缺失模态或要求固定的模态集,而是将每个模态视为对底层潜在状态的部分感知,并直接从观察到的模态执行可用性感知表示学习。这种基于邻居的潜在对齐和可用性感知模态融合的结合,使得在部分观测下能够进行鲁棒的多模态预测,同时避免了显式重构缺失模态带来的误差传播。我们在真实世界的不完整多组学基准上评估了所提出的框架,并证明它为下游任务(如癌症表型分类和生存预测)提供了一种有效的方法。

英文摘要

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

2606.12352 2026-06-11 cs.RO cs.AI 新提交

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

CHORUS: 基于单一VLA策略的去中心化多体协作

Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn, Jeannette Bohg

发表机构 * Stanford University(斯坦福大学)

AI总结 提出CHORUS框架,利用预训练视觉-语言-动作模型的视觉运动先验,实现无需推理时通信的去中心化多机器人协作,在真实实验中显著优于基线。

详情
Comments
Project Website: this https URL
AI中文摘要

多机器人协作使机器人能够高效完成从通过门搬运沙发到建筑工地组装结构等各种任务。然而,在移动多机器人环境中实现这种协调仍然具有挑战性:基于团队联合观测的集中式方法随团队规模扩展性差,而为每个机器人训练一个策略的去中心化方法通常需要显式对齐程序或推理时信息共享来克服部分可观测性。我们的关键见解是,预训练的视觉-语言-动作(VLA)模型的视觉运动先验应能够仅从每个机器人的局部观测实现反应式去中心化协作,无需这些推理时假设。我们提出CHORUS,一个适配单一VLA骨干以控制多样化多机器人团队的框架。推理时,每个机器人运行CHORUS的独立副本,仅基于其自身观测和机器人标识提示。在包括移动卷尺测量、图书馆书籍交接和洗衣篮抬举的真实实验中,CHORUS相比去中心化从头训练模型提升64个百分点,对队友行为的反应性提升40个百分点,并优于集中式基线。这些结果表明,共享VLA骨干能够实现去中心化多机器人协作,无需每个机器人的独立策略或推理时机器人间通信。

英文摘要

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

2606.12349 2026-06-11 cs.RO eess.SY 新提交

Traceable Virtual Sea Trials in the Marine Robotics Unity Simulator for Manoeuvring Assessment of Unmanned Surface Vehicles

面向无人水面艇操纵性评估的海洋机器人Unity仿真器中可追溯虚拟海试

Paria Rezayan

发表机构 * School of Engineering and Built Environment, Sheffield Hallam University(谢菲尔德哈勒姆大学工程与建筑环境学院)

AI总结 针对USV水动力导数辨识数据获取难的问题,在MARUS仿真器中建立标准化虚拟海试框架,通过TC/ZZ机动自动化执行、数据采集与后处理管道,生成符合IMO/ITTC指标的可重复数据集,案例验证了框架的有效性。

详情
AI中文摘要

精确识别水动力导数对于无人水面艇(USV)的控制与导航至关重要,但物理海试的高保真操纵数据受成本和安全性限制。回转试验(TC)和Z形试验(ZZ)仍是IMO和ITTC评估程序的基础。本文扩展了海洋机器人Unity仿真器(MARUS),引入标准化虚拟海试框架,用于TC/ZZ机动的自动化执行和数据生成,包括可追溯的命令-执行日志记录、面向系统辨识(SI)的数据调理以及自动提取符合IMO/ITTC的操纵性指标。一个关键贡献是专用的TC/ZZ数据采集和后处理管道,提高了基于仿真的机动的可重复性和可审计性,同时生成适用于水动力导数辨识和数字孪生工作流的SI就绪数据集。另一个特点是差动推力转向的显式命令-执行分离,其中输入记录为有序的等效舵命令,而实际执行则记录为基于施加推力的执行级代理。案例研究结果表明了可重复且合规的机动行为。对于TC试验,左舷和右舷之间的归一化进距差异约为3.9%,战术直径差异约为4.6%至4.7%。对于ZZ试验,±10度和±20度机动下的第一和第二超越角超调量均保持在1度以下,满足IMO标准,而峰值偏航速率约为4.1至5.8度/秒。总体而言,该框架提供了一种可重复且可审计的虚拟海试工作流,用于生成符合IMO/ITTC的数据集,并支持系统辨识、水动力导数估计和数字孪生校准。

英文摘要

Accurate identification of hydrodynamic derivatives is essential for control and navigation of Unmanned Surface Vehicles (USVs), but high-fidelity manoeuvring data from physical sea trials are constrained by cost and safety. Turning Circle (TC) and Zig-Zag (ZZ) trials remain fundamental to IMO and ITTC assessment procedures. This paper extends the Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres, with traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. Another feature is explicit command-execution separation for differential-thrust steering, where inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case-study results demonstrate repeatable and compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9 percent between port and starboard sides, while the tactical diameter differs by approximately 4.6 to 4.7 percent. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/- 10 degree and +/- 20 degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 deg/s. Overall, the framework provides a repeatable and auditable virtual sea-trial workflow for generating IMO/ITTC-aligned datasets and supporting system identification, hydrodynamic-derivative estimation, and digital-twin calibration.

2606.12346 2026-06-11 cs.CV cs.AI cs.LG 新提交

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Atlas H&E-TME:基于AI的可扩展组织分析,达到专家病理学家级别的准确性

Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

发表机构 * Aignostics, Germany(Aignostics,德国) Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院病理学研究所) Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院柏林健康研究所) Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, MA, US(哈佛医学院麻省总医院病理学系) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US(梅奥诊所检验医学与病理学系) Machine Learning Group, Technische Universität Berlin, Germany(柏林工业大学机器学习组) BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany(柏林学习与数据基础研究所) Department of Artificial Intelligence, Korea University, Republic of Korea(高丽大学人工智能系) Max-Planck Institute for Informatics, Germany(马克斯·普朗克信息学研究所) German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany(德国癌症研究中心及德国癌症联盟柏林和慕尼黑合作站点) Institute of Pathology, Ludwig-Maximilians-Universität München, Germany(慕尼黑大学病理学研究所) Bavarian Cancer Research Center (BZKF), Germany(巴伐利亚癌症研究中心)

AI总结 提出Atlas H&E-TME系统,利用病理基础模型预测组织质量、区域和细胞类型,通过IHC共识验证和20万+注释基准,在多种癌症中达到或超越病理学家水平。

详情
AI中文摘要

苏木精和伊红(H&E)染色是组织病理学的基石,然而对H&E全切片图像(WSI)进行可扩展的定量分析仍然是计算病理学中的核心挑战。我们提出了Atlas H&E-TME,这是一个基于Atlas病理基础模型家族的AI系统,可预测多种癌症类型的组织质量、组织区域和细胞类型标签,在细胞级分辨率下每张切片产生超过4,500个定量读数。验证此类系统的关键挑战在于克服H&E-only金标准固有的形态模糊性,以及依赖免疫组织化学(IHC)等模态的更可靠参考的可扩展性有限。我们通过一个双重验证框架解决了这一问题,该框架将生物学深度的基础与技术及形态学的广度相结合。在深度方面,我们提出了一种IHC引导的多病理学家共识协议,该协议显著提高了相较于传统H&E-only注释的评分者间一致性。这产生了一个分子学基础的参考,我们据此比较Atlas H&E-TME和仅使用H&E的病理学家。在广度方面,我们在超过20万个高置信度H&E-only病理学家注释上对Atlas H&E-TME进行了基准测试,这些注释涵盖1,500多个病例,跨越八种癌症类型及其最常见的转移部位,亚型覆盖每种癌症类型>90%的临床病例,来自25个以上来源和8种以上扫描仪型号。与IHC引导的共识相比,Atlas H&E-TME达到或超过了病理学家仅使用H&E的性能,并在这一广泛的形态学和技术范围内一致且稳健地泛化。通过这种方式,Atlas H&E-TME将H&E切片——病理学中最普遍的数据——转化为一个可扩展的、定量的肿瘤及其微环境窗口,为转化和临床研究中下一代基于组织的生物标志物奠定了基础。

英文摘要

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

2606.12344 2026-06-11 cs.LG cs.CL 新提交

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Claw-SWE-Bench:评估OpenClaw风格代理框架在编码任务上的基准

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

发表机构 * TokenRhythm Technologies(TokenRhythm 技术公司) Infinigence AI Peking University(北京大学) City University of Hong Kong(香港城市大学) SEE Fund(SEE 基金) Shanghai Jiaotong University(上海交通大学) Beijing Jiaotong University(北京交通大学) Tsinghua University(清华大学)

AI总结 提出Claw-SWE-Bench基准,通过适配器协议统一评估异构代理框架,发现适配器设计对编码性能至关重要,且模型和框架选择显著影响通过率与成本。

详情
AI中文摘要

通用代理(如OpenClaw)越来越多地被用作自主工具使用者,但其编码能力难以在SWE-bench下衡量:通用代理本身不满足评分所需的干净Docker工作区、补丁和预测合约。我们引入了Claw-SWE-Bench,一个多语言SWE-bench风格的基准和适配器协议,使异构代理框架(即claws)在公平设置下具有可比性,包括固定提示、运行时预算、工作区合约、补丁提取过程和评估器。完整基准包含8种语言、43个仓库的350个GitHub问题解决实例,这些实例来自SWE-bench-Multilingual和SWE-bench-Verified-Mini,经过未来提交清理。我们还发布了Claw-SWE-Bench Lite用于更快验证,这是一个通过成本感知、排名感知程序从17个校准列中选出的80个实例子集。在完整基准上,使用最小直接差异适配器的OpenClaw仅获得19.1%的Pass@1,而完整适配器在相同GLM 5.1骨干下达到73.4%,表明适配器设计对于使OpenClaw风格的框架有效执行编码任务至关重要。在OpenClaw × 9模型扫描和5框架 × 2模型扫描中,模型选择使Pass@1变化29.4个百分点,固定模型下框架选择变化27.4个百分点;精度相似的系统在总API成本上可能差异很大。因此,Claw-SWE-Bench将框架和成本核算视为SWE风格编码代理评估的第一类轴,提供了完整基准和低成本参考集,用于可重复比较。数据可在https://this URL 和 https://this URL 获取。

英文摘要

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at this https URL and this https URL.

2606.12342 2026-06-11 cs.CL cs.AI cs.ET cs.LG 新提交

ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM: 通过跨词汇表logit混合实现推理时对齐迁移

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs

AI总结 针对领域微调降低大模型安全性的问题,提出无需训练的ALIGNBEAM方法,通过逐token翻译锚模型logit并选择最安全候选,实现跨词汇表的安全对齐迁移,保持任务准确性和推理开销。

详情
AI中文摘要

领域微调会降低大型语言模型的安全性:微调后的专家模型容易顺从以领域语言表述的有害提示。现有的推理时防御方法通过混合来自安全锚模型的logit,但要求两个模型共享词汇表,这使得它们无法用于安全性退化最严重的跨族专家模型。我们提出ALIGNBEAM,一种无需训练的方法,通过在每个解码步骤逐token将锚模型logit翻译为目标模型的词汇表来解除这一限制;然后一个小型LLM法官从K个候选续写中选择最安全的。无需改变权重,并且可以在部署时调整安全-效用权衡而无需重新训练。在跨词汇表和同词汇表评估对中,ALIGNBEAM显著提高了对抗性基准上的拒绝率,同时将任务准确性和推理开销保持在实用范围内。结果表明,安全对齐可以在推理时在不同模型族之间迁移,而无需修改任一模型的权重。

英文摘要

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

2606.12340 2026-06-11 cs.CV 新提交

Echoes of the Prior: A Computational Phenomenology of Forgetting

先验的回响:遗忘的计算现象学

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * Eberhard Karl University of Tübingen(蒂宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 通过在前馈3D重建模型中诱导突触衰减,可视化遗忘的主观现象学,将神经网络作为认知代理探索神经形态美学。

详情
AI中文摘要

记忆不仅仅是数据的存储;它是现实的支架。当生物记忆消退时,世界并不会简单地变黑;它会退化为无法辨认的混乱。《先验的回响》是一个互动装置,试图可视化这种遗忘的主观现象学。通过在前馈3D重建模型中诱导受控的突触衰减,我们为大脑预测先验的侵蚀创造了一个艺术类比。我们将神经网络定位为一种工程工具之外的认知代理——一个硅基大脑,其结构退化唤起了失去对世界掌控的迷失、诗意和恐怖体验。最终,我们提供这个框架作为催化剂,邀请更广泛的社区探索神经形态美学在可视化智能脆弱性方面的未开发潜力。互动演示见此网址。

英文摘要

Memory is not merely the storage of data; it is the scaffolding of reality. When biological memory fades, the world does not simply turn black; it regresses into an unrecognizable chaos. Echoes of the Prior is an interactive installation that attempts to visualize this subjective phenomenology of forgetting. By inducing controlled synaptic decay within a Feed-Forward 3D Reconstruction model, we create an artistic analogy for the erosion of the brain's predictive priors. We position the Neural Network not as a tool for engineering, but as a cognitive proxy - a silicon brain whose structural degeneration evokes the disorienting, poetic, and terrifying experience of losing one's grip on the world. Ultimately, we offer this framework as a catalyst, inviting the wider community to explore the uncharted potential of neuromorphic aesthetics in visualizing the fragility of intelligence. Interactive demo see this https URL.

2606.12339 2026-06-11 cs.SD cs.RO 新提交

Fast-SDE: Efficient Single-Microphone Sound Source Distance Estimation in Reverberant Environments

Fast-SDE:混响环境中高效单麦克风声源距离估计

Jiang Wang, Runwu Shi, Yaozhong Kang, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Institute of Science Tokyo(东京科学大学)

AI总结 提出Fast-SDE,一种基于子带处理的轻量级单麦克风框架,用于在资源受限的机器人平台上实现高效声源距离估计。

详情
Comments
To appear in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
AI中文摘要

声源距离估计(SDE)是人机交互中的关键能力。不适当的交互距离不仅会降低语音采集和理解的可靠性,还会损害交互的自然性和舒适性。现有大多数SDE方法依赖麦克风阵列,然而多麦克风系统通常需要精心的硬件同步、几何校准以及额外的空间和计算资源,这限制了其在尺寸受限和计算能力有限的嵌入式平台上的适用性。为了解决这些问题,我们提出了Fast-SDE,一种轻量级的单麦克风SDE框架,适用于计算资源有限且尺寸严格受限的机器人平台。具体来说,Fast-SDE采用基于子带的骨干网络,将频率轴分解为多个子带,而不是使用宽的全频带骨干处理整个频谱。然后,一个共享的子带编码器将每个子带映射为紧凑的潜在表示,并学习声学结构与时频模式之间的关系。最后,一个轻量级的回归头将融合后的子带表示转换为估计的距离。大量的仿真和真实世界实验证明了所提方法的优点。为了惠及更广泛的研究社区,我们在以下网址开源了代码:this https URL。

英文摘要

Sound source distance estimation (SDE) is a critical capability in human-robot interaction. An inappropriate interaction distance not only reduces the reliability of speech acquisition and understanding, but also compromises the naturalness and comfort of the interaction. Most existing SDE methods rely on microphone arrays, however, multi-microphone systems typically require careful hardware synchronization, geometric calibration, and additional space and computational resources, which limits applicability to size-constrained and computability-limited embodied platforms. To alleviate these issues, we propose Fast-SDE, a lightweight single-microphone SDE framework that is suited for deployment on robot platforms with limited computational resources and strict size constraints. Specifically, Fast-SDE employs a subband-based backbone that decomposes the frequency axis into multiple subbands, rather than processing the entire spectrum with a wide full-band backbone. A shared subband encoder then maps each subband to a compact latent representation and learns the relationship between acoustic structure and time-frequency patterns. Finally, a lightweight regression head converts the fused subband representations into the estimated distance. Extensive simulation and real-world experiments demonstrate the merits of the proposed method. To benefit the broader research community, we have open-sourced our code at this https URL.

2606.12334 2026-06-11 cs.LG cs.RO 新提交

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

傅里叶特征让智能体通过模仿学习学习高精度策略

Balázs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) FZI Research Center for Information Technology(FZI信息技术研究中心)

AI总结 提出在点云编码器中使用傅里叶特征映射,解决神经网络低频偏好导致的高精度操作问题,在多个基准和真实机器人上显著提升性能。

详情
Comments
Published as a conference paper at ICML 2026
AI中文摘要

高精度机器人操作需要细粒度的空间推理,由于深度模糊和透视尺度问题,仅使用RGB的策略通常难以实现。直接利用3D信息(如基于点云的策略)比纯图像策略提供了更强的几何先验,但其性能仍然高度依赖于任务。我们假设这种差异可能是由于神经网络倾向于学习低频函数的频谱偏差,这尤其影响以缓慢变化的笛卡尔特征为条件的架构。因此,我们提出将点云从笛卡尔空间映射到高维傅里叶空间,有效地使点云编码器能够直接访问高频特征。我们通过实验验证了傅里叶特征在RoboCasa和ManiSkill3基准测试中的具有挑战性的操作任务以及真实机器人设置上的效果。尽管简单,我们发现傅里叶特征在不同的编码器架构和基准测试中提供了显著的好处,并且对超参数具有鲁棒性。我们的结果表明,傅里叶特征让策略比笛卡尔特征更有效地利用几何细节,显示了其作为基于点云的模仿学习的通用工具的潜力。我们在项目页面上提供源代码和视频:https://this https URL

英文摘要

High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: this https URL