arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12412 2026-06-11 cs.CV cs.AI 新提交

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

重新路由，而非移除：面向视觉语言模型的可恢复视觉令牌路由

Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University（国立阳明交通大学）； National Taiwan University（国立台湾大学）

AI总结针对视觉语言模型中视觉令牌重要性随解码器深度变化的问题，提出无需训练的可恢复路由方法Reroute，将不可逆移除改为可恢复路由，在激进令牌缩减下提升定位能力并保持通用VQA性能。

详情

Comments: Code: this https URL

AI中文摘要

视觉语言模型（VLM）将图像投影为数百到数千个视觉令牌，使得解码器推理在注意力计算和KV缓存内存方面代价高昂。现有的视觉令牌缩减方法大多遵循排序-移除范式：它们对视觉令牌进行评分，保留一个紧凑的子集，并永久丢弃其余部分。我们表明这种不可逆操作是脆弱的，因为视觉令牌的重要性随解码器深度变化；在某一阶段排名低的令牌可能在后续层中变得相关，尤其是对于需要定位的查询。我们提出Reroute，一种无需训练的插件，用可恢复路由替代移除。在每个路由阶段，选中的视觉令牌通过解码器块，而延迟的令牌绕过该阶段并在下一个路由决策时重新进入候选池。Reroute重用现有的注意力分数排序规则和阶段级调度，保留了它所增强的剪枝方法的理论TFLOPs和KV缓存预算类别。在LLaVA-1.5和Qwen骨干网络上的FastV、PDrop和Nüwa变体中，Reroute在激进令牌缩减下改善了定位性能，同时保持通用VQA性能。这些结果表明，VLM令牌缩减不应仅被视为不可逆剪枝，也应被视为可恢复路由。代码可在此处获取：this https URL

英文摘要

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: this https URL

URL PDF HTML ☆

赞 1 踩 1

2606.12411 2026-06-11 cs.CL cs.LG 新提交

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

上下文驱动的增量压缩用于多轮对话生成

Yeongseo Jung, Jaehyeok Kim, Eunseo Jung, Jiachuan Wang, Yongqi Zhang, Ka Chun Cheung, Simon See, Lei Chen

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； NVIDIA AI Technology Center（NVIDIA AI技术中心）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出上下文驱动的增量压缩（C-DIC），通过可修订的线程压缩状态和轻量级检索-修订-写回循环，实现跨轮信息共享，稳定长对话性能。

详情

Comments: Accepted at ICML 2026

AI中文摘要

现代对话代理在每一轮都会处理不断增长的对话历史，导致冗余的注意力和编码成本随对话长度增加。简单的截断或摘要会降低保真度，而现有的上下文压缩器缺乏跨轮记忆共享或修订，导致信息丢失和长对话中的累积错误。我们重新审视了对话动态下的上下文压缩，并经验性地展示了其脆弱性。为了提高效率和鲁棒性，我们引入了上下文驱动的增量压缩（C-DIC），它将对话视为交织的上下文线程，并在单个紧凑的对话记忆中存储每个线程的可修订压缩状态。在每一轮，一个轻量级的检索、修订和写回循环在轮次之间共享信息并更新过时的记忆，从而稳定长期行为。此外，我们将截断反向传播（TBPTT）适应于我们的多轮设置，学习跨轮依赖关系而无需完整历史反向传播。在长对话基准上的大量实验证明了C-DIC的优越性能和效率；值得注意的是，C-DIC在数百轮对话中表现出稳定的推理延迟和困惑度，为高质量对话建模提供了一条可扩展的路径。

英文摘要

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

URL PDF HTML ☆

赞 1 踩 0

2606.12407 2026-06-11 cs.CV 新提交

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

看似无关紧要的设计选择如何决定病理学中LLM的性能

Kian R. Weihrauch, Thomas A. Buckley, William Lotter, Arjun K. Manrai

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard Medical School（哈佛医学院）； Dana-Farber Cancer Institute（丹娜-法伯癌症研究所）

AI总结通过系统因子分析发现，调整补丁大小、放大倍数等输入配置可使通用大语言模型在病理切片任务上性能大幅提升，缩小与专用模型的差距。

详情

AI中文摘要

通用大语言模型（LLM）在评估全切片图像（WSI）上的专用病理模型时，常被用作基线。由于WSI超出当代模型上下文限制，LLM基线通常使用独立处理的小尺寸高放大倍数补丁，通过多数投票进行，而缺乏对补丁大小、补丁数量和放大倍数等看似无关紧要的设计选择的系统评估。通用LLM一直表现不如专用系统，这强化了领域特定训练或架构适应对于涉及WSI的病理任务必要的观点。在这里，我们对四个输入设计因素：推理模式、补丁大小、放大倍数和补丁数量进行了系统因子分析。我们证明，先前研究通过选择非优化的输入配置夸大了专用模型与通用LLM之间的差距。在MultiPathQA基准上，切换到单一平衡配置（低放大倍数下的大补丁，联合处理）将GPT-5在癌症类型分类（TCGA）上从15.1%提升至39.5%，在器官分类（GTEx）上从38.1%提升至62.9%。每任务优化进一步带来增益，分别达到43.9%（TCGA）和71.6%（GTEx）。相同的配置推广到另外两个模型以及完全保留的CPTAC队列，在无需任何任务特定调整的情况下，将Gemini 3 Flash提升了23.4个百分点。

英文摘要

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

URL PDF HTML ☆

赞 1 踩 0

2606.12406 2026-06-11 cs.RO cs.AI cs.LG eess.SY 新提交

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

FACTR 2: 学习商用机器人手臂的外部力感知提升策略学习

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, Deepak Pathak

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Waseda University（早稻田大学）

AI总结提出无需专用力传感器的数据驱动方法NEXT，可在1分钟内从10分钟自由运动数据中训练，实现与专用关节力矩传感器相当的估计，并结合FIRST采样策略提升策略学习性能。

详情

Comments: Website at this https URL

AI中文摘要

接触丰富的操作需要力敏感性，但由于成本高昂，许多机器人手臂缺乏专用的力传感器。我们提出了神经外部力矩估计（NEXT），一种无需任何专用力传感器即可估计外部关节力矩的数据驱动方法。NEXT 仅需 10 分钟的自由运动数据即可在 1 分钟内完成训练，却能实现与专用关节力矩传感器相当的估计。NEXT 能够在低成本手臂上实现力反馈遥操作，并通过力信息重采样训练（FIRST）改进策略学习，该训练在行为克隆过程中对预接触和接触段进行上采样。在五个长时域任务中，FIRST 在任务进展上比先前的力感知策略提高了超过 17%。NEXT 和 FIRST 共同将力感知遥操作和策略学习引入现成的机器人，无需额外的传感硬件。视频结果和代码可在 https://this URL 获取。

英文摘要

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL

URL PDF HTML ☆

赞 0 踩 1

2606.12403 2026-06-11 cs.RO 新提交

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

World Pilot: 用世界动作先验引导视觉-语言-动作模型

Zefu Lin, Rongxu Cui, Junjia Xu, Xiaojuan Jin, Wenling Li, Lue Fan, Zhaoxiang Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences (CASIA)（中国科学院自动化研究所）； Nanjing University（南京大学）； Beihang University（北京航空航天大学）

AI总结提出World Pilot框架，通过世界动作模型（WAM）的潜在引导和动作引导两条路径，为VLA模型提供场景演化先验和轨迹级运动提示，在LIBERO-Plus零样本OOD基准上达到84.7%的总成功率，并在多个真实机器人操作任务中取得最高成功率。

详情

Comments: Project Website: this https URL

AI中文摘要

视觉-语言-动作（VLA）模型从大规模预训练中继承了语义基础，并在分布内的操作任务中表现良好。然而，这种语义基础建立在静态图像-文本对上，而操作是一个连续的、接触丰富的过程，其动态特性是这种预训练无法捕捉的。我们提出了World Pilot，一个VLA框架，通过两条互补路径将世界动作模型（WAM）的先验注入决策链。潜在引导（Latent Steering）以场景演化潜变量为条件作用于感知层，动作引导（Action Steering）将预期轨迹作为运动先验提供给动作生成器。这两个先验共同为VLA提供了场景的预期视图和轨迹级运动提示，同时保留了其语义条件。即使由未经过动作后训练的视频预训练世界模型提供，场景演化先验仍然有效。World Pilot在LIBERO-Plus零样本OOD基准上达到了84.7%的总成功率，并在四个操作任务的每个真实机器人设置中取得了最高成功率，在视角、几何、变形状态和姿态变化下具有最大的优势。项目网站：此 https URL

英文摘要

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12402 2026-06-11 cs.RO cs.AI cs.CV 新提交

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

DIRECT: 在具身规划器中何时何地分配测试时计算？

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

发表机构 * Stanford University（斯坦福大学）； University of Waterloo（滑铁卢大学）； NVIDIA（英伟达）

AI总结提出DIRECT路由框架，根据多模态场景上下文按提示分配计算资源，优化成功-成本帕累托前沿，实验表明不同缩放轴带来不同能力增益，在物理机器人上以更低延迟匹配或超越更强模型。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署为具身智能体的高层规划器，一种新兴策略是扩展测试时计算以提高能力。然而，我们观察到这样做会增加延迟、令牌使用和FLOPs，同时在下游任务中产生不均匀且往往递减的收益，限制了具身智能体的部署范围。我们认为，选择何时何地花费测试时计算是将前沿性能带入现实世界的关键。我们引入了DIRECT，一个路由框架，利用多模态场景上下文按提示分配计算资源，在固定模型选择上改进了成功-成本帕累托前沿。在三种主要的缩放轴（即思维链深度、模型大小和记忆历史）上，我们在VLABench和RoboMME上的实验表明，测试时计算并非均匀的杠杆：不同的轴产生性质不同的能力增益。我们在DROID设置中的物理Franka机械臂上验证了这些见解，涵盖了零样本操作和长程链式任务，我们的路由器以高达65%的平均延迟降低匹配或超过了更强模型的成功率。最终，我们的结果表明，天真地扩展测试时计算是浪费的，而DIRECT能够以极低的成本在机器人系统中提供前沿级别的具身规划。项目页面可在此http URL找到。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12400 2026-06-11 cs.CL cs.IR 新提交

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

Doc-to-Atom：学习编译和组合记忆原子

Xingjian Diao, Wenbo Li, Yashas Malur Saidutta, Avinash Amballa, Lazar Valkov, Srinivas Chappidi

发表机构 * AI Center-Mountain View, Samsung Electronics（三星电子AI中心-山景城）； Dartmouth College（达特茅斯学院）

AI总结提出Doc2Atom框架，将文档分解为语义类型化的知识原子并编译为微LoRA适配器，通过轻量查询路由器选择相关原子组装成查询特定适配器，以解决文档压缩中的干扰和扩展性问题，在六个QA基准上优于Doc-to-LoRA。

详情

Comments: 20 pages

AI中文摘要

长输入序列是大语言模型文档理解和多步推理的核心，但注意力的二次成本使得推理既内存密集又缓慢。上下文蒸馏通过将上下文信息压缩到模型参数中来缓解这一问题，最近的工作如Doc-to-LoRA将上下文蒸馏摊销为一次前向传播，为每个文档生成一个LoRA适配器。然而，为所有查询生成单个整体适配器会导致无关查询干扰、有限的组合回忆以及长文档推理的可扩展性差。为了解决这些挑战，我们提出了Doc-to-Atom（Doc2Atom），一种组合参数化记忆框架，将每个文档分解为语义类型化的知识原子。每个原子被编译成一个独立的微LoRA适配器和一个来源检索键。在推理时，一个轻量查询路由器选择并仅组装相关原子到一个查询特定适配器中，然后将其注入冻结的基础模型。整个系统通过多目标蒸馏框架进行端到端训练。在六个不同的QA基准上的实验表明，Doc2Atom优于Doc-to-LoRA基线，同时降低了文档内部化的内存成本。

英文摘要

Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

URL PDF HTML ☆

赞 0 踩 0

2606.12397 2026-06-11 cs.LG cs.AI cs.CL 新提交

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

重新设计混合专家模型的路由器：基于流形幂迭代

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学高瓴人工智能学院）； Large Language Model Department, Tencent（腾讯大型语言模型部门）

AI总结提出将路由器行与专家矩阵主奇异方向对齐，并基于流形幂迭代（MPI）重新设计路由器，通过“幂迭代-收缩”范式实现对齐，理论证明收敛性，实验验证1B至11B参数规模下模型效果提升。

详情

Comments: Preprint

AI中文摘要

路由器是混合专家模型的核心组件。作为专家代理，路由器矩阵的行计算与MoE输入的相似度，以确定激活哪些专家子集。理想情况下，每个路由器行被设计为将专家矩阵编码到该代表性向量中，使得其与token的点积能更好地反映token-专家亲和性。然而，目前没有设计原则来强制这种压缩。在本文中，我们提出将每个路由器行与相关专家的主奇异方向对齐，因为该方向提供了矩阵最具表现力的数学描述。基于这一原则，我们提出了一种基于流形幂迭代（MPI）的路由器重新设计。具体来说，它引入了一种“幂迭代-收缩”范式，其中对路由器权重执行幂迭代步骤，然后进行收缩以施加范数约束，确保效率和稳定性。理论上，我们证明MPI驱动路由器行收敛到相关专家的主奇异方向。实验上，我们在1B到11B参数规模的MoE模型上进行预训练，证实这种对齐有助于更有效的MoE模型。

英文摘要

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

URL PDF HTML ☆

赞 0 踩 0

2606.12396 2026-06-11 cs.CV cs.RO 新提交

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

VLGA：用于自动驾驶的视觉-语言-几何-动作模型

Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman

发表机构 * Uber AV Labs（Uber自动驾驶实验室）； University of Virginia（弗吉尼亚大学）

AI总结提出VLGA模型，通过引入几何作为第四模态，利用逐像素点图回归损失监督，实现密集3D世界重建，在nuScenes和Bench2Drive上达到SOTA。

详情

Comments: Project page: this https URL

AI中文摘要

视觉-语言-动作（VLA）模型能够描述场景并用语言进行推理，但仍难以将其动作锚定在周围的密集3D世界中。现有方法要么从冻结的3D基础模型中注入特征，而没有确保策略使用这些特征的目标，要么通过稀疏的框和地图损失来约束几何，这些损失不提供密集的空间信号。我们引入了VLGA，这是第一个被监督以重建其驾驶通过的密集3D世界的视觉-语言-动作模型。VLGA通过一个专门的专家模块，由针对LiDAR的逐像素点图回归损失监督，将几何作为第四模态与视觉、语言和动作一起引入。在具有挑战性的nuScenes和Bench2Drive数据集上分别进行开环和闭环评估的大量实验表明，VLGA优于对应的VLA方法。特别是在开环nuScenes上，VLGA在没有自车状态的情况下，在VLA方法中取得了新的最先进结果，具有最低的L2误差（平均0.50米）和3秒碰撞率（0.18%）。在闭环Bench2Drive上，VLGA取得了79.08的最先进驾驶得分，比最强的先前VLA高出0.71，同时具有相当的效率和舒适性。

英文摘要

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50\,m average) and 3-second collision rate (0.18\%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.

URL PDF HTML ☆

赞 0 踩 0

2606.12395 2026-06-11 cs.CR 新提交

MARCIM-WG: A cyber wargame proposal based on math modeling applied in a naval scenario

MARCIM-WG：基于数学建模的海军场景网络兵棋推演方案

Diego Cabuya-Padilla, Daniel Díaz-López, Carlos Castaneda-Marroquín

AI总结提出MARCIM-WG学习型网络兵棋，基于北约方法论设计，结合实体棋盘与计算仿真，通过高低级设计规范在虚构海战场景中验证，干预组态势感知能力提升34个百分点。

详情

Comments: 8 pages, 5 figures, 2 tables, paper in proceedings of the XI National Cybersecurity Research Conference (JNIC) in Barcelona, Spain, May, 2026

AI中文摘要

随着海上行动日益依赖互联的数字生态系统，网络事件可能通过海上网络传播并降低关键服务。因此，加强战略网络态势感知（CSA）需要培训机制，使决策者能够应对不断变化的攻击动态、有限的资源以及需要与事件响应程序协调行动的需求。本文介绍了MARCIM-WG，一种面向学习的海上网络防御兵棋，按照北约兵棋方法论设计，并作为混合桌面体验实现，结合物理棋盘（令牌、指示器和特殊卡片）与由计算仿真模型支持的分析辅助裁决。该方案通过高层设计（HLD）和低层设计（LLD）规范进行说明，并在虚构的海上网络危机场景中实例化，以实现结构化决策周期、摩擦和可衡量的后果。验证结合了（i）在三种配置（悲观、中性/最可能、乐观）下基于操作场景的评估，以验证决策敏感性和结果一致性，以及（ii）使用与等效对照组比较设计的CSA能力和学习成果评估。结果显示干预组提高了34.0个百分点，其中理解相关能力提升最大。

英文摘要

As maritime operations increasingly depend on interconnected digital ecosystems, cyber incidents can propagate across maritime networks and degrade critical services. Strengthening strategic Cyber Situational Awareness (CSA) therefore requires training mechanisms that expose decision-makers to evolving attack dynamics, constrained resources, and the need to align actions with incident-response procedures. This paper introduces MARCIM-WG, a learning-oriented maritime cyberdefense wargame designed following the NATO wargaming methodology and implemented as a hybrid tabletop experience combining a physical board (tokens, indicators, and special cards) with analytically-assisted adjudication supported by a computational simulation model. The proposal is specified through High-Level Design (HLD) and Low-Level Design (LLD) specifications and instantiated in a fictional maritime cyber crisis scenario to enable structured decision cycles, friction, and measurable consequences. Validation combines (i) an operational scenario-based assessment under three configurations (pessimistic, neutral/most likely, optimistic) to verify decision sensitivity and outcome coherence, and (ii) a CSA competency and learning-outcome evaluation using a comparative design against an equivalent control group. Results show a +34.0 percentage-point improvement in the intervention group, with the largest gains in comprehension-related competencies.

URL PDF HTML ☆

赞 0 踩 0

2606.12392 2026-06-11 cs.CL cs.AI 新提交

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

CCL25-Eval 任务5系统报告：新数据集与LoRA微调Qwen2.5

Haotao Xie

发表机构 * The Hangzhou International Innovation Institute Beihang University（北京航空航天大学杭州国际创新研究院）

AI总结针对古典诗歌翻译与情感理解任务，构建高质量指令数据集CCPoetry-49K，并采用LoRA微调Qwen2.5-14B模型得到PoetryQwen，在CCL25-Eval任务5上取得0.757分，较基线提升9.7%。

详情

AI中文摘要

近年来，大语言模型（LLMs）在古典汉语翻译和古典诗歌生成领域取得了令人瞩目的进展。然而，针对古典诗歌精确翻译和情感语义理解的领域特定研究仍然有限。主要挑战在于大多数研究将诗歌鉴赏任务视为通用领域问题，忽略了诗歌鉴赏的独特特征，同时高质量且领域特定的数据集极为稀缺。为解决这一局限，我们将任务分解为三个子任务：术语解释、语义解释和情感推理。基于多个开源数据集，我们进行数据清洗和对齐，构建了古典诗歌指令对数据集（CCPoetry-49K），包含49,404个高质量指令-响应对，专门针对该领域进行了优化。随后，我们提出领域专用LLM，称为PoetryQwen，通过应用低秩适配（LoRA）微调Qwen2.5-14B模型。在CCL25-Eval任务5基准上的实验结果表明，PoetryQwen得分为0.757，较Qwen2.5-14B-Instruct基线（0.690）提升9.7%。这些发现明确表明，PoetryQwen在古典诗歌的精确翻译和情感理解方面显著提升了性能。我们提供了新数据集和方法论考虑，旨在支持LLMs的领域特定优化。

英文摘要

Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

URL PDF HTML ☆

赞 0 踩 2

2606.12387 2026-06-11 cs.DB cs.AI 新提交

TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

TAHOE: 基于经验的自动提示优化文本到SQL系统

Zhiyi Chen, Jie Song, Peng Li

AI总结提出TAHOE系统，通过错误驱动的提示学习管道将调试痕迹转化为结构化提示库，结合策略层建模用户意图，在Spider 2.0-Snow上无需更新参数即可显著提升Text-to-SQL性能。

详情

AI中文摘要

大型语言模型（LLM）通过Text-to-SQL使数据库访问民主化，但从原型到生产部署仍然困难。实际部署必须处理严格的SQL方言、大规模模式和不断变化的用户偏好，而有监督微调成本高且僵化，代理测试时扩展昂贵。我们提出Tahoe，一个将提示优化视为动态数据管理问题的系统。Tahoe在开发和部署阶段使用错误驱动的提示学习管道，将调试痕迹整合到结构化的提示库中。编译器反馈被提炼为可重用的语法提示（针对方言特定规则），而执行和用户反馈被转换为语义提示（针对模式和用户特定逻辑）。Tahoe进一步引入策略层，将冲突的用户意图建模为共享自然语言触发下的竞争策略，并利用近期信号和学习后归因统计来总结经验成功、危害、惰性和支持。在推理时，Tahoe检索相关提示，并通过逻辑规划后接SQL合成引导LLM。我们实现并评估了开发阶段的工作流，将部署时的人类反馈更新留作未来工作。在Spider 2.0-Snow上，Tahoe在不更新模型参数的情况下显著改进了Text-to-SQL。在113个有监督的Spider 2.0-Snow-0212示例上使用GPT-5.5，Tahoe将通过率从61.95%提高到79.42%，pass-at-4从72.57%提高到87.61%，实现了100%的Snowflake语法通过率，并将每个采样候选的平均编译器反馈批评轮次从2.79降低到0.12。相同的提示库也迁移到较弱的骨干模型，包括在Doubao-2.0-lite上获得19.7个百分点的通过率提升。

英文摘要

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.

URL PDF HTML ☆

赞 0 踩 0

2606.12386 2026-06-11 cs.LG cs.AI 新提交

ATLAS: Active Theory Learning for Automated Science

ATLAS: 自动化科学的主动理论学习

Noémi Éltető, Nathaniel D. Daw, Kimberly L. Stachenfeld, Kevin J. Miller

发表机构 * Google DeepMind（谷歌深度思维）； Princeton University（普林斯顿大学）； Columbia University（哥伦比亚大学）； University College London（伦敦大学学院）

AI总结提出ATLAS框架，通过主动学习迭代生成稀疏神经网络假设并设计最优区分实验，在bandit任务中恢复强化学习智能体，相比随机实验采样效率提升5-10倍。

详情

AI中文摘要

通过机制建模推进科学理解需要提出正确的实验问题以产生信息量最大的数据。为了在认知科学中自动化这一追求，我们引入了ATLAS（自动化科学的主动理论学习），这是一个用于数据驱动的可解释行为模型发现的主动学习框架。ATLAS在生成机制假设（实例化为多样化的稀疏神经网络集成，即解缠RNN）和设计能够最优区分这些假设的实验之间迭代。我们在从bandit任务中的行为恢复强化学习智能体的问题上测试了这种方法。ATLAS设计了具有时间结构的定性新颖实验序列，该结构针对底层智能体特征量身定制。在这些实验上训练的模型通过一套全面的机制建模指标进行评估，这些指标捕捉了行为、结构和计算相似性。与随机实验相比，ATLAS在所有指标上实现了5-10倍的采样效率提升，并且其性能进一步通过与文献中专家设计的实验进行验证得到确认。这些计算机模拟结果展示了ATLAS在加速人类可解释洞察方面的潜力，适用于认知科学以及其他科学探究依赖于发现机制模型的领域。

英文摘要

Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informative data. To automate this pursuit within cognitive science, we introduce ATLAS (Active Theory Learning for Automated Science), an active learning framework for the data-driven discovery of interpretable behavioral models. ATLAS iterates between generating mechanistic hypotheses--instantiated as a diverse ensemble of sparse neural networks (Disentangled RNNs)--and designing experiments that optimally distinguish between them. We test this approach on the problem of recovering reinforcement learning agents from their behavior in bandit tasks. ATLAS designs varied sequences of qualitatively novel experiments with temporal structure tailored to underlying agent characteristics. The models trained on these experiments are evaluated against a comprehensive set of metrics for mechanistic modeling that capture behavioral, structural, and computational similarity. ATLAS achieves a 5-10x improvement in sample efficiency across all metrics compared to random experimentation, and its performance is further validated against expert-designed experiments derived from literature. These in silico results showcase ATLAS's potential to accelerate human-interpretable insights in cognitive science and other domains where scientific inquiry relies on discovering mechanistic models.

URL PDF HTML ☆

赞 1 踩 0

2606.12385 2026-06-11 cs.CL 新提交

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

我们的模型建立在哪些模型之上？审计现代LLM中的隐形依赖

Sanjay Adhikesaven, Haoxiang Sun, Sewon Min

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Allen Institute for AI（艾伦人工智能研究所）

AI总结本文提出ModSleuth系统，通过递归重建LLM依赖图，揭示多跳许可义务、训练-评估耦合等隐藏依赖问题，并发布工具和依赖图以支持透明分析。

详情

AI中文摘要

现代LLM训练流程越来越依赖其他模型来生成数据、过滤语料库、判断输出和指导开发决策。这些依赖是递归的：一个模型可能依赖于上游工件，而该工件的自身依赖仅在单独的发布和工件中记录。因此，完整的依赖结构分散在异构的公共工件中，其复杂性和递归深度远超人类追踪能力。我们引入了ModSleuth，一个智能系统，可以从公共工件中递归重建LLM依赖图，并提供基于来源的证据。我们发现主要挑战不再是信息提取，而是定义什么构成依赖以及在不一致的文档中协调工件引用。我们通过形式化方法解决这些挑战，该方法区分直接和间接依赖，通过以操作为中心的关系表示异构管道角色，并在名称、版本和仓库之间解析工件身份。将ModSleuth应用于四个富含公共工件的LLM发布，我们恢复了1,060个来源验证的依赖，并构建了现代LLM开发的大规模依赖图。这些图揭示了多跳许可义务、训练-评估耦合、发布时与训练时工件之间的差异，以及否则难以发现的文档不一致性。我们发布ModSleuth和生成的依赖图，以支持对现代LLM日益复杂生态系统的透明分析。

英文摘要

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.12384 2026-06-11 cs.LG cs.AI 新提交

APPO: Agentic Procedural Policy Optimization

APPO: 智能体程序策略优化

Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China（中国科学技术大学）； AMAP, Alibaba Group（阿里巴巴集团高德地图）； Southern University of Science and Technology（南方科技大学）

AI总结提出APPO方法，通过细粒度分支和程序级优势缩放改进智能体强化学习的信用分配，在13个基准上平均提升近4个点。

详情

Comments: 25 pages, including 14 pages of main text and 11 pages of appendix; work in progress

AI中文摘要

近期智能体强化学习（RL）的进展显著提升了大型语言模型智能体的多轮工具使用能力。然而，现有方法大多基于粗粒度的启发式单元（如工具调用边界或固定工作流）进行信用分配，难以识别哪些中间决策影响下游结果。本文从两个角度研究智能体RL：\textit{何处分支以及分支后如何分配信用}。我们的初步分析表明，有影响力的决策点广泛分布在生成序列中，而非集中于工具调用，而仅凭token熵无法可靠反映其对最终结果的影响。基于这些观察，我们提出\textbf{智能体程序策略优化（APPO）}，将分支和信用分配从粗粒度的交互单元转移到序列中的细粒度决策点。APPO使用分支分数选择分支位置，该分数结合了token不确定性和后续延续的策略诱导似然增益，从而在过滤掉虚假高熵位置的同时实现更有针对性的探索。它进一步引入了程序级优势缩放，以更好地在分支展开中分配信用。在13个基准上的实验表明，APPO在保持高效工具调用和行为可解释性的同时，一致地将强智能体RL基线提升了近4个点。

英文摘要

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.12382 2026-06-11 cs.NE cs.AI 新提交

SPEA2$^+$: Improved Density Estimation in SPEA2 with Provable Runtime Guarantees

SPEA2$^+$：具有可证明运行时间保证的改进SPEA2密度估计

Duc-Cuong Dang, Andre Opris, Dirk Sudholt

AI总结针对SPEA2处理支配解时多样性不足的问题，提出使用所有成对距离改进密度估计的SPEA2$^+$，在OneTrapZeroTrap基准上达到与其他主流算法相同的性能保证。

详情

Comments: To appear in the Proceedings of PPSN 2026

AI中文摘要

强度帕累托进化算法2（SPEA2）是解决多目标优化问题的流行且著名的进化算法。尽管其受欢迎，但SPEA2的理论分析直到最近才出现。此外，这些分析仅关注SPEA2如何处理非支配解，而忽略了处理支配解的算法组件。我们首次对SPEA2进行了运行时分析，其中分析了这些组件。我们证明，与其他主流算法（包括相同设置下具有恒定种群大小和重复消除的NSGA-II、NSGA-III和SMS-EMOA）不同，SPEA2无法有效覆盖OneTrapZeroTrap基准的帕累托前沿。我们的结果表明，在适应度分配中使用k近邻距离提供的信号不足以维持支配个体间的多样性。为了解决这个问题，我们提出了一种改进的变体SPEA2$^+$，它考虑了所有成对距离。新算法在OneTrapZeroTrap上实现了与其他主流算法相同的性能保证，同时在更简单的问题上匹配原始SPEA2的性能。实验结果补充了我们的理论发现。

英文摘要

The Strength Pareto Evolutionary Algorithm 2 (SPEA2) is a popular and prominent evolutionary algorithm for solving multi-objective optimisation problems. Despite its popularity, theoretical analyses of SPEA2 have only appeared recently. Moreover, these analyses focus exclusively on how SPEA2 handles non-dominated solutions and disregard the algorithmic components responsible for handling dominated solutions. We conduct a first runtime analysis of SPEA2 for which these components are analysed. We prove that, unlike other prominent algorithms, including NSGA-II, NSGA-III and SMS-EMOA under the same setting of constant population size and duplicate elimination, SPEA2 is unable to cover the Pareto front of the OneTrapZeroTrap benchmark efficiently. Our results indicate that using k-th nearest-neighbour distance in the fitness assignment provides an insufficient signal to maintain diversity among dominated individuals. To address this issue, we propose an improved variant, SPEA2$^+$, that considers all pairwise distances. The new algorithm achieves the same performance guarantees as the other prominent algorithms on OneTrapZeroTrap, while matching the performance of the original SPEA2 on simpler problems. Experimental results complement our theoretical findings.

URL PDF HTML ☆

赞 0 踩 0

2606.12378 2026-06-11 cs.CV cs.AI 新提交

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

面向机器人生理感知的鲁棒光照相机心率估计

Zhi Wei Xu, Torbjörn E. M. Nordling

发表机构 * National Cheng Kung University（国立成功大学）

AI总结提出一种端到端时空Transformer框架，结合PRNet三维人脸对齐、光照增强、残差时序标准化和混合时频监督，在光照变化数据集上实现0.79 bpm心率MAE和0.982相关系数，相比PhysFormer降低93.6%误差。

详情

Comments: 8 pages, 4 figures

AI中文摘要

生理感知对于在日常生活环境中与人类交互的服务型、社交型和辅助型机器人至关重要。远程光电容积描记法（rPPG）能够从RGB相机中实现非接触式心率（HR）估计，使其成为机器人视觉系统的一种有前景的感知模态。然而，光照变化仍然是鲁棒部署的主要障碍。本文提出了一种端到端的时空Transformer框架，用于在具有不同光照条件的新数据集上进行远程心率估计。我们的估计器集成了基于PRNet的三维人脸对齐、片段级光照增强、残差时序标准化模块以及受控的混合时频监督。训练目标结合了Soft-Shifted Pearson波形损失和频谱Kullback-Leibler散度损失，其中调优权重（$\mathbf{\beta}$）控制频域心率指导的贡献。在覆盖三个光照级别的静态全混合协议上的实验表明，$\mathbf{\beta}=5$在测试的beta设置中提供了最强结果，实现了最佳运行心率平均绝对误差（MAE）为0.79 bpm，心率相关系数为0.982。与在我们的数据集上评估的PhysFormer基线相比，我们的估计器将心率MAE降低了93.6%，同时将心率相关系数从0.088提高到0.982，使其在光照变化时可用。

英文摘要

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

URL PDF HTML ☆

赞 0 踩 0

2606.12375 2026-06-11 cs.CE math.NA physics.comp-ph 新提交

A coupled finite element formulation for chemo-mechano-thermodynamical contact and its application to bonding and debonding

化学-力学-热力学接触的耦合有限元公式及其在粘接与脱粘中的应用

Roger A. Sauer

AI总结提出一种基于Sauer等人接触理论的耦合有限元公式，用于模拟化学-力学-热力学大变形接触，重点研究粘接与脱粘的演化及其与机械和热接触状态的耦合，并通过多个算例验证其通用性。

详情

Comments: 42 pages, 22 figures, 6 tables

AI中文摘要

本文提出了一种用于耦合化学-力学-热力学大变形接触的有限元公式。该公式基于Sauer等人（2022）的接触理论，包含六个耦合但独立的场：两个接触体的变形和温度，以及界面粘接场和界面温度。后者由界面处的化学和机械能量耗散控制。这里重点研究粘接和脱粘的演化，以及它们如何与机械和热接触状态耦合。基于二次接触势，提出了几个基本模型。由此产生的接触公式变得非常通用和灵活，通过几个具有挑战性的算例进行了说明。这些算例包括压力依赖和间隙依赖的粘接、放热粘接反应、热硬化和热膨胀，以及同时发生的粘接和脱粘。它们基于使用经典和等几何形函数以及隐式时间积分的整体有限元实现。还提供了牛顿-拉夫逊求解方法所需的完全线性化。如果粘接点是材料点，则粘接变量可以在局部凝聚掉。

英文摘要

This work presents a finite element formulation for coupled chemo-mechano-thermodynamical large deformation contact. The formulation is based on the contact theory of Sauer et al. (2022) that contains six coupled (but separate) fields: the deformation and temperature of the two contacting bodies, as well as an interfacial bonding field and interfacial temperature. The latter is governed by the chemical and mechanical energy dissipation at the interface. Here the focus is placed on the evolution of bonding and debonding, and how it is coupled to the mechanical and thermal contact state. Several elementary models are proposed for this based on a quadratic contact potential. The resulting contact formulation becomes very general and versatile, which is illustrated by several challenging examples. They include pressure- and gap- depended bonding, exothermic bonding reactions, thermal hardening and thermal expansion, as well as simultaneous bonding and debonding. They are based on a monolithic finite element implementation using classical and isogeometric shape functions together with implicit time integration. Its full linearization, required for the Newton-Raphson solution method, is also provided. If bonding sites are material points, the bonding variable can be condensed-out locally.

URL PDF HTML ☆

赞 0 踩 0

2606.12374 2026-06-11 cs.RO cs.CV 新提交

Semantically-Aware Diver Activity Recognition Framework for Effective Underwater Multi-Human-Robot Collaboration

语义感知的潜水员活动识别框架用于有效的水下多人类-机器人协作

Sadman Sakib Enan, Junaed Sattar

发表机构 * University of Minnesota（明尼苏达大学）

AI总结提出DAR-Net框架，结合Transformer时间推理与像素级场景监督，通过多损失训练对齐全局活动识别与局部人机交互语义，解决低可见度水下环境中的潜水员活动识别问题，并发布首个水下潜水员活动数据集UDA。

详情

AI中文摘要

有效的人机多体协作对于在具有挑战性和高风险的水下环境中扩展人类主导的操作至关重要。为了使自主水下航行器（AUV）成为真正的队友，它们必须能够理解周围环境并识别潜水员的活动，以提供帮助并确保安全。为此，我们引入了DAR-Net，一种新颖的基于Transformer的框架，用于分析复杂的水下场景并对潜水员活动进行分类。我们的贡献在于一种语义引导的学习公式，它将基于Transformer的时间推理与像素级场景监督相结合。这种多损失训练策略明确地将全局活动识别与局部人机交互语义对齐，这在低可见度水下条件下尤为关键。为了解决该领域数据稀缺的重大挑战，我们首次提出了水下潜水员活动（UDA）数据集，这是一个基础资源，包含超过2600张带有像素级掩码的注释图像。通过在受控环境中进行严格的实验评估，我们证明DAR-Net在识别六种不同潜水员活动方面达到了有希望的准确性，优于现有最先进的模型。虽然该数据集提供了关键的基线，但我们的工作作为开创性的一步，为未来研究奠定了基础，并促进了更智能、协作的水下机器人系统的发展。

英文摘要

Effective multi-human-robot collaboration is essential for expanding human-led operations in the challenging and high-risk underwater environment. For autonomous underwater vehicles (AUVs) to become true teammates, they must be able to comprehend their surroundings and recognize a diver's activities to offer assistance and ensure safety. Towards this goal, we introduce DAR-Net, a novel transformer-based framework that analyzes complex underwater scenes to classify diver activities. Our contribution lies in a semantically guided learning formulation that couples transformer-based temporal reasoning with pixel-level scene supervision. This multi-loss training strategy explicitly aligns global activity recognition with local human-robot interaction semantics, which is particularly critical in low-visibility underwater conditions. To address the significant challenge of data scarcity in this domain, we present the first-ever Underwater Diver Activity (UDA) dataset, a foundational resource containing over 2,600 annotated images with pixel-level masks. Through rigorous experimental evaluations in a controlled environment, we demonstrate that DAR-Net achieves promising accuracy in recognizing six distinct diver activities, outperforming state-of-the-art models. While this dataset provides a crucial baseline, our work serves as a pioneering step, laying the groundwork for future research and facilitating the development of more intelligent, collaborative underwater robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2606.12373 2026-06-11 cs.CL 新提交

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

可验证环境是乐高积木：递归组合实现推理泛化

Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu

AI总结提出RACES框架，将可验证环境视为可递归组合的构建块，通过定义四种组合算子自动生成复合环境，在六个未见基准上平均提升DeepSeek-R1-Distill-Qwen-14B 3.1分，且仅用50个基础环境即可达到300个环境的性能。

详情

AI中文摘要

基于可验证环境的强化学习已成为增强大语言模型推理能力的有效方法。虽然先前研究表明扩展环境数量可提升强化学习性能，但现有手动或单独构建方法受限于线性扩展瓶颈，阻碍了可扩展的推理泛化。本文提出RACES（递归自动组合环境扩展）框架，将可验证环境视为可递归组装的可组合构建块。关键洞察是：当一个环境的余域（输出类型）与另一个环境的定义域（输入类型）匹配时，它们可以自动融合为新的可验证环境，从而实现递归组合。RACES使用300个独立环境实现，并定义了四种组合算子（SEQUENTIAL、PARALLEL、SORT和SELECT），诱导出多样化的推理模式。大量实验表明，在这些复合环境上进行强化学习训练持续提升了推理泛化能力。具体而言，RACES在六个未见基准上平均提升DeepSeek-R1-Distill-Qwen-14B 3.1分（从48.2到51.3），并将Qwen3-14B的性能从58.8提升至61.1。此外，RACES仅使用50个基础环境即可达到与使用300个独立环境训练相当的性能，展现了显著的环境利用效率。

英文摘要

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

URL PDF HTML ☆

赞 0 踩 0

2606.12372 2026-06-11 cs.RO cs.LG 新提交

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

UniIntervene：用于高效现实世界强化学习的智能干预

Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang

发表机构 * Nanyang Technological University（南洋理工大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出UniIntervene智能干预模型，通过检测低效探索并自主恢复策略至高价值状态，在真实机器人操作任务中平均成功率提升8.6%，人类干预减少57%。

详情

Comments: Project page: this https URL

AI中文摘要

人在回路强化学习（HiL-RL）已成为现实世界机器人操作的有效范式，能够通过人类指导实现在线策略改进。然而，当前的HiL-RL框架仍然依赖频繁的人类干预来纠正策略，使其脱离低效探索，这导致高昂的人力成本并限制了现实世界的可扩展性。为解决这一问题，我们提出UniIntervene，一种智能干预模型，它能够检测低效探索并自主将策略恢复至高价值状态，从而接管人类操作员的大部分干预工作。具体而言，UniIntervene首先执行未来条件化的动作价值估计，预测当前动作的潜在后果并评估其诱导价值，从而提供更稳定的进展信号。在此基础上，一个时间价值风险评论家聚合最近的价值动态，并在估计价值出现持续停滞或下降时触发干预。当需要干预时，UniIntervene从过去干预事件的内存中检索高价值恢复目标，并通过目标条件化的恢复策略生成可执行的纠正动作。通过这种方式，UniIntervene将干预从被动的人类纠正转变为价值感知的恢复过程，从而实现高效的现实世界强化学习。在多种真实世界操作任务上的大量实验表明，与最先进的HiL-RL基线相比，UniIntervene将平均成功率提高了8.6%，同时将人类干预减少了57%。

英文摘要

Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.12371 2026-06-11 cs.CV 新提交

A Turbo-Inference Strategy for Object Detection and Instance Segmentation

一种用于目标检测和实例分割的涡轮推理策略

Zhen Zhao, Gang Zhang, Xiaolin Hu, Liang Tang

发表机构 * School of Technology, Beijing Forestry University（北京林业大学工学院）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Beijing National Research Center for Information Science and Technology, Tsinghua University（清华大学北京信息科学与技术国家研究中心）； Chinese Institute for Brain Research (CIBR)（北京脑科学与类脑研究中心）

AI总结提出一种涡轮推理策略，通过迭代利用检测与分割的互补信息，设计涡轮检测头和涡轮分割头形成闭环，无需重新训练即可提升两者精度。

详情

Comments: Preprint version of an article published in Computer Vision and Image Understanding

AI中文摘要

目标检测和实例分割任务密切相关。现有的自上而下实例分割方法通常遵循先检测后分割的范式，即先使用初始检测器识别并用边界框定位对象，然后在每个边界框内分割实例掩码。在这种方法中，检测精度直接影响后续分割性能。然而，以往的研究很少探讨实例分割任务对目标检测的影响。本文提出一种用于自上而下方法的涡轮推理策略，该策略迭代利用检测和分割任务之间的互补信息。具体来说，我们设计了两个模块：涡轮检测头和涡轮分割头，它们促进任务之间的通信。这两个模块形成一个闭环，交织检测和分割结果，而无需重新训练模型。在COCO、iFLYTEK和Cityscapes数据集上的综合实验表明，我们的方法在计算成本增加的情况下，显著提高了检测和分割精度。所提出的方法代表了预测精度和推理速度之间的权衡。代码可在以下网址获取：https://this URL。

英文摘要

Object detection and instance segmentation tasks are closely related. Existing top-down instance segmentation methods usually follow a detect-then-segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo-inference strategy for the top-down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo-detection head and turbo-segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12370 2026-06-11 cs.LG cs.CL 新提交

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

打破熵界：通过带拒绝采样的多令牌预测加速强化学习训练

Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou

发表机构 * Qwen Team, Alibaba Inc（阿里巴巴集团 Qwen 团队）

AI总结针对强化学习训练中多令牌预测接受率因熵波动而下降的问题，提出Bebop方法，采用概率拒绝采样和端到端TV损失优化，实现高达95%接受率和1.8倍加速。

详情

AI中文摘要

强化学习（RL）已成为现代大型语言模型的关键组成部分，但展开阶段仍是RL训练流程中的主要瓶颈。尽管多令牌预测（MTP）通过推测解码提供了一种自然的加速方案，但许多研究观察到MTP接受率在RL训练期间显著下降，导致加速效果有限。为解决这一瓶颈，我们提出Bebop，对LLM后训练中的MTP进行系统研究，并提供将MTP集成到大规模RL流水线中的实用方案。首先，我们揭示MTP接受率根本上受模型熵波动的限制，其与RL阶段熵的上升呈现清晰的负线性关系。其次，我们证明与贪婪草稿采样相比，概率拒绝采样在很大程度上减轻了RL中熵引入的干扰。我们进一步发现，传统的MTP训练目标（交叉熵或KL）在此类设置中次优，因此我们提出一种新颖的端到端TV损失，直接优化多步拒绝采样接受率，带来约10%的接受率提升，在数学推理、代码生成和智能体任务中实现高达95%的接受率和高达25%的额外推理吞吐量增益。第三，我们测试了RL期间的各种在线MTP训练策略，并表明使用端到端TV损失和拒绝采样的预RL MTP训练在整个RL过程中保持一致的接受率和加速，消除了昂贵的在线MTP更新需求。我们提供了大量实验和分析来验证我们的发现。实验结果表明，我们的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步RL训练中实现了高达1.8倍的端到端加速。

英文摘要

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

URL PDF HTML ☆

赞 0 踩 0

2606.12369 2026-06-11 cs.CY 新提交

Should LLM Agents Decide in Social Simulations? Comparing Finite-State and LLM-Based Decision Policies

LLM智能体应在社会模拟中做决策吗？比较有限状态与基于LLM的决策策略

Alejandro Buitrago López, Javier Pastor-Galindo, José A. Ruipérez-Valiente

AI总结研究评估LLM作为在线社交网络模拟中动作选择器时，是否保持可解释的参考策略，发现LLM在某些配置下可近似但不可靠地保持策略，且速度远慢于马尔可夫链采样。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用作社会模拟中的决策组件。这引入了一种方法论风险：模拟可能偏离研究者定义的显式行为策略。在在线社交网络（OSN）模拟中，动作选择塑造系统动态、交互模式和模型可解释性。本文评估了LLM动作选择器在OSN模拟中是否保持可解释的参考策略。参考策略是一个实现为一阶马尔可夫模型的有限状态机，其转移概率取决于用户类型。评估使用包含1000个智能体和10000个动作决策的合成网络。测试了三种开放权重LLM：LLaMA 3.1、GPT-OSS和Mistral 24B。每个模型在三种提示策略下评估：基础、引导和概率。使用带有拉普拉斯平滑的詹森-香农散度衡量对齐度，并报告执行时间。结果表明，LLM在某些配置下可以近似参考策略，但不能可靠地保持它。对齐度因模型和提示而异，额外的引导可能引入系统性动作偏差。即使是最佳对齐的LLM配置也比直接马尔可夫链采样慢几百倍。这些发现表明，基于LLM的动作选择不能直接替代显式决策策略：它可能改变预期行为，同时增加计算成本。

英文摘要

Large language models (LLMs) are increasingly used as decision-making components in social simulations. This introduces a methodological risk: the simulation may deviate from the explicit behavioral policy defined by the researcher. In online social network (OSN) simulations, action choices shape system dynamics, interaction patterns, and model interpretability. This paper evaluates whether LLM action selectors preserve an interpretable reference policy in an OSN simulation. The reference is a finite state machine implemented as a first-order Markov model, with transition probabilities depending on the user type. The evaluation uses a synthetic network with 1,000 agents and 10,000 action decisions. Three open-weight LLMs are tested: LLaMA 3.1, GPT-OSS, and Mistral 24B. Each model is evaluated under three prompting strategies: base, guided, and probabilistic. Alignment is measured using Jensen-Shannon Divergence with Laplace smoothing, and execution time is reported. Results show that LLMs can approximate the reference policy in some configurations, but do not preserve it reliably. Alignment varies across models and prompts, and additional guidance can introduce systematic action biases. Even the best-aligned LLM configurations are several hundred times slower than direct Markov chain sampling. These findings indicate that LLM-based action selection is not a direct replacement for explicit decision policies: it can alter the intended behavior while increasing computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.12368 2026-06-11 cs.CV 新提交

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

DepthMaster: 统一透视与全景图像的单目深度估计

Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang

AI总结提出DepthMaster统一框架，通过将全景图分解为重叠透视块并引入对应一致性损失和虚拟投影相机几何先验，解决透视与全景深度估计的几何差异和数据稀缺问题，在13个数据集上实现零样本最优性能。

详情

AI中文摘要

虽然单目深度估计取得了显著进展，但对于窄视场（FoV）透视图像和$360^\circ$全景图像实现通用的度量深度估计仍然是一个未解决的挑战。现有方法通常针对特定相机类型设计，难以在多样化场景中生成准确的度量深度。这一限制源于两个关键挑战：透视相机与全景相机之间的固有几何差异，以及带有度量标注的全景训练数据的稀缺性。在这项工作中，我们引入了DepthMaster，一个统一的度量深度估计框架。我们不采用专门网络来学习球形畸变，而是通过将全景图像分解为重叠的透视块来重新表述问题。关键的是，与先前依赖临时架构修改来处理边界的基于投影的方法不同，我们引入了一种新颖的对应一致性损失（CCL），并注入虚拟投影相机作为几何先验，从而能够无缝拼接这些块，同时避免专用算子并保持主干与标准Transformer设计高度兼容。该策略通过将所有输入统一为规范透视表示来解决几何差异，并通过直接从大量透视数据集中解锁强大的度量先验来有效规避数据稀缺问题。在仅包含一个全景数据集的混合数据集上训练后，DepthMaster在13个多样化数据集上实现了最先进的零样本性能，不仅在透视和全景领域超越了通用方法，还领先于领先的专家模型。

英文摘要

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

URL PDF HTML ☆

赞 0 踩 0

2606.12366 2026-06-11 cs.RO 新提交

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

APT: 动作专家预训练提升视觉-语言-动作策略的指令泛化能力

Kechun Xu, Zhenjie Zhu, Anzhe Chen, Rong Xiong, Yue Wang

发表机构 * Zhejiang University（浙江大学）； Zhejiang Humanoid Robot Innovation Center（浙江人形机器人创新中心）

AI总结针对连续动作专家模型对分布外语言指令泛化差的问题，提出APT两阶段训练方法，先预训练动作专家作为视觉-动作先验，再通过门控融合注入语言，显著提升泛化性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型将预训练的视觉-语言模型（VLM）与连续动作专家结合，在操作任务上取得了强劲性能，但对分布外（OOD）语言指令的泛化能力仍然较差。一个已知挑战是VLA数据中的结构不平衡，其中语言的多样性远低于视觉和动作内容，使得策略容易依赖视觉捷径。虽然离散动作方法通过视觉-语言联合训练缓解了这一问题，但连续动作专家缺乏此类保护：它们从随机初始化开始，完全从不平衡数据中学习，产生噪声梯度，破坏VLM并无法利用其语言能力。我们从贝叶斯角度出发，将策略分解为与语言无关的视觉-动作（VA）先验和语言条件化的VLA似然，并提出APT，一种强调动作专家预训练的两阶段训练方法。在第一阶段，动作专家作为VA先验，在冻结的VLM提供的视觉-动作对上进行预训练，绕过了语言不平衡问题。在第二阶段，通过门控融合机制注入语言标记，该机制整合VLM特征的同时保留已学习的视觉运动先验。APT适用于主流VLA架构，包括π和GR00T风格架构。综合实验验证了APT在未见指令和组合任务上取得了一致的性能提升。项目页面：此 https URL

英文摘要

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the $\pi$ and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12365 2026-06-11 cs.RO cs.AI 新提交

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

环境扩散策略：从次优数据中进行机器人模仿学习

Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake

发表机构 * MIT（麻省理工学院）

AI总结提出环境扩散策略，通过噪声依赖的数据使用从次优数据中提取有用特征，在六项任务上优于现有方法，最高提升33%。

详情

Comments: 14 pages (main body), 52 pages total. Project website: this https URL

AI中文摘要

我们提出环境扩散策略，一种从机器人次优数据中进行模仿学习的简单且原则性的方法。高质量、特定任务的机器人数据收集昂贵且耗时，而低质量或分布外演示的次优数据集则丰富。现有的在机器人中同时训练两种数据源的方法通常无法分离次优样本中的有意义和有害特征。相比之下，我们的方法通过引入机器人协同训练的新轴：噪声依赖的数据使用，仅提取有用特征。环境扩散策略在训练期间将次优数据的贡献限制在仅高和低扩散时间。为了严格证明我们的方法，我们首先观察到机器人动作数据表现出频谱幂律。这在我们利用的最优扩散策略上引出了两个重要性质：全局到局部层次结构和局部性。我们使用简化模型从理论上形式化这一讨论。我们的实验在六项任务上验证了环境扩散策略对四种类型的次优动作数据（噪声轨迹、模拟到现实差距、任务不匹配和大规模数据混合）的有效性。结果表明，它有效地从任意来源的次优数据中学习。值得注意的是，当扩展到Open X-Embodiment（一个具有异质数据质量和非结构化分布偏移的大规模数据集）时，它比现有协同训练基线高出33%。总体而言，环境扩散策略提高了次优演示的实用性，并扩展了机器人中可用数据源的范围。

英文摘要

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

URL PDF HTML ☆

赞 0 踩 0

2606.12364 2026-06-11 cs.LG 新提交

On Subquadratic Architectures: From Applications to Principles

关于次二次架构：从应用到原理

Anamaria-Roberta Hartl, Levente Zólyomi, David Stap, Pieter-Jan Hoedt, Niklas Schmidinger, Lukas Hauzenberger, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

发表机构 * ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz（林茨ELLIS单元、LIT AI实验室、机器学习研究所、约翰内斯·开普勒大学林茨）； NVIDIA（英伟达）

AI总结本文比较了xLSTM、Mamba-2和Gated DeltaNet三种次二次架构，发现xLSTM在代码预训练、蒸馏和时间序列预训练中表现最佳，其优势源于灵活稳定的门控记忆校正机制。

详情

AI中文摘要

Transformer主导现代序列建模，但其二次注意力机制带来了巨大的计算成本。次二次架构提供了一种可扩展的替代方案。然而，目前尚不清楚哪些设计能产生最有效的序列模型。我们比较了三种领先的方法：xLSTM、Mamba-2和Gated DeltaNet。我们在具有复杂依赖关系的任务上评估这些模型：（1）代码模型预训练，（2）从大型语言模型蒸馏代码模型，以及（3）时间序列基础模型预训练。在这些设置中，xLSTM提供了最强的整体性能。为了解释xLSTM的优势，我们提出了一个统一的公式并分析了底层架构机制，重点关注状态跟踪和记忆动态。我们的结果表明，xLSTM通过其门控方案实现了更灵活和稳定的记忆校正。我们在受控的合成长度泛化任务上证实了这些发现。总体而言，我们的发现表明，xLSTM在复杂任务上的收益源于稳健的状态跟踪和积累。

英文摘要

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.

URL PDF HTML ☆

赞 0 踩 0

2606.12362 2026-06-11 cs.LG cs.AI 新提交

Latent World Recovery for Multimodal Learning with Missing Modalities

缺失模态下的多模态学习中的潜在世界恢复

Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade

发表机构 * Queen's University Belfast（贝尔法斯特女王大学）

AI总结提出潜在世界恢复（LWR）框架，通过邻居潜在对齐和可用性感知融合，在缺失模态下实现鲁棒的多模态预测，避免显式重构误差。

详情

AI中文摘要

我们研究了缺失模态下的多模态学习，特别受到生物科学应用的启发，在这些应用中，当需要做出决策时，异构模态通常仅部分可用。我们提出了潜在世界恢复（LWR），这是一个基于两个关键思想的框架：(i) 来自不同模态的特定模态嵌入在共享潜在空间中对齐，以及 (ii) 通过仅融合在训练和推理时实际可用的模态嵌入来构建统一表示。LWR 不填补缺失模态或要求固定的模态集，而是将每个模态视为对底层潜在状态的部分感知，并直接从观察到的模态执行可用性感知表示学习。这种基于邻居的潜在对齐和可用性感知模态融合的结合，使得在部分观测下能够进行鲁棒的多模态预测，同时避免了显式重构缺失模态带来的误差传播。我们在真实世界的不完整多组学基准上评估了所提出的框架，并证明它为下游任务（如癌症表型分类和生存预测）提供了一种有效的方法。

英文摘要

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.12360 2026-06-11 cs.LG 新提交

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

后训练的解剖：利用可解释性表征数据并塑造学习信号

Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger, Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas McGrath, Ekdeep Singh Lubana

AI总结提出基于可解释性的数据后训练流程，通过统计假设识别偏好数据中的潜在概念，实现细粒度反馈，减少虚假关联和不良行为。

详情

AI中文摘要

语言模型后训练是塑造模型行为的主要阶段，但它仍然主要涉及优化总结多样需求的标量奖励。这种抽象使从业者几乎无法了解数据实际教会了模型什么，导致模型学习虚假关联，并引发过度风格化和谄媚等不良行为。为了解决这个问题，我们提出：能否在优化之前检查偏好数据集，并在概念层面决定模型应该被允许学习哪些行为？受此启发，我们引入了一个以数据为中心的后训练流程，该流程使用可解释性协议来开发统计假设，以区分偏好和非偏好生成的潜在概念，使其明确以供细粒度用户反馈。基于这一观点，我们将几种基于可解释性的训练协议统一为通过特征或数据干预来塑造奖励的方式。实验上，我们表明我们的流程诊断了现有偏好数据中的不良信号，减轻了脱靶学习，并且还可以帮助放大或塑造期望的属性，如安全防护和模型个性。更广泛地说，我们的结果表明，可解释性可以将后训练从优化不透明的代理奖励转变为审计和塑造学习信号本身的过程。

英文摘要

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

URL PDF HTML ☆

赞 0 踩 0