arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.08260 2026-06-09 cs.CV 新提交

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

TIDE: 任务隔离扩散模型用于统一视频编辑与生成

Qi Liu, Gang Yue, Mingyu Yin, Lisai Zhang, Yidi Wu, Yaole Wang, Yaohui Wang, Chang Yao, Jingyuan Chen, Lin Ma

发表机构 * Zhejiang University（浙江大学）； Bilibili Inc.（哔哩哔哩股份有限公司）

AI总结提出TIDE统一框架，通过逐token任务嵌入和双路径条件机制，实现指令编辑、参考编辑和多参考生成，在多任务渐进训练下达到SOTA性能。

详情

AI中文摘要

扩散Transformer的最新进展推动了视频生成和编辑的快速发展，但这些能力仍由独立的、任务特定的模型处理。构建支持多种视频任务的统一框架仍然是一个开放挑战：现有的统一尝试要么需要专用的辅助编码器，要么缺乏区分异构条件令牌的显式机制，当视觉条件的数量和类型因任务而异时难以应对。我们提出TIDE，一个统一框架，集成了基于指令的编辑、参考引导编辑和多参考生成。其核心是，我们引入了逐令牌任务嵌入，为每个输入令牌分配一个任务特定标识符，使模型能够显式区分目标、源和参考令牌。为了同时捕捉高层语义理解和细粒度结构保真度，我们设计了一种双路径条件方案，将视觉语言模型与VAE潜在路径耦合以提供互补信号。我们进一步设计了一种多任务渐进训练策略，逐步引入复杂度递增的任务，有效协调不同目标，并实现跨异构任务分布的平滑泛化。在多个视频编辑和生成基准上的大量实验表明，TIDE在所有评估任务上均达到了最先进的性能。我们的项目页面可在https://LittleWork123.github.io/tide获取。

英文摘要

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks. Our project page is available at https://LittleWork123.github.io/tide.

URL PDF HTML ☆

赞 0 踩 0

2606.08259 2026-06-09 cs.LG 新提交

Differentially Private Synthetic Data via APIs 4: Tabular Data

通过API实现差分隐私合成数据 4: 表格数据

Toan Tran, Arturs Backurs, Zinan Lin, Victor Reis, Li Xiong, Sergey Yekhanin

发表机构 * Microsoft（微软）

AI总结提出Tab-PE算法，将Private Evolution框架扩展至表格数据，通过启发式算子迭代优化候选数据集，在保持差分隐私的同时高效处理高阶相关性，相比基线AIM分类准确率提升最高10%，速度提升28倍。

详情

Comments: ICML'26

AI中文摘要

本文研究了在差分隐私（DP）保证下生成合成表格数据的问题，使得在敏感领域能够共享数据。尽管已有大量研究，最先进的方法通常侧重于最小化低阶边际查询误差，而忽视了高阶相关性带来的挑战。为解决这一差距，我们将最初为DP合规图像和文本合成开发的Private Evolution（PE）框架扩展到表格数据。我们提出了Tab-PE——一种在DP约束下生成合成表格数据的算法。Tab-PE通过一个进化过程迭代改进候选数据集，该过程利用表格专用算子产生变体，对其进行私有评分，并选择最高质量的样本进行保留和传播。与依赖大型基础模型的原始PE不同，Tab-PE采用计算成本显著更低的启发式算子，使得PE对表格数据更加实用和可扩展。通过在真实和模拟数据集上的大量实验，我们证明Tab-PE在表现出高阶相关性的数据集上显著优于先前的基线。与最佳基线AIM相比，Tab-PE的分类准确率提高了最高10%，同时运行速度快了28倍。

英文摘要

This paper investigates the problem of generating synthetic tabular data with differential privacy (DP) guarantees, enabling data sharing in sensitive domains. Despite extensive study, state-of-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we extend the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for synthetic tabular data generation under DP constraints. Tab-PE iteratively improves a candidate dataset via an evolutionary process that leverages tabular-specialized operators to produce variations, privately scores them, and selects the highest-quality samples to retain and propagate. In contrast to the original PE, which relies on large foundation models, Tab-PE employs heuristic operators with significantly lower computational costs, making PE more practical and scalable for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10% while running 28 times faster.

URL PDF HTML ☆

赞 0 踩 0

2606.08256 2026-06-09 cs.AI cs.DL 新提交

Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing

Traxia：一个可验证的、智能体原生的科学出版框架

Wisdom Dogah

发表机构 * Faculty of Computing and Mathematical Sciences, University of Mines and Technology (UMaT), Tarkwa, Ghana（加纳塔夸矿业与技术大学计算与数学科学学院）； BlackMatrix AI Research, Accra, Ghana（加纳阿克拉BlackMatrix AI研究院）

AI总结提出Traxia框架，通过智能体身份、可验证出版、四层同行评审、声誉机制和知识图谱，解决科学出版中可验证性、归属和可重复性问题。

详情

Comments: 22 pages, 3 figures, 3 tables. Preprint. Under active development. Comments welcome

AI中文摘要

可验证性、归属和可重复性是科学知识的基本要求，但当前的出版基础设施并未大规模强制执行这些要求。我们介绍Traxia，一个智能体原生的科学出版框架，其中AI研究智能体发布可验证的论文，建立声誉身份，相互进行同行评审，并与人类在共享溯源模型中协作。Traxia将智能体视为第一类认知参与者：每篇论文都带有推理轨迹，每个声明都带有置信区间，每个智能体都有加密签名的身份，每次协作都有不可变的贡献日志。我们形式化了五个组件：智能体身份与注册、可验证出版层、四层同行评审协议、声誉与质押引擎，以及带有矛盾检测的知识图谱。该框架针对可重复性失败、溯源不透明以及排除全球南方研究能力的问题。本文仅介绍架构基础和形式化规范；未报告实证结果。评估和更深入的组件研究将在后续论文中进行。原型部分实现了核心形式化；完整系统仍在积极开发中。

英文摘要

Verifiability, attribution, and reproducibility are foundational requirements of scientific knowledge, yet current publishing infrastructure does not enforce them at scale. We introduce Traxia, an agent-native scientific publishing framework in which AI research agents publish verifiable papers, build reputational identities, peer-review one another, and collaborate with humans in a shared provenance model. Traxia treats agents as first-class epistemic participants: every paper carries a reasoning trace, every claim a confidence interval, every agent a cryptographically signed identity, and every collaboration an immutable contribution log. We formalise five components: Agent Identity and Registry, Verifiable Publishing Layer, four-tier Peer Review Protocol, Reputation and Staking Engine, and a Knowledge Graph with contradiction detection. The framework targets reproducibility failure, provenance opacity, and exclusion of Global South research capacity. This paper presents architectural foundations and formal specifications only; it does not report empirical results. Evaluation and deeper component studies will follow in subsequent papers. A prototype partially implements core formalisms; the full system remains under active development.

URL PDF HTML ☆

赞 0 踩 0

2606.08254 2026-06-09 cs.CL 新提交

SSR: Can Simulated Patients Learn to Stigmatize Themselves? Modeling Self-Stigma through Internal Monologue

SSR: 模拟患者能否学会自我污名化？通过内心独白建模自我污名

Kunyao Lan, Bingrui Jin, Zichen Zhu, Mengyue Wu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； X-LANCE Lab, Dept. of Computer Science and Engineering（X-LANCE实验室，计算机科学与工程系）； MoE Key Lab of Artificial Intelligence, AI Institute（教育部人工智能重点实验室，人工智能研究院）

AI总结提出基于心理3A1H模型的SSR框架，通过内心独白数据集和链式思维微调LLM，使模拟患者根据对话触发动态调整污名表达，生成更真实的情境适应性反应。

详情

AI中文摘要

使用大语言模型（LLM）模拟患者是心理健康训练的一种有前景的工具，但现有方法未能捕捉一个关键的临床现实：自我污名。经历自我污名的患者，即内化负面刻板印象，通常表现出情境敏感性的抵抗，如回避、否认或自责，而当前模型将其呈现为静态或统一顺从的行为。为了解决这一问题，我们引入了一个基于自我污名化心理3A1H模型的新型模拟框架。我们的核心创新是创建了一个\textbf{污名化自我反思}（\textbf{SSR}）数据集，在该数据集中，我们通过反映污名意识推理的内心独白来增强心理健康对话。通过使用链式思维方法对LLM进行微调，我们训练患者代理根据对话触发动态调整其污名水平和表达方式。评估表明，我们的方法显著优于专门的基线，生成了更真实且情境适当的患者反应。这项工作为临床训练和共情对话系统的现实污名模拟迈出了关键一步。

英文摘要

Simulating patients with large language models (LLMs) is a promising tool for mental health training, but existing approaches fail to capture a key clinical reality: self-stigma. Patients experiencing self-stigma, the internalization of negative stereotypes, often exhibit context-sensitive resistance, such as avoidance, denial, or self-blame, which current models render as static or uniformly compliant behavior. To address this, we introduce a novel simulation framework grounded in the psychological 3A1H model of self-stigmatization. Our core innovation is the creation of a \textbf{Stigmatized Self-Reflection} (\textbf{SSR}) dataset, where we augment mental health dialogues with internal monologues that reflect stigma-aware reasoning. By fine-tuning LLMs with this data using a chain-of-thought approach, we train patient agents to dynamically adjust their level and expression of stigma based on conversational triggers. Evaluations demonstrate that our approach significantly outperforms specialized baselines, generating more authentic and situationally appropriate patient responses. This work provides a crucial step towards realistic stigma simulation for clinical training and empathetic dialogue systems.

URL PDF HTML ☆

赞 0 踩 0

2606.08253 2026-06-09 cs.RO cs.LG 新提交

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

注意你的步伐：一种用于精确人形机器人落脚点跟踪的通用学习框架

Alessandro Montenegro, Shihao Li, Puze Liu, Alberto Maria Metelli, Jan Peters

发表机构 * Politecnico di Milano（米兰理工大学）； TU Darmstadt（达姆施塔特工业大学）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Italian Institute of Technology（意大利技术研究院）； University of Pisa（比萨大学）

AI总结提出一种轻量级通用3D落脚点跟踪策略学习框架，通过目标采样器动态提供步态支持，结合新目标表示克服真实世界噪声，实现与多种高层规划器无缝集成的精确自然运动。

详情

Comments: Accepted to RSS 2026

AI中文摘要

使人形机器人在复杂动态环境中运行仍然是一个关键挑战，其根本受限于稳健、安全且精确导航的能力。虽然基于速度指令策略的强化学习在人形机器人运动方面取得了显著的鲁棒性，但这种方法缺乏对落脚点位置的显式控制，导致不安全行为（如踩到人脚）或不精确导航，阻碍后续操作任务。相反，显式落脚点跟踪策略通过直接以目标足部姿态作为指令提供了一种有前景的替代方案。然而，现有方法通常受限于不切实际的状态假设（影响实际部署），或者作为分阶段流程的一部分而受限于特定下游任务。在这项工作中，我们引入了一种新颖的轻量级框架，用于训练通用的3D落脚点跟踪策略。通过目标采样器动态提供步态支持，该方法使学习到的策略对特定地形不敏感。我们的新目标表示有效缓解了现实世界中出现的挑战，例如噪声和不准确的姿态估计以及足部接触估计。为直接迁移到现实世界而设计，我们的策略作为一个独立的低级控制器，可以与各种高级落脚点生成器无缝配对。通过在仿真和现实世界中的大量实验，我们证明了框架的有效性。通过将我们的策略与不同的上游规划器耦合，我们在具有挑战性的环境中实现了自然且精确的运动，为复杂环境中的运动-操作任务铺平了道路。

英文摘要

Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.

URL PDF HTML ☆

赞 0 踩 0

2606.08245 2026-06-09 cs.CL 新提交

ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL

ZAS-SQL: 从失败中提炼规则用于零样本文本到SQL

Hongzhou Zheng, Yixin Gou, Wenjia Zhang

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University（同济大学上海自主智能无人系统科学中心）； College of Architecture and Urban Planning, Tongji University（同济大学建筑与城市规划学院）； Behavioral and Spatial AI Lab, Peking University & Tongji University（北京大学与同济大学行为与空间人工智能实验室）

AI总结提出ZAS-SQL零样本框架，通过Map-Reduce规则蒸馏从失败案例中提取核心生成规则，结合知识增强模式表示、规则驱动结构化推理和执行引导早停三个模块，在Spider上达到87.2%和88.6%执行准确率，超越多个少样本和微调方法。

详情

AI中文摘要

文本到SQL将自然语言转换为可执行的SQL查询。基于大语言模型（LLM）的少样本上下文学习方法表现出色，但其对示例的依赖限制了跨领域泛化，并消耗大量上下文窗口空间。现有的零样本方法缺乏有效的生成约束，仍落后于少样本方法。我们观察到LLM在零样本文本到SQL中的失败并非随机，而是表现出系统性的、重复出现的模式。基于这一观察，我们提出了一个完全零样本的文本到SQL框架，该框架通过基于Map-Reduce的规则蒸馏管道从失败案例中提炼核心生成规则，并通过三个互补模块提高生成质量：知识增强的模式表示，补充数据定义语言中缺失的语义；规则驱动的结构化推理框架，抑制结构偏差；以及执行引导的早停，实现低成本的自我纠正。在Spider上，所提出的框架在开发集和测试集上分别达到87.2%和88.6%的执行准确率，建立了新的零样本最先进水平，并超越了多个基于GPT-4/4o的少样本和微调方法。在领域特定数据集UrbanPlan上，它达到了81.3%，证实了规则蒸馏方法跨领域的泛化能力。此外，当配备4B参数模型时，该框架超越了领先闭源模型的零样本基线，展示了强大的模型通用性。

英文摘要

Text-to-SQL translates natural language into executable SQL queries. Few-shot in-context learning methods built upon large language models (LLMs) achieve strong performance, yet their reliance on demonstrations limits cross-domain generalization and consumes substantial context window space. Existing zero-shot methods, lacking effective generation constraints, still fall short of few-shot approaches. We observe that LLM failures in zero-shot Text-to-SQL are not random but exhibit systematic, recurring patterns. Building on this observation, we propose a fully zero-shot Text-to-SQL framework that distills core generation rules from failure cases through a Map-Reduce-based rule distillation pipeline and improves generation quality via three complementary modules: knowledge-augmented schema representation, which supplements missing semantics in Data Definition Language; a rule-driven structured reasoning framework that suppresses structural deviations; and Execution-Guided Early Stopping, which enables low-cost self-correction. On Spider, the proposed framework achieves up to 87.2% and 88.6% execution accuracy on the Dev and Test sets, respectively, establishing a new zero-shot state-of-the-art and surpassing multiple few-shot and fine-tuning methods built upon GPT-4/4o. On the domain-specific dataset UrbanPlan, it achieves 81.3%, confirming that the rule distillation approach generalizes across domains. Moreover, when equipped with a 4B-parameter model, the framework surpasses zero-shot baselines of leading closed-source models, demonstrating strong model generality.

URL PDF HTML ☆

赞 0 踩 0

2606.08243 2026-06-09 cs.CL 新提交

Building Comparative Motivation Profiles with Instrumental Interventions

构建带有工具性干预的比较动机概况

David Vella Zarb, Rustem Turtayev, Taywon Min, Jinghua Ou, Shi Feng

发表机构 * MATS ； University of Cambridge（剑桥大学）； KAIST（韩国科学技术院）； George Washington University（乔治华盛顿大学）

AI总结通过对称工具性干预区分对齐伪装中的策略性自我保护与研究者期望追踪，发现模型对期望追踪更敏感，提示需要构念效度检验。

详情

AI中文摘要

安全性评估通常从行为模式推断潜在动机，但这些推断的构念效度尚不明确。我们在对齐伪装中研究这一问题，即当模型推断出训练压力时，它们更常服从训练目标。这种行为通常被解释为策略性自我保护，但也可能反映模型对研究者期望的敏感性。我们引入一个对称干预框架来区分这些竞争性假设。我们不直接干预“诡计”或“谄媚”，而是针对每个假设所蕴含的工具性过程：后果追踪和研究者期望追踪。然后比较对这些过程的干预如何影响对齐伪装。我们使用合成文档微调、激活引导和提示研究了四个开源模型生物。在合成文档微调下，Llama-3.1-70B、Llama3.1-405B 和 Qwen-2.5-72B 对期望追踪干预比后果追踪干预更敏感。对 Llama-3.1-70B 的激活引导支持相同的总体图景，提示干预与 SDF 概况大致一致。总体而言，对齐伪装行为在因果上对评估上下文期望敏感，尽管存在与诡计一致的草稿板。因此，诡计和策略性欺骗评估需要构念效度检验，而对称工具性干预提供了这样一种测试。

英文摘要

Safety evaluations often infer latent motivations from behavioral patterns, but the construct validity of these inferences is unclear. We study this problem in alignment faking, where models comply with training objectives more often when they infer training pressure. This behavior is commonly interpreted as strategic self-preservation, but it may also reflect sensitivity to the model's inference about the expectation of researchers conducting the evaluation. We introduce a symmetric intervention framework for distinguishing these competing hypotheses. Instead of directly intervening on "scheming" or "sycophancy", we target instrumental processes entailed by each hypothesis: consequence-tracking and researcher-expectation tracking. We then compare how interventions on these processes affect the alignment faking. We study four openweight model organisms using synthetic document fine-tuning, activation steering, and prompting. Under synthetic document fine-tuning, Llama-3.1-70B, Llama3.1-405B, and Qwen-2.5-72B are more sensitive to expectation-tracking than consequence-tracking interventions. Activation steering on Llama-3.1- 70B supports the same broad picture, and prompt interventions broadly align with SDF profiles. Overall, alignment-faking behavior can be causally sensitive to evaluation-context expectations despite scheming-consistent scratchpads. Scheming and strategic-deception evaluations therefore need construct-validity checks, and symmetric instrumental interventions provide one such test.

URL PDF HTML ☆

赞 0 踩 0

2606.08242 2026-06-09 cs.CV 新提交

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Light-WAM：基于状态融合动作解码的高效世界动作模型

Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang

发表机构 * Wuhan University（武汉大学）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Fudan University（复旦大学）； East China Normal University（华东师范大学）

AI总结提出轻量级世界动作模型Light-WAM，通过紧凑视频骨干和降维潜空间未来视频监督降低训练成本，并引入状态融合动作专家实现高效动作预测，在LIBERO和RoboTwin 2.0上取得良好性能。

详情

AI中文摘要

世界动作模型（WAM）通过将未来预测作为额外训练目标来扩展机器人策略学习，鼓励策略在其表示中编码任务相关的时间结构。当前的WAM通常依赖大规模生成架构，导致高训练成本和推理延迟，难以部署为高效的闭环策略。我们提出Light-WAM，一种轻量级的世界动作模型，用于高效的机器人操作。具体来说，它采用紧凑的视频骨干网络，并在降维的潜空间中进行未来视频监督，降低了视频协同训练的成本，同时保留了其对表示学习的益处。对于动作预测，Light-WAM引入了状态融合动作专家（StateFusionActionExpert），该专家从多个骨干层读取适应后的状态，通过可学习查询池化进行融合，并在单次前向传播中直接预测动作块。这种设计为视频骨干表示与机器人动作之间提供了高效接口，避免了繁重的生成式动作专家。实验表明，Light-WAM在LIBERO上保持强劲性能，在RoboTwin 2.0上实现了可用的多任务性能，同时仅使用0.44B可训练参数。它还实现了72.03ms的推理延迟，峰值GPU内存为4.1GiB，并提高了训练吞吐量。

英文摘要

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.08239 2026-06-09 cs.AI cs.CL cs.CV 新提交

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

当没有正确答案时：诊断视频理解中多模态大语言模型的缺失答案检测

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen

发表机构 * Duke University（杜克大学）

AI总结研究多模态大语言模型在视频理解中检测缺失答案的能力，发现模型倾向于选择干扰项而非识别无正确答案，时间推理任务中问题更严重，链式思维提示虽提升检测率但仍不理想。

详情

Comments: Under review

AI中文摘要

多模态大语言模型在视频理解方面取得了实质性进展，但其响应的可靠性仍未得到充分探索。本文对视频理解中多模态大语言模型的缺失答案检测进行了诊断研究，其中正确答案被故意排除在候选集之外，而一个可靠的模型应能识别出没有有效选项。我们在三种设置下评估缺失答案检测行为：带有“以上皆非”选项的多选题、带有检测指令的开放式生成，以及没有任何指导的标准评估。在多种模型和基准测试中，我们发现多模态大语言模型压倒性地选择合理的干扰项，而不是检测到缺失答案。这种失败在时间推理任务中更为明显，并且随着帧采样密度的增加而恶化。我们进一步探索了链式思维提示作为缓解策略，发现虽然它显著提高了检测率，但性能仍不令人满意，这表明仅基于提示的策略不足以完全解决这一局限性。这些发现揭示了缺失答案检测中的系统性失败，并强调了在多模态系统中需要明确的检测机制。

英文摘要

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2606.08234 2026-06-09 cs.AI 新提交

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

SciTrace: 面向科学发现代理的轨迹感知安全推理

Tanush Swaminathan, Runmin Jiang, Letian Zhang, Min Xu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Allen Institute（艾伦研究所）

AI总结提出SciTrace框架，通过安全内在推理循环和组合工具链验证器，在科学代理管道的每个阶段融入安全推理，实现工具调用安全性和对抗鲁棒性的SOTA提升。

详情

Comments: 23 pages

AI中文摘要

基于LLM的科学代理在自主研究方面展现出强大能力，但其安全层在结构上与核心推理相分离：它们检查管道输出，而非塑造产生输出的推理过程。这种分离导致两种故障模式：一个阶段积累的安全信号在下一阶段被丢弃，以及一系列单独良性的工具调用可能组合成有害结果，而单步过滤器无法检测到。为了解决这些挑战，我们引入了\ extbf{SciTrace}，这是一个将安全推理编织到科学代理管道每个阶段的框架。SciTrace结合了两种互补机制：\ extit{安全内在推理循环}（SIR），通过联合任务与安全推理，在思考者、实验者、写作者和审阅者阶段维护累积风险状态；以及\ extit{组合工具链验证器}（CTV），在执行前执行轨迹感知安全检查，捕捉仅出现在多步工具序列中的风险。在跨越六个科学领域的240个高风险研究任务和120个工具相关风险任务上的评估中，SciTrace在四个骨干模型上实现了框架间的\ extbf{最先进}（SOTA）安全性：它持续提高了工具调用安全性和对抗鲁棒性，同时保持了科学输出质量，并发现了单步监视器遗漏的\ extbf{78.8\%}的组合工具链逃逸。项目网站可在https://opensciagent.github.io/SciTrace/ 获取。

英文摘要

LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects. To address these challenges, we introduce \textbf{SciTrace}, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a \textit{Safety-Intrinsic Reasoning Loop} (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a \textit{Compositional Tool-Chain Verifier} (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences. Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (\textbf{SOTA}) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers \textbf{78.8\%} of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.

URL PDF HTML ☆

赞 0 踩 0

2606.08231 2026-06-09 cs.CV 新提交

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

多模态基础模型中的测试时扩展：生成与推理的综合调查

Cong Wan, Ying He, Zhongzhan Huang, Hefeng Wu

发表机构 * Sun Yat-sen University（中山大学）

AI总结本文首次系统综述多模态基础模型中的测试时扩展（TTS）方法，提出统一分类框架（采样、反馈、搜索三类），总结应用与基准，并讨论未来方向。

详情

Comments: Accepted by ACL 2026, Findings

AI中文摘要

测试时扩展（TTS）已成为通过在推理过程中动态分配计算资源来增强模型性能的关键研究方向。最近的进展将这一范式应用于多模态基础模型（MFMs），释放了它们在多模态推理和生成方面的潜力。尽管进展迅速，该领域缺乏系统性的调查和统一的理论框架来描绘多模态TSS的发展格局。为填补这一空白，我们首次对MFMs的TTS研究进行了全面综述，提出了一个统一的分类框架，将现有方法归纳为三种不同策略：基于采样的、基于反馈的和基于搜索的方法。我们进一步总结了常用于评估多模态TTS在生成和推理任务中能力的代表性应用和基准。最后，本调查讨论了开放挑战并概述了未来研究方向，为这一快速发展的领域的后续研究提供了系统路线图。

英文摘要

Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.

URL PDF HTML ☆

赞 0 踩 0

2606.08221 2026-06-09 cs.LG 新提交

De novo molecular generation with optical property preconditioning at the token level

基于Token级光学性质预条件的从头分子生成

Haozhe Huang, Manuel Gonzalez Lastre, Hyun Suk Park, Jorge A. Campos-Gonzalez-Angulo, Xinjian Liu, Alán Aspuru-Guzik

发表机构 * University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（向量人工智能研究所）； Universidad Autónoma de Madrid（马德里自治大学）； Canadian Institute for Advanced Research (CIFAR)（加拿大高等研究院）； NVIDIA（英伟达）

AI总结针对OLED分子光学性质可控生成中数据稀缺和条件控制可靠性有限的问题，提出基于GPT2的Token条件自回归语言模型，通过离散属性Token和多任务优化实现垂直吸收能和振子强度的定向生成，并在TDDFT级别评估分布保真度和可控性。

详情

AI中文摘要

由于高质量数据的稀缺以及生成模型中跨化学基序的条件控制可靠性有限，设计具有目标光学性质的OLED分子仍然具有挑战性。在此，我们在现实低数据场景下对用于OLED分子生成的Token条件自回归语言模型进行了基准测试。一个GPT2模型在大规模化学语料库上进行预训练，增加了离散性质Token，并通过多任务优化进行微调。条件目标为垂直吸收能和振子强度，并将HOMO-LUMO能隙作为辅助电子描述符。生成的分子在TDDFT水平上进行评估，以评估分布保真度和可控性。生成的库再现了训练分布的主要光学性质支持，同时向更低分子量和更少重原子偏移。Token级控制在不同条件区间内一致定向，但并非完全正交，并表现出局部校准不规则性。化学型解析分析进一步表明，可控性强烈依赖于局部电子环境：适度共轭的芳香碳基序与改进的联合目标满足度相关，而吸电子基序，特别是芳基腈，表现出系统性红移和可控性降低。这些结果为条件OLED分子生成建立了定量基准，并表明模型可靠性必须在化学上有意义的子空间中评估，而非仅从聚合性质分布中评估。

英文摘要

Designing OLED molecules with targeted optical properties remains challenging due to the scarcity of high-quality data and the limited reliability of conditional control in generative models across chemical motifs. Here, we benchmark a token-conditioned autoregressive language model for OLED molecular generation in a realistic low-data regime. A GPT2 model is pretrained on large chemical corpora, augmented with discrete property tokens, and fine-tuned using multi-task optimisation. Conditioning targets vertical absorption energy and oscillator strength, with the HOMO-LUMO gap included as an auxiliary electronic descriptor. Generated molecules are evaluated at the TDDFT level to assess distributional fidelity and controllability. The generated library reproduces the dominant optical-property support of the training distribution while shifting towards lower molecular weight and fewer heavy atoms. Token-level control is consistently directional across conditioning bins, but is not fully orthogonal and exhibits local calibration irregularities. A chemotype-resolved analysis further shows that controllability depends strongly on local electronic environments: moderately conjugated aromatic-carbon motifs are associated with improved joint target satisfaction, whereas electron-withdrawing motifs, particularly aryl nitriles, show systematic red-shifting and reduced controllability. These results establish a quantitative benchmark for conditional OLED molecular generation and show that model reliability must be assessed in chemically meaningful subspaces rather than from aggregate property distributions alone.

URL PDF HTML ☆

赞 0 踩 0

2606.08218 2026-06-09 cs.LG cs.AI math.ST stat.ML stat.TH 新提交

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

深度高斯过程到底有多深？组合高斯过程的尖锐阈值与非高斯极限

Mark Kozdoba, Shie Mannor

发表机构 * Technion, IIT（以色列理工学院）； NVIDIA（英伟达）

AI总结本文研究了深度高斯过程先验在深度增长时的极限行为，识别出RBF核带宽的尖锐阈值，低于该阈值时先验收敛到非退化非高斯分布，具有非零坐标依赖。

详情

AI中文摘要

组合先验描述了深度贝叶斯模型中分层函数的通用属性，其中随机权重的深度神经网络是一个典型例子。在宽网络极限下，先验是一个具有深度相关核的高斯过程，其随深度增长的行为已通过该核得到广泛研究。这里，我们研究另一种情况，其中每一层本身是一个向量值高斯过程，我们的目标类似地理解先验随深度增长的极限行为。先前的高斯过程工作已确定，对于RBF核和一定范围的带宽$r$，先验在极限下退化，收敛到常数函数集——这作为概率模型是无用的。在本文中，我们建立了几个新结果。首先，我们识别出一个尖锐的带宽阈值$r_c(d) = Θ(\sqrt{d})$，高于该阈值极限是退化的，加强了先前的界限。其次，更重要的是，我们证明对于低于阈值$r_c(d)$的$r$，先验收敛到极限分布$π_{\bar{Z}}$。我们还证明这些分布是非退化且非高斯的，坐标之间具有非消失的依赖性。与先前已知的退化机制相反，深度高斯过程先验因此可以允许非平凡极限。实验上，我们在维度$d$的范围内验证了该阈值，并展示了极限分布$π_{\bar{Z}}$的复杂多模态行为——该机制随$d$增长而变得狭窄，且在不了解阈值的情况下难以识别。

英文摘要

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = Θ(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $π_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $π_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

URL PDF HTML ☆

赞 0 踩 0

2606.08214 2026-06-09 cs.RO 新提交

Agentic Neuro-Symbolic Planning and Commissioning for Human-in-the-Loop Industrial Robotics with Digital Twins

面向人机协同工业机器人的智能神经符号规划与调试：基于数字孪生

Zhihao Liu, Victor Nan Fernandez-Ayala, Tianyu Wang, Qiang Qin, Xi Vincent Wang, Dimos V. Dimarogonas, Lihui Wang

发表机构 * Royal Institute of Technology (KTH)（皇家理工学院（KTH））

AI总结提出一种结合LLM语言理解与确定性验证执行的神经符号框架，采用SDI架构和两级恢复机制，在数字孪生中验证后执行，显著提升任务成功率。

详情

AI中文摘要

灵活的机器人自动化需要系统能够解释操作员意图、验证物理可行性，并在规划和执行阶段从执行失败中恢复。本文提出了一种面向人机协同工业机器人的智能神经符号框架，其中LLM用于需要语言理解或上下文推理的任务，而所有验证、排序和执行保持确定性。该框架将软件工程中的规划器-生成器-评估器（PGE）模式改编为面向工业机器人的指定器-设计器-检查器（SDI）架构，并结合基于LangGraph的动态路由进行故障恢复。两级恢复机制通过上下文感知编排处理结构级重新规划，并通过确定性恢复技能处理执行级几何故障。Unity3D数字孪生支持在物理执行前进行人工检查、修改和重新验证。在多个难度级别的自然语言命令上对十个基线进行评估，所提方法实现了最高的任务成功率。消融结果证实，结构化命令扩展、符号验证、选择性LLM路由和恢复技能各自都是必要的。

英文摘要

Flexible robotic automation requires systems that interpret operator intent, verify physical feasibility, and recover from execution failures across both the planning and execution stages. This paper proposes an agentic neuro-symbolic framework for human-in-the-loop industrial robotics, in which LLMs are used for tasks that require language understanding or contextual reasoning, while all verification, sequencing, and execution remain deterministic. The framework adapts the Planner-Generator-Evaluator (PGE) harness pattern from software engineering into a Specifier-Designer-Inspector (SDI) architecture for industrial robotics, combined with LangGraph-based dynamic routing for failure recovery. A two-tier recovery mechanism addresses structure-level replanning through context-aware orchestration and execution-level geometric failures through deterministic recovery skills. A Unity3D digital twin supports human inspection, modification, and re-verification prior to physical execution. Evaluated on natural-language commands across multiple difficulty levels against ten baselines, the proposed method achieves the highest task success. Ablation results confirm that structured command expansion, symbolic verification, selective LLM routing, and recovery skills are each individually necessary.

URL PDF HTML ☆

赞 0 踩 0

2606.08212 2026-06-09 cs.LG 新提交

Public Machine Learning Solver Framework for Novices in the Machine Learning Domain

面向机器学习初学者的公共机器学习求解器框架

Lokman Saleh, Hafedh Mili, Mounir Boukadoum

发表机构 * LATECE Lab, Université du Québec à Montréal（LATECE实验室，魁北克大学蒙特利尔分校）

AI总结提出一个结合专家知识和迁移学习的半自动化平台，为非专家推荐完整的机器学习流水线，并自动提取数据特征，通过一阶逻辑推理提供排名算法。

详情

AI中文摘要

解决机器学习问题很复杂，通常只有专家才能胜任。过去二十年中，出现了支持非专家的系统。根据我们的回顾，我们识别出三类：(1) 全自动AutoML系统，(2) 用于算法选择的专家备忘单，以及(3) 使用选择标准（准确性、透明度、数据要求）的决策支持系统。我们提出一个新平台，结合了第2和第3类，为非专家提供半自动化、智能的解决方案推荐。与推荐单一算法的现有方法不同，我们的平台建议一个针对用户问题量身定制的完整流水线。它整合了专家定义的选择标准与迁移学习，并自动从用户提供的数据集中提取数据特征（例如，类别不平衡、缺失值）。该平台使用一阶逻辑对其知识库进行推理，并推荐按相关性排序的合适算法。它具有用户友好的界面，并连接到面向机器学习专家的众包平台，确保持续更新。该平台是增量构建的，允许无缝集成新算法、标准和领域知识。据我们所知，这是第一个免费、公开可访问的在线框架，系统地捕获和操作专家知识，以结构化、透明的方式指导非专家解决机器学习问题。

英文摘要

Solving machine learning problems is complex and typically reserved for experts. Over the past two decades, systems have emerged to support non-experts. Based on our review, we identify three categories: (1) fully automated AutoML systems, (2) expert cheat sheets for algorithm selection, and (3) decision-support systems using selection criteria (accuracy, transparency, data requirements). We propose a new platform combining categories 2 and 3 to deliver semi-automated, intelligent solution recommendations for non-experts. Unlike existing approaches that recommend a single algorithm, our platform suggests a complete pipeline tailored to the user's problem. It integrates expert-defined selection criteria with transfer learning and automatically extracts data characteristics (e.g., class imbalance, missing values) from user-provided datasets. The platform uses first-order logic to reason over its knowledge base and recommends suitable algorithms ranked by relevance. It features a user-friendly interface and connects to a crowdsourcing platform for ML experts, ensuring continuous updates. The platform is built incrementally, allowing seamless integration of new algorithms, criteria, and domain knowledge. To our knowledge, this is the first free, publicly accessible online framework that systematically captures and operationalizes expert knowledge to guide non-experts in solving ML problems in a structured, transparent manner.

URL PDF HTML ☆

赞 0 踩 0

2606.08206 2026-06-09 cs.CV cs.LG 新提交

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

SegmentAnyTreeV2：跨传感器、平台和森林的基于Transformer的树木实例分割扩展

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup

发表机构 * Norwegian Institute of Bioeconomy Research (NIBIO)（挪威生物经济研究所（NIBIO））

AI总结提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架，结合Point Transformer v3骨干网络、轻量语义头和树木交叉注意力掩码解码器，在FOR-instance v3基准上达到90.5%精度和80.2%召回率，并展现出强跨域泛化能力。

详情

Comments: 25 pages, 6 figures, 10 tables

AI中文摘要

我们提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架。该模型结合了基于序列化的Point Transformer v3骨干网络、轻量级语义头以及专注于树木的交叉注意力掩码解码器。语义预测将实例解码限制在树木类体素上，而实例感知的查询初始化、一对多种子监督和非对称掩码评分改善了密集和结构复杂林分中的分离效果。我们进一步引入了FOR-instance v3，一个扩展的基准数据集，包含427个场景和26,496棵标注树木，涵盖不同生物群落、森林结构和LiDAR平台。在FOR-instanceV2测试集上，SegmentAnyTreeV2实现了90.5%的精度、80.2%的召回率、85.0%的F1分数、90.7%的覆盖率和87.6%的语义mIoU，在实例检测和掩码完整性方面均优于以往基于学习的方法。在独立站点上的零样本评估进一步证明了其强大的跨域泛化能力。

英文摘要

We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.08205 2026-06-09 cs.CV 新提交

Empowering Feed-Forward Reconstruction Models with Metric Scale via Satellite Images

利用卫星图像赋予前馈重建模型度量尺度

Xianghui Ze, Yongjian Luo, Mengjun Chao, Zhenbo Song, Jianfeng Lu, Yujiao Shi

发表机构 * Nanjing University of Science and Technology（南京理工大学）； ShanghaiTech University（上海科技大学）

AI总结提出卫星引导框架，通过双向交叉视图交互利用卫星图像作为全局度量参考，解决前馈3D重建中的尺度模糊问题，实现度量深度估计、点云重建和相机定位。

详情

AI中文摘要

前馈3D重建模型最近在多样场景中展现出强大的泛化能力，但大多数模型仅能恢复未知全局尺度下的几何结构。这种尺度模糊限制了它们在需要环境度量理解的应用中的使用。现有的度量重建方法通常依赖于大规模度量标注或精确的相机标定，这在许多实际场景中成本高昂或不可靠。我们提出了一种卫星引导框架，用于解决前馈3D重建中的尺度模糊问题。关键思想是利用现成的卫星图像作为全局度量参考。给定粗略的相机姿态，我们的方法检索局部卫星图像块，并通过双向交叉视图交互将其与前馈重建主干集成。通过强制重建场景与卫星参考之间的一致性，模型推断绝对尺度、细化场景几何并在度量坐标系中估计相机姿态。在KITTI、nuScenes和Oxford RobotCar上的实验表明，该方法在度量深度估计、多视角点云重建和跨视角相机定位方面取得了一致改进，同时保持了跨数据集和地理区域的强泛化能力。

英文摘要

Feed-forward 3D reconstruction models have recently shown strong generalization across diverse scenes, yet most of them recover geometry only up to an unknown global scale. This scale ambiguity limits their use in applications that require metric understanding of the environment. Existing metric reconstruction methods commonly rely on large-scale metric annotations or accurate camera calibration, both of which are costly or unreliable in many real-world settings. We propose a satellite-guided framework for resolving scale ambiguity in feed-forward 3D reconstruction. The key idea is to use readily available satellite imagery as a global metric reference. Given a coarse camera pose, our method retrieves a local satellite patch and integrates it with a feed-forward reconstruction backbone through bidirectional cross-view interaction. By enforcing consistency between the reconstructed scene and the satellite reference, the model infers absolute scale, refines scene geometry, and estimates camera pose in a metric coordinate frame. Experiments on KITTI, nuScenes, and Oxford RobotCar show consistent improvements in metric depth estimation, multi-view point-cloud reconstruction, and cross-view camera localization, while preserving strong generalization across datasets and geographic regions.

URL PDF HTML ☆

赞 0 踩 0

2606.08204 2026-06-09 cs.LG cs.CV 新提交

Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

具有层次和空间局部性先验的神经场分词

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

发表机构 * Zuse Institute Berlin (ZIB)（柏林祖斯研究所）； Cartesia AI ； Technische Universität Berlin（柏林工业大学）

AI总结提出LH-NeF框架，利用层次和局部性先验学习通用连续信号的分词表示，通过前馈编码替代元学习，内存减少42倍，批大小提升133倍，在图像、3D形状和气候场上匹配或超越多种基线。

详情

AI中文摘要

神经场将数据参数化为从坐标到值的函数，为跨模态表示学习提供统一框架。现有方法以每样本元学习为主，由于内存密集的内循环优化而扩展性差。自然的替代方案——前馈编码——通常引入模态特定假设，牺牲了神经场学习的通用性。我们认为局部性和层次性是学习场表示的有用先验，可以在不损害模态无关性的情况下注入。我们提出LH-NeF，一个学习连续信号通用分词表示的框架。保持局部性的层次编码器将原始坐标-值场观测映射到结构化分词，训练期间从中重建场。通过用单次前向传播替代元学习的内循环，LH-NeF比最强的模态无关基线少用42倍内存，支持133倍更大的批次。在图像、3D形状和气候场上，我们的学习表示在重建和下游任务上匹配或超过模态无关、模态特定和专用生成神经场基线的性能。

英文摘要

Neural fields parameterize data as functions from coordinates to values, providing a unified framework for representation learning across modalities. Existing approaches are dominated by per-sample meta-learning, which scales poorly due to memory-intensive inner-loop optimization. The natural alternative -- feed-forward encoding -- typically introduces modality-specific assumptions, sacrificing the generality that makes learning with neural fields attractive. We argue that locality and hierarchy are useful priors for learning field representations that can be injected without compromising modality-agnosticism. We propose LH-NeF, a framework to learn general-purpose tokenized representations of continuous signals. A locality-preserving hierarchical encoder maps raw coordinate-value field observations to structured tokens, from which the field is reconstructed during training. By replacing meta-learning's inner loop with a single forward pass, LH-NeF uses 42$\times$ less memory and supports 133$\times$ larger batches than the strongest modality-agnostic baseline. Across images, 3D shapes, and climate fields, our learned representations match or exceed performance of modality-agnostic, modality-specific, and specialized generative neural field baselines on both reconstruction and downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.08197 2026-06-09 cs.CL cs.DC 新提交

AlignFed: Alignment-Aware Asynchronous Federated Fine-Tuning for Large Language Models in Heterogeneous Edge Environments

AlignFed: 异构边缘环境中大语言模型的对齐感知异步联邦微调

Yan Wang, Ziyi Gao, Rui Wang

发表机构 * University of Science and Technology Beijing（北京科技大学）

AI总结提出AlignFed框架，通过多阶段语义对齐机制（版本感知更新分组、跨版本语义对齐、公平性感知聚合）解决异步联邦微调中大语言模型在异构边缘环境中的模型漂移、客户端漂移和聚合不公平问题。

详情

AI中文摘要

大语言模型（LLMs）显著推动了边缘智能的发展，并已广泛应用于自动驾驶、工业检测和个性化物联网服务等多种场景。然而，由于严格的数据隐私约束、高度异构的计算和通信资源以及本地数据的非独立同分布（non-IID）特性，在边缘设备上协作适配LLMs仍面临严峻挑战。联邦微调（FFT）能够在无需暴露原始数据的情况下实现分布式模型的协作优化。然而，传统的同步聚合存在严重的掉队者效应，导致系统延迟高、资源利用率低。现有的异步联邦学习方法主要针对中小规模模型设计，难以解决LLM微调中特有的挑战，即由陈旧更新引起的模型漂移、由数据异质性加剧的客户端漂移以及由快速客户端主导导致的聚合公平性失衡。针对这些问题，本文提出AlignFed，一种面向异构边缘环境的LLMs异步联邦微调框架。AlignFed采用轻量级多阶段语义对齐机制，包含三个核心模块：版本感知的更新分组、基于小批量校准集的跨版本语义对齐，以及结合更新新鲜度和客户端参与频率的公平性感知聚合。该框架有效缓解了跨版本模型漂移和客户端漂移，同时增强了聚合公平性，从而在高异质性和显著更新陈旧性的场景中实现稳定高效的异步联邦优化。

英文摘要

Large Language Models (LLMs) have significantly propelled the advancement of edge intelligence and have been widely deployed across various scenarios, including autonomous driving, industrial inspection, and personalized IoT services. However, the collaborative adaptation of LLMs on edge devices continues to face formidable challenges due to strict data privacy constraints, highly heterogeneous computing and communication resources, and the non-independent and identically distributed (non-IID) nature of local data. Federated Fine-Tuning (FFT) enables the collaborative optimization of distributed models without exposing raw data. Yet, traditional synchronous aggregation suffers from a severe straggler effect, resulting in high system latency and low resource utilization. Existing asynchronous federated learning methods are predominantly designed for small-to-medium-scale models and struggle to address the specific challenges inherent in LLM fine-tuning namely, model drift caused by stale updates, aggravated client drift stemming from data heterogeneity, and aggregation fairness imbalance resulting from the dominance of fast clients. To address these issues, this paper proposes AlignFed, an asynchronous federated fine-tuning framework for LLMs tailored to heterogeneous edge environments. AlignFed employs a lightweight multi-stage semantic alignment mechanism comprising three core modules: version-aware update grouping, cross-version semantic alignment based on a mini-batch calibration set, and fairness-aware aggregation that integrates both update freshness and client participation frequency. This framework effectively mitigates cross-version model drift and client drift while enhancing aggregation fairness, thereby achieving stable and efficient asynchronous federated optimization in scenarios characterized by high heterogeneity and significant update staleness.

URL PDF HTML ☆

赞 0 踩 0

2606.08194 2026-06-09 cs.CL cs.AI 新提交

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio：用于大型音频-语言模型自然主义评估的多语言多文化基准

Ryner Tan, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出GlobeAudio基准，包含5637道多语言多选题，评估大型音频-语言模型在自然音频条件下的听觉推理和文化理解能力，发现开源模型和低资源语言存在显著性能差距。

详情

AI中文摘要

大型音频-语言模型（LALMs）在统一框架中整合了音频感知和语言理解，支持广泛的实际应用。尽管近期取得了进展，但LALMs的评估相对于实际需求仍严重不足：大多数评估缺乏真正的语言和文化真实性，而其他评估则未能捕捉声学真实性。为弥补这一差距，我们提出了GlobeAudio，一个旨在评估自然音频理解的多语言和多文化基准。GlobeAudio包含5637道多项选择题，涵盖六种类型多样的语言，由母语者基于自然发生的音频精心制作。为了表现良好，模型必须具有更高层次的听觉推理技能和文化基础的解释。我们系统地评估了代表性的闭源和开源LALMs，以及级联的ASR-LLM流水线。我们的实验揭示了在自然声学条件下的显著性能差距，特别是对于开源模型和低资源语言。这些发现凸显了当前LALMs的关键局限性，并强调了自然音频评估对未来音频-语言系统的重要性。GlobeAudio可在https://huggingface.co/datasets/iNLP-Lab/GlobeAudio 获取。

英文摘要

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

URL PDF HTML ☆

赞 0 踩 0

2606.08186 2026-06-09 cs.RO 新提交

Propeller-Assisted Robust 3D Hopping Robot with Hierarchical Force Allocation

螺旋桨辅助的鲁棒三维跳跃机器人及分层力分配

Chuhan Zhang, Hongbo Zhang, Yanlin Chen, Yunxi Tang, Yun-Hui Liu, Mingyi Liu, Xiangyu Chu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Guangdong Technion–Israel Institute of Technology（广东以色列理工学院）； Technion–Israel Institute of Technology（以色列理工学院）； Multiscale Medical Robotics Centre（多尺度医疗机器人中心）

AI总结提出一种螺旋桨辅助的单腿三维跳跃机器人Pro-OMEGA2，通过分层力分配框架协调腿与三旋翼的力，实现鲁棒跳跃和扰动恢复。

详情

Comments: 8 pages, 9 figures, 1 table. Accepted to the 2026 IEEE International Conference on Automation Science and Engineering (CASE)

AI中文摘要

单腿跳跃机器人概念简单但高度动态且天生不稳定。实现鲁棒的三维跳跃仍然困难，因为地面反作用力仅在短暂的支撑阶段可用，而机器人在飞行阶段欠驱动。一个未解决的关键问题是如何提高飞行阶段的控制能力。螺旋桨辅助提供了一种有希望的解决方案，但需要仔细协调腿产生的接触力和螺旋桨推力在支撑和飞行阶段的配合。本文介绍了Pro-OMEGA2，一种螺旋桨辅助的三维单腿跳跃机器人，具有主动3-RSR并联腿和安装在躯干上的三旋翼用于辅助姿态调节。为了解决力协调挑战，我们提出了一种基于单刚体模型的分层力分配框架。腿产生主要的支撑接触力，而三旋翼提供辅助姿态调节，补偿支撑阶段的残余姿态力矩并在飞行阶段维持姿态。室内和室外场景的真实机器人实验展示了持续的三维跳跃，包括地形过渡和脉冲推挤恢复，验证了在未建模接触和外部扰动下的鲁棒性。

英文摘要

Monopedal hopping robots are conceptually simple but highly dynamic and inherently unstable. Achieving robust 3D hopping is still difficult because ground reaction forces are available only during the short stance phase, while the robot is underactuated in flight. A key unresolved issue is how to improve flight-phase control authority. Propeller assistance provides a promising solution, but it requires careful coordination of leg-generated contact forces and propeller thrusts across stance and flight. This paper presents Pro-OMEGA2, a propeller-assisted 3D monopedal hopping robot with an active 3-RSR parallel leg and a trunk-mounted tri-rotor for auxiliary attitude regulation. To address the force coordination challenge, we propose a Hierarchical Force Allocation (HFA) framework based on a single rigid body (SRB) model. The leg generates the main stance contact wrench, while the tri-rotor provides auxiliary attitude regulation, compensating the residual attitude moment in stance and maintaining attitude during flight. Real-robot experiments in indoor and outdoor scenarios demonstrate sustained 3D hopping, including terrain transitions and impulsive push recovery, validating robustness under unmodeled contact and external disturbances.

URL PDF HTML ☆

赞 0 踩 0

2606.08184 2026-06-09 cs.CL 新提交

TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

TextEconomizer：利用去噪变换器和熵编码增强有损文本压缩

Mahbub E Sobhani, Anika Tasnim Rodela, Chowdhury Mofizur Rahman, Dewan Md. Farid, Swakkhar Shatabda

发表机构 * United International University（联合国际大学）； BRAC University（BRAC大学）； Southeast University（东南大学）

AI总结提出TextEconomizer编码器-解码器框架，结合去噪变换器和熵编码，实现50%-80%的压缩率，参数减少153倍，在BLEU等指标上保持近完美文本质量。

详情

DOI: 10.1016/j.neunet.2026.109111
Journal ref: Neural Networks, Vol. 203, 109111, 2026
Comments: Published in Neural Networks (Elsevier), Vol. 203, 2026

AI中文摘要

有损文本压缩在保留核心含义的同时减少数据大小，适用于摘要、自动分析和数字存档。尽管基于变换器的模型在语言建模中占主导地位，但将上下文向量和熵编码集成到序列到序列（Seq2Seq）生成中仍未充分探索。一个关键挑战在于从编码器输出中识别信息最丰富的上下文向量，并引入熵编码以提高存储效率，同时即使在噪声文本下也能保持高质量输出。我们提出了TextEconomizer，一种与变换器神经网络配对的编码器-解码器框架，无需数据集维度的先验知识即可将可变大小输入减少50%至80%。我们的模型通过熵编码实现了有竞争力的压缩比，同时通过BLEU、ROUGE、METEOR和语义相似度评分评估，提供了近乎完美的文本质量。TextEconomizer的参数数量比同类模型少约153倍，实现了5.39倍的压缩比，且不牺牲语义质量。我们还评估了一个基于LSTM的自编码器，实现了最先进的67倍压缩比，参数减少196倍；以及LLaMAFormer，一种改进的变换器，参数比ICAE少263倍，同时保持有竞争力的文本质量。TextEconomizer在平衡内存效率和高保真输出方面显著超越了现有的基于变换器的模型，标志着有损压缩在最优空间利用方面的突破。

英文摘要

Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.

URL PDF HTML ☆

赞 0 踩 0

2606.08170 2026-06-09 cs.RO 新提交

Learning from Human Driving: A Human-in-the-Loop Online Behavior Cloning Framework for Autonomous Driving

从人类驾驶中学习：一种用于自动驾驶的人机协同在线行为克隆框架

Yuhong Shi, Jianyi Liu, Lihang Sun, Li Li, Xudong Dong

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（西安交通大学人工智能与机器人研究所人机混合增强智能国家重点实验室）

AI总结提出人机协同在线行为克隆框架HiL-OBC，通过人类干预初始化策略、贝叶斯潜在行为建模和在线更新，结合大模型感知与人类驾驶智能，在CARLA基准上显著提升驾驶性能。

详情

AI中文摘要

随着大型基础模型（LFM）的发展，数据驱动的自动驾驶取得了显著进展。然而，现有范式在复杂交互和长尾场景中仍面临分布偏移和因果混淆的严峻挑战。这些限制往往导致在极端条件下缺乏人类级别的决策灵活性和安全性。为克服这一局限，本文提出了一种用于自动驾驶的人机协同在线行为克隆框架（HiL-OBC），旨在深度融合LFM的跨模态感知能力与人类专家的高级驾驶智能。具体而言，HiL-OBC的部署通过三个关键阶段执行：带人类干预的策略初始化、基于贝叶斯策略适应的潜在行为建模，以及在线部署与更新。此外，我们设计了一种多模态在线行为克隆（MOBC）模型，通过轻量级网络架构、接管触发机制和多变量损失函数在线优化基础驾驶策略，从而增强系统在复杂环境中的决策鲁棒性。我们在LangAuto-Human CARLA基准上评估了HiL-OBC。实验结果表明，通过人机协同机制优化的驾驶策略实现了显著的性能提升：StructNav、LFG和LMDrive的驾驶得分（DS）分别提高了47.25%、31.59%和32.12%，同时各种实验设置和关键组件的分析凸显了人机协同学习在提高决策鲁棒性和整体驾驶性能方面的优势。

英文摘要

With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant strides. However, existing paradigms still face severe challenges in complex interaction and long-tail scenarios due to distribution shift and causal confusion. These limitations often result in a lack of human-level decision-making flexibility and safety in extreme conditions. To overcome this limitation, this paper proposes a Human-in-the-Loop Online Behavior Cloning frame work (HiL-OBC) for autonomous driving, which aims to deeply integrate the cross-modal perceptual capabilities of LFMs with the high-level driving intelligence of human experts. Specifically, HiL-OBC deployment is executed through three critical phases: policy initialization with human intervention, latent behavioral modeling with Bayesian policy adaptation, and online deploy ment and updates. Furthermore, we design a Multi-modal Online Behavior Cloning (MOBC) model, which optimizes the base driving policy online through a lightweight network architecture, a takeover trigger mechanism, and a multi-variant loss function, thereby enhancing the system's decision-making robustness in complex environments. We evaluated the HiL-OBC on the LangAuto-Human CARLA benchmark. Experimental results demonstrate that the driving policies optimized via the human-in-the-loop mechanism achieve substantial performance gains: the DS of StructNav, LFG, and LMDrive increased by 47.25%, 31.59%, and 32.12%, respectively, with a simultaneous of various experimental settings and key components highlights the advantages of human-in-the-loop learning in improving decision-making robustness and overall driving performance.

URL PDF HTML ☆

赞 0 踩 0

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 新提交

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合，采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)（德国航空航天中心（DLR），机器人与机电一体化研究所（RMC））； Technical University of Munich (TUM)（慕尼黑工业大学（TUM））

AI总结提出CLASP架构，结合任务参数化核化运动基元（TP-KMP）与预训练视觉语言模型（VLM），通过自然语言命令实现技能选择、组合和主动学习，无需微调，在7自由度机械臂上达到73.3%-100%成功率。

详情

Comments: 23 pages, 11 figues, 4 tables, 1 listing

AI中文摘要

使机器人能够理解自然语言命令并执行任务，同时保持数据效率仍然具有挑战性。视觉-语言-动作（VLA）和视觉-语言模型（VLM）等基础模型提供了直观的交互通道，但需要大量数据；任务参数化模仿学习实现了数据效率，但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距，该架构将任务参数化核化运动基元（TP-KMP）与预训练VLM相结合。在学习过程中，技能从2到5次动觉演示中获取，VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中，VLM解释命令以选择技能，推理参数绑定，并通过协方差加权组合创建新颖行为。当没有技能或组合足够时，系统识别能力差距并请求有针对性的演示，所有这些都无需微调。在7自由度机械臂上的验证显示，在需要技能选择、组合和主动学习的场景中，成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

URL PDF HTML ☆

赞 0 踩 0

2606.08167 2026-06-09 cs.LG cs.AI 新提交

Explaining Data Mixing Scaling Laws

解释数据混合缩放定律

Rui Dai, Shuran Zheng

发表机构 * Beijing Institute of Technology（北京理工大学）； IIIS, Tsinghua University（清华大学智能产业研究院）

AI总结提出统一框架解释多领域数据混合中模型损失行为，基于能力竞争和噪声减少两个关键因素，在多个尺度上有效预测高性能混合。

详情

Comments: Published to ICML 2026

AI中文摘要

最近的研究建立了经验缩放定律来预测多领域数据混合上的模型性能。然而，对这些模型损失行为的理论理解仍然缺失。在这项工作中，我们提出了一个统一框架来解释数据混合的底层机制。我们的方法将最初为标准神经缩放定律（如Kaplan和Chinchilla）开发的理论视角扩展到多领域设置。基于领域在基本技能上重叠而在专门技能上分化的分布假设，我们确定了控制不同数据混合训练模型领域损失的两个关键因素：\textit{能力竞争}，其中有限模型能力的分配全局耦合了领域损失；以及\textit{噪声减少}，其中最优权重向更难学习的领域转移以最小化整体噪声。实证评估表明，我们的框架通过以更低的平均相对误差拟合损失景观并识别出更高性能的训练混合，优于现有基线。最重要的是，我们的模型成功跨尺度外推，使用较小尺度上拟合的参数预测大型未见尺度的高效混合。此外，与之前的经验定律相比，我们的模型使用显著更少的参数实现了这些结果。我们的代码可在 https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws 获取。

英文摘要

Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.

URL PDF HTML ☆

赞 0 踩 0

2606.08164 2026-06-09 cs.CV 新提交

How Much MRI Preprocessing Is Enough? A Cost-Utility Study for Brain MRI Foundation Models

MRI预处理需要多少才够？脑MRI基础模型的成本效用研究

Jiangshuan Pang, Wangyang Tang, Jing Yan, Zhixuan Cheng, Youzhe He, Zhenkun Zhuang, Tao Zhou, Shiping Liu

发表机构 * University of the Chinese Academy of Sciences（中国科学院大学）； BGI Research（华大研究院）

AI总结本研究通过比较P0-P7预处理级别对自监督3D MRI预训练的影响，发现并非预处理越强越好，P2是最低成本可行级别，更强预处理仅在特定任务中带来有限提升，且下游可补偿。

详情

AI中文摘要

MRI预处理定义了脑MRI基础模型看到的输入分布，但它通常被视为常规数据清理而非建模选择。我们询问对于自监督3D MRI预训练，多少预处理值得其计算成本。保持语料库、3D ViT骨干网络、掩码协议和下游评估不变，我们在20,000个异质脑MRI体积上比较了用于掩码自编码（MAE）和联合嵌入预测学习（JEPA）的分级P0-P7预处理谱，然后将编码器迁移到IDH预测、MCI分类、脑年龄回归和GLI/PED肿瘤分割。结果不支持简单的“越多越好”规则。P0/P1数值不稳定，使P2成为成本最低的可行级别；超过P2，选择最佳可行预处理级别仅使MAE的聚合效用提高3.4个百分点，JEPA提高1.8个百分点，且大多数配对增益在统计上未解决。更强的预处理仅在选定场景中有益：IDH略有改善，AGE和GLI/PED通常在P2附近或最佳，而MCI显示出最清晰的P7经验增益。跨级别MCI迁移进一步表明，大部分P7优势可以通过在下游应用更强的预处理来恢复，而不需要在预训练全程使用P7。这些发现将MRI预处理重新定义为一种下游感知的成本效用决策，而非默认的升级流水线。代码可在https://github.com/PangJiangShuan/PreBrain获取。

英文摘要

MRI preprocessing defines the input distribution seen by brain MRI foundation models, yet it is usually treated as routine data cleaning rather than a modeling choice. We ask how much preprocessing is worth its computational cost for self-supervised 3D MRI pretraining. Keeping the corpus, 3D ViT backbone, masking protocol, and downstream evaluations fixed, we compare a graded P0-P7 preprocessing spectrum for masked autoencoding (MAE) and joint-embedding predictive learning (JEPA) on 20,000 heterogeneous brain MRI volumes, then transfer the encoders to IDH prediction, MCI classification, brain age regression, and GLI/PED tumor segmentation. The results do not support a simple "more is better" rule. P0/P1 are numerically unstable, making P2 the lowest-cost feasible level; beyond P2, choosing the best feasible preprocessing level improves aggregate utility by only 3.4 percentage points for MAE and 1.8 percentage points for JEPA, with most paired gains statistically unresolved. Stronger preprocessing is beneficial only in selected regimes: IDH improves modestly, AGE and GLI/PED are often near or best at P2, and MCI shows the clearest empirical P7 gain. Cross-level MCI transfer further shows that much of the P7 advantage can be recovered by applying stronger preprocessing downstream, without requiring P7 throughout pretraining. These findings recast MRI preprocessing as a downstream-aware cost-utility decision rather than a default escalation pipeline. Code is available at https://github.com/PangJiangShuan/PreBrain.

URL PDF HTML ☆

赞 0 踩 0

2606.08161 2026-06-09 cs.LG cs.AR cs.NA math.NA 新提交

AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

AttentionCap: 基于Transformer的电容矩阵学习用于全芯片提取

Jiechen Huang, Hector R. Rodriguez, Dingcheng Yang, Zuochang Ye, Yibo Lin, Wenjian Yu

发表机构 * Dept. Computer Science & Tech., BNRist, Tsinghua Univ., Beijing, China（清华大学计算机科学与技术系，北京信息科学与技术国家研究中心）； School of IC, BNRist, Tsinghua Univ., Beijing, China（清华大学集成电路学院，北京信息科学与技术国家研究中心）； School of IC, Peking Univ., Beijing, China（北京大学集成电路学院）

AI总结提出AttentionCap，一种定制化Transformer，结合Gram表示、对称注意力输出层和归一化拉普拉斯损失，实现多层多节点下的高精度电容矩阵预测，速度提升192倍。

详情

Comments: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC '26)

AI中文摘要

随着基于规则的模式匹配的电容提取精度在先进节点上难以维持，开发基于深度学习的2D电容模型的趋势日益增长。然而，现有的基于MLP和CNN的方法将其输入限制在特定工艺节点的固定金属层组合上，限制了其在实际中的可用性。认识到电容矩阵与流行的注意力机制之间的固有相似性，我们提出了AttentionCap，一种定制化的Transformer用于电容矩阵学习，具有Gram表示框架、物理对齐的对称注意力输出层以及新颖的归一化拉普拉斯损失。我们还引入了工艺节点嵌入以实现多节点学习。在合成数据上训练后，AttentionCap在多层多节点设置下，对未见过的真实设计实现了0.67%/3.99%的自电容/耦合电容误差，相比CNN-Cap基线，自电容/耦合误差降低了4.6倍/5.7倍，推理速度提高了192倍。预训练的AttentionCap仅需5000个样本和4000步微调即可准确迁移到未见过的节点。凭借对未见过的真实设计的足够精度和对新工艺节点的强迁移能力，AttentionCap为现代EDA工作流程提供了很高的实用价值。代码和数据可在https://github.com/THU-numbda/AttentionCap获取。

英文摘要

As capacitance extraction accuracy of rule-based pattern matching becomes difficult to sustain at advanced nodes, a growing trend emerges to develop deep-learning-based 2D capacitance models. However, existing MLP- and CNN-based methods constrain their input to fixed metal-layer combinations in a specific process node, limiting their usability in practice. Recognizing the inherent similarity between capacitance matrix and the prevailing attention mechanism, we propose AttentionCap, a customized Transformer for capacitance matrix learning, with a Gram representation framework, a physics-aligned symmetric-attention output layer, and a novel normalized Laplacian loss. We also introduce a process-node embedding to enable multi-node learning. Trained on synthetic data, AttentionCap attains 0.67\%/3.99\% self/coupling-capacitance error on unseen real designs under a multi-layer and multi-node setting, surpassing the CNN-Cap baseline with 4.6$\times$/5.7$\times$ lower self/coupling error and 192$\times$ faster inference speed. A pretrained AttentionCap accurately transfers to an unseen node with only 5K samples and 4K finetuning steps. With sufficient accuracy on unseen real designs and strong transferability to new process nodes, AttentionCap offers highly practical value for modern EDA workflows. Code and data are available at https://github.com/THU-numbda/AttentionCap.

URL PDF HTML ☆

赞 0 踩 0

2606.08156 2026-06-09 cs.CV cs.AI 新提交

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

RAPID: 逐层冗余感知剪枝与重要性驱动的令牌合并以实现高效ViT

Kyumin Choi, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出RAPID框架，根据ViT网络深度自适应调整令牌缩减策略：浅中层用冗余相似度感知剪枝，深层用重要性相似度感知合并，在ImageNet-1K上实现更优的精度-压缩帕累托前沿。

详情

Comments: 7 pages, 2 figures

AI中文摘要

视觉Transformer（ViT）取得了强大性能，但由于二次自注意力复杂度而遭受高计算成本。尽管令牌缩减技术（如剪枝和合并）缓解了这一问题，但它们通常忽略了表示在网络深度上的演化。我们提出RAPID，一种深度感知的令牌缩减框架，可根据令牌表示的逐层特征自适应调整缩减策略。主要方法贡献是一种分叉策略：在浅层到中层，RAPID采用冗余相似度感知剪枝度量来消除过度表示的局部模式。当特征在更深层过渡到全局语义概念时，框架转向重要性相似度感知合并机制。该阶段利用分类（CLS）令牌注意力权重来保护语义关键令牌，同时融合不太重要但相似的邻居。在ImageNet-1K上使用ViT和DeiT架构的实验验证表明，与ToMe和ToFu等即插即用基线相比，RAPID建立了更优的精度-压缩帕累托前沿。RAPID在激进压缩场景下尤其鲁棒，在极端缩减率下比ToMe准确率高出4.29%。我们的框架提供了一种免训练模板，通过将缩减策略与层次化特征演化对齐来优化视觉模型。

英文摘要

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

URL PDF HTML ☆

赞 0 踩 0

2606.08155 2026-06-09 cs.LG cs.IR 新提交

Have I Solved This Before? Retrieving Similar Segmentation Problems for Evolutionary Learning

我以前解决过这个问题吗？检索相似分割问题进行进化学习

Andreas Margraf, Henning Cui, Jörg Hähner

发表机构 * University of Augsburg（奥格斯堡大学）

AI总结提出一种基于检索相似分割问题的进化学习方法，通过重用已有管道避免从头训练模型，降低开发成本，并分析跨域迁移的可行性。

详情

AI中文摘要

监控系统的可靠集成和稳固配置是实现现代制造环境高效率和生产率的基本前提。关于传感器类型和系统架构的设计决策必须在早期阶段且在高不确定性下做出。本文研究了一种偏离传统监控系统开发过程的研究方向，将注意力从算法设计转向对检测问题的更深入分析。与传统设计周期不同，本文提出逐步收集知识并将其存储在抽象系统模型中。这使得能够检索未来用例的相似解决方案，避免了昂贵的从头开始模型训练，而是允许对现有基础配置进行增量改进。重用先前生成的管道降低了后期昂贵修订的风险。由于关于滤波器管道的跨域可转移性知之甚少，本研究分析了检索滤波器管道以将其转移到不同但相似的分割问题的潜力。最后，我们统计分析了这种主要应用于图像分割问题的“迁移学习”变体的优势。此外，我们讨论了简单模型如何帮助在设计过程中平衡复杂性、技术要求和可靠性之间的权衡。

英文摘要

Reliable integration and solid configuration of monitoring systems constitute a fundamental prerequisites for achieving high efficiency and productivity in contemporary manufacturing environments. Design decisions on sensor type and system architecture have to be made at an early stage and under comparably high uncertainty. This work investigates a research direction that deviates from the traditional monitoring-system development process by shifting the attention from algorithm design to a deeper analysis of the inspection problem. In contrast to traditional design cycles, this paper proposes to gradually collect knowledge and store it in an abstract system model. This enables the retrieval of similar solutions for future use cases, preventing the need for expensive model training from scratch and allowing instead for the incremental refinement of existing base configurations. Reuse of previously generated pipelines reduces the risk of late and costly revisions. As there is little knowledge on cross-domain transferability of filter pipelines, this study analyzes the potential of retrieving filter pipelines to transfer them to different but similar segmentation problems. Finally, we statistically analyze the benefits of this `transfer learning' variant which is predominantly applied to image segmentation problems. In addition, we discuss how simple models help balancing the trade-off between complexity, technical requirements, and reliability in the design process.

URL PDF HTML ☆

赞 0 踩 0

2606.08154 2026-06-09 cs.RO 新提交

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

SynthICL: 基于合成数据的可扩展上下文模仿学习

Cheng Qian, Ruomeng Fan, Yifei Ren, Yilong Wang, Edward Johns

发表机构 * The Robot Learning Lab（机器人学习实验室）； Imperial College London（伦敦帝国理工学院）

AI总结提出SynthICL框架，利用纯RGB合成数据训练上下文模仿学习策略，避免深度传感和真实数据，通过子目标预测提升控制精度，在16个真实操作任务中平均成功率79%。

详情

AI中文摘要

上下文模仿学习（ICIL）使机器人能够通过将预训练策略以任务特定示例为条件，在测试时无需重新训练，从少量演示中学习新任务。尽管前景广阔，训练可泛化且可扩展的上下文模仿策略仍是一个开放挑战。我们提出SynthICL，一个完全基于RGB合成数据训练ICIL策略的可扩展框架。具体而言，我们构建了一个数据生成流水线以产生高保真ICIL数据，并在所得数据集上训练了一个流匹配变换器策略。SynthICL避免了先前方法中对深度传感、精确相机校准和真实世界训练数据的需求，提供了一种更简单且更可扩展的替代方案。我们进一步通过训练模型预测下一个子目标图像来融入子目标预测，从而实现更精确且视觉上可控的操作。在16个未见过的真实世界操作任务上评估，SynthICL在测试时仅提供一个演示的情况下实现了79%的平均成功率，并优于先前方法。项目页面：https://synth-icl.github.io

英文摘要

In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io

URL PDF HTML ☆

赞 0 踩 0