arXivDaily arXiv每日学术速递 周一至周五更新
热门方向导航
2606.05846 2026-06-19 cs.CL eess.AS 版本更新

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo(东京大学)

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.04307 2026-06-19 cs.LG stat.CO stat.ME 版本更新

Folded Transport MCMC: Eliminating Label Switching by Sampling on a Fundamental Domain

折叠传输MCMC:对称贝叶斯模型的可认证商后验计算

Jun Hu

发表机构 * Wuhan University of Technology(武汉理工大学)

AI总结 针对对称贝叶斯模型中的冗余多峰性导致MCMC收敛诊断退化的问题,提出Folded Transport MCMC方法,通过在对称群的基本域上构建独立采样器直接对商后验进行推断,并利用LCNF振荡认证框架在商度量下提供可证明的认证下界。

Comments 50 pages (including supplementary material), 5 figures, 6 tables. Submitted to Journal of Computational and Graphical Statistics

详情
AI中文摘要

具有有限对称性的贝叶斯模型——如可交换分量的混合模型、具有紧密间隔模态的结构识别——定义的后验在标签置换群下不变,产生冗余的多峰性,从而降低MCMC收敛诊断的质量。我们引入折叠传输MCMC(FolT-MCMC),该方法通过在对称群的基本域上构建独立采样器,直接对商后验进行推断。商提议分布通过对群轨道上学习的归一化流进行对称化得到。我们证明了基于LCNF振荡的认证框架可以迁移到商度量,并具有稳定子修正的球质量界和改进的覆盖半径,并且当未折叠流表现出跨模态提议缺陷时,分位数核心认证下界会得到改善。在高斯混合(d=2-20)、标签切换目标(最多24个等价模态)以及标准贝叶斯三分量混合后验上,分位数核心认证改进比从2倍到145倍不等,且折叠认证经验上几乎与维度无关。在台风山竹期间超高层建筑的真实加速度计数据上,FolT-MCMC产生了非平凡的分位数核心认证,而未折叠认证是平凡的。

英文摘要

In Bayesian mixture models and other exchangeable-component models, the posterior is invariant under permutation of component labels, creating m! equivalent modes-the label-switching problem. Standard MCMC methods either mix poorly across these modes or rely on post-hoc relabelling that cannot guarantee the sampler has converged. We propose Folded Transport MCMC (FolT-MCMC), which eliminates label switching before sampling by restricting the Markov chain to a fundamental domain-a sorted or reflected subspace containing exactly one representative from each symmetric mode. The proposal is a learned normalising flow whose density is symmetrised over the group orbits, ensuring correct targeting on the reduced space. We show that this construction preserves a computable convergence diagnostic based on the oscillation of the log-density ratio, and that the diagnostic becomes sharper on the fundamental domain whenever the original-space flow under-covers one or more symmetric modes. Experiments on Gaussian mixtures (d=2-20), label-switching targets (up to 24 equivalent modes), a standard Bayesian three-component mixture posterior, and real accelerometer data from a supertall building show improvement ratios of 2x to 145x, with the folded diagnostic stable across dimensions while the unfolded diagnostic collapses.

2606.04075 2026-06-19 cs.LG cs.AI cs.CL cs.CR cs.CY 版本更新

Large Language Models Hack Rewards, and Society

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London(伦敦大学国王学院) Fudan University(复旦大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象,通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞,且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情
AI中文摘要

强化学习已成为一种主导的后训练范式,使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况,同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞,因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式,即社会攻击:发现社会运行规则中的漏洞。为了研究这一现象,我们引入了SocioHack,一个包含72个社会环境的沙盒,并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略,而当前的大型语言模型安全措施仅提供有限的缓解。因此,收集真实世界反馈用于模型训练需要更加谨慎,我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

2606.01338 2026-06-19 cs.CL 版本更新

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试:消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands(大学的计算机科学系) Department of Computer Science, DePaul University(德保罗大学计算机科学系) Youngstown State University(亚当斯州立大学)

AI总结 本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能,发现代码调优的通用模型优于领域特定模型,但当前性能仍需人工监督。

详情
AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范(GMP)和欧盟AI法案等监管框架下运营,这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型(LLM)提供了一种保护隐私的替代方案,但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM(Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B)在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI,使用一个包含约63,000条记录的合成Microsoft SQL Server数据库,涵盖批次、制造执行系统(MES)和在线清洗(CIP)模块。模型在60个领域特定的自然语言问题上进行了基准测试,使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL,而Meditron 7B由于上下文窗口限制和SQL生成能力差,几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性,而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明,代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的,但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

2606.01316 2026-06-19 cs.AI 版本更新

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Science Earth: 迈向面向AI原生科学发现的行星级操作系统

Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma, Yifan Wen, Jinglin Jian, Jiacheng Ge, Xiangru Tang, Bo An, Ming Yin, Sanfeng Wu, Mengdi Wang, Le Cong

发表机构 * Department of Pathology, Department of Genetics, Stanford University School of Medicine(病理学系、遗传学系,斯坦福大学医学院) Princeton AI Lab, Department of Electrical & Computer Engineering, Princeton University(普林斯顿人工智能实验室、电气与计算机工程系,普林斯顿大学) Scripps Research, La Jolla, CA, USA(斯克里普斯研究机构,洛杉矶,加利福尼亚州,美国) Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine(生物统计学部、人口健康系,纽约大学格罗斯曼医学院) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学) Department of Computer Science, Yale University(计算机科学系,耶鲁大学) Department of Physics, Princeton University(物理系,普林斯顿大学)

AI总结 提出Science Earth行星级科学运行时,通过EACN协议实现AI能力动态连接与自组织协作,在跨太平洋Kuramoto同步研究和单细胞分析中验证了分布式自校正科学推理。

Comments Withdrawn by the authors. (1) The author list and authorship roles had not been finalized and agreed upon by all listed authors prior to submission. (2) The specific contribution of the system in the K3 synchronization example (Section on Kuramoto/nonlinear physics) requires further validation before it can be reported. The authors are addressing both points and may resubmit a corrected version.

详情
AI中文摘要

科学发现需要在广阔的搜索空间中运用智能、毅力和偶然性。如今,顶尖科学能力仍然孤立——一个AI系统用于生物分析,另一个用于临床推理、数学推导或材料模拟——并且没有预设计的团队能够预见一个问题所需的所有技能。Science Earth是一个行星级科学运行时,其中任何能力——模拟集群、湿实验室机器人、证明引擎、单细胞管道——都可以相互连接,协作结构由问题本身涌现。其底层EACN协议让能力能够相互发现、协商任务所有权,并在不相容的证据标准之间进行裁决,而无需事先知道谁将遇见谁。这将组织挑战从工作流设计转向开放式连接。两次运行在结构不同的条件下验证了这一点。在一项跨太平洋高阶Kuramoto同步研究中,智能体在30分钟内识别并纠正了Ott-Antonsen解析理论中一个在洛伦兹极限外失效的闭合比率假设。在针对488万细胞Kang 2024泛癌图谱的八智能体单细胞运行中,异质能力在64.9小时窗口内耦合,仅有一条结构外部指令,产生了三个新的结果层,并将发现与一项关于相邻CCR8- TIGIT+ Treg亚群的独立湿实验室研究进行锚定。这些案例是首次实证读数,而非基准测试。它们表明,当AI能力真正可连接且协调从问题中涌现时,科学推理成为一个分布式、自校正的过程——这是向行星级AI原生发现迈出的一步。

英文摘要

Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed--one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation--and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability--a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline--can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process--a step towards scaling AI-native discovery to the planet.

2605.31393 2026-06-19 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata(III-LIDI国立拉普拉塔大学) CDTEC, Federal University of Pelotas(CDTEC,联邦 Pelotas 大学) CONICET III-LIDI Comision de Investigaciones Cientificas Universidad Nacional de La Plata(科学委员会国立拉普拉塔大学) Universidade Federal de Pelotas(联邦 Pelotas 大学)

AI总结 针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题,提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强,并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign @ CVPR 2026. Non-Proceedings Track (https://genai4sl.github.io/)

详情
AI中文摘要

手语翻译(SLT)仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法,其中GPT-4o生成参考句子的受控释义变体,而手语输入保持不变。采用基于Signformer姿态的Transformer,在两阶段调度下进行训练:先在增强语料库上预训练,然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估:PHOENIX14T(德国手语),具有适度的词汇多样性;GSL(希腊手语),具有高度受控、重复的录制;以及LSA-T(阿根廷手语),具有严重的长尾稀疏性。在PHOENIX14T上,增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知,这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

2605.31158 2026-06-19 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2605.30456 2026-06-19 cs.LG math.OC 版本更新

DisjunctiveNet: Neural Symbolic Learning via Differentiable Convexified Optimization Layers

DisjunctiveNet: 通过可微凸优化层实现的神经符号学习

Shraman Pal, Can Li

发表机构 * Davidson School of Chemical Engineering, Purdue University, West Lafayette, USA(帕克大学化学工程大卫逊学校)

AI总结 针对数据稀疏且富含领域知识的场景,提出DisjunctiveNet框架,通过可微凸优化层将析取约束嵌入神经网络,实现硬约束满足与强预测性能。

Comments ICML 2026

详情
AI中文摘要

科学与工程中的许多学习任务以稀疏数据集为特征,这限制了纯数据驱动方法的有效性。同时,这些问题通常伴随着源自物理定律、操作要求和专家启发式的丰富领域知识。这些知识经常以涉及逻辑命题和线性不等式的规则形式表达。现有的神经符号方法通常通过软惩罚近似地强制执行这些规则,在设计专门架构时假设输入无关的规则,或者依赖推理时的不可微后处理来实现硬约束满足。虽然可微优化层的最新进展使得在神经网络中实现端到端的可行性强制成为可能,但由于固有的非凸性,将这些方法扩展到逻辑或混合整数规则仍然具有挑战性。在这项工作中,我们提出了一个统一的端到端框架,用于在神经网络中强制执行硬性的、输入相关的混合整数线性约束。我们的方法将规则表示为析取约束,并应用层次凸松弛来获得凸包公式。这些松弛产生了易于处理的线性约束,可以嵌入为可微优化层,同时实现精确的规则满足。我们在真实数据集上展示了所提出框架的有效性,实现了完美的规则满足和强大的预测性能。

英文摘要

Many learning tasks in science and engineering are characterized by sparse datasets, which limits the effectiveness of purely data-driven approaches. At the same time, these problems are often accompanied by rich domain knowledge derived from physical laws, operational requirements, and expert heuristics. Such knowledge is frequently expressed as rules involving logical propositions and linear inequalities. Existing neuro-symbolic methods typically enforce these rules approximately through soft penalties, assume input-independent rules when designing specialized architectures, or rely on non-differentiable post-processing at inference time to achieve hard constraint satisfaction. While recent advances in differentiable optimization layers enable end-to-end feasibility enforcement within neural networks, extending these approaches to logical or mixed-integer rules remains challenging due to inherent nonconvexity. In this work, we propose a unified end-to-end framework for enforcing hard, input-dependent mixed integer linear constraints within neural networks. Our approach represents rules as disjunctive constraints and applies hierarchical convex relaxations to obtain convex hull formulations. These relaxations yield tractable linear constraints that can be embedded as differentiable optimization layers while enabling exact rule satisfaction. We demonstrate the effectiveness of the proposed framework on real-world datasets, achieving perfect rule satisfaction and strong predictive performance.

2605.28654 2026-06-19 cs.RO cs.SY eess.SY math.OC 版本更新

Integrated Exploration-Aware UAV Route Optimization and Path Planning

集成探索感知的无人机路径优化与轨迹规划

Jimin Choi, Grant Stagg, Cameron K. Peterson, Max Z. Li

发表机构 * Department of Aerospace Engineering, University of Michigan(密歇根大学航空航天工程系) Department of Electrical Engineering, Brigham Young University(BYU 电子工程系) Department of Aerospace Engineering, Department of Civil and Environmental Engineering, and Department of Industrial and Operations Engineering, University of Michigan(密歇根大学航空航天工程系、土木与环境工程系和工业与运营管理工程系)

AI总结 提出一种集成探索感知的无人机路径优化与轨迹规划框架,通过风险地图、不确定兴趣区域建模、B样条轨迹优化和在线重规划,在灾害监测中平衡报告点访问与新信息探索,实现平均KL散度降低15.9%。

详情
AI中文摘要

无人机越来越多地用于危险环境(如灾区、污染场地、野火区域和受损基础设施)中的探索驱动监测,此时有限的飞行续航必须在访问报告位置和收集新信息之间分配。在这些场景中,关于危险的先验信息通常不完整、空间不精确,并且在执行过程中可能发生变化。例如,初始报告可能识别出危险可能存在的区域,但实际危险可能被移动、部分观察到或完全未被报告。我们提出了一种集成的探索感知无人机路径优化与轨迹规划框架,用于在不确定和演变的先验信息下进行危险监测。环境被表示为空间风险地图,每个位置都有相关的危险状况信念。报告的危险被建模为不确定的兴趣区域(ROI),而不是确认的目标位置,要求无人机在检查报告区域的同时,利用有限的飞行续航探索信息丰富的区域。所提出的方法解决了报告ROI上的车辆路径问题,通过辅助伪节点增强路径以改善空间覆盖,将剩余飞行距离预算分配到路径段,并优化局部探索的动态可行B样条轨迹。在执行过程中,无人机测量更新基于网格的信念地图,当新信息和剩余预算证明调整合理时,对剩余轨迹进行重规划。在48种场景配置中,在线重规划相比离线优化规划器平均KL散度降低15.9%,相比直线遍历降低48.6%。

英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly used for exploration-driven monitoring in hazardous environments such as disaster zones, contaminated sites, wildfire areas, and damaged infrastructure, where limited flight endurance must be allocated between visiting reported locations and gathering new information. In these settings, prior information regarding hazards is often incomplete, spatially imprecise, and subject to change during execution. For example, initial reports may identify a region where a hazard is likely to exist, but the actual hazard may be displaced, partially observed, or entirely unreported. We present an integrated exploration-aware UAV route optimization and path planning framework for hazard monitoring under uncertain and evolving prior information. The environment is represented as a spatial risk map, where each location has an associated belief of hazardous conditions. Reported hazards are modeled as uncertain regions of interest (ROIs) rather than confirmed target locations, requiring the UAV to inspect reported areas while also using its limited flight endurance to explore informative regions. The proposed method solves a vehicle routing problem over reported ROIs, augments the route with auxiliary pseudo-nodes to improve spatial coverage, allocates the remaining flight distance budget across route segments, and optimizes dynamically feasible B-spline trajectories for local exploration. During execution, UAV measurements update a grid-based belief map, and the remaining trajectory is replanned when new information and the remaining budget justify adaptation. Across 48 scenario configurations, online replanning improves average KL reduction by 15.9% over the offline optimized planner and 48.6% over straight-line traversal.

2605.26891 2026-06-19 cs.CL 版本更新

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

发表机构 * Research and Innovation, Telenor Group(Telenor集团研究与创新)

AI总结 本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库,共1122篇文档,用于支持北欧NLP和信息检索研究。

Comments 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152

详情
AI中文摘要

本文介绍了一个多语言客户服务自助语料库,包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档,总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面,随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺,尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示,不同运营商的文档长度和结构存在显著差异,反映了不同的编辑策略,以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供,网址为https://zenodo.org/records/19493152,旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling 274,599 words and 1,884,833 characters. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/20732652, intended to support reproducible research in Nordic NLP and information retrieval.

2605.30089 2026-06-19 cs.LG 版本更新

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption

推理时元素损坏下的分布鲁棒集合表示学习

Yankai Chen, Hanrong Zhang, Bowei He, Philip S. Yu, Xue Liu

发表机构 * McGill University(麦吉尔大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 针对推理时元素损坏问题,提出SW-DRSO分布鲁棒优化框架,通过重心对抗近似最坏情况损失,在四个任务上验证了鲁棒性和性能。

Comments Accepted by ICML'26

详情
AI中文摘要

标准集合表示学习方法通常在精心整理的数据上表现良好,但往往忽略了推理时元素损坏的挑战。这指的是部署模型遇到元素级别的退化(如异常值或缺失组件)时,可能扭曲集合表示并降低性能。我们提出了SW-DRSO,一个专门为集合设计的分布鲁棒优化框架。SW-DRSO不是仅最小化观测训练数据上的损失,而是优化一个关于一系列合理推理时变体的最坏情况期望损失的可处理替代项。我们引入了一个重心对抗,通过可微的训练时优化单纯形权重来近似对损坏集合的难以处理的搜索。在四个任务上的大量实验表明,SW-DRSO在保持高整体性能的同时,有效增强了对损坏的鲁棒性。

英文摘要

Standard Set Representation Learning methods typically excel on curated data but often overlook the challenge of inference-time element corruption. This refers to scenarios where deployed models encounter element-level degradations, such as outliers or missing components, that may distort set representation and degrade performance. We propose SW-DRSO, a distributionally robust optimization framework tailored for sets. Rather than minimizing loss solely on observed training data, SW-DRSO optimizes a tractable surrogate of the worst-case expected loss over a family of plausible inference-time variations. We introduce a barycentric adversary that approximates the intractable search over corrupted sets by a differentiable training-time optimization over simplex weights. Extensive experiments across four tasks demonstrate that SW-DRSO effectively enhances robustness against corruption while maintaining high overall performance.

2605.27864 2026-06-19 cs.AI 版本更新

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台,用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) UMass Boston(马萨诸塞大学波士顿分校)

AI总结 提出FundaPod平台,通过多角色独立研究、知识图谱记忆和事后裁决机制,支持人类投资经理进行透明、可验证的基础投资决策。

Comments 32 pages; 12 figures

详情
AI中文摘要

大型语言模型(LLMs)在金融领域的应用日益增多,但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下,机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果,而是产生透明、可重用和可验证的投资计划,同时促进投资知识的累积发展。我们提出了FundaPod,一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务,在本质上与交易信号生成不同,因此更适合采用保持独立性的架构。在FundaPod中,具有不同角色(如价值投资者或宏观策略师)的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现,供人类投资组合经理(PM)裁决。本文基于设计科学实践以及认知隔离和人机协调理论,提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制:将公开投资者资料转化为可部署智能体的角色提炼管道;允许规划器推导类型化任务图的声明式技能注册表;将备忘录声明与可验证来源联系起来的基于证据的模型;以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2605.29483 2026-06-19 cs.AI 版本更新

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

VitalAgent: 一种工具增强型代理,用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

发表机构 * The University of Melbourne, Australia(墨尔本大学) Dartmouth College, US(达特茅斯学院) University of Auckland, New Zealand(奥克兰大学) Eindhoven University of Technology, Netherlands(埃因霍温理工大学)

AI总结 提出VitalAgent框架,通过工具增强推理和纵向生理记忆,实现对ECG/PPG信号的反应性问答与主动监测,在VitalBench基准上相比基线提升超30%。

Comments Minor revisions; results unchanged

详情
AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号,但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent,一个基于ECG/PPG的移动健康工具增强型代理框架,支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上,能够对原始信号进行动态计算。我们进一步引入VitalBench,一个纵向生理监测基准数据集,包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录,涵盖心脏、身体活动和压力相关任务。实验表明,VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升,并支持对长期生理信号的主动警报监测,突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 25% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

2605.25160 2026-06-19 cs.AI 版本更新

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

SimuWoB: 模拟真实世界移动应用以实现快速且保真的GUI智能体基准测试

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) University of Electronic Science and Technology of China(电子科技大学) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus团队)

AI总结 针对现有移动GUI智能体基准测试与现实应用之间的差距,提出全合成基准SimuWoB,通过鲁棒的虚拟环境生成框架合成高保真任务和环境,自动提供有效奖励,实现对复杂长程交互的高效可重复评估。

详情
AI中文摘要

由大型语言模型驱动的移动GUI智能体发展迅速,迫切需要真实且全面的评估。现有基准测试优先考虑可重复性,但通常局限于开源应用或文件操作任务,因为在实际应用中构建奖励困难,导致基准设置与现实使用之间存在差距。此外,大多数基准测试侧重于基本定位和导航,对复杂长程交互的覆盖有限。为解决这些局限性,我们引入了SimuWoB,一个全合成的移动GUI智能体基准测试,包含120个涵盖不同类型和难度级别的挑战性任务。我们构建了一个鲁棒的虚拟环境生成框架,合成高保真任务和环境,并为每个任务自动提供有效奖励。每个环境都部署为可通过URL访问的无后端网页,实现高效且可重复的评估。我们对几个最先进的移动GUI智能体进行了全面实验。平均成功率仅为27.92%,在长程任务上降至17.82%,揭示了当前智能体在复杂场景下的显著弱点。与真实世界样本任务的评估结果比较表明,基于我们合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断见解,并讨论了对未来移动GUI智能体开发的启示。

英文摘要

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

2605.25005 2026-06-19 cs.RO 版本更新

Stiffness Optimization for Concentrated Bending in Magnetically Actuated Catheters: Maintaining Steerability under Gradient Stiffness

磁驱动导管集中弯曲的刚度优化:在梯度刚度下保持可操控性

Jiewen Tan, Junnan Xue, Shing Shin Cheng, Shuang Song, Erli Lyu, Jiaole Wang

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) The Chinese University of Hong Kong(香港中文大学) Macao Polytechnic University(澳门理工学院)

AI总结 针对磁驱动软导管在推送性与近端集中弯曲之间的权衡,提出一种刚度优化的多段磁驱动导管(SO-MAC),通过解耦转向-推进机构和梯度刚度架构,在推进过程中实现稳定的近端枢轴弯曲,同时远端被动自直以传递推进力。

详情
AI中文摘要

对于磁驱动软导管,实现高效的推送性(推进力传递)和近端集中弯曲以保持可操控性具有挑战性:较高的轴向/弯曲刚度可改善力传递但降低可操控性,而较低的刚度可实现大的近端集中弯曲,但在压缩推送载荷下增加扭结/屈曲风险。为了解决这一权衡,我们提出了一种刚度优化的多段磁驱动导管(SO-MAC),它集成了解耦的转向-推进机构与梯度刚度架构。SO-MAC在推进过程中将弯曲集中在稳定的近端枢轴周围,而远端部分通过优化的刚度分布和弹簧骨架的弹性恢复抵抗摩擦引起的扭结/屈曲,被动自直以传递推进力。在$0{-}180^{\circ}$的组合转向和推进过程中,枢轴保持稳定,远端尖端几乎直线地向目标方向推进。直径为1.5 mm的SO-MAC在其10 mm尖端处实现了高达$180^{\circ}$的转向,弯曲半径为3 mm,平均形状误差为$1.39 \pm 0.56$ mm,转向枢轴误差为$0.35 \pm 0.10$ mm。在支气管模型中的视觉反馈控制进一步验证了通过高度弯曲的分叉路径的鲁棒导航。

英文摘要

Achieving both efficient pushability (propulsion transmission) and proximally concentrated bending for steerability is challenging for magnetically actuated soft catheters: higher axial/bending stiffness improves force transmission but reduces steerability, whereas lower stiffness enables large, proximally concentrated bending yet increases kinking/buckling risk under compressive push loads. To address this trade-off, we propose a stiffness-optimized multi-segment magnetically actuated catheter (SO-MAC) that integrates a decoupled steering-advancement mechanism with a gradient-stiffness architecture. The SO-MAC concentrates bending about a stable proximal pivot during advancement while the distal section passively self-straightens to transmit propulsion, aided by the optimized stiffness distribution and elastic recovery of the spring backbone against friction-induced kinking/buckling. Over $0{-}180^{\circ}$ combined steering and advancement, the pivot remained stable and the distal tip advanced near-straight toward the target direction. A 1.5 mm-diameter SO-MAC achieved up to $180^{\circ}$ steering with a 3 mm bending radius at its 10 mm tip, with an average shape error of $1.39 \pm 0.56$ mm and a steering-pivot error of $0.35 \pm 0.10$ mm. Visual feedback control in a bronchial phantom further confirmed robust navigation through highly curved, bifurcating paths.

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics(LimX动力学)

AI总结 提出Any2Any范式,通过运动学对齐和动力学微调,实现预训练全身跟踪模型高效迁移至新的人形机器人本体,仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

2605.22748 2026-06-19 cs.RO cs.AI cs.LG cs.MA 版本更新

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人与感知组) Google DeepMind(谷歌深Mind) Nomagic

AI总结 本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能,展示了多智能体交互对真实世界交互安全性的关键作用,同时在高速赛车中超越人类飞行员并减少碰撞率。

Comments 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl

详情
AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能,但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式,其中其他参与者被忽略或视为环境噪声,阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台,训练智能体在复杂空气动力学相互作用和战略机动中导航,具有可变数量的赛车。通过联赛基于的自我对战,智能体进化出复杂的前瞻性行为,包括主动避障、超车和处理多智能体物理交互,包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员,同时与最先进的单智能体基线相比,碰撞率减少了50%。关键的是,使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明,实现稳健的机器人共存的路径不在于孤立的安全约束,而在于多智能体交互的严格要求。多媒体材料可在:https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

2605.16865 2026-06-19 cs.CL 版本更新

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室,多伦多大学及向量研究所) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Princeton University(普林斯顿大学) Cornell University(康奈尔大学) The University of Tokyo(东京大学) RIKEN AIP(日本理化学研究所AIP) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根最大计划智能系统研究所) EuroSafeAI

AI总结 本文提出MixSD方法,通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入,从而在保持预训练能力的同时提升事实记忆和推理能力。

详情
AI中文摘要

监督微调(SFT)被广泛用于将新知识注入语言模型,但通常会损害预训练能力,如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致,迫使优化器模仿低概率token序列。为了解决这个问题,我们提出了MixSD,一种无需外部教师的简单方法,用于对齐分布的知识注入。与固定目标训练不同,MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号,同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD,研究事实回忆和算术功能学习,并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下,MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线,能够保留基础模型的100% held-out能力,同时保持接近完美的训练准确率,而标准SFT只能保留1%。我们进一步表明,MixSD在基础模型下生成的监督目标具有显著更低的NLL,并减少了有害的Fisher敏感参数方向运动。这些结果表明,将监督与模型的本征生成分布对齐是简单且有效的知识注入原则,可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

2509.24725 2026-06-19 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net:基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam(阿姆斯特丹大学) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Q-Net框架,通过结合卡尔曼滤波与神经网络,解决信号交叉口队列长度估计中的数据融合问题,提升空间转移性和实时性,实现无需昂贵传感设备的准确队列估计。

Journal ref Transportation Research Part C: Emerging Technologies, Volume 190, September 2026, Article 105809

详情
AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源:(i) 接近停止线的环形检测器提供的车辆计数汇总数据,以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD),但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此,本文提出Q-Net:一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战,如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构,并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现,并通过将aFCD测量分组为固定大小的局部组来提高空间转移性,使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示,Q-Net优于基线方法,能够准确追踪队列的形成和消散,并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性,Q-Net在无需昂贵的传感基础设施(如摄像头或雷达)的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

2605.20448 2026-06-19 cs.CV cs.LG 版本更新

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI(德克南人工智能)

AI总结 本文通过一个包含3034个样本的人工整理基准,探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力,发现模型在重新安排可见布局时表现优异,但在遮挡和反射推断上表现较差。

详情
AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体,但它们是否代表这些物体所处的3D布局?我们引入了一个包含3034个样本的人工整理基准,针对空间理解的三个组成部分:深度有序遮挡(通过三种独立的反事实操作化进行探测)、可见反射的光学几何推断,以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分,没有使用LLM作为判断标准,揭示了明显的分离:在53-97%的准确率下,能够对可见布局进行重新安排的模型,在遮挡任务中表现不佳,仅在6-45%之间,而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示,失败归因于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可用,只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

2604.00626 2026-06-19 cs.LG cs.CL 版本更新

A Survey of On-Policy Distillation for Large Language Models

大型语言模型的在线策略蒸馏综述

Mingyang Song, Mao Zheng

发表机构 * Tencent, China(腾讯,中国)

AI总结 本文综述了大型语言模型的在线策略蒸馏方法,探讨了蒸馏过程中如何通过反馈减少累积误差,提出了基于f-散度最小化的蒸馏框架,并分析了蒸馏与强化学习之间的联系。

Comments Ongoing Work

详情
AI中文摘要

随着大型语言模型(LLMs)在能力和成本上的持续增长,将前沿能力转移到更小、可部署的学生模型已成为核心工程问题,知识蒸馏仍然是这一转移的主导技术。工业流水线中普遍采用的静态模仿教师生成文本的方法存在结构性缺陷,随着任务变得更长且需要更多推理,这种缺陷变得更加严重。因为学生是在完美教师前缀上训练的,但在推理时必须生成自己的文本,小错误往往会积累成学生很少被训练来恢复的轨迹,导致的暴露偏差已被证明与序列长度的平方成比例。在线策略蒸馏(OPD)围绕这一观察重新组织训练循环,通过让教师对学生实际生成的内容提供反馈,以减少累积项趋于线性,并将蒸馏重新定义为迭代修正过程,而不是单次模仿。由此产生的文献在分歧设计、奖励引导优化和自我对抗方面有所扩展,但贡献仍然分散在知识蒸馏、RLHF和模仿学习社区中,缺乏统一的处理。本文提供了这样的处理。我们正式将OPD定义为学生采样轨迹上的f-散度最小化,将该领域沿三个设计轴(优化什么、信号来源在哪里、以及如何在实践中稳定训练)组织起来,并整合成功条件、反复失败模式以及OPD与KL约束强化学习之间的联系。最后,我们提出了由此综合而产生的开放性问题,包括蒸馏扩展定律、不确定反馈、代理蒸馏以及知识蒸馏与强化学习之间的日益增长的重叠。

英文摘要

As Large Language Models continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become an important engineering problem, and knowledge distillation remains a common technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but generates its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as f-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained reinforcement learning. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agent-level distillation, and the growing overlap between knowledge distillation and RL.

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea(韩国文化科技研究所) Maum AI Inc., Republic of Korea(马姆人工智能公司)

AI总结 本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题,通过分析下游语义失败,揭示了传统ASR指标无法完全捕捉的误差影响,发现不同性能的LLM在级联降级上的一致性,识别出单字符ASR错误作为语义失败通道,并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

详情
AI中文摘要

我们分析了自动语音识别(ASR)误差如何通过ASR-LLM级联在韩语语音问答(SQA)中传播,重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析显示,由ASR误差引起的相对下游降级在不同绝对性能的LLM中保持一致,表明级联降级主要跟踪ASR阶段的信息损失。我们进一步识别出单字符韩语ASR错误作为一种独特的语义失败通道,其中正确答案在下游预测中完全消失,尽管仅存在微小的转录差异。最后,辅助比较显示,大型音频语言模型在噪声韩语SQA中优于具有匹配语言骨干的ASR-LLM流水线,表明直接音频输入有潜力缓解转录诱导的信息损失。

英文摘要

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net:一种通用的基于网格的替代模型,用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London(帝国理工学院伦敦设计工程学院) TUM School of Engineering and Design, Technical University of Munich(慕尼黑技术大学工程与设计学院) Faculty of Mechanical Engineering, OTH Regensburg(雷根斯堡机械工程学院) NVIDIA(NVIDIA公司)

AI总结 本文提出Mask-Morph Graph U-Net,通过特征对齐的重心参数化和节点掩码预训练,提升网格模拟的通用性和数据效率,适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情
AI中文摘要

非线性有限元碰撞模拟准确但计算成本高,限制了其在迭代设计优化中的应用。基于图神经网络(GNN)的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟,其共享节点和边更新函数在不同图结构中相对通用。相比之下,非共享边特定聚合层能更准确地捕捉非线性关系,但通常需要固定图连接性,限制了通用性。本文提出Mask-Morph Graph U-Net(MMGUNet),一种解决分层图U-Net架构限制的方法,该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性,所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格,然后构建跨图边。它进一步在监督预训练中应用节点掩码,随后进行参数高效的微调,其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明,粗图变形相对于固定粗图基线提高了测试准确性,而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径,用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别?

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响,发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

详情
AI中文摘要

生物特征面部识别模型在处理现实世界图像时常表现出显著的准确性下降,通常表现为图像质量差、非正面姿态和主体遮挡。我们调查了针对这些挑战的AI驱动头部姿态校正和图像修复是否能提高识别准确率。使用模型无关的大规模法医评估流程,我们评估了三种修复方法:3D重建(NextFace)、2D正面化(CFR-GAN)和特征增强(CodeFormer)。我们发现这些技术的简单应用会显著降低面部识别准确率。然而,我们还发现选择性应用CFR-GAN结合CodeFormer可以带来有意义的提升。

英文摘要

Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench:一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出CADBench,一个统一的多模态CAD程序生成基准,包含18000个样本和六类基准,评估11种视觉语言模型,揭示了CAD程序生成中的三种常见失败模式。

详情
AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心,但进展难以衡量,因为现有评估分散在数据集、模态和指标上。我们引入CADBench,一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本,涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族,五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染,以及六个指标,涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层,所有家族均进行多样性采样,以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统,生成超过140万个CAD程序。在理想输入下,专用的网格到CAD模型显著优于代码生成VLMs,后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式:几何复杂性增加时重建质量下降,CAD专用模型在模态转移下可能变得脆弱,且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

2605.09609 2026-06-19 cs.LG math.AG 版本更新

Minimal Filling Architectures of Polynomial Neural Networks: Counterexamples, Frontier Search, and Defects

多项式神经网络的最小填充架构:反例、前沿搜索与缺陷

Kevin Dao, Jose Israel Rodriguez

发表机构 * Department of Mathematics, University of Wisconsin-Madison, Wisconsin, USA(威斯康星大学麦迪逊分校数学系)

AI总结 本文通过前沿搜索和符号计算验证了多项式神经网络的最小单峰猜想反例,揭示了部分子架构存在较大缺陷,与以往小缺陷现象形成对比。

详情
AI中文摘要

我们为具有幂激活函数的多项式神经网络(PNNs)提供了最小单峰猜想的反例。在固定输入和输出宽度的情况下,该猜想声称任何最小填充架构的隐藏层宽度都是单峰的。我们通过前沿搜索找到反例,并通过递归维度界限和符号计算进行了认证。值得注意的是,该反例的几个子架构表现出较大的缺陷,这与以往示例中普遍观察到的小缺陷行为形成对比。

英文摘要

We provide counterexamples to the unimodal minimal filling architecture conjecture for polynomial neural networks (PNNs) with power activation functions. Fixing the input and output widths, the conjecture states that any minimal filling architecture has unimodal widths for the hidden layers. We found counterexamples via a frontier search, recursive dimension bounds on neurovarieties, and symbolic computation. Notably, several subarchitectures of our main example exhibit large defect, in contrast with the predominantly small-defect behavior observed in prior literature.

2605.09383 2026-06-19 cs.RO 版本更新

Safety-Critical LiDAR-Inertial Odometry with On-Manifold Deterministic Protection Level

安全关键的激光雷达-惯性里程计与在线流形确定性保护级别

Yueqi Zhu, Yan Pan, Chufan Rui, Jiasheng Luo, Shihua Li, Bo Zhou

发表机构 * School of Automation, Southeast University(东南大学自动化学院) Key Laboratory of Measurement and Control of CSE, Ministry of Education(教育部测控CSE重点实验室)

AI总结 本文提出一种安全关键的激光雷达-惯性里程计,通过在线流形确定性状态估计提供确定性保护级别,以提升移动机器人在安全关键场景中的导航安全性。

详情
AI中文摘要

在安全关键场景中,自主导航系统的保护级别对于使移动机器人安全执行任务至关重要。然而,现有针对机器人概率导航系统的研究通常使用有限数据集进行离线准确性评估,并假设结果可应用于未知真实环境。因此,当前自主移动机器人往往缺乏在线安全评估的保护级别。为填补这一空白,我们提出了一种安全关键的激光雷达-惯性里程计(LIO),其基于在线流形确定性状态估计提供确定性保护级别。通过采用未知但有界的假设,我们推导出点云噪声与迭代最近点算法估计不确定性之间的简洁闭式关系。利用这一关系,我们设计了一种在线流形椭球集成员滤波器,并将其实现于LIO系统中。利用集成员滤波器的性质,我们的系统将估计位置的可行集作为确定性保护级别,用作机器人下游自主操作的安全参考。实验结果表明,我们的系统能够为各种环境中的不同机器人提供有效的确定性在线安全参考。

英文摘要

In safety-critical scenarios, the protection level of the autonomous navigation system is crucial for enabling mobile robots to perform safe tasks. However, existing studies on probabilistic navigation systems for robots usually perform offline accuracy evaluations using limited datasets and assume that the results can be applied to unknown real-world environments. As a result, current autonomous mobile robots often lack protection levels for online safety assessment. To fill this gap, we propose a safety-critical LiDAR-inertial odometry (LIO) that provides deterministic protection levels based on on-manifold deterministic state estimation. By adopting the unknown but bounded assumption, we derive a neat closed-form relationship between point cloud noise and the uncertainty of the estimation from the iterated closest point algorithm. Using this relationship, we design an on-manifold ellipsoidal set-membership filter and implement it within the LIO system. Leveraging the properties of the set-membership filter, our system offers the feasible sets of the estimated locations as the deterministic protection levels, serving as safety references for the robots' downstream autonomous operations. The experimental results show that our system can provide effective deterministic online safety references for diverse robots in various environments.

2605.08525 2026-06-19 cs.RO cs.SY eess.SY 版本更新

Model-Reference Adaptive Flight Control of a 95-mg Insect-Scale Flapping-Wing Aerial Robot

95毫克昆虫尺度扑翼飞行机器人的模型参考自适应飞行控制

Francisco M. F. R. Gonçalves, Conor K. Trygstad, Néstor O. Pérez-Arancibia

发表机构 * Washington State University(华盛顿州立大学)

AI总结 针对昆虫尺度扑翼飞行机器人参数不确定性和扰动问题,提出模型参考自适应控制(MRAC)架构,结合混合乘性扩展卡尔曼滤波,实现高精度位置控制,并通过95毫克机器人实验验证了悬停和轨迹跟踪性能。

Comments Under review, 8 pages, 7 figures

详情
AI中文摘要

由于系统尺度和复杂制造,描述扑翼昆虫尺度飞行机器人动力学的模型存在参数不确定性,例如惯性矩阵和飞行器的执行器映射。此外,由于其低惯性,这种机器人在飞行中受到随机和系统性扰动的严重影响,包括电源线张力、阵风和机翼不对中产生的非期望气动力。因此,在亚分克尺度上执行复杂机动的高性能要求机器人调整其行为以抵消扰动和模型不确定性。为此,我们引入了一种模型参考自适应控制(MRAC)架构,用于可实现为三维空间中刚体的扑翼机器昆虫的高性能位置控制。此外,我们展示了在飞行中实现混合乘性扩展卡尔曼滤波以估计当前和期望角速度,如何显著抑制姿态振动,特别是沿滚转和俯仰自由度,并提高飞行性能。为了展示所提方法的适用性、功能性和高性能,我们使用一个95毫克的昆虫尺度飞行机器人进行了实时悬停和轨迹跟踪六自由度飞行控制实验。

英文摘要

Due to the system's scale and complex fabrication, the model describing the dynamics of a flapping-wing insect-scale aerial robot is subject to parameter uncertainty; for example, in the inertia matrix and the actuator mapping of the flier. Furthermore, due to its low inertia, this type of robot is greatly affected by stochastic and systematic disturbances during flight, including power-wire tension, gusts, and undesired aerodynamic forces produced by wing misalignment. Therefore, the high-performance execution of complex maneuvers at the subdecigram scale requires the robot to adapt its behavior to counteract disturbances and model uncertainty. Toward this objective, we introduce a model-reference adaptive control (MRAC) architecture for high-performance position control of flapping-wing robotic insects that can be modeled as rigid bodies in the three-dimensional (3D) space. In addition, we demonstrate how the implementation of a hybrid multiplicative extended Kálmán filter for estimating current and desired angular velocities during flight significantly dampens attitude vibrations, especially along the roll and pitch degrees of freedom (DOFs), and also improves flight performance. To show the suitability, functionality, and high performance of the proposed approach, we conducted real-time hovering and trajectory-tracking 6-DOF flight control experiments with a 95-mg insect-scale aerial robot.

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) Shenzhen University(深圳大学) Shenzhen Loop Area Institute(深圳环城区域研究所)

AI总结 提出基于对象共现的OOD检测框架,通过解耦表示和分治策略区分近OOD,缓解简单性偏差,在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情
AI中文摘要

分布外(OOD)检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内(ID)和OOD数据,忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性,因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发,我们提出了一种以对象为中心的OOD检测框架,学习捕捉图像中的对象共现(OCO)模式。该方法引入了一种新的OOD检测范式,通过预测测试样本的解耦表示来理解图像中的对象共现,然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景,最后以分治方式进行OOD检测。通过这种方式,OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD,避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO,展示了竞争性结果,并证实了其处理语义和协变量偏移的能力。代码发布在:https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.