arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.06219 2026-06-05 cs.RO cs.AI

CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

CLEAR:端到端自动驾驶中的认知与潜在评估自适应路由

Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

发表机构 * arXiv.org cs.RO(计算机机器人学)

AI总结 提出CLEAR框架,通过单步条件漂移替代扩散模型的多步去噪,结合视觉编码器Drive-JEPA和微调Qwen 3.5 0.8B进行语义推理,实现高效多模态规划,在NAVSIM v1上达到93.7 PDMS。

详情
AI中文摘要

端到端自动驾驶模型通常难以平衡多模态机动生成与实时推理约束。虽然扩散模型成功捕捉了多样化的驾驶行为,但其迭代去噪过程在安全关键部署中引入了不可接受的延迟。为了解决这个问题,我们提出了CLEAR(认知与潜在评估自适应路由),一个结合超快生成规划与深度语义推理的框架。CLEAR采用Drive-JEPA作为视觉编码器,并用VAE潜在空间中的单步条件漂移替代多步去噪链,引入条件系数以平衡多样性和专家精度。同时,我们在驾驶问答对上全微调Qwen~3.5~0.8B以提取场景感知隐藏状态。这些状态指导自适应调度器(从预定义方案的离散集中选择条件系数$α$和样本数量$N$)和交叉注意力评分器(从候选中选择最优轨迹)。在NAVSIM v1基准上,CLEAR达到了最先进的PDMS 93.7。我们的结果表明,无需密集几何标注或迭代采样,即可高效执行高保真多模态规划。

英文摘要

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

2606.06218 2026-06-05 cs.RO cs.AI

TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation

TAM: 用于鲁棒操作运动传递的扭矩自适应模块

Dongwon Son, Florian Shkurti, Jason Lee, Naman Shah, Beomjoon Kim, Dieter Fox

发表机构 * KAIST(韩国科学技术院) Allen Institute for AI(人工智能研究院) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 提出扭矩自适应模块(TAM),通过历史编码器和扭矩适配器修正扭矩指令,实现不同机器人或负载间的运动传递,无需领域随机化或重新收集数据。

详情
AI中文摘要

为一个机器人调整的策略在另一个机器人上往往表现不同,无论是由于仿真到现实的差距、未知负载,还是同一机器人两个实例的不同动力学。在接触丰富的动态操作中,即使微小的运动差异也可能导致跟踪参考运动失败,因为它们会破坏接触的时间和模式。常见的补救措施,如领域随机化或系统辨识,要么产生过于保守的任务策略,要么需要为每个机器人或负载重新收集数据。我们引入了扭矩自适应模块(TAM),这是一个学习模块,它调整发送给机器人的扭矩命令以匹配理想机器人的行为。TAM 在跟踪策略动作的低级控制器和机器人的扭矩接口之间运行。它包括一个历史编码器,将本体感受历史嵌入到潜在状态中,以及一个扭矩适配器,计算残余扭矩修正。由于 TAM 仅依赖于本体感受历史,而不依赖于策略观测或动作空间,因此相同的 TAM 权重可以重复用于适应具有不同动作空间(关节目标、末端执行器目标或直接扭矩)的策略。策略本身不需要使用机器人参数的领域随机化进行训练。相反,我们将领域随机化的需求转移到 TAM 上,通过在随机化仿真中完全训练 TAM,使用多机器人预训练,然后进行特定机器人的微调步骤,该步骤仍然不需要真实机器人数据。我们在真实的 Franka Panda 机器人上对 TAM 进行了零样本评估,涉及动态操作任务,包括基于视觉的推箱子策略(来自强化学习)、翻转策略(来自行为克隆)和 MPC 球杆平衡。我们的实验表明,与在线系统辨识和 RMA 基线相比,TAM 改善了零样本真实机器人执行,并实现了鲁棒的动态操作性能。

英文摘要

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.

2606.06217 2026-06-05 cs.CV cs.AI

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

DisasterBench: 复杂环境中基于无人机灾害响应的多模态基准

Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DisasterBench多模态基准,涵盖14种灾害场景和9个响应任务,并设计轻量级模型DisasterVL通过三阶段优化在边缘设备上实现高效推理。

详情
AI中文摘要

当灾难发生时,响应者不仅需要回答正在发生什么,还需要回答为什么发生、接下来会发生什么以及现在该做什么,而这些通常来自嘈杂的低空无人机视角,并在现场计算资源紧张的情况下进行。然而,现有的大多数多模态基准侧重于感知(例如识别/描述),覆盖的灾害类型有限,并且对实际应急响应所需的多阶段推理支持不足。我们引入了DisasterBench,一个用于复杂环境中基于无人机灾害响应的多阶段多模态推理基准。DisasterBench涵盖14种灾害相关场景类型和9个响应关键任务,覆盖灾前、灾中和灾后阶段,具有细粒度的灾害-任务映射,明确测试因果归因、传播预测、损害分析和决策导向推理。为了在边缘设备上实现推理,我们进一步提出了DisasterVL,一个轻量级多模态模型,通过三阶段流水线进行优化,结合领域指令微调、思维链引导的多模态对齐以及基于强化学习的策略优化。在21个流行的MLLM上的实验表明,我们的2B参数DisasterVL优于所有评估的开源模型,并显著缩小了与最先进闭源模型的差距,实现了与GPT-4o相当的推理准确性和更高的效率。项目页面:https://github.com/TanmouTT/DisasterBench。

英文摘要

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.

2606.06211 2026-06-05 cs.CL cs.SD eess.AS

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

基于FiLM的说话人条件化SpeechLLM用于病理语音识别

Fernando López, Santosh Kesiraju, Jordi Luque

发表机构 * Telefónica Innovación DigitalSpain(西班牙电信创新数字研究院) Universidad Autónoma de MadridSpain(马德里自治大学) Brno University of TechnologyCzech Republic(捷克布拉格技术大学)

AI总结 本研究提出通过特征线性调制(FiLM)将x-vector说话人信息注入冻结的ASR编码器各Transformer层,实现对病理语音的说话人自适应,在不修改基础模型权重的情况下提升识别性能,并保持对非条件化语音的问答能力。

详情
Comments
Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop
AI中文摘要

自动语音识别(ASR)在标准语音方面取得了显著进展;然而,来自神经系统疾病的病理语音仍然是一个重大挑战。我们研究了通过特征线性调制(FiLM)进行说话人条件化,将x-vector派生信息注入冻结的ASR编码器的每个Transformer层,以在不修改基础模型权重的情况下适应个体病理说话人的内部表示。我们在西班牙语和英语病理语音上,针对ASR任务将其与标准和参数高效微调基线进行基准测试,并辅以后处理。此外,我们评估了自适应模型是否保留了回答语音相关问题的能力。结果表明,说话人条件化的ASR与已建立的适应策略具有竞争力,同时保持了对非条件化语音的性能。

英文摘要

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

2606.06207 2026-06-05 cs.AI cs.LG

Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk Assessment

日本兽医毒理学中的无监督模式分析:用于跨物种风险评估的合规框架

Yukiko Kawakami, Mohammad Shirazi, Ryo Shimizuwa, Saito Shinoda, Alireza Mortazavi, Matsumoto Kawahara

发表机构 * Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究生院)

AI总结 提出一种监管集成的无监督框架,利用NVAL数据库对不良药物事件进行聚类分析,识别出具有生物学意义的跨物种毒性模式。

详情
Comments
Submitted to IEEE Transactions on Biomedical Engineering
AI中文摘要

兽医药物警戒系统对于监测不良药物事件(ADEs)至关重要,然而现有方法往往无法捕捉由当地生物学和监管环境塑造的区域特异性毒性模式。在日本,这些挑战因物种特异性代谢差异以及农林水产省(MAFF)定义的报告实践而加剧。以往的工作大多依赖于预测导向模型,限制了机制可解释性。本研究提出了一种监管集成的无监督框架,用于利用国家兽医检测实验室(NVAL)数据库进行模式发现。ADEs被编码为器官系统对齐的表示,并针对物种特异性报告偏差进行调整,从而实现跨物种比较。应用基于相似性的聚类和降维来识别潜在毒性结构。对4,120份高置信度ADE报告(9,080个药物-ADE组合)的分析识别出三个显著的物种聚类(p < 0.01),包括伴侣动物中的肝脏主导模式(0.42 ± 0.06)、反刍动物中的肾毒性(0.39 ± 0.07)以及绵羊中的皮肤敏感性(0.35 ± 0.07)。药物水平聚类与药理类别的对齐率达到83%,而余弦相似度优于其他指标(轮廓系数:0.48;聚类精度:87%)。监管验证显示与既定分类高度一致。这些发现表明,与监管对齐的无监督分析能够揭示具有生物学意义的区域特异性毒性模式,为兽药安全性评估提供了一个可解释且可扩展的框架。

英文摘要

Veterinary pharmacovigilance systems are essential for monitoring adverse drug events (ADEs), yet existing approaches often fail to capture region-specific toxicity patterns shaped by local biological and regulatory contexts. In Japan, these challenges are amplified by species-specific metabolic differences and reporting practices defined by the Ministry of Agriculture, Forestry, and Fisheries (MAFF). Most prior work relies on prediction-oriented models, limiting mechanistic interpretability. This study proposes a regulatory-integrated unsupervised framework for pattern discovery using the National Veterinary Assay Laboratory (NVAL) database. ADEs are encoded into organ system-aligned representations and adjusted for species-specific reporting biases, enabling cross-species comparison. Similarity-based clustering and dimensionality reduction are applied to identify latent toxicity structures. Analysis of 4,120 high-confidence ADE reports (9,080 drug-ADE combinations) identified three significant species clusters (p < 0.01), including hepatic-dominant patterns in companion animals (0.42 $\pm$ 0.06), renal toxicity in ruminants (0.39 $\pm$ 0.07), and dermatological sensitivity in sheep (0.35 $\pm$ 0.07). Drug-level clustering achieved 83% alignment with pharmacological classes, while cosine similarity outperformed alternative metrics (silhouette score: 0.48; cluster precision: 87%). Regulatory validation showed strong agreement with established classifications. These findings demonstrate that regulation-aligned unsupervised analysis can uncover biologically meaningful, region-specific toxicity patterns, providing an interpretable and scalable framework for veterinary drug safety assessment.

2606.06205 2026-06-05 cs.LG

Non-Negative Matrix Factorization for Event Data

事件数据的非负矩阵分解

Raphaël Romero

发表机构 * Ghent University(根特大学)

AI总结 提出EventNMF,一种直接对事件时间进行建模的连续时间非负矩阵分解方法,通过B样条基函数分解强度函数,避免分箱预处理的信息损失,并证明标准分箱方法是其特例。

详情
AI中文摘要

连续时间事件数据,其中实体随时间发出瞬时事件,自然出现在许多领域,如神经科学、地震学和社会网络。非负矩阵分解(NMF)是揭示此类数据中可解释结构的自然工具,但迄今为止仅在分箱或平滑实体级计数度量后应用。这种预处理步骤存在擦除实体级异质性和细粒度时间特征的风险。在本文中,我们介绍了EventNMF,一种直接对事件时间进行操作的连续时间非负分解模型:每个实体的事件被建模为泊松过程,其强度通过非负B样条基函数分解,一个简单的估计过程恢复了跨实体共享的可解释时间模板。所得方法在数学上严谨、易于实现且计算高效。我们进一步证明了标准分箱计数方法是零次样条的特例,探索了偏差-方差权衡,并在合成潜在因子模型上与现有方法进行了比较,以及在几个实际应用中展示了EventNMF的有效性。

英文摘要

Continuous-time event data, in which entities emit instantaneous events over time, arises naturally across many domains such as neuroscience, seismology, and social networks. Non-negative matrix factorization (NMF) is a natural tool to uncover interpretable structure in such data, but it has so far only been applied after binning or smoothing the entity-level counting measures. This preprocessing step comes with the risk of erasing entity-level heterogeneities and fine-grained temporal features. In this paper, we introduce EventNMF, a continuous-time non-negative factorization model that operates directly on event times: each entity's events are modeled as a Poisson process whose intensity factorizes through a non-negative B-spline basis, and a simple estimation procedure recovers interpretable temporal templates shared across entities. The resulting method is mathematically principled, easy to implement, and computationally efficient. We further show that standard binned-count approaches arise as the special case of degree-zero splines, explore bias-variance tradeoffs and compare against existing methods on a synthetic latent factor model, and demonstrate the effectiveness of EventNMF on several real-world applications.

2606.06203 2026-06-05 cs.CL cs.AI

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

密集上下文是困难上下文:词汇密度限制LLM的有效上下文

Giovanni Dettori, Matteo Boffa, Danilo Giordano, Idilio Drago, Marco Mellia

发表机构 * Department of Computer Science Politecnico di Torino(计算机科学系politecnico di torino大学) Department of Computer Science University of Turin(计算机科学系都灵大学)

AI总结 本文通过三个“找针”式基准测试,发现词汇密度(上下文引入不同信息的速率)是除输入长度和相关信息位置外,第三个系统性降低LLM有效上下文窗口的因素,并证明降低密度可恢复性能。

详情
Comments
20 pages, 6 figures
AI中文摘要

输入长度和相关信息的位置被广泛认为是导致LLM长上下文性能下降的主要原因。在这里,我们研究词汇密度——上下文引入不同信息的速率——作为第三个被广泛忽视的因素,它系统地缩小了LLM的有效上下文窗口。我们使用三个“找针”式基准测试,在相同长度(约12k tokens)和受控的针位置但信息密度递增的情况下,量化了词汇密度对开放权重LLM(9B-685B)的影响。我们观察到在高密度基准测试中性能急剧下降:在稀疏上下文中近乎完美的模型在密集上下文中的检索分数降至60%以下。为了排除任务类型混淆,我们在每个基准测试内部改变并控制密度,同时保持其他所有属性不变。降低密度通常能恢复性能,尤其是在出现退化的高密度区域。这些结果表明,有效上下文容量是词汇密度的函数,对运行在紧凑、信息丰富输入上的真实世界LLM系统具有直接影响。

英文摘要

Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three "find-the-needle" style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.

2606.06200 2026-06-05 cs.SD eess.AS

Learning Emotion-discriminative Representations for Zero-Shot Cross-lingual Speech Emotion Recognition

学习情感判别表示用于零样本跨语言语音情感识别

Jinyi Mi, Ding Ma, Tomoki Toda

发表机构 * Graduate School of Informatics, Nagoya University, Japan(名古屋大学信息学研究科) Information Technology Center, Nagoya University, Japan(名古屋大学信息技术中心)

AI总结 针对零样本跨语言语音情感识别中语言分布不匹配和目标语言缺乏情感标注的问题,提出一种结合监督对比学习和说话人对抗学习的情感判别表示学习方法,显著提升了识别性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

零样本跨语言语音情感识别(SER)由于语言间的分布不匹配以及目标语言缺乏情感标注而仍然具有挑战性。在这种情况下,仅使用源语言数据训练的模型在评估未见过的目标语言时,常常会出现泛化能力下降的问题。为了解决这一局限性,我们提出了一种情感判别表示学习方法,该方法集成了监督对比学习和说话人对抗学习。对比学习促进了跨语言情感对齐,而说话人对抗学习则抑制了与说话人相关的线索,以鼓励说话人不变的表示。在零样本跨语言SER设置下的实验结果表明,与传统训练策略相比,所提出的方法显著提高了SER性能。

英文摘要

Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.

2606.06199 2026-06-05 cs.CV cs.GR

SC-MFJ: A Simple Haptic Quality Metric for Medical Image Segmentation

SC-MFJ: 一种用于医学图像分割的简单触觉质量度量

Souraj Adhikary, Negar Chabi, Andre Mastmeyer

发表机构 * Jade University of Applied Sciences(亚德应用科学大学)

AI总结 针对手术模拟中触觉渲染对分割表面质量的需求,提出SC-MFJ度量,通过虚拟触笔行走测量接触力抖动,揭示了几何度量无法发现的触觉质量差异。

详情
Comments
11 pages, 5 figures, 5 tables, http://www.wscg.eu/
AI中文摘要

标准分割度量如Dice和Hausdorff距离测量几何重叠,但无法判断分割表面是否适合手术模拟中的触觉渲染。我们提出SC-MFJ(表面约束平均力抖动),一种简单、廉价的度量,通过多次短虚拟触笔行走采样分割器官表面,并测量由此产生的接触力抖动程度。该度量从现有分割输出计算,每个病例约需一分钟CPU时间。我们在五折交叉验证中对80个病例评估了三种胰腺CT分割方法——原始二值nnU-Net输出、高斯平滑输出和学习的符号距离函数(SDF)回归。SC-MFJ显示,原始二值基线与简单高斯后处理之间的触觉质量差距达147倍,而Dice和HD95完全无法察觉这一差异。它还表明,尽管需要完整的模型重新训练,学习的SDF回归产生的触觉质量比高斯平滑更不稳定,病例级标准差为168 N/s²,而高斯平滑为22 N/s²。在LiTS肝脏数据集(131个病例)上的第二次评估证实了这些发现的普遍性:二值到高斯的差距扩大到189倍,且高斯平滑在所有折中始终产生一致的低力抖动。我们的结果表明,对于触觉模拟应用,一行后处理步骤可能就足够了,而像SC-MFJ这样廉价的度量可以标记出几何度量遗漏的问题。

英文摘要

Standard segmentation metrics such as Dice and Hausdorff distance measure geometric overlap but say nothing about whether a segmented surface is suitable for haptic rendering in surgical simulation. We propose SC-MFJ (Surface-Constrained Mean Force Jerk), a simple, inexpensive metric that samples a segmented organ surface with many short virtual stylus walks and measures how jerky the resulting contact forces are. The metric is computed from existing segmentation outputs and uses roughly one minute of CPU time per case. We evaluate three pancreas CT segmentation approaches-binary nnU-Net output, Gaussian-smoothed output, and learned signed distance function (SDF) regression-across 80 cases in five-fold cross-validation. SC-MFJ reveals a 147x gap in haptic quality between the raw binary baseline and simple Gaussian post-processing, a difference entirely invisible to Dice and HD95. It also shows that learned SDF regression, despite requiring full model retraining, produces more variable haptic quality than Gaussian smoothing, with a case-level standard deviation of 168 N/s2 compared with 22 N/s2 for Gaussian. A second evaluation on the LiTS liver dataset (131 cases) confirms the generality of these findings: the binary-to-Gaussian gap widens to 189x, and Gaussian smoothing again produces consistently low force jerk across all folds. Our results suggest that for haptic simulation applications, a one-line post-processing step may be sufficient, and that a cheap metric like SC-MFJ can flag problems that geometric metrics miss.

2606.06197 2026-06-05 cs.CL cs.AI

Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

利用大语言模型改进基于上下文的问答系统中的答案提取

Hafez Abdelghaffar, Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有问答系统在复杂或模糊查询下答案提取不准确的问题,提出基于微调预训练大语言模型的方法,在SQuAD1.1数据集上取得ROUGE-L 86.84%、BLEU 28.24%、BERTScore 95.38%的高性能。

详情
Comments
7 pages, IMSA2026
AI中文摘要

随着大语言模型(LLM)的出现,问答(QA)系统取得了显著进展。然而,它们在从给定上下文中准确提取和生成精确答案方面仍面临挑战,尤其是在处理复杂或模糊查询时。现有方法通常在上下文理解、答案一致性和跨不同领域的泛化能力方面存在不足。在这项工作中,我们提出了一种基于大语言模型的问答系统,其输入由文本上下文和相应问题组成,输出为简洁准确的答案。本研究旨在解决当前QA系统的局限性,特别是它们即使能够访问正确上下文也倾向于产生不相关或不精确响应的问题。我们的方法包括在基准QA数据集上微调预训练的LLM,以提高其上下文理解和答案提取能力。具体来说,我们使用斯坦福问答数据集(SQuAD1.1),该数据集提供了高质量的上下文-问题-答案三元组用于监督训练和评估。实验结果表明,微调后的Roberta-base模型取得了最高性能,ROUGE-L得分为86.84%,BLEU得分为28.24%,BERTScore为95.38%。这些结果表明了强大的准确性和答案相关性,证明了所提方法在基于上下文的问答任务中的有效性。此外,研究结果证实,有针对性的微调显著提高了QA系统的可靠性和精确性。

英文摘要

Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.

2606.06194 2026-06-05 cs.RO cs.CV

ActiveMimic: Egocentric Video Pretraining with Active Perception

ActiveMimic: 基于主动感知的自我中心视频预训练

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Current Robotics NeoteAI

AI总结 提出ActiveMimic框架,从自我中心人类视频中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,联合学习主动感知和操作技能,使预训练模型在机器人任务上达到与机器人数据预训练相当的性能。

详情
Comments
Project Page: https://activemimic.github.io/
AI中文摘要

自我中心人类视频为机器人数据预训练提供了一种可扩展的替代方案,但在此类视频上预训练的模型始终不如在机器人数据上预训练的模型。我们将这一差距归因于缺失的信号,即自我中心视频中的主动感知行为,其中人类在操作过程中不断重新定位视角,导致标准流程视为噪声的相机运动。为解决这一问题,我们提出了ActiveMimic,一个预训练框架,从单个身体佩戴的RGB相机中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,并在适应目标机器人之前,从野外自我中心人类视频中联合学习主动感知和操作。实验表明,在具有不同主动感知需求的任务中,ActiveMimic始终优于在人类视频上预训练的基线,并与在机器人数据上预训练的最先进模型相匹配。进一步分析提供了证据,表明主动感知能力源自自我中心人类视频预训练而非机器人特定微调,确认了主动感知是解锁自我中心人类视频用于机器人预训练的关键。

英文摘要

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

2606.06188 2026-06-05 cs.CL

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

告密范数:$\ell_2$ 幅度作为大语言模型推理动态的信号

Jinyang Zhang, Hongxin Ding, Yue Fang, Weibin Liao, Muyang Ye, Junfeng Zhao, Yasha Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出隐藏状态的 $\ell_2$ 范数作为大语言模型推理强度的内生信号,并通过稀疏自编码器分析、理论证明和因果干预验证其有效性,进而引入三种基于 $\ell_2$ 范数的测试时缩放技术以提升推理性能。

详情
Comments
ICML
AI中文摘要

近期工作试图理解大语言模型的推理过程,但一种能够捕捉其逐层推理动态的、基于模型内在信号的原理性方法仍未得到充分探索。我们通过证明隐藏状态的 $\ell_2$ 范数作为模型推理强度的内生信号来填补这一空白。利用稀疏自编码器作为诊断探针,我们观察到 LLM 的内部推理以集中在后期层的推理特征激活急剧增加为特征。受此模式启发,我们在推理强度与模型潜在几何之间建立了正式联系,并从理论上证明隐藏状态的 $\ell_2$ 范数限制了 SAE 推理特征的激活强度。经验相关性分析和因果干预进一步验证了 $\ell_2$ 范数作为忠实指标的有效性,其中较高的范数始终对应于关键推理步骤。随后,我们引入了三种由 $\ell_2$ 范数指导的测试时缩放技术:(i) 自适应逐层推理递归,(ii) 内生推理状态引导,以及 (iii) $\ell_2$ 引导的响应选择,这些技术无需额外训练或数据,且与高级推理引擎兼容。跨模型架构和基准的实验表明,基于 $\ell_2$ 范数的技术显著提升了推理性能,为感知和控制 LLM 潜在推理动态提供了一种原理性且简单的方法。我们的代码可在 https://github.com/zjy1298/The-Tell-Tale-Norm 获取。

英文摘要

Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at https://github.com/zjy1298/The-Tell-Tale-Norm.

2606.06186 2026-06-05 cs.CV

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

对抗攻击已揭示答案:面向视觉语言模型的定向偏差引导测试时防御

Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng, Xiaotian Yin, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家深空探测重点实验室,深空探测实验室) The Chinese University of Hong Kong(香港中文大学) Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出定向偏差引导防御(DBD),利用对抗样本在CLIP特征空间中沿主导方向偏移的现象,通过估计防御方向并采用DB分数双流重建策略恢复鲁棒表示,在15个数据集上实现最先进对抗鲁棒性且保持干净准确率。

详情
Comments
Accepted by ICLR2026
AI中文摘要

视觉语言模型(VLM),如CLIP,展现出强大的零样本泛化能力,但仍高度易受对抗扰动影响,在现实应用中构成严重风险。针对VLM的测试时防御最近成为一种有前景且高效的方法,无需昂贵的大规模重训练即可防御对抗攻击。在这项工作中,我们发现了一个令人惊讶的现象:在多种输入变换下,CLIP特征空间中的对抗图像始终沿主导方向偏移,而干净图像则呈现分散模式。我们假设这种主导偏移(称为防御方向)与对抗偏移相反,将特征指向正确的类别中心。基于这一见解,我们提出了定向偏差引导防御(DBD),一种测试时框架,用于估计防御方向,并采用基于DB分数的双流重建策略恢复鲁棒表示。在15个数据集上的实验表明,DBD不仅实现了最先进的对抗鲁棒性,同时保持了干净准确率,还揭示了对抗准确率甚至可能超过干净准确率的反直觉结果。这表明对抗扰动内在地编码了关于真实决策边界的定向先验信息。

英文摘要

Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

2606.06178 2026-06-05 cs.LG cs.AI cs.CL

Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

通过元学习从隐式成本-性能偏好中学习路由LLM

Jiahao Zeng, Ming Tang, Ningning Ding

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Southern University of Science and Technology(南方科技大学)

AI总结 提出MetaRouter框架,利用元学习从少量交互中学习用户隐式成本-性能偏好,实现个性化LLM路由,在分布内外任务上优于基线方法。

详情
AI中文摘要

大型语言模型(LLM)在性能与成本之间存在权衡,更强大的模型会产生更高的费用。LLM路由旨在通过将查询发送到最合适的模型来降低费用同时保持性能。然而,现有方法无法很好地适应不同用户的成本-性能偏好。为了解决这一差距,我们引入了一种新颖的感知LLM路由范式,用于个性化和以用户为中心的成本-性能优化,通过少量交互高效学习用户的隐式偏好。为了应对异构用户需求的挑战,我们将偏好配置文件形式化为上下文赌博机中的一组不同任务,并提出了MetaRouter,一个用于偏好感知LLM路由的元学习框架。实验结果表明,MetaRouter在分布内和分布外任务上均优于强基线。此外,它在学习用户偏好方面表现出高效率,对可路由LLM的变化具有鲁棒性,并且可扩展到多模型路由。

英文摘要

Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.

2606.06177 2026-06-05 cs.CL cs.HC

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Ouvia:一种以用户为中心的框架,用于衡量真实世界通信场景中语音翻译的可用性

Giuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky, Matteo Negri, Marine Carpuat, Maarten Sap, André F. T. Martins

发表机构 * Instituto de Telecomunicações(电信研究所) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Carnegie Mellon University(卡内基梅隆大学) University of Maryland(马里兰大学) Instituto Superior Técnico(技术高级研究所)

AI总结 提出Ouvia框架,通过收集1750+次真实医疗和日常场景中的交互,评估语音翻译的用户感知可用性,发现现代ST仅部分可用(约一半交互被评为可用),且QA评估比标准方法更能预测可用性。

详情
Comments
Code and data at https://github.com/g8a9/ouvia
AI中文摘要

语音翻译(ST)在用户应用中日益普及,但其评估主要侧重于去情境化的测试床和整体质量,而非最终用户的通信需求。我们引入了Ouvia,一个用于衡量真实世界环境中语音翻译输出的用户感知可用性的评估框架。Ouvia专注于一对一通信:一位英语使用者需要向一位葡萄牙语使用者传达请求,消息被自动翻译。通过自定义网页应用和多阶段研究设计,我们在医疗和日常情境中收集了超过1750次此类交互,涉及四个ST系统,以及来自三种英语方言和两种性别的使用者。我们发现,现代ST只能有限地服务于人们——只有大约一半的交互被评为可用——且不同人口统计群体报告的可用性存在显著差距。此外,在质量指标中,我们发现基于QA的评估比标准方法更能预测真实世界的可用性。这些发现共同强调了情境化、以用户为中心的评估框架的重要性,这些框架超越了整体质量分数,并关注技术服务于谁——以及服务得如何。

英文摘要

Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

2606.06176 2026-06-05 cs.CV

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

RQUL-UIE: 通过数据集内自监督重振质量不稳定标签用于水下图像增强

Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种基于扩散模型的数据集内自监督学习策略,通过评估标签质量并量化噪声级别进行分步去噪监督,结合傅里叶细化网络,有效利用不稳定标签提升水下图像增强质量。

详情
AI中文摘要

水下图像增强对于减轻水介质引起的退化至关重要。尽管基于学习的方法取得了显著进展,但大多数依赖于具有不稳定标签质量的配对数据集,这限制了模型性能。本文提出了一种基于扩散的数据集内自监督学习策略,旨在利用训练标签的质量分布。具体地,我们通过预训练扩散模型的语义感知嵌入以无需训练的方式评估标签质量。这些质量分数随后被量化为噪声级别索引,指导多步去噪过程以进行级别监督。该机制防止低质量标签降低模型性能,同时最大化其在训练中的效用。此外,引入基于傅里叶的细化网络以显式重建高频分量。大量评估表明,我们的方法在恢复质量上始终优于最先进的方法。代码和预训练模型将在接收后提供链接。

英文摘要

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

2606.06174 2026-06-05 cs.LG stat.AP

Learning to model pediatric asthma exacerbation from multiple risk factors: a case study in coastal Virginia

学习从多风险因素建模儿童哮喘加重:弗吉尼亚沿海地区案例研究

Jonathan Colen, Eric Werner, Maryam Golbazi, Heather Richter, Diana McSpadden, Amy Quinn, Jocel Santos, Mary Jane Darling, Mary Margaret Gleason

发表机构 * Joint Institute on Advanced Computing for Energy and Science(联合能源与科学高级计算研究所) Old Dominion University(老 Dominion 大学) School of Data Science(数据科学学院) Eastern Virginia Medical School(东部弗吉尼亚医学院) Macon & Joan Brock Virginia Health Sciences(马科恩与乔安·布罗克弗吉尼亚健康科学) Children’s Hospital of the King’s Daughters(国王女儿儿童医院) Children’s Specialty Group(儿童专科组) Institute for Coastal Adaptation and Resilience(海岸适应与韧性研究所) Chief Data Office(首席数据办公室) Thomas Jefferson National Accelerator Facility(托马斯·杰弗逊国家加速器设施) Department of Psychiatry(精神病学系) Boston Children’s Hospital(波士顿儿童医院)

AI总结 本研究通过比较广义线性模型、神经网络和稀疏字典学习框架,建模弗吉尼亚沿海地区儿童哮喘加重与空气污染、气象及社区社会经济因素的关系,并识别关键风险因素。

详情
Comments
22 pages, 6 figures (5 supplemental)
AI中文摘要

儿童哮喘是一种常见疾病,受空气污染、气象和社区级社会经济因素加剧。在大型时空数据集中建模哮喘加重(AE)需要厘清多个贡献因素的影响。在本案例研究中,我们比较了三种平衡预测能力与可解释性的技术,以预测汉普顿路(弗吉尼亚沿海地区,包括7个城市,人口超过150万)的AE。在整理环境空气污染测量值、天气数据和社区机会指标后,我们建模了2018-2023年区域儿童医院及附属机构的邮政编码级急性AE就诊情况。广义线性模型(GLM)提供基线,神经网络(NN)作为最大预测目标。为了桥接统计模型和深度学习,我们开发了一个基于稀疏字典学习的框架,以识别和解释简约的非线性交互方程。在比较每个模型的预测性能后,我们估计了输入暴露变量导致的AE相对风险,并发现各框架间的一致性。我们的工作将统计模型与可解释机器学习模型联系起来,突出了可能影响AE的协同交互作用,并可能为未来研究指导弗吉尼亚沿海地区的公共卫生干预措施。

英文摘要

Childhood asthma is a common illness exacerbated by air pollution as well as meteorological and neighborhood-level socioeconomic factors. Modeling asthma exacerbation (AE) in large spatiotemporal datasets requires disentangling impacts from multiple contributors. In this case study, we compared three techniques that balance predictive power with interpretability to predict AE in Hampton Roads, a coastal Virginia region comprising 7 cities and over 1.5 million people. After collating ambient air pollution measurements, weather data, and measures of neighborhood opportunity, we modeled zip code-level acute AE visits to a regional children's hospital and affiliated providers from 2018-2023. Generalized linear models (GLM) provided a baseline while neural networks (NN) served as a maximally predictive target. To bridge between statistical models and deep learning, we developed a framework based on sparse dictionary learning to identify and interpret parsimonious nonlinear interacting equations. After comparing each model's predictive performance, we estimated relative risks for AE due to input exposure variables and found consensus across frameworks. Our work links statistical and interpretable machine learning models to highlight possible synergistic interactions influencing AE, and may enable future studies to guide public health interventions in coastal Virginia.

2606.06168 2026-06-05 cs.AI cs.CL

ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

ProSarc: 通过时间韵律不协调性进行韵律感知的讽刺识别框架

Prathamjyot Singh, Ashima Sood, Sahil Sharma, Jasmeet Singh

发表机构 * Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India(1 计算机科学与工程系,泰帕尔工程与技术学院,印度帕蒂亚拉) School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry, United Kingdom(2 计算学、工程与智能系统学院,乌斯特大学,英国伦敦德里) School of Computing, Ulster University, Belfast, United Kingdom(3 计算学学院,乌斯特大学,英国贝尔法斯特)

AI总结 提出ProSarc,一个仅利用音频的框架,通过建模局部韵律动态与话语级情感基线之间的时间韵律不协调性来检测讽刺,在MUStARD++等数据集上取得最优性能。

详情
Comments
Accepted at Interspeech 2026, Sydney
AI中文摘要

我们提出了ProSarc,一个仅利用音频的框架,通过建模时间韵律不协调性(即局部韵律动态与话语级情感基线之间的不匹配)来检测讽刺。双编码路径——全局情感编码器和时间韵律编码器(BiLSTM + 多头注意力)——馈送到韵律不协调性分析器,该分析器产生一个标量不协调性分数用于分类。蒙特卡洛dropout提供不确定性估计,基于注意力的机制无需帧级标签即可定位讽刺起始点。ProSarc在MUStARD++(F1=75.3)上优于先前的纯音频方法,并泛化到自发性语音(PodSarc,F1=62.9)和跨语言语音(MuSaG,F1=65.6)。十次运行验证证实了不协调性建模的贡献(Wilcoxon p=0.002,Cohen's d=1.51)。人工评估表明,模型不确定性追踪感知模糊性,预测的起始点与人工标注的时间窗口对齐。

英文摘要

We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

2606.06164 2026-06-05 cs.LG physics.comp-ph

On the training of physics-informed neural operators for solving parametric partial differential equations

关于物理信息神经算子训练以求解参数化偏微分方程的研究

Nanxi Chen, Chuanjie Cui, Airong Chen, Sifan Wang, Rujin Ma

发表机构 * College of Civil Engineering, Tongji University(同济大学土木工程学院) Department of Engineering Science, University of Oxford(牛津大学工程科学系) Institute for Foundations of Data Science, Yale University(耶鲁大学数据科学基础研究所)

AI总结 本文系统研究了物理信息神经算子(PINO)的训练策略,包括架构设计、优化器选择、损失平衡和配置点采样,通过实验发现CViT架构表现稳定,并揭示了梯度冲突和因果违反等优化问题,表明PINN的缓解算法在PINO中同样有效。

详情
AI中文摘要

物理信息神经算子(PINO)旨在通过使用控制物理作为监督来学习偏微分方程的解算子,而不是仅仅依赖配对的输入-输出模拟数据。通过将物理约束纳入训练目标,PINO结合了神经算子的跨实例泛化能力和物理信息学习的数据效率。尽管有这一前景,如何高效且稳健地训练PINO仍不如数据驱动神经算子或物理信息神经网络(PINN)的训练那样被充分理解。为弥补这一差距,我们考察了PINO训练流程的关键组成部分,包括架构设计、优化器选择、损失平衡和配置点采样策略。我们研究了三种代表性算子骨干:深度算子网络(DeepONet)、傅里叶神经算子(FNO)和连续视觉Transformer(CViT),涵盖五个不同的参数化PDE系统。结果表明,CViT在考虑的基准测试中提供了一致且稳定的强性能。除了架构,我们发现先前在PINN训练中识别出的几种优化病理自然出现在PINO中,包括梯度冲突和因果违反。我们还发现为PINN开发的缓解算法在PINO设置中仍然有效。我们进一步比较了不同数据体制下的物理信息和数据驱动训练,揭示精心设计的物理信息训练流程可以匹配,并在某些情况下超越纯数据驱动神经算子。综合来看,这些发现提供了对PINO训练中优化挑战的系统性实证理解,并为高效稳健的物理信息算子学习提供了实用流程。代码和数据可在 https://github.com/NanxiiChen/PI-CViT 获取。

英文摘要

Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By incorporating physical constraints into the training objective, PINOs combine the cross-instance generalization of neural operators with the data efficiency of physics-informed learning. Despite this promise, how to train PINOs efficiently and robustly remains less well-understood than the training of either data-driven neural operators or physics-informed neural networks (PINNs). To bridge this gap, we examine key components of the PINO training pipeline, including architecture design, optimizer choice, loss balancing, and collocation-point sampling strategy. We study three representative operator backbones, Deep Operator Network (DeepONet), Fourier Neural Operator (FNO), and Continuous Vision Transformer (CViT), across five diverse parametric PDE systems. Our results show that CViT provides consistently strong and stable performance across the considered benchmarks. Beyond architecture, we find that several optimization pathologies previously identified in PINN training naturally arise in PINOs, including gradient conflicts and causal violation. We also find that mitigation algorithms developed for PINNs remain effective in the PINO setting. We further compare physics-informed and data-driven training under different data regimes, revealing that a carefully designed physics-informed training pipeline can match, and in some cases, outperform purely data-driven neural operators. Taken together, these findings provide a systematic empirical understanding of the optimization challenges in PINO training and inform a practical pipeline for efficient and robust physics-informed operator learning. Code and data are available at https://github.com/NanxiiChen/PI-CViT.

2606.06160 2026-06-05 cs.AI cs.CL

Where does Absolute Position come from in decoder-only Transformers?

在仅解码器Transformer中,绝对位置从何而来?

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

发表机构 * Sapienza University of Rome(罗马大学萨皮恩扎分校) Intuition Machines(直觉机器)

AI总结 本文研究了RoPE训练的仅解码器Transformer中绝对位置信息的来源,发现因果掩码和残差流是导致绝对位置泄露的两个关键组件,并提出了通过替换BOS嵌入来减少残差流成分的方法。

详情
AI中文摘要

RoPE训练的Transformer在其注意力模式中区分绝对位置,尽管RoPE在内积中仅编码相对偏移。我们将这种泄露追溯到两个架构组件。因果掩码是第一个:其每个查询的softmax分母按构造依赖于绝对查询位置。残差流提供第二个。在因果注意力下,位置$0$处的激活仅关注自身,并作为封闭动力系统从该位置token的嵌入运行;下游注意力通过sink-reading头读取该轨迹。这两个组件在我们研究的所有三种架构中都存在,但以架构特定的平衡出现:NTK缩放抑制残差流组件,滑动窗口注意力使其随深度累积,而标准RoPE介于两者之间。在前向传播前替换\texttt{BOS}嵌入可消除早期查询中$40\%$的残差流组件。注意力sink是锚定在token上的稳定器,传递位置$0$处token的确定性指纹,当该token是自动预置的\texttt{BOS}时,该指纹跨输入恒定,否则随其变化。

英文摘要

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

2606.06158 2026-06-05 cs.CV

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

通过时间冗余掩码和潜在修复的自适应分词

Kevin Dave, Sai Aditya Patkuri, Chhaya Kumar Das, Gouranga Bala, R. Venkatesh Babu, Rajeshkumar SA

发表机构 * Phronetic AI IISc Bangalore IIT Bombay

AI总结 提出一种无参数的自适应视频分词机制,利用冻结连续分词器的潜在空间中的时间冗余,通过阈值丢弃冗余位置,并使用轻量级潜在修复变压器重建,实现内容驱动的令牌分配和高效推理。

详情
AI中文摘要

自适应视频分词旨在根据序列的底层视觉复杂度动态分配令牌预算。当前的连续方法通过迭代二值化搜索或训练神经回归器实现,而离散方法通常需要全速率解码器来估计信息内容。我们证明这些计算开销并非必要。我们表明,冻结的连续视频分词器的潜在空间固有地编码了可直接利用的时间冗余:潜在表示在连续帧之间变化最小的空间位置携带接近零的额外信息。我们引入了一种无参数的自适应令牌分配机制,该机制对每个位置的时间L1差异应用固定阈值,识别并丢弃冗余的潜在位置。因此,压缩率自然地从输入内容中产生,而不是自上而下地强制执行:静态场景被积极压缩,而高度动态的序列保留更多令牌。为了重建丢弃的位置,我们提出了潜在修复变压器(LIT),一种轻量级的分解时空注意力架构。得到的推理流水线非常高效,仅需一次编码器前向传播和一次LIT前向传播,消除了辅助路由网络的需求。在TokenBench和DAVIS(近期分词器使用的标准基准)上的评估表明,我们的框架产生了有意义的、内容驱动的令牌分配,同时保持了有竞争力的重建保真度,并且相比连续自适应基线(ElasticTok-CV)实现了31倍的推理加速,相比离散信息论基线(InfoTok)实现了约2倍的加速。

英文摘要

Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)

2606.06156 2026-06-05 cs.LG

Trust-Aware Predictive Emissions Monitoring for Gas Turbine Fleets with Limited Labelled Data

面向有限标注数据的燃气轮机机群信任感知预测性排放监测

Rebecca Potts, Aiden Durrant, Rick Hackney, Georgios Leontidis

发表机构 * School of Natural and Computing Sciences University of Aberdeen(自然与计算科学学院 布拉德福德大学) School of Computing Sciences University of East Anglia(计算科学学院 东安格利亚大学) Siemens Energy Industrial Turbomachinery Ltd.(西门子能源工业涡轮机有限公司) Department of Physics and Technology UiT The Arctic University of Norway(物理与技术系 UiT 北极大学)

AI总结 提出一种信任感知概率框架,结合多头循环预测模型、置信度估计、集成不确定性量化、辅助特征预测、特征空间距离分析和运行范围诊断,在少量标注数据下实现机群级NOx预测,并提供可解释的逐样本信任分数以指示未标注涡轮机的预测可靠性。

详情
Comments
14 pages, 6 figures, 6 tables
AI中文摘要

基于机器学习的预测性排放监测系统为直接排放测量提供了一种实用替代方案,但当排放标签仅适用于一小部分资产时,其在燃气轮机机群中的部署具有挑战性。在这项工作中,提出了一种信任感知概率框架,用于在有限标注监督下进行机群级燃气轮机NOx预测。该框架结合了多头循环预测模型与学习到的置信度估计、集成不确定性量化、辅助特征预测、特征空间距离分析和运行范围诊断。这些信号在标注数据上进行校准,以产生可解释的逐样本信任分数,为未标注涡轮机上的预测可靠性提供指标,支持识别在机群部署中应更加谨慎对待的预测。基于置信度的过滤将平均绝对误差从全覆盖时的0.202降低到最高置信度10%预测的0.070,表明置信度估计与预测误差有显著关联。未标注和分布外样本表现出增加的不确定性和降低的置信度,表明该框架对分布偏移做出了适当响应。结果表明,所提出的信任框架为未标注涡轮机上的排放预测提供了可操作的可靠性信息,支持PEMS在工业机群中更透明和可信的部署。

英文摘要

Machine learning-based predictive emissions monitoring systems offer a practical alternative to direct emissions measurement, but their deployment across gas turbine fleets is challenging when emissions labels are available for only a small subset of assets. In this work, a trust-aware probabilistic framework is proposed for fleet-level gas turbine NOx prediction under limited labelled supervision. The framework combines a multi-head recurrent prediction model with learned confidence estimation, ensemble-based uncertainty quantification, auxiliary feature prediction, feature-space distance analysis, and operating-range diagnostics. These signals are calibrated on labelled data to produce interpretable per-sample trust scores, providing indicators of prediction reliability on unlabelled turbines, supporting the identification of predictions that should be treated with greater caution during fleet-level deployment. Confidence-based filtering reduces MAE from 0.202 at full coverage to 0.070 for the highest-confidence 10\% of predictions, demonstrating that confidence estimates are meaningfully related to prediction error. Unlabelled and out-of-distribution samples exhibit increased uncertainty and reduced confidence, indicating that the framework responds appropriately to distributional shift. The results show that the proposed trust framework provides actionable reliability information for emissions prediction on unlabelled turbines, supporting more transparent and trustworthy deployment of PEMS across industrial fleets.

2606.06155 2026-06-05 cs.RO cs.CV cs.MM

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AffordanceVLA:一种通过可供性感知理解赋能动作生成的视觉-语言-动作模型

Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen

发表机构 * Peking University(北京大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong(香港中文大学) Knowin AI

AI总结 提出AffordanceVLA框架,通过引入结构化可供性预测作为任务导向的中间表示,解决VLA模型中语义空间与具身控制策略的结构不匹配问题,实现精确的感知-动作映射。

详情
Comments
Preprint. Code and project page are available. Code: https://github.com/Skywalker-yqz/AffordanceVLA Project page: https://skywalker-yqz.github.io/AffordanceVLA/
AI中文摘要

视觉-语言-动作(VLA)模型利用预训练视觉-语言模型(VLM)的丰富世界知识来实现指令跟随的机器人操作。然而,VLM语义空间与具身控制策略之间的结构不匹配常常阻碍精确感知-动作映射的学习。为解决这一挑战,我们提出 extbf{AffordanceVLA},一个统一框架,引入结构化可供性预测作为任务导向的中间表示,以建立更精确和鲁棒的感知-动作映射。具体而言,我们通过三个互补组件逐步建模操作先验:1) extbf{Which2Act},通过视觉潜在预测进行以物体为中心的定位以抑制干扰;2) extbf{Where2Act},通过可供性图估计进行2D交互定位;3) extbf{How2Act},用于引导操作策略的3D几何推理。这些可供性线索提供了空间定位、语义条件化和动作耦合的中间表示,从而自然地桥接视觉、语言和动作。我们将这些模块集成到具有专门专家的混合Transformer(MoT)架构中,并使用三阶段训练策略和渐进式数据课程训练模型。为克服机器人数据集中密集可供性标签的稀缺性,我们还开发了一个鲁棒的自动化数据增强流水线。在仿真和真实世界中的大量实验表明,AffordanceVLA在多种操作场景中实现了强大的性能。

英文摘要

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

2606.06154 2026-06-05 cs.AI

Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models

摊销联邦自适应:基于超网络的LoRA用于个性化基础模型

Sunny Gupta, Shambhavi Shanker, Amit Sethi

发表机构 * Indian Institute of Technology, Bombay(印度理工学院班加罗尔)

AI总结 提出HyperLoRA框架,通过超网络驱动的LoRA生成和乘积空间聚合,解决联邦LoRA中的结构聚合偏差和客户端初始化滞后问题,实现高效个性化、无偏聚合和更快收敛。

详情
Comments
Accepted at International Workshop on Federated Learning in the Age of Foundation Models In Conjunction with IJCAI 2026 (FL@FM-IJCAI'26)
AI中文摘要

使用低秩自适应(LoRA)对基础模型进行联邦微调为分布式学习提供了一种通信高效的解决方案。然而,现有的联邦LoRA方法存在两个基本限制:(1)结构聚合偏差,即独立平均低秩因子无法近似真实的组合更新;(2)客户端初始化滞后,即客户端在通信轮次中反复重新初始化LoRA参数,导致收敛变慢。我们提出HyperLoRA,一个统一的框架,通过超网络驱动的LoRA生成和乘积空间聚合的摊销联邦自适应来解决这两个问题。HyperLoRA不是进行迭代的逐客户端优化,而是使用一个学习到的生成器,将客户端分布特征映射到LoRA初始化,从而有效摊销每个客户端的自适应。在服务器端,我们引入一个学习到的聚合模块,直接在低秩乘积空间中合成更新,消除了因子级平均的不一致性。一个轻量级的残差校正模块进一步提高了在异质(非IID)客户端分布下的稳定性。通过用学习到的算子替代迭代优化和启发式平均,HyperLoRA共同实现了高效个性化、无偏聚合和更快的收敛。在联邦视觉和视觉-语言基准上的实验表明,与先前的联邦LoRA方法相比,HyperLoRA实现了更快的收敛速度、对分布偏移更强的鲁棒性以及更强的个性化性能。

英文摘要

Federated fine-tuning of foundation models using Low-Rank Adaptation (LoRA) offers a communication efficient solution for distributed learning. However, existing federated LoRA methods suffer from two fundamental limitations: (1) structural aggregation bias, where independently averaging low rank factors fails to approximate the true combined update, and (2) client side initialization lag, as clients repeatedly reinitialize LoRA parameters across communication rounds, slowing convergence. We propose HyperLoRA, a unified framework that addresses both issues through amortized federated adaptation through hypernetwork-driven LoRA generation and product space aggregation. Instead of iterative per-client optimization, HyperLoRA employs a learned generator that maps client distribution signatures to LoRA initializations, effectively amortizing per client adaptation. On the server side, we introduce a learned aggregation module that directly synthesizes updates in the low-rank product space, eliminating the inconsistencies of factor-wise averaging. A lightweight residual correction module further improves stability under heterogenous (non-IID) client distributions.By replacing iterative optimization and heuristic averaging with learned operators, HyperLoRA jointly enables efficient personalization, unbiased aggregation, and faster convergence. Experiments on federated vision and vision-language benchmarks show that HyperLoRA achieves improved convergence speed, greater robustness to distribution shift, and stronger personalization performance compared to prior federated LoRA methods.

2606.06148 2026-06-05 cs.LG

Tight list replicability bounds via a novel sphere covering theorem

通过新颖的球面覆盖定理实现紧的列表可复现性界

Ari Blondal, Hamed Hatami, Pooya Hatami, Chavdar Lalov, Sivan Tretiak

发表机构 * McGill University(麦吉尔大学) Ohio State University(俄亥俄州立大学)

AI总结 针对列表可复现性中列表大小与精度参数及假设类复杂度之间的关系,本文利用Borsuk-Ulam定理证明了一个新颖的拓扑球面覆盖定理,并由此得到VC类列表大小与精度的紧界,以及大间隔半空间的最优列表大小。

详情
Comments
17 pages, 2 figures
AI中文摘要

近年来,列表可复现性已成为学习理论中形式化可复现性的一个框架。一个核心问题是所需列表大小如何与精度参数及假设类的自然复杂度度量相关。为了获得列表可复现性的紧界,我们证明了一个新颖的拓扑球面覆盖定理,该定理源自Borsuk-Ulam定理。具体而言,如果$d$维球面被开集覆盖,且每个开集位于一个开半球内,那么这些集合中必有$d+1$个具有公共交集。利用这一结果,我们得到了VC类列表大小与精度之间关系的紧界。我们还证明,对于大间隔半空间,只要间隔不是太大,最优列表大小等于环境维度。然而,当间隔非常大时,我们设计了一个可复现算法,实现了最小列表大小$\lceil d/2 \rceil + 1$。

英文摘要

In recent years, list replicability has emerged as a framework for formalizing reproducibility in learning theory. A central question is how the required list size relates to the accuracy parameter and natural complexity measures of the hypothesis class. To achieve sharp bounds on list replicability, we prove a novel topological sphere covering theorem, derived from the Borsuk-Ulam theorem. Specifically, if the $d$-sphere is covered by open sets, each of which lies in an open hemisphere, then $d+1$ of these sets must have a common intersection. Using this result, we obtain a sharp bound on the relationship between list size and accuracy for VC classes. We also show that for large-margin half-spaces, provided the margin is not too large, the optimal list size equals the ambient dimension. However, when the margin is taken to be very large, we devise a replicable algorithm achieving the minimal list size of $\lceil d/2 \rceil + 1$.

2606.06147 2026-06-05 cs.AI

WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

WorldFly: 基于世界模型的视觉-语言-动作模型用于无人机导航

Shengtao Zheng, Kai Li, Weichen Zhang, Yu Meng, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) BNRist, Tsinghua University(清华大学北京研究院)

AI总结 提出WorldFly框架,通过双分支耦合流匹配机制联合生成未来视频预测和导航动作,解决城市峡谷中严重遮挡和视角剧变下的无人机导航问题。

详情
AI中文摘要

端到端的视觉-语言-动作(VLA)模型在无人机导航中显示出潜力。然而,现有方法通常依赖历史观测直接预测动作,在密集城市环境中常因严重遮挡和急转弯导致视角剧变而表现不佳。我们认为,世界模型固有的“想象”未来状态的能力对于在这种部分可观测性下做出稳健决策至关重要。为此,我们构建了一个具有挑战性的城市峡谷遍历基准,专门用于评估在严重遮挡和视角剧变场景下的空间理解能力。基于此,我们提出了WorldFly,一种新颖的基于世界模型的VLA框架,采用双分支耦合流匹配机制联合生成未来视频预测和导航动作,从而通过空间想象显式引导智能体的策略。在我们基准上的大量评估表明,WorldFly优于其他基线,特别是在未见过的环境中,验证了将世界模型集成到具身空中智能体中的有效性。

英文摘要

End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states -- inherent in World Models -- is critical for robust decision-making under such partial observability. To address this, we construct a challenging Urban Canyon Traversal Benchmark, specifically designed to evaluate spatial understanding in scenarios characterized by severe occlusions and drastic viewpoint transitions. To this end, we propose WorldFly, a novel world-model-based VLA framework that employs a dual-branch coupled flow matching mechanism to jointly generate future video predictions and navigation actions, thereby explicitly guiding the agent's policy via spatial imagination. Extensive evaluations on our benchmark demonstrate that WorldFly outperforms other baselines, particularly in unseen environments, validating the effectiveness of integrating world models into embodied aerial agents.

2606.06142 2026-06-05 cs.CV

Computation-Aware Event-to-Frame Reconstruction via Selective Attention

计算感知的基于选择性注意力的事件到帧重建

Jingqian Wu, Yunbo Jia, Edmund Y. Lam

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种高效的事件到帧重建框架,通过循环编码器-解码器、选择性上下文融合和轻量级混合注意力机制,在保持重建质量的同时降低计算复杂度。

详情
AI中文摘要

事件到帧(E2F)重建将异步事件流与基于帧的视觉流水线连接起来,但现有方法通常在重建质量和计算效率之间面临权衡。在这项工作中,我们提出了一种高效的E2F框架,强调因果时间建模和计算感知设计。该架构采用循环编码器-解码器,以紧凑的隐藏状态逐步聚合事件信息。为了提高在快速运动和光照变化下的鲁棒性,引入了一种选择性上下文融合策略,将事件驱动的特征与先验强度线索相结合。在此融合过程中,一种轻量级混合注意力机制增强了特征选择性,而无需依赖繁重的注意力操作。在标准基准上的实验结果表明,所提出的方法在保持重建性能竞争力的同时,在准确性和模型复杂度之间取得了良好的平衡。

英文摘要

Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.

2606.06139 2026-06-05 cs.RO

MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation

MotionDisco: 用于极端人形机器人移动操作的运动发现

Ilyass Taouil, Michal Ciebelski, Shafeef Omar, Haizhou Zhao, Angela Dai, Aaron M. Johnson, Majid Khadiv

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) New York University, USA(纽约大学) Carnegie Mellon University, USA(卡内基梅隆大学)

AI总结 提出MotionDisco框架,通过大语言模型引导的进化搜索和顺序运动动力学轨迹优化,从零开始自动发现长时域、接触丰富的人形机器人移动操作技能,并在真实机器人上部署。

详情
AI中文摘要

我们提出MotionDisco,一个从零开始发现接触丰富、长时域人形机器人移动操作运动的框架,无需依赖遥操作或从人类演示中重定向运动。这是具有挑战性的,因为可能的接触交互空间随任务时域和场景中物体数量呈组合增长。MotionDisco通过将大语言模型(LLM)引导的进化搜索与高效的顺序运动动力学轨迹优化器和剪枝策略相结合,实现对交互序列的快速搜索,从而快速发现新技能。通过大量消融研究,我们展示了LLM引导的搜索在多个具有挑战性的长时域任务中成功发现了全身轨迹。最后,通过在发现的轨迹上训练强化学习跟踪策略,我们将运动迁移到真实人形机器人上。这是第一项完全通过自动进化搜索发现并部署长时域人形机器人移动操作技能的工作。实验补充视频见:https://youtu.be/DHiVz34QYlw。

英文摘要

We present MotionDisco, a framework that discovers contact-rich, long-horizon humanoid loco-manipulation motions from scratch, without relying on teleoperation or motion retargeting from human demonstrations. This is challenging because the space of possible contact interactions grows combinatorially with the task horizon and the number of objects in the scene. MotionDisco enables rapid discovery of novel motions by coupling a large language model (LLM) guided evolutionary search over sequences of interactions with an efficient sequential kinodynamic trajectory optimizer and pruning strategy, enabling the rapid discovery of novel skills. Through extensive ablation studies, we show that our LLM-guided search discovers successful whole-body trajectories across several challenging long-horizon tasks. Finally, by training reinforcement learning tracking policies on the discovered trajectories, we transfer the motions to a real humanoid robot. This is the first work to discover and deploy long-horizon humanoid loco-manipulation skills entirely through automated evolutionary search. Supplementary videos of the experiments are available at: https://youtu.be/DHiVz34QYlw.

2606.06130 2026-06-05 cs.RO

Towards Realistic 3D Sonar Simulation

面向真实3D声纳仿真

Youssef Attia, Davide Costa, Francesco Wanderlingh, Filippo Campagnaro, Enrico Simetti

发表机构 * IEEE

AI总结 本文提出一种模块化架构,结合GPU加速图形引擎与物理声学传播原理,在NVIDIA Isaac Sim中实现基于Water Linked 3D-15传感器的体积3D声纳模型,并通过硬件在环配置验证其有效性。

详情
AI中文摘要

随着水下机器人研究日益涉及复杂的三维感知和自主导航,声纳仿真的保真度已成为算法开发的关键因素。当前的仿真框架通常依赖于几何驱动的渲染,将3D声纳近似为水下的LiDAR等效物,这未能考虑基本的声学现象,如折射、多径干扰和相位相关的信号形成。本文提出了一种用于真实3D声纳仿真的模块化架构,该架构将GPU加速的图形引擎与基于物理的声学传播原理相结合。我们在NVIDIA Isaac Sim环境中实现了一个体积3D声纳模型,该模型以Water Linked 3D-15传感器为原型,并将其集成到一个全面的水下仿真框架中。该系统通过硬件在环配置进行了验证,其中在NVIDIA Jetson Orin Nano上执行的改进FastLIO2 SLAM流水线使用合成3D声纳、DVL、IMU和压力数据进行传感器融合。最后,提供了模拟输出与来自港口板桩检查的真实数据之间的定性比较,描述了剩余的模拟到现实差距,并建立了迈向完全声学驱动的体积感知的路线图。

英文摘要

As underwater robotics research increasingly addresses complex 3D perception and autonomous navigation, the fidelity of sonar simulation has become a key factor in algorithm development. Current simulation frameworks typically rely on geometry-driven rendering, approximating 3D sonar as an underwater equivalent to LiDAR, which fails to account for fundamental acoustic phenomena such as refraction, multi-path interference, and phase-dependent signal formation. This paper proposes a modular architecture for realistic 3D sonar simulation that integrates GPU-accelerated graphics engines with physically grounded acoustic propagation principles. We implement a volumetric 3D sonar model within the NVIDIA Isaac Sim environment, modeled after the Water Linked 3D-15 sensor, and integrate it into a comprehensive underwater simulation framework. The system is validated through a hardware-in-the-loop configuration, where a modified FastLIO2 SLAM pipeline, executed on an NVIDIA Jetson Orin Nano, performs sensor fusion using synthetic 3D sonar, DVL, IMU, and pressure data. Finally, a qualitative comparison between simulated outputs and real-world data from harbor sheet-pile inspections is provided, characterizing the remaining sim-to-real gap and establishing a roadmap toward fully acoustics-driven volumetric sensing.

2606.06123 2026-06-05 cs.LG stat.ML

Adaptive state-action abstractions via rate-distortion

基于率失真的自适应状态-动作抽象

Fernando E. Rosas

发表机构 * Department of Informatics, University of Sussex(苏塞克斯大学信息学院) Department of Brain Science, Imperial College London(伦敦帝国学院脑科学系) Centre for Eudaimonia and Human Flourishing, University of Oxford(牛津大学幸福与人类繁荣中心)

AI总结 提出通过率失真原理构建软状态-动作抽象,并利用性能证书动态调整抽象粒度,以在压缩状态和动作信息时实现近似最优性能。

详情
Comments
28 pages, 2 figures
AI中文摘要

在学习走路时,婴儿似乎首先处理问题的粗略版本——保持直立、到达看护者——并且只有当在该分辨率下的进一步练习不再有回报时才会细化它。强化学习提供了多种构建复杂任务简单版本的技术,但缺乏关于如何在学习过程中动态调整这些抽象粒度的通用原则。本文提出了这样一个原则:一旦抽象内的学习误差变得与抽象本身引起的误差相当,就细化抽象。在这里,我们通过一个性能证书来研究这一原则的一种形式化方式,该证书将值误差分解为两项:由贝尔曼残差捕获的学习误差界,和由双模拟度量给出的抽象误差界。由此产生的切换策略通过基于率失真原理构建的软状态-动作抽象来实现,其沿状态和动作轴的分辨率可以连续调整。我们在各种表格设置中验证了这种构造,表明在状态和动作信息的大量有损压缩下可以实现近似最优性能。

英文摘要

When learning to walk, infants seem to address a coarse version of the problem first - stay upright, reach the caregiver - and refine it only when further practice at that resolution stops paying off. Reinforcement learning offers multiple techniques for building simple versions of complex tasks, but lacks general principles for how to dynamically adjust the granularity of these abstractions during learning. This paper proposes one such principle: refine the abstraction as soon as the learning error within it becomes comparable to the error induced by the abstraction itself. Here, we investigate one way of formalising this principle via a performance certificate that decomposes value error into two terms: a learning error bound captured by a Bellman residual, and an abstraction error bound given by a bisimulation metric. The resulting switching strategy is implemented by soft state-action abstractions built from rate-distortion principles, whose resolution along state and action axes can be continuously adjusted. We validate this construction in a range of tabular settings, showing that near-optimal performance can be achieved under substantial lossy compression of state and action information.