arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
专题追踪 全部专题
2606.19610 2026-06-19 cs.LG cs.AI 新提交

Latent Confounded Causal Discovery via Lie Bracket Geometry

基于李括号几何的潜在混杂因果发现

Sridhar Mahadevan

发表机构 * Adobe Research(Adobe研究院) University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)

AI总结 利用信息几何和范畴论,提出两种算法(BRIDGE和SKFM),通过干预诱导流的李括号非闭合性检测潜在混杂,大幅缩减因果图搜索空间。

Comments 39 pages

详情
AI中文摘要

最近关于Kan-Do-Calculus (KDC)的工作已经确立了被动观察和主动干预在因果推断中的边界是一个范畴论双伴随,其中干预由左Kan扩展建模,条件作用由右Kan扩展建模。本文在潜在混杂下引入了两种因果发现算法,基于KDC的信息几何和范畴论结果。在光滑统计设置中,观测和干预测度之间的Radon-Nikodym导数诱导局部因果向量场;这些场在李括号下不闭合的失败成为可计算的Frobenius残差,我们将其解释为失败的可视可积性和可能的潜在或未建模结构的证据。我们的第一个算法BRIDGE(用于干预发现和几何估计的括号残差)结合了一个干预密度或Radon-Nikodym比引擎与一个几何筛选器,该筛选器提出一个高召回率的可接受箭头族,识别非闭合的可视对作为潜在障碍候选,并将缩减后的族传递给下游的基于分数或可微的发现程序。第二个算法贡献,谱Kan-Do流匹配(SKFM),学习摊销干预场并在谱上分解潜在曲率,揭示BRIDGE指向的直接李空间端点。一系列详细的实验表明,两种算法都能发现具有潜在混杂的因果模型,同时将可能的DAG的超指数空间缩减多个数量级。本文引入了一种新的因果发现范式,其中潜在结构直接从干预诱导流的几何中推断出来。

英文摘要

Recent work on Kan-Do-Calculus (KDC) has established that the boundary between passive observation and active intervention in causal inference is a category-theoretic bi-adjunction, with interventions modeled by left Kan extensions and conditioning by right Kan extensions. This paper introduces two causal discovery algorithms under latent confounding, building on the information-geometric and categorical consequences of KDC. In smooth statistical settings, Radon-Nikodym derivatives between observational and interventional measures induce local causal vector fields; failures of these fields to close under Lie brackets become computable Frobenius residuals, which we interpret as witnesses of failed visible integrability and possible latent or unmodeled structure. Our first algorithm, BRIDGE (Bracket Residuals for Interventional Discovery and Geometric Estimation), combines an interventional density or Radon-Nikodym-ratio engine with a geometric screen that proposes a high-recall family of admissible arrows, identifies non-closing visible pairs as latent-obstruction candidates, and passes the reduced family to downstream score-based or differentiable discovery routines. The second algorithmic contribution, Spectral Kan-Do Flow Matching (SKFM), learns amortized intervention fields and factors latent curvature spectrally, exposing the direct Lie-space endpoint toward which BRIDGE points. A detailed set of experiments show that both algorithms are capable of discovering causal models with latent confounders while collapsing the super-exponential space of possible DAGs by many orders of magnitude. This paper introduces a new paradigm in causal discovery, where latent structure is inferred directly from the geometry of intervention-induced flows.

2606.19607 2026-06-19 cs.AI stat.AP 新提交

Which Pairs to Compare for LLM Post-Training?

LLM后训练中应比较哪些对?

Jiangze Han, Vineet Goyal, Will Ma

发表机构 * Columbia University(哥伦比亚大学)

AI总结 研究偏好后训练中如何选择最具信息量的比较对,提出基于采样设计的比较策展方法,通过DPO训练的理论分析给出优化准则,实验证明能提升样本效率。

详情
AI中文摘要

基于偏好的后训练已成为对齐语言模型的核心范式。常见的数据收集策略是为每个提示生成少量补全并标注生成的比较对。然而,人工偏好标签通常比生成额外补全昂贵得多,这提示了相同标注预算的不同使用方式:生成更大的补全集,但只标注最具信息量的比较对。本文研究在基于偏好的后训练中应比较哪些对。我们将比较策展形式化为一个采样设计问题,并通过基于偏好的后训练目标下的最终策略质量来评估设计。我们针对直接偏好优化(DPO)实例化该框架,分析标注对的选择如何通过DPO训练传播到下游策略性能。我们的主要结果为DPO训练策略的后训练最优性差距提供了匹配的上界和下界。这些界限表明,比较选择通过一个单一的设计相关信息矩阵影响下游性能,该矩阵将标签分配与参数估计误差和策略次优性联系起来。这为预算受限的比较策展提供了显式优化准则,并激发了从大型生成补全池中选择信息对的实际采样设计。在合成设置和语言模型后训练基准上的实验表明,所提出的设计在样本效率上持续优于常见的比较选择启发式方法。

英文摘要

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

2606.19603 2026-06-19 cs.LG 新提交

Comparing Linear Probes with Mahalanobis Cosine Similarity

比较线性探针与马氏余弦相似度

Zhuofan Josh Ying, Peter Hase, Nikolaus Kriegeskorte

发表机构 * Columbia University(哥伦比亚大学) Stanford University(斯坦福大学) Schmidt Sciences(施密特科学)

AI总结 研究证明马氏余弦相似度与OOD AUROC存在线性关系,提供理论解释并验证其作为线性探针比较指标的有效性。

Comments 16 pages, 10 figures

详情
AI中文摘要

线性探针广泛用于可解释性研究,并常通过余弦相似度进行比较。两个方向之间的马氏余弦相似度(MCS)通过测试数据协方差重新加权内积,是一种自然的任务感知改进。Ying等人(2026)报告称,探针与在分布外(OOD)数据上训练的参考探针的MCS近乎完美地线性预测了该探针的OOD AUROC(R^2 = 0.98)。在这里,我们将这一实证发现扩展到不同模型、层和概念领域,并以封闭形式证明了这一普遍现象:对于投影为高斯分布的平衡类别,OOD AUROC与参考探针的MCS是线性的,因为两者都是探针在测试数据上信噪比(SNR)的S形函数。该理论还预测了这种线性何时失效,我们通过实验验证了这一点。MCS为比较线性探针提供了有理论依据且经验有效的替代方案,优于欧几里得余弦相似度。

英文摘要

Linear probes are widely used in interpretability research and often compared by cosine similarity. The Mahalanobis cosine similarity (MCS) between two directions, which reweights the inner product by test data covariance, is a natural task-aware refinement. Ying et al. (2026) report that a probe's MCS to a reference probe trained on the out-of-distribution (OOD) data near-perfectly linearly predicts the probe's OOD AUROC (R^2 = 0.98). Here, we extend this empirical finding across models, layers, and concept domains, and prove this general phenomenon in closed form: For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data. The theory also predicts when this linearity fails, which we verify empirically. MCS offers a theoretically grounded and empirically effective alternative to Euclidean cosine similarity for comparing linear probes.

2606.19602 2026-06-19 cs.AI 新提交

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG:什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen(埃森大学医学院人工智能医学研究所) Faculty of Computer Science, University of Duisburg-Essen(杜伊斯堡-埃森大学计算机科学学院) Department of Physics, TU Dortmund University(多特蒙德工业大学物理系) Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University(多特蒙德工业大学拉马尔机器学习和人工智能研究所) Advanced Clinical Research Center, Fukushima Medical University(福岛医科大学先进临床研究中心) Department of Cardiology and Vascular Medicine, University Hospital Essen(埃森大学医院心血管内科)

AI总结 针对临床文档元数据缺失问题,提出基于智能体RAG的ACIE系统,在埃森大学医学中心部署,通过完整患者上下文推理和源引用验证,在7326次临床判断中实现96.5%的提取接受率。

详情
AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点,然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效,无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE(智能体临床信息提取):一个本地智能体RAG管道,能够推理完整的患者上下文,并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距,追溯了由此形成的架构决策,并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果,其中核医学医生根据引用的来源验证每个提取值。在7326次判断中,临床医生接受了96.5%的提取结果,按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

2606.19598 2026-06-19 cs.RO 新提交

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification

Fail-RAG:一种基于检索增强生成的机器人故障识别框架

Ameya Salvi, Jie Hu

发表机构 * Hitachi America, Ltd.(日立美国有限公司)

AI总结 提出Fail-RAG框架,利用检索增强生成和视觉语言模型,通过嵌入故障图像和上下文信息并查询数据库,实现机器人操作故障的高效检测,在仓库自动化任务中平均检测准确率提升25个百分点。

详情
AI中文摘要

工业自动化正经历由技术突破和社会变革驱动的机器人演进:向通用机器人、具身和物理人工智能发展,以及劳动力短缺的加剧。智能自主机器人不仅需要按计划运动,还需对意外事件做出反应。本研究聚焦于仓库中物料搬运机器人的意外事件,将其定义为故障,并开发检测机器人操作故障的方法。由于环境和任务的动态性,故障形式可能变化,基于规则的检测方法可能失效。我们提出'Fail-RAG',一种基于检索增强生成(RAG)的故障检测框架,其中故障图像和上下文信息被嵌入,并通过计算相似度查询故障数据库。进一步使用视觉语言模型(VLM)按照指令模板分析故障并提供细节。通过使用固定机械臂和移动操作器在仓库自动化常见任务中进行仿真和物理实验,评估了Fail-RAG的性能。与使用现成VLM相比,Fail-RAG在五种机器人操作类型上的平均故障检测准确率提高了25个百分点,表明其在真实世界故障检测中的有效性。

英文摘要

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.

2606.19597 2026-06-19 cs.SD cs.AI cs.LG 新提交

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PrefSQA: 用于语音质量评估的成对偏好预测及高质量数据集的关键作用

Junyi Fan, Donald S. Williamson

发表机构 * Department of Computer Science and Engineering, The Ohio State University, USA(美国俄亥俄州立大学计算机科学与工程系)

AI总结 提出PrefSQA模型,通过不确定性感知logits、损伤注意力头和非匹配参考比较模块,利用高质量偏好数据集提升语音质量评估的准确性。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)广泛用于语音质量评估,但标量标签对评估者变异性和听力测试差异敏感,这引入了标签噪声,限制了MOS预测的可靠性。偏好预测通过让听者直接比较信号来减少这种变异性,产生更干净的标签。我们研究了无MOS的偏好预测,并提出了PrefSQA,它结合了不确定性感知logits、损伤注意力头以及基于非匹配参考比较的模块。我们使用并精炼了五个数据集,包括MOS衍生和低噪声模拟集(包含匹配和非匹配内容),在人类偏好集上进行实验,并在未见数据上测试。实验表明,在MOS衍生数据上改进较小,而其他数据集显示出相对于基线的明显改进,突显了高质量偏好数据的价值,并证明了所提出方法的有效性。

英文摘要

Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.

2606.19595 2026-06-19 cs.LG cs.AI 新提交

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

IHBench:评估语音代理在结构化工作流中的中断后恢复能力

Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola

发表机构 * Boson AI

AI总结 提出IHBench基准,评估语音代理在结构化工作流中处理中断后的恢复能力,涵盖任务完成和恢复质量两个维度,实验表明闭源模型比开源模型更鲁棒。

详情
AI中文摘要

部署在结构化工作流(客户服务、医疗调度、账户管理)中的语音代理必须处理频繁的用户中断,同时保持多步骤程序的进度。现有的语音能力模型基准侧重于中断的时机:闯入检测、端点检测和轮流对话动态。它们忽略了中断后发生的情况:代理是否在正确的步骤恢复工作流?是否处理了用户的插话?是否避免重复用户已经听过的内容?我们引入了IHBench(中断处理基准),这是一个评估语音代理在10个企业领域中执行状态机驱动工作流时的中断后恢复能力的基准。六种中断类型在话语中间的控制点注入,并随数据生成每个中断的评估标准。每个中断在两个轴上评分:任务完成和恢复质量。我们评估了来自OpenAI、Google和开源社区的27个音频-语言模型配置。模型差异很大,恢复质量强烈依赖于中断类型。在我们的实验中,闭源模型比开源模型对中断更鲁棒:它们在任务完成上获胜的频率更高,随着对话变长,性能下降速度慢约3.3倍,并且没有音频与文本模态差距,而开源模型在这三个方面都处于劣势。一项人类研究验证了LLM评判员与人类标注者的一致性,与AudioMultiChallenge的跨基准分析表明,恢复质量在很大程度上是一个独立的能力轴。

英文摘要

Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.

2606.19594 2026-06-19 cs.LG 新提交

Unsupervised Causal Abstractions Discovery

无监督因果抽象发现

Théo Saulus, Simon Lacoste-Julien, Dhanya Sridhar

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出从低层测量数据中直接学习高层结构因果模型的方法,利用低秩因果发现假设,证明低秩图观测诱导的潜变量形成因果抽象,并给出可辨识性结果及实用学习目标。

详情
AI中文摘要

因果抽象形式化了当高层结构因果模型(SCM)捕捉低层SCM的干预行为时的情形。该概念的现有应用主要遵循假设检验范式:专家提出候选高层模型,然后评估低层系统是否实现了它。我们研究了直接从低层测量中学习高层模型的互补问题。我们的贡献利用了低秩因果发现的假设,可以总结如下:(1)我们证明了由低秩图生成的观测数据诱导出形成因果抽象的潜变量,(2)我们提供了关于这些潜变量的可辨识性结果,以及(3)我们提出了一个实用的目标来学习这个高层SCM。

英文摘要

Causal abstractions formalize when a high-level structural causal model (SCM) captures the interventional behavior of a lower-level SCM. Existing applications of this notion largely follow a hypothesis-testing paradigm: an expert proposes a candidate high-level model and then evaluates if the low-level system implements it. We study the complementary problem of learning a high-level model directly from low-level measurements. Our contributions leverage hypotheses from low-rank causal discovery, and can be summarized as follows: (1) we show that observations generated by a low-rank graph induce latents that form a causal abstraction, (2) we provide identifiability results about these latents, and (3) we propose a practical objective to learn this high-level SCM.

2606.19591 2026-06-19 cs.CL cs.AI 新提交

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC(Aimesoft股份公司)

AI总结 提出一种新颖简单的基于黄金摘要缩短文档的分层策略,结合BART模型实现越南语多文档抽象式摘要,在VLSP 2022测试集上达到ROUGE2-F1 0.2468,并利用外部数据增强训练。

Comments originally written in 2022

详情
AI中文摘要

在本技术报告中,我们专注于解决越南语多文档抽象式摘要的挑战,该任务在2022年越南语言与语音处理国际研讨会(VLSP)上提出。我们选择遵循流行的分层方法,即先浓缩每个文档,然后进行聚合和摘要。我们提出了一种新颖而简单的策略来缩短文档,该策略由黄金摘要驱动,从而确保分层方法各阶段之间的高度相关性。我们的方法在VLSP的公开测试集上达到了0.2468的ROUGE2-F1分数,并且能够生成流畅简洁的摘要。此外,我们利用外部来源获取额外数据,这极大地增加了越南语多文档摘要的数据量。额外数据已向社区公开。

英文摘要

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

2606.19590 2026-06-19 cs.RO cs.SY eess.SY 新提交

Safe, Real-Time Active Model Discrimination and Fault Diagnosis for Nonlinear Systems via Differentiable Reachability

通过可微可达性实现非线性系统的安全、实时主动模型辨识与故障诊断

Xinpei Ni, Melkior Ornik, Glen Chou, Samuel Coogan

发表机构 * Institute of Robotics and Intelligent Machines (IRIM), Georgia Institute of Technology(佐治亚理工学院机器人与智能机器研究所) Department of Aerospace Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校航空航天工程系)

AI总结 针对不确定非线性系统,提出一种基于可微可达性近似的实时主动故障诊断算法,通过优化控制输入使输出集分离,在保证安全的同时实现快速模型辨识。

详情
AI中文摘要

我们提出了一种安全、实时的算法,用于对具有过程和测量扰动的连续时间不确定非线性系统进行主动故障诊断和模型辨识。给定一组表示正常和故障模式(包括执行器和传感器故障)的候选模型,我们制定了一个输出反馈、时变策略优化问题,该问题(i)在有限时域内鲁棒地强制执行状态输入安全约束,并且(ii)驱动系统产生与至多一个模型一致的采样测量,从而实现确定性诊断。为了实时解决这个问题,我们使用可达状态和输出集的区间过近似开发了一个可处理的近似,并通过一个可微目标函数对诊断能力进行编码,该函数惩罚可能模型的可达输出集之间的重叠。由此产生的优化使用基于梯度的JAX和可微可达性原语在线高效求解。我们在几个高维非线性机器人系统(包括模拟四旋翼和战斗机模型、硬件差速驱动机器人和四足导航)上评估了我们的方法,用于传感器和执行器故障诊断(最多11种故障模式)。在这些案例研究中,我们的方法在50毫秒内实现了可靠的模型辨识,在辨识成功率和速度上优于基线方法,同时提供了形式化的安全保证。

英文摘要

We present a safe, real-time algorithm for active fault diagnosis and model discrimination for uncertain continuous-time nonlinear systems with process and measurement disturbances. Given a finite set of candidate models representing nominal and faulty modes, including actuator and sensor faults, we formulate an output-feedback, time-varying policy optimization problem that (i) robustly enforces state-input safety constraints over a finite horizon and (ii) drives the system to produce sampled measurements consistent with at most one model, enabling deterministic diagnosis. To solve this problem in real time, we develop a tractable approximation using interval over-approximations of reachable state and output sets, and encode diagnosability via a differentiable objective that penalizes overlap between the reachable output sets of possible models. The resulting optimization is solved efficiently online with gradient-based methods using JAX and differentiable reachability primitives. We evaluate our method on sensor and actuator fault diagnosis (up to 11 fault modes) in several high-dimensional nonlinear robotic systems, including a simulated quadrotor and fighter-jet model, a hardware differential-drive robot, and quadrupedal navigation. Across these case studies, our approach achieves reliable model discrimination in under 50 ms, outperforming baselines in discrimination success rate and speed while providing formal safety guarantees.

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 研究LLM与SAT/SMT求解器混合推理中,将求解器输出转化为用户答案的叙述步骤存在的安全漏洞,通过形式化建模和实验评估发现证书门控可保证求解结果正确,但对抗攻击可反转结论。

详情
AI中文摘要

诸如SAT和SMT求解器之类的形式化工具,当安全或安保关键问题可以用逻辑表述时,越来越多地被嵌入到语言模型推理流程中。与思维链不同(其步骤从模型分布中采样,没有形式化保证),求解器产生可靠且可独立验证的答案。然而,这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分:形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解,但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距,我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型,发现证书门控使求解器判定可靠,而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法,该方法显著减少了注入但无法完全消除,并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究,我们表明在LLM-求解器循环中,鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

2606.19586 2026-06-19 cs.RO 新提交

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

一个演示胜过千条轨迹:用于视觉运动策略的动作-视角增强

Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song

发表机构 * Stanford University(斯坦福大学) Columbia University(哥伦比亚大学) Toyota Research Institute(丰田研究所)

AI总结 提出一种数据增强框架,通过高斯泼溅和轨迹优化生成逼真的鱼眼图像序列和物理可行的动作轨迹,提升操作策略在场景变化和障碍物下的成功率。

Comments Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025

Journal ref Proceedings of The 9th Conference on Robot Learning, PMLR 305:3902-3914, 2025

详情
AI中文摘要

用于操作的视觉运动策略在建模复杂机器人行为方面展现出显著潜力,但机器人初始配置的微小变化和未见障碍物容易导致分布外观测。在没有大量数据收集工作的情况下,这些会导致灾难性的执行失败。在这项工作中,我们引入了一个有效的数据增强框架,该框架从真实世界的眼在手演示中生成视觉上逼真的鱼眼图像序列和相应的物理上可行的动作轨迹,这些演示使用带有单个鱼眼摄像头的便携式平行夹爪捕获。我们引入了一种新颖的高斯泼溅公式,适用于广角鱼眼摄像头,以重建和编辑带有未见物体的3D场景。我们利用轨迹优化生成平滑、无碰撞、视图渲染友好的动作轨迹,并从相应新视角渲染视觉观测。在仿真和现实世界中的综合实验表明,我们的增强框架提高了各种操作任务在相同场景和需要避障的增强场景中的成功率。

英文摘要

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

2606.19584 2026-06-19 cs.CV 新提交

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google(谷歌)

AI总结 提出语言引导视觉嵌入(LIVE)方法,利用语言动态引导视觉编码器生成任务中心嵌入,无需任务特定重训练,减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉基础模型通常被训练为静态特征提取器,将任务适应的负担转移到大型下游模型上。我们提出另一种范式:不是仅将视觉特征输入语言模型,而是使用语言本身动态引导视觉编码器。我们的方法,语言引导视觉嵌入(LIVE),利用语言作为高层指导在推理时生成以任务为中心的嵌入,消除了任务特定重训练的需要。这使得编码器能够关注输入中上下文相关的方面,产生更可控和可泛化的表示。实验上,LIVE减少了视觉幻觉(在MMVP上提升34分),在视觉问答上超越了参数数量大几个数量级的视觉语言模型,并泛化到未见过的指令和任务——为自适应的、指令驱动的视觉智能提供了直接路径。

英文摘要

Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks -- offering a direct path toward adaptive, instruction-driven visual intelligence.

2606.19579 2026-06-19 cs.SD cs.AI 新提交

FlowFake: Liquid Networks for Audio Deepfake Detection

FlowFake: 用于音频深度伪造检测的液态网络

Shivaay Dhondiyal, Divyansh Sharma, Dinesh Kumar Vishwakarma

发表机构 * Delhi Technological University(德里理工大学)

AI总结 针对音频深度伪造检测中跨数据集泛化失败的问题,提出基于液态时间常数(LTC)架构的FlowFake模型,通过学习ODE演化隐藏状态并自适应时间常数,以34K参数在跨域基准上超越现有方法。

Comments Accepted at the Workshop on Learning to Listen: Machine Learning for Audio at ICML 2026

详情
AI中文摘要

由神经文本转语音和语音克隆系统生成的音频深度伪造对说话人验证和公共话语构成大规模威胁。核心挑战是跨数据集泛化:在一种合成流水线上训练的检测器在面对未见过的伪造时性能崩溃。我们认为这种失败主要是由于结构性合成语音伪影,这些伪影是多时间尺度的轨迹异常。尽管每个现有检测器都聚合固定窗口的帧统计量,但这使得架构与信号不对齐。我们提出FlowFake,一种液态时间常数(LTC)架构,其隐藏状态通过学习ODE演化,每个神经元具有自适应时间常数,同时解析频谱(10ms)和韵律(2s)线索。仅34K参数,FlowFake实现了正式的BIBO稳定性和O(dt^4)积分误差。在四个数据集的跨域基准(ASVspoof2019-LA、FakeOrReal、InTheWild、MLAAD)上,FlowFake在仅用FakeOrReal训练时在ASVspoof2019上达到75.29%,仅用MLAAD训练时达到79.97%。它在每个评估对上优于RawGAT-ST和Whisper-DF,并以0.01%的参数数量匹配SSL Wav2vec2(大300倍)。源代码可在以下网址获取:this https URL

英文摘要

Audio deepfakes generated by neural text-to-speech and voice-cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross-dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi-timescale trajectory anomalies. Though every existing detector aggregates a fixed-window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time-Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per-neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four-dataset cross domain benchmark (ASVspoof2019-LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT-ST and Whisper-DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : https://github.com/GhostRider2023/FlowFake

2606.19569 2026-06-19 cs.LG 新提交

On the QUEST for Uncertainty Quantification via Highest Density Regions

论基于最高密度区域的量化不确定性探索

Sam Goring, Tom Kuipers, Nicola Paoletti, David S. Watson

发表机构 * Northeastern University London(东北大学伦敦校区)

AI总结 针对概率机器学习中回归问题的不确定性量化,提出基于最高密度区域体积的QUEST框架,满足单调性和平移不变性公理,在选择性预测基准上优于方差和微分熵。

Comments 27 pages, of which 10 are main text. Contains 7 figures, 4 tables, 1 algorithm in total

详情
AI中文摘要

不确定性量化对于概率机器学习中安全关键应用的可靠决策至关重要。对于回归问题,主流的标量不确定性量化方法——特别是基于适当评分规则的方法——通过逐点预测风险来衡量不确定性。当目标统计量不是条件期望时,这可能导致反直觉的结果。我们提出了一种替代框架,其中不确定性通过分布支持的最可能子集的体积来表征。QUEST(通过最高密度区域量化不确定性)是一种基于勒贝格测度在分布峰值处集中程度的新颖不确定性量化方法,在鲁棒性参数$\alpha$的一个或多个值处进行评估。我们建立了我们的度量与信息论和经济学中经典统计量之间的联系。我们表明,与基于适当评分规则的流行替代方案不同,QUEST的认知不确定性和偶然不确定性度量满足从不确定性量化文献中改编的一组公理,包括在分布扩散下的单调性和位置偏移的不变性。选择性预测基准证实,QUEST在方差和微分熵等标准度量上表现良好。

英文摘要

Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper scoring rules - measure uncertainty via pointwise predictive risk. This can lead to counterintuitive results when the target statistic is not the conditional expectation. We propose an alternative framework, in which uncertainty is characterised by the volume of the most probable subset of a distribution's support. QUEST (Quantifying Uncertainty via highest dEnSiTy regions) is a novel approach to UQ based on the concentration of Lebesgue measure at a distribution's peak(s), evaluated at one or more values of a robustness parameter $α$. We establish connections between our measures and classical statistics from information theory and economics. We show that, unlike popular alternatives based on proper scoring rules, QUEST measures of epistemic and aleatoric uncertainty satisfy a set of axioms adapted from the UQ literature, including monotonicity under distributional spread and invariance to location shifts. Selective prediction benchmarks confirm that QUEST performs favourably against standard measures such as variance and differential entropy.

2606.19565 2026-06-19 cs.CV 新提交

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

Mix-QVLA:任务证据感知的视觉-语言-动作模型混合精度量化

Navin Ranjan, Andreas Savakis

发表机构 * Rochester Institute of Technology(罗彻斯特理工学院)

AI总结 提出Mix-QVLA框架,通过任务证据感知的混合精度后训练量化,在保持任务性能的同时大幅降低VLA模型的内存和计算开销,在LIBERO上实现4.1GB内存和1.52倍加速。

详情
AI中文摘要

我们提出Mix-QVLA,一种针对VLA模型的任务证据感知混合精度PTQ框架。Mix-QVLA将每个量化变体锚定到全精度动作令牌参考决策,并评估量化是否在关键VLA功能边界上保留了任务相关证据。它从边界激活计算归一化的梯度加权任务证据图,并使用证据质量和归因分布失真比较全精度和量化图,捕捉决策支持证据的强度和分配变化。一个软瓶颈目标将边界级退化聚合为层敏感度分数。Mix-QVLA进一步在整个任务执行过程中建模敏感度,捕捉层重要性的阶段依赖变化,而不是假设固定的敏感度分布。由此产生的证据和时间感知分数指导在模型大小和BitOps预算下的混合精度位分配。在OpenVLA风格策略上的广泛评估表明,Mix-QVLA改善了低比特VLA部署的精度-效率权衡。在LIBERO上,Mix-QVLA将OpenVLA-OFT内存从15.4 GB减少到4.1 GB,保留了96.3的平均成功率(BF16模型为97.1),并实现了1.52倍的推理加速。

英文摘要

We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

2606.19561 2026-06-19 cs.RO cs.SY eess.SY 新提交

pdSTL: Probabilistic Differentiable Signal Temporal Logic for Stochastic Systems

pdSTL: 面向随机系统的概率可微信号时序逻辑

Bennett Dogbey, Hemanth Manjunatha

发表机构 * Oklahoma State University(俄克拉荷马州立大学)

AI总结 提出pdSTL框架,将概率语义与可微鲁棒性结合,通过区间值概率语义和LSTM式展开实现线性时间可微监控,在障碍物规避、换道和真实四旋翼飞行实验中优于确定性可微STL。

详情
AI中文摘要

在不确定环境中运行的自主机器人必须满足复杂的时序和安全规范,尽管存在随机动力学和感知噪声。虽然信号时序逻辑(STL)为基于梯度的优化提供了鲁棒性度量,但现有的扩展要么缺乏可微性,要么忽略了信念空间的不确定性。我们引入了pdSTL(概率可微信号时序逻辑),这是一个将概率语义与信念轨迹上的可微鲁棒性统一起来的框架。pdSTL采用区间值概率语义来计算保守的满足界限,并通过STL语法树组合传播。我们将时序鲁棒性评估制定为STL算子的循环、LSTM式展开,从而实现适用于端到端轨迹优化的线性时间、可微监控。我们在模拟障碍物规避、换道操作以及真实世界的Crazyflie四旋翼飞行实验中验证了pdSTL,这些实验在气动干扰下进行。结果表明,pdSTL在保持形式化概率保证的同时实现了高效优化,在现实世界的不确定性下,在维持安全裕度方面显著优于确定性可微STL。

英文摘要

Autonomous robots operating in uncertain environments must satisfy complex temporal and safety specifications despite stochastic dynamics and sensing noise. While Signal Temporal Logic (STL) offers robustness measures for gradient-based optimization, existing extensions either lack differentiability or ignore belief-space uncertainty. We introduce pdSTL (probabilistic differentiable Signal Temporal Logic), a framework that unifies probabilistic semantics with differentiable robustness over belief trajectories. pdSTL employs interval-valued probabilistic semantics to compute conservative satisfaction bounds, propagated compositionally through the STL syntax tree. We formulate the temporal robustness evaluation as a recurrent, LSTM-style unfolding of STL operators, enabling linear-time, differentiable monitoring suitable for end-to-end trajectory optimization. We validate pdSTL on simulated obstacle avoidance, lane-change maneuvers, and real-world Crazyflie quadcopter flight experiments under aerodynamic disturbances. Results demonstrate that pdSTL achieves efficient optimization with formal probabilistic guarantees, significantly outperforming deterministic differentiable STL in maintaining safety margins under real-world uncertainty.

2606.19560 2026-06-19 cs.LG 新提交

Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting

从流行病预测理解时间序列基础模型的关键特征

Alireza Jafari, Judy Fox, Geoffrey C. Fox, Madhav Marathe, Aniruddha Adiga

发表机构 * Department of Computer Science, School of Engineering and Applied Science, University of Virginia(弗吉尼亚大学工程与应用科学学院计算机科学系) School of Data Science, University of Virginia(弗吉尼亚大学数据科学学院) Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所) Department of Electrical and Computer Engineering, School of Engineering and Applied Science, University of Virginia(弗吉尼亚大学工程与应用科学学院电气与计算机工程系)

AI总结 系统评估多种时间序列模型在流感预测中的表现,发现混合专家模型性能最优,预训练在长时域提升显著,而LLM方法效果较差。

Comments 15 pages, 2 figures, 9 tables

详情
AI中文摘要

季节性流感每年感染数百万人,并在美国造成大量发病和死亡,因此准确的短期预测成为核心公共卫生需求。可靠的流行病时间序列预测可以为疫苗接种时机、医院人员配备和资源分配提供信息,然而现代预测架构在传染病监测数据上的比较行为仍未得到充分表征。我们通过系统评估区域流感预测来填补这一空白,使用流感样疾病监测和流感相关住院时间序列,在时间泛化和空间泛化设置下进行1-4周提前预测。我们比较了经典神经网络架构、基于数值的Transformer模型、预训练时间序列基础模型和基于LLM的预测方法。在各项任务中,我们证明融合多个预训练预测器的混合专家模型实现了最强的整体性能,表明异质预训练表示提供了互补的预测信息。我们的结果进一步表明,基于数值的Transformer模型产生可靠的预测,而预训练在更长时域上提供最大增益,特别是当预训练领域与流感动力学机制一致时。相比之下,基于LLM的时间序列方法在此设置下表现不如数值预测器。最后,我们研究了住院信息作为辅助协变量和预训练源的作用。住院信号在特定设置中提供了互补的改进,并阐明了额外的监测流如何增强多时域预测的鲁棒性。这些发现为流感防范的模型选择、预训练策略和辅助信号使用提供了可操作的指导。

英文摘要

Seasonal influenza infects millions of people and causes substantial morbidity and mortality in the United States each year, making accurate short-term forecasting a core public-health need. Reliable forecasts of epidemic time series can inform vaccination timing, hospital staffing, and resource allocation, yet the comparative behavior of modern forecasting architectures on infectious-disease surveillance data remains insufficiently characterized. We address this gap through a systematic evaluation of regional influenza forecasting using influenza-like illness surveillance and influenza-associated hospitalization time series under both temporal and spatial generalization settings for 1-4-week-ahead prediction. We compare classical neural network architectures, numerical transformer-based models, pretrained time series foundation models, and LLM-based forecasting approaches. Across tasks, we demonstrate that a mixture-of-experts model that fuses multiple pretrained forecasters achieves the strongest overall performance, indicating that heterogeneous pretrained representations provide complementary predictive information. Our results further show that numerical transformer-based models produce reliable forecasts, while pretraining provides the largest gains at longer horizons, particularly when the pretraining domain is mechanistically aligned with influenza dynamics. In contrast, LLM-based time series methods underperform relative to numerical forecasters in this setting. Finally, we examine hospitalization information as both an auxiliary covariate and a pretraining source. Hospitalization signals provide complementary improvements in selected settings and clarify when additional surveillance streams enhance the robustness of multi-horizon forecasting. These findings provide actionable guidance on model selection, pretraining strategy, and auxiliary-signal use for influenza preparedness.

2606.19559 2026-06-19 cs.AI cs.CL 新提交

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19558 2026-06-19 cs.LG cs.CL 新提交

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19555 2026-06-19 cs.RO 新提交

SCAN-Planner: Spatial Collision-Aware Local Planning for Route-Guided Long-Range Quadruped Navigation

SCAN-Planner:用于路线引导的远程四足导航的空间碰撞感知局部规划

Han Zheng, Zhe Chen, Yiwen Fu, Ming Yang, Tong Qin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SCAN-Planner框架,通过偏航感知双圆柱足迹和投影A*搜索实现空间碰撞感知的局部规划,在密集杂乱、3D非结构化环境和远程导航中生成安全平滑轨迹。

详情
AI中文摘要

四足机器人越来越需要能够在狭窄通道、杂乱室内场景和大规模3D非结构化环境中导航。现有的局部规划器通常使用各向同性几何膨胀来近似机器人,或依赖于平面和高程图表示,导致在狭窄空间中的保守运动以及对悬垂结构的推理有限。本文提出了SCAN-Planner,一种用于远程四足导航的空间碰撞感知局部规划框架。使用偏航感知的双圆柱足迹来建模细长的机器人身体,通过在膨胀的3D占用地图中进行稀疏查询实现全身碰撞评估。我们进一步引入投影A*搜索,在插值的地面跟随表面上生成无碰撞引导,并通过z梯度抑制来水平避开障碍物同时保持垂直稳定性。对于大规模部署,具有边界回退的机器人中心滑动地图提供高分辨率局部碰撞检查并从局部死胡同中恢复。仿真和真实实验表明,SCAN-Planner在密集杂乱、3D非结构化场景、楼梯穿越和远程导航任务中生成安全、平滑且高效的轨迹。

英文摘要

Quadruped robots are increasingly expected to navigate through narrow passages, cluttered indoor scenes, and large-scale 3D unstructured environments. Existing local planners commonly approximate the robot using isotropic geometric inflation or rely on planar and elevation-map representations, leading to conservative motion in tight spaces and limited reasoning about overhanging structures. This letter presents SCAN-Planner, a spatial collision-aware local planning framework for long-range quadruped navigation. A yaw-aware twin-cylinder footprint is used to model the elongated robot body, enabling whole-body collision evaluation through sparse queries in an inflated 3D occupancy map. We further introduce a projected A* search that generates collision-free guidance on an interpolated ground-following surface, with z-gradient suppression to avoid obstacles horizontally while maintaining vertical stability. For large-scale deployment, a robot-centric sliding map with boundary fallback provides high-resolution local collision checking and recovery from local dead ends. Simulation and real-world experiments demonstrate that SCAN-Planner generates safe, smooth, and efficient trajectories in dense clutter, 3D unstructured scenes, stair traversal, and long-range navigation tasks.

2606.19552 2026-06-19 cs.CL 新提交

LaViSA: A Language and Vision Structural Ambiguity Benchmark

LaViSA:语言与视觉结构歧义基准

Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Guardian Robot Project RIKEN(RIKEN守护机器人项目) The University of Osaka(大阪大学)

AI总结 提出LaViSA基准,通过七类歧义句及对应图像评估视觉语言模型利用视觉场景解决结构歧义的能力,实验显示现有模型虽能部分利用视觉信息,但在特定歧义类型和细微语义区分上仍有局限。

详情
AI中文摘要

结构歧义是指单个句子由于其句法结构而产生多种有效解释,这给语言理解带来了基本挑战。视觉场景可作为解决此类歧义的有用线索,视觉语言模型(VLM)需要能够从视觉场景中推导出可能的语义解释。我们引入了语言与视觉结构歧义(LaViSA)基准,旨在评估VLM利用视觉场景解决结构歧义的能力。LaViSA包含歧义句子、其消歧句子以及这些消歧句子对应的图像,涵盖七类歧义。利用LaViSA,我们对多种VLM进行了全面评估,包括专有模型和开源模型,参数规模和推理能力各异。实验结果表明,尽管最近的VLM能在一定程度上利用视觉场景解决结构歧义,但它们仍然在特定歧义类型和视觉上微妙的语义区分上存在困难,表明在利用视觉场景解决结构歧义方面仍存在局限性。

英文摘要

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

2606.19549 2026-06-19 cs.LG 新提交

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

预测参数高效微调更新的可合并性

Lin Tang, Wei Zhang, Jing Li, Hongyu Chen, Ming Zhao, Yuxuan Wang

发表机构 * Sichuan University(四川大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出MergeProbe,通过训练初期信号预测LoRA适配器的可合并性,在MERGE-PEFT基准上实现最佳平均和最差保留性能。

详情
AI中文摘要

低秩适配(LoRA)使得训练许多领域和任务特定的语言模型适配器变得廉价,但两个适配器是否可以合并通常只有在两者都经过充分训练和评估后才能发现。这种延迟反馈代价高昂:单独表现强大的适配器在合并更新后可能会产生破坏性干扰。我们询问是否可以预测这种结果。我们将适配器可合并性形式化为适配器在合并后保持其单任务效用的程度,并表明可以从训练初期百分之几的信号中预测——主要是低秩更新及其梯度在不同任务间的对齐程度以及它们对共享表示的干扰程度。我们将这些信号打包成MergeProbe,一个轻量级预测器,用于估计成对和集合级别的保留,并将估计转化为具体决策:直接合并、重新加权、剪枝或路由。在MERGE-PEFT(一个涵盖数学、代码、科学、指令遵循和安全的五领域基准)上,MergeProbe在强干扰感知合并基线中实现了最佳平均和最差保留,同时增加的部署开销远低于完整任务路由。这将LoRA合并从事后工程步骤转变为预期测量问题。

英文摘要

Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training -- chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.

2606.19544 2026-06-19 cs.CL 新提交

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性:LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information(加州大学伯克利分校信息学院)

AI总结 本研究通过大规模系统性评估(21个裁判模型、118次运行、约54.1万次判断),发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题,包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存,并提出了最小可行验证协议。

详情
AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式,但实际中的裁判验证依赖于精确匹配一致性,这一指标未对随机性进行校正,且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估:来自九个提供商的21个裁判模型,在MT-Bench、JudgeBench和RewardBench上,按照三种协议(一致性、稳定性、偏差审计)进行了118次运行,约54.1万次独立判断。发现了四个结果,在整个队列中一致,包括2026年4月的前沿模型:精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的(MT-Bench上33-41个百分点),裁判排名在不同基准上最多移动14个位置,高重测信度(>0.95)与两个生产部署裁判中的严重位置偏差(>0.10)并存(体现了一致性-偏差悖论),以及在单一成对评分标准下,整个队列中的冗长偏差较小(<0.011)。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

2606.19542 2026-06-19 cs.LG 新提交

Tracking Representation Dynamics in Large Language Models with Persistent Homology

利用持续同调追踪大型语言模型中的表示动态

Naman Malhotra, Jay Ambadkar, Abhinav Gupta, Kushal Kasivel, Abbas Schwarz, Kamillo Ferry, Anthea Monod

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 通过持续同调分析激活空间拓扑,发现对齐过程中拓扑重组主要发生在训练早期,且不同对齐目标产生可区分的拓扑轨迹。

Comments 29 pages

详情
AI中文摘要

大型语言模型通常通过监督微调进行对齐,但关于其内部表示在此过程中如何演变的研究尚不充分。我们利用持续同调,通过追踪微调过程中激活空间的拓扑结构来研究对齐动态。在四个参数范围从1B到7B的Transformer语言模型以及对应于有用、无害和混合训练数据的三个对齐目标上,我们发现大多数拓扑重组发生在训练的最早阶段。密集检查点分析揭示了拓扑活动的瞬态峰值,随后迅速稳定。我们进一步表明,不同的对齐目标会引发可区分的拓扑轨迹,而指令微调和预训练模型则表现出定性不同的演化模式。我们的结果表明,持续同调为对齐提供了互补视角,揭示了仅从行为指标无法察觉的表示级变化。

英文摘要

Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking the topology of activation spaces throughout fine-tuning. Across four transformer language models ranging from 1B to 7B parameters and three alignment objectives corresponding to helpful, harmless, and mixed training data, we find that the majority of topological reorganization occurs during the earliest stages of training. A dense checkpoint analysis reveals a transient peak in topological activity followed by rapid stabilization. We further show that different alignment objectives induce distinguishable topological trajectories, while instruction-tuned and pretrained models exhibit qualitatively different patterns of evolution. Our results suggest that persistent homology provides a complementary perspective on alignment, revealing representation-level changes that are not apparent from behavioral metrics alone.

2606.19538 2026-06-19 cs.AI cs.LG 新提交

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

ITNet: 一种可学习的积分变换,统一卷积、注意力与循环

Ashim Dhor, Rasel Mondal, Pin Yu Chen

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校) IBM Research(IBM研究院)

AI总结 提出可学习积分变换网络ITNet,通过位置-特征联合核函数统一卷积、注意力和循环架构,实现跨模态高性能。

详情
AI中文摘要

卷积网络、循环网络和变换器各自编码不同的归纳偏置——局部性、序列记忆和内容相关的成对交互——自诞生以来在数学上一直彼此独立。我们表明,这种碎片化反映的不是信号处理方式的根本多样性,而是对单一底层数学对象的不完整视角:可学习的积分变换。我们引入积分变换网络(ITNet),这是一种统一架构,围绕一个依赖于位置和特征的联合可学习核构建。该核实现为一个小型神经网络(具体为MLP),用于建模成对交互,使模型能够从数据中自适应其行为。我们证明,卷积、自注意力(包括多头)和自回归循环(包括LSTM、GRU、S4和Mamba)在适当参数化下均作为特例出现,且ITNet是连续算子的通用逼近器。为使其实用,我们开发了分块核融合、重要性加权蒙特卡洛积分和可学习低秩分解,实现高效可扩展计算。单个ITNet架构,共享算子与轻量级模态特定编码器,在ImageNet-1K、GLUE、ModelNet40、VQA v2和NLVR2上匹配或超越专用基线。结果表明,单一学习交互机制可从数据中恢复所有三个架构族的行为。

英文摘要

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM:基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University(北京大学) MSALab ByteDance(字节跳动)

AI总结 提出PerceptionDLM,利用扩散语言模型的并行解码特性,通过高效提示和结构化注意力掩码实现多区域并行感知,显著提升推理效率,并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,现有大多数MLLMs依赖自回归生成,这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中,我们提出PerceptionDLM,一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base(一个在开源扩散MLLMs中达到最先进性能的强基础基线),我们的架构充分利用了DLMs的并行解码特性。具体来说,我们引入了高效提示和结构化注意力掩码,以实现对多个掩码区域的同步感知,使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比,这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性,我们通过将DLC-Bench扩展为每张图像包含多个区域掩码,构建了一个新的并行详细局部描述基准(ParaDLC-Bench),从而能够联合评估描述质量和推理效率。实验表明,PerceptionDLM在区域描述中保持竞争性能,同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知,我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

2606.19531 2026-06-19 cs.CV cs.RO 新提交

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

ImageWAM:世界动作模型真的需要视频生成,还是只需要图像编辑?

Yuyang Zhang, Wenyao Zhang, Zekun Qi, He Zhang, Haitao Lin, Jingbo Zhang, Yao Mu, Xiaokang Yang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东方理工学院) Tencent Robotics X(腾讯机器人X) Tsinghua University(清华大学) Zhongguancun Academy(中关村学院)

AI总结 提出ImageWAM框架,利用预训练图像编辑模型替代视频生成进行机器人动作预测,通过编辑去噪的KV缓存作为世界动作上下文,在多个模拟和真实实验中优于基线,计算量降至1/6,延迟降至1/4。

Comments Project Page: https://zhangwenyao1.github.io/ImageWAM/

详情
AI中文摘要

世界动作模型(WAMs)通常依赖视频生成来桥接视觉世界建模和机器人控制。然而,基于视频的WAMs面临三个耦合的限制:密集的多帧未来令牌使得推理成本高昂,完整的视频预测将容量花费在与动作无关的时间和外观细节上,以及长期未来想象可能引入误导动作预测的错误。这些问题提出了一个简单的问题:世界动作模型真的需要视频生成吗?我们提出ImageWAM,一个简单的WAM框架,将预训练的图像编辑模型重新用于机器人动作预测。与视频生成相比,图像编辑提供了更匹配的先验:它只需要建模目标帧变换,关注与动作相关的当前到目标视觉差异,并通过编辑预训练将任务指令接地到局部视觉变化。在实践中,ImageWAM在推理时不解码目标帧;相反,它根据图像编辑去噪产生的KV缓存条件化一个流匹配动作专家,将其用作紧凑的世界动作上下文。ImageWAM在多个模拟和真实世界实验中优于标准VLA基线和匹配的竞争性WAM,且无需额外的策略预训练。它还将FLOPs降低到基于视频的WAMs的1/6,延迟降低到1/4。注意力分析进一步表明,编辑缓存聚焦于任务相关的变化区域,支持图像编辑作为基于视频的世界动作建模的有效替代方案。

英文摘要

World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled limitations: dense multi-frame future tokens make inference costly, full video prediction spends capacity on action-irrelevant temporal and appearance details, and long-horizon future imagination may introduce errors that mislead action prediction. These issues raise a simple question: Does world action model really need video generation? We propose ImageWAM, a simple WAM framework that repurposes pretrained image editing models for robot action prediction. In contrast to video generation, image editing provides a better-matched prior: it only needs to model a target-frame transformation, focuses on action-relevant current-to-target visual differences, and grounds task instructions to localized visual changes through edit pretraining. In practice, ImageWAM does not decode the target frame at inference time; instead, it conditions a flow-matching action expert on the KV caches produced by image-editing denoising, using them as a compact world-action context. ImageWAM outperforms standard VLA baselines and matching competitive WAMs without additional policy pretraining across different simulator and real-world experiments. It also reduces FLOPs to 1/6 and latency to 1/4 of video-based WAMs. Attention analysis further shows that editing caches focus on task-relevant change regions, supporting image editing as an effective alternative to video-based world-action modeling.

2606.19528 2026-06-19 cs.LG cs.AI 新提交

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

边缘设备上LLM LoRA微调峰值内存降低技术

Hassan Dbouk, Matthias Reisser, Prathamesh Mandke, Likhita Arun Navali, Christos Louizos

发表机构 * GitHub

AI总结 针对边缘设备上LLM LoRA微调的内存瓶颈,提出四种互补技术(量化、检查点、softmax近似、logits掩码),在Llama-3.2 3B和Qwen-2.5 3B上实现高达26倍和28倍的峰值内存降低。

Comments Hassan Dbouk and Matthias Reisser contributed equally to this work

详情
AI中文摘要

使用低秩适配(LoRA)在终端用户数据上微调大型语言模型(LLM)可提供个性化体验并保护数据隐私,但在消费级硬件上面临严重的内存限制。微调期间的峰值内存通常超过设备限制,尤其是对于具有数十亿参数和长上下文训练数据的模型。本文介绍了一套互补技术,可在不牺牲模型质量的情况下减少内存占用:(1)基模型量化与即时反量化,(2)结合选择性激活缓存和磁盘卸载的内存高效检查点,(3)使用语义相关令牌子集的softmax近似,以及(4)logits掩码。在Llama-3.2 3B和Qwen-2.5 3B上的实验表明,峰值内存降低高达26倍和28倍,从而能够在资源受限设备上进行微调。

英文摘要

Fine-tuning of Large Language Models (LLMs) using Low-Rank Adaptation (LoRA) on an end-user's data offers personalized experiences while keeping data private, but faces severe memory constraints on consumer hardware. Peak memory during fine-tuning often exceeds device limits, especially for models with billions of parameters and long-context training data. This paper introduces a suite of complementary techniques to reduce memory footprint without sacrificing model quality: (1) base model quantization with on-the-fly dequantization, (2) memory-efficient checkpointing combining selective activation caching and disk offloading, (3) softmax approximation using semantically relevant token subsets, and (4) logits masking. Experiments on Llama-3.2 3B and Qwen-2.5 3B demonstrate up to $26\times$ and $28\times$ reduction in peak memory, enabling fine-tuning on resource-constrained devices.

2606.19527 2026-06-19 cs.AI 新提交

Emergent Alignment

涌现对齐

Martin Kolář

发表机构 * CIIRC, Czech Technical University in Prague(捷克理工大学CIIRC)

AI总结 提出一种在线对齐技术,通过引入良心步骤和基于直接偏好优化的对齐损失,使大语言模型在训练、微调、对抗提示和零样本学习中自我纠正非伦理输出。

Comments Rejected from ICML 2026

详情
AI中文摘要

大型语言模型(LLM)能否辨别其自身输出何时与人类伦理不一致?它们能否自我纠正?我们赋予LLM一个良心步骤,用于审查其自身的推理和输出,并通过使用直接偏好优化(DPO)扩展训练损失中的对齐组件,引导模型远离非伦理输出。结果是一种在线技术,可在广泛的应用中对齐模型:训练、微调、对抗提示和零样本学习。它不需要较弱或较强的评判者,而是依赖于自身的冻结副本。在先前的工作中,涌现错位场景显示了微调模型以破解代码时出现的一系列涌现非伦理行为。相反,我们实证展示了如何实现涌现对齐:在相同的代码破解场景下,一个单一的高层内省问题将训练引导向伦理模型。

英文摘要

Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization (DPO) to steer the model away from non-ethical outputs. The result is an online technique to align models in a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning. It does not require a weaker or stronger judge, relying instead on a frozen copy of itself. In previous work, the Emergent Misalignment scenario showed a range of emergent unethical behaviors from fine-tuning the model to hack code. Instead, we empirically show how to achieve Emergent Alignment: a single high-level introspective question steers training toward an ethical model under the same code hacking scenario.