arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.16713 2026-06-12 cs.CV cs.AI 版本更新

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM:从世界模型中获取几何结构用于视觉-语言模型

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Kempner Institute for the Study of Natural and Artificial Intelligence(凯普纳自然与人工智能研究 institute) Harvard University(哈佛大学)

AI总结 GeoWorld-VLM通过将冻结的摄像机条件视频世界模型的几何结构转移到视觉-语言模型中,提升空间关系推理能力,实验显示在两个不同架构上均提升了约4%的性能。

详情
AI中文摘要

现代视觉-语言模型(VLMs)在语义识别方面表现优异,但在基本空间关系如左、在、后、之间等上仍显脆弱。这一失败的原因出现在语言推理之前:视觉路径在特征提取过程中可能压缩或丢弃关键的3D结构线索,导致语言模型接收到的图像表示不足以支持可靠的空判断。我们引入GeoWorld-VLM,一种VLM侧蒸馏框架,将冻结的摄像机条件视频世界模型的几何结构转移到VLMs中。GeoWorld-VLM仅微调图像编码器和多模态投影器,使后投影器图像特征与中间世界模型表示对齐,同时保持主骨干冻结。给定图像、提示和采样的摄像机轨迹,世界模型教师将静态视觉输入转换为合成多视角空间信号。训练结合空间答案监督、教师-学生特征对齐和对原VLM的保留锚点。由于语言模型保持冻结,GeoWorld-VLM保留原始模型的语言能力,同时将空间改进归因于增强的视觉路径。为了评估所提方法的有效性和通用性,我们将GeoWorld-VLM应用于两种不同的VLM架构,并在两个骨干上观察到一致的改进。GeoWorld-VLM在What'sUp和VSR基准上分别提升了约4%的性能,表明世界模型引导的视觉对齐在模型结构和空间推理数据集上具有泛化能力。

英文摘要

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

2605.16430 2026-06-12 cs.LG cs.AI 版本更新

A Theory of Training Profit-Optimal LLMs

训练利润最优大语言模型的理论

Sophie Hao, William Merrill

发表机构 * Boston University(波士顿大学) Allen Institute for AI(人工智能研究院)

AI总结 本文提出一个经济模型,结合扩展定律与微观经济学理论,分析大语言模型训练的利润最大化问题,探讨模型规模与训练成本的关系及对利润的影响。

详情
Comments
Minor edits for preprint
AI中文摘要

扩展大语言模型(LLM)需要巨大的计算资源,近年来人工智能的进步与大量资本支出相伴而生。尽管扩大LLM规模确实能提高模型质量(以损失或下游评估量化),但其质量提升如何转化为潜在收入,以及收入是否能抵消更大规模训练和推理的成本仍不清楚。本文发展了一个经济模型,结合扩展定律与微观经济学理论,以描述LLM训练公司的理性行为。在我们的模型中,增加参数和训练令牌可提高LLM质量,从而吸引更多消费者,每个消费者都有一个质量阈值。另一方面,额外的参数和训练令牌都会带来额外成本。我们分析了该模型在计算受限和数据受限环境下的利润最大化问题。在计算受限环境下,最优模型规模和令牌预算与硬件效率$E$(FLOPs/$)近似线性增长;总训练成本则以$E$的亚四次方程增长。数据效率的提升激励更大规模的模型和训练支出。当数据受限于$D$时,利润最优的训练支出为$D^2/E$,即随数据增加而增加,随硬件效率(以及数据效率)降低而减少。最后,我们分析了训练支出的实际趋势:当前趋势与计算受限环境下的最宽松模型变体一致,但在数据受限环境或假设硬件进步停滞时并非利润最优。总体而言,我们的结果提供了利润最优LLM训练的理论,为批判性地看待行业声明和支持长期经济决策提供了基础。

英文摘要

Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

2605.14568 2026-06-12 cs.SE cs.CL cs.LG 版本更新

Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines

在行为驱动软件测试套件中挖掘子场景重构机会:ML分类器和LLM-判断基线

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

发表机构 * Independent Researcher(独立研究者;应用MBA(数据分析),德克萨斯韦斯利安大学) Applied MBA (Data Analytics), Texas Wesleyan University(独立研究者;计算机工程学士,国立科学与技术大学(NUST)) Independent Researcher(独立研究者;管理硕士,慕尼黑技术大学) B.E. Computer Engineering, National University of Sciences and Technology (NUST) Independent Researcher M.Sc. Management, Technical University of Munich

AI总结 本文通过ML分类器和LLM基线,识别行为驱动开发测试套件中可提取的子场景,量化其在公共BDD生态系统中的普及率。

详情
Comments
31 pages, 10 figures, 6 tables, 56 references. v2: retitled; reference list fully corrected and verified; decision-threshold sensitivity analysis and imbalance-robust baseline metrics added; figures restyled. Reproduction package at this https URL (Apache-2.0). Upstream cukereuse corpus at this https URL
AI中文摘要

背景。行为驱动开发(BDD)软件测试套件积累重复的步骤子序列。有三种已发布的重构模式(在同一文件中的背景、在同一仓库中可重用的场景调用、跨组织共享的更高层次步骤),但没有先前工作自动化确定哪些重复的子序列值得提取或哪种机制适用。目标。通过重构适宜性(提取值得)对重复的步骤子序列(

英文摘要

Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

2605.13426 2026-06-12 cs.LG math.AG 版本更新

Strategic PAC Learnability via Geometric Definability

通过几何可定义性实现策略PAC可学习性

Yuval Filmus, Shay Moran, Elizaveta Nesterova, Nir Rosenfeld, Alexander Shlimovich

发表机构 * Weizmann Institute of Science(魏茨曼研究院) University of Waterloo(滑铁卢大学) ETH Zurich(苏黎世联邦理工学院) University of Washington(华盛顿大学)

AI总结 研究个体通过成本修改特征影响分类器决策的策略学习问题,证明在简单情况下策略行为可使易学问题变为不可学,并引入几何可定义性假设以控制样本复杂度。

详情
AI中文摘要

策略分类研究个体通过成本修改特征以影响分类器决策的学习场景。核心问题是诱导的(策略性)假设类样本复杂度如何依赖于基础假设类复杂度和可行操纵的成本结构。先前工作显示在某些自然设置如线性分类器与范数成本下,诱导复杂度可被控制。我们证明此类保证一般失效:存在VC维为1的实数假设类,即使在最简单的区间邻域下,诱导类的VC维为无限。因此策略行为可将易学问题转为不可学。为克服此问题,我们引入几何可定义性假设:假设类和成本诱导的邻域关系可通过实数上的第一阶公式定义。这表示假设和成本可通过算术运算、指数、对数和比较描述。此假设涵盖广泛自然类和成本函数,包括ℓp距离、Wasserstein距离和信息论分歧。在此假设下,我们证明可学习性得以保持,样本复杂度由定义公式的复杂度控制。

英文摘要

Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier's decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension $1$ on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over $\mathbb{R}_{\mathtt{exp}}$. Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including $\ell_p$ distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.

2605.12542 2026-06-12 astro-ph.IM astro-ph.EP cs.LG 版本更新

Earth Science Foundation Models: From Perception to Reasoning and Discovery

地球科学基础模型:从感知到推理与发现

Xiangyu Zhao, Bo Liu, Yuehan Zhang, Zelin Song, Wanghan Xu, Feng Liu, Fengxiang Wang, Ben Fei, Fenghua Ling, Wangxu Wei, Wenlong Zhang, Xiao-Ming Wu

发表机构 * Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文综述了地球科学基础模型,探讨了其从感知到多模态推理及科学发现的能力演进,并总结了其在大气、水圈、岩石圈等领域的广泛应用。

详情
AI中文摘要

大规模基础模型(FMs)正在通过整合异构多模态数据,如多平台影像、格网再分析数据、多样的地球物理和地球化学观测以及领域特定文本,来推动地球科学的发展。本文通过两个互补维度对地球科学基础模型(地球FMs)进行统一综述:深度,即追踪模型能力从感知到多模态推理和代理科学工作流的演变;广度,即总结其在大气、水圈、岩石圈、生物圈、人类圈和冰圈以及耦合地球系统过程中的扩展应用。利用这一框架,我们回顾了代表性多模态地球基础模型,并编译了超过200个数据集和基准,涵盖多样化的地球科学任务和模态。我们进一步讨论了多模态数据异构性、科学可靠性和持续更新、可扩展性和可持续性以及从基础模型到代理和具身地球智能的转变,并展望了更集成、可信和可操作的AI地球科学家的未来方向。总体而言,本文为理解地球基础模型的发展提供了结构化的路线图,从能力和应用广度两个方面进行综述。

英文摘要

Large foundation models (FMs) are transforming Earth science by integrating heterogeneous multimodal data, such as multi-platform imagery, gridded reanalysis data, diverse geophysical and geochemical observations, and domain-specific text, to support tasks ranging from basic perception to advanced scientific discovery. This paper provides a unified review of Earth science foundation models (Earth FMs) through two complementary dimensions: depth, which traces the evolution of model capabilities from perception to multimodal reasoning and agentic scientific workflows, and breadth, which summarizes their expanding applications across the atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, as well as coupled Earth system processes. Using this framework, we review representative multimodal Earth foundation models and compile more than 200 datasets and benchmarks spanning diverse Earth science tasks and modalities. We further discuss key challenges in multimodal data heterogeneity, scientific reliability and continual updating, scalability and sustainability, and the transition from foundation models to agentic and embodied Earth intelligence, and outline future directions toward more integrated, trustworthy, and actionable AI Earth scientists. Overall, this paper offers a structured roadmap for understanding the development of Earth foundation models from both capability depth and application breadth.

2605.11165 2026-06-12 cs.LG 版本更新

COSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication

COSMOS:基于聚类服务器模型和伪标签通信的模型无关个性化联邦学习

Ben Rachmut, Luise Ge, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 COSMOS通过伪标签通信实现服务器端个性化,利用客户端本地模型预测公共数据并聚类,训练集群特定模型并回传知识蒸馏,理论分析显示其能有效降低个性化风险,实验验证其在异构环境中优于现有基线方法。

详情
AI中文摘要

联邦学习在异构环境中面临挑战,因为客户端模型在架构和数据分布上差异显著。尽管近期方法通过客户端聚类和知识蒸馏应对,但同时处理架构和统计异质性仍困难。我们引入COSMOS,一种模型无关框架,通过仅使用伪标签通信实现服务器端个性化。客户端训练本地模型并在公共数据上进行预测;服务器根据预测相似性聚类客户端,利用自身计算为每个群组训练特定模型,并将所得模型蒸馏回客户端。我们提供了首个理论分析,证明从学习的集群模型蒸馏可产生指数级个性化风险收缩,超越模型无关联邦学习通常提供的收敛到平稳状态保证。在基准测试中,COSMOS在异构环境中一致优于所有模型无关联邦学习基线方法,同时与最先进的个性化联邦学习方法竞争。更广泛地说,我们的结果强调了使用伪标签实现个性化服务器端学习作为可扩展且模型无关联邦学习的有前景范式。

英文摘要

Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.

2503.17182 2026-06-12 cs.CV 版本更新

Radar-Guided Polynomial Fitting for Metric Depth Estimation

雷达引导的多项式拟合用于度量深度估计

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong

发表机构 * Yale University(耶鲁大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出POLAR方法,利用雷达数据预测多项式系数,对单目深度估计的无尺度深度进行非均匀校正,实现度量深度估计,性能在三个数据集上平均提升24.9% MAE和33.2% RMSE。

详情
Comments
CVPR 2026
AI中文摘要

我们提出POLAR,一种新颖的雷达引导深度估计方法,引入多项式拟合以高效地将预训练单目深度估计(MDE)模型的无尺度深度预测转换为度量深度图。与依赖复杂架构或昂贵传感器的现有方法不同,我们的方法基于一个基本洞察:尽管MDE模型通常能在每个物体或局部区域内推断合理的局部深度结构,但它们可能使这些区域相互错位,使得在三个或更多区域的情况下线性尺度和偏移(仿射)变换不足。为解决这一限制,我们使用从廉价、普遍存在的雷达数据预测的多项式系数,在深度范围内非均匀地自适应调整预测。通过这种方式,POLAR超越了仿射变换,并能够通过引入拐点来纠正此类错位。重要的是,我们的多项式拟合框架通过一种新颖的训练目标保持结构一致性,该目标通过一阶导数正则化强制局部单调性。POLAR在三个数据集上实现了最先进的性能,在MAE和RMSE上平均优于现有方法24.9%和33.2%,同时在延迟和计算成本方面也实现了最先进的效率。

英文摘要

We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

2605.08116 2026-06-12 cs.LG cs.AI 版本更新

The Safety-Aware Denoiser for Text Diffusion Models

文本扩散模型的安全感知去噪器

Amman Yusuf, Zhejun Jiang, Mijung Park

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出安全感知去噪器(SAD),在文本扩散模型的迭代去噪过程中引导生成文本进入安全区域,无需重训练即可实现灵活的安全约束,有效降低不安全生成同时保持生成质量。

详情
Comments
28 pages, 12 figures. Code available at: this https URL
AI中文摘要

最近关于文本扩散模型的工作为自回归生成提供了一种有前景的替代方案,但控制其安全性仍未被充分探索。现有的安全方法面向自回归模型,通常依赖于事后过滤或推理时干预。这些方法不足以有效解决文本扩散模型中的安全风险。我们提出了安全感知去噪器(SAD),一种文本扩散模型中的安全引导框架。SAD修改了迭代去噪过程,使得最终去噪步骤中的文本样本被引导至文本空间中可证明的安全区域。这种推理时方法可以将安全约束集成到去噪器中,避免了底层扩散模型的计算昂贵重训练,并实现了灵活、轻量级的安全引导。我们使用SAD评估生成文本的安全性,涉及危害分类、记忆和越狱。实验结果表明,SAD在保持生成质量、多样性和流畅性的同时,显著减少了不安全生成,优于现有方法。这些结果表明,我们在去噪过程中的安全引导为在文本扩散模型中实施安全提供了一种有效且可扩展的机制。

英文摘要

Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

2605.01391 2026-06-12 cs.CV 版本更新

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

VISTA:视频交互时空分析基准

Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen, Anirudh Bharadwaj, Aman Chadha, Yogesh Rawat

发表机构 * University of Central Florida(中央佛罗里达大学) BITS Pilani(比特斯理工学院) Ho Chi Minh City University of Science(胡志明市科学大学) Amazon GenAI Project(亚马逊生成人工智能项目)

AI总结 提出VISTA基准,通过分解视频为实体、动作和关系,实现开放集多实体多动作的时空理解评估,揭示传统指标掩盖的偏差。

详情
Comments
Accepted to CVPR 2026 Workshop on Pixel-level Video Understanding in the Wild (PVUW)
AI中文摘要

现有的视觉-语言模型(VLM)基准主要评估简单单动作视频、封闭属性集和受限实体类型的时空理解,未能捕捉真实世界视频理解中多样实体之间的自由形式多动作交互。此外,缺乏一个系统性的框架来分析模型在互补时空轴上的失败,阻碍了全面评估。为解决这些问题,我们引入了VISTA,一个视频交互时空分析基准,专为VLM中的开放集、多实体和多动作时空理解设计。VISTA将视频分解为可解释的实体、其关联动作和关系动态,实现多轴诊断以及关系、空间和时间理解的统一评估。我们的基准将多个数据集整合到一个单一的交互感知分类法中,包含约12K个精心策划的视频-查询对,涵盖多样场景和复杂性。我们在VISTA上系统评估了11个最先进的VLM,并分解了跨分类法的聚合性能,揭示了传统指标掩盖的缺陷和显著的时空偏差。通过在具有挑战性的数据集上提供详细的、分类法驱动的诊断,VISTA提供了一个精细的框架来指导模型设计、预训练策略和评估协议的进步。总体而言,VISTA是第一个大规模、交互感知的VLM时空理解诊断基准。

英文摘要

Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.

2601.19827 2026-06-12 cs.CL cs.AI cs.IR 版本更新

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

当迭代RAG优于理想证据:科学多跳问答中的诊断研究

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

发表机构 * Faculty of Engineering, McMaster University, Canada(麦斯特大学工程学院,加拿大) BASF Canada Inc., Canada(巴斯夫加拿大公司,加拿大)

AI总结 通过化学多跳问答数据集,诊断发现迭代检索-推理循环在科学领域显著优于静态RAG上限,揭示了阶段式检索的优势与失败模式。

详情
Comments
51 pages, 29 figures
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但目前尚不清楚迭代检索-推理循环何时能有效超越静态RAG,尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次进行了受控的、机制层面的诊断研究,以探究同步迭代检索和推理能否超越理想化的静态上限(Gold Context)RAG。我们在三种设置下对十一个最先进的LLM进行了基准测试:(i)无上下文,衡量对参数化记忆的依赖;(ii)Gold Context,一次性提供所有真实证据;(iii)迭代RAG,一种无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用以化学为中心的ChemKGMultiHopQA数据集,我们分离出需要真正检索的问题,并通过诊断分析行为,涵盖检索覆盖缺口、锚点携带下降、查询质量、组合保真度和控制校准。在所有模型中,迭代RAG始终优于Gold Context,增益高达25.6个百分点,尤其对于非推理微调模型。阶段式检索减少了后期跳失败,缓解了上下文过载,并实现了对早期假设漂移的动态修正,但剩余的失败模式包括跳覆盖不完整、干扰物锁定轨迹、过早停止校准错误以及即使检索完美时的高组合失败率。总体而言,阶段式检索通常比理想证据的单纯存在更具影响力;我们为在专门科学环境中部署和诊断RAG系统提供了实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。

英文摘要

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

2605.00600 2026-06-12 cs.LG cs.AI cs.CV 版本更新

Possibilistic Predictive Uncertainty for Deep Learning

深度学习的可能性预测不确定性

Yao Ni, Jeremie Houssineau, Yew-Soon Ong, Piotr Koniusz

发表机构 * arXiv.org University of Cambridge(剑桥大学) National University of Singapore(新加坡国立大学) University of Warsaw(华沙大学)

AI总结 提出基于可能性理论的Dirichlet近似可能性后验预测(DAPPr)框架,通过投影-近似策略实现高效且原则性的认知不确定性量化,在多个基准上达到竞争性能。

详情
Comments
Accepted by ICML 2026, 20 pages
AI中文摘要

深度神经网络在多种应用中取得了令人印象深刻的结果,然而它们对未见输入的过度自信需要可靠的认知不确定性建模。现有的不确定性建模方法面临一个基本困境:贝叶斯方法提供原则性的估计,但计算成本高昂,而高效的二阶预测器在其特定目标与认知不确定性量化之间缺乏严格联系。为解决这一困境,我们引入了Dirichlet近似可能性后验预测(DAPPr),一个基于可能性理论的原则性框架。我们定义了参数上的可能性后验,通过上确界算子将其投影到预测空间,并使用可学习的Dirichlet可能性函数近似投影后的后验。这种投影-近似策略产生了一个具有闭式解的简单训练目标。尽管简单,跨多个不同基准的大量实验表明,DAPPr在保持原则性推导和计算效率的同时,实现了与最先进的二阶预测器相当或更优的不确定性量化性能。代码可在 https://github.com/MaxwellYaoNi/DAPPr 获取。

英文摘要

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at this https URL.

2604.27960 2026-06-12 cs.AI 版本更新

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

LLMs 作为 ASP 程序员:自我纠正实现任务无关的非单调推理

Adam Ishay, Joohyung Lee

发表机构 * Arizona State University(亚利桑那州立大学) Samsung Research(三星研究院)

AI总结 提出 LLM+ASP 框架,通过自我纠正循环将自然语言转化为回答集程序,实现无需任务特定工程的非单调推理,在多个基准上优于 SMT 方法。

详情
Comments
30 pages
AI中文摘要

近期的大语言模型(LLMs)在推理方面取得了令人瞩目的进展,但仍面临高计算成本、逻辑不一致性以及在高度复杂问题上性能急剧下降等问题。神经符号方法通过将 LLMs 与符号推理器结合来缓解这些问题,但现有方法通常依赖于单调逻辑(如 SMT),无法表示可废止推理——人类认知的重要组成部分。我们提出了“LLM+ASP”框架,该框架将自然语言转化为回答集编程(ASP),一种基于稳定模型语义的非单调形式化方法。与先前需要手动编写知识模块、领域特定提示或仅限于单一问题类别评估的“LLM+ASP”方法不同,我们的框架无需任何每任务工程,并统一适用于多种推理任务。我们的系统利用自动化的自我纠正循环,其中来自 ASP 求解器的结构化反馈能够实现迭代优化。在六个不同基准上的评估表明:(1)稳定模型语义使 LLMs 能够自然地表达默认规则和例外,在非单调任务上显著优于基于 SMT 的替代方法;(2)迭代自我纠正是性能的主要驱动力,有效替代了手工领域知识的需求;(3)紧凑的上下文参考指南显著优于冗长的文档,揭示了“上下文腐烂”现象,即过多上下文会阻碍约束遵循。

英文摘要

Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning -- essential components of human cognition. We present "LLM+ASP," a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior "LLM+ASP" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a "context rot" phenomenon where excessive context hinders constraint adherence.

2604.27277 2026-06-12 cs.LG cs.AI cs.CV 版本更新

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

BrainDINO:一种用于通用临床表征学习的脑MRI基础模型

Yizhou Wu, Shansong Wang, Yuheng Li, Mojtaba Safari, Mingzhe Hu, Chih-Wei Chang, Harini Veeraraghavan, Xiaofeng Yang

发表机构 * Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤科和Winship癌症研究所,埃默里大学) Department of Radiation and Cellular Oncology, The University of Chicago(放射肿瘤学与细胞肿瘤学部,芝加哥大学) Department of Electrical and Computer Engineering, Georgia Institute of Technology(电气与计算机工程系,佐治亚理工学院) Department of Biomedical Engineering, Georgia Institute of Technology(生物医学工程系,佐治亚理工学院) Department of Biomedical Informatics, Emory University(生物医学信息学系,埃默里大学) Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理系,纪念斯隆凯特琳癌症中心)

AI总结 提出BrainDINO,一种基于自蒸馏的基础模型,在约660万张未标记轴向切片上训练,通过冻结编码器加轻量任务头,在多种脑MRI任务上达到或超越基线,尤其在小样本场景下优势显著。

详情
Comments
25 pages, 5 figures
AI中文摘要

脑MRI支撑着广泛的神经科学和临床应用,然而大多数基于学习的方法仍针对特定任务且需要大量标注数据。本文表明,单一的自监督表征可以泛化到异质的脑MRI终点。我们训练了BrainDINO,一个自蒸馏的基础模型,使用了来自20个数据集的约660万张未标记轴向切片,这些数据集涵盖了人群、疾病和采集设置的广泛变异。通过使用冻结编码器加轻量任务头,BrainDINO支持肿瘤分割、神经退行性和神经发育性疾病分类、脑年龄估计、卒中后时间预测、分子状态预测、MRI序列分类和生存建模等任务的迁移。在各种任务和监督机制下,BrainDINO始终等于或超过自然图像和MRI特定自监督基线,在标签稀缺时尤其具有优势。表征分析进一步显示,在缺乏任务特定监督的情况下,特征结构具有解剖学组织和病理敏感性。我们的发现表明,大规模切片级自监督学习可以产生统一的脑MRI表征,支持多样化的神经影像任务,无需体积预训练或全网络微调,为稳健且数据高效的脑影像分析建立了可扩展的基础。代码可在 https://github.com/mclwu22/BrainDINO 获取。

英文摘要

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at this https URL

2604.26940 2026-06-12 cs.CL 版本更新

Select to Think: Unlocking SLM Potential with Local Sufficiency

Select to Think: 利用局部充分性解锁小语言模型潜力

Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Select to Think (S2T)方法,通过将大语言模型角色从生成转为选择,并蒸馏选择逻辑到小语言模型,使其在推理时无需依赖大模型,显著提升性能。

详情
Comments
Accepted to ICML 2026. Code is available at this https URL
AI中文摘要

小语言模型(SLM)部署高效,但在推理能力上常落后于大语言模型(LLM)。现有解决方案要么在推理分歧点调用LLM,导致大量延迟和成本,要么依赖标准蒸馏,受限于SLM准确模仿LLM复杂生成分布的能力。我们通过识别局部充分性来解决这一困境:在分歧点,LLM偏好的token通常位于SLM的top-K预测中,即使未能成为SLM的top-1选择。因此,我们提出Select to Think(S2T),将LLM的角色从开放式生成重新定义为在SLM的候选提案中进行选择,将监督信号简化为离散的候选排名。利用这一点,我们引入S2T-Local,将选择逻辑蒸馏到SLM中,使其能够在推理时自主重新排序,无需依赖LLM。实验表明,1.5B SLM的top-8候选包含32B LLM选择的命中率达95%,S2T-Local使1.5B SLM的数学平均相对贪心解码提升24.1%,以单轨迹效率达到8路径自一致性的效果。

英文摘要

Small language models (SLMs) offer efficient deployment, yet they often lag behind their larger counterparts (LLMs) in reasoning. Existing remedies either invoke an LLM at points of reasoning divergence, incurring substantial latency and cost, or rely on standard distillation, which is limited by the SLM's capacity to accurately mimic the LLM's complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM's preferred token often resides within the SLM's top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose Select to Think (S2T), which reframes the LLM's role from open-ended generation to selection among the SLM's proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-Local, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, a 1.5B SLM's top-8 candidates contain the 32B LLM's choice with a 95% hit rate, and S2T-Local improves the 1.5B SLM's Math Avg. over greedy decoding by 24.1% relative gain, matching the efficacy of 8-path self-consistency with single-trajectory efficiency.

2604.24806 2026-06-12 cs.IR cs.AI cs.DB 版本更新

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

版本化延迟物化:面向大规模推荐系统的超长序列训练

Liang Guo, Ge Song, Litao Deng, Jianhui Sun, Chufeng Hu, Lu Zhang, Zhen Ma, Shouwei Chen, Weiran Liu, Sarang Masti Sreeshylan, Xiaoxuan Meng, Yanzun Huang

发表机构 * Meta Platforms, Inc.(Meta平台)

AI总结 提出版本化延迟物化范式,通过归一化存储和即时序列重建消除数据冗余,支持超长用户交互历史训练,降低存储I/O开销并提升模型质量。

详情
AI中文摘要

现代深度学习推荐模型(DLRM)遵循序列长度的缩放定律,推动前沿走向超长用户交互历史(UIH)。然而,行业标准的“Fat Row”范式将序列预物化到每个训练样本中,造成存储和I/O瓶颈,数据基础设施使用超过GPU训练容量,数据冗余在多租户环境中被放大,其中不同序列长度需求的模型共享联合数据集。我们提出了一种\emph{版本化延迟物化}范式,通过将UIH归一化存储在一个不可变层中,并在训练期间通过轻量级版本指针即时重建序列,从而消除冗余。系统通过一个分叉协议确保在线到离线(O2O)一致性,防止未来泄漏跨流式和批式训练,同时一个读优化的不可变存储层为异构模型租户提供多维投影下推。解耦的数据预处理与流水线I/O预取和数据亲和性优化掩盖了训练时序列重建的延迟,使训练吞吐量保持GPU计算受限。部署在生产DLRM上,系统减少了训练数据基础设施资源使用,同时实现了激进的序列长度缩放,带来显著的模型质量提升,作为现代推荐模型架构(包括HSTU和ULTRA-HSTU)的基础数据基础设施。

英文摘要

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

2604.24079 2026-06-12 cs.CL cs.AI 版本更新

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

实用人格:通过桥接推理发现LLM人格

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Republic of Korea(Chung-Ang大学人工智能系) Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada(不列颠哥伦比亚大学计算机科学系) Van Lang University, Ho Chi Minh City, Vietnam(文-lang大学)

AI总结 提出基于桥接推理的框架,通过构建话语级知识图谱捕捉LLM对话中的隐含语义关联,实现从话语连贯性层面发现稳定人格特征,优于基于频率或风格的基线方法。

详情
Comments
15 pages, 4 figures, accepted to ICPR 2026
AI中文摘要

大型语言模型(LLM)通过对话展现出固有且独特的人格。然而,现有的大多数人格发现方法依赖于表面层面的词汇或风格线索,将对话视为平坦的token序列,未能捕捉维持人格一致性的更深层次话语结构。为解决这一局限,我们提出一种新颖的分析框架,通过桥接推理——即通过共享世界知识和话语连贯性连接话语的隐含概念关系——来解读LLM对话。通过将这些关系建模为结构化知识图谱,我们的方法捕捉了控制LLM在对话轮次间组织意义的潜在语义链接,从而在话语连贯性层面而非表面实现上实现人格发现。在多种推理骨干和从小型模型到80B参数系统的目标LLM上的实验结果表明,与基于频率或风格的基线相比,桥接推理图产生了显著更强的语义连贯性和更稳定的人格识别。这些结果表明,人格特质始终编码在话语的结构组织中,而非孤立的词汇模式中。本工作提出了一个系统框架,通过认知话语理论的视角来探测、提取和可视化潜在的LLM人格,桥接了计算语言学、认知语义学和大型语言模型中的人格推理。代码见:https://this URL

英文摘要

Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at this https URL

2508.04427 2026-06-12 cs.LG cs.AI 版本更新

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

解码多模态迷宫:多模态注意力模型中可解释性采纳的系统综述

Md Raisul Kibria, Sébastien Lafond, Janan Arslan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文系统综述了2020年至2024年初多模态模型可解释性研究,发现多数工作集中于视觉-语言和纯语言模型,注意力机制是主要解释方法,但评估缺乏系统性和鲁棒性,并提出了改进建议。

详情
AI中文摘要

近年来,多模态学习取得了显著进展,特别是随着注意力模型的整合,在各种任务中带来了显著的性能提升。与此同时,对可解释人工智能(XAI)的需求推动了越来越多的研究,旨在解释这些模型的复杂决策过程。本系统文献综述分析了2020年1月至2024年初期间发表的、关注多模态模型可解释性的研究。在XAI更广泛目标的框架内,我们从多个维度审视文献,包括模型架构、涉及模态、解释算法和评估方法。我们的分析显示,大多数研究集中在视觉-语言和纯语言模型上,注意力机制是最常用的解释方法。然而,这些方法往往无法捕捉模态间交互的全谱系,这一问题因领域间的架构异质性而进一步加剧。重要的是,我们发现多模态环境中XAI的评估方法大多是非系统性的,缺乏一致性、鲁棒性,并且未考虑模态特定的认知和上下文因素。为解决这些不足,我们不仅综合了所调查研究的发现,还纳入了补充分析,整合了推动多模态可解释性的近期和新兴进展。基于这些见解,我们提出了一套全面的建议,旨在促进多模态XAI研究中严谨、透明和标准化的评估与报告实践。我们的目标是支持未来构建更可解释、可问责和负责任的多模态AI系统,并以可解释性为核心。

英文摘要

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

2604.23165 2026-06-12 cs.CV 版本更新

BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

BSViT:用于高效表达视觉表征学习的脉冲视觉Transformer

Hongxiang Peng, Dewei Bai, Hong Qu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出BSViT,通过双通道爆发脉冲自注意力机制和局部邻域掩码策略,解决脉冲视觉Transformer中二进制脉冲信息容量有限和全局自注意力密集交互的问题,在静态和事件视觉基准上取得更高精度和能效。

详情
Comments
Accepted by ECML PKDD 2026
AI中文摘要

脉冲视觉Transformer(S-ViT)为节能视觉学习提供了有前景的框架。然而,现有设计仍受限于两个基本问题:二进制脉冲编码的信息容量有限以及全局自注意力引入的密集令牌交互。为应对这些挑战,本文提出BSViT,一种爆发脉冲驱动的视觉Transformer,具有双通道爆发脉冲自注意力(DBSSA)机制。DBSSA用二进制脉冲编码查询,用爆发脉冲编码键以增强表示能力。值通路采用双兴奋性和抑制性二进制通道,实现有符号调制和更丰富的脉冲交互。重要的是,整个注意力操作保持仅加法计算,确保与节能神经形态硬件的兼容性。为进一步降低脉冲活动并融入空间先验,引入补丁邻域掩码策略将注意力限制在局部邻域,实现结构感知稀疏性并减少计算开销。此外,爆发脉冲编码被系统地集成到网络中,以提升脉冲级表示能力,超越传统二进制脉冲。在静态和事件视觉基准上的大量实验表明,BSViT在精度上持续优于现有脉冲Transformer,同时保持有竞争力的能效。

英文摘要

Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.

2506.18493 2026-06-12 cs.CV 版本更新

ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation

ShowFlow: 从鲁棒的单概念到无条件的多概念生成

Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science(科学大学) Vietnam National University(越南国家大学) Monash University(墨尔本大学) University of Dayton(Dayton大学)

AI总结 提出ShowFlow框架,通过KronA-WED适配器和语义感知注意力正则化增强单概念生成,并利用SAMA和布局一致性指导实现无额外条件的多概念生成。

详情
AI中文摘要

定制化图像生成仍然是可控图像合成中的核心挑战。对于单概念生成,保持身份保留和提示对齐是困难的。在多概念场景中,仅依赖提示而不使用布局框或语义掩码等额外条件,通常会导致身份丢失和概念遗漏。在本文中,我们介绍了ShowFlow,一个旨在应对这些挑战的全面框架。我们提出了用于单概念图像生成的ShowFlow-S,以及用于处理多个概念的ShowFlow-M。ShowFlow-S引入了一个KronA-WED适配器,它将Kronecker适配器与权重和嵌入分解相结合,并配合一种新颖的语义感知注意力正则化(SAR)训练目标,以增强单概念生成。在此基础上,ShowFlow-M直接重用由ShowFlow-S学习的鲁棒模型,以支持无需额外条件的多概念生成,并集成了主体自适应匹配注意力(SAMA)和布局一致性指导作为即插即用模块。大量实验和用户研究验证了ShowFlow的有效性,突显了其在广告和虚拟试穿等实际应用中的潜力。我们的源代码将在以下网址公开:this https URL。

英文摘要

Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and together with a novel Semantic-Aware Attention Regularization (SAR) training objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses robust models learned by ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a Layout Consistency guidance as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing. Our source code will be publicly available at: this https URL.

2604.20236 2026-06-12 cs.LG 版本更新

Machine Learning-based Two-Stage Graph Sparsification for the Travelling Salesman Problem

基于机器学习的两阶段图稀疏化方法用于旅行商问题

Bo-Cheng Lin, Yi Mei, Mengjie Zhang

发表机构 * Centre for Data Science and Artificial Intelligence(数据科学与人工智能中心) School of Engineering and Computer Science(工程与计算机科学学院) Victoria University of Wellington(惠灵顿维多利亚大学)

AI总结 提出两阶段方法,先结合α-Nearest和POPMUSIC得到近完美召回率的候选图,再用轻量级分类器修剪单源边,在保持≥99.69%最优边的同时降低37%-47%密度。

详情
AI中文摘要

高性能TSP求解器(如Lin-Kernighan-Helsgaun (LKH))在\emph{候选图}(为求解器预先选定的边的小子集)中搜索,而不是在完整图上搜索。两种主要的稀疏化启发式方法,$\alpha$-Nearest和POPMUSIC,各自在密度-覆盖率平衡上存在不足:$\alpha$-Nearest密集且召回率稳定,而POPMUSIC更稀疏但其召回率随规模增大而下降。它们的并集在密度上远低于完整图的同时弥补了召回率差距,为进一步缩减留下了空间。现有的基于学习的稀疏化方法在完整图上对边评分,这种方法代价高昂且主要限于欧几里得实例。我们提出了一种两阶段方法,反转了这一逻辑。第一阶段取$\alpha$-Nearest和POPMUSIC的并集,在${\sim}6N$条边上实现近乎完美的召回率。关键在于,并集为每条边标注了其\emph{来源出处}——即它是由$\alpha$-Nearest、POPMUSIC还是两者共同支持的。第二阶段在这些标注边上训练一个轻量级分类器,并修剪得分最低的边。由于双源边几乎总是最优的,学习问题简化为过滤单源子集——这比从头开始对所有$O(N^2)$条边进行分类要容易得多。在四种距离类型、五种空间分布以及50到500的问题规模上,该流程将候选图密度降低了37%-47%,同时保留了${\geq}99.69\%$的最优旅行边,并且在TSP500上以更低的密度达到或超过了近期仅限欧几里得的神经稀疏化方法的覆盖率。

英文摘要

High-performance TSP solvers such as Lin-Kernighan-Helsgaun (LKH) search within a \emph{candidate graph} -- a small subset of edges pre-selected for the solver -- rather than over the complete graph. The two leading sparsification heuristics, $\alpha$-Nearest and POPMUSIC, each fall short of the density-coverage balance: $\alpha$-Nearest is dense with stable recall, while POPMUSIC is sparser but its recall degrades with scale. Their union closes the recall gap while remaining far below the complete graph in density, leaving room for further reduction. Existing learning-based sparsifiers score edges on the complete graph, an approach that is expensive and largely limited to Euclidean instances. We propose a two-stage method that inverts this logic. Stage~1 takes the union of $\alpha$-Nearest and POPMUSIC, achieving near-perfect recall at ${\sim}6N$ edges. Crucially, the union annotates each edge with its \emph{source provenance} -- whether it was endorsed by $\alpha$-Nearest, POPMUSIC, or both. Stage~2 trains a lightweight classifier on these annotated edges and prunes the lowest-scoring ones. Because dual-source edges are almost always optimal, the learning problem reduces to filtering the single-source subset -- a substantially easier task than classifying all $O(N^2)$ edges from scratch. Across four distance types, five spatial distributions, and problem sizes from 50 to 500, the pipeline reduces candidate-graph density by $37$-$47\%$ while retaining ${\geq}99.69\%$ of optimal-tour edges, and matches or exceeds the coverage of recent Euclidean-only neural sparsifiers at lower density at TSP500.

2604.18307 2026-06-12 cs.CL 版本更新

Reasoning Models Know What's Important, and Encode It in Their Activations

推理模型知道什么重要,并在其激活中编码

Yaniv Nikankin, Martin Tutek, Tomer Ashuach, Jonathan Rosenfeld, Yonatan Belinkov

发表机构 * Technion(技术离子大学) University of Zagreb, FER(扎格雷布大学,FER) MIT(麻省理工学院) Kempner Institute, Harvard(哈佛大学凯普纳研究所)

AI总结 通过分析模型激活而非仅依赖推理链文本,发现激活能更有效识别关键推理步骤,且模型在生成后续步骤前已内部编码步骤重要性。

详情
AI中文摘要

语言模型通常通过生成包含许多重要性不同的步骤的长推理链来解决复杂任务。虽然某些步骤对生成最终答案至关重要,但其他步骤是可移除的。确定哪些步骤最重要以及为什么,仍然是理解模型如何处理推理的核心开放问题。我们研究了这个问题是通过模型内部还是通过推理链本身的标记来最好地解决。我们发现,模型激活比标记包含更多信息,用于识别重要的推理步骤。关键的是,通过在模型激活上训练探针来预测重要性,我们表明模型在生成后续步骤之前就已经编码了步骤重要性的内部表示。不同模型中重要性的内部表示在哪些步骤重要上具有高度一致性。这种表示分布在各个层中,并且与表面特征(如步骤的相对位置或长度)不相关。我们的发现表明,分析激活可以揭示表面方法根本遗漏的推理方面,表明推理分析应该研究模型内部。

英文摘要

Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. The internal representations of importance in different models yield high agreement on which steps are important. The representation is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.

2601.00921 2026-06-12 cs.LG cs.AI quant-ph 版本更新

Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease

用于预测慢性阻塞性肺疾病骨骼肌结果的几何与量子核方法

Azadeh Alavi, Hamidreza Khalili, Stanley H. Chan, Fatemeh Kouchmeshki, Muhammad Usman, Ross Vlahos

发表机构 * School of Computing Technologies, RMIT University(计算技术学院,拉筹纳斯大学) School of Health & Biomedical Sciences, STEM College, RMIT University(健康与生物医学科学学院,STEM学院,拉筹纳斯大学) Pattern Recognition Pty Ltd, Melbourne(模式识别有限公司,墨尔本) Data61, CSIRO(Data61,澳大利亚联邦科学与工业研究组织)

AI总结 提出一种核几何量子混合方法,通过再生核希尔伯特空间映射合成SPD参考、随机投影压缩和低维量子回归电路,在COPD动物队列中预测肌肉重量、质量和力量,肌肉重量RMSE比最佳经典方法低约1.8%。

详情
Comments
24 pages, 2 figures
AI中文摘要

慢性阻塞性肺疾病(COPD)影响全球数亿人,骨骼肌功能障碍具有临床重要性。量子机器学习在生物医学预测中日益受到探索,但在小型生物标志物队列中的价值需要与强经典基线进行基准测试。我们分析了一个由213只动物组成的香烟烟雾COPD队列,利用血液和支气管肺泡灌洗生物标志物预测胫骨前肌重量、肌肉质量和力量。我们开发了一种核几何量子混合方法,其中合成对称正定(SPD)参考通过再生核希尔伯特空间映射,使用仅训练随机投影压缩,归一化,并输入低维量子回归电路。我们将该方法与经典岭/核模型、SPD关系表示和量子核回归(QKR)进行了基准测试。所有方法均使用条件分层重复交叉验证进行评估。最大的数值改进出现在肌肉重量上,所提出方法的平均均方根误差(RMSE)数值最低,比最佳经典比较器低约1.8%;配对折叠水平测试在Holm调整后未建立统计显著性优势,但该终点具有生物学意义。该方法在肌肉质量上也具有数值最低的平均RMSE。对于力量,仅使用生物标志物的岭回归表现最佳,表明更线性的终点结构。

英文摘要

Chronic obstructive pulmonary disease (COPD) affects hundreds of millions of people worldwide, and skeletal-muscle dysfunction is clinically important. Quantum machine learning is increasingly explored for biomedical prediction, but its value in small biomarker cohorts requires benchmarking against strong classical baselines. We analysed a cigarette-smoke COPD cohort of 213 animals with blood and bronchoalveolar-lavage biomarkers to predict tibialis anterior muscle weight, muscle quality, and force. We developed a kernel-geometric quantum hybrid method in which synthetic symmetric positive definite (SPD) references are mapped through a reproducing kernel Hilbert space, compressed using train-only random projection, normalised, and supplied to low-dimensional quantum regression circuits. We benchmarked this approach against classical ridge/kernel models, SPD relational representations, and quantum-kernel regression (QKR). All methods were evaluated using condition-stratified repeated cross-validation. The largest numerical improvement was observed for muscle weight, where the proposed method had the numerically lowest mean root mean squared error (RMSE), approximately 1.8% below the best classical comparator; paired fold-level testing did not establish statistically significant superiority after Holm adjustment, but the endpoint is biologically meaningful. The method also had the numerically lowest mean RMSE for muscle quality. For force, biomarker-only Ridge performed best, suggesting a more linear endpoint structure.

2604.16689 2026-06-12 cs.AI 版本更新

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

查询通道:基于掩码的解释的信息论极限

Erciyes Karakaya, Ozgur Ercetin

发表机构 * Department of Electrical and Computer Engineering, University of Maryland, College Park, USA(美国马里兰大学电气与计算机工程系) Faculty of Engineering and Natural Sciences, Sabanci University, Turkiye(土耳其萨班奇大学工程与自然科学学院)

AI总结 本文提出查询通道框架,将掩码后解释建模为通信过程,推导解释率与识别容量之间的信息论极限,并证明稀疏最大似然解码器可实现可靠恢复。

详情
AI中文摘要

基于掩码的事后解释方法,如KernelSHAP和LIME,通过随机扰动下的查询估计局部特征重要性。本文将这一过程建模为在查询通道上的通信,其中潜在解释作为消息,每次掩码评估作为一次信道使用。在此框架内,解释的复杂度由假设类的熵捕获,而查询接口以每次查询的识别容量确定的速率提供信息。我们推导了一个强逆定理,表明如果解释率超过该容量,则对于任何解释器和解码器序列,精确恢复的概率必然收敛到误差中的一。我们还证明了一个可达性结果,即当速率低于容量时,稀疏最大似然解码器可实现可靠恢复。互信息的蒙特卡洛估计器提供了一个非渐近查询基准,我们用它来比较最优解码与模拟LIME和KernelSHAP的基于Lasso和OLS的过程。实验揭示了在一定的查询预算范围内,信息论允许可靠解释,但标准凸替代方法仍然失败。最后,我们将神经语言模型的超像素分辨率和分词解释为一种源编码选择,它设定了解释的熵,并展示了高斯噪声和非线性曲率如何劣化查询通道,引发瀑布和错误平层行为,并使高分辨率解释无法实现。

英文摘要

Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

LLM智能体中长期记忆安全综述:跨记忆生命周期的攻击、防御与治理

Zehao Lin, Xixuan Hao, Renyu Fu, Shaobo Cui, Kai Chen, Chunyu Li, Zhiyu Li, Feiyu Xiong

发表机构 * MemTensor Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出记忆生命周期框架,系统分析LLM智能体长期记忆面临的新威胁,并引入可验证记忆治理(VMG)架构原语,强调存储时溯源与版本控制对安全的关键作用。

详情
AI中文摘要

LLM智能体中可写、跨会话持久记忆的出现,引入了与传统的以输入为中心的安全问题性质不同的威胁格局,其特点包括三个属性:持久性、状态性和传播性。为系统描述这一格局,我们提出记忆生命周期框架,该框架沿两个轴组织攻击、防御及其跨阶段依赖关系:六个生命周期阶段(写入、存储、检索、执行、共享与传播、遗忘与回滚)和四个安全目标(完整性、机密性、可用性、治理)。该分析进而揭示了在系统层面需要形式化安全保证,从而推动了可验证记忆治理(VMG)——一个由五个架构原语组成的框架,它规定了长期记忆系统必须提供哪些可验证机制,以维持对其记忆状态的可审计、可恢复控制。我们的分析表明,健壮的长期记忆(LTM)安全无法仅在检索或执行时进行事后补救,而必须从一开始就锚定于存储时的溯源、版本控制和策略感知的保留。

英文摘要

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

2604.13924 2026-06-12 cs.LG cs.AI cs.CV 版本更新

ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection

ASTER: 用于无监督时间序列异常检测的潜在伪异常生成

Romain Hermary, Samet Hicsonmez, Dan Pineau, Abd El Rahman Shabayek, Djamila Aouada

发表机构 * University of Montreal(蒙特利尔大学) Université de Montréal(蒙特利尔大学)

AI总结 提出ASTER框架,在潜在空间生成伪异常训练Transformer分类器,结合预训练LLM增强表示,在三个基准数据集上达到最优性能。

详情
Comments
Published in ICPR 2026
AI中文摘要

时间序列异常检测(TSAD)在工业监控、医疗保健和网络安全等领域至关重要,但由于罕见且异质的异常以及标记数据的稀缺性,它仍然具有挑战性。这种稀缺性使得无监督方法占主导地位,但现有方法通常依赖于重建或预测(难以处理复杂数据),或依赖于需要领域特定异常合成和固定距离度量的基于嵌入的方法。我们提出ASTER,一个直接在潜在空间中生成伪异常的框架,避免了手工制作的异常注入和对领域专业知识的需求。潜在空间解码器生成定制的伪异常,用于训练基于Transformer的异常分类器,而预训练的LLM丰富了该空间的时间和上下文表示。在三个基准数据集上的实验表明,ASTER达到了最先进的性能,并为基于LLM的TSAD设立了新标准。

英文摘要

Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.

2604.08958 2026-06-12 cs.LG cs.AI cs.RO 版本更新

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET:基于世界模型的经验迁移实现鲁棒且样本高效的强化学习

Mintae Kim, Koushil Sreenath

发表机构 * Hybrid Robotics, UC Berkeley(混合机器人技术,伯克利大学)

AI总结 提出WOMBET框架,通过源任务中学习世界模型并生成不确定性惩罚的离线数据,再结合自适应采样进行在线微调,实现鲁棒且样本高效的强化学习迁移。

详情
Comments
13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)
AI中文摘要

机器人领域的强化学习通常受限于数据收集的成本和风险,因此需要从源任务向目标任务进行经验迁移。离线到在线强化学习利用先验数据,但通常假设给定固定数据集,并未解决如何生成可靠数据进行迁移的问题。我们提出基于世界模型的经验迁移(WOMBET)框架,该框架联合生成和利用先验数据。WOMBET在源任务中学习世界模型,并通过不确定性惩罚规划生成离线数据,随后筛选出高回报和低认知不确定性的轨迹。然后,它通过在离线数据和在线数据之间进行自适应采样,在目标任务中进行在线微调,实现了从先验驱动的初始化到任务特定适应的稳定过渡。我们证明了不确定性惩罚目标提供了真实回报的下界,并推导了有限样本误差分解,捕捉了分布不匹配和近似误差。实验上,WOMBET在连续控制基准测试中相比强基线提高了样本效率和最终性能,展示了联合优化数据生成和迁移的益处。

英文摘要

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

2604.12497 2026-06-12 cs.LG stat.ML 版本更新

Allocating Human Oversight in AI-Enabled Analytics

AI赋能分析中的人类监督分配

Zikun Ye, Jiameng Lyu, Rui Tao

发表机构 * Michael G. Foster School of Business, University of Washington(华盛顿大学迈克尔·G·福斯特商学院) Department of Management Science, School of Management, Fudan University(复旦大学管理学院管理科学系) Guanghua School of Management, Peking University(北京大学光华管理学院)

AI总结 针对AI预测可靠性异质且未知的问题,提出基于上置信界的在线学习策略,动态分配有限的人类验证预算,使终端效率损失随预算增长趋于零。

详情
AI中文摘要

组织越来越多地部署AI作为面向客户的决策过程中的低成本预测层,包括需求感知、服务质量监控、产品测试和市场研究,但AI生成的信号在不同任务、产品和客户细分中的可靠性并不均匀。因此,企业仍然需要稀缺的人类验证(标签、审计、调查回复或后续测量)来将AI输出锚定到真实情况。由于人类真实情况本身存在噪声,在不同标注者之间甚至重复判断中都有所变化,企业必须为每个任务收集并平均多个人类标签,这使得人类验证成本高昂。我们研究如何在可靠性异质且在部署前未知的情况下,将有限的人类验证预算分配到多个AI辅助任务中。我们将其置于调优的预测驱动推断框架内。每个人类标签既提高了AI辅助估计的精度,也揭示了任务的修正难度,即在使用AI预测作为控制变量后剩余的方差。如果难度已知,最优分配将遵循Neyman平方根规则;由于未知,我们提出一种基于上置信界的策略,该策略在线学习难度并将验证导向AI最不可靠的任务。我们证明,随着预算增长,该策略相对于最优分配的终端效率损失趋于零。在合成实验和一个包含68个任务和超过2000名受访者的真实数字孪生调查中,当可靠性异质时,该策略缩小了与最优分配的大部分差距,优于均匀分配和epsilon-贪婪分配;在调查数据上,它还优于先探索后提交的试点设计,并将均匀分配的10-12%差距缩小到2-6%。AI的价值不仅取决于模型准确性,还取决于将人类监督定向到AI错误影响最大的操作策略。

英文摘要

Organizations increasingly deploy AI as a low-cost prediction layer in customer-facing decision processes, including demand sensing, service-quality monitoring, product testing, and market research, but AI-generated signals are unevenly reliable across tasks, products, and customer segments. Firms therefore still need scarce human validation (labels, audits, survey responses, or follow-up measurements) to anchor AI outputs to ground truth. Because human ground truth is itself noisy, varying across labelers and even across repeated judgments, the firm must collect and average several human labels per task, which makes human validation costly. We study how to allocate a limited human-validation budget across many AI-assisted tasks when reliability is heterogeneous and unknown before deployment. We cast this within tuned prediction-powered inference. Each human label both sharpens the AI-assisted estimate and reveals the task's rectification difficulty, the variance that remains after the AI prediction is optimally used as a control variate. If difficulties were known, the optimal allocation would follow a Neyman square-root rule; because they are unknown, we propose a policy based on upper confidence bounds that learns them online and steers validation toward tasks where AI is least reliable. We prove that the policy's terminal efficiency loss relative to the oracle allocation vanishes as the budget grows. In synthetic experiments and a real digital-twin survey with 68 tasks and over 2000 respondents, it closes most of the gap to the oracle when reliability is heterogeneous, outperforming uniform and epsilon-greedy allocation; on the survey data it also outperforms explore-then-commit pilot designs and cuts uniform's 10--12% gap to 2--6%. The value of AI depends not only on model accuracy but also on the operational policy that targets human oversight where AI errors matter most.

2604.12002 2026-06-12 cs.CL 版本更新

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

自蒸馏零:自我修订将二元奖励转化为密集监督

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora

发表机构 * Princeton University(普林斯顿大学) University of Toronto(多伦多大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出SD-Zero方法,通过让模型同时扮演生成器和修订者,利用二元奖励生成密集的token级自监督信号,显著提升训练样本效率,在数学和代码推理任务上超越RFT、GRPO等基线。

详情
AI中文摘要

当前在可验证设置下的后训练方法分为两类。强化学习(RLVR)依赖二元奖励,虽然广泛适用且强大,但在训练过程中仅提供稀疏监督。蒸馏提供密集的token级监督,通常从外部教师或使用高质量示范中获得。收集此类监督成本高昂或不可用。我们提出自蒸馏零(SD-Zero),一种比RL更高效利用训练样本的方法,且不需要外部教师或高质量示范。SD-Zero训练单个模型扮演两个角色:生成器,产生初始响应;修订者,基于该响应及其二元奖励生成改进的响应。然后我们进行在线自蒸馏,将修订者蒸馏到生成器中,使用修订者以生成器的响应及其奖励为条件的token分布作为监督。实际上,SD-Zero训练模型将二元奖励转化为密集的token级自监督。在数学和代码推理基准上,使用Qwen3-4B-Instruct和Olmo-3-7B-Instruct,SD-Zero相比基础模型性能提升至少10%,并在相同问题集和训练样本预算下优于强基线,包括拒绝微调(RFT)、GRPO和自蒸馏微调(SDFT)。大量消融实验显示了所提出算法的两个新特性:(a)token级自定位,其中修订者能够基于奖励识别生成器响应中需要修订的关键token;(b)迭代自进化,其中改进答案的修订能力可以通过定期教师同步蒸馏回生成性能。代码:此https URL。

英文摘要

Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization. Code: this https URL.

2604.10389 2026-06-12 cs.CL 版本更新

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

BLUEmed: 基于检索增强的多智能体辩论用于临床错误检测

Saukun Thika You, Nguyen Anh Khoa Tran, Wesley K. Marizane, Hanshu Rao, Qiunan Zhang, Xiaolei Huang

发表机构 * arXiv.org University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出BLUEmed框架,结合混合检索增强生成与多智能体辩论,通过分解临床笔记、检索证据、专家辩论及安全层过滤,在术语替换错误检测中达到最优性能。

详情
Comments
Accepted to the IEEE International Conference on Healthcare Informatics (ICHI) 2026
AI中文摘要

临床笔记中的术语替换错误(即一个医学术语被一个语言上有效但临床不同的术语替换)对医疗保健中的自动错误检测构成了持续挑战。我们引入了BLUEmed,一个多智能体辩论框架,增强有混合检索增强生成(RAG),该框架结合了基于证据的推理和多视角验证用于临床错误检测。BLUEmed将每个临床笔记分解为聚焦的子查询,通过密集、稀疏和在线检索检索来源分区的证据,并分配两个具有不同知识库的领域专家智能体以产生独立分析;当专家意见不一致时,一轮结构化的反论证和跨来源裁决解决冲突,随后是一个级联安全层,过滤常见的假阳性模式。我们在一个临床术语替换检测基准上评估BLUEmed,在零样本和少样本提示下,使用多个骨干模型(涵盖专有和开源系列)。实验结果表明,在少样本提示下,BLUEmed达到了最佳准确率(69.13%)、ROC-AUC(74.45%)和PR-AUC(72.44%),优于单智能体RAG和仅辩论基线。跨六个骨干模型和两种提示策略的进一步分析证实,检索增强和结构化辩论是互补的,并且该框架从具有足够指令遵循和临床语言理解的模型中受益最大。

英文摘要

Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

2511.18322 2026-06-12 cs.RO cs.CV cs.LG 版本更新

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

从视频中学习软体连续体机器人的视觉可解释振荡器网络

Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo(东京大学先进跨学科研究系) Institute of Assembly Technology and Robotics, Leibniz University Hannover(莱比锡大学汉诺威装配技术与机器人研究所) Research Center for Advanced Science and Technology, The University of Tokyo(东京大学先进科学研究中心)

AI总结 提出注意力广播解码器(ABCD)和视觉振荡器网络(VONs),实现从视频中学习软体连续体机器人动力学的视觉和机械可解释性,多步预测误差降低5.8倍。

详情
Comments
Code available at: this https URL Dataset available at: this https URL Video available at: this https URL
AI中文摘要

从视频中学习软体连续体机器人(SCR)动力学提供了灵活性,但现有方法缺乏可解释性或依赖先验假设。基于模型的方法需要先验知识和手动设计。我们通过引入以下内容来弥补这一差距:(1)注意力广播解码器(ABCD),一种用于基于自编码器的潜在动力学学习的即插即用模块,生成像素级注意力图,定位每个潜在维度的贡献,同时过滤静态背景,通过空间接地潜在变量和图像叠加实现视觉可解释性。(2)视觉振荡器网络(VONs),一种二维潜在振荡器网络,与ABCD注意力图耦合,用于学习到的质量、耦合刚度和力的图像可视化,从而实现机械可解释性。我们在单段和双段SCR上验证了我们的方法,表明基于ABCD的模型显著提高了多步预测精度,在双段机器人上,Koopman算子的误差降低了5.8倍,振荡器网络的误差降低了3.5倍。VONs自主发现了振荡器的链式结构。这种完全数据驱动的方法产生了紧凑、机械可解释的模型,对未来的控制应用具有潜在意义。

英文摘要

Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, thereby enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.