arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2601.21789 2026-05-27 cs.LG cs.AI stat.ML

ECSEL: Explainable Classification via Signomial Equation Learning

ECSEL: 通过符号方程学习的可解释分类

Adia Lumadjeng, Ilker Birbil, Erman Acar

发表机构 * Amsterdam Business School, University of Amsterdam, Amsterdam, the Netherlands(阿姆斯特丹大学阿姆斯特丹商学院) Institute for Informatics, University of Amsterdam(阿姆斯特丹大学信息学院) Institute for Logic, Language and Computation, University of Amsterdam(阿姆斯特丹大学逻辑、语言与计算研究所)

AI总结 提出ECSEL方法,通过学习符号方程形式的闭式表达式实现可解释分类,在符号回归基准上以更低计算量恢复更多目标方程,并保持分类精度与可解释性。

Comments 9 pages, 4 figures, accepted at ICML 2026

详情
AI中文摘要

我们引入ECSEL,一种可解释的分类方法,它学习形如符号方程的正式表达式,其动机是观察到许多符号回归基准具有紧凑的符号结构。ECSEL直接构建一个结构化的闭式表达式,同时作为分类器和解释。在标准符号回归基准上,我们的方法比竞争的最新方法恢复更大比例的目标方程,同时需要更少的计算。利用这种效率,ECSEL在不牺牲可解释性的情况下实现了与已建立的机器学习模型竞争的分类精度。此外,我们展示了ECSEL在全局特征行为、决策边界分析和局部特征归因方面满足一些理想性质。在基准数据集和两个真实世界案例研究(即电子商务和欺诈检测)上的实验表明,学习到的方程暴露了数据集偏差,支持反事实推理,并产生可操作的见解。

英文摘要

We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

2511.16870 2026-05-27 cs.CV cs.LG

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

对齐与反转:通过表示对齐解决扩散和流模型中的逆问题

Loukas Sfountouris, Giannis Daras, Paris Giampouras

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出将扩散或流模型的内部表示与预训练自监督编码器(DINOv2)对齐(REPA),在推理时引导逆问题重建,显著提升重建质量和感知真实感。

详情
AI中文摘要

最近研究表明,强制扩散或流生成模型的内部表示与预训练自监督编码器的表示对齐,提供了强大的归纳偏置,改善了收敛性和样本质量。在这项工作中,我们将这一思想扩展到逆问题,其中预训练生成模型被用作先验。我们提出在扩散或流模型与DINOv2视觉编码器之间应用表示对齐(REPA),以在推理时指导重建过程。尽管逆问题中无法获得真实信号,但我们实验表明,对齐模型对近似目标特征的表示可以显著提升重建质量和感知真实感。我们提供了理论结果,显示(a) REPA正则化可以视为在DINOv2嵌入空间中最小化散度度量的变分方法,(b) 在一定的正则性假设下,REPA更新将潜在扩散状态引导向干净图像的状态。这些结果揭示了REPA在提升感知保真度中的作用。最后,我们通过将REPA集成到多个最先进的逆问题求解器中证明了方法的通用性,并在超分辨率、框内补全、高斯去模糊和运动去模糊上进行了大量实验,证实我们的方法一致地改善了重建质量,同时通过减少所需的离散化步骤数提高了效率。

英文摘要

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

2601.21576 2026-05-27 cs.AI

Chain Of Thought Compression: A Theoretical Analysis

思维链压缩:理论分析

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, Jeff Z. Pan

发表机构 * School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi, China(山西大学计算机与信息学院) Queen Mary, University of London, UK(伦敦大学女王学院) School of Informatics, University of Edinburgh, UK(爱丁堡大学信息学院)

AI总结 本文通过引入Order-r Interaction理论,证明了隐式思维链压缩中高阶逻辑依赖的学习信号指数衰减问题,并提出ALiCoT框架通过对齐潜在令牌分布与中间推理状态来克服信号衰减,实现54.4倍加速且性能与显式CoT相当。

详情
AI中文摘要

思维链(CoT)通过中间步骤解锁了大语言模型(LLMs)的高级推理能力,但由于生成额外令牌而带来了高昂的计算成本。最近的研究经验表明,将推理步骤压缩到潜在状态中,即隐式CoT压缩,提供了一种令牌高效的替代方案。然而,CoT压缩背后的机制仍不清楚。在本文中,我们首次对学习内化中间推理步骤的难度进行了理论分析。通过引入Order-r Interaction,我们证明了高阶逻辑依赖的学习信号指数衰减以解决不可约问题,其中跳过中间步骤不可避免地导致高阶交互障碍。为了经验验证这一点,我们引入了NatBool-DAG,这是一个具有挑战性的基准测试,旨在强制执行不可约逻辑推理并消除语义捷径。在我们的理论发现指导下,我们提出了ALiCoT(对齐隐式CoT),一种新颖的框架,通过对齐潜在令牌分布与中间推理状态来克服信号衰减。实验结果表明,ALiCoT成功解锁了高效推理:它实现了54.4倍加速,同时保持与显式CoT相当的性能。

英文摘要

Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.

2601.20796 2026-05-27 cs.CL cs.LG

Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

解析多模态上下文学习:现代Transformer中的模态不对称性与电路动力学

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心) DeepMind(深Mind) Beijing University of Posts(北京邮电大学)

AI总结 通过可控实验,研究现代Transformer中多模态上下文学习的基本机制,发现模态间学习不对称性,并揭示其背后的归纳式电路机制。

Comments ICML 2026 Spotlight

详情
AI中文摘要

基于Transformer的多模态大语言模型通常展现出上下文学习(ICL)能力。受此现象启发,我们提出疑问:Transformer如何从上下文示例中跨模态关联信息?我们通过在合成分类任务上训练的小型Transformer进行可控实验来研究这一问题,从而能够精确操控数据统计和模型架构。我们首先重新审视现代Transformer中单模态ICL的核心原理。虽然多个先前发现得以复现,但我们发现旋转位置编码(RoPE)提高了ICL的数据复杂度阈值。扩展到多模态设置揭示了一个基本的学习不对称性:当在来自主要模态的高多样性数据上预训练时,次要模态中令人惊讶的低数据复杂度就足以使多模态ICL出现。机制分析表明,两种设置都依赖于一种归纳式机制,该机制从匹配的上下文示例中复制标签;多模态训练则跨模态细化和扩展这些电路。我们的发现为理解现代Transformer中的多模态ICL提供了机制基础,并为未来研究引入了一个可控的测试平台。代码可在 https://github.com/YiranHuangIrene/multimodal-icl 获取。

英文摘要

Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation. Code is available at: https://github.com/YiranHuangIrene/multimodal-icl

2508.02806 2026-05-27 cs.CV cs.LG

PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

PyCAT4: 基于层次化视觉Transformer的3D人体姿态估计框架

Zongyou Yang, Jonathan Loo, Yinghan Hou

发表机构 * Department of Computer Science(计算机科学系) University College London(伦敦大学学院) School of Electronic Engineering(电子工程学院) Queen Mary University of London(伦敦女王学院) Department of Earth Science(地球科学系) Imperial College London(帝国理工学院)

AI总结 本研究提出PyCAT4框架,通过引入自注意力机制的Transformer特征提取层、特征时间融合技术和空间金字塔结构,优化Pymaf网络,在COCO和3DPW数据集上显著提升3D人体姿态估计的检测能力。

Comments 10 pages, 20 figures

详情
AI中文摘要

近年来,通过将卷积神经网络与金字塔网格对齐反馈循环相结合,3D人体姿态估计的准确性得到了显著提升。此外,基于Transformer的时间分析架构的采用在计算机视觉领域取得了创新性突破。鉴于这些进展,本研究旨在深度优化和改进现有的Pymaf网络架构。本文的主要创新包括:(1) 引入基于自注意力机制的Transformer特征提取网络层,以增强对低级特征的捕获;(2) 通过特征时间融合技术增强对视频序列中时间信号的理解和捕获;(3) 实现空间金字塔结构以实现多尺度特征融合,有效平衡不同尺度下的特征表示差异。本研究得到的新PyCAT4模型在COCO和3DPW数据集上进行了实验验证。结果表明,所提出的改进策略显著提升了网络在人体姿态估计中的检测能力,进一步推动了人体姿态估计技术的发展。

英文摘要

Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network's detection capability in human pose estimation, further advancing the development of human pose estimation technology.

2601.18904 2026-05-27 cs.SD cs.AI cs.CL

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出MetaSICL方法,利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力,在低资源场景下优于直接微调。

详情
AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而,当应用于低资源任务时,它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配,直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型,提供了一种无需训练、推理时的解决方案。在这项工作中,我们首先表明,$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能,这表明这种ICL适应能力可以推广到多模态设置。在此基础上,我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$,这是一种后训练方法,仅利用来自各种任务的高资源语音数据,旨在增强模型的上下文学习能力。实验表明,我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

2601.18381 2026-05-27 cs.AI cs.SE

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

AI Agent 用于逆向工程遗留有限差分代码并转换为 Devito

Yinghan Hou, Zongyou Yang

发表机构 * Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(帝国理工学院伦敦分校) Department of Computer Science(计算机科学系) University College London(伦敦大学学院)

AI总结 本研究提出一个集成 AI Agent 框架,结合检索增强生成(RAG)和开源大语言模型,通过多阶段迭代工作流将遗留有限差分代码转换为 Devito 环境,并引入强化学习反馈机制实现动态自适应代码翻译。

Comments 14 pages, 7 figures

详情
AI中文摘要

为了促进遗留有限差分实现向 Devito 环境的转换,本研究开发了一个集成的 AI Agent 框架。检索增强生成(RAG)和开源大语言模型通过系统混合 LangGraph 架构中的多阶段迭代工作流相结合。该 Agent 通过文档解析、结构感知分割、实体关系提取和基于 Leiden 的社区检测构建了一个广泛的 Devito 知识图谱。GraphRAG 优化增强了跨语义社区的查询性能,这些社区包括地震波模拟、计算流体动力学和性能调优库。一个逆向工程组件通过 Fortran 源代码的静态分析推导出用于 RAG 检索的三级查询策略。为了为语言模型指导提供精确的上下文信息,多阶段检索流水线执行并行搜索、概念扩展、社区级检索和语义相似性分析。代码合成受基于 Pydantic 的约束控制,以保证结构化输出和可靠性。一个全面的验证框架将传统静态分析与 G-Eval 方法相结合,涵盖执行正确性、结构健全性、数学一致性和 API 合规性。整个 Agent 工作流在 LangGraph 框架上实现,并采用并发处理以支持基于质量的迭代细化和状态感知的动态路由。主要贡献在于引入了受强化学习启发的反馈机制,实现了从静态代码翻译向动态自适应分析行为的转变。

英文摘要

To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval-Augmented Generation (RAG) and open-source Large Language Models are combined through multi-stage iterative workflows in the system's hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure-aware segmentation, extraction of entity relationships, and Leiden-based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three-level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi-stage retrieval pipeline performs parallel searching, concept expansion, community-scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic-based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G-Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality-based iterative refinement and state-aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.

2512.01556 2026-05-27 cs.AI cs.CL cs.LG

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Shandong University(山东大学) Tongji University(同济大学) City University of Hong Kong(香港城市大学)

AI总结 提出LEC框架,通过线性期望约束将选择性预测转化为决策问题,在可交换性假设下利用校准集计算风险约束下的保留最大化阈值,并扩展到双模型路由系统,实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情
AI中文摘要

基础模型常常生成不可靠的答案,而启发式不确定性估计器无法完全区分正确与错误输出,导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题,旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此,我们提出了LEC,一个原则性框架,将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率,这对应于选择条件下的边际错误概率。在可交换性下,我们推导出一个仅依赖于保留校准集的有限样本充分条件,从而能够计算风险约束下的保留最大化阈值。此外,我们将LEC扩展到双模型路由系统:如果主模型的不确定性超过其校准阈值,则输入被委托给后续模型,同时保持系统级的选择条件误差控制。在封闭式和开放式问答(QA)以及视觉问答(VQA)上的实验表明,LEC在接受的预测中维持了规定的风险水平,并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

2601.15283 2026-05-27 cs.CV cs.GR

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

LuxRemix: 室内场景的光照分解与重新混合

Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt

发表机构 * Meta Reality Labs(Meta现实实验室) University of Toronto(多伦多大学)

AI总结 提出一种基于图像的光照分解模型,从多视图场景捕获中分解室内光照为独立光源,并通过多视图光照协调集成到可重光照的3D高斯溅射表示中,实现交互式光源编辑。

Comments CVPR 2026. Project page: https://luxremix.github.io

详情
AI中文摘要

我们提出了一种新颖的方法,用于从单个多视图场景捕获中对室内场景进行交互式光照编辑。我们的方法利用基于生成图像的光照分解模型,将复杂的室内场景照明分解为其组成光源。这种分解能够独立操作各个光源,特别是控制其状态(开/关)、色度和强度。我们进一步引入了多视图光照协调,以确保光照分解在所有场景视图中的一致传播。这被集成到一个可重光照的3D高斯溅射表示中,提供对单个光源的实时交互控制。我们的结果展示了在多种室内场景中高度逼真的光照分解和重光照效果。我们在合成和真实世界数据集上评估了我们的方法,并与最先进的技术进行了定量和定性比较。视频结果和交互演示请参见 https://luxremix.github.io。

英文摘要

We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.

2601.14702 2026-05-27 cs.AI cs.CV cs.RO

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Drive-P2D:自动驾驶中视觉语言模型的渐进式感知到决策基准

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 提出Drive-P2D基准,通过分离推理与答案的协议,在目标、场景和决策三个层级上评估视觉语言模型的感知到决策能力,并分析错误模式。

详情
AI中文摘要

自动驾驶需要在复杂场景中实现可靠的感知和安全的决策。最近的视觉语言模型(VLM)展示了推理和泛化能力,为自动驾驶开辟了新的可能性;然而,现有的基准通常分别评估感知和决策,通过仅选择格式限制故障分析,或通过LLM评分的长格式输出引入评估偏差。为了解决这些问题,我们提出了Drive-P2D,一个渐进式感知到决策基准,包含6650个问题,涵盖目标、场景和决策三个层级。Drive-P2D采用分离的推理与答案协议:最终答案客观评分,而推理则用于分析沿渐进式感知到决策链暴露的错误模式。我们评估了所有场景和高风险场景下的主流VLM,并通过相关性分析和相似场景鲁棒性测试进一步刻画了感知到决策的能力边界。推理进一步揭示了逻辑推理错误和语义特征遗漏等故障模式,我们训练了一个轻量级分析器模型来自动化大规模推理错误模式标注。这些设计共同为构建更安全、更可靠的用于现实世界自动驾驶的VLM提供了实用见解。

英文摘要

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks often evaluate perception and decision-making separately, limit failure analysis with choice-only formats, or introduce evaluation bias through LLM-scored long-form outputs. To address these issues, we present Drive-P2D, a progressive perception-to-decision benchmark with 6,650 questions across Object, Scene, and Decision levels. Drive-P2D adopts a separated reasoning-and-answer protocol: final answers are scored objectively, while reasoning is analyzed to identify error modes exposed along the progressive perception-to-decision chain. We evaluate mainstream VLMs across all and high-risk scenarios, and further characterize the perception-to-decision capability boundary through correlation analysis and similar-scene robustness testing. Reasoning further exposes failure modes such as logical reasoning errors and semantic feature omissions, and we train a lightweight analyzer model to automate large-scale error-mode annotation of reasoning. Together, these designs provide practical insights for building safer and more reliable VLMs for real-world autonomous driving.

2508.03774 2026-05-27 cs.LG cs.AI

A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets

用于三维PEC目标微波散射分析的物理信息分层神经网络

Rui Zhu, Yuexing Peng, George C. Alexandropoulos, Wenbo Wang

发表机构 * Key Laboratory of Universal Wireless Communication, Ministry of Education, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications(信息与通信工程学院,北京邮电大学,教育部无线通信重点实验室) Department of Informatics and Telecommunications, National and Kapodistrian University of Athens(信息与电信学院,希腊国家与卡波迪斯提亚大学)

AI总结 提出一种U形物理信息神经网络(U-PINet),结合近场图编码器和八叉树分层多尺度融合模块,通过电场积分方程残差训练,实现高效准确的三维PEC目标微波散射分析。

Comments Submitted to an IEEE Journal

详情
AI中文摘要

在微波频率下精确建模三维完美电导体(PEC)目标的散射是计算电磁学的一个基本目标,特别是在雷达截面(RCS)预测和微波散射分析中。经典求解器,如矩量法和多层快速多极子算法(MLFMA),虽然提供高物理保真度,但在涉及多次入射配置或频率的重复查询场景下变得昂贵,而纯数据驱动的代理模型通常在几何复杂目标上缺乏准确性。本文提出一种U形物理信息人工神经网络(U-PINet)用于三维微波散射分析。受MLFMA的近远场分解启发,U-PINet结合了由可学习单变量基函数参数化的近场图编码器,以及在八叉树分区上组织的分层多尺度融合模块。所提出的网络在表面配置点处针对电场积分方程的离散残差进行训练,无需参考电流标签。在多个频率和极化配置下,对典型和几何复杂的三维PEC目标进行的实验,并通过双站RCS重建评估,表明U-PINet优于代表性的物理信息基线,并在重复查询场景下相比经典MLFMA求解器实现了显著的运行时间节省。

英文摘要

Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies constitutes a fundamental objective in computational electromagnetics, particularly for radar cross section (RCS) prediction and microwave scattering analysis. Classical solvers, such as the method of moments and the Multilevel Fast Multipole Algorithm (MLFMA), although provide high physical fidelity, they become costly under scenarios of repeated queries involving many incidence configurations or frequencies, whereas purely data-driven surrogates often lack accuracy on geometrically complex targets. This paper proposes a U-shaped physics-informed artificial neural network (U-PINet) for 3D microwave scattering analysis. Inspired by the near-far field decomposition of MLFMA, U-PINet combines a near-field graph encoder, parameterized by learnable univariate basis functions, with a hierarchical multi-scale fusion module organized on an octree partition. The proposed network is trained against a discretized residual of the electric-field integral equation at surface collocation points, without requiring reference current labels. Experiments on canonical and geometrically complex 3D PEC targets, conducted under multiple frequency and polarization configurations and assessed through bistatic RCS reconstruction, showcase that U-PINet outperforms representative physics-informed baselines, and yields substantial runtime savings over the classical MLFMA solver under repeated-query scenarios.

2601.12809 2026-05-27 cs.CV cs.AI cs.LG

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

CLIP风格视觉语言模型在合成空间关系数据训练中的左右对称性破缺

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

发表机构 * InfoTech, Toyota Motor Corporation(丰田汽车公司信息科技部门)

AI总结 通过可控一维图像文本测试平台,研究基于Transformer的视觉语言编码器在CLIP风格对比学习下如何通过位置与标记嵌入交互产生左右关系理解,并发现标签多样性比布局多样性更关键。

Comments Accepted at ICML 2026

详情
AI中文摘要

空间理解仍然是视觉语言模型中的一个关键挑战。然而,这种理解是否真正获得,如果是,通过什么机制,目前尚不清楚。我们提出了一个可控的一维图像文本测试平台,以探究在基于Transformer的视觉和文本编码器中,使用CLIP风格的对比目标训练时,左右关系理解是如何出现的。我们在单对象和双对象场景的配对描述上端到端地训练轻量级基于Transformer的视觉和文本编码器,并评估对未见对象对的泛化能力,同时系统性地改变标签和布局多样性。我们发现对比训练学习了左右关系,并且标签多样性(而非布局多样性)是这种情况下泛化的主要驱动因素。为了获得机制性理解,我们进行了注意力分解,并表明位置嵌入和标记嵌入之间的相互作用导致了水平注意力梯度,从而打破了编码器中的左右对称性;消除这一贡献会显著降低左右辨别能力。我们的结果提供了关于CLIP风格模型何时以及如何获得关系能力的机制性见解。

英文摘要

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

2601.08267 2026-05-27 cs.CL

Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

Med-CoReasoner: 通过语言感知的协同推理减少医学推理中的语言差异

Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akiko Aizawa, Irene Li

发表机构 * The University of Tokyo(东京大学) ETH Zürich(苏黎世联邦理工学院) National Institute of Informatics(日本信息处理学会) Siriraj Informatics and Data Innovation Center(Siriraj信息与数据创新中心) Yale University(耶鲁大学)

AI总结 提出Med-CoReasoner框架,通过并行英语和本地语言推理、结构化概念抽象及概念级对齐与检索,将本地临床知识整合到英语逻辑框架中,以缩小医学推理中的多语言差距,在MultiMed-X基准上平均提升5%的多语言推理性能。

详情
AI中文摘要

尽管推理增强的大语言模型在英语医学任务上表现强劲,但多语言差距仍然存在,本地语言的推理能力明显较弱,限制了全球医疗部署的公平性。为弥合这一差距,我们引入了Med-CoReasoner,一种语言感知的协同推理框架,它引出平行的英语和本地语言推理,将其抽象为结构化概念,并通过概念级对齐和检索将本地临床知识整合到英语逻辑框架中。这种设计结合了英语推理的结构稳健性和本地语言编码的实践基础专业知识。为评估超越选择题设置的多语言医学推理,我们构建了MultiMed-X基准,涵盖七种语言,包含专家标注的长文本问答和自然语言推理任务,每种语言350个实例。在三个基准上的实验表明,Med-CoReasoner平均提高了5%的多语言推理性能,在低资源语言上提升尤为显著。此外,模型蒸馏和专家评估分析进一步证实,Med-CoReasoner产生了临床合理且文化扎根的推理轨迹。

英文摘要

While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.

2601.08146 2026-05-27 cs.CL cs.AI cs.LG

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

超越迁移准确率:用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

发表机构 * Monash University Indonesia(印度尼西亚墨尔本大学) Institute Teknologi Bandung(Bandung理工大学) MBZUAI(MBZUAI研究所) Boston University(波士顿大学)

AI总结 提出基于上下文分解的电路发现方法(CD-T),通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现,并利用电路目标监督微调(CT-SFT)在低资源跨语言情感迁移中最小化灾难性遗忘,优于全局微调。

详情
AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务,限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分,将上下文分解方法适配到非结构化设置(CD-T),实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调(CT-SFT),将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明,CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率,但CT-SFT独特地最小化灾难性遗忘,保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立,表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

2511.02360 2026-05-27 cs.CV cs.CL

LaRe: Latent Refocusing for Multimodal Reasoning

LaRe: 用于多模态推理的潜在重聚焦

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络与信息安全学院)

AI总结 提出LaRe范式,在潜在空间内进行视觉重聚焦,结合语义增强训练,在提升推理准确率的同时大幅减少推理所需token数。

详情
AI中文摘要

思维链推理通过分解复杂任务提升逻辑性能,但其多模态扩展面临权衡。主流的“用图像思考”范式通过显式裁剪图像区域实现视觉重聚焦,但导致计算开销快速增长。新兴的潜在空间推理范式减少了token消耗,但缺乏动态重聚焦能力。我们认为这种权衡源于一个默认前提:有效的视觉重聚焦必须以显式token的形式发生。基于此,我们提出潜在重聚焦(LaRe),一种新的多模态推理范式,其中视觉重聚焦完全在潜在空间内进行。我们进一步设计了一种语义增强训练策略,通过视觉重建目标确保潜在空间的语义结构。实验评估表明,与现有基线相比,LaRe将平均准确率提高了7.6%,同时将推理所需的token数量减少了59.7%。当扩展到8B参数的视觉语言模型骨干时,LaRe实现了与最先进方法相当的性能,证明了我们提出的潜在重聚焦范式在多模态推理中的有效性。

英文摘要

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

2512.01572 2026-05-27 cs.LG cs.AI physics.app-ph

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

使用自编码器-扩散级联从极度稀疏测量中重建多尺度物理场

Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su, Zhilu Lai

发表机构 * Internet of Things Thrust(物联网方向) Intelligent Transportation Thrust(智能交通方向) Marine Hydrodynamic Research Facility(海洋流体研究设施) Department of Civil and Environmental Engineering(土木与环境工程系)

AI总结 提出Cascaded Sensing框架,通过粗尺度确定性估计和细尺度条件扩散模型级联,解决极度稀疏测量下物理场重建的不适定性和多模态后验问题。

Comments 34 pages,22 figures

详情
AI中文摘要

极端传感器稀疏性使得全场重建成为科学传感中一个根本性的不适定问题,其目标是从稀疏测量中推断物理场。在此情况下,后验严重欠约束且固有地多模态,使其近似高度病态。具体而言,确定性映射会坍塌不确定性,直接条件学习无法覆盖可能的观测条件解空间,而似然引导采样对噪声和传感器配置高度敏感。这些限制导致后验估计不稳定,并突显了以结构化方式建模不确定性的必要性。为此,我们提出了Cascaded Sensing,一个跨尺度重构后验推理的分层框架。Cas-Sensing不直接建模全场后验,而是首先通过确定性粗阶段估计器解决全局结构模糊性。一个基于神经算子的功能自编码器,使用掩码输入训练,将稀疏观测映射到粗尺度结构场,其作用类似于最大后验估计器,选择主导全局配置。该结构锚点固定了后验的主要自由度,并将问题转化为一个条件更好的残差推理任务。然后,一个条件扩散模型仅学习细化尺度的残差分布,将采样限制在合理解的稳定邻域内,并抑制观测一致模式之间的竞争。为了增强在不同传感条件下的鲁棒性,我们引入了掩码级联训练,通过中间粗重建使模型暴露于多样的稀疏观测模式。在推理过程中,流形约束引导将观测一致性作为细化机制而非全局模式选择过程来实施。

英文摘要

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer physical fields from sparse measurements.In this regime,the posterior is severely underconstrained and inherently multimodal,making its approximation highly ill-conditioned.Specifically,deterministic mappings collapse uncertainty,direct conditional learning cannot cover the space of possible observation-conditioned solutions,and likelihood-guided sampling becomes highly sensitive to noise and sensor configurations.These limitations result in unstable posterior estimates and highlight the need for modeling uncertainty in a structural manner.To this end,we propose Cascaded Sensing,a hierarchical framework that restructures posterior inference across scales.Rather than modeling the full-field posterior directly,Cas-Sensing first resolves global structural ambiguity through a deterministic coarse-stage estimator.A neural-operator-based functional autoencoder,trained with masked inputs,maps sparse observations to a coarse-scale structural field,acting analogously to a maximum a posteriori estimator that selects the dominant global configuration.This structural anchor fixes the principal degrees of freedom of the posterior and transforms the problem into a better-conditioned residual inference task.A conditional diffusion model then learns only the refined-scale residual distribution,confining sampling to a stable neighborhood of plausible solutions and suppressing competition among observation-consistent modes.To enhance robustness under varying sensing conditions,we introduce mask-cascade training,which exposes the model to diverse sparse observation patterns through intermediate coarse reconstructions.During inference,manifold-constrained guidance enforces observation consistency as a refinement mechanism rather than a global mode-selection process.

2601.09886 2026-05-27 cs.CL

Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

缩小差距:探究为何语言模型惊奇度优于完形填空惊奇度

Sathvik Nair, Byung-Doh Oh

发表机构 * University of Maryland(马里兰大学) Nanyang Technological University(南洋理工大学)

AI总结 本研究通过三个假设(低分辨率、语义相似词区分、低频词概率准确性)解释了语言模型概率在预测处理努力上优于完形填空数据的原因。

Comments 18 pages, 10 figures, accepted to ACL 2026 Main Conference

详情
AI中文摘要

一个词的可预测性可以通过两种方式量化:使用人类对完形填空任务的响应或使用语言模型(LM)的概率。当用作处理努力的预测因子时,LM概率优于从完形填空数据得出的概率。然而,重要的是要确定LM概率之所以如此是出于正确的原因,因为不同的预测因子可能导致关于预测在语言理解中作用的科学结论不同。我们提供了关于LM概率优势的三个假设的证据:不受低分辨率影响、区分语义相似的词、以及准确分配低频词的概率。这些结果呼吁努力提高完形填空研究的分辨率,同时进行实验以确定类似人类的预测是否也对LM概率所做的细粒度区分同样敏感。

英文摘要

How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.

2601.08375 2026-05-27 cs.CV

Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation

地理空间点云语义分割的无源域适应

Yuan Gao, Di Cao, Xiaohuan Xi, Sheng Nie, Shaobo Xia, Cheng Wang

发表机构 * Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航天信息研究所) International Research Center of Big Data for Sustainable Development Goals(可持续发展目标大数据国际研究中心) University of Chinese Academy of Sciences(中国科学院大学) Zhengzhou Institute for Advanced Research of Henan Polytechnic University(河南理工大学郑州研究院) Henan Polytechnic University(河南理工大学) School of Aeronautic Engineering, Changsha University of Science and Technology(长沙理工大学航空工程学院) China University of Geosciences, Beijing(中国地质大学(北京))

AI总结 提出LoGo无源域适应框架,通过局部类平衡原型估计和全局最优传输分布对齐,解决地理空间点云语义分割中的域偏移问题。

详情
AI中文摘要

3D地理空间点云的语义分割是遥感应用的基础,但由区域和采集相关变化引起的域偏移通常会降低模型性能。尽管域适应可以缓解这种偏移,但现有方法通常需要访问源域数据,由于隐私问题和监管政策,这往往不可行。为了解决这个问题,我们提出了LoGo(局部-全局双共识),一种新颖的无源无监督域适应(SFUDA)框架,仅需要预训练模型和无标签目标数据。在局部层面,我们引入了一个类平衡原型估计模块,确保即使对于样本稀缺的尾部类别也能生成鲁棒的特征原型,有效缓解长尾分布引起的特征崩溃。在全局层面,我们引入了一个基于最优传输的全局分布对齐模块,将伪标签分配公式化为全局优化问题,有效纠正局部贪婪分配中头部类别的过度主导,从而防止模型预测严重偏向多数类别。最后,我们提出了一种双一致性伪标签过滤机制,仅保留局部多增强集成预测与全局最优传输分配一致的高置信度伪标签用于自训练。在两个具有挑战性的基准测试(包括跨场景和跨传感器设置)上的大量实验表明,LoGo始终优于现有的最先进方法。源代码可在 https://github.com/GYproject/LoGo-SFUDA 获取。

英文摘要

Semantic segmentation of 3D geospatial point clouds is fundamental to remote sensing applications, yet domain shifts caused by regional and acquisition-related variations often degrade model performance. Although domain adaptation can mitigate such shifts, existing methods typically require access to source-domain data, which is often infeasible due to privacy concerns and regulatory policies. To address this, we propose LoGo (Local-Global Dual-Consensus), a novel source-free unsupervised domain adaptation (SFUDA) framework requiring only a pretrained model and unlabeled target data. At the local level, we introduce a class-balanced prototype estimation module that ensures that robust feature prototypes can be generated even for sample-scarce tail classes, effectively mitigating the feature collapse caused by long-tailed distributions. At the global level, we introduce an optimal transport-based global distribution alignment module that formulates pseudo-label assignment as a global optimization problem, effectively correcting the over-dominance of head classes inherent in local greedy assignments, and thereby preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual-consistency pseudo-label filtering mechanism that retains only high-confidence pseudo-labels where local multi-augmented ensemble predictions align with global optimal transport assignments for self-training. Extensive experiments on two challenging benchmarks, encompassing cross-scene and cross-sensor settings, demonstrate that LoGo consistently outperforms existing state-of-the-art methods. The source code is available at https://github.com/GYproject/LoGo-SFUDA.

2601.07737 2026-05-27 cs.CV cs.AI

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

看见 vs. 相信:评估开源多模态大模型在反直觉场景中的语言偏见

Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding

发表机构 * Zhejiang University(浙江大学) Beijing University of Posts and Telecommunications(北京邮电大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 为评估多模态大模型处理反直觉动作场景的能力,提出CAIT基准(400个高保真合成场景),发现开源模型因语言先验而忽视视觉证据,性能接近随机水平,而链式思维推理虽提升准确率但导致过度思考拒绝视觉内容,通过微调和结构化提示可缓解此偏见。

详情
AI中文摘要

多模态大语言模型(MLLMs)在主流视觉理解任务中表现出色,但其处理违背日常常识的动作场景的能力尚未得到充分测试。为填补这一空白,我们引入了CAIT,一个包含400个高保真合成场景的基准,专注于反直觉的视觉动作,例如“兔子在追老虎”,其中视觉证据明确违背常识预期。我们评估了人类、领先的专有模型(如Claude和Gemini)以及14个代表性的开源MLLMs。人类达到近乎完美的性能(约0.95准确率),专有模型表现出稳健的理解(达到0.88准确率),而标准的开源指令微调模型性能处于随机水平。进一步分析表明,这种失败是由强烈的语言先验驱动的:模型不信任视觉输入,而是自动用统计上常见的文本描述覆盖异常的视觉信号。尽管引入链式思维推理机制可以提高准确率,但会显著减慢响应速度并产生新的失败模式:模型过度思考场景,仅仅因为违反现实物理定律而拒绝接受实际的视觉内容。最后,我们证明有针对性的微调和结构化提示可以有效缓解这种对语言先验的依赖,使开源模型能够基于实际视觉证据准确地进行推理。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.

2601.07284 2026-05-27 cs.RO

AdaMorph: Unified Motion Retargeting via Embodiment-Aware Adaptive Transformers

AdaMorph: 通过具身感知自适应变换器实现统一运动重定向

Haoyu Zhang, Shibo Jin, Lusong Li, Jun Li, Liang Lin, Xiaodong He, Zecui Zeng

发表机构 * JD Explore Academy(京东探索学院)

AI总结 提出AdaMorph统一框架,利用具身感知自适应变换器将人体运动重定向到多种机器人形态,实现零样本泛化。

详情
AI中文摘要

将人体运动重定向到异构机器人是机器人学中的一个基本挑战,主要由于不同具身之间的严重运动学和动力学差异。现有解决方案通常训练特定于具身的模型,这扩展性差且无法利用共享的运动语义。为了解决这个问题,我们提出了AdaMorph,一个统一的神经重定向框架,使单个模型能够将人体运动适应到多种机器人形态。我们的方法将重定向视为一个条件生成任务。我们将人体运动映射到一个与形态无关的潜在意图空间,并利用双用途提示机制来条件化生成。不同于简单的输入拼接,我们利用自适应层归一化(AdaLN)根据具身约束动态调制解码器的特征空间。此外,我们通过基于课程的训练目标强制执行物理合理性,通过积分确保方向和轨迹一致性。在12个不同的人形机器人上的实验结果表明,AdaMorph有效地统一了跨异构拓扑的控制,在保持源行为动态本质的同时,对未见过的复杂运动表现出强大的零样本泛化能力。

英文摘要

Retargeting human motion to heterogeneous robots is a fundamental challenge in robotics, primarily due to the severe kinematic and dynamic discrepancies between varying embodiments. Existing solutions typically resort to training embodiment-specific models, which scales poorly and fails to exploit shared motion semantics. To address this, we present AdaMorph, a unified neural retargeting framework that enables a single model to adapt human motion to diverse robot morphologies. Our approach treats retargeting as a conditional generation task. We map human motion into a morphology-agnostic latent intent space and utilize a dual-purpose prompting mechanism to condition the generation. Instead of simple input concatenation, we leverage Adaptive Layer Normalization (AdaLN) to dynamically modulate the decoder's feature space based on embodiment constraints. Furthermore, we enforce physical plausibility through a curriculum-based training objective that ensures orientation and trajectory consistency via integration. Experimental results on 12 distinct humanoid robots demonstrate that AdaMorph effectively unifies control across heterogeneous topologies, exhibiting strong zero-shot generalization to unseen complex motions while preserving the dynamic essence of the source behaviors.

2601.06580 2026-05-27 cs.CL

Stylistic Evolution and LLM Neutrality in Singlish Language

新加坡英语中的文体演变与LLM中立性

Linus Tze En Foo, Weihan Angela Ng, Wenkai Li, Lynnette Hui Xian Ng

发表机构 * Independent Researcher(独立研究者) ETH Zürich(苏黎世联邦理工学院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过分析十年间非正式数字信息的文体变化,研究大型语言模型(LLM)能否生成时间中立的输出,发现文体可分离性随时间距离增加,且LLM在真实性和时间中立性之间存在结构性权衡。

详情
AI中文摘要

新加坡英语是一种根植于新加坡多语言环境的克里奥尔语,随着社会和技术变革持续演变。我们考察了十年间非正式数字信息的历时文体变化,并探究大型语言模型(LLM)能否生成时间中立的输出,以近似该变体的稳定本质。使用词汇、语用、心理语言学和基于编码器的特征,我们发现文体可分离性随时间距离增加而增强,这主要由长度和复杂度等结构特征驱动。与零分布基线相比,大多数LLM未能同时实现真实性和时间中立性,揭示了一种结构性权衡:生成真实新加坡英语的模型继承了其时间偏差,而时间中立的模型则产生不真实的输出。这些发现将时间中立性定位为评估LLM社会方言基础的诊断指标。

英文摘要

Singlish is a creole rooted in Singapore's multilingual environment that continues to evolve alongside social and technological change. We examine diachronic stylistic change across a decade of informal digital messages and ask whether Large Language Models (LLMs) can generate temporally neutral outputs approximating the stable essence of the variety. Using lexical, pragmatic, psycholinguistic, and encoder-based features, we find that stylistic separability increases with temporal distance, driven primarily by structural features such as length and complexity. Evaluated against a null distribution baseline, most LLMs fail to achieve both authenticity and temporal neutrality simultaneously, revealing a structural trade-off: models generating realistic Singlish inherit its temporal biases, while temporally neutral models produce inauthentic outputs. These findings position temporal neutrality as a diagnostic metric for assessing sociolectal grounding in LLMs.

2601.05899 2026-05-27 cs.AI

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

TowerMind: 一个用于LLM作为智能体的塔防游戏学习环境与基准

Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

发表机构 * Newcastle University(新castle大学) University of Auckland(奥克兰大学)

AI总结 本文提出TowerMind,一个基于塔防子类型的轻量级、多模态游戏环境,用于评估大语言模型在长期规划和决策中的能力,并揭示其与人类专家的性能差距及关键局限性。

Comments AAAI 2026 Oral

详情
AI中文摘要

近年来,大语言模型(LLM)的突破性进展使其成为智能体的一种有前景的范式,其中长期规划和决策作为适应不同场景和任务的核心通用能力逐渐凸显。实时策略(RTS)游戏因其固有的游戏玩法需要宏观战略规划和微观战术调整与行动执行,成为评估这两种能力的理想测试平台。现有的基于RTS游戏的环境要么计算需求较高,要么缺乏对文本观察的支持,这限制了RTS游戏在LLM评估中的应用。受此启发,我们提出了TowerMind,一种基于RTS游戏子类型——塔防(TD)的新型环境。TowerMind保留了RTS游戏评估LLM的关键优势,同时具有低计算需求和多模态观察空间,包括基于像素、文本和结构化游戏状态的表示。此外,TowerMind支持模型幻觉评估,并提供高度的可定制性。我们设计了五个基准关卡,以评估几种广泛使用的LLM在不同多模态输入设置下的表现。结果揭示了LLM与人类专家在能力和幻觉维度上的明显性能差距。实验进一步突出了LLM行为的关键局限性,例如规划验证不足、决策缺乏多终性以及行动使用效率低下。我们还评估了两种经典强化学习算法:Ape-X DQN和PPO。通过提供轻量级和多模态设计,TowerMind补充了现有的基于RTS游戏的环境格局,并为AI智能体领域引入了一个新的基准。源代码已在GitHub上公开(https://github.com/tb6147877/TowerMind)。

英文摘要

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

2601.05729 2026-05-27 cs.CV

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

TAGRPO: 通过直接轨迹对齐提升图像到视频生成中的GRPO

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

发表机构 * The University of Hong Kong(香港大学) Tencent Hunyuan(腾讯文心)

AI总结 针对图像到视频生成中GRPO优化效果不佳的问题,提出基于对比学习的TAGRPO框架,通过中间潜变量对齐高奖励轨迹并远离低奖励轨迹,结合记忆库提升多样性,显著优于DanceGRPO。

Comments 18 pages, 12 figures

详情
AI中文摘要

近期研究表明,将组相对策略优化(GRPO)集成到流匹配模型中,特别是在文本到图像和文本到视频生成中,具有显著效果。然而,我们发现将这些技术直接应用于图像到视频(I2V)模型往往无法带来一致的奖励提升。为解决这一局限,我们提出了TAGRPO,一个受对比学习启发的鲁棒后训练框架,适用于I2V模型。我们的方法基于以下观察:从相同初始噪声生成的rollout视频为优化提供了更优的指导。基于这一洞察,我们提出了一种应用于中间潜变量的新型GRPO损失,鼓励直接对齐高奖励轨迹,同时最大化与低奖励轨迹的距离。此外,我们引入了一个用于rollout视频的记忆库,以增强多样性并降低计算开销。尽管方法简单,TAGRPO在I2V生成中相比DanceGRPO取得了显著改进。相关成果将在 https://tagrpo.github.io/ 更新。

英文摘要

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation. The deliverables will be updated at https://tagrpo.github.io/ .

2601.03525 2026-05-27 cs.LG cs.AI

Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation

超越二元:将部分成功转化为代码生成中强化学习的密集可验证奖励

Longwen Wang, Yirui Liu, Xuan'er Wu, Xiaohui Hu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

发表机构 * Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI)) Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司Xingchen AGI实验室) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,西安交通大学)

AI总结 提出VeRPO框架,利用代码测试的部分成功作为可验证密集奖励,通过动态密度校准局部奖励修正基数偏差,并与全局执行结果结合,提升代码生成强化学习的性能。

详情
AI中文摘要

有效的奖励设计是代码生成强化学习(RL)中的核心挑战。主流的测试套件级结果奖励强制执行功能正确性但导致稀疏性,而外部奖励模型(RM)提供密集监督但代价是错位和额外开销。由于代码评估自然产生多个测试用例级结果,部分成功(即通过部分测试用例)提供了内在的、可验证的密集监督来源。在本文中,我们提出VeRPO(可验证密集奖励策略优化),一个系统地将可验证的部分成功转化为可靠密集奖励的RL框架。我们使用加权和公式分析部分成功奖励,理论上识别出一个关键的基数偏差,导致策略更新不成比例地偏向于从简单测试成功中获益,而非在前沿测试上取得进展。基于此,VeRPO引入了一个动态的、密度校准的局部奖励,明确纠正这种偏差,并从部分成功中提供稳健的密集监督。为了增强与端到端功能正确性的一致性,VeRPO进一步将局部密集奖励与全局执行结果相结合。在多种基准和设置上的大量实验表明,VeRPO优于结果驱动和基于RM的基线,实现了高达+8.83 pass@1的提升,且时间成本可忽略不计(<0.02%),GPU内存开销为零。

英文摘要

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewards enforce functional correctness but induce sparsity, while external Reward Models (RMs) provide dense supervision at the cost of misalignment and additional overhead. Since code evaluation naturally yields multiple test-case-level outcomes, partial success, i.e., passing a subset of test cases, offers an intrinsic, verifiable source of dense supervision. In this paper, we propose VeRPO (Verifiable Dense Reward Policy Optimization), an RL framework that systematically turns verifiable partial success into reliable dense rewards. We analyze partial-success rewards using a weighted sum formulation, theoretically identifying a critical cardinality bias that causes policy updates to disproportionately favor gains from easy-test successes over progress on frontier tests. Based on this, VeRPO introduces a dynamic, density-calibrated local reward that explicitly corrects this bias and provides robust dense supervision from partial success. To enhance alignment with end-to-end functional correctness, VeRPO further integrates the local dense reward with global execution outcomes. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO outperforms outcome-driven and RM-based baselines, achieving up to +8.83 pass@1 gain with negligible time cost (< 0.02%) and zero GPU memory overhead.

2601.05028 2026-05-27 cs.LG

Approximate Equivariance via Projection-based Regularisation

基于投影正则化的近似等变性

Torben Berndt, Jan Stühmer

发表机构 * Heidelberg Institute for Theoretical Studies(海德堡理论研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出一种基于投影的正则化方法,通过在线性层中分解等变与非等变分量并惩罚非等变算子范数,实现高效且精确的近似等变性,在SO(3)等连续群上优于样本基方法。

详情
AI中文摘要

等变性是神经网络中一种强大的归纳偏置,能够提高泛化能力和物理一致性。然而,最近非等变模型因其更好的运行时性能以及现实应用中可能出现的不完美对称性而重新受到关注。这推动了近似等变模型的发展,这些模型在尊重对称性和拟合数据分布之间取得了平衡。该领域现有的方法通常使用基于样本的正则化器,这些正则化器依赖于训练时的数据增强,导致较高的样本复杂度,特别是对于$SO(3)$等连续群。相反,本文通过基于投影的正则化器来处理近似等变性,该正则化器利用线性层到等变和非等变分量的正交分解。与现有方法不同,本文在算子层面上对整个群轨道上的非等变性进行惩罚,而不是逐点惩罚。我们提出了一个数学框架,用于在空间域和谱域中精确且高效地计算非等变性惩罚。在我们的实验中,我们的方法在模型性能和效率上始终优于先前的近似等变性方法,与基于样本的正则化器相比,实现了显著的运行时增益。

英文摘要

Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as $SO(3)$. This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

2410.00995 2026-05-27 cs.LG

CktGen: Automated Analog Circuit Design with Generative Artificial Intelligence

CktGen: 基于生成式人工智能的自动化模拟电路设计

Yuxuan Hou, Hehe Fan, Jianrong Zhang, Yue Zhang, Hua Chen, Min Zhou, Faxin Yu, Roger Zimmermann, Yi Yang

发表机构 * College of Computer Science and Technology(计算机科学与技术学院) Australian Artificial Intelligence Institute(澳大利亚人工智能研究所) School of Aeronautics and Astronautics(航空宇航科学学院) School of Computing(计算科学学院)

AI总结 提出CktGen,一种基于条件变分自编码器的模拟电路生成方法,通过解耦电路与规格编码并采用对比训练和分类器引导,实现从目标规格到有效电路的生成,显著优于现有方法。

Comments Paper accepted by Engineering

详情
AI中文摘要

模拟电路的自动综合面临重大挑战。大多数现有方法将问题表述为单目标优化任务,忽略了给定电路类型的设计规格在不同应用中的广泛变化。为了解决这个问题,我们引入了规格条件模拟电路生成,这是一项根据目标规格直接生成模拟电路的任务。其动机是利用现有的设计良好的电路来提高模拟电路设计的自动化程度。具体来说,我们提出了CktGen,一种简单而有效的变分自编码器,它将离散化的规格和电路映射到联合潜在空间,并从该潜在向量重建电路。值得注意的是,由于单个规格可能对应多个有效电路,简单地将规格信息融合到生成模型中无法捕捉这些一对多的关系。为了解决这个问题,我们解耦了电路和规格的编码,并对齐它们映射的潜在空间。然后,我们采用带有过滤掩码的对比训练来最大化编码电路和规格之间的差异。此外,分类器引导与潜在特征对齐促进了共享相同规格的电路的聚类,避免了模型崩溃为平凡的一对一映射。通过根据规格规范化潜在空间,我们可以搜索满足有效目标规格的最优电路。我们在开放电路基准上进行了全面实验,并引入了评估跨模型一致性的指标。实验结果表明,CktGen相比最先进的方法取得了显著改进。

英文摘要

The automatic synthesis of analog circuits presents significant challenges. Most existing approaches formulate the problem as a single-objective optimization task, overlooking that design specifications for a given circuit type vary widely across applications. To address this, we introduce specification-conditioned analog circuit generation, a task that directly generates analog circuits based on target specifications. The motivation is to leverage existing well-designed circuits to improve automation in analog circuit design. Specifically, we propose CktGen, a simple yet effective variational autoencoder that maps discretized specifications and circuits into a joint latent space and reconstructs the circuit from that latent vector. Notably, as a single specification may correspond to multiple valid circuits, naively fusing specification information into the generative model does not capture these one-to-many relationships. To address this, we decouple the encoding of circuits and specifications and align their mapped latent space. Then, we employ contrastive training with a filter mask to maximize differences between encoded circuits and specifications. Furthermore, classifier guidance along with latent feature alignment promotes the clustering of circuits sharing the same specification, avoiding model collapse into trivial one-to-one mappings. By canonicalizing the latent space with respect to specifications, we can search for an optimal circuit that meets valid target specifications. We conduct comprehensive experiments on the open circuit benchmark and introduce metrics to evaluate cross-model consistency. Experimental results demonstrate that CktGen achieves substantial improvements over state-of-the-art methods.

2601.03089 2026-05-27 cs.CL cs.AI cs.LG

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

基于受控保留信息的仅解码器LLM归因忠实性评估

Xin Huang, Antoni B. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对现有软扰动忠实性指标因保留词数不同导致评估偏差的问题,提出π-Soft-NC和π-Soft-NS框架,通过控制期望保留概率公平比较归因方法,并引入专用于自回归解码器LLM的梯度归因方法Grad-ELLM。

详情
AI中文摘要

大型语言模型(LLM)越来越多地使用输入归因方法进行评估,但比较这些解释仍然具有挑战性。现有的软扰动忠实性指标,如Soft-NC和Soft-NS,可能将归因质量与扰动期间保留的词数混为一谈:平均得分较高的归因方法可能保留更多词,从而获得膨胀的分数。为解决此问题,我们提出π-Soft-NC和π-Soft-NS,这是一个在相同期望保留概率下比较归因方法的评估框架,从而控制保留词数。我们进一步引入Grad-ELLM,一种针对自回归仅解码器LLM定制的基于梯度的归因方法,该方法在每个解码步骤将梯度导出的通道重要性与注意力导出的标记重要性相结合。在Llama和Mistral上的分类和开放生成任务实验表明,Grad-ELLM在π-Soft-NC下实现了强全面性导向的忠实性,而在π-Soft-NS下没有主导方法。我们的评估指标为比较LLM的可解释人工智能方法提供了一个严格的框架,将支持该领域的进展。

英文摘要

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $π$-Soft-NC and $π$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $π$-Soft-NC, while there is no dominant method under $π$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

2601.01668 2026-05-27 cs.CL cs.AI

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

EHRSummarizer:一种隐私感知、FHIR原生的源接地EHR摘要参考架构

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

发表机构 * MedLedger365 MedConnect365 Xylemed Kypath Associates Inc.

AI总结 提出一种隐私感知、FHIR原生的参考架构EHRSummarizer,通过检索HL7 FHIR R4资源并约束生成源接地摘要,以支持临床病历审查。

Comments 15 pages, 2 figures, 2 tables. Version 2 clarifies missing-data status handling, medication-status ambiguity, controlled narrative-document handling, source-grounded resource grouping, and future source-to-summary traceability

详情
AI中文摘要

临床医生通常需要浏览碎片化的电子健康记录(EHR)界面,以整合患者问题、用药、近期就诊和纵向趋势的连贯图像。本文描述了EHRSummarizer,一种用于结构化EHR摘要的隐私感知、FHIR原生参考架构。该架构检索一组目标性的高收益HL7 FHIR R4资源,将其标准化为临床上下文包,并使用受约束的摘要阶段生成源接地摘要,旨在支持病历审查。该架构进一步阐明了缺失数据状态处理、用药状态模糊性、在可用时对叙述性临床文档的受控使用,以及未来的源到摘要可追溯性。本文描述的是参考架构和原型行为,而非经过验证的临床干预、自主临床决策支持系统或临床获益证据。在合成和测试FHIR环境上的原型演示展示了端到端行为和输出格式;然而,本文未报告临床结果、受控工作流研究或基准结果。我们概述了一个评估计划,重点关注忠实性、遗漏风险、时间正确性、可用性、隐私和操作监控,以指导未来的机构评估。

英文摘要

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This manuscript describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture for structured EHR summarization. The architecture retrieves a targeted set of high-yield HL7 FHIR R4 resources, normalizes them into a clinical context package, and uses a constrained summarization stage to produce source-grounded summaries intended to support chart review. The architecture further clarifies missing-data status handling, medication-status ambiguity, controlled use of narrative clinical documents when available, and future source-to-summary traceability. The manuscript describes a reference architecture and prototype behavior rather than a validated clinical intervention, autonomous clinical decision-support system, or evidence of clinical benefit. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes, controlled workflow studies, or benchmark results. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, privacy, and operational monitoring to guide future institutional assessment.

2601.01608 2026-05-27 cs.CV

Guiding Token-Sparse Diffusion Models

引导令牌稀疏扩散模型

Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

发表机构 * CompVis

AI总结 针对稀疏训练扩散模型在推理时对无分类器引导响应不足的问题,提出令牌级稀疏引导方法,在保持输出高质量和高方差的同时降低计算成本。

详情
AI中文摘要

扩散模型在图像合成中质量高,但训练和推理成本昂贵。近期工作利用视觉内容固有的冗余性,仅对视觉信息子集进行训练以降低训练成本。虽然这些方法成功实现了更便宜且更有效的训练,但稀疏训练的扩散模型在推理时表现不佳,原因是它们对无分类器引导(CFG)响应不足。为解决此问题,我们提出稀疏引导(SG)。SG不使用条件丢弃作为引导扩散模型的信号,而是使用令牌级稀疏性。因此,SG更好地保留了条件预测的高方差,实现了高质量和高方差输出。在推理时利用令牌级稀疏性,SG以更低的计算量提高了保真度,在常用的ImageNet-256基准上以25%更少的FLOPs实现了1.58 FID,并在匹配基线质量时节省高达58%的FLOPs。为证明稀疏引导的有效性,我们使用训练时稀疏性训练了一个2.5B文本到图像扩散模型,并在推理时利用SG。SG在提高吞吐量的同时,在构图和人类偏好评分上取得了改进。

英文摘要

Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

2601.00575 2026-05-27 cs.CL

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

InfoSynth: 信息引导的大语言模型基准合成

Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出基于信息论(KL散度和熵)的InfoSynth框架,自动生成高难度、多样化的Python编程基准,97%的测试用例和解决方案准确。

详情
AI中文摘要

大型语言模型(LLM)在推理和代码生成方面取得了显著进展,但高效创建新基准来评估这些能力仍然是一个挑战。传统的基准创建依赖人工,成本高且耗时。此外,现有基准常常污染LLM训练数据,因此需要新颖多样的基准来准确评估其真实能力。本文介绍了InfoSynth,一个基于信息论原理自动生成和评估推理基准的新框架。我们提出了基于KL散度和熵的度量标准,无需昂贵的模型评估即可量化基准的新颖性和多样性。在此框架基础上,我们开发了一个端到端的流水线,使用遗传算法和迭代代码反馈从种子数据集中合成稳健的Python编程问题。我们的方法在97%的情况下生成准确的新问题测试用例和解决方案,并且合成的基准在难度上始终高于先前的工作。此外,我们的算法提供了控制生成问题的新颖性/多样性和难度的方法。InfoSynth为LLM构建高质量、具有挑战性的编程基准提供了一个可扩展、自验证的流水线。项目页面:https://ishirgarg.github.io/infosynth_web/

英文摘要

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation, but efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, which is expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher difficulty compared to prior works. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, challenging coding benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/