arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2603.11045 2026-05-15 cs.LG cond-mat.mtrl-sci cs.AI cs.CV physics.ins-det

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出了一种名为NeFTY的神经场热层析成像方法,用于解决无标签的三维逆热传导问题。该方法通过将扩散率表示为基于坐标的连续神经网络,并在每次优化步骤中使用可微分的隐式欧拉热求解器,确保控制方程在离散化层面精确成立,而非作为软约束。实验表明,NeFTY在合成三维基准测试和真实热成像数据中均显著优于传统物理信息神经网络和体素网格方法,在缺陷分割和深度估计方面表现出优越性能。

Comments 37 pages, 19 figures

详情
英文摘要

Inverse problems for stiff parabolic partial differential equations (PDEs), such as the inverse heat conduction problem (IHCP), are severely ill-posed: the forward map rapidly damps high-frequency interior structure before it reaches the boundary. Soft-constrained physics-informed neural networks (PINNs), which embed the PDE as a residual penalty, suffer from gradient pathology in this regime and tend to fit boundary measurements while leaving the interior field essentially untouched. We propose Neural Field Thermal Tomography (NeFTY), a hard-constrained neural field framework for label-free three-dimensional inverse heat conduction. NeFTY represents the unknown diffusivity as a continuous coordinate-based neural network, and at every optimization step passes the candidate field through a differentiable implicit-Euler heat solver with harmonic-mean interface flux, so that the governing PDE holds exactly on the discretization rather than as a soft penalty. Adjoint gradients propagate the surface reconstruction error back to the network weights at solver-level memory cost, making test-time inversion tractable on a single GPU. Across synthetic 3D benchmarks, NeFTY substantially outperforms soft-constrained PINN variants and a voxel-grid baseline on label-free volumetric recovery, and it transfers to real thermography data, surpassing classical signal-processing baselines in both defect segmentation and depth estimation. Additional details at https://cab-lab-princeton.github.io/nefty/

2603.03577 2026-05-15 cs.CV cs.RO

From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes

Qifan Zhang, Sai Haneesh Allu, Jikai Wang, Yangxiao Lu, Yu Xiang

发表机构 * IRVLUTD

AI总结 本文研究了在开放世界场景中,如何利用少量模板图像检测和分割新颖物体实例的问题。提出了一种名为L2G-Det的局部到全局检测框架,通过模板与查询图像之间的密集块级匹配生成候选点,并结合改进的分割模型实现精确的实例分割。该方法避免了传统提案机制的依赖,提升了在遮挡和背景干扰下的检测与分割性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://irvlutd.github.io/L2G/

详情
英文摘要

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

2603.02115 2026-05-15 cs.RO cs.AI cs.LG

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang

发表机构 * Univ. of Southern California(南加州大学) UT Dallas(德克萨斯大学达拉斯分校) MIT(麻省理工学院) Indep. Researcher(独立研究员) Univ. of Washington(华盛顿大学) Ai2 NVIDIA(英伟达)

AI总结 本文提出Robometer,一种通过轨迹比较扩展通用机器人奖励模型的可扩展框架。该方法结合轨迹内部的进度监督与轨迹之间的偏好监督,通过双目标训练:一方面利用专家数据进行帧级进度损失以锚定奖励幅度,另一方面通过轨迹对比偏好损失实现任务轨迹的全局排序约束,从而有效学习真实和增强失败轨迹的奖励函数。为支持该方法的大规模应用,研究者构建了包含超过一百万条轨迹的RBM-1M数据集,实验表明Robometer在多个基准和实际应用中表现出更优的泛化能力和学习效果。

Comments 33 pages, 17 figures

Journal ref RSS 2026

详情
英文摘要

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

2602.21302 2026-05-15 cs.RO

Learning Dynamic Rope Manipulation Using Task-Level Iterative Learning Control

Krishna Suresh, Chris Atkeson

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种任务级迭代学习控制方法,用于实现对绳索的动态操作,特别针对一种非平面绳索操作任务——“飞结”进行演示。该方法仅需一次人类示范和简化的绳索模型,即可在实际硬件上直接学习,无需大量示范数据或仿真支持。通过在每次迭代中求解二次规划问题,将任务空间误差转化为动作更新,从而实现对机器人和绳索模型的逆向控制。实验表明,该方法在7种不同材质和规格的绳索上均实现了100%的成功率,并能在2至5次尝试内实现不同绳索类型之间的迁移。

Comments Project website: https://flying-knots.github.io

详情
英文摘要

We introduce a Task-Level Iterative Learning Control method for dynamic manipulation of ropes. We demonstrate this method on a non-planar rope manipulation task called the flying knot. Using a single human demonstration and a simplified rope model, the method learns directly on hardware without reliance on large amounts of demonstration data or massive amounts of simulation. At each iteration, the algorithm inverts a model of the robot and rope by solving a quadratic program to propagate task-space errors into action updates. We evaluate performance across 7 different kinds of ropes, including chain, latex surgical tubing, and braided and twisted ropes, ranging in thicknesses of 7--25\,mm and densities of 0.013--0.5\,kg/m. Learning achieves a 100\% success rate within 10 trials on all ropes. Furthermore, the method can successfully transfer between most rope types in 2--5 trials. https://flying-knots.github.io

2602.19532 2026-05-15 cs.RO cs.SY eess.SY

Bellman Value Decomposition for Task Logic in Safe Optimal Control

William Sharpless, Oswin So, Dylan Hirsch, Sylvia Herbert, Chuchu Fan

发表机构 * UCSD(加州大学圣地亚哥分校) MIT(麻省理工学院)

AI总结 该研究针对高维安全最优控制任务中目标与安全规范的复杂组合问题,提出了一种基于贝尔曼值分解的方法。通过将复杂任务的贝尔曼值分解为由可达-避障、避障及新型可达-避障-循环贝尔曼方程连接的图结构,实现了对任务逻辑的自然组织。研究进一步提出VDPPO算法,将分解后的值图嵌入双层神经网络,自动处理隐含依赖关系,并在多个高维仿真和硬件实验中验证了方法的有效性,显著提升了安全与活性的平衡性能。

详情
英文摘要

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.

2602.13483 2026-05-15 cs.LG cs.AI

Finding Interpretable Prompt-Specific Circuits in Language Models

Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella

发表机构 * Department of Computer Science(计算机科学系) Boston University(波士顿大学) Faculty of Computing & Data Sciences(计算与数据科学学院)

AI总结 本文研究了语言模型中用于执行任务的内部电路结构,重点在于理解注意力头为何关注特定的词对。为此,作者提出了改进的电路追踪方法 ACC++,该方法基于注意力因果通信原理,能够从单次前向传播中提取出具有因果关系的电路组件及其低维信号,无需替换模型或进行修补。实验表明,ACC++ 识别出的信号在多语言模型中具有可解释性,并揭示了模型对提示结构、语言差异等行为的敏感性,展示了该方法在解释模型行为方面的广泛适用性。

详情
英文摘要

Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce ACC++, an improved circuit-tracing method based on the principle of attention-causal communication (ACC) [1], which identifies signals, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a single forward pass, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion of ACC++ signals are interpretable: many signals admit a short natural-language description. We next present a number of new insights into model behavior obtained via ACC++. First, we use ACC++'s interpretable circuits to characterize the sensitivity of indirect object identification (IOI) circuits to prompt structure. We find that prompt-specific circuits form well-defined clusters, and across clusters, heads receive systematically different signals corresponding to distinct mechanisms for identifying the IO name. Next, in multilingual IOI, ACC++ circuits show that while model components are reused across languages, signals are often language-specific. In a four-language IOI case study, cross-language circuit distances are consistent with linguistic relatedness. Together, these results show that ACC++ can shed light on a broad spectrum of model behaviors.

2602.07519 2026-05-15 cs.LG

PALMS: A Computational Implementation for Pavlovian Associative Learning Models' Simulation

Martin Fixman, Alessandro Abati, Julián Jiménez Nimmo, Sean Lim, Esther Mondragón

发表机构 * Artificial Intelligence Research Centre (CitAI), Department of Computer Science, City St George’s, University of London, London, United Kingdom(人工智能研究所在(CitAI),计算机科学系,伦敦城市圣乔治大学,伦敦,英国) Centre for Computational and Animal Learning Research, CAL-R(计算与动物学习研究中心,CAL-R)

AI总结 本文介绍了一种名为PALMS的计算工具,用于在Python环境中模拟巴甫洛夫联想学习模型。该工具不仅实现了经典的Rescorla-Wagner模型,还包含了多种注意机制模型及其扩展,如 Pearce-Kaye-Hall、Mackintosh Extended 和 Le Pelley 的混合模型,并引入了一个统一的学习率变量以融合不同理论观点。PALMS 提供图形化界面,支持输入复杂的实验设计,并能处理大量刺激和配置性线索的计算,显著提升了模型的预测能力,为神经科学家提供了研究和优化实验设计的有力工具。

Comments PALMS is licensed under the open-source GNU Lesser General Public License 3.0. The environment source code and the latest multiplatform release build are accessible as a GitHub repository at https://github.com/cal-r/PALMS-Simulator

详情
英文摘要

In contrast to static formalisms, computational definitions describe the operational mechanisms of a model. Simulations are an essential part of the cycle of theory development and refinement, assisting researchers in formulating the precise definitions that models require, and making accurate predictions. This manuscript introduces a computational implementation of Pavlovian learning models in a Python environment, termed Pavlovian Associative Learning Models' Simulation (PALMS). In addition to the canonical Rescorla-Wagner model, attentional approaches are implemented, including Pearce-Kaye-Hall, Mackintosh Extended, Le Pelley's Hybrid, and a novel extension of the Rescorla-Wagner model featuring a unified variable learning rate that synthesises Mackintosh's and Pearce and Hall's opposing conceptualisations. To our knowledge, only the first attentional model has been previously specified computationally in a general design tool. PALMS integrates a graphical interface that permits the input of entire experimental designs in an alphanumeric format, akin to that used by experimental neuroscientists. It uniquely enables the simulation of experiments involving hundreds of stimuli, such as those used with human participants, and the computation of configural cues and configural-cue compounds across all models, thereby substantially broadening their predictive capabilities. A comprehensive description of the models' implementation is provided in the paper. We evaluate PALMS by simulating five published experiments in the associative learning literature that assessed the predictive scope of existing models, and we show that this implementation provides neuroscientists with a useful tool for identifying critical variables, refining experimental designs, making precise predictions, comparing model fitness, and formulating new theoretical approaches.

2602.05319 2026-05-15 cs.LG

Accelerated Sequential Flow Matching: A Bayesian Filtering Perspective

Yinan Huang, Hans Hao-Hsun Hsu, Junran Wang, Bo Dai, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为“顺序贝叶斯流匹配”的新框架,用于从实时流数据中进行序列概率推断。该方法借鉴贝叶斯滤波的思想,通过学习一个概率流将后验分布从一个时间步递推到下一个时间步,从而实现高效的预测分布建模。相比传统的从无信息初始分布反复采样的方法,该方法利用前一时刻的信念作为信息源分布,显著提升了采样效率,在多个科学预测和决策任务中表现出与完整扩散模型相当的性能,但所需的采样步骤更少,大幅降低了推理延迟。

详情
英文摘要

Sequential probabilistic inference from streaming observations requires modeling distributions over future trajectories as new observations arrive. Although diffusion and flow-matching models are effective at capturing high-dimensional, multimodal distributions, their deployment in real-time streaming settings typically relies on repeatedly sampling from a non-informative initial distribution. This results in substantial inference latency, particularly when multiple samples are needed to characterize the predictive distribution. In this work, we introduce Sequential Bayesian Flow Matching, a framework inspired by Bayesian filtering. By learning a probability flow that transports the posterior distribution from one time step to the next time step conditioned on new observations, it mirrors the recursive structure of Bayesian belief updates. Crucially, by using the previous belief as an informative source distribution, it enables substantially faster sampling than naive resampling from scratch. Across scientific forecasting tasks spanning accelerator beam spill dynamics, fluid dynamics, and weather forecasting, as well as decision-making benchmarks, our method achieves performance competitive with full-step diffusion on distributional metrics while using far fewer sampling steps, substantially reducing inference latency. Our code is available at https://github.com/Graph-COM/Sequential_Flow_Matching.

2602.04585 2026-05-15 cs.CV

ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry

Dawid Uchal, Marcin Możejko, Krzysztof Gogolewski, Piotr Kupidura, Szymon Łukasik, Jakub Giezgała, Tomasz Nocoń, Kacper Pietrzyk, Robert Pieniuta, Mateusz Sulimowicz, Michal Orzyłowski, Tomasz Siłkowski, Karol Zagródka, Eike Staub, Ewa Szczurek

发表机构 * Faculty of Mathematics, Informatics and Mechanics, University of Warsaw(数学与信息学学院,华沙大学) Merck Healthcare KGaA(默克健康护理公司) Institute of AI for Health, Helmholtz Munich(健康人工智能研究所,海德堡-穆恩)

AI总结 本文提出了一种名为 ImmuVis 的高效基础模型,专门用于成像质谱流式细胞术(IMC)数据的处理。该模型通过引入标记自适应超卷积,解决了IMC数据中通道不固定的问题,使得模型能够灵活处理不同研究中的标记组合。ImmuVis 在大规模数据集 IMC17M 上进行预训练,相比基于 Transformer 的方法具有更低的计算成本,并在虚拟染色和分类任务中表现出色,同时提供了校准的不确定性估计,为实际应用中的IMC建模提供了实用框架。

Comments 38 pages, 19 figures

详情
英文摘要

We present ImmuVis, a family of efficient foundation models for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest dataset to date, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms state-of-the-art baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical framework for real-world IMC modeling.

2602.02427 2026-05-15 cs.LG

Embedding Perturbation may Better Reflect Intermediate-Step Uncertainty in LLM Reasoning

Qihao Wen, Jiahao Wang, Yang Nan, Pengfei He, Ravi Tandon, Han Xu

发表机构 * University of Arizona(亚利桑那大学) Michigan State University(密歇根州立大学)

AI总结 本文研究了如何更准确地量化大语言模型(LLM)在推理过程中的中间步骤不确定性。作者提出通过分析嵌入扰动对生成结果的影响,来识别模型在推理过程中可能存在的不确定或错误步骤。实验表明,基于嵌入扰动的不确定性度量方法相比概率、采样和贝叶斯等传统方法,在不确定性估计方面表现更优,且具有更高的简洁性和效率。

详情
英文摘要

Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model's uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM's "intermediate uncertainty" during reasoning. Our study reveals that an LLM's incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings, indicating the model's uncertainty among multiple competing continuations. In this way, uncertain (possibly incorrect) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metrics achieve stronger uncertainty quantification performance compared with baselines including probability-based, sampling-based and Bayesian-based methods. Meanwhile, such metrics also enjoy good simplicity and efficiency.

2601.22197 2026-05-15 cs.LG cs.AI eess.SP

Neural Signals Generate Clinical Notes in the Wild

Jathurshan Pradeepkumar, Zheng Chen, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) SANKEN, Osaka University(大阪大学SANKEN)

AI总结 生成能够总结长期脑电图(EEG)记录中异常模式、诊断发现和临床解释的临床报告仍然是一项耗时的工作。本文提出CELM,首个能够对长时间、变长EEG记录进行多尺度端到端临床报告生成的临床EEG到语言基础模型。该模型结合了预训练的EEG模型和语言模型,通过构建包含9,048名患者约11,000小时EEG记录和9,922份临床报告的大规模数据集进行训练,并发布了自动化报告结构化流程作为基准,实验结果表明CELM在多项评估设置中均优于现有方法,且经临床专家评估,其生成的报告在临床连贯性、诊断可靠性及与专家解释的一致性方面表现更优。

详情
英文摘要

Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We present CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales. CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We curate a large-scale clinical EEG dataset containing 9,922 reports paired with approximately 11,000 hours of EEG recordings from 9,048 patients to train CELM, and release the benchmark with an automated report-structuring pipeline to facilitate future research. Experimental results show that CELM consistently outperforms existing methods across all evaluation settings. Importantly, we further conduct human evaluation with clinical experts, demonstrating that CELM generates reports that are more clinically coherent, diagnostically reliable, and better aligned with expert interpretation. We release our model and benchmark construction pipeline at https://github.com/Jathurshan0330/CELM.

2601.21929 2026-05-15 cs.LG

LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

发表机构 * EPFL(瑞士联邦理工学院) UNC Charlotte(北卡罗来纳大学夏洛特分校) Stony Brook University(史蒂文斯理工学院)

AI总结 训练数据归因(TDA)旨在识别哪些训练样本对模型预测产生了最大影响。LoRIF 是一种基于梯度的归因方法,通过利用梯度的低秩结构,解决了大规模训练数据下归因计算中的存储和计算瓶颈。该方法通过低秩分解和截断奇异值分解(SVD)降低了存储和内存需求,同时保持了较高的归因质量,在大规模模型和数据集上展现出显著的效率提升。

详情
英文摘要

Training data attribution (TDA) identifies which training examples most influenced a model's prediction. Influence function methods are a theoretically grounded family of TDA methods and exploit gradients. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality -- creating a quality--scalability tradeoff. We introduce \textbf{LoRIF} (\textbf{Lo}w-\textbf{R}ank \textbf{I}nfluence \textbf{F}unctions), which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the inverse Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.

2601.20173 2026-05-15 cs.LG cs.HC

MAPLE: Self-Supervised Learning-Enhanced Nonlinear Dimensionality Reduction for Visual Analysis

Zeyang Huang, Takanori Fujiwara, Angelos Chatzimparmpas, Wandrille Duchemin, Andreas Kerren

发表机构 * Linköping University(林肯大学) University of Arizona(亚利桑那大学) Utrecht University(乌得勒支大学) University of Basel(巴塞尔大学) Linnaeus University(林奈大学)

AI总结 本文提出了一种新的非线性降维方法MAPLE,通过改进流形建模增强UMAP算法。MAPLE采用自监督学习方法,利用最大流形容量表示(MMCRs)更高效地编码低维流形结构,有效区分局部相似与不相似数据点的方差,特别适用于具有高内聚类方差和曲面流形结构的生物或图像数据。实验表明,MAPLE在保持计算效率的同时,能够生成更清晰的聚类分离和更细致的子聚类结构。

详情
英文摘要

We present a new nonlinear dimensionality reduction method, MAPLE, that enhances UMAP by improving manifold modeling. MAPLE employs a self-supervised learning approach to more efficiently encode low-dimensional manifold geometry. Central to this approach are maximum manifold capacity representations (MMCRs), which help untangle complex manifolds by compressing variances among locally similar data points while amplifying variance among dissimilar data points. This design is particularly effective for high-dimensional data with substantial intra-cluster variance and curved manifold structures, such as biological or image data. Our qualitative and quantitative evaluations demonstrate that MAPLE can produce clearer visual cluster separations and finer subcluster resolution than UMAP while maintaining tractable computational cost.

2601.02179 2026-05-15 cs.CL

Confidence Estimation for LLMs in Multi-turn Interactions

Caiqi Zhang, Ruihan Yang, Xiaochen Zhu, Chengzu Li, Tiancheng Hu, Yijiang River Dong, Deqing Yang, Nigel Collier

发表机构 * University of Cambridge(剑桥大学) Fudan University(复旦大学)

AI总结 本文研究了大语言模型在多轮对话中进行置信度估计的问题,当前研究多集中于单轮场景,而多轮对话中随着上下文积累和歧义逐步消除,模型置信度的变化机制尚未被充分探索。为此,作者提出了一个基于“每轮校准”和“信息增加下置信度单调性”的评估框架,并引入了新的指标和生成方法,实验表明传统方法在多轮场景中表现不佳,而提出的基于logit的探针P(Sufficient)在跟踪证据积累方面更具有效性,为构建更可靠、可信的对话代理提供了基础方法。

Comments ACL 2026 Findings

详情
英文摘要

While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research overwhelmingly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new "Hinter-Guesser" paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. In contrast, a novel logit-based probe we introduce, P(Sufficient), proves comparatively more effective, robustly tracking evidence accumulation and distinguishing it from conversational filler. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.

2512.09115 2026-05-15 cs.CV

SuperF: Neural Implicit Fields for Multi-Image Super-Resolution

Sander Riisøen Jyhne, Christian Igel, Morten Goodwin, Per-Arne Andersen, Serge Belongie, Nico Lang

发表机构 * University of Agder(阿格德大学) University of Copenhagen(哥本哈根大学)

AI总结 本文提出了一种名为 SuperF 的多图像超分辨率方法,旨在通过多个亚像素偏移的低分辨率图像提升图像的光学分辨率。该方法基于坐标感知的神经网络(神经场),通过共享一个隐式神经表示(INR)并联合优化图像对齐与重建过程,有效避免了单图像超分辨率中常见的“幻觉”问题。SuperF 不依赖高分辨率训练数据,实验表明其在卫星图像和手持相机拍摄的地面图像上均取得了高质量的超分辨率结果,放大因子高达8倍。

Comments Published at ICLR 2026, Project website: https://sjyhne.github.io/superf/, 23 pages, 13 figures, 8 table

详情
英文摘要

High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires solving an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to "hallucinated" structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data.

2512.07805 2026-05-15 cs.LG cs.AI cs.CL

Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

发表机构 * Princeton University(普林斯顿大学) University of California, Los Angeles(加州大学洛杉矶分校) IIIS, Tsinghua University(清华大学人工智能研究院)

AI总结 本文提出了一种基于群作用的统一位置编码框架 GRAPE,能够涵盖乘法和加法两类机制。乘法 GRAPE 通过指数映射生成保持模长的相对位置表示,能够精确还原 RoPE 并扩展至更复杂的子空间耦合结构;加法 GRAPE 则基于单秩或低秩单射作用,实现了 ALiBi 和 FoX 的精确复现并保持流式计算能力。GRAPE 为长上下文模型中的位置编码提供了理论严谨的设计空间,统一并扩展了现有方法。

Comments Published in ICLR 2026. Project Page: https://github.com/model-architectures/GRAPE

详情
英文摘要

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n \, ω\, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.

2512.02920 2026-05-15 cs.LG cs.CV cs.SI

Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang

发表机构 * Northeastern University(东北大学)

AI总结 本文研究如何利用道路网络数据和卫星图像信息进行交通事故预测与因果分析。作者构建了一个包含美国六州九百万起事故记录和一千万张高分辨率卫星图像的多模态数据集,并结合天气、道路类型和交通流量等标注信息,评估了融合视觉与网络嵌入的多模态学习方法。实验表明,融合两种模态信息可显著提升预测性能,平均AUROC达90.1%,并发现降水、道路类型和季节性因素对事故率有显著影响。

Comments 17 pages. Appeared in KDD 2026

详情
英文摘要

We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset spanning six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, a $3.7\%$ gain over graph neural network models that use only graph structures. With the improved embeddings, we conduct a causal analysis using a matching estimator to identify the key factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

2512.01766 2026-05-15 cs.LG

On the Unreasonable Effectiveness of Last-layer Retraining

John C. Hill, Tyler LaBonte, Xinchen Zhang, Vidya Muthukumar

发表机构 * School of Electrical and Computer Engineering(电气与计算机工程学院) Georgia Institute of Technology(佐治亚理工学院) H. Milton Stewart School of Industrial and Systems Engineering(H. Milton Stewart工业与系统工程学院)

AI总结 本文研究了最后一层重训练(LLR)方法在提升模型对少数群体鲁棒性方面的有效性。作者发现,即使在训练集的不平衡子集上进行重训练,LLR仍能显著提升最差群体的准确率。研究通过实验证明,LLR的效果主要源于重训练数据集中的组间平衡性,而非此前假设的神经崩溃缓解机制。文章进一步分析了近期提出的CB-LLR和AFR算法如何通过隐式组平衡提升模型鲁棒性。

详情
英文摘要

Last-layer retraining (LLR) methods -- wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training -- have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness'' of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.

2511.17299 2026-05-15 cs.RO

MonoSpheres: Large-Scale Monocular SLAM-Based UAV Exploration through Perception-Coupled Mapping and Planning

Tomáš Musil, Matěj Petrlík, Martin Saska

发表机构 * Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague(捷克技术大学布拉格分校电子工程系控制系)

AI总结 本文提出了一种基于单目视觉的无人飞行器大规模自主探索方法MonoSpheres,解决了仅依靠单目相机进行三维环境探索时稀疏深度数据、自由空间间隙和深度不确定性等问题。该方法通过感知耦合的建图与规划模块,实现了对室内外非结构化环境的安全高效探索,并首次在真实户外环境中实现了基于单目视觉的三维自主探索。实验验证了方法的有效性,并开源了代码以支持后续研究。

Comments 8 pages, 9 figures, accepted to IEEE Robotics and Automation Letters

详情
英文摘要

Autonomous exploration of unknown environments is a key capability for mobile robots, but it is largely unsolved for robots equipped with only a single monocular camera and no dense range sensors. In this paper, we present a novel approach to monocular vision-based exploration that can safely cover large-scale unstructured indoor and outdoor 3D environments by explicitly accounting for the properties of a sparse monocular SLAM frontend in both mapping and planning. The mapping module solves the problems of sparse depth data, free-space gaps, and large depth uncertainty by oversampling free space in texture-sparse areas and keeping track of obstacle position uncertainty. The planning module handles the added free-space uncertainty through rapid replanning and perception-aware heading control. We further show that frontier-based exploration is possible with sparse monocular depth data when parallax requirements and the possibility of textureless surfaces are taken into account. We evaluate our approach extensively in diverse real-world and simulated environments, including ablation studies. To the best of the authors' knowledge, the proposed method is the first to achieve 3D monocular exploration in real-world unstructured outdoor environments. We open-source our implementation to support future research.

2511.07308 2026-05-15 cs.LG

Can Stationary Distributions of Scale-Invariant Neural Networks Be Described by the Thermodynamics of an Ideal Gas?

Ildus Sadrtdinov, Ekaterina Lobacheva, Ivan Klimov, Mikhail Burtsev, Mikhail I. Katsnelson, Dmitry Vetrov

发表机构 * Constructor University(Constructor大学) Mila – Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) London Institute for Mathematical Sciences(伦敦数学科学研究所) Institute for Molecules and Materials, Radboud University(分子与材料研究所,拉德堡德大学)

AI总结 本文探讨了深度神经网络训练过程中的动力学行为,提出了一种基于热力学的框架,用于描述具有权重衰减的随机梯度下降(SGD)在尺度不变神经网络中的平稳分布。研究将训练超参数(如学习率、权重衰减)与热力学变量(如温度、压力、体积)建立类比,并通过理论分析和实验验证,揭示了SGD动态与理想气体行为之间的紧密对应关系。该框架为理解训练过程提供了物理视角,有助于指导超参数调整和学习率调度器的设计。

Comments Accepted at IJCAI-ECAI 2026 (the 35th International Joint Conference on Artificial Intelligence)

详情
英文摘要

Understanding the training dynamics of deep neural networks remains a major open problem, with physics-inspired approaches offering promising insights. Building on this perspective, we develop a thermodynamic framework to describe the stationary distributions of stochastic gradient descent (SGD) with weight decay for scale-invariant neural networks, a setting that both reflects practical architectures with normalization layers and permits theoretical analysis. We establish analogies between training hyperparameters (e.g., learning rate, weight decay) and thermodynamic variables such as temperature, pressure, and volume. Starting with a simplified isotropic noise model, we uncover a close correspondence between SGD dynamics and ideal gas behavior, validated through theory and simulation. Extending to training of neural networks, we show that key predictions of the framework, including the behavior of stationary entropy, align closely with experimental observations. This framework provides a principled foundation for interpreting training dynamics and may guide future work on hyperparameter tuning and the design of learning rate schedulers.

2510.23477 2026-05-15 cs.CL

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang

发表机构 * Tongji University(同济大学) Fudan University(复旦大学) University of Notre Dame(圣母大学) Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 MMTutorBench 是首个用于评估人工智能数学辅导能力的多模态基准,旨在测试模型在问题求解、诊断学生困难和逐步引导等方面的能力。该基准包含685个围绕教学关键步骤构建的数学问题,每个问题配有详细的评分标准,并分为三个任务:洞察发现、操作制定和操作执行。实验表明,当前主流多模态大语言模型在辅导能力上仍与人类教师存在较大差距,且不同输入方式对模型表现有显著影响,凸显了该基准在评估和推动AI数学辅导系统发展中的重要价值。

详情
英文摘要

Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.

2510.18326 2026-05-15 cs.CV

Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ABHFA-Net

Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu Duong

发表机构 * School of Mechanical and Aerospace Engineering (MAE), NTU(南洋理工大学机械与航空航天工程学院) Department of Computer Science, The University of New Orleans(新奥尔良大学计算机科学系)

AI总结 随着自然灾害和人为灾害频发,亟需在标注数据有限的情况下具备强鲁棒性的视觉识别系统。本文提出了一种基于注意力机制和巴氏距离的特征聚合网络(ABHFA-Net),用于提升少样本分类在基准和灾害图像上的性能。该方法通过将类别原型建模为概率分布,并利用巴氏距离进行分类,同时引入空间通道注意力机制和对比softmax损失,有效提升了特征判别能力和类别可分性。实验表明,ABHFA-Net在多个基准和真实灾害数据集上均取得优异性能,尤其在灾害图像分类中表现出显著优势。

Comments Revised and Submitted to SN Computer journal

详情
英文摘要

The rising incidence of natural and human-induced disasters necessitates robust visual recognition systems capable of operating under limited labeled data conditions. However, disaster-related image classification remains challenging due to data scarcity, high intra-class variability, and domain-specific complexities in remote sensing imagery. To address these challenges, we propose the Attention Bhattacharyya Distance-based Feature Aggregation Network (ABHFA-Net), a novel few-shot learning (FSL) framework that models class prototypes as probability distributions and performs classification via Bhattacharyya distance-based comparison. Our approach integrates a spatial channel attention mechanism to enhance discrimiantive feature learning in the few-shot context and introduces a Bhattacharyya-based contrastive softmax loss for improved class separability. Extensive experiments on both benchmark datasets (CIFAR-FS, FC-100, miniImageNet, tieredImageNet) and real-world disaster datasets (AIDER, CDD, MEDIC) demonstrate the effectiveness of the proposed method. In particular, ABHFA-Net achieves 80.7% and 92.3% accuracy on CIFAR-FS under 5-way 1-shot and 5-shot settings, respectively, outperforming existing state-of-the-art methods. On disaster datasets, the model consistently improves classification performance, achieving up to 68.2% (1-shot) and 78.3% (5-shot) accuracy on AIDER, highlighting its robustness in real-world scenarios. These results establish ABHFA-Net as a strong and practical solution for few-shot disaster image classification, particularly in data-scarce and time-critical remote sensing applications. The code repository for our work is available at https://github.com/GreedYLearner1146/ABHFA-Net.

2510.16196 2026-05-15 cs.CV cs.AI

Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Zheng Huang, Enpei Zhang, Weikang Qiu, Yinghao Cai, Carl Yang, Elynn Chen, Xiang Zhang, Rex Ying, Dawei Zhou, Yujun Yan

发表机构 * Dartmouth College(达特茅斯学院) Yale University(耶鲁大学) Emory University(埃默里大学) New York University(纽约大学) UNC Charlotte(北卡罗来纳大学柴郡分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究如何从功能性磁共振成像(fMRI)信号中重建视觉刺激,以理解大脑如何编码视觉信息。研究发现,fMRI信号与语言模型的文本空间更为相似,而非基于视觉或图文联合的空间,并提出应通过结构化文本空间来更好地表示视觉刺激的组成特性。基于这一发现,作者提出了PRISM模型,通过将fMRI信号投影到结构化文本空间,并结合对象生成和属性关系搜索模块,显著提升了图像重建质量,在真实数据集上实现了感知损失的降低。

详情
英文摘要

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

2510.07060 2026-05-15 cs.CL

Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations

Miriam Wanner, Sophia Hager, Anjalie Field

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文研究了 Sinclair 公司收购地方新闻台后对其新闻内容的影响。通过计算方法分析收购前后地方新闻台与全国性新闻机构的内容变化,发现地方新闻台在被 Sinclair 收购后,更频繁地报道全国性新闻,减少了对本地议题的覆盖,并增加了对争议性全国话题的报道。这一研究揭示了媒体所有权变化对新闻内容倾向的潜在影响。

Comments Published at NLP+CSS Workshop @ ACL 2026

详情
英文摘要

Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. Because these stations are trusted news sources, viewers are particularly susceptible to the information they report. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We use computational methods to investigate changes in internet content put out by local news stations before and after being acquired by Sinclair and in comparison to national news outlets. We find that there is clear evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases.

2510.00231 2026-05-15 cs.LG cs.AI

The Pitfalls of KV Cache Compression

Alex Chen, Renato Geh, Aditya Grover, Guy Van den Broeck, Daniel Israel

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文探讨了KV缓存压缩在实际应用场景中的潜在问题,特别是在多指令提示任务中可能引发的性能下降。研究评估了五种KV缓存压缩方法在大型语言模型中的表现,发现某些指令在压缩后性能急剧下降,甚至被模型完全忽略,并以系统提示泄露为例,分析了压缩对指令遵循能力的影响。文章进一步指出了影响泄露现象的关键因素,并提出了改进KV缓存淘汰策略的简单方法,以提升多指令任务的整体表现。

Comments In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, ACL 2026

详情
英文摘要

KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls that practitioners should be aware of when deploying KV cache compressed LLMs. We evaluate five KV cache compression methods (StreamingLLM, SnapKV, TOVA, H2O, and K-Norm) on Llama3.1 8B and Qwen2.5 14B under multi-instruction prompting with IFEval. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example, we highlight system prompt leakage as a case study, empirically demonstrating the impact of compression on leakage and general instruction-following. We identify several factors that contribute to system prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

2509.14159 2026-05-15 cs.RO

MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies

Dayi Dong, Maulik Bhatt, Seoyeon Choi, Negar Mehr

发表机构 * Department of Mechanical Engineering, University of California Berkeley(加州大学伯克利分校机械工程系)

AI总结 随着机器人在社会中应用日益广泛,其在多模态任务中与其它机器人和人类协调合作的能力变得至关重要。传统模仿学习方法在处理多模态专家示范时往往无法有效捕捉多种可能的行为模式,而现有基于扩散模型的多智能体方法通常依赖集中式规划或显式通信。本文提出MIMIC-D,一种基于扩散模型的去中心化多智能体模仿学习框架,通过仅使用局部信息联合训练所有智能体策略,实现隐式协调,在仿真和实际硬件实验中表现出优异的多模态协作能力。

Comments 8 pages, 4 figures, 5 tables

详情
英文摘要

As robots become more integrated in society, their ability to coordinate with other robots and humans on multi-modal tasks (those with multiple valid solutions) is crucial. Such behaviors can be learned from expert demonstrations via imitation learning (IL), but when expert demonstrations are multi-modal, standard IL approaches usually average across modes or collapse to a single mode, preventing effective coordination. Being inspired by diffusion models' ability to capture complex multi-modal trajectory distributions in single-agent settings, we develop a diffusion-based framework for coordinated multi-modal behavior in multi-agent systems. However, existing multi-agent diffusion approaches typically require a centralized planner or explicit communication among agents. This assumption can fail in real-world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC-D, a joint training with decentralized execution paradigm for multi-modal multi-agent IL via diffusion. We jointly train all agents' policies with only local information to achieve implicit coordination. In simulation and hardware experiments, our method exhibits robust multi-modal coordination behavior in various tasks and environments, improving upon state-of-the-art baselines.

2509.01416 2026-05-15 cs.LG

MD-PNOP: Equation-Recast Neural Operators for Minimal-Data Extrapolation and PDE Solver Acceleration

Qiyun Cheng, Md Hossain Sahadath, Huihua Yang, Shaowu Pan, Wei Ji

发表机构 * Department of Mechanical, Aerospace, and Nuclear Engineering(机械、航空航天与核工程系)

AI总结 该研究提出了一种名为MD-PNOP的框架,旨在加速参数化偏微分方程(PDE)求解器并实现小样本条件下的参数外推。通过将参数引起的算子差异转化为额外的源项,并结合预训练神经算子进行迭代求解,该方法能够在不重新训练的情况下,从单一训练配置外推到多种未见过的参数场景。实验表明,MD-PNOP在保持物理守恒的前提下显著提升了求解效率,适用于中子输运等实际应用中的复杂问题。

详情
英文摘要

The computational overhead of traditional numerical solvers for partial differential equations (PDEs) remains a critical bottleneck for large-scale parametric studies and design optimization. We introduce a Minimal-Data Parametric Neural Operator Preconditioning (MD-PNOP) framework, which establishes a new strategy for accelerating parametric PDE solvers while strictly preserving physical constraints. To address the extrapolation limitation of neural operators, parameter-induced operator difference is recast as additional source terms and incorporated into an iterative solution scheme using a pretrained neural operator. This equation-recast formulation enables systematic parameter extrapolation from a single training configuration to a broad range of unseen parameter settings without retraining. The neural operator predictions are then embedded into iterative PDE solvers as improved initial guesses, thereby reducing convergence iterations without sacrificing accuracy. Unlike purely data-driven approaches, MD-PNOP guarantees that the governing equations remain fully enforced, eliminating concerns regarding loss of physics or interpretability. The framework is architecture-agnostic and is demonstrated using both DeepONet and FNO for Boltzmann transport equation solvers in neutron transport applications. Numerical results demonstrate that neural operators trained on a single set of constant parameters successfully accelerate solutions with heterogeneous, sinusoidal, and discontinuous parameter distributions. Moreover, MD-PNOP consistently achieves approximately 50% reduction in computational time while maintaining full-order fidelity for fixed-source, single-group eigenvalue, and multigroup coupled eigenvalue problems.

2506.11067 2026-05-15 cs.CL

A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

Hieu Nghiem, Zhuqi Miao, Hemanth Reddy Singareddy, Jivan Lamichhane, Abdulaziz Ahmed, Johnson Thomas, Dursun Delen, William Paiva

发表机构 * Department of Computer Science(计算机科学系) Department of Management(管理系) Department of Medicine(医学系) Center for Health Systems Science and Information Systems(健康系统科学与信息系统中心) The State University of New York(纽约州立大学) Upstate Medical University(上州医学院) Oklahoma State University(俄克拉荷马州立大学) Innovation New Paltz(新帕尔茨创新中心) Department of Health Services(健康服务系) Center for Health Systems(健康系统中心) Science and Information Systems University of Alabama at Stillwater, OK, USA(阿拉巴马大学仍水分校) Innovation Birmingham, AL, USA(伯明翰创新中心)

AI总结 该研究提出了一种基于大语言模型(LLM)的高效管道,用于从临床笔记中自动识别“系统回顾”(ROS)实体,如疾病、症状及其所属的身体系统。研究采用四种开源大语言模型,并引入了一种新颖的归因算法,以提高实体识别的准确性。实验结果表明,该管道在多个任务上表现出色,且在资源受限的环境中具有良好的应用前景。

Comments Accepted by IEEE EMBC 2026. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情
英文摘要

Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS section from the clinical note using SecTag header terminology, followed by few-shot LLMs to identify ROS entities such as diseases or symptoms, their positive/negative status and associated body systems. We implemented the pipeline using 4 open-source LLM models: llama3.1:8b, gemma3:27b, mistral3.1:24b and gpt-oss:20b. Additionally, we introduced a novel attribution algorithm that aligns LLM-identified ROS entities with their source text, addressing non-exact and synonymous matches. The evaluation was conducted on 24 general medicine notes containing 340 annotated ROS entities. Results: Open-source LLMs enable a local, cost-efficient pipeline while delivering promising performance. Larger models like Gemma, Mistral, and Gpt-oss demonstrate robust performance across three entity recognition tasks of the pipeline: ROS entity extraction, negation detection and body system classification (highest F1 score = 0.952). With the attribution algorithm, all models show improvements across key performance metrics, including higher F1 score and accuracy, along with lower error rate. Notably, the smaller Llama model also achieved promising results despite using only one-third the VRAM of larger models. Discussion and Conclusion: From an application perspective, our pipeline provides a scalable, locally deployable solution to easing the ROS documentation burden. Open-source LLMs offer a practical AI option for resource-limited healthcare settings. Methodologically, our newly developed algorithm facilitates accuracy improvements for zero- and few-shot LLMs in named entity recognition.

2506.04646 2026-05-15 cs.RO cs.LG

ActivePusher: Active Learning and Planning with Residual Physics for Nonprehensile Manipulation

Zhuoyun Zhong, Seyedali Golestaneh, Constantinos Chamzas

发表机构 * Department of Robotics Engineering, Worcester Polytechnic Institute (WPI)(机器人工程系,沃斯特理工学院(WPI))

AI总结 本文提出了一种名为ActivePusher的新型框架,用于非抓取式操作(如推动和滚动)中的主动学习与规划。该方法结合残差物理模型与基于不确定性的主动学习策略,以高效采集最具信息量的训练数据,并与基于模型的运动规划器集成,提升长期规划的可靠性。实验表明,该方法在仿真和实际环境中均表现出更高的数据效率和规划成功率。

Comments Accepted by the 2026 IEEE International Conference on Robotics & Automation (ICRA 2026)

详情
英文摘要

Planning with learned dynamics models offers a promising approach toward versatile real-world manipulation, particularly in nonprehensile settings such as pushing or rolling, where accurate analytical models are difficult to obtain. However, collecting training data for learning-based methods can be costly and inefficient, as it often relies on randomly sampled interactions that are not necessarily the most informative. Furthermore, learned models tend to exhibit high uncertainty in underexplored regions of the skill space, undermining the reliability of long-horizon planning. To address these challenges, we propose ActivePusher, a novel framework that combines residual-physics modeling with uncertainty-based active learning, to focus data acquisition on the most informative skill parameters. Additionally, ActivePusher seamlessly integrates with model-based kinodynamic planners, leveraging uncertainty estimates to bias control sampling toward more reliable actions. We evaluate our approach in both simulation and real-world environments, and demonstrate that it consistently improves data efficiency and achieves higher planning success rates in comparison to baseline methods. The source code is available at https://github.com/elpis-lab/ActivePusher.

2505.23912 2026-05-15 cs.CL cs.AI

LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations

Caiqi Zhang, Xiaochen Zhu, Chengzu Li, Nigel Collier, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出 LoVeC,一种基于强化学习的方法,用于在长文本生成过程中动态添加可解释的置信度评分,以提升生成内容的事实准确性。该方法克服了现有方法在计算效率和任务泛化上的不足,能够在长形式问答任务中实现更高效、更鲁棒的置信度估计。实验表明,LoVeC 在多个数据集上表现出更优的校准能力和跨领域泛化性能,且效率比传统方法高20倍。

Comments ACL 2026 Main

详情
英文摘要

Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), a novel reinforcement learning based method that trains LLMs to append an on-the-fly numerical confidence score to each generated statement during long-form generation. The confidence score serves as a direct and interpretable signal of the factuality of generation. We introduce two evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, being 20 times faster than traditional self-consistency methods while achieving better calibration.