arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2069
2602.08885 2026-06-01 cs.LG cs.AI cs.SC

Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

打破摊销神经符号回归中的简化瓶颈

Paul Saegert, Ullrich Köthe

AI总结 针对摊销符号回归中表达式简化速度慢的问题,提出基于规则的简化引擎SimpliPy,实现百倍加速,从而提升模型精度和可扩展性。

Comments main text: 8 pages, 7 figures; appendix: 12 pages, 11 figures; code available at https://github.com/psaegert/simplipy and https://github.com/psaegert/flash-ansr; v2: Fixed rendering artifact in Figure 7; v3: Fixed Figure 3 title and formula; v4: Fixed Eq (1), example in App. M, Fig 13; v5: ICML 2026 Camera-Ready Version

详情
AI中文摘要

符号回归旨在发现准确描述观测数据的可解释解析表达式。摊销符号回归有望比主流的遗传编程符号回归方法效率更高,但目前难以扩展到真实的科学复杂度。我们发现一个关键障碍是缺乏将等价表达式快速简化为简洁规范形式的方法。摊销符号回归已通过通用计算机代数系统(如SymPy)解决此问题,但其高计算成本严重限制了训练和推理速度。我们提出SimpliPy,一个基于规则的简化引擎,在相当质量下实现比SymPy快100倍的速度。这使摊销符号回归获得显著改进,包括扩展到更大的训练集、更高效地使用每个表达式的令牌预算,以及系统性地消除训练集中与测试等价表达式的污染。我们在Flash-ANSR框架中展示了这些优势,在FastSRB基准上比摊销基线(NeSymReS, E2E)获得更好的准确率。此外,其性能与最先进的直接优化方法(PySR)相当,同时在增加推理预算时恢复更简洁而非更复杂的表达式。

英文摘要

Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this with general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise rather than more complex expressions with increasing inference budget.

2506.01928 2026-06-01 cs.CL cs.LG

Esoteric Language Models: A Family of Any-Order Diffusion LLMs

深奥语言模型:一类任意阶扩散LLM

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

AI总结 提出Eso-LMs模型,融合自回归与掩码扩散范式,通过因果注意力实现精确似然计算和KV缓存,在速度-质量帕累托前沿上达到新最优。

Comments ICML 2026

详情
AI中文摘要

基于扩散的语言模型通过并行和可控生成为自回归(AR)模型提供了引人注目的替代方案。在这一类模型中,掩码扩散模型(MDM)目前表现最佳,但在困惑度上仍不如AR模型,并且缺乏关键的推理时效率特性,尤其是KV缓存。我们引入了Eso-LMs,这是一个融合AR和MDM范式的新模型家族,能够平滑地插值它们的困惑度,同时克服各自的局限性。与以往使用具有双向注意力的Transformer作为MDM去噪器的工作不同,我们利用了MDM与任意阶自回归模型之间的联系,并采用因果注意力。这种设计使我们首次能够计算MDM的精确似然,并且关键的是,首次能够在保持并行生成的同时为MDM引入KV缓存,从而显著提高推理效率。结合优化的采样调度,Eso-LMs在无条件生成的快速-质量帕累托前沿上建立了新的最先进水平。我们在项目页面上提供代码、模型检查点和视频教程:https://s-sahoo.com/Eso-LMs。

英文摘要

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key inference-time efficiency features, most notably KV caching. We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, smoothly interpolating between their perplexities while overcoming their respective limitations. Unlike prior work, which uses transformers with bidirectional attention as MDM denoisers, we exploit the connection between MDMs and Any-Order autoregressive models and adopt causal attention. This design lets us compute the exact likelihood of MDMs for the first time and, crucially, enables us to introduce KV caching for MDMs while preserving parallel generation for the first time, significantly improving inference efficiency. Combined with an optimized sampling schedule, Eso-LMs establish a new state of the art on the speed-quality Pareto frontier for unconditional generation. We provide the code, model checkpoints, and the video tutorial on the project page: https://s-sahoo.com/Eso-LMs.

2602.18333 2026-06-01 cs.LG cs.CL

On the "Induction Bias" in Sequence Models

论序列模型中的“归纳偏置”

M. Reza Ebrahimi, Michaël Defferrard, Sunny Panchal, Roland Memisevic

AI总结 通过大规模实验比较Transformer和RNN在状态跟踪任务上的数据效率,发现Transformer需要更多训练数据且难以跨长度共享权重,而RNN通过权重共享实现有效学习。

Comments Accepted to the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

尽管基于Transformer的语言模型在实际应用中取得了显著成功,但近期研究对其执行状态跟踪的能力提出了担忧。特别是,越来越多的文献通过分布外(OOD)泛化失败(如长度外推)来展示这一局限性。在本工作中,我们将注意力转向这些局限性的分布内影响。我们在多种监督机制下对Transformer和循环神经网络(RNN)的数据效率进行了大规模实验研究。我们发现,Transformer所需的训练数据量随状态空间大小和序列长度的增长远快于RNN。此外,我们分析了学习到的状态跟踪机制在不同序列长度上的共享程度。我们表明,Transformer在不同长度上的权重共享可以忽略甚至有害,表明它们孤立地学习长度特定的解决方案。相比之下,循环模型通过跨长度共享权重实现了有效的摊销学习,使得一个序列长度的数据能够提高其他长度上的性能。这些结果共同表明,即使训练和评估分布匹配,状态跟踪仍然是Transformer的一个基本挑战。

英文摘要

Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

2602.17531 2026-06-01 cs.LG cs.AI

Position: Evaluation of ECG Representations Must Be Fixed

Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz

AI总结 本文主张必须改进12导联心电图表示学习的基准测试实践,以确保进展可靠且符合临床目标,并提出了扩展评估范围、采用最佳实践以及将随机编码器作为基线等建议。

Comments Project website at https://ecgfix.csail.mit.edu/

详情
AI中文摘要

这篇立场论文认为,当前12导联心电图表示学习的基准测试实践必须加以改进,以确保进展可靠且与临床有意义的目标一致。该领域已基本集中于三个公共多标签基准(PTB-XL、CPSC2018、CSN),这些基准主要由心律失常和波形形态标签主导,尽管已知心电图编码了更广泛的临床信息。我们认为,下游评估应扩展到包括结构性心脏病评估和患者级预测,以及其他不断发展的心电图相关终点,作为相关的临床目标。接下来,我们概述了多标签、不平衡设置下的评估最佳实践,并表明当应用这些实践时,文献中关于哪些表示性能最佳的当前结论会发生变化。此外,我们展示了一个令人惊讶的结果:随机初始化的编码器在线性评估下与许多任务上的最先进预训练方法相匹配。这促使将随机编码器作为合理的基线模型。我们通过实证评估五种代表性心电图预训练方法在六种评估设置(三个标准基准、一个结构性心脏病数据集、血流动力学推断和患者预测)中的表现来证实我们的观察。

英文摘要

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of five representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.

2602.13110 2026-06-01 cs.CL cs.AI

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

SCOPE: 选择性保形优化的成对LLM评判

Sher Badshah, Ali Emami, Hassan Sajjad

AI总结 提出SCOPE框架,通过校准接受阈值控制非弃权判断的错误率,并引入双向偏好熵(BPE)提供无偏不确定性信号,实现可靠且高覆盖率的LLM成对评估。

Comments Accepted at ICML 2026. 23 pages (9 main plus appendix), 7 figures, 11 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作成对评估中的可扩展评判者,但它们仍然容易受到校准偏差和偏见的影响。我们提出SCOPE(选择性保形优化成对评估),一个校准接受阈值的框架,使得在可交换性条件下,非弃权判断的错误率最多为用户指定的水平$α$。为了向SCOPE提供无偏的不确定性信号,我们引入了双向偏好熵(BPE),它在两个响应位置下查询评判者,并将顺序平均的偏好概率转换为基于熵的分数。在各种成对评判基准上,BPE在校准和区分能力方面优于标准置信度代理,而SCOPE始终满足目标风险界限(在$α=0.10$时,经验FDR约为0.097至0.099),并保持较高的覆盖率。与原始基线相比,在相同风险约束下,SCOPE接受的判断数量最多增加2.4倍,表明BPE能够实现可靠且高覆盖率的基于LLM的评估。

英文摘要

Large language models (LLMs) are increasingly used as scalable judges in pairwise evaluation, but they remain prone to miscalibration and biases. We propose SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework that calibrates an acceptance threshold so that, under exchangeability, the error rate among non-abstained judgments is at most a user-specified level $α$. To supply SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions and converts the order-averaged preference probability into an entropy-based score. Across various pairwise judging benchmarks, BPE outperforms standard confidence proxies in calibration and discrimination, while SCOPE consistently satisfies the target risk bound (empirical FDR $\approx 0.097$ to $0.099$ at $α= 0.10$) and retains substantial coverage. Compared to vanilla baselines, SCOPE accepts up to $2.4\times$ more judgments under the same risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

2602.16682 2026-06-01 cs.CV

SAW-Bench: Learning Situated Awareness in the Real World

SAW-Bench:在现实世界中学习情境感知

Chuhan Li, Rilyn Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang

AI总结 提出SAW-Bench基准,通过自录视频和问答对评估多模态基础模型的以观察者为中心的情境感知能力,揭示人类与模型间的显著性能差距。

详情
AI中文摘要

人类感知的一个核心方面是情境感知,即我们能够将自己与周围的物理环境联系起来,并根据上下文推理可能的行动。然而,现有的大多数多模态基础模型(MFM)基准强调以环境为中心的空间关系(场景中物体之间的关系),而很大程度上忽略了需要相对于智能体的视角、姿态和运动进行推理的以观察者为中心的关系。为了填补这一空白,我们引入了SAW-Bench(现实世界中的情境感知),这是一个利用真实世界视频评估以自我为中心的情境感知的新基准。SAW-Bench包含使用Ray-Ban Meta(Gen 2)智能眼镜自录的786个视频,涵盖多样的室内外环境,以及超过2,071个人工标注的问答对。它通过六种不同的感知任务来探测模型对观察者中心的理解。我们的综合评估显示,即使使用性能最佳的MFM Gemini 3 Flash,人类与模型之间的性能差距也达到了37.66%。除了这一差距,我们的深入分析还揭示了一些显著发现;例如,虽然模型可以利用以自我为中心的视频中的部分几何线索,但它们常常无法推断出连贯的相机几何结构,从而导致系统性的空间推理错误。我们将SAW-Bench定位为情境空间智能的基准,超越被动观察,转向理解基于物理的、以观察者为中心的动态。

英文摘要

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

2602.16305 2026-06-01 cs.SD cs.LG

BAT: Better Audio Transformer Guided by Convex Gated Probing

BAT: 基于凸门控探测的更好音频Transformer

Houtan Ghaffari, Lukas Rauch, Christoph Scholz, Paul Devos

AI总结 提出凸门控探测(CGP)方法,通过门控机制有效利用所有冻结层,缩小音频自监督学习中探测与微调的差距,并基于CGP改进SSL流程,构建Better Audio Transformer(BAT),在音频基准上取得新最优结果。

Comments Accepted @ ICML26

详情
AI中文摘要

探测在计算机视觉中被广泛用于忠实评估自监督学习(SSL)嵌入,因为微调可能扭曲其内在质量。相比之下,音频SSL模型仍依赖微调,因为简单探测无法充分发挥其潜力,并在AudioSet竞争时改变排名。因此,需要一种稳健高效的探测机制来引导音频SSL走向可靠和可重复的方法。我们引入凸门控探测(CGP),一种基于原型的方法,显著缩小了音频中微调和探测之间的差距。CGP通过门控机制高效利用所有冻结层,并揭示潜在任务相关信息的所在位置。以CGP作为可靠的事后评估探测为指导,我们重新设计了当前最佳音频模型的整个SSL流程,这些模型使用了先前SSL方法的遗留实现。通过改进数据预处理、模型架构和预训练方案,我们推出了Better Audio Transformer(BAT),并在音频基准上建立了新的最优结果。

英文摘要

Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as finetuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on finetuning because simple probing fails to unlock their full potential and alters their rankings when competing on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that significantly closes the gap between finetuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP as a reliable post-hoc evaluation probe, we rework the entire SSL pipeline of current best performing audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pretraining recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.

2602.15778 2026-06-01 cs.CL

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

*-PLUIE:使用大语言模型改进评估的可个性化度量

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

AI总结 提出*-PLUIE,一种基于困惑度的可个性化LLM评判度量,通过任务特定提示变体实现与人类判断的更强相关性,同时保持低计算成本。

Comments Accepted at *SEM 2026

详情
AI中文摘要

评估自动生成文本的质量通常依赖于LLM作为评判者(LLM-judge)的方法。虽然有效,但这些方法计算成本高且需要后处理。为了解决这些限制,我们基于ParaPLUIE进行改进,ParaPLUIE是一种基于困惑度的LLM评判度量,它通过估计“是/否”答案的置信度而不生成文本。我们引入了*-PLUIE,即ParaPLUIE的任务特定提示变体,并评估它们与人类判断的一致性。实验表明,个性化的*-PLUIE在保持低计算成本的同时,与人类评分的相关性更强。

英文摘要

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

2602.15634 2026-06-01 cs.LG

Beyond ReLU: Bifurcation, Oversmoothing, and Topological Priors

超越ReLU:分岔、过平滑与拓扑先验

Erkan Turan, Gaspard Abel, Maysam Behmanesh, Emery Pierson, Maks Ovsjanikov

AI总结 从分岔理论视角重新解释图神经网络的过平滑问题,发现用特定激活函数替代ReLU可打破同质稳定状态,诱导出抵抗过平滑的非同质模式,并推导出分岔感知初始化方法。

详情
AI中文摘要

图神经网络(GNN)通过基于网络的迭代消息传递学习节点表示。尽管强大,深层GNN却遭受过平滑问题,即节点特征收敛到同质、无信息的状态。我们从分岔理论的角度重新审视这种表示坍缩问题,将过平滑表征为收敛到稳定的“同质不动点”。我们的核心贡献是理论发现:通过用一类函数替代标准单调激活函数(如ReLU),可以打破这种不期望的稳定性。利用Lyapunov-Schmidt约化,我们解析证明这种替换会诱导分岔,使同质状态失稳,并产生一对新的稳定、非同质的模式,这些模式被证明能抵抗过平滑。我们的理论预测了这些涌现模式振幅的精确、非平凡标度律,并在实验中定量验证。最后,我们通过推导闭式的、分岔感知的初始化方法,并在实际基准实验中展示其效用,证明了我们理论的实用价值。

英文摘要

Graph Neural Networks (GNNs) learn node representations through iterative network-based message-passing. While powerful, deep GNNs suffer from oversmoothing, where node features converge to a homogeneous, non-informative state. We re-frame this problem of representational collapse from a \emph{bifurcation theory} perspective, characterizing oversmoothing as convergence to a stable ``homogeneous fixed point.'' Our central contribution is the theoretical discovery that this undesired stability can be broken by replacing standard monotone activations (e.g., ReLU) with a class of functions. Using Lyapunov-Schmidt reduction, we analytically prove that this substitution induces a bifurcation that destabilizes the homogeneous state and creates a new pair of stable, non-homogeneous \emph{patterns} that provably resist oversmoothing. Our theory predicts a precise, nontrivial scaling law for the amplitude of these emergent patterns, which we quantitatively validate in experiments. Finally, we demonstrate the practical utility of our theory by deriving a closed-form, bifurcation-aware initialization and showing its utility in real benchmark experiments.

2602.15293 2026-06-01 cs.LG cs.AI cs.CL stat.ML

The Information Geometry of Softmax: Probing and Steering

Softmax的信息几何:探测与引导

Kiho Park, Todd Nief, Yo Joong Choe, Victor Veitch

AI总结 本文从信息几何角度研究AI系统如何将语义结构编码到表示空间的几何结构中,并提出一种利用线性探针鲁棒引导表示以展现特定概念的“双重引导”方法。

Comments Code is available at https://github.com/KihoPark/dual-steering

详情
Journal ref
In Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
AI中文摘要

本文关注AI系统如何将语义结构编码到其表示空间的几何结构中的问题。动机观察是,这些表示空间的自然几何应反映模型使用表示产生行为的方式。我们聚焦于定义softmax分布的重要特例。在这种情况下,我们认为自然几何是信息几何。我们的重点是信息几何在语义编码和线性表示假设中的作用。作为一个说明性应用,我们开发了“双重引导”,一种利用线性探针鲁棒地引导表示以展现特定概念的方法。我们证明双重引导在最小化对非目标概念改变的同时,最优地修改目标概念。实验上,我们发现双重引导增强了概念操控的可控性和稳定性。

英文摘要

This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop "dual steering", a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.

2602.15018 2026-06-01 cs.RO cs.CV

Neurosim: A Fast Simulator for Neuromorphic Robot Perception

Neurosim: 一种用于神经形态机器人感知的快速模拟器

Richeek Das, Pratik Chaudhari

AI总结 提出Neurosim和Cortex库,通过高速传感器模拟和低延迟通信,支持神经形态感知与控制算法的训练和闭环测试。

Comments 11 pages, 6 figures

详情
AI中文摘要

Neurosim是一个快速、实时、高性能的库,用于模拟动态视觉传感器、RGB相机、深度传感器和惯性传感器等传感器。它还可以模拟复杂动态环境中多旋翼飞行器的敏捷动力学。Neurosim在桌面GPU上可实现高达约2700 FPS的帧率。Neurosim与一个基于ZeroMQ的通信库Cortex集成,以促进与机器学习和机器人工作流的无缝集成。Cortex为Python和C++应用程序提供了一个高吞吐量、低延迟的消息传递系统,原生支持NumPy数组和PyTorch张量。本文讨论了Neurosim和Cortex的设计理念。它展示了如何利用它们来(i)训练神经形态感知和控制算法,例如,在时间同步的多模态数据上使用自监督学习,以及(ii)在闭环中测试这些算法的实时实现。Neurosim和Cortex可在https://github.com/grasp-lyrl/neurosim获取。

英文摘要

Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .

2602.14441 2026-06-01 cs.CV

D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection

D-SECURE:用于虚假信息检测中统一推理的双源证据融合

Samudi Amarasinghe, Gagandeep Singh, Priyanka Singh

AI总结 提出D-SECURE框架,通过融合内部篡改检测(HAMMER)和外部证据检索(DEFAME)实现多模态虚假新闻的统一推理与可解释报告。

详情
AI中文摘要

多模态虚假信息越来越多地将逼真的图像编辑与流畅但误导性的文本混合在一起,产生难以验证的有说服力的帖子。现有系统通常依赖单一证据源。基于内容的检测器识别图像及其标题内的局部不一致性,但无法确定全局事实真相。基于检索的事实核查器在外部证据上进行推理,但将输入视为粗略声明,常常错过微妙的视觉或文本操纵。这种分离导致内部一致的伪造绕过操纵检测器,而事实核查器验证包含像素级或令牌级损坏的声明。我们提出了D-SECURE,一个结合内部操纵检测与基于外部证据的推理的框架,用于新闻类帖子。D-SECURE将HAMMER操纵检测器与DEFAME检索流水线集成。DEFAME执行广泛验证,HAMMER分析可能包含细粒度编辑的残差或不确定案例。在DGM4和ClaimReview样本上的实验突出了两个系统的互补优势,并推动了它们的融合。我们提供了一个统一的、可解释的报告,融合了操纵线索和外部证据。

英文摘要

Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.

2602.13069 2026-06-01 cs.LG cs.CL

Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

面向设备上大语言模型微调的内存高效结构化反向传播

Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee

AI总结 提出MeSP方法,通过手动推导利用LoRA低秩结构的反向传播,在计算数学等价梯度的同时平均减少49%内存,使内存受限设备上的微调成为可能。

Comments ACL2026

详情
AI中文摘要

设备上微调能够实现大语言模型的隐私保护个性化,但移动设备存在严重的内存限制,通常所有工作负载共享6-12GB内存。现有方法迫使在高内存的精确梯度(MeBP)和低内存的噪声估计(MeZO)之间进行权衡。我们提出内存高效结构化反向传播(MeSP),通过手动推导利用LoRA低秩结构的反向传播来弥合这一差距。我们的关键洞察是,中间投影 $h = xA$ 可以在反向传播中以最小成本重新计算,因为秩 $r \ll d_{in}$,从而无需存储它。在Qwen2.5模型(0.5B-3B)上,MeSP相比MeBP平均减少49%内存,同时计算数学上等价的梯度。我们的分析还揭示,MeZO的梯度估计与真实梯度的相关性接近零(余弦相似度≈0.001),解释了其收敛缓慢的原因。MeSP将Qwen2.5-0.5B的峰值内存从361MB降低到136MB,使得先前在内存受限设备上不可行的微调场景成为可能。

英文摘要

On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.

2602.12686 2026-06-01 cs.RO

SignScene: Visual Sign Grounding for Mapless Navigation

SignScene: 用于无地图导航的视觉标志接地

Nicky Zimmerman, Joel Loo, Benjamin Koh, Zishuo Wang, David Hsu

AI总结 提出SignScene,一种以标志为中心的空间语义表示方法,利用视觉语言模型将标志的语义指令与场景元素和导航动作对应,实现无地图导航,在114个查询中达到88%的接地准确率。

Comments Under review for a conference

详情
AI中文摘要

导航标志使人类能够在没有地图的情况下导航陌生环境。本文研究机器人如何类似地利用标志在开放世界中进行无地图导航。一个核心挑战在于解读标志:现实世界的标志多样且复杂,其抽象语义内容需要与局部3D场景对应。我们将此形式化为标志接地问题,即将标志上的语义指令映射到相应的场景元素和导航动作。最近的视觉语言模型(VLM)具备完成此任务所需的语义常识和推理能力,但对空间信息的表示方式敏感。我们提出SignScene,一种以标志为中心的空间语义表示,捕获与导航相关的场景元素和标志信息,并以有利于有效推理的形式呈现给VLM。我们在涵盖九种不同环境类型的114个查询数据集上评估了我们的接地方法,实现了88%的接地准确率,显著优于基线。最后,我们证明该方法使Spot机器人仅使用标志即可在现实世界中进行无地图导航。

英文摘要

Navigational signs enable humans to navigate unfamiliar environments without maps. This work studies how robots can similarly exploit signs for mapless navigation in the open world. A central challenge lies in interpreting signs: real-world signs are diverse and complex, and their abstract semantic contents need to be grounded in the local 3D scene. We formalize this as sign grounding, the problem of mapping semantic instructions on signs to corresponding scene elements and navigational actions. Recent Vision-Language Models (VLMs) offer the semantic common-sense and reasoning capabilities required for this task, but are sensitive to how spatial information is represented. We propose SignScene, a sign-centric spatial-semantic representation that captures navigation-relevant scene elements and sign information, and presents them to VLMs in a form conducive to effective reasoning. We evaluate our grounding approach on a dataset of 114 queries collected across nine diverse environment types, achieving 88% grounding accuracy and significantly outperforming baselines. Finally, we demonstrate that it enables real-world mapless navigation on a Spot robot using only signs.

2602.11802 2026-06-01 cs.LG

Structural Bias Beyond Homophily: A Study of Fairness in Link Prediction

超越同质性的结构偏差:链接预测中的公平性研究

Lilian Marey, Mathilde Perez, Tiphaine Viard, Charlotte Laclau

AI总结 本研究通过形式化拓扑偏差度量并引入可控结构属性的合成图生成方法,实证了图拓扑与链接预测公平性之间的强相关性,并揭示了现有公平感知方法对同质性之外的结构偏差仍然敏感。

详情
AI中文摘要

图链接预测(LP)在诸如工作推荐和友谊形成等具有社会影响力的应用中发挥着关键作用,使得公平性成为该任务中的一个关键问题。虽然许多公平感知方法通过操纵图结构来减轻预测差异,但社会图中固有的拓扑偏差仍然未被充分理解,并且始终仅与同质性混为一谈。在这项工作中,我们研究了结构偏差与LP中公平性结果之间的关系。为此,我们形式化了拓扑偏差度量的分类,并引入了一种图生成方法,该方法可生成具有可控结构属性的多样化合成图语料库。利用该语料库,我们实证表明公平性结果与图拓扑强相关,并且当前的公平感知方法对同质性之外的结构偏差仍然敏感。这些发现强调了在公平图学习中进行基于结构的评估的必要性。

英文摘要

Graph link prediction (LP) plays a critical role in socially impactful applications such as job recommendation and friendship formation, making fairness a critical concern in this task. While many fairness-aware methods manipulate graph structures to mitigate prediction disparities, the topological biases inherent to social graphs remain poorly understood and are consistently conflated with homophily alone. In this work, we study the relationship between structural biases and fairness outcomes in LP. To this end, we formalize a taxonomy of topological bias measures and introduce a graph generation method producing a diverse corpus of synthetic graphs with controlled structural properties. Using this corpus, we show empirically that fairness outcomes are strongly correlated with graph topology, and that current fairness-aware methods remain sensitive to structural biases beyond homophily. These findings highlight the need for structurally grounded evaluations in fair graph learning.

2602.11216 2026-06-01 cs.LG physics.bio-ph

Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators

蛋白质语言模型嵌入提升隐式转移算子的泛化能力

Panagiotis Antoniadis, Beatrice Pavesi, Simon Olsson, Ole Winther

AI总结 本研究提出PLaTITO方法,通过整合蛋白质语言模型嵌入改进隐式转移算子,在分子动力学中实现更高效的数据利用和跨分子系统的泛化,在非平衡蛋白质系统采样中达到最优性能。

Comments 29 pages, 14 figures and 11 tables, Accepted at ICML 2026

详情
AI中文摘要

分子动力学(MD)是物理学、化学和生物学中的核心计算工具,能够将实验可观测量作为高维分子分布(如玻尔兹曼分布和转移密度)的期望进行定量预测。然而,传统MD受到生成独立样本所需高计算成本的根本限制。生成式分子动力学(GenMD)最近作为一种替代方案出现,通过数据或与能量模型交互学习分子分布的替代模型。尽管这些方法实现了高效采样,但它们在不同分子系统间的可迁移性通常有限。在本工作中,我们表明整合辅助信息源可以提高可迁移隐式转移算子(TITO)在分子动力学中的数据效率和泛化能力。我们发现粗粒化TITO模型比玻尔兹曼模拟器在数据效率上显著更高,并且整合蛋白质语言模型(pLM)嵌入进一步改善了分布外泛化。我们的方法PLaTITO在非平衡蛋白质系统(包括快速折叠蛋白质)的平衡采样基准测试中达到了最先进的性能。我们进一步研究了额外条件信号(如结构嵌入、温度和大语言模型衍生嵌入)对模型性能的影响。

英文摘要

Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over high-dimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse-grained TITO models are substantially more data-efficient than Boltzmann Emulators, and that incorporating protein language model (pLM) embeddings further improves out-of-distribution generalization. Our approach, PLaTITO, achieves state-of-the-art performance on equilibrium sampling benchmarks for out-of-distribution protein systems, including fast-folding proteins. We further study the impact of additional conditioning signals such as structural embeddings, temperature, and large-language-model-derived embeddings on model performance.

2602.11208 2026-06-01 cs.LG

Adaptive Physics Transformer with Fused Global-Local Attention for Subsurface Energy Systems

自适应物理Transformer融合全局-局部注意力用于地下能源系统

Xin Ju, Nok Hei, Fung, Yuyan Zhang, Carl Jacquemyn, Matthew Jackson, Randolph Settgast, Sally M. Benson, Gege Wen

AI总结 提出自适应物理Transformer(APT),通过融合图编码器和全局注意力机制,高效处理地下能源系统中的异构网格和物理耦合问题,在规则与不规则网格上均优于现有架构,并首次直接从高分辨率自适应网格细化模拟中学习。

详情
AI中文摘要

地球地下空间是现代社会的基石,提供碳氢化合物、地热和矿物等基本能源资源,同时是$CO_2$封存的主要储层。然而,由于地质异质性、高分辨率要求以及具有不同传播时间尺度的物理过程的紧密耦合,这些系统的全物理数值模拟计算成本极高。本文提出$ extbf{自适应物理Transformer}$(APT),这是一种与几何、网格和物理无关的神经算子,明确解决了这些挑战。APT融合了基于图的编码器以提取高分辨率局部异质特征,并结合全局注意力机制以解析远程物理影响。我们的结果表明,APT在规则和不规则网格上的地下任务中均优于最先进的架构,并具有鲁棒的超分辨率能力。值得注意的是,APT是第一个直接从高分辨率自适应网格细化模拟中学习的架构。我们还展示了APT良好的扩展行为和跨数据集学习能力,使其成为大规模地下基础模型开发的稳健且可扩展的骨干网络。

英文摘要

The Earth's subsurface is a cornerstone of modern society, providing essential energy resources like hydrocarbons, geothermal, and minerals while serving as the primary reservoir for $CO_2$ sequestration. However, full physics numerical simulations of these systems are notoriously computationally expensive due to geological heterogeneity, high resolution requirements, and the tight coupling of physical processes with distinct propagation time scales. Here we propose the $\textbf{Adaptive Physics Transformer}$ (APT), a geometry-, mesh-, and physics-agnostic neural operator that explicitly addresses these challenges. APT fuses a graph-based encoder to extract high-resolution local heterogeneous features with a global attention mechanism to resolve long-range physical impacts. Our results demonstrate that APT outperforms state-of-the-art architectures in subsurface tasks across both regular and irregular grids with robust super-resolution capabilities. Notably, APT is the first architecture that learns directly from HR-adaptive mesh refinement simulations. We also demonstrate APT's favorable scaling behavior and cross-dataset learning capability, positioning it as a robust and scalable backbone for large-scale subsurface foundation model development.

2602.11137 2026-06-01 cs.LG cs.AI cs.CL

Weight Decay Improves Language Model Plasticity

权重衰减提升语言模型可塑性

Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

AI总结 本文通过系统实验表明,预训练中较大的权重衰减能提高模型的可塑性,使微调后下游性能更优,并揭示了其促进线性可分表示、正则化注意力矩阵和减少过拟合的机制。

详情
AI中文摘要

大型语言模型通常分两个主要阶段训练:预训练以产生基础模型,然后进一步训练以提高下游性能。然而,超参数优化和缩放定律主要从基础模型验证损失的角度研究,忽略了一个关键的模型属性:下游适应性。在这项工作中,我们从模型可塑性的角度研究预训练,即基础模型在额外训练后成功适应下游任务的能力。我们关注权重衰减的作用,这是预训练中的一个关键正则化参数,并通过系统实验表明,较大的权重衰减提高了预训练模型的可塑性,导致微调后下游性能提升更大。这种效应可能导致反直觉的权衡,即预训练后表现较差的基础模型在进一步训练后可能表现更好。对权重衰减对模型行为的机制影响的进一步研究表明,它鼓励线性可分的表示,正则化注意力矩阵,并减少对训练数据的过拟合。这些发现共同强调了预训练模型可塑性的重要性,使用交叉熵损失作为超参数优化的唯一指标的局限性,以及单个优化超参数在塑造模型行为中的多方面作用。

英文摘要

Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to improve downstream performance. However, hyperparameter optimization and scaling laws are studied primarily from the perspective of the base model's validation loss, overlooking a crucial model property: downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks upon additional training. We focus on the role of weight decay, a key regularization parameter during pretraining, and show through systematic experiments that larger weight decay increases the plasticity of the pretrained model, resulting in greater performance gains downstream after fine-tuning. This effect can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after further training. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. Together, these findings highlight the importance of pretrained model plasticity, the limits of using cross-entropy loss as the sole metric for hyperparameter optimization, and the multifaceted role that a single optimization hyperparameter plays in shaping model behavior.

2602.11083 2026-06-01 cs.LG cs.CR

Token-Efficient Change Detection in LLM APIs

LLM API中的令牌高效变化检测

Timothée Chauvin, Clément Lalanne, Erwan Le Merrer, Jean-Michel Loubes, François Taïani, Gilles Tredan

AI总结 提出基于边界输入的黑盒变化检测方案B3IT,在仅观察输出令牌的条件下实现低成本、高性能的LLM变化检测。

Comments ICML 2026

详情
AI中文摘要

远程检测LLM中的变化是一个难题。现有方法要么在大规模部署时成本过高,要么需要初始的白盒访问模型权重或灰盒访问对数概率。我们的目标是实现低成本和严格的黑盒操作,仅观察输出令牌。我们的方法依赖于我们称为边界输入的特定输入,对于这些输入,存在多个输出顶部令牌。从统计角度来看,最优变化检测取决于模型的雅可比矩阵和输出分布的Fisher信息。在低温状态下分析这些量表明,边界输入能够实现强大的变化检测测试。基于这一见解,我们提出了黑盒边界输入跟踪(B3IT)方案。大量的体内和体外实验表明,对于非推理测试端点,边界输入很容易找到,并且性能与最佳可用的灰盒方法相当。与现有方法相比,B3IT将成本降低了30倍,同时在严格的黑盒设置中运行。

英文摘要

Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to achieve both low cost and strict black-box operation, observing only output tokens. Our approach hinges on specific inputs we call Border Inputs, for which there exists more than one output top token. From a statistical perspective, optimal change detection depends on the model's Jacobian and the Fisher information of the output distribution. Analyzing these quantities in low-temperature regimes shows that border inputs enable powerful change detection tests. Building on this insight, we propose the Black-Box Border Input Tracking (B3IT) scheme. Extensive in-vivo and in-vitro experiments show that border inputs are easily found for non-reasoning tested endpoints, and achieve performance on par with the best available grey-box approaches. B3IT reduces costs by $30\times$ compared to existing methods, while operating in a strict black-box setting.

2602.10809 2026-06-01 cs.CV cs.IR

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

DeepImageSearch: 多模态智能体在视觉历史中上下文感知图像检索的基准测试

Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou

AI总结 提出DeepImageSearch范式,将图像检索重构为自主探索任务,通过构建DISBench基准和模块化智能体框架,验证了在视觉历史中基于隐式上下文线索进行多步推理的必要性。

Comments 18 pages, 6 figures

详情
AI中文摘要

现有的多模态检索系统在语义匹配方面表现出色,但隐含地假设查询-图像相关性可以孤立地衡量。这种范式忽略了真实视觉流中固有的丰富依赖关系,其中信息分布在时间序列中,而不是局限于单个快照。为了弥合这一差距,我们引入了DeepImageSearch,一种新颖的智能体范式,将图像检索重构为自主探索任务。模型必须规划并在原始视觉历史上执行多步推理,以基于隐式上下文线索定位目标。我们构建了DISBench,一个基于互联视觉数据的具有挑战性的基准。为了解决创建上下文相关查询的可扩展性挑战,我们提出了一种人机协作流程,利用视觉语言模型挖掘潜在的时空关联,在人工验证之前有效地卸载密集的上下文发现。此外,我们使用一个配备细粒度工具和用于长程导航的双记忆系统的模块化智能体框架构建了一个稳健的基线。大量实验表明,DISBench对最先进的模型提出了重大挑战,突出了将智能体推理纳入下一代检索系统的必要性。

英文摘要

Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

2602.10324 2026-06-01 cs.AI cs.CL cs.CY cs.HC

Discovering Differences in Strategic Behavior Between Humans and LLMs

发现人类与LLM在战略行为上的差异

Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro

AI总结 使用AlphaEvolve程序发现工具,从数据中直接发现可解释的人类和LLM行为模型,揭示在迭代石头剪刀布中前沿LLM比人类具有更深层次的战略行为。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署在社交和战略场景中,了解它们的行为在何处以及为何与人类行为产生差异变得至关重要。虽然行为博弈论(BGT)为分析行为提供了框架,但现有模型并未完全捕捉到人类或像LLM这样的黑箱非人类代理的独特行为。我们采用AlphaEvolve这一前沿程序发现工具,直接从数据中发现可解释的人类和LLM行为模型,从而能够开放式地发现驱动人类和LLM行为的结构因素。我们对迭代石头剪刀布的分析表明,前沿LLM可能比人类具有更深层次的战略行为。这些结果为理解驱动人类和LLM在战略互动中行为差异的结构性差异奠定了基础。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

2602.10286 2026-06-01 cs.LG

What Does Preference Learning Recover from Pairwise Comparison Data?

成对比较数据中的偏好学习恢复了什么?

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

AI总结 本文通过条件偏好分布(CPRD)形式化成对比较数据中的偏好信息,分析了Bradley-Terry模型在数据违反假设时的恢复能力,并揭示了影响样本效率的关键因素(边界和连通性)。

详情
Journal ref
ICML 2026
AI中文摘要

成对偏好学习是机器学习的核心,最近应用于将语言模型与人类偏好对齐。典型数据集由三元组 $(x, y^+, y^-)$ 组成,其中对于上下文 $x$,响应 $y^+$ 优于响应 $y^-$。Bradley-Terry (BT) 模型是主要方法,将偏好概率建模为潜在得分差异的函数。标准实践假设数据遵循此模型,并相应地学习潜在得分。然而,真实数据可能违反这一假设,目前尚不清楚 BT 学习在这种情况下恢复了什么。从三元组比较数据出发,我们通过条件偏好分布 (CPRD) 形式化其编码的偏好信息。我们给出了 BT 适用于建模 CPRD 的精确条件,并确定了影响样本效率的因素——即边界和连通性。这些结果共同为理解偏好学习实际恢复了什么提供了以数据为中心的基础。

英文摘要

Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred over response $y^-$ for context $x$. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency -- namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

2602.07721 2026-06-01 cs.LG cs.CL cs.DB

ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

ParisKV:面向长上下文LLM的快速且漂移鲁棒的KV缓存检索

Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas

AI总结 提出基于碰撞候选选择和量化内积重排序的GPU原生KV缓存检索框架ParisKV,在百万token上下文中实现低延迟、高吞吐且分布漂移鲁棒的检索,性能优于或持平全注意力。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

KV缓存检索对于长上下文LLM推理至关重要,但现有方法在处理大规模分布漂移和高延迟时存在困难。我们提出ParisKV,一种基于碰撞候选选择、随后使用量化内积重排序估计器的漂移鲁棒、GPU原生的KV缓存检索框架。对于百万token上下文,ParisKV通过统一虚拟寻址(UVA)支持CPU卸载的KV缓存,实现按需的top-$k$获取,开销极小。ParisKV在长输入和长生成基准测试中匹配或超越全注意力质量。它实现了最先进的长上下文解码效率:即使在长上下文的批大小为1时,也能匹配或超过全注意力速度;在全注意力可运行范围内提供高达2.8倍的吞吐量;并扩展到全注意力内存不足的百万token上下文。在百万token规模下,与两个最先进的KV缓存Top-$k$检索基线MagicPIG和PQCache相比,ParisKV分别将解码延迟降低了17倍和44倍。代码可在https://github.com/amy-77/ParisKV/tree/main获取。

英文摘要

KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention's runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines, code is available at https://github.com/amy-77/ParisKV/tree/main.

2602.09276 2026-06-01 cs.CL cs.AI cs.LG

Effective Reasoning Chains Reduce Intrinsic Dimensionality

有效推理链降低内在维度

Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw

AI总结 本文通过内在维度量化推理链有效性,发现有效推理策略能降低任务内在维度,并在GSM8K上验证其与泛化性能的强负相关。

Comments ICML (spotlight) camera-ready; 22 pages, 3 figures

详情
AI中文摘要

思维链推理及其变体显著提升了语言模型在复杂推理任务上的性能,但不同策略促进泛化的精确机制仍不明确。虽然当前解释常指向增加测试时计算或结构引导,但建立这些因素与泛化之间一致、可量化的联系仍具挑战。本文中,我们将内在维度识别为表征推理链有效性的定量度量。内在维度量化了在给定任务上达到特定准确率阈值所需的最小模型维度数。通过固定模型架构并改变不同推理策略下的任务表述,我们证明有效推理策略持续降低任务的内在维度。在GSM8K上使用Gemma-3 1B和4B验证这一点,我们观察到推理策略的内在维度与其在分布内和分布外数据上的泛化性能之间存在强负相关。我们的发现表明,有效推理链通过使用更少参数更好地压缩任务来促进学习,为分析推理过程提供了新的定量度量。

英文摘要

Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

2602.08964 2026-06-01 cs.LG cs.AI cs.CL cs.CY

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

语言模型智能体中目标导向性的行为与表征评估

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

AI总结 本文提出一种结合行为评估与内部表征可解释性分析的目标导向性评估框架,并以LLM智能体在2D网格世界中的导航为例,验证了其行为与表征的一致性。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

理解智能体的目标有助于解释和预测其行为,但目前尚无可靠的方法来归因智能系统的目标。我们提出一个评估目标导向性的框架,该框架将行为评估与基于可解释性的模型内部表征分析相结合。作为案例研究,我们考察了一个在二维网格世界中导航至目标状态的LLM智能体。在行为上,我们评估智能体在不同网格大小、障碍物密度和目标结构下的最优策略,发现其性能随任务难度扩展,同时对保持难度的变换和多目标结构具有鲁棒性。然后,我们使用探测方法解码环境及多步行动计划的内部表征。我们发现,LLM智能体非线性地编码了一个粗略的空间地图,保留了关于其位置和目标位置的任务相关近似线索;其行动与这些内部表征大致一致;推理过程重新组织这些表征,从空间线索转向即时行动选择。我们的研究结果支持这样的观点:除了行为评估之外,还需要内省检查来表征智能体如何表示和追求其目标。

英文摘要

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world towards a goal state. Behaviourally, we evaluate the agent against optimal policies across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and multi-goal structures. We then use probing methods to decode internal representations of the environment and multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from spatial cues towards immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

2602.08267 2026-06-01 cs.LG cs.AI

Inverting Data Transformations via Diffusion Sampling

通过扩散采样逆变换数据变换

Jinwoo Kim, Sékou-Oumar Kaba, Jiyun Park, Seunghoon Hong, Siamak Ravanbakhsh

AI总结 提出一种在一般李群上通过扩散采样逆变换未知变换的方法,用于恢复原始数据分布,并在测试时等变性应用中提升预训练神经网络的鲁棒性。

Comments 31 pages, 11 figures

详情
AI中文摘要

我们研究了一般李群上的变换逆问题:一个数据被未知群元素变换,目标是恢复一个逆变换,将其映射回原始数据分布。这种未知变换在机器学习和科学建模中广泛出现,会显著扭曲观测数据。我们采用概率视角,将变换的后验建模为玻尔兹曼分布,由数据空间上的能量函数定义。为了从该后验中采样,我们引入了一个李群上的扩散过程,该过程保持所有更新在流形上,并且仅需在关联的李代数中进行计算。我们的方法,即变换逆能量扩散(TIED),依赖于一个新的平凡化目标分数恒等式,能够高效地对变换后验进行基于分数的采样。作为一个关键应用,我们专注于测试时等变性,其目标是提高预训练神经网络对输入变换的鲁棒性。在图像单应性和PDE对称性上的实验表明,TIED可以在测试时将变换后的输入恢复到训练分布,表现出优于强规范化和采样基线的性能。代码可在 https://github.com/jw9730/tied 获取。

英文摘要

We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is to recover an inverse transformation that maps it back to the original data distribution. Such unknown transformations arise widely in machine learning and scientific modeling, where they can significantly distort observations. We take a probabilistic view and model the posterior over transformations as a Boltzmann distribution defined by an energy function on the data space. To sample from this posterior, we introduce a diffusion process on Lie groups that keeps all updates on-manifold and only requires computations in the associated Lie algebra. Our method, Transformation-Inverting Energy Diffusion (TIED), relies on a new trivialized target-score identity that enables efficient score-based sampling of the transformation posterior. As a key application, we focus on test-time equivariance, where the objective is to improve the robustness of pretrained neural networks to input transformations. Experiments on image homographies and PDE symmetries demonstrate that TIED can restore transformed inputs to the training distribution at test time, showing improved performance over strong canonicalization and sampling baselines. Code is available at https://github.com/jw9730/tied.

2506.00175 2026-06-01 cs.LG cs.AI

Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

谁获得功劳或责备?在现代AI系统中分配责任

Shichang Zhang, Hongzhe Du, Jiaqi W. Ma, Himabindu Lakkaraju

AI总结 提出一个归因框架,通过反事实问题量化模型开发各阶段(预训练、微调等)对最终行为的影响,并设计无需重训练的估计器,成功识别并移除多阶段任务中的虚假关联。

详情
AI中文摘要

现代AI系统通常通过多个阶段开发——预训练、微调轮次以及后续的适应或对齐,每个阶段都建立在先前阶段之上并以不同方式更新模型。这引发了一个关键的责任问题:当部署的模型成功或失败时,哪个阶段负责,以及负责到什么程度?我们提出了责任归因问题,用于将模型行为追溯到模型开发过程的特定阶段。为了解决这一挑战,我们提出了一个通用框架,回答关于阶段效应的反事实问题:如果特定阶段的更新没有发生,模型的行为会如何改变?在此框架内,我们引入了无需重新训练模型即可高效量化阶段效应的估计器,考虑了数据和模型优化动态的关键方面,包括学习率调度、动量和权重衰减。我们证明了我们的方法成功量化了每个阶段对模型行为的责任。基于归因结果,我们的方法可以识别并移除在图像分类和文本毒性检测任务中跨多个阶段开发时学到的虚假相关性。我们的方法为模型分析提供了实用工具,并代表了向更负责任的AI发展迈出的重要一步。

英文摘要

Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment, where each stage builds on the previous ones and updates the model in distinct ways. This raises a critical question of accountability: when a deployed model succeeds or fails, which stage is responsible, and to what extent? We pose the accountability attribution problem for tracing model behavior back to specific stages of the model development process. To address this challenge, we propose a general framework that answers counterfactual questions about stage effects: how would the model's behavior have changed if the updates from a particular stage had not occurred? Within this framework, we introduce estimators that efficiently quantify stage effects without retraining the model, accounting for both the data and key aspects of model optimization dynamics, including learning rate schedules, momentum, and weight decay. We demonstrate that our approach successfully quantifies the accountability of each stage to the model's behavior. Based on the attribution results, our method can identify and remove spurious correlations learned during image classification and text toxicity detection tasks that were developed across multiple stages. Our approach provides a practical tool for model analysis and represents a significant step toward more accountable AI development.

2602.07928 2026-06-01 cs.LG cs.AI

A Kinetic Energy Perspective of Flow Matching

流匹配的动能视角

Ziyun Li, Huancheng Hu, Soon Hoe Lim, Xuyu Li, Fei Gao, Enmao Diao, Zezhen Ding, Michalis Vazirgiannis, Henrik Bostrom

AI总结 本文引入动能路径能量(KPE)作为流匹配生成模型的诊断工具,发现其与语义保真度和数据稀疏性相关,并基于此提出无训练的动能轨迹塑形(KTS)策略以改善生成质量。

Comments ICML 2026 Spotlight

详情
AI中文摘要

基于流的生成模型可以通过物理视角来审视:采样通过积分学习到的速度场将粒子从噪声传输到数据,每个样本对应一条具有自身动力学努力的轨迹。受经典力学启发,我们引入了动能路径能量(KPE),这是一种类似作用量的每样本诊断指标,用于测量沿常微分方程(ODE)轨迹累积的动能努力。实验上,KPE表现出两种稳健的对应关系:{i} 较高的KPE预测更强的语义保真度;{ii} 高KPE轨迹落在稀疏表示区域。我们进一步提供了将轨迹能量与数据稀疏性联系起来的理论保证。矛盾的是,这种相关性是非单调的。在足够高的能量下,生成可能退化为记忆。利用经验流匹配的闭式公式,我们表明极端能量驱动轨迹接近训练样本的副本。这产生了金发姑娘原则,并激发了动能轨迹塑形(KTS),一种无训练的两阶段推理策略,该策略增强早期运动并强制执行后期软着陆,从而减少记忆并提高基准任务上的生成质量。

英文摘要

Flow-based generative models can be viewed through a physics lens: sampling transports a particle from noise to data by integrating a learned velocity field, and each sample corresponds to a trajectory with its own dynamical effort. Motivated by classical mechanics, we introduce Kinetic Path Energy (KPE), an action-like, per-sample diagnostic that measures the accumulated kinetic effort along an ordinary differential equation (ODE) trajectory. Empirically, KPE exhibits two robust correspondences: {i} higher KPE predicts stronger semantic fidelity; {ii} high-KPE trajectories land in sparse representation regions. We further provide theoretical guarantees linking trajectory energy to data sparsity. Paradoxically, this correlation is non-monotonic. At sufficiently high energy, generation can degenerate into memorization. Leveraging the closed-form formula of empirical flow matching, we show that extreme energies drive trajectories toward near-copies of training examples. This yields a Goldilocks principle and motivates Kinetic Trajectory Shaping (KTS), a training-free two-phase inference strategy that boosts early motion and enforces a late-time soft landing, reducing memorization and improving generation quality across benchmark tasks.

2602.07905 2026-06-01 cs.AI

MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation

MedCoG:通过元认知调节最大化医学推理中的LLM推理密度

Yu Zhao, Hao Guan, Yongcheng Jing, Ying Zhang, Dacheng Tao

AI总结 提出MedCoG框架,利用元认知评估动态调节知识使用,以缓解推理扩展定律下的收益递减,提升推理效率与准确性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)在复杂医学推理中展现出强大潜力,但在推理扩展定律下面临收益递减。现有研究通过添加各种知识类型增强LLM,但额外成本转化为准确性的效果尚不明确。本文探索LLM的元认知(即对其自身认知状态的自我评估)如何调节推理过程。具体而言,我们提出MedCoG,一种带有知识图谱的医学元认知智能体,其中对任务复杂性、熟悉度和知识密度的元认知评估动态调节程序性、情景性和事实性知识的利用。这种以LLM为中心的按需推理旨在通过(1)避免无差别扩展以降低成本,(2)过滤干扰知识以提高准确性,来缓解扩展定律下的收益递减。为验证这一点,我们经验性地刻画了扩展曲线,并引入推理密度来量化推理效率。实验表明MedCoG在五个医学基准困难集上的有效性和高效性,实现了6.2倍的推理密度。此外,Oracle研究凸显了元认知调节的巨大潜力。

英文摘要

Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling laws. While existing studies augment LLMs with various knowledge types, it remains unclear how effectively the additional costs translate into accuracy. In this paper, we explore how meta-cognition of LLMs, i.e., their self-assessment of their own cognitive states, can regulate the reasoning process. Specifically, we propose MedCoG, a Medical Meta-Cognition Agent with Knowledge Graph, where the meta-cognitive assessments of task complexity, familiarity, and knowledge density dynamically regulate utilization of procedural, episodic, and factual knowledge. The LLM-centric on-demand reasoning aims to mitigate the diminishing returns under scaling law by (1) reducing costs via avoiding indiscriminate scaling, (2) improving accuracy via filtering out distractive knowledge. To validate this, we empirically characterize the scaling curve and introduce inference density to quantify inference efficiency. Experiments demonstrate the effectiveness and efficiency of MedCoG on five hard sets of medical benchmarks, yielding 6.2x inference density. Furthermore, the Oracle study highlights the significant potential of meta-cognitive regulation.

2602.07864 2026-06-01 cs.CV

Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces

在结构中思考:评估约束空间中的空间智能

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan

AI总结 提出SSI-Bench基准,通过结构约束下的空间推理任务评估视觉语言模型的空间智能,发现模型与人类存在巨大差距。

Comments ICML 2026, Project Page: https://ssi-bench.github.io

详情
AI中文摘要

空间智能对视觉语言模型(VLM)至关重要,然而许多以场景为中心的基准评估的是无约束环境,其中单个图像可能允许多种合理的3D解释。我们引入了SSI-Bench,一个用于约束空间中结构中心空间推理(SCSR)的VQA基准。它基于复杂的真实世界3D结构,利用几何、拓扑和物理可行性方面的结构约束,使组件关系从视觉证据中更加确定。该基准包含1000个涵盖几何和拓扑推理的排序问题,其中正确的排序需要解决所有候选对象的3D关系,对空间理解提出了更强的要求。它通过完全以人为中心的流程创建,包括超过400研究员小时的图像整理、组件标注和问题设计。评估31个VLM揭示了与人类的巨大差距:最好的开源模型达到22.2%的准确率,最强的闭源模型达到33.6%,而人类得分为91.6%。进一步的结果表明,思维链推理仅带来微小的提升,错误分析揭示了当前模型在约束空间中空间理解的根本局限性。项目页面:https://ssi-bench.github.io。

英文摘要

Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io.