arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.15864 2026-05-28 cs.CV cs.CL

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

VLMs 是在看还是只是在说?揭示视觉重新检查的幻觉

Chufan Shi, Cheng Yang, Yaokang Wu, Linghao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

AI总结 通过图像交换探测框架 VisualSwap 和 800 对图像基准 VS-Bench,发现视觉语言模型在推理时声称的“重新检查图像”多为文本模式,而非真正的视觉重新检查,且思考模型更易受影响,用户指令可恢复视觉基础但自我反思无效。

Comments ICML 2026 Oral

详情
AI中文摘要

视觉语言模型(VLM)在推理过程中经常产生自我反思的语句,如“让我再检查一下图片”。这样的语句是否触发了真正的视觉重新检查,还是仅仅是习得的文本模式?我们通过 VisualSwap(一种图像交换探测框架)对此进行研究:在模型对一张图像进行推理后,我们将其替换为视觉上相似但语义不同的图像,并测试模型是否注意到这一变化。我们引入了 VS-Bench,包含从 MathVista、MathVerse、MathVision 和 MMMU-Pro 中精选的 800 对图像。在 Qwen3-VL、Kimi-VL 和 ERNIE-VL 上的实验揭示了一个惊人的失败:模型绝大多数情况下忽略了图像交换,准确率下降高达 60%。与直觉相反,思考模型比其指令对应模型脆弱近 3 倍,且扩展规模无法缓解。多轮用户指令可以恢复视觉基础,但连续生成过程中自我生成的反思语句则不能。注意力分析解释了原因:用户指令显著提高了对视觉标记的注意力,而自我反思则没有。当前的 VLM 在声称执行视觉重新检查时倾向于“说”而非真正“看”。我们的代码和数据集可在项目页面获取:https://visualswap.github.io

英文摘要

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

2605.15523 2026-05-28 cs.CV

Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

自提示扩散变压器用于开放词汇场景文本编辑的上下文学习

Hongxi Li, Tong Wang, Chengjing Wu, Tianbao Liu, Jiangtao Yao, Xiaochao Qu, Xinxiao Wu, Luoqi Liu, Ting Liu

AI总结 提出一种自提示场景文本编辑方法,通过构建风格和字形提示,利用多模态扩散变压器的上下文学习能力,实现开放词汇和风格一致的文本编辑。

Comments ICML 2026

详情
AI中文摘要

场景文本编辑旨在修改图像目标区域中的文本,同时保留周围的背景风格和纹理。现有方法仅依赖图像背景信息,而忽略了目标区域的视觉细节,这丢弃了原始文本中的风格特征,本质上将任务降级为文本渲染。此外,预训练的字形编码器施加的条件限制了可编辑文本的范围。为了解决这些问题,本文提出了一种自提示场景文本编辑方法,直接从原始图像构建风格和字形提示,无需引入额外的风格或字形编码器。我们采用两阶段训练策略:扩散变压器首先在大规模自监督数据上训练,然后使用少量配对图像进行微调。通过利用多模态扩散变压器(MM-DiT)的上下文学习能力,它实现了开放词汇和风格一致的文本编辑。在各种语言上的实验结果表明,我们的方法在文本准确性和风格一致性方面均达到了最先进的性能。我们的项目页面:hongxiii.github.io/mstedit。

英文摘要

Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: hongxiii.github.io/mstedit.

2605.15250 2026-05-28 cs.LG cs.AI

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

GQLA: 面向硬件自适应的大语言模型解码的分组查询潜在注意力

Fanxu Meng

AI总结 提出GQLA,一种对MLA的极小修改,通过暴露MQA和GQA两种等价解码路径,使单一权重集在不同硬件(如H100和H20)上达到最优性能,并支持零冗余张量并行,同时通过TransGQLA将预训练GQA检查点转换为GQLA模型,显著压缩KV缓存。

Comments https://github.com/MuLabPKU/TransArch

详情
AI中文摘要

多头潜在注意力(MLA)是DeepSeek-V2/V3中使用的注意力机制,它将键和值联合压缩为低秩潜在表示,并几乎完美匹配H100的roofline。然而,其训练权重仅暴露一种解码路径——吸收式MQA形式——这使得高效推理依赖于H100级别的计算带宽比,放弃了沿头轴的张量并行,并且在诸如受出口限制的H20等商用推理GPU上无法获得多令牌预测(MTP)增益。我们提出分组查询潜在注意力(GQLA),这是对MLA的最小修改,其训练权重在相同参数上暴露两种代数等价的解码路径:与MLA相同的MQA吸收路径,以及具有每组扩展缓存的GQA路径。运行时选择匹配目标硬件的路径——无需重新训练,无需自定义内核——因此单一组GQLA权重即可同时锁定H100(MQA吸收,s_q=1)和H20(GQA+MTP,s_q=2)的roofline,同时在GQA路径上支持最多8路零冗余张量并行。为避免从头预训练,我们将TransMLA扩展为TransGQLA,将预训练的GQA检查点转换为GQLA模型;在LLaMA-3-8B上,它在MQA吸收路径上将每令牌KV缓存压缩至GQA基线的28.125%,同时在每组路径上结构性地保持GQA级别的流量。

英文摘要

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.

2605.14809 2026-05-28 cs.LG

GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning

GFMate:通过测试时提示调优赋能图基础模型

Yan Jiang, Ruihong Qiu, Zi Huang

AI总结 提出预训练无关的测试时图提示调优方法GFMate,通过质心提示和层提示避免与源领域和预训练策略的纠缠,并设计测试时互补学习目标利用有标签和无标签目标域数据,在12个基准数据集上实现高达30.63%的性能提升。

详情
AI中文摘要

图提示调优通过在传统单域场景中引入可训练提示来增强模型性能,在图学习中展现出巨大潜力。最近的研究通过少样本调优辅助提示,将图提示扩展至改进图基础模型(GFM)。尽管取得了进展,现有方法大多将源域信息嵌入提示中,这些提示要么作为GFM的输入,要么在模型预训练期间编码。这种提示与特定源域和GFM预训练策略的纠缠限制了其向其他域和不同GFM的泛化能力。此外,现有的GFM提示仅依赖少样本调优进行适应,忽略了未标记目标域测试数据中的丰富信息。受这些洞察启发,本文旨在通过预训练无关的测试时图提示调优赋能GFM,命名为GFMate。GFMate引入在目标域上预训练后应用的质心提示和层提示,避免与特定源域和模型预训练的纠缠。此外,设计了一个测试时互补学习目标,以利用有标签和未标记的目标域数据进行有效的测试时提示调优。在12个基准数据集上的大量实验证明了GFMate的优越性能和效率,实现了高达30.63%的提升。代码可在https://github.com/YanJiangJerry/GFMate获取。

英文摘要

Graph prompt tuning has shown great potential in graph learning by introducing trainable prompts to enhance the model performance in conventional single-domain scenarios. Recent research has extended graph prompts to improve Graph Foundation Models (GFMs) by few-shot tuning auxiliary prompts. Despite their progress, most existing methods embed source-domain information into prompts, which serve either as input to GFMs or encoded during model pre-training. Such prompt entanglement with specific source domains and GFM pre-training strategy restricts their generalisability to other domains and different GFMs. Furthermore, existing GFM prompts merely rely on few-shot tuning for adaptation, neglecting the rich information in unlabelled target domain test data. Motivated by these insights, this paper aims to empower GFMs with pre-training-agnostic test-time graph prompt tuning, named GFMate. GFMate introduces centroid and layer prompts applied after pre-training on target domains, avoiding entanglement with specific source domains and model pre-training. In addition, a test-time complementary learning objective is devised to exploit both labelled and unlabelled target domain data for effective test-time prompt tuning. Extensive experiments on 12 benchmark datasets demonstrate the superior performance and efficiency of GFMate, achieving improvements of up to 30.63%. Code is available at https://github.com/YanJiangJerry/GFMate.

2605.14284 2026-05-28 cs.LG

Smooth Multi-Policy Causal Effect Estimation in Longitudinal Settings

纵向设置下的平滑多策略因果效应估计

Wenxin Chen, Weishen Pan, Kyra Gan, Fei Wang

AI总结 针对多个动态治疗策略的因果效应估计,提出一种策略感知的迭代条件期望重参数化方法(PEQ-Net),通过共享表示实现联合估计,并利用核均值嵌入训练策略编码器,以降低有限样本方差。

详情
AI中文摘要

多个动态治疗策略的比较评估对于医疗和政策决策至关重要,然而传统的纵向因果推断方法孤立地估计每个策略,阻止了反事实之间的信息共享。我们证明这种单独估计范式会引入结构上不受控制的二阶偏差,即使在经过纵向目标最大似然估计(LTMLE)的标准去偏后,也会膨胀有限样本方差。为了解决这个问题,我们提出了一种策略感知的迭代条件期望(ICE)Q函数重参数化方法,通过共享表示实现联合估计。我们在策略编码Q网络(PEQ-Net)中实现了这种方法,该网络以共享策略编码器为核心。编码器使用核均值嵌入进行训练,确保学习到的表示空间反映总体层面的策略差异。在应用LTMLE校正步骤后,我们证明这种设计对二阶余项施加了结构约束,从而稳定了有限样本方差。在半合成数据集上的实验表明,PEQ-Net始终优于现有的基于ICE的方法,特别是在评估紧密相关的策略时,均方根误差显著降低。

英文摘要

Comparative evaluation of multiple dynamic treatment policies is essential for healthcare and policy decisions, yet conventional longitudinal causal inference methods estimate each in isolation, preventing information sharing across counterfactuals. We demonstrate that this separate estimation paradigm induces a structurally uncontrolled second-order bias, inflating finite-sample variance even after standard debiasing with longitudinal targeted maximum likelihood estimation(LTMLE). To address this, we propose a policy-aware reparameterization of Iterative Conditional Expectation (ICE) Q-functions that enables joint estimation through shared representations. We implement this approach in the Policy-Encoded Q Network (PEQ-Net), an architecture centered on a shared policy encoder. The encoder is trained using kernel mean embeddings, ensuring that the learned representation space reflects population-level policy dissimilarities. After applying an LTMLE correction step, we prove this design imposes a structural constraint on the second-order remainder, thereby stabilizing finite-sample variance. Experiments on semi-synthetic datasets demonstrate that PEQ-Net consistently outperforms existing ICE-based methods, achieving substantial reductions in root-mean-square error, particularly when evaluating closely related policies.

2601.16312 2026-05-28 cs.CL cs.AI

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

教授和评估LLMs推理聚合物设计相关任务

Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian

AI总结 本文提出PolyBench基准数据集和知识增强推理蒸馏方法,使中小型语言模型在聚合物设计任务上性能接近前沿闭源LLM。

详情
AI中文摘要

AI4Science研究在许多科学应用中显示出前景,包括聚合物设计。然而,当前的LLMs在此问题空间中效果不佳,因为:(i)大多数模型缺乏聚合物特定知识,(ii)现有对齐模型对聚合物设计相关知识和能力的覆盖有限。为解决此问题,我们引入了PolyBench,一个包含超过125K聚合物设计相关任务的大规模训练和测试基准数据集,利用从实验和合成数据源获得的超过1300万数据点的知识库,以确保聚合物及其属性的广泛覆盖。为了使用PolyBench进行有效对齐,我们引入了一种知识增强推理蒸馏方法,用结构化CoT增强该数据集。此外,PolyBench中的任务从简单到复杂的分析推理问题组织,使得能够进行泛化测试和问题空间中的诊断探测。实验表明,在PolyBench上训练的具有7B到32B参数的中小型语言模型(SLMs)在PolyBench测试数据集上优于类似大小的模型,并与闭源前沿LLMs保持竞争力,同时在外部聚合物基准上展示了性能提升。数据集和相关代码可在https://github.com/StonyBrookNLP/PolyBench获取。

英文摘要

Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small- and mid- sized language models (SLMs) with 7B to 32BB parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.

2605.13743 2026-05-28 cs.LG

GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

GHGbench:一个统一的多实体、多任务碳排放预测基准

Yifan Duan, Siyuan Zheng, Lihuan Li, Chao Xue, Flora Salim

AI总结 提出GHGbench,一个包含公司和建筑层面温室气体排放预测的统一开放数据集与基准,通过多模态数据融合和标准化评估揭示结构化难度与分布外泛化差距。

详情
AI中文摘要

实体级碳排放预测的开放数据集和基准在访问、规模、粒度和评估方面仍然分散。我们引入了GHGbench,一个用于公司和建筑层面温室气体预测的开放数据集和基准。公司轨道包含来自12,000多家公司的32,000多条公司年记录,包含范围1+2和范围3披露以及财务/行业信号;建筑轨道将来自13个开放源的491,591条建筑年记录统一为涵盖26个大都市区域(10个美国、15个澳大利亚、1个新加坡)的单一模式,包含气候协变量和多模态遥感嵌入。GHGbench定义了规范的数据划分,以分布内和跨区域/城市迁移为主要任务,以时间保持和短期预测为补充附录证据;主要基线涵盖梯度提升树、表格基础模型、MLP、FT-Transformer和多模态融合,辅以LLM面板,所有方法均通过多种子配对自助法评估。出现了三个基准级别的发现:(i)建筑排放的结构性难度高于公司排放;(ii)分布内到分布外的差距远远超过两个轨道中任何模型内的差距,并且据我们所知,表格基础模型是第一个在多城市建筑排放任务上通过配对自助法显著优于调优树的基线;(iii)多模态遥感嵌入在表格泛化失效的地方恰好有帮助。GHGbench还揭示了灾难性的城市迁移和部门因子查找上限作为系统性失败模式。代码和重建配方可在GHGbench获取。

英文摘要

Open datasets and benchmarks for entity-level carbon-emission prediction remain fragmented across access, scale, granularity, and evaluation. We introduce GHGbench, an open dataset and benchmark for company- and building-level greenhouse-gas prediction. The company track contains 32,000+ company-year records from 12,000+ firms with Scope 1+2 and Scope 3 disclosures and financial/sectoral signals; the building track harmonises 491,591 building-year records from 13 open sources into a single schema across 26 metropolitan areas (10 U.S., 15 Australian, 1 Singaporean), with climate covariates and multimodal remote-sensing embeddings. GHGbench defines canonical splits with in-distribution and cross-region/city transfer as primary tasks and temporal hold-out plus short-horizon forecasting as supplementary appendix evidence; headline baselines span gradient-boosted trees, a tabular foundation model, MLP, FT-Transformer, and multimodal fusion, with an LLM panel as auxiliary, all evaluated under multi-seed paired-bootstrap tests. Three benchmark-level findings emerge: (i) building emissions are structurally harder than company emissions; (ii) the in-distribution to out-of-distribution gap dwarfs any within-model gap across both the company track and the building track, and a tabular foundation model is, to our knowledge, the first baseline to open a paired-bootstrap-significant gap over tuned trees on a multi-city building-emissions task; (iii) multimodal remote-sensing embeddings help precisely where tabular generalisation breaks. GHGbench also exposes catastrophic city transfer and the sector-factor lookup ceiling as systematic failure modes. Code and reconstruction recipes are available at GHGbench.

2605.13517 2026-05-28 cs.CV cs.AI cs.LG

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

ArcVQ-VAE:一种带有反余弦加性边界的球面向量量化框架

Jaeyung Kim, YoungJoon Yoo

AI总结 针对VQ-VAE有限码本容量限制表示能力的问题,提出ArcVQ-VAE框架,通过引入球面角边先验(包括球界范数正则化和反余弦加性边界损失)增强潜在表示的判别性和均匀分散性,提升码本利用率,在图像重建和生成任务上取得竞争性能。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

向量量化变分自编码器(VQ-VAE)已成为图像建模中学习离散表示的基本框架。然而,VQ-VAE模型必须使用有限的码本向量集对整张图像进行分词,这种容量限制限制了其捕获丰富多样表示的能力。在本文中,我们提出反余弦加性边界VQ-VAE(ArcVQ-VAE),一种新颖的向量量化框架,该框架为传统VQ-VAE的码本引入了球面角边先验(SAMP)。所提出的SAMP由球界范数正则化(将所有码本向量约束在时间相关的欧几里得球内)和反余弦加性边界损失(鼓励潜在向量之间更大的角度可分性)组成。这种公式在受限空间内促进了更具判别性和均匀分散的潜在表示,从而提高了有效的潜在空间覆盖范围,并导致码本利用率提升。在标准图像重建和生成任务上的实验结果表明,ArcVQ-VAE在重建精度、表示多样性和样本质量方面与基线模型相比取得了竞争性能。代码可在 https://github.com/goals4292/ArcVQ-VAE 获取。

英文摘要

Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE

2506.22726 2026-05-28 cs.CV cs.LG

XTransfer: Modality-Agnostic Few-Shot Model Transfer for Human Sensing at the Edge

XTransfer: 面向边缘人体感知的模态无关小样本模型迁移

Yu Zhang, Xi Zhang, Hualin Zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

AI总结 提出XTransfer方法,通过模型修复和层重组实现模态无关的小样本模型迁移,降低传感器数据收集、模型训练和边缘部署成本。

Comments Accepted at ICML2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, 6-11 July 2026
AI中文摘要

边缘系统上用于人体感知的深度学习具有巨大的智能应用潜力。然而,其训练和开发受到传感器数据有限和边缘系统资源约束的限制。虽然将预训练模型迁移到不同的感知应用很有前景,但现有方法通常需要大量的传感器数据和计算资源,导致成本高且可迁移性有限。在本文中,我们提出了XTransfer,这是一种首创的方法,实现了模态无关、小样本模型迁移,并具有资源高效的设计。XTransfer通过以下方式灵活地使用预训练模型并在不同模态间迁移知识:(i) 模型修复,通过仅使用少量传感器数据适配预训练层来安全地缓解模态偏移;(ii) 层重组,以逐层方式高效地搜索和重组源模型中的感兴趣层以重构模型。我们在跨不同模态的多种人体感知数据集上对各种基线进行了基准测试。结果表明,XTransfer实现了最先进的性能,同时显著降低了传感器数据收集、模型训练和边缘部署的成本。

英文摘要

Deep learning for human sensing on edge systems presents significant potential for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. While transferring pre-trained models to different sensing applications is promising, existing methods often require extensive sensor data and computational resources, resulting in high costs and limited transferability. In this paper, we propose XTransfer, a first-of-its-kind method enabling modality-agnostic, few-shot model transfer with resource-efficient design. XTransfer flexibly uses pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to restructure models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. The results show that XTransfer achieves state-of-the-art performance while significantly reducing the costs of sensor data collection, model training, and edge deployment.

2605.12929 2026-05-28 cs.CV cs.AI

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

Anatomy-Slot: 用于视网膜诊断中同源双侧推理的无监督解剖分解

Yingzhe Ma, Xiao Yang, Yuguo Yin, Zheyu Wang

AI总结 提出Anatomy-Slot方法,通过无监督解剖瓶颈分解斑块令牌为结构一致的解剖区域槽,并利用双向交叉注意力对齐双眼槽,在ODIR-5K上相比ViT-L基线提升AUC 4.2点,验证了显式结构对应改善诊断的假设。

Comments 15 pages, 3 figures

详情
AI中文摘要

视网膜诊断本质上是双侧的:临床医生比较双眼的同源结构(例如,视盘不对称),然而大多数深度模型基于单眼表示。我们研究显式结构对应是否改善诊断,并提出Anatomy-Slot来操作化这一假设。Anatomy-Slot通过将斑块令牌分解为一组涌现的、结构一致的槽(对应于解剖区域)来引入无监督解剖瓶颈,然后通过双向交叉注意力对齐双眼的槽。在ODIR-5K上使用$n=10$个种子,该方法相比匹配的ViT-L基线在AUC上提升$4.2$个点(95%置信区间;Wilcoxon符号秩检验,$W=0$,$p=0.002$)。配对破坏和高斯噪声下的压力测试提供了对应依赖性和鲁棒性的受控测试。我们进一步在REFUGE上报告了定量视盘定位和交叉注意力定位分析。除了报告的性能提升外,这些结果表明,以对象为中心的解剖对应为与临床双侧比较一致的可解释诊断系统提供了一条原则性路径。

英文摘要

Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into a set of emergent, structurally-coherent slots that correspond to anatomical regions, then aligning these slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by $4.2$ points over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis. Beyond the reported gains, these results indicate that object-centric anatomical correspondence offers a principled path toward interpretable diagnostic systems aligned with clinical bilateral comparison.

2604.04295 2026-05-28 cs.CL

Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Generation

面向可靠专利权利要求生成的适应性成本高效评估

Yongmin Yoo, Qiongkai Xu, Longbing Cao

AI总结 提出两阶段框架ACE,利用专利错误类别结构进行不确定性感知路由,第一阶段编码器预测错误类型熵,超过阈值则交由第二阶段专家LLM执行模式约束的专利思维链协议,在降低78%成本的同时超越70B参数LLM基线。

详情
AI中文摘要

自动化专利权利要求验证要求低容错率。然而,现有方法面临僵化-资源困境:轻量级编码器无法追踪长程法律依赖,而穷举式LLM验证在百万权利要求规模下会产生4-5倍的开销。基于置信度的简单级联无法解决这一问题,因为二元有效性分数无法区分需要不同推理深度的结构上不同的错误类型。我们提出一个两阶段框架:适应性成本高效评估(ACE),它利用专利错误的类别结构进行不确定性感知路由。在第一阶段,微调后的编码器将权利要求投影到法律错误类型上的K+1分布,其预测熵作为路由信号。超过熵阈值的权利要求被升级到第二阶段,由专家LLM执行模式约束的专利思维链(CoPT)协议,将权利要求元素映射到35 U.S.C.标准,其模式约束将每个权利要求的延迟降低42%,同时产生法律依据充分的裁决。我们进一步提出了一个包含40,000个权利要求的数据集ACE-40k,带有MPEP注释,其中ACE超越了包括监督式70B参数LLM在内的竞争基线,同时将成本降低78%。在真实的USPTO驳回数据上,路由机制无需重新校准即可迁移,推理时间减少60%,同时保持竞争性的召回率。

英文摘要

Automated patent claim validation demands low error tolerance. However, existing approaches face a rigidity-resource dilemma: lightweight encoders cannot track long-range legal dependencies, while exhaustive LLM verification incurs 4-5X higher overhead at million-claim scale. A naive confidence-based cascade cannot resolve this because binary validity scores fail to distinguish structurally distinct error types which require different reasoning depths. We propose a two-stage framework: Adaptive Cost-efficient Evaluation (ACE), which exploits the categorical structure of patent errors for uncertainty-aware routing. In the first stage, a fine-tuned encoder projects claims into a K+1 distribution over legal error types, whose predictive entropy serves as the routing signal. Claims exceeding an entropy threshold are escalated to the second stage, where an expert LLM executes a schema-constrained Chain-of-Patent-Thought (CoPT) protocol to map claim elements against 35 U.S.C. standards whose schema constraint reduces per-claim latency by 42% while producing legally grounded verdicts. We further present a 40,000-claim dataset ACE-40k with MPEP-grounded annotations, where ACE surpasses competitive baselines including a supervised 70B-parameter LLM while reducing costs by 78%. On real USPTO rejection data, the routing mechanism transfers without re-calibration, reducing inference time by 60% while maintaining competitive recall.

2512.21075 2026-05-28 cs.LG cs.AI math.PR stat.ML

Feature Learning Dynamics in Infinite-Depth Neural Networks

无限深度神经网络中的特征学习动力学

Zihan Yao, Ruoyu Wu, Tianxiang Gao

AI总结 本文研究深度-μP缩放下单层ResNet中由权重重用引起的前向-后向耦合,证明其在初始化时随宽度消失,但在训练中产生非平凡相关项,并推导出无限深度极限下的神经特征动力学(NFD)SDE系统。

详情
AI中文摘要

深度神经网络在实践中取得了显著成功,但对训练过程中特征如何演化的机制理解仍不完整,尤其是在大深度极限下。对于深度-μP缩放下的ResNet,先前工作将层索引ℓ视为连续时间t_ℓ = ℓ/L,得到训练动力学的SDE描述。一个关键未解决问题是,反向传播通过其转置W_ℓ^⊤重用每个前向权重矩阵W_ℓ,在前向特征和反向梯度之间产生相关性,其行为和特征学习中的作用尚不清楚。我们研究了深度-μP下单层ResNet中这种重用权重的前向-后向耦合。使用条件高斯表示,我们在取任何网络极限之前,显式地将权重重用引起的耦合项与解耦的高斯波动分开。在初始化时,我们证明耦合是有限宽度效应,并以O(n^{-1})的速率随深度一致消失。然而,在训练期间,SGD引入了一个非平凡的前向-后向相关项,该项在无限宽度极限下仍然存在。关键的深度效应是,在深度-μP缩放下,这个幸存项在深度上是高阶的,并且随着L→∞,其在层上的累积贡献变得可忽略。这种深度诱导的抑制促使了神经特征动力学(NFD),一个具有解耦后向权重的向前-向后SDE系统,它保留了训练期间生成的特征-梯度协方差结构。在非退化假设下,我们证明有限网络训练动力学收敛到其NFD极限,深度离散化误差为O(L^{-1}),而重用权重耦合项具有更快的O(L^{-2})衰减。这些结果为深度-μP下单层ResNet的特征学习动力学提供了严格的无限深度极限。

英文摘要

Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training remains incomplete, especially in the large-depth limit. For ResNets under depth-$μ$P scaling, prior work treats the layer index $\ell$ as a continuous time $t_\ell = \ell/L$, yielding SDE descriptions of the training dynamics. A key unresolved issue is that backpropagation reuses each forward weight matrix $W_\ell$ through its transpose $W_\ell^\top$, creating correlations between forward features and backward gradients whose behavior and role in feature learning remain unclear. We study this reused-weight forward--backward coupling in one-layer ResNets under depth-$μ$P. Using conditional Gaussian representations, we explicitly separate the coupling terms induced by weight reuse from decoupled Gaussian fluctuations before taking any network limit. At initialization, we prove that the coupling is a finite-width effect and vanishes at rate $O(n^{-1})$, uniformly over depth. During training, however, SGD induces a nontrivial forward--backward correlation term that survives the infinite-width limit. The key depth effect is that, under depth-$μ$P scaling, this surviving term is higher order in depth and its accumulated contribution over layers becomes negligible as $L\to\infty$. This depth-induced suppression motivates Neural Feature Dynamics (NFD), a forward--backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training. Under nondegeneracy assumptions, we prove that the finite-network training dynamics converge to its NFD limit with an $O(L^{-1})$ depth-discretization error, while the reused-weight coupling term has a faster $O(L^{-2})$ decay. These results provide a rigorous infinite-depth limit for the feature-learning dynamics of one-layer ResNets under depth-$μ$P.

2605.12515 2026-05-28 cs.CL

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

通过共识驱动的偏好优化缓解多语言大模型中的跨语言文化不一致性

Lucas Resck, Isabelle Augenstein, Anna Korhonen

AI总结 提出C-3PO框架,通过共识驱动的偏好优化,缓解多语言大模型在用户身份明确时因提示语言变化导致的跨语言文化不一致问题,显著提升一致性指标κ_S。

Comments 24 pages, 13 figures, 11 tables

详情
AI中文摘要

尽管多语言大模型(MLLMs)能力令人印象深刻,但当提示语言改变时,它们经常表现出不一致的行为。虽然这种适应通常是可取的,但当用户身份被明确定义时,它就会成为一个关键失败。例如,给定一个固定的英国人角色和一个关于文学的模糊日常知识查询,提示语言经常覆盖系统角色——英语输出莎士比亚,西班牙语输出塞万提斯。为了稳健地量化这种跨语言文化不一致性,我们引入了Singleton Fleiss的κ_S,一个在数学上对幻觉具有鲁棒性的度量。为了缓解这一问题,我们提出了跨语言文化一致的偏好优化(C-3PO),一种共识驱动的对齐框架。C-3PO在κ_S上实现了比未对齐模型高达0.13个绝对点的提升,持续优于强提示和表示引导基线,同时保留了明确的用户身份、文化中立性和内在文化知识。实证评估表明,这种不一致性对印尼语和波斯语等低资源语言影响尤为严重。最后,中间层的早期解码揭示了MLLMs在正向传播表示稳定时,会隐式地将输出个性化到提示语言的刻板文化。

英文摘要

Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $κ_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.13-point absolute increase in $κ_S$ over unaligned models, consistently outperforming strong prompting and representation steering baselines whilst preserving explicit user identities, cultural neutrality and intrinsic cultural knowledge. Empirical evaluations demonstrate this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. Finally, early decoding of intermediate layers reveals that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.

2605.11755 2026-05-28 cs.LG cs.CV stat.ML

One-Step Generative Modeling via Wasserstein Gradient Flows

通过Wasserstein梯度流的一步生成建模

Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu, Stefano Ermon, Emmanuel J. Candès

AI总结 提出W-Flow框架,通过Wasserstein梯度流将参考分布到目标分布的演化压缩为一步生成,结合Sinkhorn散度实现高效最优传输,在ImageNet 256×256上达到1.29 FID且采样速度提升约100倍。

Comments 40 pages, 14 figures

详情
AI中文摘要

扩散模型和基于流的方法展现了令人印象深刻的生成能力,尤其对于图像,但其采样成本高昂,因为需要多次迭代更新。我们引入了W-Flow,一个训练生成器的框架,该生成器在单步中将来自简单参考分布的样本转换为来自目标数据分布的样本。这通过两步实现:首先,通过最小化能量泛函的Wasserstein梯度流,定义从参考分布到目标分布的演化;其次,训练一个静态神经生成器将此演化压缩为一步生成。我们用Sinkhorn散度实例化能量泛函,该散度产生一种高效的基于最优传输的更新规则,捕获全局分布差异并改善目标分布的覆盖。我们进一步证明了在适当假设下,有限样本训练动力学收敛到连续时间分布动力学。实验上,W-Flow为一步ImageNet 256×256生成设立了新的最先进水平,实现了1.29 FID,并改善了模式覆盖和域迁移。与具有相似FID分数的多步扩散模型相比,我们的方法实现了约100倍的采样加速。这些结果表明,Wasserstein梯度流为快速且高保真的生成建模提供了原则性和有效的基础。

英文摘要

Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.

2605.11544 2026-05-28 cs.AI cs.LO

Optimal LTLf Synthesis

最优 LTLf 综合

Yujian Cao, Sven Schewe, Qiyi Tang, Shufang Zhu

AI总结 本文提出最优 LTLf 综合,通过最大化可保证实现的目标数量,解决多目标规范无法全部实现时的策略综合问题,并实验验证了方法的可行性。

详情
AI中文摘要

策略综合通常遵循全有或全无的范式,当规范在不确定环境中无法保证时返回不可实现。在本文中,我们引入了最优 LTLf 综合,其目标是从由多个目标组成的给定规范中尽可能多地实现目标,特别是当它们不能全部联合实现时。我们首先考虑最大保证综合,它承诺一个我们可以先验保证实现的最大目标集。然后,我们引入最大观察综合,它最大化后验实现的目标,这些目标在不同执行中可能不可比较。最后,我们提出增量最大观察综合,通过在执行过程中出现更强保证的机会时进一步改进策略。实验结果表明,最优综合的不同变体扩展性大致相当,在给定的超时时间内解决了大部分基准实例,证明了该方法的实际可行性。

英文摘要

Strategy synthesis typically follows an all-or-nothing paradigm, returning unrealisable whenever a specification cannot be guaranteed in an uncertain environment. In this paper, we introduce optimal LTLf synthesis, where the goal is to realise as many objectives as possible from a given specification consisting of multiple objectives, especially for the case that they are not all jointly realisable. We first consider max-guarantee synthesis, which commits to a maximal set of objectives that we can a priori guarantee to realise. We then introduce max-observation synthesis, which maximises a posteriori realised objectives that may be incomparable on different executions. Finally, we present incremental max-observation synthesis, which further improves strategies by exploiting opportunities for stronger guarantees when they arise during an execution. Experimental results show that different variations of optimal synthesis scale broadly equally well, solving a large fraction of the benchmark instances within the given timeout, demonstrating the practical feasibility of the approach.

2605.10583 2026-05-28 cs.CV

FrequencyCT: Frequency Domain Self-supervised Low-dose CT Denoising

FrequencyCT:频域自监督低剂量CT去噪

Guoquan Wei, Liu Shi, Chong Chen, Qiegen Liu

AI总结 提出FrequencyCT,一种在频域中利用噪声与真实信号分布差异生成伪样本的零样本自监督方法,用于低剂量CT去噪,并通过数据截断稳定优化,实验验证了其临床潜力。

详情
AI中文摘要

尽管对计算机断层扫描(CT)去噪进行了广泛研究,但很少有研究利用投影域数据特性来减轻噪声相关性。为填补这一空白,本文提出FrequencyCT,这是第一种在频域中为低剂量CT去噪生成伪样本的零样本自监督方法。具体而言,通过利用噪声和真实信号在频域分布上的差异,提出了一种区域低频锚定技术。对高频区域应用相位保持噪声和掩膜扰动,生成用于自监督的伪样本。基于含噪投影的噪声方差与底层真实信号之间的指数相关性,对生成的样本进行一致的数据截断,以稳定优化梯度。在多个公开和真实数据集上的评估结果证实了本研究的临床应用潜力,为去噪领域提供了创新视角。代码可在 https://github.com/yqx7150/FrequencyCT 获取。

英文摘要

Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To bridge this gap, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-sample generation in the frequency domain for low-dose CT denoising. Specifically, by exploiting the distinct frequency-domain distributions of noise and true signal, a regional low-frequency anchoring technique is proposed. Applying phase-preserving noise and mask perturbations to the high-frequency region generates pseudo-samples for self-supervision. Driven by the exponential correlation between noise variance of noisy projections and the underlying true signal, consistent data truncation is applied to the generated samples to stabilize optimization gradients. Evaluation results on multiple public and real datasets confirm the clinical application potential of this research, which provides an innovative perspective for the field of denoising. The code is available at: https://github.com/yqx7150/FrequencyCT.

2605.10581 2026-05-28 cs.CV

Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention

Polygon-mamba: 使用多边形扫描Mamba和空间-频率协同注意力进行视网膜血管分割

Yuanyuan Peng, Wen Li

AI总结 针对视网膜小血管分割难题,提出一种混合CNN-Mamba融合网络,通过多边形扫描Mamba和空间-频率协同注意力机制,有效保留像素连通性并增强关键特征,在三个公开数据集上取得优异性能。

详情
AI中文摘要

视网膜血管分割对于眼部疾病的诊断和评估至关重要。值得注意的是,小血管的分割一直被认为是一项具有挑战性和复杂性的任务。为了应对这一挑战,我们设计了一种混合CNN-Mamba融合网络,该网络集成了多边形扫描Mamba和空间-频率协同注意力机制,用于检测小血管。考虑到传统的水平-垂直扫描Mamba架构可能会破坏目标结构的拓扑完整性,并导致视网膜小血管的局部不连续性,我们提出了一种多边形扫描视觉状态空间模型(PS-VSS),通过多层反向扫描方式识别小血管结构特征。该方法有效保留了像素连通性,从而显著减轻了小血管信息的丢失。此外,众所周知,空间域优先考虑位置和结构信息,而频率域强调全局感知和局部细节成分,我们在跳跃连接中引入了空间-频率协同注意力机制(SFCAM),以从空间域和频率域提取高效特征。该策略使模型能够动态增强关键特征,同时有效抑制杂乱信息。为了评估模型的有效性,我们在三个公开数据集:DRIVE、STARE和CHASE_DB1上进行了测试。与手动标注相比,我们的模型在三个数据集上的F1分数分别为0.8283、0.8282和0.8251,曲线下面积(AUC)值分别为0.9806、0.9840和0.9866,灵敏度(SE)值分别为0.8268、0.8314和0.8484。通过视觉检查和定量分析验证了模型的有效性。

英文摘要

Retinal vessel segmentation is crucial for diagnosis and assessment of ocular diseases. Notably, segmentation of small retinal vessels has been consistently recognized as a challenging and complex task. To tackle this challenge, we design a hybrid CNN-Mamba fusion network that integrates polygon scanning mamba and space-frequency collaborative attention mechanism for the detection of small vessels. Considering that the traditional mamba architecture with horizontal-vertical scanning may compromise the topological integrity of target structures and result in local discontinuities in small retinal vessels, we present a polygon scanning visual state space model (PS-VSS) to identify small vessel structural features by multi-layer reverse scanning way. Which effectively preserves pixels connectivity, thereby substantially mitigating the loss of information pertaining to small vessels. Furthermore, as we all known that the spatial domain prioritizes positional and structural information, while the frequency domain emphasizes global perception and local detail components, a space-frequency collaborative attention mechanism (SFCAM) is introduced within the skip connection to extract efficient features from the spatial and frequency domains. This strategy empowers the model to dynamically enhance the key features while effectively suppressing clutters. To assess the efficacy of our model, it was tested on three publicly available datasets: DRIVE, STARE, and CHASE_DB1. Compared to manual annotations, our model demonstrated F1 scores of 0.8283, 0.8282, and 0.8251, Area Under Curve (AUC) values of 0.9806, 0.9840, and 0.9866, and Sensitivity (SE) values of of 0.8268, 0.8314, and 0.8484 across three datasets, respectively. The effectiveness of our model was validated through both visual inspection and quantitative analysis.

2605.10325 2026-05-28 cs.AI

Verifiable Process Rewards for Agentic Reasoning

可验证过程奖励用于智能体推理

Huining Yuan, Zelai Xu, Huaijie Wang, Xiangmin Yi, Jiaxuan Gao, Xiao-Ping Zhang, Yu Wang, Chao Yu, Yi Wu

AI总结 提出可验证过程奖励(VPR)框架,通过符号或算法预言机将密集的中间步骤验证转化为强化学习的逐轮监督信号,解决长程智能体推理中的信用分配问题,并在搜索、约束和后验验证三种场景中验证其有效性。

Comments Corrected minor typos and LLM-assisted data extraction errors. The main conclusions are unchanged

详情
AI中文摘要

来自可验证奖励的强化学习(RLVR)提升了大型语言模型(LLMs)的推理能力,但现有方法大多依赖稀疏的结果级反馈。这种稀疏性在长程智能体推理中造成了信用分配难题:一个轨迹可能因包含许多正确的中间决策而失败,或因包含有缺陷的决策而成功。在这项工作中,我们研究了一类密集可验证的智能体推理问题,其中中间动作可以通过符号或算法预言机进行客观检查。我们提出了可验证过程奖励(VPR),一个将此类预言机转化为密集的逐轮监督信号用于强化学习的框架,并在三个代表性场景中实例化:基于搜索验证的动态演绎、基于约束验证的逻辑推理和基于后验验证的概率推理。我们进一步提供了理论分析,表明密集的验证器基础奖励可以通过提供更局部的学习信号来改善长程信用分配,其收益取决于验证器的可靠性。实验上,VPR在受控环境中优于结果级奖励和基于rollout的过程奖励基线,更重要的是,它能够迁移到通用和智能体推理基准测试中,表明可验证的过程监督可以培养适用于训练环境之外的通用推理技能。我们的结果表明,VPR是一种有前景的方法,用于在可靠的中间验证可用时增强LLM智能体,同时也强调了其对预言机质量的依赖性,以及将VPR扩展到结构化程度较低、开放环境中的开放性挑战。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.

2602.05198 2026-05-28 cs.RO

Informative Path Planning with Guaranteed Estimation Uncertainty

具有保证估计不确定性的信息性路径规划

Kalvik Jakkala, Saurav Agarwal, Jason O'Kane, Srinivas Akella

AI总结 提出一种三阶段方法,通过高斯过程模型将不确定性约束转化为覆盖图,并规划近最短路径,在复杂环境中实现具有保证估计不确定性的信息性路径规划。

Comments 15 pages, 11 figures, RSS 2026

详情
AI中文摘要

环境监测机器人通常需要在严格的资源约束下估计数据场(例如盐度、温度、水深)。经典的往复式割草机式测量提供了几何覆盖保证,但可能因过度采样可预测区域而浪费精力。相比之下,信息性路径规划(IPP)方法利用空间相关性减少过度采样,但通常不提供估计质量的保证。本文通过解决复杂环境中具有保证估计不确定性的IPP来弥合这些方法:计算最短路径,其测量确保高斯过程(GP)后验方差——一种内在的不确定性度量,在GP模型下下界均方预测误差——在监测区域上由用户指定的阈值上界。 我们提出了一种高效环境监测的三阶段方法:(i)从先验信息学习GP模型;(ii)将GP核转换为二元覆盖图,识别不确定性可降低到目标阈值以下的位置;(iii)规划一条近最短路径以满足全局不确定性约束。我们的方法结合非平稳核以捕捉异质现象中空间变化的关联性,并适应具有障碍物的非凸环境。我们为传感位置选择以及在旅行预算下的联合选择与路由问题提供了近最优近似保证。在真实世界地形数据上的实验表明,我们的规划器以比代表性基线更少的传感位置和更短的旅行距离实现了不确定性目标。此外,使用自主水面和水下车辆进行的现场实验验证了该方法的实际可行性。我们的代码可在www.sgp-tools.com获取。

英文摘要

Environmental monitoring robots often need to estimate data fields (e.g., salinity, temperature, bathymetry) under tight resource constraints. Classical boustrophedon lawnmower surveys provide geometric coverage guarantees but can waste effort by oversampling predictable regions. In contrast, informative path planning (IPP) methods leverage spatial correlations to reduce oversampling, yet typically offer no guarantees on estimation quality. This paper bridges these approaches by addressing IPP with guaranteed estimation uncertainty in complex environments: computing the shortest path whose measurements ensure that the Gaussian process (GP) posterior variance -- an intrinsic uncertainty measure that lower-bounds the mean-squared prediction error under the GP model -- is upper bounded by a user-specified threshold over the monitoring region. We propose a three-stage approach for efficient environmental monitoring: (i) learning a GP model from prior information; (ii) transforming the GP kernel into binary coverage maps that identify locations where uncertainty can be reduced below a target threshold; and (iii) planning a near-shortest route to satisfy the global uncertainty constraint. Our approach incorporates non-stationary kernels to capture spatially varying correlations in heterogeneous phenomena and accommodates non-convex environments with obstacles. We provide near-optimal approximation guarantees for both sensing-location selection and the joint selection-and-routing problem under a travel budget. Experiments on real-world topographic data demonstrate that our planners achieve uncertainty targets with fewer sensing locations and shorter travel distances than representative baselines. Furthermore, field experiments with autonomous surface and underwater vehicles validate the real-world feasibility of the approach. Our code is available at: www.sgp-tools.com

2605.10073 2026-05-28 cs.CL

Heterogeneous Dependency Graph-Guided Attentionfor Patent Representation Learning

异构依赖图引导的专利表示学习注意力机制

Yongmin Yoo, Qiongkai Xu, Zhangkai Wu, Longbing Cao

AI总结 针对专利权利要求间的依赖层次被忽略的问题,提出专利异构注意力图编码器(PHAGE),通过构建类型图区分法律引用与技术关系,并引入可学习偏置的连通性掩码将权利要求级拓扑投射到令牌级注意力,结合双粒度对比学习,在分类、检索和聚类任务上超越领域自适应和引用感知基线。

详情
AI中文摘要

预训练语言模型通过将权利要求编码为扁平令牌序列来推进专利分类和检索,但忽略了权利要求之间的依赖层次。将层次结构融入自注意力面临两个挑战。首先,权利要求依赖涉及不同可靠性的关系类型:不加区分地对待它们会使有噪声的技术关系污染更清洁的法律引用信号。其次,当依赖图在权利要求级别定义时,Transformer模型会失败,因为它们在令牌级别操作;广播权利要求级别的邻接可能会稀释跨无关令牌对的结构信息。一种新颖的专利异构注意力图编码器(PHAGE)解决了这些挑战。为了处理异构依赖,PHAGE构建了一个类型图,将法律引用与技术关系区分为不同的边类型。为了弥合层次差距,PHAGE引入了一个带有可学习关系感知偏置的连通性掩码,将权利要求级别的拓扑投射到令牌级别的注意力中。PHAGE学习一个双粒度对比目标,以将表示与专利间分类法和专利内拓扑对齐。实验表明,PHAGE在专利分类、检索和聚类上优于领域自适应和引用感知基线。PHAGE揭示,专利内权利要求拓扑比专利间结构捕获了更强的归纳偏置。

英文摘要

Pre-trained language models advance patent classification and retrieval via encoding claims as flat token sequences, yet overlooking the dependency hierarchy among claims. Incorporating the hierarchy into self-attention poses two challenges. First, claim dependencies involve relation types with varying reliability: treating them indiscriminately allows noisy technical relations to corrupt cleaner legal citation signals. Second, when the dependency graph is defined over claims, Transformer models fail as they operate at the token level; broadcasting claim-level adjacency can dilute structural information across unrelated token pairs. A novel Patent Heterogeneous Attention Graph Encoder (PHAGE) addresses these challenges. To handle heterogeneous dependencies, PHAGE constructs a typed graph to separate legal citations from technical relations as distinct edge types. To bridge the hierarchy gap, PHAGE introduces a connectivity mask with learnable relation-aware biases to project a claim-level topology into token-level attention. PHAGE learns a dual-granularity contrastive objective to align representations with inter-patent taxonomy and intra-patent topology. Experiments show that PHAGE outperforms domain-adapted and citation-aware baselines on patent classification, retrieval, and clustering. PHAGE discloses that the intra-patent claim topology captures stronger inductive bias than the inter-patent structure.

2508.11011 2026-05-28 cs.CV

Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

大型预训练视觉语言模型能否成为有效的施工安全检查员?

Xuezheng Chen, Zhengbo Zou

AI总结 本文提出ConstructionSite 10k数据集,包含1万张施工图像及三项任务标注,评估大型预训练视觉语言模型在零样本和小样本下的泛化能力,为施工安全检查提供基准。

详情
AI中文摘要

施工安全检查通常涉及人类检查员在现场识别安全问题。随着强大的视觉语言模型(VLM)的兴起,研究人员正在探索将其用于从现场图像中检测安全违规等任务。然而,目前缺乏公开数据集来全面评估和进一步微调VLM在施工安全检查中的应用。当前VLM的应用使用小型监督数据集,限制了它们在未直接训练的任务中的适用性。在本文中,我们提出了ConstructionSite 10k数据集,包含10,000张施工场地图像,并为三个相互关联的任务提供标注,包括图像描述、安全违规视觉问答(VQA)和施工元素视觉定位。随后我们对当前最先进的大型预训练VLM的评估显示,它们在零样本和小样本设置下具有显著的泛化能力,但需要额外训练才能应用于实际施工场地。该数据集允许研究人员使用新的架构和技术训练和评估自己的VLM,为施工安全检查提供了有价值的基准。

英文摘要

Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

2605.08938 2026-05-28 cs.AI cs.LG

Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators

我们能否形式化验证神经PDE代理模型?小傅里叶神经操作符的SMT编译

Ali Baheri, Ignacio Laguna Peralta

AI总结 本文通过将傅里叶神经操作符(FNO)的谱卷积编译为线性映射,在Z3中实现精确或近似的SMT编码,从而对小型FNO代理模型进行形式化验证,并揭示了验证的可靠性与可扩展性之间的权衡。

详情
AI中文摘要

傅里叶神经操作符(FNO)可以极大地加速PDE模拟,但它们通常在没有形式化保证其保留基本物理结构的情况下使用。我们表明,一旦训练权重和网格固定,FNO中的谱卷积是一个线性映射。因此,完整的前向传播是分段线性的,并且可以在Z3的线性实数算术中精确表示。我们研究了两种编码。精确编码将谱卷积编译为稠密矩阵乘法,对于证明和反例都是可靠的。更轻量的冻结编码用常数替换谱路径,使其更快但近似。在10个用于一维对流-扩散-反应的小型FNO代理模型(85到117个参数,网格8到32)上,精确编码在线性(无ReLU)模型上给出了2个可靠的正性证明,5个可靠的正性反例,以及10个可靠的质量违反反例;其余3个在ReLU模型上的正性查询超时。对于质量不增加,Z3在10个模型中的7个上找到了比基于梯度的伪造和蒙特卡洛更差的反例。冻结编码可扩展到网格大小64,且正性检查亚秒级,但它不再为原始FNO提供证书。总体而言,结果明确了可靠性与可扩展性之间的权衡,并指出了对生产规模神经操作符进行形式化验证所需的条件。

英文摘要

Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve basic physical structure. We show that, once the trained weights and grid are fixed, the spectral convolution in an FNO is a linear map. As a result, the full forward pass is piecewise-linear and can be represented exactly in Z3's linear real arithmetic. We study two encodings. The exact encoding compiles the spectral convolution into a dense matrix multiplication, which is sound for both proofs and counterexamples. The lighter frozen encoding replaces the spectral path with a constant, making it faster but approximate. On 10 small FNO surrogates for 1D advection-diffusion-reaction (85 to 117 parameters, grids 8 to 32), the exact encoding gives 2 sound positivity proofs on linear (ReLU-free) models, 5 sound positivity counterexamples, and 10 sound mass-violation counterexamples; the remaining 3 positivity queries on ReLU models time out. For mass non-increase, Z3 finds worse counterexamples than both gradient-based falsification and Monte Carlo on 7 of 10 models. The frozen encoding scales to grid size 64 with sub-second positivity checks, but it no longer provides certificates for the original FNO. Overall, the results make the soundness--scalability tradeoff explicit and point to what is needed for formal verification of production-scale neural operators.

2605.08758 2026-05-28 cs.RO cs.AI math.OC

Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems

基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架

Jiaxin Liu, Peng Yang, Yuping Li, Xinyue Xie

AI总结 针对料箱搬运机器人系统的订单履行决策,提出一种结合结构化组合优化与多智能体强化学习的通用可扩展序贯决策框架OLSF-TRS,在小规模系统上平均最优性差距低于3.5%,在大规模场景中相比启发式基线减少8-12%的料箱移动,并保持实时响应。

Comments 35 pages, 5 figures

详情
AI中文摘要

受电子商务和小批量生产的快速扩张推动,成品、半成品和原材料的内部物流负载单元规模正在稳步缩小。料箱正逐渐取代托盘成为主要的搬运和存储容器。这一转变将料箱搬运机器人系统推向了自动化订单履行中心的前沿。料箱搬运机器人系统的订单履行决策具有共同的订单-料箱-机器人序贯决策性质。现有研究主要针对特定系统的决策机制,难以泛化或迁移到其他场景。我们提出了一种基于全尺度学习的料箱搬运机器人系统订单履行序贯决策框架(OLSF-TRS),这是一个通用且可扩展的序贯决策框架,结合了结构化组合优化与多智能体强化学习,以协调订单、料箱和机器人决策。在小规模料箱搬运机器人系统上,OLSF-TRS在两种不同的系统配置下实现了接近最优的性能,平均最优性差距低于3.5%。在大规模场景中,OLSF-TRS在两种不同类型的系统上始终优于启发式基线,与基于规则的最先进方法相比,总料箱移动量减少了8-12%和超过30%,同时保持实时响应。这些改进转化为切实的运营效益,包括成本降低、能耗降低和吞吐量稳定性增强。所提出的框架为广泛部署的料箱搬运机器人系统提供了一种高效且统一的订单履行决策框架,支持电子商务和工业物流领域的高质量订单履行。

英文摘要

Driven by the rapid expansion of e-commerce and small-batch production, the size of the intralogistics load unit of finished goods, semi-finished goods and raw materials is steadily shrinking. Totes are gradually replacing pallets as the primary handling and storage container. This shift has propelled tote-handling robotic systems to the forefront of automation order fulfillment centers. The order-fulfillment decisions of tote-handling robotic systems share a common order-tote-robot sequential decision-making nature. Existing studies primarily focus on decision mechanisms tailored to particular systems, making it difficult to generalize or transfer them to other contexts. We propose an Omni-scale Learning-based Sequential Decision Framework for Order Fulfillment of Tote-handling Robotic Systems (OLSF-TRS), a generalized and scalable sequential decision framework that combines structured combinatorial optimization with multi-agent reinforcement learning to coordinate order,tote, and robot decisions. On small-scale tote-handling robotic systems, OLSF-TRS achieves near-optimal performance with average optimality gaps below 3.5% across two distinct system configurations. In large-scale scenarios, OLSF-TRS consistently outperforms heuristic baselines across two different system types, reducing total tote movements by 8-12% and over 30% compared to SOTA rule-based approaches, while maintaining real-time responsiveness. These improvements translate into tangible operational benefits, including cost reduction, lower energy consumption, and enhanced throughput stability. The proposed framework delivers an efficient and unified order fulfillment decision-making framework for widely deployed tote-handling robotic systems,supporting high-quality order fulfillment in both e-commerce and industrial logistics sectors.

2605.08678 2026-05-28 cs.LG

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

MLS-Bench:对构建更好AI的AI系统的全面且严格评估

Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan-ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, Chi Jin

AI总结 提出MLS-Bench基准,包含12个领域140个任务,评估AI系统能否发明通用且可扩展的机器学习方法,发现当前智能体在方法发明上仍远逊于人类,瓶颈在于科学洞察而非单纯搜索或计算。

详情
AI中文摘要

现代AI的进步由跨设置泛化且可扩展到更大规模的机器学习方法驱动。随着大型语言模型在推理、编码和工程任务中展现出高级能力,理解它们是否能发现此类方法(而不仅仅是应用现有方法)变得越来越重要。我们引入了MLS-Bench,一个用于评估AI系统能否发明通用且可扩展的机器学习方法的基准。MLS-Bench包含12个领域的140个任务,每个任务要求智能体改进机器学习系统或算法的一个目标组件,并证明该改进在受控设置和规模下具有泛化性。我们发现,当前智能体在可靠地超越人类设计的方法方面仍相差甚远,并且工程风格的调优对它们来说比真正的方法发明更容易。我们进一步研究了测试时缩放、自适应计算分配和上下文提供对智能体发现性能的影响,以及它们行为的案例研究。我们的分析表明,瓶颈不仅在于提出新方法,还在于规划、验证和扩展关于这些方法的声明所需的科学洞察。仅靠更多的搜索、计算或上下文并不能消除这一瓶颈。我们构建并维护一个社区平台以进行累积和可比较的迭代,并在https://mls-bench.com发布数据和代码。

英文摘要

Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at https://mls-bench.com.

2604.24938 2026-05-28 cs.LG cs.AI cs.CL

Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning

重新思考层冗余:校准比搜索在LLM深度剪枝中更重要

Minkyu Kim, Vincent-Daniel Yun, Youngrae Kim, Suin Cho, Woosang Lim, Sunwoo Lee

AI总结 本文通过实验发现,在大型语言模型深度剪枝中,校准配置对剪枝模式和性能的影响远大于搜索算法的选择。

Comments Preprint

详情
AI中文摘要

深度剪枝通过移除Transformer块来提高大型语言模型的推理效率。先前的工作通常将层冗余视为预训练网络固有的结构属性,强调重要性标准和搜索算法来识别可移除的层。在本研究中,我们从功能角度实证研究深度剪枝。通过评估不同校准配置和多种搜索算法下的代表性LLM系列,我们展示了不同配置会产生不同的剪枝模式。此外,在固定校准配置下,复杂的搜索算法相比简单的一次性方法仅带来边际性能提升,并收敛到相似的剪枝子集。总体而言,我们的结果表明,校准配置在塑造剪枝模式和校准困惑度方面比搜索算法的选择起着更大的作用,同时对下游推理准确性的方差贡献相当。这表明未来的剪枝工作可能受益于优先考虑校准配置而非搜索复杂性。

英文摘要

Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer redundancy as an inherent structural property of pretrained networks, emphasizing importance criteria and search algorithms to identify removable layers. In this study, we empirically investigate depth pruning from a functional perspective. Evaluating representative LLM families across diverse calibration configurations and multiple search algorithms, we show that different configurations produce different pruning patterns. Furthermore, under a fixed calibration configuration, complex search algorithms yield marginal performance improvements over simple one-shot methods, converging to similar pruned subsets. Overall, our results suggest that the calibration configuration plays a substantially larger role than the choice of search algorithm in shaping pruning patterns and calibration perplexity, while contributing comparably to variance in downstream reasoning accuracy. This indicates that future pruning efforts may benefit from prioritizing the calibration configuration over search complexity.

2605.06915 2026-05-28 cs.LG

LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

LLMs 并非(一致地)贝叶斯:量化 LLMs 概率信念的内部(不)一致性

Chacha Chen, Matthew Jörke, Adam Goliński, Masha Fedzechkina, Guillermo Sapiro, Sinead Williamson, Nicholas Foti

AI总结 本文引入信息处理差距来研究 LLMs 在更新概率信念时的内部不一致性,发现非贝叶斯启发式更新在下游任务中常优于精确贝叶斯计算,表明 LLMs 的世界概率模型存在错误设定。

详情
AI中文摘要

现代人工智能系统正被部署在医学、科学和法律等复杂领域,在这些领域中,它们不仅需要产生正确的答案,还需要在新证据出现时表示和更新关于世界的不确定性信念。我们引入了一种新颖的技术,将 LLMs 作为信息处理规则进行研究,并利用信息处理差距来研究 LLMs 如何从证据中更新其概率信念的内部(不)一致性。我们的广泛实验评估了 LLMs 将证据纳入其信念的多种方法。其中一些方法产生(近乎)贝叶斯更新;其他方法似乎使用学习到的启发式。令人惊讶的是,非贝叶斯启发式更新在下游任务性能上通常优于精确贝叶斯计算——这表明 LLMs 的世界概率模型存在错误设定。最后,我们展示了我们的度量如何提供诊断,以识别 LLM 驱动的推理系统中的问题。

英文摘要

Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.

2510.25781 2026-05-28 cs.LG cs.AI cs.NA cs.NE math.NA

A Practitioner's Guide to Kolmogorov-Arnold Networks

Kolmogorov-Arnold网络实践指南

Amir Noorizadegan, Sifan Wang, Leevan Ling, Juan P. Dominguez-Morales

AI总结 本文系统综述了受Kolmogorov叠加定理启发的KAN网络,从理论基础、设计轴心(基函数)到最新进展,并提供了实用选择指南和未来方向。

详情
AI中文摘要

Kolmogorov-Arnold网络(KAN)的设计灵感来源于Kolmogorov叠加定理(而非由其决定),已成为MLP的结构化替代方案。本综述对快速扩展的KAN文献进行了系统全面的概述。综述围绕三个核心主题组织:(i)阐明KAN与Kolmogorov叠加理论(KST)、MLP和经典核方法之间的关系;(ii)将基函数作为中心设计轴进行分析;(iii)总结在准确性、效率、正则化和收敛性方面的最新进展。最后,我们提供了实用的“选择你的KAN”指南,并概述了开放的研究挑战和未来方向。随附的GitHub仓库为正在进行的KAN研究提供了结构化参考。

英文摘要

Kolmogorov-Arnold Networks (KANs), whose design is inspired-rather than dictated-by the Kolmogorov superposition theorem, have emerged as a structured alternative to MLPs. This review provides a systematic and comprehensive overview of the rapidly expanding KAN literature. The review is organized around three core themes: (i) clarifying the relationships between KANs and Kolmogorov superposition theory (KST), MLPs, and classical kernel methods; (ii) analyzing basis functions as a central design axis; and (iii) summarizing recent advances in accuracy, efficiency, regularization, and convergence. Finally, we provide a practical "Choose-Your-KAN" guide and outline open research challenges and future directions. The accompanying GitHub repository serves as a structured reference for ongoing KAN research.

2605.00435 2026-05-28 cs.CL cond-mat.dis-nn cs.AI nlin.CD

Escaping Mode Collapse in LLM Generation via Geometric Regulation

通过几何调控逃离大语言模型生成中的模式崩溃

Xin Du, Kumiko Tanaka-Ishii

AI总结 本文从动力系统视角将模式崩溃解释为几何崩溃,并提出轻量级在线状态空间干预方法RMR(通过低秩阻尼调控Transformer值缓存中的自强化方向),显著降低模式崩溃并实现极低熵率下的稳定生成。

Comments Accepted to ICML 2026

详情
AI中文摘要

模式崩溃是生成建模中的一个持续挑战,在自回归文本生成中表现为从显式循环到逐渐失去多样性和轨迹过早收敛等行为。我们采用动力系统视角,将模式崩溃重新解释为由*几何崩溃*引起的状态空间可访问性降低:在生成过程中,模型的内部轨迹被限制在其表示空间的低维区域。这意味着模式崩溃并非纯粹的token级现象,无法通过符号约束或仅概率解码启发式可靠解决。基于这一视角,我们提出*强化模式调控*(RMR),一种轻量级的在线状态空间干预方法,用于调控Transformer值缓存中占主导地位的自强化方向(实现为低秩阻尼)。在多个大型语言模型上,RMR显著减少了模式崩溃,并能够在极低熵率(低至0.8 nats/步)下实现稳定生成,而标准解码通常在2.0 nats/步附近崩溃。

英文摘要

Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by *geometric collapse*: during generation, the model's internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose *Reinforced Mode Regulation* (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.

2508.05417 2026-05-28 cs.CV

Smoothing Slot Attention Iterations and Recurrences

平滑槽注意力迭代与循环

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

AI总结 针对槽注意力在图像首帧冷启动查询缺乏样本特异性及视频帧间聚合变换同质化的问题,提出SmoothSA方法,通过预热冷启动查询和差异化迭代次数来平滑迭代与循环,提升目标发现、识别与推理性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

槽注意力(SA)是主流面向对象学习(OCL)的核心。图像特征可以通过SA迭代地细化冷启动查询槽来聚合成对象级表示。对于视频,这种聚合通过SA在帧间共享的循环进行,查询在第一帧冷启动,之后从上一帧的槽过渡。然而,冷启动查询缺乏样本特异性,从而阻碍了图像或视频第一帧的精确聚合;非第一帧的查询已经具有样本特异性,因此需要与第一帧不同的聚合变换。我们通过SmoothSA解决这些问题:(1)为了平滑图像或视频第一帧上的SA迭代,我们通过OCL内部自蒸馏的微型模块预热冷启动查询,使其具有丰富的输入特征信息;(2)为了平滑视频第一帧和非第一帧之间的SA循环,我们分别使用完整迭代和单次迭代来区分同质的聚合变换。在目标发现、识别和视觉推理上的综合实验验证了我们方法的有效性。进一步的视觉分析阐明了其潜在机制。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/SmoothSA获取。

英文摘要

Slot Attention (SA) lies at the heart of mainstream Object-Centric Learning (OCL). Image features can be aggregated into object-level representations by SA \textit{iteratively} refining cold-start query slots. For video, such aggregation proceeds by SA \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots thereafter. However, cold-start queries lack sample-specific cues thus hindering precise aggregation on image or video's first frame; Non-first frames' queries are already sample-specific thus requiring aggregation transforms different from the first frame. We address these issues with our \textit{SmoothSA}: (1) To smooth SA iterations on image or video's first frame, we \textit{preheat} cold-start queries with rich input-feature information, by a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across video's first and non-first frames, we \textit{differentiate} the homogeneous aggregation transforms by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and visual reasoning validate our method's effectiveness. Further visual analyses illuminate the underline mechanisms. Our \textit{source code}, \textit{model checkpoints} and \textit{training logs} are provided on https://github.com/Genera1Z/SmoothSA.

2604.27251 2026-05-28 cs.CL cs.AI

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

服从与感知:大型语言模型中的推理可控性研究

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras

AI总结 通过推理冲突视角,系统研究大型语言模型在诱导逻辑模式与任务预期模式冲突时,是否优先服从指令还是遵循感知合理性,并探索内部检测与激活级干预方法。

详情
AI中文摘要

大型语言模型(LLMs)已知通过预训练数据中的共享推理模式获得推理能力,并通过思维链(CoT)实践进一步激发。然而,基本推理模式(如归纳、演绎和溯因)能否与具体问题实例解耦,仍然是模型可控性的关键挑战,并有助于阐明推理可控性。在本文中,我们首次通过推理冲突的视角系统研究这一问题:推理冲突是指通过强制使用偏离目标任务预期逻辑模式而引发的参数信息与上下文信息之间的显性张力。我们的评估表明,LLMs 始终优先考虑感知合理性而非服从性,尽管存在冲突指令,仍倾向于采用任务合适的推理模式。我们进一步证明推理冲突在内部是可检测的,因为在冲突期间置信度分数显著下降。探测实验确认推理类型从中间层到后期层线性编码,表明存在激活级可控性的潜力。利用这些见解,我们引导模型朝向服从性,将指令遵循度提高多达 29%。总体而言,我们的发现表明,虽然 LLM 推理锚定于具体实例,但主动的机制性干预可以有效地将逻辑模式与数据解耦,为改进可控性、忠实性和泛化性提供了一条路径。

英文摘要

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.