arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2603.29917 2026-05-14 cs.CV

Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan, Róbert Rajkó

发表机构 * Doctoral School of Computer Science, University of Szeged(塞格德大学计算机科学博士学院) University Research and Innovation Center (EKIK), Óbuda University(奥布达大学研究与创新中心(EKIK))

AI总结 本文提出了一种结合扩散驱动特征去噪与混合特征表示的鲁棒手写数字多分类框架。通过非负矩阵分解(NNMF)将输入图像转换为可解释的特征表示,同时利用卷积神经网络提取深层特征,并将两者融合为统一的混合特征表示。在特征空间中引入逐步扩散噪声并训练去噪网络以恢复干净特征,从而提升模型对噪声和对抗攻击的鲁棒性。实验结果表明,该方法在基准和对抗环境下均表现出优越的分类性能。

详情
英文摘要

This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. This manuscript is submitted as an extended abstract rather than a full-length press-ready paper. First, the input images are converted into tight, interpretable exemplification using Non-negative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. The main objective of this work is to extend our previously validated two-class framework to a multi-class handwritten digit classification scenario. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

2603.27134 2026-05-14 cs.LG

Factorization Regret mediates compositional generalization in latent space

John Schwarcz

发表机构 * Edmond and Lily Safra Center for Brain Sciences(埃德蒙和莉莉·萨弗拉脑科学中心)

AI总结 本文研究了在已知所有相关变量的情况下,泛化仍可能面临的障碍,提出了一种将组合泛化视为潜在变量间参数化相互作用的变分推断问题的框架。通过构建认知网格世界环境,作者引入了“分解遗憾”这一信息论指标,用于衡量潜在变量相互作用对任务表现的影响,并发现RNN中显式提供交互信息可解释不同网络结构间的性能差异。进一步提出了一种新的架构——表示分类链(RCCs),能够分离变量推断与参数估计,从而在无需显式交互信息的情况下实现组合泛化与新动作空间的离线学习,为研究通用目标导向智能体提供了理论基础。

详情
英文摘要

Are there still barriers to generalization once all of the relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this framework, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) in which observations are generated jointly by multiple latent variables, yet feedback is provided only for a single goal variable. This setting allows us to describe Factorization Regret: an information-theoretic quantity that measures the contribution of latent variable interactions to task performance. Using this metric, we first analyze Recurrent Neural Networks (RNNs) that are explicitly provided with the interactions and find that Factorization Regret explains the accuracy gap between Echo State and Fully Trained networks. Additionally, our analysis uncovers a theoretically predicted failure mode, where confidence becomes decoupled from accuracy. These results suggest that utilizing the interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions themselves must be learned by an embedding model. Learning how variables interact while learning how to infer their values is a variational inference problem. We approach this dilemma via Representation Classification Chains (RCCs), a novel architecture which disentangles variable inference and parameter estimation. We demonstrate that, by learning how variables interact, RCCs facilitate compositional generalization to novel combinations of relevant variables and offline learning in novel action spaces. Together, these results establish a theoretically grounded setting for researching, developing and evaluating goal-directed generalist agents.

2603.26839 2026-05-14 cs.LG cs.CV

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto G. Rodriguez Salgado

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨了多模态模型在解决视觉空间任务时是依赖真正的规划能力,还是通过在文本空间中进行暴力搜索。为此,研究者提出了一个名为 MazeBench 的基准测试,包含 110 个程序生成的迷宫图像,并评估了来自 OpenAI、Anthropic、Google 和阿里巴巴的 16 种模型配置。实验发现,尽管某些模型在视觉迷宫任务中表现出高准确率,但其解题方式主要是将图像转换为文本网格,再逐步枚举路径,而非真正的空间规划,揭示了高准确率并不意味着具备人类水平的空间理解能力。

Comments 15 pages, 10 figures. Code and mazes available at https://github.com/alrod97/LLMs_mazes

详情
英文摘要

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.

2603.24125 2026-05-14 cs.CL

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Nour Bouchouchi, Thibault Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki

发表机构 * Sorbonne Université, CNRS, LIP6(索邦大学、国家科学研究中心、LIP6实验室) AXA(安盛) Polish Academy of Science, IBS PAN(波兰科学院、IBS PAN)

AI总结 本研究探讨了大型语言模型在训练过程中学习到的社会规范如何导致性别偏见,并指出现有去偏方法主要关注生成输出中的偏见,而未涉及模型内部表示。为此,作者提出一个统一框架,通过相同中性提示同时分析模型内在和外在的性别偏见,发现对齐方法虽能减少输出中的偏见,但模型内部仍可能存在可被激活的性别关联。研究进一步表明,基于结构化基准的去偏效果在实际应用场景中可能并不稳定。

详情
英文摘要

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

2603.22910 2026-05-14 cs.CL

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China(社会计算与交互机器人研究院,哈尔滨工业大学,中国)

AI总结 随着大语言模型在长上下文应用中对Key-Value(KV)缓存的内存需求不断增长,如何高效压缩KV缓存成为关键问题。本文提出了一种灵活的KV缓存压缩框架EchoKV,通过利用注意力头内部和跨层的相似性,采用轻量网络从部分子集重构被丢弃的KV组件,从而支持按需切换全缓存与压缩缓存模式。实验表明,EchoKV在多种压缩比和模型架构下均优于现有方法,同时在短上下文场景中保持了全缓存推理的吞吐量。

详情
英文摘要

The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank KV compression methods reduce this footprint by modifying model projections, limiting the flexibility to switch back to standard full-cache inference when sufficient memory is available. In this paper, we propose EchoKV, a flexible KV cache compression framework that supports on-demand transitions from full KV caching to compressed caching. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the discarded KV components from a partial subset, exploiting intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a lightweight two-stage fine-tuning strategy, requiring only a few minutes on a single A100 GPU for a 7B model. Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across multiple compression ratios and backbone models while preserving the throughput of full-cache inference in short-context scenarios.

2603.22665 2026-05-14 cs.CL cs.LG

Improving LLM Final Representations with Inter-Layer Geometry

Tom Ulanovski, Eyal Blyachman, Maya Bechler-Speicher

发表机构 * Blavatnik School of Computer Science(布拉瓦特尼克计算机科学学院) Tel Aviv University(特拉维夫大学)

AI总结 本文研究了如何改进基于大语言模型(LLM)的预测性能,通过更有效地利用模型各层的表示信息。传统方法仅使用最终层的表示,而作者提出使用图神经网络(GNN)在LLM各层之间建立连接,以更高效地聚合跨层信息。进一步地,他们引入了基于SL(2, Zn)的Cayley图结构的Cayley-Encoder,显著提升了预测性能与效率,并在多个任务和模型上验证了其有效性,同时保持参数增长极小。

Comments 17 pages, 4 figures. Equal contribution by first two authors

详情
英文摘要

The standard in LLM-based prediction is to use the final-layer representation as the input to a downstream predictor. However, intermediate layers may encode complementary task-relevant signals. Existing approaches therefore either search for the best layer for each task or apply expensive attention-based mechanisms to learn inter-layer aggregation. In this work, we first show that such complexity is unnecessary: a lightweight Graph Neural Network over a fully connected graph of LLM layers is more efficient and achieves significantly stronger predictive performance than existing approaches. We then introduce the Cayley-Encoder, which further improves both efficiency and predictive performance by replacing the fully connected graph with a Cayley graph over SL(2, Zn). These Cayley graphs provide a mathematically grounded topology that is sparse, regular by construction, and has low diameter. This enables effective communication across layers while constraining the aggregation structure and reducing the risk of GNN overfitting. In an evaluation of Cayley-Encoder across 13 tasks and 9 LLMs, Cayley-Encoder consistently outperforms baselines, achieving improvements of up to 40 percentage points in accuracy, while introducing at most 0.1% additional parameters relative to the LLM size. We further show that Cayley-Encoder is effective in few-shot regimes. Finally, we show that Cayley-Encoder outperforms LoRA fine-tuning while operating on the frozen LLM. We conclude with an explainability analysis showing that multiple layers contribute meaningfully to the final prediction, supporting our hypothesis.

2603.22364 2026-05-14 cs.LG cs.AI cs.CV

MCLR: Improving Conditional Modeling via Inter-Class Likelihood-Ratio Maximization and Unifying Classifier-Free Guidance with Alignment Objectives

Xiang Li, Yixuan Jia, Xiao Li, Jeffrey A. Fessler, Rongrong Wang, Qing Qu

发表机构 * University of Michigan(密歇根大学) Michigan State University(密歇根州立大学)

AI总结 本文提出了一种名为MCLR的新训练目标,旨在通过最大化类间似然比来提升扩散模型的条件生成能力。该方法解决了标准去噪分数匹配(DSM)在类间分离不足的问题,并在训练过程中引入对齐目标,使模型在无需推理时引导(CFG)的情况下也能获得更优的条件生成效果。理论分析表明,CFG引导的分数实际上是针对样本自适应加权MCLR目标的最优解,从而揭示了CFG与对齐目标之间的内在联系。

详情
英文摘要

Diffusion models achieve strong performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. In theory, diffusion models trained with standard denoising score matching (DSM) should recover the target data distribution, raising two fundamental questions: (i) why is inference-time guidance necessary in practice, and (ii) can its underlying effect be internalized into a principled training objective? In this work, we argue that a key limitation of standard DSM is insufficient inter-class separation. To address this issue, we propose MCLR, an alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Fine-tuning diffusion models with MCLR induces CFG-like improvements under standard sampling, substantially improving guidance-free conditional generation and narrowing the gap to inference-time CFG. Beyond these empirical benefits, we show theoretically that the CFG-guided score is exactly the optimal solution to a sample-adaptive weighted MCLR objective. This result connects CFG to alignment-based objectives, providing a mechanistic interpretation of CFG as an implicit inference-time contrastive alignment procedure.

2603.20527 2026-05-14 cs.LG

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) International Computer Science Institute(国际计算机科学研究所) University of California, Berkeley(加州大学伯克利分校) Meta(Meta公司)

AI总结 本文提出了一种名为RMNP的优化算法,用于提升基于矩阵的深度神经网络训练效率。该方法通过将Newton-Schulz迭代替换为基于输入维度的行归一化操作,显著降低了计算复杂度,同时保持了与Muon等方法相当的优化性能。理论分析表明RMNP在非凸场景下具有匹配最优复杂度的收敛性,实验结果显示其在大语言模型预训练中表现出良好的效果并大幅减少了预处理时间。

Comments The 43rd International Conference on Machine Learning (ICML 2026)

详情
英文摘要

Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise ($d_{\text{in}}$) $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. We empirically verified that orthogonalization and row-wise (on input dim) $\ell_2$ normalization are asymptotically equivalent in the case of the transformer. This substitution reduces the per-iteration computational complexity from ${O}(mn\cdot\min(m,n))$ to ${O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. Our code is available at https://github.com/Dominator-Index/RMNP.

2603.20521 2026-05-14 cs.LG cs.AI math.OC stat.ML

Delightful Distributed Policy Gradient

Ian Osband

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 分布式强化学习在使用过时、有错误或不匹配的智能体生成的数据进行训练时,容易受到高惊讶度(负对数概率)动作的影响,导致学习效果下降。本文提出的“Delightful Policy Gradient”(DG)方法通过将优势值与惊讶度相乘作为门控机制,有效抑制高惊讶度的失败案例,同时保留高惊讶度的成功案例,从而提升学习效率。实验表明,DG在多种复杂场景下相比传统方法具有显著的样本效率优势,尤其在任务复杂度增加时表现更为突出。

详情
英文摘要

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate finite-batch updates through large perpendicular components, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and preserving rare successes without behavior probabilities. In a tabular analysis, DG suppresses the perpendicular second moment of high-surprisal failures by a policy-overlap factor that vanishes as the learner improves. The advantage sign is essential for surprisal-based filtering: any learner-probability-only gate that suppresses rare failures also suppresses rare successes. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG often achieves nearly order-of-magnitude lower error. When all four frictions act simultaneously, its sample-efficiency advantage is order-of-magnitude and grows with task complexity.

2603.15854 2026-05-14 cs.LG cs.AI cs.CL

FlashSampling: Fast and Memory-Efficient Exact Sampling

Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang

发表机构 * LMU Munich(慕尼黑大学) Princeton University(普林斯顿大学)

AI总结 本文提出了一种名为 FlashSampling 的高效精确采样方法,旨在解决大词汇量解码中采样操作带来的额外内存流量和计算开销问题。该方法将采样过程直接融合到语言模型的输出层矩阵乘法中,避免了显存中 logits 张量的显式存储,从而显著提升了内存效率和计算速度。实验表明,FlashSampling 在多种数据中心级 GPU 上实现了内核级别的性能提升,并在端到端的 vLLM 框架中将每个输出 token 的生成时间减少了最多 10%。

Comments Project Page: https://github.com/FlashSampling/FlashSampling

详情
英文摘要

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. In tensor-parallel decoding, FlashSampling replaces the all-gather of logits with streaming peer-to-peer writes: This overlaps GPU-to-GPU communication with computation and HBM loads across up to 8 GPUs, with near-ideal scaling at large batch sizes. Our kernel is exact because argmax decomposes over partitions; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. FlashSampling demonstrates kernel-level speedups on decode workloads across 4 different datacenter GPUs (H100, H200, B200, B300), and in end-to-end vLLM experiments, it reduces time per output token by up to $10\%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, consolidating the bandwidth-bound sampling step in an efficient epilogue.

2603.13054 2026-05-14 cs.CV

Topo-R1: Detecting Topological Anomalies via Vision-Language Models

Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu, Weimin Lyu, Kehan Qi, Dimitris Samaras, Chao Chen

发表机构 * Stony Brook University(石溪大学) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院) Stanford University(斯坦福大学) Penn State University(宾夕法尼亚州立大学)

AI总结 该研究探讨了如何利用视觉-语言模型(VLMs)检测管状网络结构中的拓扑异常,如血管、神经纤维和道路网络中的连接断裂、虚假连接、分支缺失或多余等问题。研究发现现有VLMs在拓扑感知方面表现较差,几乎随机。为此,作者构建了一个包含多样化拓扑扰动的大型基准数据集,并提出Topo-R1模型,通过结合定位、分类和结构保真度的复合奖励机制,显著提升了模型在拓扑异常检测任务中的性能,优于通用VLMs并接近监督学习方法。

Comments 26 pages, 6 figures

详情
英文摘要

Topology is critical in tubular structures such as blood vessels, nerve fibers, and road networks, where connectivity and loop structure govern downstream functional analysis. Vision-Language Models (VLMs) are promising candidates for understanding such structures, given their reasoning and grounding capabilities. To probe their topological perception, we systematically evaluate leading closed- and open-source VLMs on localizing and classifying four canonical topological anomalies (broken/spurious connections, missing/extra branches) in tubular-network segmentation masks. They perform nearly at random, indicating that topology-aware perception is largely absent from current general-purpose VLMs. As no existing resource pairs segmentation masks with localized anomaly annotations, we build an automated, multi-domain data-curation pipeline that synthesizes diverse topological perturbations with verifiable Betti-number annotations across graduated difficulty levels, yielding the first systematic benchmark with a large-scale training set and held-out in-distribution (ID) and out-of-distribution (OOD) test suites. Building on this benchmark, we introduce Topo-R1, centered on a topology-aware composite reward that jointly scores localization, classification, and skeleton-level structural fidelity. Supervised fine-tuning cold-starts schema-compliant outputs, and Group Relative Policy Optimization (GRPO) then optimizes the policy against this reward, steering predictions toward topologically meaningful structure rather than superficial pixel overlap. Extensive experiments show that Topo-R1 substantially outperforms general-purpose VLMs and matches or exceeds supervised baselines across ID, OOD, and real-segmentation-output protocols, establishing a strong foundation for VLM-based topological understanding of structured visual data.

2603.10305 2026-05-14 cs.LG physics.ao-ph

Data-Driven Integration Kernels for Interpretable Nonlocal Operator Learning

Savannah L. Ferretti, Jerry Lin, Sara Shamekh, Jane W. Baldwin, Michael S. Pritchard, Tom Beucler

发表机构 * Department of Earth System Science(地球系统科学系) University of California, Irvine(加州大学伊维福分校) Department of Computing and Data Science(计算与数据科学系) Boston University(波士顿大学) The Center for Atmosphere Ocean Science(大气海洋科学中心) New York University(纽约大学) NVIDIA Corporation(NVIDIA公司) Faculty of Geosciences and Environment(地球科学与环境学院) University of Lausanne(洛桑大学) Expertise Center for Climate Extremes(极端气候专家中心)

AI总结 该研究提出了一种基于数据驱动的积分核方法,用于可解释的非局部算子学习,旨在解决气候模型中非局部信息整合带来的可解释性差和过拟合问题。通过将非局部信息聚合与局部非线性预测分离,该方法利用可学习的积分核对时空特征进行加权整合,从而显著减少模型参数并提高可解释性。实验表明,在南亚季风降水预测任务中,该框架在保持预测性能的同时大幅降低了模型复杂度。

Comments Presented at Climate Informatics 2026 (14 pages, 5 figures, 1 table)

详情
英文摘要

Machine learning models can represent climate processes that are nonlocal in horizontal space, height, and time, often by combining information across these dimensions in highly nonlinear ways. While this can improve predictive skill, it makes learned relationships difficult to interpret and prone to overfitting as the extent of nonlocal information grows. We address this challenge by introducing data-driven integration kernels, a framework that adds structure to nonlocal operator learning by explicitly separating nonlocal information aggregation from local nonlinear prediction. Each spatiotemporal predictor field is first integrated using learnable kernels (defined as continuous weighting functions over horizontal space, height, and/or time), after which a local nonlinear mapping is applied only to the resulting kernel-integrated features and optional local inputs. This design confines nonlinear interactions to a small set of integrated features and makes each kernel directly interpretable as a weighting pattern that reveals which horizontal locations, vertical levels, and past timesteps contribute most to the prediction. We demonstrate the framework for South Asian monsoon precipitation using a hierarchy of neural network models with increasing structure, including baseline, nonparametric kernel, and parametric kernel models. Across this hierarchy, kernel models achieve near-baseline performance with far fewer trainable parameters, indicating that much of the relevant nonlocal information can be captured through a small set of interpretable integrations when appropriate structural constraints are imposed.

2603.07433 2026-05-14 cs.LG cs.CV

Data Agent: Learning to Select Data via End-to-End Dynamic Optimization

Suorong Yang, Fangjian Su, Hai Gan, Ziqi Ye, Jie Li, Baile Xu, Furao Shen, Soujanya Poria

发表机构 * National University of Singapore(新加坡国立大学) Nanjing University(南京大学) Nanyang Technological University(南洋理工大学)

AI总结 该论文提出了一种名为Data Agent的端到端动态数据选择框架,旨在通过在线训练中优先选择信息量大的样本来加速模型训练。其核心方法是将数据选择建模为一个与训练过程相关的序列决策问题,通过结合损失和置信度的复合奖励机制,学习一个与模型优化协同进化的样本选择策略。实验表明,Data Agent在多个数据集和模型架构上均能有效提升训练效率并保持或提升性能,且具有良好的通用性和鲁棒性,适用于多种实际场景。

详情
英文摘要

Dynamic Data selection aims to accelerate training by prioritizing informative samples during online training. However, existing methods typically rely on task-specific handcrafted metrics or static/snapshot-based criteria to estimate sample importance, limiting scalability across learning paradigms and making it difficult to capture the evolving utility of data throughout training. To address this challenge, we propose Data Agent, an end-to-end dynamic data selection framework that formulates data selection as a training-aware sequential decision-making problem. The agent learns a sample-wise selection policy that co-evolves with model optimization, guided by a composite reward that integrates loss-based difficulty and confidence-based uncertainty signals. The reward signals capture complementary objectives of optimization impact and information gain, together with a tuning-free adaptive weighting mechanism that balances these signals over training. Extensive experiments across a wide range of datasets and architectures demonstrate that Data Agent consistently accelerates training while preserving or improving performance, e.g., reducing costs by over 50\% on ImageNet-1k and MMLU with lossless performance. Moreover, its dataset-agnostic formulation and modular reward make it plug-and-play across tasks and scenarios, e.g., robustness to noisy datasets, highlighting its potential in real-world scenarios. Code is available at https://github.com/Jackbrocp/Data-Agent.

2603.05582 2026-05-14 cs.LG cs.CV

Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

Ivan Luiz De Moura Matos, Abdel Djalil Sad Saoud, Ekaterina Iakovleva, Vito Paolo Pastore, Enzo Tartaglione

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, France(法国巴黎电信学院(LTCI)、巴黎理工学院)

AI总结 本文探讨了如何从常规训练的深度学习模型中提取无偏的子网络,以减少算法中的偏见。研究提出了一种名为BISE的方法,无需额外数据或重新训练,即可通过剪枝技术识别并分离出模型中已存在的“无偏”子网络。该方法在保持模型性能的同时降低了对有偏特征的依赖,为高效的偏见缓解提供了结构化适应的新途径。实验表明,该方法在多个基准数据集上表现出优越的性能和计算效率。

Comments This work has been accepted for publication at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情
英文摘要

The issue of algorithmic biases in deep learning has led to the development of various debiasing techniques, many of which perform complex training procedures or dataset manipulation. However, an intriguing question arises: is it possible to extract fair and bias-agnostic subnetworks from standard vanilla-trained models without relying on additional data, such as unbiased training set? In this work, we introduce Bias-Invariant Subnetwork Extraction (BISE), a learning strategy that identifies and isolates "bias-free" subnetworks that already exist within conventionally trained models, without retraining or finetuning the original parameters. Our approach demonstrates that such subnetworks can be extracted via pruning and can operate without modification, effectively relying less on biased features and maintaining robust performance. Our findings contribute towards efficient bias mitigation through structural adaptation of pre-trained neural networks via parameter removal, as opposed to costly strategies that are either data-centric or involve (re)training all model parameters. Extensive experiments on common benchmarks show the advantages of our approach in terms of the performance and computational efficiency of the resulting debiased model.

2603.05094 2026-05-14 cs.SD

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee

发表机构 * National Taiwan University(国立台湾大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出TW-Sound580K,一个通过验证-生成-批评(VGC)流程构建的台湾地区音频-文本指令数据集,旨在解决大型音频-语言模型在处理本地化方言韵律时因缺乏专用语料而表现不佳的问题。该数据集利用双ASR验证筛选出522,000个原始音频片段,并扩展为580,000对高质量指令对。基于该数据集训练的Tai-LALM模型在TAU基准测试中取得了49.1%的准确率,较零样本基线提升了6.5%,验证了结合区域性语料与严格筛选及动态仲裁策略对提升本地化语音任务性能的有效性。

详情
英文摘要

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

2603.03295 2026-05-14 cs.CL cs.AI cs.CY

Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task

Gaia Molinaro, Dave August, Danielle Perszyk, Anne G. E. Collins

发表机构 * University of California, Berkeley(加州大学伯克利分校) Amazon AGI Lab(亚马逊人工智能实验室)

AI总结 该研究探讨了大型语言模型(LLMs)在自主学习任务中选择目标的行为是否与人类一致。通过对比五种主流模型与人类的表现,发现模型在目标选择上与人类存在显著差异,多数模型倾向于依赖单一解决方案或表现出较低的学习灵活性,而人类则表现出更大的探索性和个体多样性。研究指出,尽管思维链推理和角色引导能略微改善模型表现,但当前模型仍难以准确反映人类目标选择的独特性,提示在相关应用中需谨慎替代人类决策。

详情
英文摘要

Whether in agentic workflows, social studies, or chat settings, large language models (LLMs) are increasingly being asked to replace humans in choosing which goals to pursue, rather than completing predefined tasks. However, the assumption that LLMs accurately reflect human preferences for goal setting remains largely untested. We assess the validity of LLMs as proxies for human goal selection in a controlled, self-directed learning task borrowed from cognitive science. Across five models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, Qwen3 32B, and Centaur), we find substantial divergence from human behavior. While people gradually explore and learn to achieve goals with diversity across individuals, most models exploit a single identified solution or show surprisingly low performance, with distinct patterns across models and little variability across instances of the same model. Chain-of-thought reasoning and persona steering provide limited improvements, and our conclusions hold across experimental settings. While they await confirmation in applied settings, these findings highlight the uniqueness of human goal selection and caution against its replacement with current models.

2603.02337 2026-05-14 cs.LG cs.AI cs.CV

Preconditioned Flow Matching

Shadab Ahamed, Eshed Gal, Md Shahriar Rahim Siddiqui, Simon Ghyselincks, Moshe Eliasof, Eldad Haber

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of Cambridge(剑桥大学)

AI总结 本文研究了流匹配(Flow Matching)方法在训练过程中遇到的几何优化瓶颈问题,即当中间分布的协方差矩阵病态时,梯度下降方法在不同方向上的收敛速度差异显著。为此,作者提出了一种预条件流匹配(Preconditioned Flow Matching)方法,通过将目标分布转换为更各向同性的表示,改善中间路径的条件数,从而提升模型训练效率和生成质量。实验表明,该方法在多种分布和高分辨率图像数据集上均取得了显著的性能提升。

Comments 34 pages, 16 figures, 5 tables

详情
英文摘要

Flow matching (FM) learns vector fields by regressing stochastic velocity targets along intermediate distributions $p_t$. We identify a geometric optimization bottleneck in this regression problem: when the covariance $Σ_t$ of $p_t$ is ill-conditioned, gradient-based training rapidly fits high-variance directions while making slow progress along low-variance ones. In an exactly solvable Gaussian setting, we prove that the excess risk is weighted by $Σ_t$, and that both gradient descent and stochastic gradient descent inherit condition-number-dependent convergence. We then extend the analysis to Gaussian mixtures, showing that multimodality does not average away this effect; instead, the slowest and worst-conditioned component can control optimization. Motivated by this analysis, we propose \emph{preconditioned flow matching}, a precondition-then-match framework that transforms the target distribution into a more isotropic representation, trains the main flow in the transformed space, and maps generated samples back through the inverse transformation. We show theoretically that preconditioning reshapes the intermediate FM path and improves its conditioning. Across controlled Gaussian and Gaussian-mixture experiments, latent MNIST and other high resolution image datasets up to $512{\times}512$ resolution, preconditioning improves path-conditioning diagnostics, low-eigenvalue recovery, FID, MMD, precision, and recall. Compute-matched baselines and preconditioner-quality ablations further show that the gains are not explained merely by additional preconditioner parameters, but by improved geometry of the downstream flow matching problem.

2603.02175 2026-05-14 cs.CV cs.AI

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(展示实验室,新加坡国立大学)

AI总结 本文提出了一种名为 Kiwi-Edit 的通用视频编辑方法,通过指令和参考图像的联合引导实现更精确的视觉控制。为了解决现有方法在数据稀缺情况下的性能瓶颈,研究者设计了一种可扩展的数据生成管道,构建了大规模的 RefVIE 数据集和评估基准 RefVIE-Bench。基于该数据集,提出的统一编辑架构 Kiwi-Edit 通过可学习的查询与潜在视觉特征融合,实现了对参考语义的精准引导,在指令遵循和参考保真度方面取得了显著提升,达到了可控视频编辑的最新水平。

Comments Project page: https://showlab.github.io/Kiwi-Edit Huggingface Demo: https://huggingface.co/spaces/linyq/KiwiEdit

详情
英文摘要

Instruction-based video editing has witnessed rapid progress, yet current methods often struggle with precise visual control, as natural language is inherently limited in describing complex visual nuances. Although reference-guided editing offers a robust solution, its potential is currently bottlenecked by the scarcity of high-quality paired training data. To bridge this gap, we introduce a scalable data generation pipeline that transforms existing video editing pairs into high-fidelity training quadruplets, leveraging image generative models to create synthesized reference scaffolds. Using this pipeline, we construct RefVIE, a large-scale dataset tailored for instruction-reference-following tasks, and establish RefVIE-Bench for comprehensive evaluation. Furthermore, we propose a unified editing architecture, Kiwi-Edit, that synergizes learnable queries and latent visual features for reference semantic guidance. Our model achieves significant gains in instruction following and reference fidelity via a progressive multi-stage training curriculum. Extensive experiments demonstrate that our data and architecture establish a new state-of-the-art in controllable video editing. All datasets, models, and code is released at https://github.com/showlab/Kiwi-Edit.

2602.23089 2026-05-14 cs.LG

Physics-informed neural particle flow for the Bayesian update step

Domonkos Csuzdi, Tamás Bécsi, Olivér Törő

发表机构 * Department of Control for Transportation and Vehicle Systems, Faculty of Transportation Engineering and Vehicle Engineering, Budapest University of Technology and Economics(交通运输与车辆系统控制系,交通运输工程与车辆工程学院,布达佩斯技术与经济大学)

AI总结 本文提出了一种物理感知的神经粒子流方法,用于解决高维非线性估计中的贝叶斯更新步骤的计算难题。该方法通过将先验到后验的对数同伦轨迹与密度演化连续方程相结合,推导出一个主控偏微分方程,并将其作为物理约束嵌入损失函数中,训练神经网络近似传输速度场,从而实现无需真实后验样本的无监督训练。实验表明,该方法在多模态基准和复杂非线性场景中表现出更优的模式覆盖能力和鲁棒性。

详情
英文摘要

The Bayesian update step poses significant computational challenges in high-dimensional nonlinear estimation. While log-homotopy particle flow filters offer an alternative to stochastic sampling, existing formulations usually yield stiff differential equations. Conversely, existing deep learning approximations typically treat the update as a black-box task or rely on asymptotic relaxation, neglecting the exact geometric structure of the finite-horizon probability transport. In this work, we propose a physics-informed neural particle flow, which is an amortized inference framework. To construct the flow, we couple the log-homotopy trajectory of the prior to posterior density function with the continuity equation describing the density evolution. This derivation yields a governing partial differential equation (PDE), referred to as the master PDE. By embedding this PDE as a physical constraint into the loss function, we train a neural network to approximate the transport velocity field. This approach enables purely unsupervised training, eliminating the need for ground-truth posterior samples. We demonstrate that the neural parameterization acts as an implicit regularizer, mitigating the numerical stiffness inherent to analytic flows and reducing online computational complexity. Experimental validation on multimodal benchmarks and a challenging nonlinear scenario confirms better mode coverage and robustness compared to state-of-the-art baselines.

2602.23013 2026-05-14 cs.CV cs.LG

SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

Camile Lendering, Erkut Akdag, Egor Bondarev

发表机构 * AIMS Group, Department of Electrical Engineering, Eindhoven University of Technology(AIMS组,电气工程系,埃因霍温理工大学)

AI总结 本文提出了一种无需训练的少样本异常检测方法SubspaceAD,通过子空间建模实现工业视觉检测中的异常识别。该方法首先利用冻结的DINOv2模型从少量正常样本中提取块级特征,然后通过主成分分析(PCA)拟合这些特征以估计正常变化的低维子空间,在推理阶段通过重构残差检测异常,生成可解释且统计可靠的异常分数。实验表明,SubspaceAD在多个数据集上取得了当前最优的性能,尤其在单样本设置下表现出色。

Comments Accepted to CVPR 2026. Revised version with corrected AU-PRO evaluation and recomputed metrics

详情
英文摘要

Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 97.1% and 97.5% on the MVTec-AD dataset, and 93.2% and 98.2% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

2602.22474 2026-05-14 cs.RO cs.LG

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

Jessie Yuan, Yilin Wu, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究了如何在部署时通过策略引导(policy steering)使机器人行为更适应任务需求,提出了一种基于不确定性的策略引导框架UPS。该方法结合视觉语言模型(VLM)与预训练策略,通过顾及任务语义不确定性和动作可行性,选择执行动作、澄清任务或请求干预等策略,以提升引导性能。研究还利用符合性预测校准模型,并通过残差学习持续改进策略,实验表明该方法在减少用户干预方面优于现有方法。

Comments To appear in Robotics: Science and Systems 2026

详情
英文摘要

Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at https://jessie-yuan.github.io/ups/

2602.22455 2026-05-14 cs.CV

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

Giuseppe Lando, Rosario Forte, Antonino Furnari

发表机构 * Department of Mathematics and Computer Science, University of Catania, Italy(数学与计算机科学系,卡塔尼亚大学,意大利)

AI总结 本文研究了在边缘设备上使用多模态大语言模型(MLLMs)进行实时在线情景记忆问答的可行性。为应对隐私和延迟问题,作者设计了一个包含两个异步线程的问答流水线,分别用于视频到文本的轻量级描述生成和基于文本的记忆推理。实验表明,在资源受限的边缘设备上,该方法取得了与云端解决方案相当的性能,展示了边缘计算在隐私保护情景记忆检索中的潜力。

详情
英文摘要

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

2602.21204 2026-05-14 cs.LG cs.AI cs.CV

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li

发表机构 * NVIDIA, Toronto, Ontario, Canada(NVIDIA,多伦多,安大略省,加拿大) University of Toronto, Toronto, Ontario, Canada(多伦多大学,多伦多,安大略省,加拿大) Vector Institute, Toronto, Ontario, Canada(向量研究所,多伦多,安大略省,加拿大) Technion -- Israel Institute of Technology, Haifa, Israel(技术ion -- 以色列理工学院,海法,以色列)

AI总结 本文重新审视了基于键值绑定的测试时训练(TTT)在序列建模中的作用,指出其本质并非单纯的测试时记忆,而是一种学习到的线性注意力机制。研究揭示了TTT模型中一些之前难以解释的现象,并展示了多种TTT架构可以统一为线性注意力操作的形式。这一新视角不仅解释了模型行为,还带来了架构简化、并行计算和效率提升等实际优势,为TTT提供了更系统和高效的理论基础。

Comments ICML 2026, Webpage: https://research.nvidia.com/labs/sil/projects/tttla/

详情
英文摘要

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity. Project page: https://research.nvidia.com/labs/sil/projects/tttla/.

2602.20150 2026-05-14 cs.RO cs.CV

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

Wei-Cheng Huang, Jiaheng Han, Xiaohan Ye, Zherong Pan, Kris Hauser

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 本文研究如何从真实世界观测中估计可用于仿真的复杂场景,解决现有方法在处理多物体交互场景时计算成本高、鲁棒性差的问题。作者提出了一种基于物理约束的联合形状与姿态优化方法,结合可微分接触模型和高效求解器,实现了对多刚体物体几何与姿态的联合优化。该方法构建了端到端的SPARCS系统,能够鲁棒地重建出符合物理规律的仿真可用场景,实验表明其在包含多达5个物体和22个凸包的复杂场景中表现优异。

Comments Accepted to RSS 2026, camera-ready version; 17 pages, 15 figures

详情
英文摘要

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end Simulation-ready Physics-Aware Reconstruction for Cluttered Scenes (SPARCS) pipeline, which integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses. Project webpage: https://rory-weicheng.github.io/SPARCS/.

2602.14200 2026-05-14 cs.LG

TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

Nicolas Zumarraga, Thomas Kaar, Ning Wang, William Tennien, Alpay Hasanli, Max Rosenblattl, Fan Wu, Kevin Riehl, Maxwell A. Xu, Markus Kreft, Kevin O'Sullivan, Elgar Fleisch, Paul Schmiedmayer, Robert Jakob, Patrick Langer

发表机构 * Agentic Systems Lab, ETH Zurich(1 非常规系统实验室,苏黎世联邦理工学院) Stanford University(2 斯坦福大学) Traffic Engineering Group, Institute for Transport Planning and Systems, ETH Zurich(3 交通工程组,交通规划与系统研究所,苏黎世联邦理工学院) University of Illinois Urbana-Champaign(4 印第安纳大学厄巴纳-香槟分校) Google(5 谷歌) Centre for Digital Health Interventions, ETH Zurich(6 数字健康干预中心,苏黎世联邦理工学院) Centre for Digital Health Interventions, University of St. Gallen(7 数字健康干预中心,圣加尔登大学)

AI总结 本文提出 TS-Haystack,一个用于评估时间序列语言模型(TSLMs)在长上下文推理能力的多任务检索基准,涵盖从100秒到24小时的多领域事件导向问答任务,包括直接检索、时间推理、多步推理和上下文异常检测。现有 TSLMs 在处理长序列时表现出严重性能下降,而采用专门时间序列分类工具的智能检索框架在10项任务中有9项表现优于或接近当前最优模型,表明智能检索是提升长上下文时间序列推理的有效方法。

Comments Workshop version of this paper published at ICLR TSALM 2026. Benchmark generation code and datasets: https://github.com/AI-X-Labs/TS-Haystack

详情
英文摘要

Time Series Language Models (TSLMs) promise reasoning over real-world temporal data, but their ability to retrieve and reason over long time-series remains largely untested. We introduce TS-Haystack, a multi-domain retrieval benchmark with ten event-grounded question-answering tasks over contexts from 100 seconds to 24 hours, spanning direct retrieval, temporal reasoning, multi-step reasoning, and contextual anomaly detection. Existing TSLMs exhibit severe long-context degradation: accuracy declines with context length, direct-tokenization models run out of memory beyond 100 seconds on high-rate signals, and time-interval-grounded tasks collapse toward near-zero accuracy when increasing the time-series lengths, aligning with existing literature on text and multi-modal long context retrieval. An agentic retrieval framework using specialized time-series classifier tools matches or outperforms SoTA TSLMs on 9 of 10 tasks, highlighting agentic retrieval as a promising approach for long-context TSLMs.

2602.13215 2026-05-14 cs.AI

When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models

Haoran Zheng, Chen Shani

发表机构 * The University of Chicago(芝加哥大学) Stanford University(斯坦福大学)

AI总结 本文提出了一种名为AMOR的自适应混合模型架构,旨在根据预测不确定性动态选择性地调用注意力机制,从而在保持模型性能的同时提升计算效率。该方法通过熵门控机制,在递归主干模型的输出熵超过动态阈值时才激活注意力模块,避免了不必要的计算开销。实验表明,AMOR在多个大规模模型上表现优异,仅在约22%的输入位置使用注意力,同时在长上下文任务和常识推理任务中展现出更强的鲁棒性。

详情
英文摘要

Recurrent-attention hybrids aim to combine the efficiency of recurrence with the expressivity of attention, but existing approaches typically apply attention uniformly across all positions, even when the recurrent state alone is sufficient for accurate prediction. We introduce AMOR (Adaptive Metacognitive Output Router), a post-hoc hybrid architecture that selectively invokes attention based on predictive uncertainty. A recurrent backbone is augmented with entropy-gated attention blocks that activate only when the model's output entropy exceeds a dynamic threshold derived from a running batch median and scaled standard deviation. This yields a simple, gradient-free routing mechanism inspired by uncertainty-driven computation and the System 1 / System 2 distinction. Across Mamba2 and Gated DeltaNet backbones (180M-1.5B), AMOR consistently matches or outperforms both pure recurrent models and fixed-schedule hybrid baselines while invoking attention on only ~22% of tokens. It achieves strong performance on common-sense reasoning benchmarks and maintains stable long-context performance on LongBench, where prior hybrid models degrade under distribution shift. These results suggest that when attention is applied matters as much as how much: selectively allocating attention based on predictive uncertainty improves both efficiency and robustness, offering a simple alternative to uniform or fixed routing strategies and pointing toward adaptive hybrid architectures that dynamically match computation to input difficulty.

2602.13155 2026-05-14 cs.LG cs.DS cs.NE stat.ML

Learning to Approximate Uniform Facility Location via Graph Neural Networks

Chendi Qian, Christopher Morris, Stefanie Jegelka, Christian Sohler

发表机构 * RWTH Aachen University, Germany(亚琛工业大学,德国) Technical University of Munich, Germany(慕尼黑技术大学,德国) Massachusetts Institute of Technology, USA(麻省理工学院,美国) University of Cologne, Germany(科隆大学,德国)

AI总结 本文研究了在统一设施选址问题(UniFL)中如何通过图神经网络(GNN)实现高效的近似求解。作者提出了一种全微分的图神经网络方法,结合经典近似算法的思想,无需求解器监督或离散松弛,从而在保证理论近似比的同时提升了算法性能。该方法在实验中表现优于传统近似算法,缩小了与整数线性规划的性能差距。

Comments ICML 2026

详情
英文摘要

Neural networks, particularly message-passing neural networks (MPNNs), are increasingly used as heuristics for hard combinatorial optimization problems. Yet many learning-based methods rely on supervision, reinforcement learning, or gradient estimators, causing high computational cost, unstable training, or limited guarantees. Classical approximation algorithms provide worst-case guarantees but are non-differentiable and cannot adapt to structure in natural input distributions. We study this tradeoff through Uniform Facility Location (UniFL), a problem with applications in clustering, summarization, logistics, and supply chains. We propose a fully differentiable MPNN that incorporates approximation-algorithmic principles without solver supervision or discrete relaxations. The model has provable approximation guarantees and empirically improves on standard approximation algorithms, narrowing the gap to integer linear programming.

2602.12026 2026-05-14 cs.LG q-bio.QM

Protein Circuit Tracing via Cross-layer Transcoders

Darin Tsui, Kunal Talreja, Daniel Saeedi, Amirali Aghazadeh

发表机构 * School of Electrical Computer Engineering, Georgia Institute of Technology, Atltanta, GA

AI总结 该研究提出了一种名为ProtoMech的框架,用于揭示蛋白质语言模型(pLMs)中的计算电路,通过跨层转码器学习各层之间的稀疏潜在表示,从而捕捉模型的整体计算流程。该方法应用于ESM2模型后,在蛋白质家族分类和功能预测任务中恢复了82-89%的原始性能,并识别出仅使用不到1%潜在空间却保留高达79%模型精度的压缩电路,揭示了与结构和功能模体的对应关系。该成果为蛋白质功能设计提供了高效且精准的指导,显著优于现有方法。

Comments Accepted into ICML 2026. 32 pages, 17 figures

详情
英文摘要

Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture cross-layer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross-layer transcoders that learn sparse latent representations jointly across layers to capture the model's full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82-89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use <1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high-fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.

2602.11618 2026-05-14 cs.LG q-bio.QM

How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?

Tatsuya Sagawa, Ryosuke Kojima

发表机构 * Graduate School of Pharmaceutical Sciences, Kyoto University(京都大学药学研究科) RIKEN BDR(理化学研究所BDR) Graduate School of Medicine, Kyoto University(京都大学医学研究科)

AI总结 本文研究了大规模化学语言模型(CLMs)在下游分子属性预测任务中的迁移性能。通过扩展训练资源(如模型规模、数据集大小和计算量),作者系统评估了预训练损失与下游任务表现之间的关系,发现尽管预训练损失持续下降,下游任务性能提升有限。研究还揭示了预训练指标与实际任务表现之间的差距,并分析了影响迁移效果的任务依赖性失效模式,强调了在模型选择和评估中需考虑下游任务特性的必要性。

详情
英文摘要

Chemical Language Models (CLMs) pre-trained on large scale molecular data are widely used for molecular property prediction. However, the common belief that increasing training resources such as model size, dataset size, and training compute improves both pretraining loss and downstream task performance has not been systematically validated in the chemical domain. In this work, we evaluate this assumption by pretraining CLMs while scaling training resources and measuring transfer performance across diverse molecular property prediction (MPP) tasks. We find that while pretraining loss consistently decreases with increased training resources, downstream task performance shows limited improvement. Moreover, alternative metrics based on the Hessian or loss landscape also fail to estimate downstream performance in CLMs. We further identify conditions under which downstream performance saturates or degrades despite continued improvements in pretraining metrics, and analyze the underlying task dependent failure modes through parameter space visualizations. These results expose a gap between pretraining based evaluation and downstream performance, and emphasize the need for model selection and evaluation strategies that explicitly account for downstream task characteristics.

2602.10326 2026-05-14 cs.CV cs.LG

Flow Matching with Uncertainty Quantification and Guidance

Juyeop Han, Lukas Lao Beyer, Sertac Karaman

发表机构 * MIT(麻省理工学院)

AI总结 尽管基于采样的生成模型如流匹配在图像生成方面取得了显著成功,但生成的样本质量仍可能存在不一致或退化的问题。为此,本文提出了一种轻量级的不确定性感知流匹配(UA-Flow)方法,该方法在预测速度场的同时估计异方差不确定性,并通过流动态传播不确定性以评估每个样本的可靠性。实验表明,UA-Flow 生成的不确定性信号与样本保真度具有更高的相关性,且基于不确定性的引导采样进一步提升了生成质量。

详情
英文摘要

Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.