arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2601.03728 2026-05-11 cs.CV cs.AI

CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, Xiaoshuai Sun

AI总结 该论文研究了组合图像检索(CIR)中查询与目标图像在异构模态下表示空间不一致的问题,提出了一种名为CSMCIR的统一表示框架。该方法通过多级思维链(MCoT)提示策略引导大语言模型生成语义兼容的图像描述,结合对称的双塔结构和基于熵的动态记忆库策略,有效缩小了模态间的对齐差距。实验表明,CSMCIR在多个基准数据集上取得了优越的检索性能和训练效率。

详情
英文摘要

Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.

2601.01822 2026-05-11 cs.RO cs.CV

DisCo-FLoc: Semantic-Free Floorplan Localization via $SE(2)$-Aware Contrastive Disambiguation

Ping Zhong, Shiyong Meng, Bolei Chen, Tao Zou, Chaoxu Mu, Jianxin Wang

AI总结 视觉平面图定位(FLoc)在面对重复性极简布局时面临严重的结构混淆问题,导致定位精度下降。本文提出 DisCo-FLoc,一种无需语义标注的视觉几何对比去混淆方法,通过引入深度感知的射线回归预测器(RRP)将单目RGB图像投影为几何感知的射线特征,并结合空间扰动的对比学习目标,提升定位的方位分辨能力和空间分离性。实验表明,该方法在两个具有挑战性的基准数据集上显著优于基于语义的最新方法,尤其在方向定位精度方面表现出色。

Comments 9 pages, 3 figures

详情
英文摘要

Visual Floorplan Localization (FLoc) struggles with severe structural aliasing caused by repetitive minimalist layouts. This occurs because physically distant poses share highly similar visual-geometric features, which degrades spatial separability and angular discriminability. While existing methods attempt to mitigate these ambiguities by relying on costly semantic annotations, the resulting performance gains remain inherently limited. To address the above issues, we propose DisCo-FLoc, a semantic-free method for visual-geometric Contrastive Disambiguation. First, we introduce a depth-aware Ray Regression Predictor (RRP) that serves as a dense-to-ray geometric projector. By explicitly suppressing visual clutter along the vertical dimension, RRP projects monocular RGB images into 2D ray primitives, which are matched with floorplans to produce geometry-aware FLoc candidates. Second, to resolve the remaining ambiguity among these candidates, we propose a spatially perturbed contrastive objective to align RGB images with local floorplan structures and formulate a visual-geometric compatibility function. In particular, we meticulously construct positive and negative samples at both positional and directional levels through $SE(2)$ pose perturbations for contrastive learning, effectively achieving pose smoothness, spatial separability, and angular discriminability. The compatibility function enables DisCo-FLoc to disambiguate FLoc by using richer visual context beyond pure geometric layouts, without requiring any semantic annotations. Extensive experiments on two challenging visual FLoc benchmarks demonstrate that DisCo-FLoc significantly outperforms state-of-the-art semantic-based methods, especially narrowing the performance gap between positional and directional FLoc accuracy.

2601.01285 2026-05-11 cs.CV

S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss

Md. Sanaullah Chowdhury Lameya Sabrin

AI总结 本文提出了一种名为S2M-Net的新型医学图像分割网络,旨在解决局部精度、全局上下文和计算效率之间的矛盾。该网络通过频谱选择性token mixer和形态感知自适应分割损失两个创新模块,在保持全局感受野的同时显著降低了计算成本,并自动适应不同解剖结构的特性以优化分割效果。实验表明,S2M-Net在多个医学影像数据集上取得了优于现有方法的性能,且参数量远少于基于Transformer的方法。

Comments I would like to withdraw the paper from arXiv because the current version contains issues that need to be carefully revised before public dissemination

详情
英文摘要

Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at $\mathcal{O}(n)$ cost but limited receptive fields, vision transformers achieve global context through $\mathcal{O}(n^2)$ self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves $\mathcal{O}(HW \log HW)$ global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12\% Dice on polyp segmentation, 83.77\% on surgical instruments (+17.85\% over the prior art) and 80.90\% on brain tumors, with consistent 3-18\% improvements over specialized baselines while using 3.5--6$\times$ fewer parameters than transformer-based methods.

2601.00889 2026-05-11 cs.LG

FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization

Nalin Dhiman

AI总结 FANoS-v2 是一种基于反馈控制的轻量级神经网络优化器,通过引入标量反馈控制器对更新能量进行调节,并结合热力学阻尼机制,提升了优化过程的稳定性与效率。该方法支持多种预处理方式,并提供了用于稳定性分析的诊断工具。实验表明,FANoS-v2 在部分任务上相比 AdamW 取得了更高的准确率,但同时也带来了更高的计算时间开销,显示出其作为研究型优化器的潜力与当前性能瓶颈。

Comments 17 pages, 3 figures, 5 tables

详情
英文摘要

\FANOS{} is a PyTorch optimizer that augments RMS-preconditioned momentum with a scalar feedback controller over update energy. The public reference implementation stores momentum in parameter-update units, applies a non-negative thermostat damping coefficient, supports diagonal, factored, and raw-gradient preconditioning, and exposes diagnostics intended for stability audits. This study gives a complete mathematical specification of the released optimizer, including the exact parameter-unit update, the study-equation physical update mode, bounded log-ratio thermostat control, adaptive preconditioner softening, warmup guardrails, and the experimental \Fast{} profile. We report the v0.2 evidence: five-seed reduced-sample MNIST, Fashion-MNIST, and CIFAR-10 experiments show mean top-1 gains of 0.889, 2.197, and 2.666 percentage points over AdamW for \Fast{}, but with 49.8\%, 61.6\%, and 56.8\% higher wall-clock time. Preliminary scientific, PINN, and EEG smoke tests are mixed and are treated as hypothesis-generating only. The evidence supports \FANOS{} as an alpha-stage research optimizer with a reproducible lightweight-vision signal and an explicit runtime bottleneck.

2512.23770 2026-05-11 cs.LG cs.AI

SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

Dominik Wagner, Ankit Kanwar, Luke Ong

AI总结 在安全关键领域,强化学习智能体在完成任务的同时必须满足严格的零成本安全约束。本文提出了一种名为SB-TRPO的算法,通过动态结合奖励和成本的自然策略梯度,在保证安全约束的前提下优化策略,实现了安全性和任务性能之间的良好平衡。该方法在理论上有局部安全进展的保证,并在多个安全强化学习任务中表现出优越的性能。

详情
英文摘要

In safety-critical domains, reinforcement learning (RL) agents must often satisfy strict, zero-cost safety constraints while accomplishing tasks. Existing model-free methods frequently either fail to achieve near-zero safety violations or become overly conservative. We introduce Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a principled algorithm for hard-constrained RL that dynamically balances cost reduction with reward improvement. At each step, SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement. Our method comes with formal guarantees of local progress on safety, while still improving reward whenever gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks demonstrate that SB-TRPO consistently achieves the best balance of safety and task performance in the hard-constrained regime.

2512.20974 2026-05-11 cs.LG cs.AI cs.RO

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

Jingyang You, Hanna Kurniawati

AI总结 本文提出了一种名为GLiBRL的深度贝叶斯强化学习方法,通过引入可学习的基函数和广义线性模型,实现了对任务参数和模型噪声的完全可追踪贝叶斯推断,提升了任务表示的清晰度和政策性能。该方法支持精确的边缘似然评估,并能与多种策略梯度算法无缝结合,具有良好的泛化能力。实验表明,GLiBRL在多个基准任务上优于现有元强化学习方法,性能提升最高达1.8倍。

详情
英文摘要

Bayesian Reinforcement Learning (BRL), a subclass of Meta-Reinforcement Learning (Meta-RL), provides a principled framework for generalisation by explicitly incorporating Bayesian task parameters into transition and reward models. However, classical BRL methods assume known forms of transition and reward models. While recent deep BRL methods incorporate model learning to address this, applying neural networks directly to joint data and task parameters necessitates variational inference. This often yields indistinct task representations, compromising the resulting BRL policies. To overcome these limitations, we introduce Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL). Our approach features fully tractable Bayesian inference over task parameters and model noise, alongside exact marginal likelihood evaluation for learning transition and reward models. The permutation-invariant nature of exact Bayesian inference in GLiBRL enables seamless integration with both on-policy and off-policy RL algorithms. We further show that GLiBRL admits a closed-form relationship between the $\mathcal{L}_2$ distance of its task representations and empirical kernel-based correspondence between task samples, which is to our knowledge the first such structural result for online deep BRL. GLiBRL is compared against representative and recent Meta-RL methods, and improves state-of-the-art performance on both MuJoCo and MetaWorld benchmarks by up to 1.8$\times$.

2512.19991 2026-05-11 cs.LG

Bloom Filter Encoding for Machine Learning

John Cartmell, Mihaela Cardei, Ionut Cardei

AI总结 本文提出了一种利用布隆过滤器变换对机器学习数据进行预处理的方法,通过基于哈希的编码将每个样本转化为紧凑的位数组表示,从而在降低内存使用的同时模糊原始特征值。该方法无需密钥哈希,但也可选择性地使用密钥控制映射关系。实验在六个不同领域的数据集上验证了该方法的有效性,结果显示使用布隆过滤器编码训练的模型在多个数据集上性能与原始数据或传统降维方法相当,同时实现了稳定的内存节省,表明该编码方式可作为一种高效、通用的预处理表示方法,适用于多种学习任务并提供一定程度的数据模糊化。

Comments 14 pages, 7 figures

详情
英文摘要

We present a method that uses a Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact bit-array representation using hash-based encoding, producing a fixed-length feature space that reduces memory usage and obfuscates original feature values. The encoding does not rely on keyed hashing; however, a key can optionally be used to control the mapping and would be required to reproduce the representation. We evaluate the approach on six datasets spanning text, time-series, tabular, and image domains: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are considered: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve performance comparable to models trained on raw data or standard dimensionality reduction techniques across several datasets, while providing consistent memory savings. These findings suggest that Bloom filter encodings can serve as an efficient, general-purpose pre-processing representation that preserves useful similarity structure for learning tasks while providing a degree of data obfuscation.

2512.17129 2026-05-11 cs.LG cs.MA cs.RO q-bio.QM

DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

Seong Ho Pahng, Guoye Guan, Benjamin Fefferman, Sahand Hormoz

AI总结 本文提出了一种名为 DiffeoMorph 的端到端可微分框架,用于学习引导一群智能体从初始状态演化成目标三维形状的形态发生协议。该方法基于 SE(3) 等变图神经网络,使每个智能体能够根据自身状态和与其他智能体的交互信号更新位置和内部状态。研究引入了一种基于三维泽尔尼克多项式的形状匹配损失函数,能够将预测形状与目标形状作为连续空间分布进行比较,并对智能体顺序、数量和全局方向不变,同时保持对镜像的敏感性。实验表明,DiffeoMorph 能够从简单初始条件生成复杂三维结构,为形态发生、群体机器人和可编程自组装等领域的分布式控制策略学习提供了通用框架。

详情
英文摘要

Biological systems can form complex three-dimensional structures through the collective behavior of agents that share a common update rule and operate without central control. How such distributed control gives rise to precise global patterns remains a central question not only in developmental biology but also in distributed robotics, programmable matter, and multi-agent learning. Here, we introduce DiffeoMorph, an end-to-end differentiable framework for learning a morphogenesis protocol that guides a population of agents to morph into a target 3D shape. Each agent updates its position and internal state using an SE(3)-equivariant graph neural network, based on its own internal state and signals received from other agents. To train this system, we introduce a new shape-matching loss based on 3D Zernike polynomials, which compares the predicted and target shapes as continuous spatial distributions, not as discrete point clouds, and is invariant to agent ordering, number of agents, and global orientation. To achieve rotation invariance while preserving reflection sensitivity, we include an alignment step that optimally rotates the predicted Zernike spectrum to match the target before computing the loss. We perform benchmarking to establish the advantages of our shape-matching loss over other standard distance metrics for shape comparison tasks. We then demonstrate that DiffeoMorph can form a range of complex shapes from minimally patterned initial conditions. DiffeoMorph provides a general framework for learning distributed control strategies for morphogenesis, swarm robotics, and programmable self-assembly.

2512.15840 2026-05-11 cs.RO cs.CV

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Caiyi Zhang, Peihao Li, Kiwhan Song, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, Yilun Du

AI总结 该研究提出了一种基于大规模视频预训练的通用机器人控制方法,旨在解决机器人在多样化任务和环境中进行决策的问题。不同于传统的视觉-语言-动作(VLA)系统,该方法直接利用视频中包含的时空序列信息,构建用于机器人规划的开放视频模型。通过大规模互联网视频数据训练,模型能够生成针对新场景和任务的零样本视频计划,并提取可执行的机器人动作,实验表明其在实际机器人任务中具有良好的泛化能力和可行性。

Comments 29 pages, 16 figures

详情
英文摘要

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

2512.15567 2026-05-11 cs.AI cond-mat.mtrl-sci cs.LG physics.chem-ph

Evaluating Large Language Models in Scientific Discovery

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takrim Khan, Mahyar Rajabi-Kochi, Samantha Paradi-Maropakis, Tony Baltoiu, Fengyu Xie, Tianyang Chen, Kexin Huang, Weiliang Luo, Meijing Fang, Xin Yang, Lixue Cheng, Jiajun He, Soha Hassoun, Xiangliang Zhang, Wei Wang, Chandan K. Reddy, Chao Zhang, Zhiling Zheng, Mengdi Wang, Le Cong, Carla P. Gomes, Chang-Yu Hsieh, Aditya Nandy, Philippe Schwaller, Heather J. Kulik, Haojun Jia, Huan Sun, Seyed Mohamad Moosavi, Chenru Duan

AI总结 本文提出了一种基于真实科研场景的评估框架,用于评估大型语言模型在科学发现中的能力,涵盖了生物学、化学、材料科学和物理学等多个领域。该框架通过专家定义的研究项目分解为模块化场景,并从中生成经过验证的问题,从问题层面和项目层面两个维度对模型进行评估,包括假设生成、实验设计和结果解释等关键环节。研究发现,当前最先进的大型语言模型在科学发现任务中仍存在明显性能差距,且模型规模扩大带来的收益有限,揭示了现有模型在科学推理方面仍存在系统性不足。

详情
英文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.

2512.12116 2026-05-11 cs.LG stat.ML

Neural CDEs as Correctors for Learned Time Series Models

Muhammad Bilal Shahid, Zhanhong Jiang, Prajwal Koirala, Soumik Sarkar, Cody Fleming

AI总结 本文提出了一种预测-校正框架,用于改进时间序列模型的多步预测性能。该框架中,预测器生成多步预测,而校正器采用神经控制微分方程来修正预测误差,能够处理不规则采样的时间序列,并兼容连续和离散时间预测器。研究还引入了两种正则化策略以提升校正器的外推能力和训练效率,并提供了理论上的稳定性与收敛性保证。实验表明,该方法在多种预测模型上均能有效提升预测精度,具有预测器无关的广泛适用性。

详情
英文摘要

Learned time-series models, whether continuous or discrete, are widely used for forecasting the states of dynamical systems but suffer from error accumulation in multi-step forecasts. To address this issue, we propose a Predictor-Corrector framework in which the Predictor is a learned time-series model that generates multi-step forecasts and the Corrector is a neural controlled differential equation that corrects the forecast errors. The Corrector works with irregularly sampled time series and is compatible with both continuous- and discrete-time Predictors. We further introduce two regularization strategies that improve the Corrector's extrapolation performance and accelerate its training. We also provide theoretical guarantees on the stability and convergence of the proposed framework. Experiments on synthetic, physics-based, and real-world datasets show that the proposed framework consistently improves forecasting performance across diverse Predictors, including neural ordinary differential equations, ContiFormer, and DLinear, demonstrating its predictor-agnostic nature.

2512.10371 2026-05-11 cs.AI

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, Yuanchun Li

AI总结 随着移动图形用户界面(GUI)代理在长时任务自动化中的应用日益广泛,如何高效管理不断增长的交互历史成为关键挑战。本文提出AgentProg,一种基于程序引导的上下文管理方法,将交互历史重构为包含变量和控制流的程序结构,从而系统性地决定哪些信息需要保留、哪些可以舍弃。此外,AgentProg引入全局信念状态机制以应对部分可观测性和环境变化,实验表明其在长时任务上表现优异且稳定性强,优于现有方法。

Comments 16 pages, 8 figures

详情
英文摘要

The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.

2512.05439 2026-05-11 cs.AI cs.FL

BEAVER: An Efficient Deterministic LLM Verifier

Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh

AI总结 随着大语言模型从研究原型转向生产系统,实践中亟需可靠的方法来验证模型输出并评估安全部署中的尾部风险。本文提出BEAVER,首个实用的框架,用于计算大语言模型满足安全属性的确定性、可靠的概率界。该方法通过新颖的Token trie和Frontier数据结构系统地探索模型输出空间,每一步都保持可证明的界内安全,实验表明其在计算成本仅为基线1/10的情况下,能识别出更多风险实例,有效揭示了传统方法容易遗漏的尾部风险。

详情
英文摘要

As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify model outputs and characterize tail risk for safe deployment. While sampling-based estimates provide an ad-hoc intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM satisfaction of safety properties. Given a prompt & any safety property, BEAVER systematically explores the model output space using novel Token trie and Frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on 4 safety properties across 12 open-weight LLMs. BEAVER identifies 2-3x more risky instances compared to baselines while taking 1/10 of the compute budget, surfacing tail risks that loose bounds and ad-hoc evaluation misses.

2512.03476 2026-05-11 cs.LG cs.AI cs.MA cs.NA math.NA physics.comp-ph

ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms

Juan Diego Toscano, Daniel T. Chen, George Em Karniadakis

AI总结 ATHENA 是一个用于分层进化数值算法的智能代理团队框架,旨在解决科学计算与科学机器学习中理论设计与计算实现之间的鸿沟。其核心是基于上下文老虎机问题的HENA循环,通过分析历史实验选择结构化操作,并将其转化为可执行代码以生成科学奖励。ATHENA 能够自主发现数学对称性、设计稳定数值求解器,并结合符号与数值方法解决多物理场问题,表现出超越人类的性能,并可通过人机协作进一步提升结果精度。

详情
英文摘要

Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions' ($A_n$) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ($S_n$) to generate scientific rewards ($R_n$). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of $10^{-14}$. Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.

2512.03454 2026-05-11 cs.CV cs.AI

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li

AI总结 本文研究了如何让自动驾驶车辆理解自然语言指令并准确定位目标物体,针对现有方法在处理模糊或依赖上下文的指令时的不足,提出了一种基于世界模型思想的框架ThinkDeeper。该框架通过学习场景的潜在状态并预测未来空间变化,实现对指令的深入理解与定位决策。同时,作者还构建了一个多源视觉 grounding 数据集DrivePilot,并在多个基准测试中验证了方法的有效性,表现出优异的鲁棒性和效率。

详情
英文摘要

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

2511.22316 2026-05-11 cs.LG

Outlier Smoothing with Closed-Form Rotations for W4A4 Large Language Model Quantization

Jinying Xiao, Bin Ji, Shasha Li, Xiaodong Liu, Ma Jun, Chao Wang, Wei Li, Ye Zhong, Xuan Xie, Nyima Tashi, Jie Yu

AI总结 本文研究了大语言模型(LLM)量化中的收敛路径问题,提出了一种单次量化框架SingleQuant,通过解耦量化截断过程,有效消除梯度噪声和非光滑性,提升量化效率与模型性能。该方法引入对齐旋转变换(ART)和均匀性旋转变换(URT),分别针对不同类型的激活异常值进行闭式最优旋转和平滑处理,显著加速量化过程并提升任务表现。实验表明,SingleQuant在多个任务上优于现有方法,例如在量化LLaMA-2-13B时,其速度提升达1400倍,同时任务性能平均提高0.57%。

Comments 9 pages, 4 figures

详情
英文摘要

Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs' task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct activation outliers, where ART achieves smoothing of outlier values via closed-form optimal rotations, and URT reshapes distributions through geometric mapping. Both matrices comprise strictly formulated Givens rotations with predetermined dimensions and rotation angles, enabling promising LLMs task performance within a short time. Experimental results demonstrate SingleQuant's superiority over the selected baselines across diverse tasks on 7B-70B LLMs. To be more precise, SingleQuant enables quantized LLMs to achieve higher task performance while necessitating less time for quantization. For example, when quantizing LLaMA-2-13B, SingleQuant achieves 1,400$\times$ quantization speedup and increases +0.57\% average task performance compared to the selected best baseline.

2511.18085 2026-05-11 cs.RO cs.AI

Continually Evolving Skill Knowledge in Vision Language Action Model

Yuxuan Wu, Guangming Wang, Zhiheng Yang, Tianchen Deng, Maoqing Yao, Brian Sheil, Hesheng Wang

AI总结 本文研究了视觉语言动作(VLA)模型在持续学习中的知识积累问题,提出了一个无需增加网络参数的知识驱动框架Stellar VLA。该方法通过联合优化任务表示和知识空间,实现自我演进的知识学习,并引入知识引导的专家路由机制,以提升任务适应能力。实验表明,Stellar VLA在LIBERO基准和实际双臂平台中均表现出优异的性能,尤其在分层操作任务中效果显著。

详情
英文摘要

Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters. Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure. Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery. Project Website: https://stellarvla.github.io/

2511.15204 2026-05-11 cs.CV cs.AI

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique

AI总结 当前主流的多模态图像评估指标如BLEU、CIDEr等在语义和结构准确性方面存在局限,尤其在特定领域或依赖上下文的场景中表现不足。本文提出了一种基于物理约束的多模态数据评估指标PCMDE,结合大型语言模型、知识映射和视觉-语言模型,以更准确地衡量合成图像的语义与结构合理性。该方法通过多阶段架构实现特征提取、置信度加权融合以及物理引导推理,有效提升了评估的准确性和适用性。

详情
英文摘要

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

2511.12090 2026-05-11 cs.CV

Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning

Shengqin Jiang, Tianqi Kong, Yuankai Qi, Haokui Zhang, Lina Yao, Quan Z. Sheng, Qingshan Liu, Ming-Hsuan Yang

AI总结 该论文研究了持续学习中的提示调优方法,旨在在不更新预训练模型参数的前提下,通过引入可学习的提示来适应新任务,同时减少对先前任务知识的遗忘。现有方法通常为每个网络层独立添加任务特定提示,但这种高度灵活的调优方式可能导致某些层不必要的更新,增加灾难性遗忘的风险。为此,作者提出了一种分层分组的提示调优方法,通过共享组内提示和使用统一根提示生成子提示,增强了层间协同,提升了模型稳定性。实验表明,该方法在多个基准上优于现有先进方法。

Comments We have reconsidered the issue, and an updated version will be released later

详情
英文摘要

Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model's parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer's representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.

2511.09907 2026-05-11 cs.AI cs.CV

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Yongxian Wei, Yilin Zhao, Zixuan Hu, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Chun Yuan, Dian Li

AI总结 该研究提出了一种基于推理驱动和求解器自适应的数据合成方法,用于训练大型推理模型。核心方法是通过显式推理规划问题生成方向,并根据求解器的能力调整问题难度,同时利用中间推理过程增强问题多样性。实验表明,该框架在多个数学和通用推理基准测试中有效提升了模型性能,平均提升了3.4%。

详情
英文摘要

Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data are used to bootstrap problem-design strategies in the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our proposed framework achieves a cumulative average improvement of 3.4%, demonstrating robust generalization across both language and vision-language models.

2511.09598 2026-05-11 cs.LG

Amortized Multi-Objective Optimization Across Tasks with Generative Solution Modeling

Tingyang Wei, Jiao Liu, Abhishek Gupta, Chin Chun Ooi, Puay Siew Tan, Yew-Soon Ong

AI总结 本文研究了在连续任务参数空间中高效求解多目标优化问题(EMOPs)的挑战,提出了一种基于生成解建模的参数化多目标贝叶斯优化方法。该方法通过学习逆模型,实现跨任务参数空间的优化成本摊销,能够在无需重新评估的情况下直接预测任意查询任务的解。核心创新在于结合条件生成模型与任务协同的获取函数搜索,有效提升了多任务优化的效率与泛化能力。

Comments Accepted by IJCAI 2026

详情
英文摘要

Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This can be formulated as parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for each task. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. To address this, we propose learning an inverse model to amortize the multi-objective optimization cost across the continuous task-preference space, enabling direct solution prediction for any query without the need for expensive re-evaluation. This paper introduces a novel parametric multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) generative solution sampling via conditional generative models and (2) acquisition-driven search leveraging inter-task synergies. This approach enables effective optimization across multiple tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without re-evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Based on that, empirical studies in synthetic and real-world benchmarks further verify the effectiveness of the proposed parametric optimizer.

2511.09117 2026-05-11 cs.CV

DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

AI总结 该论文提出了一种新的基准数据集DKDS,用于应对古日文草书(Kuzushiji)文档中因退化和印章干扰而导致的识别与二值化挑战。DKDS数据集由专业专家协助构建,包含退化和带印章的文档,并定义了两个任务方向:古日文字符与印章检测以及文档二值化。研究提供了多种检测和二值化方法的基线结果,为相关研究提供了新的实验基准。

Comments IJDAR 2026 (ICDAR-IJDAR Track)

详情
英文摘要

Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which involves the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) Kuzushiji character and seal detection and (2) document binarization. For the Kuzushiji character and seal detection track, we provide baseline results using several recent versions of YOLO to detect Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, two state-of-the-art (SOTA) generative adversarial network (GAN) methods, and our improved conditional GAN (cGAN)-based method. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.

2511.02805 2026-05-11 cs.CL cs.AI

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han

AI总结 MemSearcher 是一种基于端到端强化学习的框架,旨在训练大型语言模型在多轮对话中高效地进行推理、搜索和记忆管理。该方法通过维护一个紧凑的内存,仅保留与当前问题相关的信息,从而避免了传统方法中因拼接完整历史记录而导致的上下文过长和计算开销增加的问题。实验表明,MemSearcher 在多个公开数据集上优于基于历史拼接的基线方法,且在多轮交互中保持了几乎恒定的 token 数量。

Comments Accepted to ACL 2026

详情
英文摘要

LLM-based search agents often concatenate the full interaction history into the context, producing long and noisy inputs, and increasing compute cost and GPU memory overhead. To address this issue, we propose MemSearcher, an agent framework that maintains a compact memory during multi-turn interactions, retaining only question-relevant information and thereby keeping the context length stable across turns. Training MemSearcher is challenging because each trajectory spans multiple turns under different LLM contexts, making each turn an independent optimization target in reinforcement learning. We introduce multi-context GRPO, which propagates trajectory-level advantages to all turns for end-to-end optimization. Experiments demonstrate that MemSearcher outperforms strong history-concatenation (ReAct-style) baselines on a range of public datasets while maintaining nearly constant token counts across multi-turn interactions. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

2510.19788 2026-05-11 cs.AI cs.LG

Benchmarking World-Model Learning with Environment-Level Queries

Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares

AI总结 该研究提出了一种新的评估方法WorldTest,用于检验智能体学习到的世界模型是否能够支持多样化的环境级查询,而不仅仅是基于观测轨迹的预测任务。研究构建了一个名为AutumnBench的基准平台,包含43个交互式网格世界环境和129个任务,用于评估人类和学习模型在不同查询类型下的表现。实验表明,人类在这些任务中显著优于现有前沿模型,这可能归因于人类在探索和信念更新方面的优势。该工作为评估世界模型的泛化能力提供了新框架,并为扩展到更复杂领域提供了参考。

Comments 34 pages, 10 figures

详情
英文摘要

World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions, such as next-frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the environment. In contrast, humans build $\textit{general-purpose}$ models that can answer many different questions about an environment$\unicode{x2014}$including questions that require understanding global structure and counterfactual consequences. We propose $\textit{WorldTest}$: a protocol for evaluating whether agents learn models that support multiple $\textit{environment-level queries}\unicode{x2014}$questions whose answers depend on properties of the full environment, not just observed trajectories. Individually, these queries can target properties (e.g., reachability or the effects of interventions) that no single rollout distribution determines. Collectively, they assess model generality across query types. We instantiate WorldTest as $\textit{AutumnBench}$, a benchmark of 43 interactive grid-world environments and 129 tasks across three query families for both humans and learning agents. Experiments with 517 human participants and five frontier models show that humans substantially outperform these models, a gap we attribute to differences in exploration and belief updating. AutumnBench provides a framework for evaluating world-model learning in grid-world environments with environment-level queries, and WorldTest provides a template for extending such evaluations to richer domains.

2510.08638 2026-05-11 cs.CV cs.AI

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

AI总结 本文研究了DINOv2模型中任务相关概念的性质,通过线性表示假设和稀疏编码器(SAEs)构建了一个包含32,000个单元的词典,用于分析模型在不同任务中的概念使用模式。研究发现,不同任务如分类、分割和深度估计分别依赖于不同类型的感知概念,并揭示了表示并非严格稀疏而是部分密集,且具有几何结构特征。基于这些发现,作者提出了闵科夫斯基表示假设(MRH),将视觉Transformer的表示结构解释为由原型的凸组合构成,为理解模型内部表征提供了新的理论框架。

Comments Accepted at ICLR 2026

详情
Journal ref
ICLR 2024
英文摘要

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

2510.07926 2026-05-11 cs.CL

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Adam Dejl, James Barry, Alessandra Pascale, Javier Carnerero Cano

AI总结 本文研究了如何自动评估大型语言模型生成文本在事实召回方面的完整性,重点在于检测遗漏的信息或未充分表达的观点。作者提出了三种自动评估指标:基于自然语言推理的分解方法、基于问答对比的评估方法以及直接利用大语言模型识别缺失内容的端到端方法。实验表明,尽管端到端方法在效果上表现突出,但其鲁棒性和可解释性相对较弱,研究还对多个主流开源大模型在多源信息下的回答完整性进行了评估。

Comments ACL 2026 Findings

详情
英文摘要

Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation metrics: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing facts, (2) a Q&A-based metric that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end approach that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end metric compared to more complex metrics, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

2510.04850 2026-05-11 cs.CL cs.AI

Detecting Distillation Data from Reasoning Models

Hengxiang Zhang, Hyeong Kyu Choi, Sharon Li, Hongxin Wei

AI总结 本文研究了如何检测推理模型的蒸馏数据,即判断某个问题是否被包含在模型的蒸馏数据中。为了解决蒸馏数据部分可见带来的挑战,作者提出了一种基于输出令牌概率偏差(TPD)的检测方法,通过分析模型生成的令牌概率模式,识别出已被模型见过的问题。实验表明,该方法在多个蒸馏数据集上显著提升了检测性能,AUC值最高提升了31%。

详情
英文摘要

Reasoning distillation has emerged as a prevailing paradigm for transferring reasoning capabilities from large reasoning models to small language models. Yet, reasoning distillation risks data contamination: benchmark data may inadvertently be included in the distillation data, thereby inflating model performance metrics. In this work, we formally define the distillation data detection task, which determines whether a given question is included in the model's distillation data. The unique challenge of this task lies in the partial availability of distillation data. To address this, we propose Token Probability Deviation (TPD), a detection method that leverages the probability patterns of output tokens generated by the model instead of input tokens. Our method is motivated by the observation that seen questions tend to elicit more near-deterministic tokens generated by the models than unseen ones. Our TPD score is thus designed to quantify the token-level deviation of generated tokens from a high-confidence reference probability. Consequently, seen questions can yield substantially lower TPD scores than unseen ones, enabling strong detection performance. Extensive experiments demonstrate the effectiveness of our approach, improving detection AUC by up to 31% on distillation datasets.

2510.01569 2026-05-11 cs.AI cs.CL

InvThink: Premortem Reasoning for Safer Language Models

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park

AI总结 本文提出了一种名为 InvThink 的训练与提示框架,通过要求模型在生成最终响应前列举、分析并约束潜在失败情况,从而提升语言模型的安全性。该方法将生成过程分为三个步骤:列举潜在危害、分析其后果、在显式约束下生成响应,相比现有方法在更大模型规模下表现出更高的安全评分,并有效缓解了安全税问题。实验表明,InvThink 不仅在通用安全任务中表现优异,在医疗、金融、法律等专业伦理领域以及智能体对齐场景中也显著减少了有害行为。

详情
英文摘要

We present InvThink, a training and prompting framework that requires the model to enumerate, analyze, and constrain potential failures before generating its final response. Unlike existing safety alignment methods that optimize only for safe final responses, InvThink structures generation into three steps: (1) enumerate potential harms, (2) analyze their consequences, (3) generate the response under explicit mitigation constraints. We observe three findings: (i) InvThink shows higher safety scores at larger model sizes, compared to existing safety prompting and alignment baselines. (ii) InvThink mitigates the safety tax. Models trained with INVTHINK preserve their reasoning capability on standard benchmarks. (iii) beyond general safety tasks, InvThink also reduces harmful behavior in professional ethics domains (medicine, finance, law) and in agentic misalignment scenarios, achieving up to 32% reduction in harmfulness over zero-shot baselines and 16% over SafetyPrompt. We extend InvThink with supervised fine-tuning, and GRPO-based reinforcement learning across three LLM families.

2510.01290 2026-05-11 cs.LG

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna

AI总结 本文提出了一种名为ThinKV的键值(KV)缓存压缩框架,用于提升大模型推理过程中的效率。该方法基于思维链(CoT)中注意力稀疏性所揭示的不同思维类型及其重要性差异,采用混合量化与淘汰策略,根据思维重要性动态调整token精度,并在推理过程中逐步淘汰不重要思维中的token。实验表明,ThinKV在保持接近原始精度的同时,将KV缓存占用降低至原规模的5%以下,并显著提升了推理吞吐量。

Comments ICLR 2026 (Oral)

详情
英文摘要

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

2510.00436 2026-05-11 cs.AI cs.CL

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Sarvesh Soni, Dina Demner-Fushman

AI总结 该研究探讨了如何通过自动化方法评估AI系统对患者住院相关问题的回答质量。研究者收集了28个AI系统对100个患者案例的回复,并从回答准确性、临床依据使用和医学知识应用三个维度进行评估。通过与医生撰写的参考答案对比,自动化评估方法能够有效区分优质与劣质的AI回答,表明精心设计的自动化评估可支持AI系统的规模化比较评估,并促进医患沟通。

Comments Accepted for publication in npj Digital Medicine

详情
英文摘要

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.