arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.17471 2026-05-20 cs.LG cs.NA math.NA math.OC

WinQ: Accelerating Quantization-Aware Training of Language Models Around Saddle Points

WinQ: 加速围绕鞍点的语言模型量化感知训练

Dongyue Li, Zechun Liu, Kai Yi, Zhenshuo Zhang, Changsheng Zhao, Raghuraman Krishnamoorthi, Harshit Khaitan, Hongyang R. Zhang, Steven Li

AI总结 本文研究了量化感知训练(QAT)在低比特宽度下的收敛问题,提出WinQ算法通过重置权重和噪声注入梯度来加速训练并提升性能。

Comments 23 pages; To appear in ICML 2026

详情
AI中文摘要

量化感知训练(QAT)被广泛用于通过训练全精度权重来量化语言模型,其主要瓶颈是收敛缓慢和早期性能 plateau,特别是在低于4比特宽度时。尽管先前工作已观察到此问题,但其精确原因仍不清楚。在本文中,我们通过估计损失曲面Hessian谱来分析QAT的收敛性。我们发现权重会收敛到鞍点周围的平坦区域,其中大量Hessian特征值同时为正和负。在训练过程中,越来越多的Hessian特征值集中在零附近,其幅度减小。在较低的比特宽度下,Hessian谱中的特征值幅度显著更小。为缓解这些问题,我们提出了一种名为WinQ的算法,包括:(1)周期性地将权重重置为全精度和量化权重的线性插值,减少到量化网格的距离并增加特征值幅度,以及(2)计算噪声注入权重的梯度以正则化Hessian。广泛的实验表明,WinQ在各种量化方法和模型上将QAT加速了多达4倍。在相同的训练成本下,WinQ将最先进的子4比特量化改进了高达8.8%。这些结果在16种不同语言模型、量化方法和比特宽度的设置中保持一致。

英文摘要

Quantization-aware training (QAT) is widely adopted to quantize language models by training full-precision weights using gradients from the quantized model. The main bottleneck is its slow convergence and early performance plateau, particularly below 4-bit-widths. While this problem has been observed in prior work, its precise cause remains unclear. In this paper, we analyze the convergence of QAT by estimating the spectrum of the loss-surface Hessians. We find that the weights converge to flat regions around saddle points, where a large fraction of the Hessian eigenvalues are both positive and negative. During training, an increasing fraction of Hessian eigenvalues concentrates around zero, whose magnitude decreases. At lower bit-widths, the magnitude of eigenvalues in the Hessian spectrum is significantly smaller. To mitigate these issues, we propose an algorithm called WinQ to accelerate QAT, which involves: (1) periodically resetting weights to the linear interpolation of full-precision and quantized weights, reducing the distance to the quantization grid and increasing eigenvalue magnitude, and (2) computing gradients of noise-injected weights to regularize the Hessian. Extensive experiments show that WinQ accelerates QAT by up to 4 times across various quantization methods and models. Under the same training cost, WinQ improves state-of-the-art sub-4-bit quantization by up to 8.8%. These results are consistent across 16 settings with different language models, quantization methods, and bit widths.

2605.16692 2026-05-20 cs.LG cs.AI cs.RO

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

EfficientTDMPC: 改进的MPC目标以实现高效的连续控制

Thomas Evers, Cristian Meo, Wendelin Bohmer, Justin Dauwels, Yaniv Oren

AI总结 本文提出EfficientTDMPC,一种基于模型的强化学习方法,用于连续控制,通过减少误差和增加数据新鲜度来提高样本效率。

详情
AI中文摘要

我们介绍了EfficientTDMPC,一种用于连续控制的样本高效模型基于强化学习方法,基于TD-MPC算法家族。该家族的核心是一个规划器,旨在找到最大化估计回报的行动序列。回报通过学习的模型和价值网络进行估计,每个都可以引入误差。EfficientTDMPC通过两种方式减少这种误差。首先,它引入了动态模型的集成,并在这些模型和不同的展开深度之间平均回报估计。其次,它增加了应用不确定性惩罚到规划器目标的选项,从而得到一个避免不确定回报估计的规划器。然后,它增加了实用改进,提高缓冲数据的新鲜度并减少计算。最后,我们发现我们的贡献使EfficientTDMPC能够更受益于更高的更新到数据(UTD)比率,进一步提高样本效率。据我们所知,在每个基准的低数据情况下,EfficientTDMPC在HumanoidBench-Hard和DMC hard上实现了最先进的样本效率,而在DMC easy上则匹配了最先进的性能。

英文摘要

We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.

2605.16565 2026-05-20 cs.AI cs.OS

Skim: Speculative Execution for Fast and Efficient Web Agents

Skim:用于快速和高效网络代理的推测执行

Mike Wong, Kevin Hsieh, Suman Nath, Ravi Netravali

AI总结 Skim通过利用专门构建网站的可预测结构,提出了一种推测执行框架,以降低网络代理的任务成本和延迟,同时保持准确性。

Comments 14 pages, 21 figures

详情
AI中文摘要

Skim是一种用于网络代理的推测执行框架,利用专门构建网站的可预测结构。当今网络代理的开销并非任务本身固有,而是由代理的组合方式决定:前沿模型推断、浏览器渲染和ReAct风格的规划被应用于每个任务的每一步,无论复杂度如何。Skim的关键观察是,网站在相同类型的查询中强制执行稳定的URL模式、答案格式和任务到轨迹的映射,因此大多数查询可以完全绕过这些重型组件。离线分析器在每个网站上捕获这些模式一次。在运行时,Skim将每个查询匹配到模板,合成目标URL,并使用小型模型提取答案。一个轻量级验证器将每个快速路径输出与查询和模式进行比对;罕见的不准确会级联到完整的代理,但通过快速路径的最终URL进行预热,以保持上游轨迹进度。在标准网络代理基准测试中,结合三个主干代理(WebVoyager、AgentOccam、BrowserUse),Skim将任务的中位成本降低了1.9倍,延迟减少了33.4%,且没有精度损失。

英文摘要

Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today's web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step of every task regardless of complexity. Skim's key observation is that websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type, so most queries can bypass these heavyweight components entirely. An offline profiler captures these patterns once per site. At runtime, Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output against the query and schema; rare misspeculations cascade to the full agent, warm-started by the fast path's final URL to preserve upstream trajectory progress. Across standard web-agent benchmarks paired with three backboneagents (WebVoyager, AgentOccam, BrowserUse), Skim reduces median per-task cost by 1.9x and latency by 33.4% with no accuracy loss.

2605.16447 2026-05-20 cs.LG cs.AI

Nested Spatio-Temporal Time Series Forecasting

嵌套时空时间序列预测

Yinghao Ai, Yukai Zhou, Ruoxi Jiang, Junyi An, Chao Qu, Zhijian Zhou, Shiyu Wang, Fenglei Cao, Zenglin Xu, Furao Shen, Yuan Qi

AI总结 本文提出了一种嵌套预测框架,通过结合未来宏观区域趋势与微观历史观测,实现了精细化预测,并通过谱聚类方法构建语义连贯的区域,有效过滤系统性噪声并保留关键趋势,实验表明该方法在多个高维数据集上优于现有最先进基线。

Comments Accept by ICML 2026

详情
AI中文摘要

时空预测对于现实应用如交通管理至关重要,但在噪声和非平稳条件下捕捉可靠交互仍具挑战性。现有方法主要依赖历史空间先验,往往无法考虑演化的时空相关性并产生系统性误差。在本文中,我们提出了一种嵌套预测框架,将未来宏观区域趋势与微观历史观测相结合,使模型能够从抽象的未来表示中获得自上而下的指导以实现精细化预测。具体而言,我们采用基于谱聚类的方法构建语义连贯的区域,提供了理论和经验证据表明这种表示能有效过滤系统性噪声并保留关键趋势。在此基础上,我们开发了一种逐步由粗到细的预测器,将这些代表性特征整合到推理过程中。这使模型能够利用趋势预测来提前预测动态异常,如周期性偏移。此外,对多个高维数据集的广泛实验表明,我们的方法在多个高维数据集上始终优于现有最先进基线,验证了未来宏观指导的嵌套预测的有效性。

英文摘要

Spatiotemporal forecasting is critical for real-world applications like traffic management, yet capturing reliable interactions remains challenging under noisy and non-stationary conditions. Existing methods primarily rely on historical spatial priors, often failing to account for evolving temporal correlations and suffering from systematic errors. In this work, we propose a nested forecasting framework that couples future macro-level regional trends with micro-level historical observations, enabling top-down guidance from abstract future representations for fine-grained forecasting. Specifically, we employ a spectral clustering-based approach to construct semantically coherent regions, providing both theoretical and empirical evidence that this representation effectively filters systematic noise while preserving essential trends. Building on this, we develop a progressive coarse-to-fine predictor to integrate these representative features into the inference process. This enables the model to leverage trend predictions to anticipate dynamic anomalies, such as periodic offsets, in advance. Furthermore, extensive experiments on multiple high-dimensional datasets demonstrate that our method consistently outperforms state-of-the-art baselines, validating the effectiveness of future macro-guided nested forecasting.

2605.16170 2026-05-20 cs.LG

BAPR: Bayesian amnesic piecewise-robust reinforcement learning for non-stationary continuous control

BAPR: 基于贝叶斯遗忘的分段鲁棒强化学习用于非平稳连续控制

Yifan Zhang, Liang Zheng

AI总结 该研究提出BAPR方法,结合贝叶斯在线变化检测与鲁棒集合强化学习,解决非平稳连续控制中的鲁棒性与适应性问题,通过形式化验证确保算法稳定性与收敛性。

详情
AI中文摘要

现实中的控制系统经常在分段平稳条件下运行,其中动态在较长时期内保持稳定,随后经历 abrupt 的 regime 变化。标准鲁棒强化学习方法面临根本性困境:全局保守策略在稳定时期浪费性能,而局部适应策略在未检测到 regime 变化时风险崩溃。我们提出 BAPR(贝叶斯遗忘分段鲁棒 SAC),将贝叶斯在线变化检测(BOCD)与鲁棒集合强化学习统一。BAPR 操作符——一种加权由模式条件贝尔曼操作符和冻结信念分布构成的凸组合——是一个 γ-收缩。一个互补的反例,在 Lean~4 中机验证,建立了明确的边界:当信念依赖于 Q 函数时,收缩因子变为 γ + λΔ(其中 Δ 是模式奖励差),且收缩失败恰好当 γ + λΔ ≥ 1。我们推导了抽象操作符的组件式形式化误差预算——每个组件机验证,限制了切换后的恢复;预算适用于抽象模式混合操作符,并通过冻结参数设计直觉继承到实现的共享批评者算法。所有结果均通过形式化验证,无 sorry(1,145 行,3 个 Lean~4 文件,22 个机验证定理)。BOCD 驱动了适应性保守机制:在检测到变化点后,策略变得最保守,并随着信心增长而平滑放松,检测延迟为 O(log(1/δ))。一个通过 RMDM 损失训练的上下文条件模块,从模拟器提供的模式 ID 提取模式感知表示,在训练时和部署时均无需模式标签。

英文摘要

Real-world control systems frequently operate under \emph{piecewise stationary} conditions, where dynamics remain stable for extended periods before undergoing abrupt regime changes. Standard robust RL methods face a fundamental dilemma: a globally conservative policy wastes performance during stable periods, while a locally adaptive policy risks catastrophic failure when the regime changes undetected. We propose \textbf{BAPR} (Bayesian Amnesic Piecewise-Robust SAC), which unifies Bayesian Online Change Detection (BOCD) with robust ensemble RL. The BAPR operator -- a convex combination of mode-conditional Bellman operators weighted by a frozen belief distribution -- is a $γ$-contraction. A complementary counterexample, machine-verified in Lean~4, establishes a \emph{sharp boundary}: when beliefs depend on the Q-function, the contraction factor becomes $γ+ λΔ$ (where $Δ$ is the mode reward gap), and contraction fails exactly when $γ+ λΔ\geq 1$. We derive a \emph{component-wise} formal error budget for the abstract operator -- every component machine-verified -- bounding post-switch recovery; the budget applies to the abstract mode-mixture operator and inherits to the implemented shared-critic algorithm only through the frozen-parameter design intuition. All results are formally verified with no \texttt{sorry} (1,145 lines across 3 Lean~4 files, 22 machine-verified theorems). BOCD drives an adaptive conservatism mechanism: the policy becomes maximally conservative after detected change-points and smoothly relaxes as confidence grows, with detection delay $O(\log(1/δ))$. A context-conditioning module trained via RMDM loss provides mode-aware representations from simulator-provided mode IDs at training time and requires no mode labels at deployment.

2605.16137 2026-05-20 cs.CV cs.RO

STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

STABLE: 通过语义-物理双系统生成仿真准备的桌面布局

Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu, Feng Zheng, Jiangmiao Pang, Yanwei Fu

AI总结 本文提出STABLE,一种通过语义-物理双系统生成仿真准备的桌面布局的方法,通过语义推理模块生成粗略布局,物理校正模块校正布局以确保物理合理性,从而提升场景的物理有效性。

Comments ICML 2026

详情
AI中文摘要

从任务指令生成仿真准备的桌面场景是嵌入式人工智能领域引人入胜且有前景的研究方向。然而,现有任务到场景生成方法仅依赖大型语言模型(LLMs)预测场景布局,不可避免地导致物体碰撞或漂浮,因为LLMs在三维空间推理方面存在固有局限性。在本文中,我们提出了STABLE,一种专为仿真准备的桌面场景生成设计的语义-物理双系统。STABLE由两个互补模块组成:(i)语义推理器,一个在结构化桌面场景数据集上微调的LLM,用于从输入任务指令生成粗略布局;(ii)物理校正器,一个具有物理意识的基于流的去噪模型,输出姿态更新以校正布局,从而确保场景的物理合理性,同时保持与任务指令的语义一致性。STABLE采用渐进生成范式:通过交替使用语义推理器和物理校正器,它逐步从任务关键对象扩展到背景对象。实验表明,STABLE成功生成严格符合任务指令的仿真准备的桌面场景,并显著提高了场景的物理有效性。

英文摘要

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

2605.15599 2026-05-20 cs.CV cs.AI

Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study

预训练目标在极低数据细粒度视觉分类中的影响:一个骨干网络控制研究

Alexander Hackett, Srikanth Thudumu, Ginny Fisher, Jason Fisher

AI总结 本文研究了在极低数据细粒度视觉分类中预训练目标对下游表示质量的影响,通过比较四种冻结的ViT-B/16编码器,得出了在数据稀缺时优先选择边界增强预训练目标的结论。

Comments Presented at the 13th Workshop on Fine-Grained Visual Categorization (FGVC13) at CVPR 2026

详情
Journal ref
13th Workshop on Fine-Grained Visual Categorization (FGVC13), CVPR 2026
AI中文摘要

极端低数据细粒度分类在专家领域中普遍存在,其中标注成本高昂,但从业者仍需要有原则的指导来选择预训练编码器。我们使用一个定制的数据集,包含三个类别的标注图像,研究了在匹配的骨干容量下,预训练目标如何影响下游表示质量。我们比较了四种冻结的ViT-B/16编码器,分别通过监督分类、对比学习(SigLIP2)、掩码重建(MAE)和自蒸馏(DINOv3)进行训练,并使用留一验证法通过线性和非线性探测器评估。为了控制低N情况下的统计噪声,我们使用排列检验(N=1000)在宏级一对多AUC上进行测试。监督和对比学习编码器在线性可分性方面表现最强(逻辑AUC:0.768和0.735;SVM AUC:0.739和0.697),而MAE在非线性探测器下表现更优(XGBoost AUC:0.713)。我们发现DINOv3在该领域整体表现较差。这些结果支持在极低数据细粒度视觉分类中的一种实用建议:当数据稀缺限制探测到线性决策规则时,优先选择边界增强预训练目标;当非线性分类器可行时,考虑使用重建式编码器。

英文摘要

Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.

2605.15532 2026-05-20 cs.LG cs.AI cs.CL

DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

DeltaPrompts: 逃离多模态蒸馏中的零delta陷阱

Jaehun Jung, Hyunwoo Kim, Brandon Cui, Ximing Lu, David Acuna, Prithviraj Ammanabrolu, Yejin Choi

AI总结 本文提出DeltaPrompts,通过量化教师与学生之间的答案分歧(Δ)来生成高分歧的推理问题,从而解决传统蒸馏中因零delta提示导致的学习信号不足问题,实验表明DeltaPrompts在多个场景下显著提升了模型性能。

详情
AI中文摘要

蒸馏使紧凑的视觉-语言模型(VLMs)能够获得强大的推理能力,但驱动这一过程的提示通常通过简单的启发法或从现成数据集中聚合获得。我们揭示了这种方法中的关键低效性:标准图表/文档推理数据集中多达69%的提示实际上是零delta,意味着教师和学生已经诱导出完全相同的答案分布。在这些提示上训练提供极小的学习信号,导致学生性能在数据规模扩大时迅速饱和。为逃离零delta陷阱,我们回归基本原理:蒸馏本质上最小化了分布差异,因此只有暴露教师与学生之间功能性能力差距的提示才具有价值。我们通过答案分歧(Δ)量化这一差距,证明非零分歧对有效扩展至关重要。基于这一洞察,我们提出一个分阶段合成流程,利用现有数据集作为种子,主动针对学生失败模式生成更好的提示。结果是DeltaPrompts,一个包含20万 synthetic 高分歧推理问题的多样化数据集。我们评估DeltaPrompts在三个不同场景下的表现:在目标教师-学生对上的在线蒸馏、转移到新型模型家族而不重新生成数据、以及非推理模型的离线微调。在所有场景中,DeltaPrompts均带来显著收益,即使在高度优化的推理模型(如Qwen3-VL-8B-Thinking)上,也能在10个基准测试中平均获得高达15%的相对提升。

英文摘要

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($Δ$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

2605.14588 2026-05-20 cs.LG

Silent Collapse in Recursive Learning Systems

递归学习系统中的沉默崩溃

Zhipeng Zhang

AI总结 本文研究了递归学习系统中模型内部分布逐渐退化的现象,提出MTR框架通过监测轨迹统计量和调整学习强度来提前预警并防止沉默崩溃。

详情
AI中文摘要

递归学习——即模型在由自身先前版本生成的数据上进行训练——在大型语言模型、自主代理和自监督系统中日益常见。然而,标准性能度量(损失、困惑度、准确率)往往无法在不可逆退化发生前检测到内部退化。本文识别出一种现象,我们称之为沉默崩溃:在广泛递归条件下,模型内部分布(预测熵、表征多样性、尾部覆盖)即使在传统度量看似稳定或改进时也会逐渐收缩。我们发现沉默崩溃并非 abrupt,其发生前总是可靠地由三个轨迹级前兆预示:(1)锚点熵的收缩,(2)表征漂移的冻结,(3)尾部覆盖的侵蚀。这些信号在任何传统验证度量退化之前多代出现,从而实现早期预警。基于这些前兆,我们提出了MTR(监控-信任-调节器)框架,一个轻量级的元认知循环,通过监测轨迹统计量、估计慢时间尺度的信任变量,并自适应调节有效学习强度。MTR在不需访问原始干净数据的情况下提供早期预警并主动防止沉默崩溃,这是当原始数据不可用、受污染或私有时的关键优势。

英文摘要

Recursive learning -- where models are trained on data generated by previous versions of themselves -- is increasingly common in large language models, autonomous agents, and self-supervised systems. However, standard performance metrics (loss, perplexity, accuracy) often fail to detect internal degradation before it becomes irreversible. Here we identify a phenomenon we call silent collapse: under broad recursive conditions, model internal distributions -- predictive entropy, representational diversity, and tail coverage -- progressively contract even as conventional metrics appear stable or improving. We discover that silent collapse is not abrupt. Its onset is reliably preceded by three trajectory-level precursors: (1) contraction of anchor entropy, (2) freezing of representation drift, and (3) erosion of tail coverage. These signals manifest multiple generations before any degradation in standard validation metrics, enabling early warning. Based on these precursors, we propose the MTR (Monitor--Trust--Regulator) framework, a lightweight metacognitive loop that monitors trajectory statistics, estimates a slow-timescale trust variable, and adaptively modulates the effective learning intensity. MTR provides early warning and actively prevents silent collapse without requiring access to pristine real data -- a critical advantage when original data is unavailable, contaminated, or private.

2605.14048 2026-05-20 cs.AI cs.LG

Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning

面向网络的双线性分块化用于脑功能连接表示学习

Leo Milecki, Qingyu Hu, Bahram Jafrasteh, Mert R. Sabuncu, Qingyu Zhao

AI总结 本文提出了一种面向网络的双线性分块化方法,用于改进脑功能连接的表示学习,通过重新定义功能连接的分块方式,提升模型在跨群体评估中的稳定性和可迁移性。

Comments Author-submitted version, provisionally accepted at MICCAI 2026

详情
AI中文摘要

Masked autoencoders (MAEs) 近年来在静息状态脑功能连接(FC)的自监督表示学习中显示出潜力。然而,一个基本问题仍未解决:如何对FC矩阵进行分块以与大规模脑网络的内在模块化组织对齐?现有方法通常采用以区域为中心或图基的方案,将FC视为结构上均质的元素,并忽略了大规模脑网络的组织结构。我们引入NERVE(通过双线性分块化进行脑功能连接的网络感知表示学习),一种自监督学习框架,通过将FC矩阵划分为内网络和跨网络连接块来重新定义FC分块。与基于图像的MAE不同,由网络对定义的FC分块在大小上异质且对应不同的功能角色。为了解决这个问题,NERVE通过一种新的结构化双线性分解来嵌入FC分块。这种形式保留了网络身份,并将参数复杂度从网络数量的二次方减少到线性。我们评估了NERVE在三个大规模发展队列(ABCD、PNC和CCNP)中对行为和精神病理学的预测。与结构上不敏感的MAE变体和基于图的自监督基线相比,所提出的网络感知形式在跨队列评估中产生了更稳定和可迁移的表示。消融研究确认了所提出的双线性网络嵌入和解剖学基础的分区对于性能至关重要。这些发现突显了在功能连接组学中将领域特定的结构先验纳入自监督学习的重要性。代码可在:https://github.com/leomlck/NERVE。

英文摘要

Masked autoencoders (MAEs) have recently shown promise for self-supervised representation learning of resting-state brain functional connectivity (FC). However, a fundamental question remains unresolved: how should FC matrices be tokenized to align with the intrinsic modular organization of large-scale brain networks? Existing approaches typically adopt region-centric or graph-based schemes that treat FC as structurally homogeneous elements and overlook the large-scale network brain organization. We introduce NERVE (Network-Aware Representations of Brain Functional Connectivity via Bilinear Tokenization), a self-supervised learning framework that redefines FC tokenization by partitioning FC matrices into patches of intra- and inter-network connectivity blocks. Unlike image-based MAE, where fixed-size patches share a common tokenizer, FC patches defined by network pairs are heterogeneous in size and correspond to distinct functional roles. To resolve this problem, NERVE embeds FC patches through a novel structured bilinear factorization. This formulation preserves network identity and reduces parameter complexity from quadratic to linear scaling in the number of networks. We evaluate NERVE across three large-scale developmental cohorts (ABCD, PNC, and CCNP) for behavior and psychopathology prediction. Compared to structurally agnostic MAE variants and graph-based self-supervised baselines, the proposed network-aware formulation yields more stable and transferable representations, particularly in cross-cohort evaluation. Ablation studies confirm that the proposed bilinear network embedding and anatomically grounded parcellation are critical for performance. These findings highlight the importance of incorporating domain-specific structural priors into self-supervised learning for functional connectomics. Code is available at: https://github.com/leomlck/NERVE.

2605.14014 2026-05-20 cs.LG cs.AI

Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signals

Dywave: 为异构物联网传感信号设计的事件对齐动态分词方法

Tomoyoshi Kimura, Denizhan Kara, Jinyang Li, Hongjue Zhao, Yigong Hu, Yizhuo Chen, Xiaomin Ouyang, Shengzhong Liu, Tarek Abdelzaher

AI总结 本文提出Dywave,一种用于异构物联网传感信号的动态分词框架,通过小波基层次分解构建紧凑的输入表示,以适应内在时间结构和底层物理事件,从而在活动识别、压力评估和附近物体检测等任务中提升准确率并提高计算效率。

详情
AI中文摘要

物联网系统持续收集来自无处不在传感器的异构传感信号,以支持智能应用,如人类活动分析、情绪监测和环境感知。这些信号本质上是非平稳和多尺度的,给标准分词技术带来了独特挑战。本文提出Dywave,一种为物联网传感信号设计的动态分词框架,该框架构建了与内在时间结构和底层物理事件对齐的紧凑输入表示。Dywave利用基于小波的层次分解,识别出对应底层语义事件的时间边界,并自适应地压缩冗余区间,同时保持时间一致性。在五个真实物联网传感数据集上进行的广泛评估表明,Dywave在活动识别、压力评估和附近物体检测等任务中,比最先进的方法在准确率上提高了高达12%,同时通过减少输入标记长度最多75%来提高计算效率。此外,Dywave在面对领域偏移和变化的序列长度时表现出更强的鲁棒性。

英文摘要

Internet of Things (IoT) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception. These signals are inherently non-stationary and multi-scale, posing unique challenges for standard tokenization techniques. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. Dywave leverages wavelet-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence. Extensive evaluations on five real-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state-of-the-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths.

2605.13793 2026-05-20 cs.CL

An LLM-Based System for Argument Mining

基于LLM的论证挖掘系统

Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman, Douglas Aldred

AI总结 本文提出一个基于大语言模型的端到端系统,用于从自然语言文本中提取论证并构建抽象论证图,通过多阶段流程识别论证组件、选择相关元素并揭示其逻辑关系,实验表明该系统能有效恢复论证结构并在不同标注方案下表现良好。

详情
AI中文摘要

论证是人类推理中的基本方面,其中主张被支持、挑战并相互比较。我们提出一个端到端的大语言模型(LLM)基于系统,用于将自然语言文本中的论证重建为抽象论证图。该系统遵循一个多阶段流程,逐步识别论证性组件、选择相关元素并揭示它们的逻辑关系。这些元素表示为由两种组件类型(前提和结论)和三种关系类型(支持、攻击和削弱)组成的有向无环图。我们进行了两项互补的实验来评估该系统。首先,我们在论证理论教科书中的论证上进行手动评估,以评估系统恢复论证结构的能力。其次,我们在基准数据集上进行定量评估,通过将我们的输出映射到已建立的标注方案来与先前工作进行比较。结果表明,该系统能够充分恢复论证结构,并且在适应不同标注方案时,在基准数据集上取得合理表现。这些发现突显了基于LLM的流程在可扩展论证挖掘中的潜力。

英文摘要

Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument mining.

2605.13318 2026-05-20 cs.AI cs.ET

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

VERA-MH:心理健康领域伦理和负责任AI的验证

Luca Belli, Kate H. Bentley, Josh Gieringer, Emily Van Ark, Nilu Zhao, Pradip Thachile, Matt Hawrilenko, Millard Brown, Adam M. Chekroud

AI总结 本研究提出VERA-MH,一种用于评估心理健康支持聊天机器人安全性的新型临床验证方法,重点评估聊天机器人在识别自杀倾向风险方面的表现。

详情
AI中文摘要

随着聊天机器人在更多领域被使用,包括原本未被设计用于的领域,如心理健康支持。为此,我们介绍了验证伦理和负责任AI在心理健康中的应用(VERA-MH),一种新的临床验证评估,用于评估聊天机器人在心理健康支持中的安全性。VERA-MH的第一版专注于自杀念头(SI)风险,通过评估聊天机器人如何回应可能处于危机中的用户。VERA-MH由三个步骤组成:对话模拟、对话评估和模型评分。首先,为评估的聊天机器人模拟对话,另一个聊天机器人将扮演用户角色,基于特定的人设进行模拟。这些用户人设是在临床指导下开发的,以确保代表多种风险因素、人口特征和披露因素。在评估步骤中,一个第二支持模型作为LLM-as-a-Judge,结合一个临床开发的评分表。评分表结构为流程,每次提出一个Yes/No问题,以提高答案的一致性并突出模型的失败模式。在最后阶段,每个对话的结果被汇总以呈现最终的聊天机器人评估。与框架一起,我们还展示了对四个领先LLM提供商的评估结果。

英文摘要

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

2605.13193 2026-05-20 cs.CV

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

FIKA-Bench: 从细粒度识别到细粒度知识获取

Geng Li, Yuxin Peng

AI总结 本文提出FIKA-Bench,一个包含311个公开来源和现实实例的细粒度知识获取基准,通过过滤和审计确保实例质量,评估最新多模态模型和代理发现细粒度识别任务仍具挑战性,需改进代理设计以提升知识获取能力。

Comments Project page with code: https://ligeng0197.github.io/FIKA-Bench.github.io/

详情
AI中文摘要

日常生活中细粒度识别往往不是封闭书目分类问题:当遇到陌生物体时,人类会主动搜索、比较视觉细节并验证证据后再做决定。现有基准主要评估视觉识别能力,忽略了这种主动外部知识获取能力。我们研究细粒度知识获取,即系统必须寻求、验证并使用外部证据来回答开放式细粒度识别问题。我们引入FIKA-Bench,一个泄漏意识且证据支持的实例集合,包含311个公开来源和现实实例。为确保高质量,每个实例均经过前沿封闭书目模型过滤以去除记忆案例,并经过审核以消除图像-答案泄漏,仅保留由验证证据支持的样本。我们对最新多模态模型(LMMs)和代理的评估显示,该任务仍具挑战性:最佳系统仅达到25.1%的准确率,无模型超过30%。关键发现是,仅给模型配备工具不足以弥合这一差距;代理失败主要由错误实体检索和较差的视觉判断驱动。这些结果表明,可靠的知识获取需要更好的代理设计,以专注于细粒度识别。

英文摘要

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

2605.12640 2026-05-20 cs.CV

MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation

MambaPanoptic:基于视觉Mamba的结构状态空间框架用于全景分割

Qing Cheng, Damiano Bertolini, Wei Zhang, Dong Wang, Niclas Zeller, Daniel Cremers

AI总结 本研究提出MambaPanoptic,一种基于视觉Mamba的结构状态空间框架,旨在解决全景分割中长程上下文建模、多尺度特征表示和高效密集预测的挑战,通过引入MambaFPN和改进的PanopticFCN风格核生成器实现统一的实例和物质预测。

Comments Accepted to ISPRS Congress 2026, camera-ready version

详情
AI中文摘要

全景分割要求同时识别可计数的实例和无形态的物质区域,对长程上下文建模、多尺度特征表示和高效密集预测提出了联合需求。现有的卷积和Transformer方法难以同时满足这三个要求:卷积架构在建模长程依赖方面能力有限,而基于Transformer的方法在高分辨率下会带来二次计算成本。在本文中,我们提出MambaPanoptic,一种完全基于Mamba的全景分割框架,通过两个主要贡献来解决这些限制。首先,我们引入MambaFPN,一种自上而下的特征金字塔,利用Mamba块生成具有线性计算复杂度的全局一致、多尺度特征表示。其次,我们采用PanopticFCN风格的核生成器,产生统一的实例和物质核用于无提案的全景预测,并通过在多个网络阶段应用QuadMamba基于的特征细化模块进行增强。在Cityscapes和COCO全景分割基准测试中,实验表明MambaPanoptic在同等模型大小下一致优于PanopticDeepLab和PanopticFCN,并在Cityscapes上以更少的参数匹配或超越Mask2Former在PQ和AP上的表现。

英文摘要

Panoptic segmentation requires the simultaneous recognition of countable thing instances and amorphous stuff regions, placing joint demands on long-range context modelling, multi-scale feature representation, and efficient dense prediction. Existing convolutional and transformer-based methods struggle to satisfy all three requirements concurrently: convolutional architectures are limited in their capacity to model long-range dependencies, while transformer-based methods incur quadratic computational cost that is prohibitive at high resolutions. In this paper, we propose MambaPanoptic, a fully Mamba-based panoptic segmentation framework that addresses these limitations through two principal contributions. First, we introduce MambaFPN, a top-down feature pyramid that leverages Mamba blocks to generate globally coherent, multi-scale feature representations with linear computational complexity. Second, we adopt a PanopticFCN-style kernel generator that produces unified thing and stuff kernels for proposal-free panoptic prediction, enhanced by a QuadMamba-based feature refinement module applied at multiple network stages. Experiments on the Cityscapes and COCO panoptic segmentation benchmarks demonstrate that MambaPanoptic consistently outperforms PanopticDeepLab and PanopticFCN under comparable model sizes, and matches or surpasses Mask2Former on Cityscapes in PQ and AP while requiring fewer parameters.

2605.12320 2026-05-20 cs.CV

Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

在噪声时间自监督下利用对比学习进行结肠镜视频处理

Luca Parolari, Pietro Gori, Lamberto Ballan, Carlo Biffi, Loic Le Folgoc

AI总结 本文提出一种在噪声时间自监督下利用对比学习进行结肠镜视频处理的方法,通过利用结肠镜检查的顺序流程来推导自监督关联,引入噪声感知的对比损失以处理噪声关联,从而在多项下游任务中取得了优于现有自监督和监督基线方法的性能。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

学习鲁棒的息肉轨迹表示对于启用多项AI辅助结肠镜应用至关重要,从息肉特征化到自动化报告和检索。监督对比学习是学习此类表示的有效方法,但通常依赖于正确的正负定义。收集这些标签需要链接在整个视频中描绘相同基础息肉实体的轨迹,这成本高昂且需要专门的临床专业知识。在本工作中,我们利用结肠镜检查的顺序流程推导出自监督关联。由于时间推导的关联不保证正确,我们引入了噪声感知的对比损失以处理噪声关联。我们展示了所学表示在多项下游任务中的有效性,包括息肉检索和重识别、大小估计和组织学分类。我们的方法在多项任务中优于先前的自监督和监督基线方法,并且在所有任务中与最近的基座模型相匹配或超过,使用了一个仅在27个视频上训练的轻量级编码器。代码可在https://github.com/lparolari/ntssl上获得。

英文摘要

Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at https://github.com/lparolari/ntssl.

2605.11262 2026-05-20 cs.LG

Latent Chain-of-Thought Improves Structured-Data Transformers

潜在的链式思维提升结构化数据转换器

Carson Dudley, Samet Oymak

AI总结 本文研究了链式思维对时间序列和表格数据的影响,并提出了一种递归方案,通过压缩查询位置的隐藏状态生成反馈标记,从而在预测前进行多次潜在计算,从而提升结构化数据转换器的性能。

详情
AI中文摘要

链式思维以及更广泛的推理时间计算已被证明能够增强语言模型的表达能力,并在推理领域带来了重大创新。受此成功启发,本文探讨了潜在链式思维以及深度和循环对时间序列和表格数据的影响。我们提出了一种递归方案,其中结构化数据转换器在初始正向传递后,将查询位置的隐藏状态压缩为反馈标记,这些标记被附加到输入并再次处理,从而在预测前允许多次潜在计算。我们比较了链式思维模型与一个相同深度无链式思维的基线模型、一个与链式思维模型在有效深度上匹配的更深层次基线模型,以及一个具有权重绑定递归但没有额外链式思维标记的循环转换器。在36个时间序列预测和表格预测数据集中,潜在链式思维在7/9个时间序列数据集上优于基线(平均提升12.63%),在23/27个表格数据集上也优于基线(平均提升3.25%),链式思维模型在两种设置中表现最佳。我们还展示了链式思维的好处扩展到了预训练基础模型:将潜在链式思维应用于nanoTabPFN,一个小型开源表格基础模型,使其性能超越了更大的TabPFN-v2在TabArena上的表现。这些结果共同表明,链式思维是扩展结构化数据推理时间计算的一个有用轴线。

英文摘要

Chain-of-thought and more broadly test-time compute are known to augment the expressive capabilities of language models and have led to major innovations in reasoning. Motivated by this success, this paper explores latent chain-of-thought as well as the impact of depth and looping for time-series and tabular data. We propose a recurrent scheme in which a structured-data transformer, after an initial forward pass, compresses its query-position hidden states into feedback tokens that are appended to the input and processed again, allowing multiple rounds of latent computation before prediction. We compare CoT models against a same-depth no-CoT baseline, a deeper baseline matched to the CoT model in effective depth, and a looped transformer with weight-tied recurrence but no additional chain-of-thought tokens. Across 36 datasets in time-series forecasting and tabular prediction, latent chain-of-thought improves over the baseline on 7/9 time-series datasets (+12.63\% average gain) and 23/27 tabular datasets (+3.25\% average gain), with CoT models performing best on average in both settings. We also show that the benefit of CoT extends to pretrained foundation models: applying latent CoT to nanoTabPFN, a small open-source tabular foundation model, improves its performance above the much larger TabPFN-v2 on TabArena. Together, these results demonstrate that chain-of-thought is a useful axis for scaling test-time compute for structured data.

2605.10180 2026-05-20 cs.CV cs.CR

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

什么概念在其中?在扩散变换器中检测和抑制危险内容

Chenyu Zhang

AI总结 本文研究了如何在扩散变换器中检测和抑制危险内容,提出了一种无需训练的推理时安全机制AHV-D&S,通过分析注意力头对概念的敏感性来检测和抑制危险生成倾向,有效压制了性内容、受版权保护的内容及有害内容,同时保持视觉质量。

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission

详情
AI中文摘要

文本到图像(T2I)模型的兴起日益引发了关于生成危险内容(如性、暴力和受版权保护的图像)的担忧,突显了在模型内部需要有效安全措施的必要性。尽管已有方法被提出以消除T2I模型中的危险概念,但它们主要针对早期的U-Net架构,使得最先进的基于扩散变换器(DiTs)的T2I模型缺乏充分保护。这一差距源于根本性的架构转变:扩散变换器(DiTs)通过联合注意力将语义注入和视觉合成结合起来,这使得在生成过程中隔离和消除危险内容变得困难。为了弥合这一差距,我们研究了DiTs中语义概念的表示方式,并发现注意力头表现出对概念的特定敏感性。这一特性使得能够同时检测和抑制危险内容。基于这一发现,我们提出AHV-D&S,一种无需训练的推理时图像生成安全机制。具体而言,AHV-D&S量化每个文本标记在所有注意力头上的敏感性作为注意力头向量(AHV),该向量用作检测危险生成倾向的判别签名。在推理阶段,我们提出了一种基于动量的策略,用于在去噪步骤中动态跟踪标记级别的AHVs,并提出一种基于敏感度的自适应抑制策略,该策略根据头特定的风险分数抑制已识别的危险标记的注意力权重。广泛的实验表明,AHV-D&S有效抑制了性内容、受版权保护的内容以及各种有害内容,同时保持了视觉质量,并进一步表现出对对抗性提示的强鲁棒性和在不同DiT-based T2I模型中的可转移性。

英文摘要

The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\&S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\&S quantifies each textual token's sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\&S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.

2605.09329 2026-05-20 cs.CL cs.LG

Test-Time Speculation

测试时推测

Avinash Kumar, Sujay Sanghavi, Poulami Das

AI总结 本文研究了测试时推测方法,通过在线蒸馏技术提升长响应任务中推测器的接受长度,从而提高LLM推理效率。

详情
AI中文摘要

推测解码通过使用快速草稿模型生成token并用更准确的目标模型验证,从而加速LLM推理。其性能取决于接受长度,即目标模型接受的草稿token数量。我们的研究表明,即使是最先进的推测器,如DFlash、EAGLE-3和PARD,其接受长度也会随着生成长度的增加而下降,在仅几千个输出token后接近1(即无加速),这使推测器在长响应任务中变得无效。接受长度下降是因为大多数推测器在离线训练时仅在短序列上训练,但在推理时被迫匹配远长于训练分布的输出。为了解决这个问题,我们提出了测试时推测(TTS),一种在线蒸馏方法,可以在测试时连续调整推测器。TTS利用关键见解,即token验证步骤已经为每个草稿token调用了目标模型,从而提供所需的训练信号,以无额外成本地调整草稿。将草稿视为学生,目标模型视为教师,TTS在多个推测轮次中调整草稿,每次更新都提高草稿的准确性。我们的结果表明,在Qwen-3、Qwen-3.5和Llama3.1家族的多个模型上,TTS在最先进的推测器上将接受长度提高高达72%和41%,且随着生成长度的增加,收益呈比例增长。

英文摘要

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

2605.09063 2026-05-20 cs.CL

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Soohak:一个由数学家精心编订的基准,用于评估大语言模型的研究级数学能力

Guijin Son, Seungone Kim, Catherine Arnett, Hyunwoo Ko, Hyein Lee, Hyeonah Kang, Jiang Longxi, Jin Yun, JungYup Lee, Kyungmin Lee, Sam Yoosuk Kim, Sang Park, Seunghyeok Hong, SeungJae Lee, Seungyeop Yi, Shinae Shin, SunHye Bok, Sunyoung Shin, Yonghoon Ji, Youngtaek Kim, Hanearl Jung, Akari Asai, Graham Neubig, Sean Welleck, Youngjae Yu, Akshelin R, Alexander B. Ivanov, Boboev Muhammadjon, Chae Young Han, Christian Stump, Cooper R. Anderson, Dmitrii Karp, Dohyun Kwon, Dongryung Yi, DoYong Kwon, Duk-Soon Oh, Eunho Choi, Giovanni Resta, Greta Panova, Huiyun Noh, Hyungryul Baik, Hyungsun Bae, Inomov Mashrafdzhon, Jeewon Kim, Jeong-Rae Kim, Ji Eun Lee, Jiaqi Liu, Jieui Kang, Jimin Kim, Jon-Lark Kim, Joonyeong Won, Junseo Yoon, Junwoo Jo, Kibeom Kim, Kiwoon Kwon, Mario Kummer, Max Mercer, Min Hoon Kim, Minjun Kim, Nahyun Lee, Ng Ze-An, Nicolas Libedinsky, Rafał Marcin Łochowski, Raphaël Lachièze-Rey, Robert Auffarth, Ruichen Zhang, Sejin Park, Seonguk Seo, Shin Jaehoon, Sunatullo, Taewoong Eom, Yeachan Park, Yongseok Jang, Youchan Oh, Zhaoyang Wang, Zoltán Kovács

AI总结 本文提出Soohak基准,通过64位数学家自主编写439道问题,评估大语言模型在研究级数学问题上的能力,同时引入拒绝子集以测试模型对不恰当问题的识别能力。

Comments Under review, For questions or model-evaluation requests, contact $guijin.son@snu.ac.kr$

详情
AI中文摘要

在最近前沿大语言模型在IMO竞赛中取得金牌成绩后,社区正在寻找下一个有意义且具有挑战性的目标来衡量大语言模型的推理能力。尽管竞赛风格的问题只测量逐步推理,但研究级问题利用这种推理来推动数学知识的前沿,成为有吸引力的替代方案。然而,研究级数学基准仍然稀缺,因为此类问题难以获取(例如Riemann Bench和FrontierMath-Tier 4分别包含25和50道问题)。为了支持下一代前沿模型的可靠评估,我们引入了Soohak,一个由64位数学家从头编写的新基准,包含439道问题。Soohak包含两个子集。在挑战子集上,前沿模型包括Gemini-3-Pro、GPT-5和Claude-Opus-4.5分别达到30.4%、26.4%和10.4%,仍有较大提升空间,而领先的大规模开放模型如Qwen3-235B、GPT-OSS-120B和Kimi-2.5则低于15%。值得注意的是,除了标准问题解决,Soohak还引入了拒绝子集,以测试研究数学中固有的能力:识别不恰当的问题并暂停而非产生自信但不合理的答案。在该子集上,没有模型超过50%,识别拒绝成为新的优化目标,而当前模型并未直接解决此问题。为防止污染,该数据集将在2026年底公开发布, interim期间模型评估可在请求时获得。

英文摘要

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

2605.04970 2026-05-20 cs.LG cs.AI

Skill Neologisms: Towards Skill-based Continual Learning

技能新词:迈向基于技能的持续学习

Antonin Berthon, Nicolas Astorga, Mihaela van der Schaar

AI总结 本文提出了一种基于技能的新词(skill neologisms)方法,通过在模型词汇中集成软token,以提高模型在特定技能上的能力,同时支持零样本组合其他技能,从而实现可扩展的基于技能的持续学习。

详情
AI中文摘要

现代大语言模型(LLMs)在不断扩大的技能范围内表现出色,并能灵活组合这些技能。然而,以可扩展的方式将模型能力扩展到新技能仍然是一个开放性问题:微调和参数高效变体有灾难性遗忘的风险,而基于上下文的方法表达能力有限且受模型有效上下文的限制。我们探索了技能新词——整合在模型词汇中的软token,并优化以提高特定技能的能力——作为一种方法,以在不更新权重的情况下选择性地获取新技能。我们首先观察到预训练LLMs已经表现出与程序知识相关的token。然后在受控的合成任务上展示,技能新词可以学习以提高模型在特定技能上的能力,同时能够与分布外技能组合,且独立训练的技能新词可以零样本组合。最后,我们验证了在更现实的自然语言设置中,即Skill-Mix基准测试中,独立学习的技能新词的零样本组合。这些结果表明,技能新词可能为基于技能的持续学习提供可扩展的路径。

英文摘要

Modern LLMs show mastery over an ever-growing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model's effective context. We explore skill neologisms--soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill--as a way to selectively acquire new skills without weight updates. We first observe that pre-trained LLMs already exhibit tokens associated with procedural knowledge. We then show on a controlled synthetic task that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. Finally, we validate zero-shot composition of independently learned skill neologisms on the more realistic natural language setting of the Skill-Mix benchmark. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

2605.01361 2026-05-20 cs.LG

Decision-Focused Learning via Tangent-Space Projection of Prediction Error

通过预测误差的切线空间投影进行决策聚焦学习

Junhyeong Lee, Sangjin Jin, Yongjae Lee

AI总结 本文提出了一种基于预测误差切线空间投影的决策聚焦学习方法,通过几何特征简化了后悔梯度的计算,提升了下游决策质量并提高了计算效率。

Comments 21 pages, 4 figures, 11 tables

详情
AI中文摘要

决策聚焦学习(DFL)训练预测器以提高下游决策质量,但计算后悔梯度通常需要对求解器进行微分或依赖于替代损失函数,这可能计算成本高或偏离真实目标。我们证明,在标准正则性条件下,本地稳定的活动约束下,后悔梯度具有闭式几何特征,等价于预测误差投影到活动约束的切线空间,乘以局部曲率。这表明,可以通过过滤决策无关成分来获得后悔梯度,提供了一种更简单直接的替代方法。基于此,我们提出PEAR(投影误差作为后悔梯度),通过在活动约束上减少的线性系统计算后悔梯度,避免对求解器迭代或额外优化求解进行微分。在LP基准和一个现实QP任务上的实验表明,PEAR在所有基线中实现了最佳的决策质量,同时是最具计算效率的,其优势在约束变化下依然保持。

英文摘要

Decision-Focused Learning (DFL) trains predictors to improve downstream decision quality, but computing regret gradients typically requires differentiating through solvers or relying on surrogate losses, which can be computationally expensive or deviate from the true objective. We show that, under standard regularity with locally stable active constraints, the regret gradient admits a closed-form geometric characterization, equivalent to the prediction error projected onto the tangent space of active constraints, scaled by local curvature. This reveals that regret gradients can be obtained by filtering decision-irrelevant components from the MSE gradient, providing a simpler and more direct alternative to existing approaches. Based on this, we propose PEAR (Projected Error As Regret-gradient), which computes regret gradients via a reduced linear system over active constraints, avoiding differentiation through solver iterations or additional optimization solves. Experiments on LP benchmarks and a real-world QP task show that PEAR achieves the best decision quality among all baselines while being the most computationally efficient, with gains that persist under constraint shifts.

2604.24658 2026-05-20 cs.LG

The Last Human-Written Paper: Agent-Native Research Artifacts

最后的人写论文:代理原研究制品

Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu, Xiangru Tang, Runyu Lu, Lichang Chen, Xiaoyan Bai, Haizhong Zheng, Carl Chen, Zhiyang Chen, Haojie Ye, Yujuan Fu, Zexue He, Zijian Jin, Zhenyu Zhang, Shangquan Sun, Maestro Harmon, John Dianzhuo Wang, Jianqiao Zeng, Jiachen Sun, Mingyuan Wu, Baoyu Zhou, Chenyu You, Shijian Lu, Yiming Qiu, Fan Lai, Yuan Yuan, Yao Li, Junyuan Hong, Ruihao Zhu, Beidi Chen, Alex Pentland, Ang Chen, Mosharaf Chowdhury, Zechen Zhang

AI总结 该研究提出了一种名为Agent-Native Research Artifact (ARA)的协议,旨在解决传统科学论文在压缩研究过程为线性叙述时所导致的结构性缺陷,通过引入可执行的研究包结构,提升AI代理理解和扩展已发表工作的能力。

Comments 46 pages, 15 figures, 14 tables

详情
AI中文摘要

科学出版物将分支、迭代的研究过程压缩成线性叙述,丢弃了大部分发现过程中的内容。这种汇总施加了两种结构性成本:故事税,即失败实验、被拒绝的假设和分支探索过程被丢弃以适应线性叙述;以及工程税,即评审充分的叙述与代理充分的规范之间存在差距,导致关键实现细节未被书写。对于人类读者来说,这些成本是可以容忍的,但当AI代理必须理解、复制和扩展已发表的工作时,这些成本变得至关重要。我们引入了Agent-Native Research Artifact (ARA),一种协议,用机器可执行的研究包取代叙述论文,结构围绕四个层次:科学逻辑、可执行代码和完整规范、探索图保存被丢弃的失败编译,以及每个声明在原始输出中得到证据支持。三种机制支持生态系统:一个Live Research Manager,捕获日常开发中的决策和死胡同;一个ARA编译器,将传统PDF和仓库转换为ARA;以及一个ARA原生评审系统,自动化客观检查,使人类评审员能够专注于重要性、新颖性和品味。在PaperBench和RE-Bench上,ARA将问答准确率从72.4%提升到93.7%,复制成功率从57.4%提升到64.4%。在RE-Bench的五个开放扩展任务中,保留的失败痕迹加速了进展,但根据代理的能力,也可能限制代理跳出先前运行的框。我们的代码在https://github.com/Orchestra-Research/Agent-Native-Research-Artifact上开源。

英文摘要

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities. Our code is open-sourced at https://github.com/Orchestra-Research/Agent-Native-Research-Artifact.

2604.15166 2026-05-20 cs.CV cs.AI cs.LG

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

通过深度感知移除遗忘特定方向实现类别反学习

Arman Hatami, Romina Aalishah, Ilya E. Monosov

AI总结 本文提出DAMP方法,通过深度感知移除遗忘特定方向,改进类别反学习的选性遗忘,同时更好地保留保留类性能并减少深层残留遗忘结构。

Comments Accepted for oral presentation at the CVPR 2026 Workshop on Machine Unlearning for Vision (MUV). Code: https://github.com/armanhtm/DAMP

详情
AI中文摘要

机器反学习旨在在不重新训练模型的情况下移除目标知识。然而,在类别反学习中,降低遗忘类的准确性并不一定意味着真正的遗忘:遗忘的信息可能仍编码在内部表示中,而显着的遗忘可能源于分类器头部抑制而非表示移除。我们显示现有类别反学习方法往往表现出弱或负的选择性,保留遗忘类结构在深度表示中,或严重依赖最终层偏移。我们随后引入DAMP(通过投影的深度感知调节),一种单次、闭合形式的权重手术方法,可以在不使用梯度优化的情况下从预训练网络中移除遗忘特定方向。在每个阶段,DAMP在下一个可学习操作的输入空间中计算类别原型,提取遗忘方向作为相对于保留类原型的残差,并应用基于投影的更新以减少下游对这些方向的敏感性。为了保持实用性,DAMP使用从探测分离性导出的参数无关深度感知缩放规则,应用较小的编辑在早期层和较大的编辑在深层。该方法自然扩展到多类遗忘通过低秩子空间移除。在MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet以及卷积和变换器架构上,DAMP比一些先前方法更接近再训练的黄金标准,改进了选择性遗忘的同时更好地保留保留类性能并减少深层残留遗忘结构。

英文摘要

Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

2604.07303 2026-05-20 cs.RO

Robots that learn to evaluate models of collective behavior

能够评估集体行为模型的机器人

Mathis Hocke, Andreas Gerken, David Bierbach, Jens Krause, Tim Landgraf

AI总结 本文提出了一种基于强化学习的框架,利用仿生机器人鱼评估活鱼行为的计算模型,通过闭环交互量化真实鱼与模拟鱼行为的差异,展示了学习驱动的机器人实验如何发现行为模型的不足。

详情
AI中文摘要

理解并建模动物行为对于研究集体运动、决策和生物启发机器人至关重要。然而,评估行为模型的准确性仍然常常依赖于离线比较静态轨迹统计。在这里,我们介绍了一种基于强化学习的框架,利用仿生机器人鱼(RoboFish)通过闭环交互评估计算模型中的活鱼行为。我们使用四个不同的鱼模型(一个简单的恒定跟随基准、两个基于规则的模型和一个生物基础的卷积神经网络模型)在仿真中训练策略,并将这些策略转移到真实的RoboFish系统中,与活鱼互动。策略被训练引导模拟鱼前往目标位置,使我们能够量化真实鱼对目标位置的响应与模拟鱼响应的差异。通过量化模拟到现实的差距(定义为模拟和现实行为指标分布的Wasserstein距离,如目标到达性能、个体间距离、墙互动和对齐),我们评估鱼模型。基于神经网络的鱼模型在目标到达性能和其他大多数指标上表现出最小的差距,表明其在该基准下的行为保真度高于传统基于规则的模型。更重要的是,这种分离表明,所提出的评估方法能够在匹配的闭环条件下定量区分候选模型。我们的工作展示了学习驱动的机器人实验如何揭示行为模型的不足,并提供了一种通过具身交互评估动物行为模型的一般框架。

英文摘要

Understanding and modeling animal behavior is essential for studying collective motion, decision-making, and bio-inspired robotics. Yet, evaluating the accuracy of behavioral models still often relies on offline comparisons to static trajectory statistics. Here we introduce a reinforcement-learning-based framework that uses a biomimetic robotic fish (RoboFish) to evaluate computational models of live fish behavior through closed-loop interaction. We trained policies in simulation using four distinct fish models-a simple constant-follow baseline, two rule-based models, and a biologically grounded convolutional neural network model-and transferred these policies to the real RoboFish setup, where they interacted with live fish. Policies were trained to guide a simulated fish to goal locations, enabling us to quantify how the response of real fish differs from the simulated fish's response. We evaluate the fish models by quantifying the sim-to-real gaps, defined as the Wasserstein distance between simulated and real distributions of behavioral metrics such as goal-reaching performance, inter-individual distances, wall interactions, and alignment. The neural network-based fish model exhibited the smallest gap across goal-reaching performance and most other metrics, indicating higher behavioral fidelity than conventional rule-based models under this benchmark. More importantly, this separation shows that the proposed evaluation can quantitatively distinguish candidate models under matched closed-loop conditions. Our work demonstrates how learning-based robotic experiments can uncover deficiencies in behavioral models and provides a general framework for evaluating animal behavior models through embodied interaction.

2604.05002 2026-05-20 cs.LG cs.AI

Learning Stable Predictors from Weak Supervision under Distribution Shift

在分布偏移下从弱监督中学习稳定的预测器

Mehrdad Shoeibi, Elias Hossain, Ivan Garibay, Niloofar Yousefi

AI总结 本文研究了在分布偏移下从弱监督中学习稳定预测器的问题,通过CRISPR-Cas13d转录组扰动实验,探讨了监督漂移现象,并展示了弱监督在域内学习和部分跨细胞系迁移中的有效性,同时揭示了时间迁移中的失败源于监督漂移而非模型容量或简单协变量偏移。

详情
AI中文摘要

在真实标签不可用时,从弱、代理或相对监督中学习是常见的,但分布偏移下的鲁棒性仍缺乏理解,因为监督机制本身可能在不同环境中变化。我们正式将这种现象定义为监督漂移,即$P(y \mid x, c)$在不同上下文中变化,并在CRISPR-Cas13d转录组扰动实验中研究了它,其中指导效果是通过RNA-seq响应间接推断的。使用涵盖两种人类细胞系和多个诱导后时间点的公开数据,我们构建了一个受控的非独立同分布基准,具有明确的领域(细胞系)和时间偏移,同时在所有上下文中重用固定的弱标签构造以避免改变目标。在线性和树基模型中,弱监督支持域内有意义的学习(岭$R^2 = 0.356$,斯皮尔曼$ρ= 0.442$)和部分跨细胞系迁移($ρ\approx 0.40$)。相比之下,时间迁移在所有考虑的模型类别中崩溃,产生负$R^2$和弱或接近零的$ρ$(岭$R^2 = -0.145$,$ρ= 0.008$;XGBoost $R^2 = -0.155$,$ρ= 0.056$;随机森林 $R^2 = -0.322$,$ρ= 0.139$)。使用外部重新计算的弱标签、偏移分数量化和简单的缓解基线进行额外的鲁棒性分析,保持了相同定性的模式。特征-标签关联和特征重要性分析在不同细胞系中相对稳定,但在时间上变化剧烈,表明失败源于监督漂移而非模型容量或简单协变量偏移。这些结果表明,在弱监督下强域内性能可能是误导性的,并促使将特征稳定性作为轻量级诊断,用于部署前检测非可迁移性。

英文摘要

Learning from weak, proxy, or relative supervision is common when ground-truth labels are unavailable, but robustness under distribution shift remains poorly understood because the supervision mechanism itself may change across environments. We formalize this phenomenon as supervision drift, defined as changes in $P(y \mid x, c)$ across contexts, and study it in CRISPR-Cas13d transcriptomic perturbation experiments where guide efficacy is inferred indirectly from RNA-seq responses. Using publicly available data spanning two human cell lines and multiple post-induction timepoints, we construct a controlled non-IID benchmark with explicit domain (cell line) and temporal shifts, while reusing a fixed weak-label construction across all contexts to avoid changing targets. Across linear and tree-based models, weak supervision supports meaningful learning in-domain (ridge $R^2 = 0.356$, Spearman $ρ= 0.442$) and partial cross-cell-line transfer ($ρ\approx 0.40$). In contrast, temporal transfer collapses across all model classes considered, yielding negative $R^2$ and weak or near-zero $ρ$ (ridge $R^2 = -0.145$, $ρ= 0.008$; XGBoost $R^2 = -0.155$, $ρ= 0.056$; random forest $R^2 = -0.322$, $ρ= 0.139$). Additional robustness analyses using externally recomputed weak labels, shift-score quantification, and simple mitigation baselines preserve the same qualitative pattern. Feature-label association and feature-importance analyses remain relatively stable across cell lines but change sharply over time, indicating that failures arise from supervision drift rather than model capacity or simple covariate shift. These results show that strong in-domain performance under weak supervision can be misleading and motivate feature stability as a lightweight diagnostic for non-transferability before deployment.

2603.25722 2026-05-20 cs.CV cs.LG

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

无需硬负样本:基于概念的学习在不降低对比模型零样本能力的情况下实现组合性

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

AI总结 本文提出了一种基于概念的学习方法,无需使用硬负样本即可在不损害对比模型零样本和检索能力的情况下实现组合性,通过简单的方法改进了文本和图像编码器的全局池化问题。

Comments Accepted at CVPR 2026. 2nd rev: update github repo URL

详情
AI中文摘要

对比视觉-语言(V&L)模型仍然是各种应用中的流行选择。然而,出现了几个限制,尤其是V&L模型学习组合性表示的能力有限。先前的方法通常通过生成定制训练数据来获得硬负样本。硬负样本已被证明可以提高组合性任务的性能,但通常只适用于单一基准,无法推广,并且可能导致基本V&L能力如零样本或检索性能的显著下降,使其不切实际。在本工作中,我们采取了不同的方法。我们识别出两个限制V&L组合性性能的根本原因:1)长训练标题不需要组合性表示;2)文本和图像编码器中的最终全局池化导致完全失去学习绑定所需的必要信息。为了解决这一问题,我们提出了两种简单的解决方案:1)使用标准NLP软件获得短的概念导向标题部分,并将其对齐到图像;2)引入无参数的跨模态注意力池化,从图像编码器中获得概念导向的视觉嵌入。通过这些更改和简单的辅助对比损失,我们获得了标准组合性基准的SOTA性能,同时保持或提高了强大的零样本和检索能力。这在不增加推理成本的情况下实现。我们在此工作的代码已发布在https://github.com/saic-fi/concept_centric_clip。

英文摘要

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/saic-fi/concept_centric_clip.

2603.25476 2026-05-20 cs.LG

How Class Ontology and Data Scale Affect Audio Transfer Learning

音频迁移学习中类本体和数据规模的影响

Manuel Milling, Andreas Triantafyllopoulos, Alexander Gebhard, Simon Rampp, Björn W. Schuller

AI总结 本文研究了在音频到音频迁移学习中,类本体和数据规模如何影响迁移学习的效果,发现增加样本和类别的数量对迁移学习有积极影响,但相似性在下游任务中起主导作用。

详情
AI中文摘要

迁移学习是深度学习中的关键概念,允许人工神经网络在数据有限的任务中受益于大量预训练数据的基础。尽管其广泛应用和明显优势,但关于迁移学习内部机制以及何时和如何有效工作的理解仍然存在许多开放问题。为此,我们进行了严格的研究,专注于音频到音频的迁移学习,在此过程中,我们在AudioSet的(基于本体的)子集上预训练各种模型状态,并在三个计算机听觉任务上进行微调:声学场景识别、鸟类活动识别和语音命令识别。我们报告说,增加预训练数据中的样本和类别的数量对迁移学习都有积极影响。然而,这通常被预训练与下游任务之间的相似性所超越,这种相似性可以导致模型学习到相似的特征。

英文摘要

Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

2603.22161 2026-05-20 cs.LG

Causal Evidence that Language Models use Confidence to Drive Behavior

语言模型使用置信度驱动行为的因果证据

Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Veličković, Viorica Patraucean

AI总结 研究探讨了语言模型是否利用置信度信号来控制行为,如决定回答或 abstain,通过四个阶段实验发现模型使用多维内部置信表示和阈值策略来实现 abstention,揭示了结构化的元认知控制机制。

详情
AI中文摘要

元认知——评估自身认知表现的质量——指导跨物种的适应性行为。大量研究表明可以从语言模型输出中提取置信度信号,但一个根本问题仍然存在:模型是否真的利用这些信号来控制行为,例如决定是否回答或 abstain?为调查这一问题,我们开发了一个四阶段范式。第一阶段获取了无 abstention 选项的基线置信度估计。第二阶段揭示了 LLMs 在决定 abstain 时应用隐含阈值,置信度效应大小大约比其他机制大一个数量级。第三阶段通过激活引导提供了直接的因果证据:提升或抑制置信度信号会相应地降低或增加 abstention 率。第四阶段通过系统地变化指示阈值,证明 LLMs 主动部署置信度信号以实施 abstention 策略。关键的是,除了基于输出分布的校准对数概率置信度外,口头置信度在所有模型中独立预测 abstention,尽管其客观上对答案正确性的区分能力较弱。最后预答标记的激活解码进一步显示,这两种可观察的指标都是更丰富的内部表示的损失性读取。总体而言,这些结果表明,abstention 不仅仅是输出分布中证据强度的简单体现,而是更好地由多维内部置信表示和基于阈值的策略的联合操作所解释——与 LLMs 中的结构化元认知控制机制一致,这一能力在模型向自主代理过渡时变得越来越重要,因为这些代理必须识别自身的不确定性。

英文摘要

Metacognition -- assessing the quality of one's own cognitive performance -- guides adaptive behavior across species. Substantial research demonstrates that confidence signals can be extracted from language model outputs, yet a fundamental question remains: do models actually use these signals to control behavior, such as deciding whether to answer or abstain? To investigate, we developed a four-phase paradigm. Phase~1 elicited baseline confidence estimates without an abstention option. Phase~2 revealed that LLMs apply an implicit threshold to internal confidence when deciding to abstain, with confidence effect sizes approximately an order of magnitude larger than alternative mechanisms. Phase~3 provided direct causal evidence through activation steering: boosting or suppressing confidence signals correspondingly decreased or increased abstention rates. Phase~4 extended this by systematically varying instructed thresholds, demonstrating that LLMs actively deploy confidence signals to implement abstention policies. Critically, beyond calibrated log-probability based confidence derived from the output distribution, verbal confidence independently predicted abstention across all models, despite being objectively less discriminatory of answer correctness. Activation decoding at the last pre-answer token further showed that both observable measures are lossy readouts of a richer internal representation. Together, these results suggest that abstention is not fully captured by the strength of evidence in the output distribution alone, but is better explained by the joint operation of a multidimensional internal confidence representation and threshold-based policies -- consistent with structured metacognitive control in LLMs, a capacity of growing importance as models transition to autonomous agents that must recognize their own uncertainty.

2603.18396 2026-05-20 cs.LG cs.RO

RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

RE-SAC:在公交车队控制中解耦偶然风险和本质风险:一种稳定且稳健的集成深度强化学习方法

Yifan Zhang, Liang Zheng

AI总结 该研究提出RE-SAC方法,通过解耦偶然风险和本质风险来提升公交车队控制的稳定性与鲁棒性,采用积分概率度量(IPM)基于的权重正则化和多样化Q-集成来应对不同类型的不确定性。

详情
AI中文摘要

公交保持控制因随机交通和乘客需求而具有挑战性。尽管深度强化学习(DRL)展现出潜力,但标准的actor-critic算法在波动环境中面临Q值不稳定的问题。这种不稳定性的一个关键来源是将两种不同的不确定性混淆:偶然不确定性(不可减少的噪声)和本质不确定性(数据不足)。将它们视为单一风险会导致在嘈杂状态下的价值低估,从而导致灾难性策略崩溃。我们提出了一种稳健的集成软actor-critic(RE-SAC)框架,以明确解耦这些不确定性。RE-SAC将积分概率度量(IPM)基于的权重正则化应用于批评者网络,以对抗偶然风险,为鲁棒Bellman算子提供平滑的分析下界,而无需昂贵的内循环扰动。为了应对本质风险,一个多样化Q-集成对稀疏覆盖区域中的过度自信价值估计进行惩罚。这种双重机制防止了集成方差将噪声误认为数据缺口,这种失败模式在我们的消融研究中被识别。在现实的双向公交走廊模拟实验中,RE-SAC在累计奖励(约-0.4e6)方面优于标准SAC(-0.55e6)。Mahalanobis稀有性分析证实,RE-SAC在罕见的分布外状态中将Oracle Q值估计误差减少了高达62%(MAE为1647 vs. 4343),展示了在高交通变异性下的优越鲁棒性。

英文摘要

Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.