arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2603.08000 2026-06-02 cs.CL cs.LG

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker: 渐进式思维链长度校准以实现高效的大语言模型推理

Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen

发表机构 * Tsinghua University(清华大学)

AI总结 针对大型推理模型输出冗余问题,提出基于GRPO的渐进式CoT长度校准方法SmartThinker,通过动态估计最优长度和调节长度奖励系数,在压缩响应长度同时提升准确率。

Comments Accepted by ICML 2026, 18 pages, 13 figures

详情
AI中文摘要

大型推理模型(LRMs),如OpenAI o1和DeepSeek-R1,通过采用长思维链(CoT)推理路径在复杂任务上实现了高准确率。然而,这些过程固有的冗长常常导致冗余和过度思考。为了解决这一问题,现有工作利用组相对策略优化(GRPO)来减少LRM的输出长度,但其静态长度奖励设计无法根据问题相对难度和响应长度分布动态调整,导致过度压缩和准确率下降。因此,我们提出SmartThinker,一种新颖的基于GRPO的高效推理方法,具有渐进式CoT长度校准。SmartThinker有两个贡献:首先,它在训练期间动态估计具有峰值准确率的最优长度,并引导过长响应朝向该长度,以减少响应长度同时保持准确率。其次,它动态调节长度奖励系数,以避免对正确推理路径的不当惩罚。大量实验结果表明,SmartThinker在提高准确率的同时实现了高达52.5%的平均长度压缩,并在AIME25等具有挑战性的基准上实现了高达16.6%的准确率提升。源代码可在https://github.com/SJTU-RTEAS/SmartThinker获取。

英文摘要

Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.

2603.07578 2026-06-02 cs.RO

Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments

基于事件的四旋翼飞行器在杂乱环境中的近似模仿学习

Nico Messikommer, Jiaxu Xing, Leonard Bauersfeld, Marco Cannici, Elie Aljalbout, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland(苏黎世联邦理工学院机器人与感知组)

AI总结 提出近似模仿学习框架,通过分离表征学习与策略搜索,将事件相机四旋翼飞行策略训练时间从52.44小时降至1.86小时,实现28倍加速,并在仿真和真实环境中验证了高速飞行性能。

详情
AI中文摘要

事件相机具有高时间分辨率和低延迟,使其成为高速机器人应用的理想传感器,而传统相机会因运动模糊而失效。然而,它们在机器人学习中的广泛应用受到在线训练期间模拟高频事件数据计算成本的严重制约。在这项工作中,我们提出了近似模仿学习,这是一个从根本上解决这一瓶颈的新框架,将复杂、敏捷无人机飞行的策略训练时间从52.44小时减少到仅1.86小时——实现了28倍的计算加速。我们的关键见解是将表征学习与策略搜索分离。我们首先利用大规模离线数据集学习特定于任务的表征空间。随后,通过仅依赖轻量级状态信息的在线交互对策略进行微调,完全消除了在主动策略搜索阶段渲染事件的需求。这种训练范式极大地降低了开发开销,并使基于事件的控制策略能够扩展到复杂环境。此外,我们的方法在部署期间消除了对标准相机或中间表示的依赖,直接将事件映射到控制命令。在仿真中,我们的方法匹配或超过了需要完整在线事件渲染的标准模仿学习基线的性能。最后,我们在真实世界中成功验证了该框架,展示了通过这种超高效范式训练的策略使四旋翼飞行器能够在高度杂乱的环境中以前所未有的速度(高达9.8米/秒)飞行。

英文摘要

Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from motion blur. However, their widespread adoption in robot learning is severely bottlenecked by the computational cost of simulating high-frequency event data during online training. In this work, we present Approximate Imitation Learning, a novel framework that fundamentally resolves this bottleneck, reducing policy training time for complex, agile drone flight from 52.44 hours to just 1.86 hours - a 28x computational speedup. Our key insight is to separate representation learning from policy search. We first leverage a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is fine-tuned through online interactions that rely solely on lightweight state information, completely eliminating the need to render events during the active policy search phase. This training paradigm drastically reduces development overhead and enables event-based control policies to scale to complex environments. Furthermore, our approach eliminates the reliance on standard cameras or intermediate representations during deployment, mapping events directly to control commands. In simulation, our method matches or exceeds the performance of standard imitation learning baselines that require full online event rendering. Finally, we successfully validate the framework in the real world, demonstrating that a policy trained via this ultra-efficient paradigm enables a quadrotor to fly through highly cluttered environments at remarkable speeds of up to 9.8 m/s.

2602.01662 2026-06-02 cs.RO

PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation

PLanAR:面向机器人操作的规划语言基础智能推理

Pengyuan Guo, Zhonghao Mai, Zhengtong Xu, Kaidi Zhang, Quan Khanh Luu, Heng Zhang, Zichen Miao, Arash Ajoudani, Zachary Kingston, Qiang Qiu, Yu She

发表机构 * Purdue University(普渡大学) Istituto Italiano di Tecnologia(意大利理工研究院)

AI总结 提出PLanAR框架,通过规划语言接口定义VLM推理空间,实现开放词汇的长时域机器人操作,并支持逐步验证与重规划。

Comments New version with updated framing, contributions, experiments, and figures

详情
AI中文摘要

近期视觉-语言模型(VLM)的进步推动了真实世界机器人操作的进展。然而,非结构化环境中的长时域操作要求VLM推理变化的场景状态、动作约束和执行结果,这仅靠自然语言推理仍然困难。我们提出PLanAR,一个规划语言基础的机器人智能体框架,用于开放词汇的长时域操作。PLanAR使用规划语言接口定义VLM推理空间:对象谓词表示场景状态,动作模式指定具有前提条件和效果的机器人技能,符号规划提供可执行的中间表示。该接口支持逐步验证:在每个动作之后,PLanAR利用机载观测检查预期符号效果是否实现,使基于VLM的智能体能够更新任务状态、检测失败,并在执行偏离预期时重新规划。在多种机器人形态、VLM后端以及包括堆叠、填字游戏和长时域厨房工作流程的任务中,PLanAR展示了强大的真实世界能力,同时揭示了当前VLM在具身推理中的关键局限性。

英文摘要

Recent advances in vision-language models (VLMs) have enabled increasing progress in real-world robot manipulation. However, long-horizon manipulation in unstructured environments requires VLMs to reason about changing scene states, action constraints, and execution outcomes, which remains difficult with natural language reasoning alone. We present PLanAR, a planning-language-grounded robot agent framework for open-vocabulary, long-horizon manipulation. PLanAR uses a planning-language interface to define the VLM reasoning space: object predicates represent scene states, action schemas specify robot skills with preconditions and effects, and symbolic plans provide executable intermediate representations. This interface enables stepwise verification: after each action, PLanAR uses onboard observations to check whether the expected symbolic effects have been achieved, allowing the VLM-based agent to update task states, detect failures, and replan when execution deviates from expectation. Across robot embodiments, VLM backends, and tasks including stacking, crossword solving, and long-horizon kitchen workflows, PLanAR demonstrates strong real-world capability while revealing key limitations of current VLMs in embodied reasoning.

2603.07109 2026-06-02 cs.AI

Vision Language Models Cannot Reason About Physical Transformation

视觉语言模型无法推理物理变换

Dezhi Luo, Yijiang Li, Maijunxian Wang, Tianwei Zhao, Bingyang Wang, Siheng Wang, Pinyuan Feng, Pooyan Rahmanzadehgervi, Ziqiao Ma, Hokin Deng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过构建ConservationBench基准,评估112个视觉语言模型在物理守恒任务上的表现,发现模型在动态场景中无法维持物理属性的变换不变性。

Comments Accepted by ICML 2026

详情
AI中文摘要

理解物理变换是动态环境中推理的基础。虽然视觉语言模型(VLM)在具身应用中展现出潜力,但它们是否真正理解物理变换仍不清楚。我们引入了ConservationBench,用于评估守恒性——即物理量在变换下是否保持不变。该基准涵盖四种属性,包含成对的守恒/非守恒场景,我们生成并评估了112个VLM上的23,040个问题。结果揭示了系统性失败:性能接近随机水平,守恒任务上的改进伴随着控制任务上的下降。控制实验显示,模型存在强烈的文本先验偏向于不变性,但在守恒和非守恒场景中性能平衡时,模型对实际视觉内容的表现更差。时间分辨率、提示或精心采样的方法均无帮助。这些发现表明,当前VLM无法在动态场景中维持物理属性的变换不变性表示。

英文摘要

Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate and evaluate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with actual visual content when performance is balanced across conserving and non-conserving scenarios. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

2603.06741 2026-06-02 cs.LG cs.AI cs.CV

Heterogeneous Decentralized Diffusion Models

异构去中心化扩散模型

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

发表机构 * bagel.com(Bagel公司)

AI总结 提出一种异构去中心化训练框架,通过支持不同专家使用不同目标(DDPM和Flow Matching)并统一推理、预训练检查点转换以及高效架构,大幅降低计算和数据需求,使单GPU(24-48GB VRAM)即可参与训练。

Comments Accepted to CVPR2026

详情
AI中文摘要

训练前沿规模的扩散模型通常需要大量计算资源集中在紧密耦合的集群中,限制了只有资源充足的机构才能参与。虽然去中心化扩散模型(DDM)能够独立训练多个专家,但现有方法需要1176 GPU天,且所有专家使用同质化训练目标。我们提出了一个高效框架,大幅降低资源需求,同时支持异构训练目标。我们的方法结合了三个关键贡献:(1)一种异构去中心化训练范式,允许专家使用不同的目标(DDPM和Flow Matching),在推理时无需任何重新训练即可统一;(2)从ImageNet-DDPM到Flow Matching目标的预训练检查点转换,加速收敛并无需针对特定目标的预训练即可初始化;(3)PixArt-$α$的高效AdaLN-Single架构,在保持质量的同时减少参数。在LAION-Aesthetics上的实验表明,相对于先前DDM工作报告的训练规模,我们的方法将计算量减少了16倍,数据量减少了14倍。在对齐的推理设置下,我们的异构配置比同质基线获得了更好的FID和更高的提示内多样性。通过消除同步需求并支持混合DDPM/FM目标,我们的框架使贡献者只需单GPU(24-48GB VRAM)即可进行去中心化生成模型训练。

英文摘要

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time without any retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-$α$'s efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces the compute by 16$\times$ and data by 14$\times$. Under aligned inference settings, our heterogeneous configuration achieves better FID and higher intra-prompt diversity than the homogeneous baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework makes decentralized generative model training accessible to contributors with single GPUs requiring only 24--48GB VRAM.

2603.06453 2026-06-02 cs.CV

Pinterest Canvas: Large-Scale Image Generation at Pinterest

Pinterest Canvas: Pinterest 的大规模图像生成系统

Yu Wang, Eric Tzeng, Raymond Shiau, Jie Yang, Dmitry Kislyuk, Charles Rosenberg

发表机构 * Pinterest, Inc.(Pinterest公司)

AI总结 本文提出 Pinterest Canvas,一个基于扩散模型的大规模图像生成系统,通过基础模型微调为特定任务(如背景增强和宽高比外扩)生成专用模型,并在线上实验中分别获得18.0%和12.5%的参与度提升。

Comments Accepted by KDD 2026 Applied Data Science Track

详情
AI中文摘要

尽管最近的图像生成模型在处理各种图像生成任务方面表现出色,但这种灵活性使得它们难以仅通过提示或简单的推理适应来控制,因此不适用于具有严格产品要求的场景。在本文中,我们介绍了 Pinterest Canvas,这是我们构建的大规模图像生成系统,用于支持 Pinterest 上的图像编辑和增强用例。Canvas 首先在多样化的多模态数据集上进行训练,以生成具有广泛图像编辑能力的基础扩散模型。然而,我们并不依赖一个通用模型来处理所有下游任务,而是针对特定任务的数据集快速微调该基础模型的变体,从而为各个用例生成专用模型。我们描述了 Canvas 的关键组件,并总结了我们在数据集策划、训练和推理方面的最佳实践。我们还通过背景增强和宽高比外扩的案例研究展示了任务特定的变体,突出了我们如何满足其特定的产品需求。在线 A/B 实验表明,我们的增强图像分别获得了显著的 18.0% 和 12.5% 的参与度提升,与人类评估者的比较进一步验证了我们的模型在这些任务上优于第三方模型。最后,我们展示了其他 Canvas 变体,包括多图像场景合成和图像到视频生成,证明了我们的方法可以推广到各种潜在的下游任务。

英文摘要

While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.

2603.06331 2026-06-02 cs.CV

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

WorldCache: 通过异构令牌缓存免费加速世界模型

Weilun Feng, Guoxin Fan, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Dingrui Wang, Longlong Liao, Michele Magno, Yongjun Xu, Chuanguang Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对扩散世界模型中令牌异质性和非均匀时间动态导致的推理慢问题,提出基于曲率引导的异构令牌预测和混沌优先自适应跳过的缓存框架WorldCache,实现高达3.7倍加速并保持98%的推出质量。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于扩散的世界模型在统一世界模拟方面显示出巨大潜力,但迭代去噪对于交互式使用和长程推出而言仍然过于昂贵。虽然特征缓存可以在无需训练的情况下加速推理,但我们发现,为单模态扩散设计的策略由于两个世界模型特有的障碍而难以迁移到世界模型:来自多模态耦合和空间变化的\emph{令牌异质性},以及\emph{非均匀时间动态},其中一小部分困难令牌驱动误差增长,使得均匀跳过要么不稳定,要么过于保守。我们提出了 extbf{WorldCache},一个为扩散世界模型量身定制的缓存框架。我们引入了 extit{曲率引导的异构令牌预测},它使用基于物理的曲率分数来估计令牌可预测性,并对具有突变方向变化的混沌令牌应用Hermite引导的阻尼预测器。我们还设计了 extit{混沌优先的自适应跳过},它累积一个曲率归一化、无量纲的漂移信号,并且仅在瓶颈令牌开始漂移时重新计算。在扩散世界模型上的实验表明,WorldCache实现了高达 extbf{3.7$ imes$}的端到端加速,同时保持了 extbf{98\%}的推出质量,展示了WorldCache在资源受限场景中的巨大优势和实用性。我们的代码发布在https://github.com/FofGofx/WorldCache。

英文摘要

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.

2602.06841 2026-06-02 cs.AI

From Features to Actions: Explainability in Traditional and Agentic AI Systems

从特征到行动:传统与智能体AI系统中的可解释性

Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Ahmed Y. Radwan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence(向量人工智能研究所) Independent Researcher(独立研究者) Mayo Clinic(梅奥诊所)

AI总结 本文比较了基于归因的解释与基于轨迹的诊断在静态和智能体设置中的效果,发现归因方法无法可靠诊断智能体轨迹中的执行级故障,而轨迹级可解释性更能定位行为故障。

详情
AI中文摘要

在过去十年中,可解释AI主要关注解释单个模型预测,在固定决策结构下生成将输入与输出关联的事后解释。大型语言模型的最新进展使得智能体AI系统能够在多步轨迹中展开行为。在这些设置中,成功与失败由决策序列而非单个输出决定。目前尚不清楚为静态预测设计的解释方法如何应用于行为随时间涌现的智能体设置。在这项工作中,我们通过比较两种设置中基于归因的解释与基于轨迹的诊断来弥合这一差距。我们的结果表明,虽然归因方法在静态设置中实现了稳定的特征排名(Spearman ρ = 0.86),但它们无法可靠地诊断智能体轨迹中的执行级故障。相比之下,针对智能体设置的轨迹接地评分标准能够一致地定位行为故障,并揭示状态跟踪不一致在失败运行中的普遍性高出2.7倍,并将成功概率降低49%。这些发现促使我们转向轨迹级可解释性,以评估和诊断智能体系统中自主AI行为。代码:https://github.com/VectorInstitute/unified-xai-evaluation-framework 项目页面:https://vectorinstitute.github.io/unified-xai-evaluation-framework

英文摘要

Over the last decade, Explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. It remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge this gap by comparing attribution-based explanations with trace-based diagnostics across both settings. Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman \r{ho} = 0.86), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7x more prevalent in failed runs and reduces success probability by 49%. These findings motivate a shift towards trajectory-level explainability for evaluating and diagnosing autonomous AI behaviour in agentic systems. Code: https://github.com/VectorInstitute/unified-xai-evaluation-framework Project page: https://vectorinstitute.github.io/unified-xai-evaluation-framework

2603.04828 2026-06-02 cs.CL

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

从陌生到熟悉:通过梯度偏差检测大型语言模型中的预训练数据

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University(北京未来区块链与隐私计算先进创新中心,北京航空航天大学) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院)

AI总结 提出GDS方法,通过分析目标样本的梯度偏差分数(包括更新幅度、位置和神经元激活集中度)来区分预训练成员与非成员数据,实现高效且跨数据集迁移的预训练数据检测。

Comments 17 pages, 8 figures

详情
AI中文摘要

大型语言模型的预训练数据检测对于解决版权问题和减轻基准污染至关重要。现有方法主要关注微调前后的基于似然的统计特征或启发式信号,但前者易受语料库中词频偏差的影响,后者强烈依赖于微调数据的相似性。从优化角度,我们观察到在训练过程中,样本从陌生到熟悉的转变方式体现在梯度行为的系统性差异上。熟悉样本表现出更小的更新幅度、模型组件中不同的更新位置以及更尖锐激活的神经元。基于这一洞察,我们提出GDS,一种通过探测目标样本的梯度偏差分数来识别预训练数据的方法。具体来说,我们首先使用梯度轮廓表示每个样本,该轮廓捕获跨FFN和注意力模块的参数更新的幅度、位置和集中度,揭示成员与非成员数据之间的一致区别。然后将这些特征输入轻量级分类器进行二值成员推断。在五个公共数据集上的实验表明,GDS在强基线上实现了最先进的性能,并显著提高了跨数据集的可迁移性。进一步的可解释性分析揭示了梯度分布的差异,半监督结果为检测预训练数据提供了一种实用方法。

英文摘要

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyses reveal differences in gradient distributions, and the semi-supervised results offer a practical way to detect pre-training data.

2603.04430 2026-06-02 cs.LG

Flowers: A Warp Drive for Neural PDE Solvers

Flowers: 神经PDE求解器的曲速引擎

Till Muser, Alexandra Spitzer, Matti Lassas, Maarten V. de Hoop, Ivan Dokmanić

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Helsinki(赫尔辛基大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Flowers架构,通过多头扭曲场实现线性代价的自适应全局交互,在2D/3D时变PDE基准上超越傅里叶、卷积和注意力基线。

详情
AI中文摘要

我们引入了Flowers,一种完全由多头扭曲构建的神经架构,用于学习PDE解算子。除了逐点通道混合和多尺度支架外,Flowers不使用傅里叶乘子、点积注意力或卷积混合。每个头预测一个位移场并扭曲混合后的输入特征。受物理和计算效率的启发,位移是逐点预测的,没有任何空间聚合,非局域性仅通过每个头在源坐标处的稀疏采样引入。在多尺度残差块中堆叠扭曲得到Flowers,它以线性代价实现自适应的全局交互。我们通过三个互补视角从理论上论证了这一设计:守恒律的流图、非均匀介质中的波以及动力学理论的连续极限。Flowers在一系列2D和3D时变PDE基准上取得了优异性能,特别是流和波。一个紧凑的17M参数模型持续优于相似规模的傅里叶、卷积和注意力基线,而一个150M参数变体在参数、数据和训练计算量多得多的情况下,超越了近期基于transformer的基础模型。

英文摘要

We introduce Flowers, a neural architecture for learning PDE solution operators built entirely from multihead warps. Aside from pointwise channel mixing and a multiscale scaffold, Flowers use no Fourier multipliers, no dot-product attention, and no convolutional mixing. Each head predicts a displacement field and warps the mixed input features. Motivated by physics and computational efficiency, displacements are predicted pointwise, without any spatial aggregation, and nonlocality enters only through sparse sampling at source coordinates, one per head. Stacking warps in multiscale residual blocks yields Flowers, which implement adaptive, global interactions at linear cost. We theoretically motivate this design through three complementary lenses: flow maps for conservation laws, waves in inhomogeneous media, and a kinetic-theoretic continuum limit. Flowers achieve excellent performance on a broad suite of 2D and 3D time-dependent PDE benchmarks, particularly flows and waves. A compact 17M-parameter model consistently outperforms Fourier, convolution, and attention-based baselines of similar size, while a 150M-parameter variant improves over recent transformer-based foundation models with much more parameters, data, and training compute.

2602.23694 2026-06-02 cs.RO cs.AI

Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion

基于对数似然比融合的可解释多模态手势识别用于无人机和移动机器人遥操作

Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh

发表机构 * Department of Artificial Intelligence, Korea University(人工智能系,韩国大学) Department of Computer Science, RPTU Kaiserslautern-Landau(计算机科学系,RPTU凯撒斯劳滕-兰道) Embedded Intelligence, German Research Center for Artificial Intelligence (DFKI)(嵌入式智能,德国人工智能研究中心(DFKI))

AI总结 提出一种融合腕戴式Apple Watch惯性数据和定制手套电容传感信号的多模态手势识别框架,利用对数似然比后期融合策略提升性能并提供可解释性,在降低计算成本的同时达到与视觉基线相当的识别效果。

详情
AI中文摘要

人类操作员仍经常暴露在危险环境中,如灾区及工业设施,在这些场景中,移动机器人和无人飞行器(UAV)的直观可靠遥操作至关重要。在此背景下,免手持遥操作增强了操作员的移动性和态势感知能力,从而提高了危险环境中的安全性。尽管基于视觉的手势识别已被探索作为免手持遥操作的一种方法,但其性能在遮挡、光照变化和杂乱背景下常会下降,限制了其在真实操作中的适用性。为克服这些限制,我们提出一种多模态手势识别框架,该框架融合来自双手腕上Apple Watch的惯性数据(加速度计、陀螺仪和方向)与来自定制手套的电容传感信号。我们设计了一种基于对数似然比(LLR)的后期融合策略,该策略不仅提升了识别性能,还通过量化模态特定贡献提供了可解释性。为支持本研究,我们引入了一个包含20种受飞机引导信号启发的手势的新数据集,包含同步的RGB视频、IMU和电容传感器数据。实验结果表明,我们的框架在显著降低计算成本、模型大小和训练时间的同时,达到了与最先进的视觉基线相当的性能,使其非常适合实时机器人控制。因此,我们强调了基于传感器的多模态融合作为手势驱动的移动机器人和无人机遥操作的鲁棒且可解释解决方案的潜力。

英文摘要

Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.

2602.22101 2026-06-02 cs.LG cs.AI

On Imbalanced Regression with Hoeffding Trees

关于使用Hoeffding树的不平衡回归

Pantia-Marina Alchirch, Dimitrios I. Diochnos

发表机构 * University of Oklahoma(俄克拉荷马大学)

AI总结 针对不平衡回归中的数据流问题,将核密度估计扩展到流式设置并集成层次收缩到增量决策树中,实验表明KDE能持续提升早期流性能。

Comments 17 pages, 5 figures, 3 tables, 2 algorithms, authors' version of paper accepted in PAKDD 2026 special session on Data Science: Foundations and Applications (DSFA)

详情
AI中文摘要

许多现实应用会生成用于回归的连续数据流。Hoeffding树及其变体因其有效性而具有悠久的传统,无论是单独使用还是作为更广泛集成中的基础模型。最近的批量学习工作表明,核密度估计(KDE)改善了不平衡回归中的平滑预测[Yang等人,2021],而层次收缩(HS)为决策树提供了事后正则化,无需修改其结构[Agarwal等人,2022]。我们通过伸缩公式将KDE扩展到流式设置,并将HS集成到增量决策树中。在标准在线回归基准上的实证评估表明,KDE持续改善了早期流性能,而HS提供的增益有限。我们的实现公开于:https://github.com/marinaAlchirch/DSFA_2026。

英文摘要

Many real-world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long-standing tradition due to their effectiveness, either alone or as base models in broader ensembles. Recent batch-learning work shows that kernel density estimation (KDE) improves smoothed predictions in imbalanced regression [Yang et al., 2021], while hierarchical shrinkage (HS) provides post-hoc regularization for decision trees without modifying their structure [Agarwal et al., 2022]. We extend KDE to streaming settings via a telescoping formulation and integrate HS into incremental decision trees. Empirical evaluation on standard online regression benchmarks shows that KDE consistently improves early-stream performance, whereas HS provides limited gains. Our implementation is publicly available at: https://github.com/marinaAlchirch/DSFA_2026.

2512.00470 2026-06-02 cs.RO

LAP: Fast LAtent Diffusion Planner for Autonomous Driving

LAP:面向自动驾驶的快速潜在扩散规划器

Jinhao Zhang, Wenlong Xia, Zhexuan Zhou, Haoming Song, Youmin Gong, Jie Mei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LAP框架,在VAE学习的潜在空间中进行规划,通过单步去噪实现高质量规划,在nuPlan基准上达到学习型规划方法的最优闭环性能,推理速度提升高达10倍。

详情
AI中文摘要

扩散模型在自动驾驶中模拟类人驾驶行为方面展现出强大能力,但其迭代采样过程导致大量延迟,且直接对原始轨迹点操作迫使模型将容量用于低级运动学而非高级多模态语义。为解决这些限制,我们提出潜在规划器(LAP),该框架在VAE学习的潜在空间中进行规划,将高级意图与低级运动学解耦,使规划器能够捕获丰富的多模态驾驶策略。为弥合高级语义规划空间与向量化场景上下文之间的表征差距,我们引入中间特征对齐机制,促进鲁棒的信息融合。值得注意的是,LAP可在单步去噪中生成高质量规划,大幅降低计算开销。通过在大型nuPlan基准上的广泛评估,LAP在学习型规划方法中实现了最先进的闭环性能,同时推理速度相比先前SOTA方法提升高达10倍。

英文摘要

Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. To bridge the representational gap between the high-level semantic planning space and the vectorized scene context, we introduce an intermediate feature alignment mechanism that facilitates robust information fusion. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10x over previous SOTA approaches.

2503.11832 2026-06-02 cs.AI cs.LG

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

安全幻象:虚假相关性如何破坏VLM安全微调及通过机器遗忘缓解

Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

发表机构 * Michigan State University(密歇根州立大学) National University of Singapore(新加坡国立大学) Cisco Research(思科研究)

AI总结 本文发现视觉语言模型(VLM)的安全微调存在“安全幻象”,即虚假相关性导致脆弱性,并提出机器遗忘作为替代方案,显著降低攻击成功率和不必要拒绝。

Comments Accepted to ICLR 2026

详情
AI中文摘要

最近的视觉语言模型(VLM)在多模态输入(特别是文本和图像)的生成建模方面取得了显著进展。然而,当暴露于不安全查询时,它们生成有害内容的倾向引发了关键的安全问题。虽然当前的对齐策略主要依赖于使用精心策划的数据集进行监督安全微调,但我们发现了一个基本限制,称为“安全幻象”,其中监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性,而不是促进深层的、内在的危害缓解。我们表明,这些虚假相关性使微调后的VLM即使面对基于单词修改的简单攻击也易受攻击,其中将文本查询中的单个单词替换为诱导虚假相关性的替代词即可有效绕过安全防护。此外,这些相关性导致过度谨慎,使微调后的VLM不必要地拒绝良性查询。为了解决这些问题,我们展示了机器遗忘(MU)作为监督安全微调的有力替代方案,因为它避免了有偏的特征-标签映射,并直接从VLM中移除有害知识,同时保留其通用能力。在安全基准上的广泛评估表明,基于MU的对齐将攻击成功率降低了高达60.27%,并将不必要的拒绝减少了超过84.20%。警告:存在可能具有攻击性的AI生成内容。

英文摘要

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

2603.03741 2026-06-02 cs.RO cs.AI

HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

HALO:通过异质智能体李雅普诺夫策略优化学习人机协作

Hao Zhang, Yaru Niu, Yikai Wang, Ding Zhao, H. Eric Tseng

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对人机协作中人类行为多样性和环境变化导致的泛化与鲁棒性问题,提出异质智能体李雅普诺夫策略优化(HALO)框架,通过李雅普诺夫收缩稳定去中心化多智能体强化学习,并利用最优二次投影修正梯度,实现理性差距的单调收缩,提升协作性能。

Comments https://HaoZhang-THU.github.io/HALO/

详情
AI中文摘要

为了提高人机协作(HRC)的泛化性和韧性,机器人必须应对人类行为和情境的多种组合,这推动了多智能体强化学习(MARL)的应用。然而,机器人与人类之间的固有异质性造成了理性差距(RG),使得去中心化的策略更新偏离了合作联合优化。由此产生的学习问题是一个一般和可微博弈,因此独立的策略梯度更新在没有额外结构的情况下可能会振荡或发散。我们提出了异质智能体李雅普诺夫策略优化(HALO),这是一个通过强制策略参数空间中的李雅普诺夫收缩来稳定去中心化MARL的框架。与针对约束马尔可夫决策过程中状态/轨迹约束的基于李雅普诺夫的安全RL不同,HALO使用李雅普诺夫认证来稳定去中心化策略学习。HALO通过最优二次投影修正去中心化梯度,确保RG的单调收缩,并实现对开放式交互空间的有效探索。大量的仿真和真实人形机器人实验表明,这种认证的稳定性提高了协作边缘情况下的泛化性和鲁棒性。我们的项目网站位于https://HaoZhang-THU.github.io/HALO/。

英文摘要

To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG), where decentralized policy updates deviate from cooperative joint optimization. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALO), a framework that stabilizes decentralized MARL by enforcing Lyapunov-based contraction in policy-parameter space. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALO uses Lyapunov certification to stabilize decentralized policy learning. HALO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases. Our project website is available at https://HaoZhang-THU.github.io/HALO/.

2603.03202 2026-06-02 cs.CL

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Code2Math:你的代码智能体能否通过探索有效演化数学问题?

Dadi Guo, Yuejin Xie, Qingyu Liu, Weixian Huang, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Jianjie Feng, Wenze Su, Yujiu Yang, Dongrui Liu, Yi R. Fung

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Nanjing Tech University(南京工业大学) Shanghai Jiao Tong University(上海交通大学) University of Michigan(密歇根大学) Independent Researcher(独立研究者)

AI总结 本文提出一个多智能体框架,利用代码智能体通过探索将现有数学问题自主演化为更复杂、更困难的变体,并验证其可解性和难度提升。

Comments 38 pages

详情
AI中文摘要

随着大型语言模型(LLM)在数学能力上向IMO和研究水平迈进,高质量挑战性问题的稀缺已成为LLM训练、评估和自我进化的显著瓶颈。同时,最近的代码智能体在智能体编码和推理方面展示了复杂技能,表明代码执行可以作为数学实验的可扩展环境。在本文中,我们研究了代码智能体自主将现有数学问题演化为更复杂变体的潜力。我们引入了一个多智能体框架,旨在执行问题演化,同时验证生成问题的可解性和难度增加。我们的实验表明,在充分的测试时探索下,代码智能体可以合成新的、可解的问题,这些问题在结构上与原始问题不同且更具挑战性。本工作提供了经验证据,表明代码驱动的智能体可以作为在可扩展计算环境中合成高难度数学推理问题的可行机制。代码和数据可在 https://github.com/TarferSoul/Code2Math 获取。

英文摘要

As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-evolution of LLMs. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Code and data is available at https://github.com/TarferSoul/Code2Math.

2603.03291 2026-06-02 cs.CL cs.AI

One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

一个接一个的偏差:语言奖励模型中的机械奖励塑造与持续偏差

Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

发表机构 * Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过系统测量五个高质量奖励模型中的偏差,发现长度、谄媚、过度自信等持续问题,并提出一种简单的后处理干预方法(机械奖励塑造)来减轻低复杂度偏差。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

奖励模型(RMs)对于语言模型(LMs)与人类偏好的在线对齐至关重要。然而,基于RM的偏好调优容易受到奖励破解的影响,即LM策略从有缺陷的RM中学习不良行为。通过系统测量五个高质量RM(包括最先进的模型)中的偏差,我们发现尽管已有相关工作,但在长度、谄媚和过度自信方面的问题仍然存在。我们还发现了与模型特定“风格”和答案顺序相关的新偏差。我们将RM失败分类为可处理或对线性干预具有抵抗性,并提出一种简单的后处理干预措施,以减轻由虚假相关性引起的低复杂度偏差。我们提出的机械奖励塑造在不降低奖励质量且使用最少标注数据的情况下减少了目标偏差。该方法可扩展到新偏差、模型内部,并具有分布外泛化能力。

英文摘要

Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific ``styles'' and answer-order. We categorize RM failures as tractable or resistant to linear intervention and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.

2510.07650 2026-06-02 cs.LG cs.AI

Value Flows

Value Flows

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach

发表机构 * Stanford University(斯坦福大学) Princeton University(普林斯顿大学)

AI总结 本文利用基于流的生成模型估计完整未来回报分布,通过新的流匹配目标满足分布贝尔曼方程,并利用流导数ODE估计回报不确定性以优先学习,在离线与在线设置中平均成功率提升1.3倍。

Comments ICLR 2026

详情
AI中文摘要

虽然当今大多数强化学习方法将未来回报的分布压缩为单个标量值,但分布RL方法利用回报分布提供更强的学习信号,并支持探索和安全强化学习中的应用。虽然估计回报分布的主要方法是将其建模为离散区间上的分类分布或估计有限数量的分位数,但这些方法留下了关于回报分布的细粒度结构以及如何区分高回报不确定性的状态以进行决策的未解问题。本文的关键思想是使用现代、灵活的基于流的模型来估计完整的未来回报分布,并识别那些具有高回报方差的状态。我们通过制定一个新的流匹配目标来实现这一点,该目标生成满足分布贝尔曼方程的概率密度路径。基于学习到的流模型,我们使用一个新的流导数ODE来估计不同状态的回报不确定性。我们还利用这种不确定性信息,优先在某些转换上学习更准确的回报估计。我们将我们的方法(Value Flows)与先前的方法在离线和在线到在线设置中进行了比较。在37个基于状态和25个基于图像的基准任务上的实验表明,Value Flows在成功率上平均提高了1.3倍。网站:https://pd-perry.github.io/value-flows 代码:https://github.com/chongyi-zheng/value-flows

英文摘要

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

2603.03031 2026-06-02 cs.LG

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

步骤级稀疏自编码器用于推理过程解释

Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出步骤级稀疏自编码器(SSAE),通过条件稀疏化形成信息瓶颈,将推理步骤中的增量信息与背景信息分离为稀疏特征,用于解释大语言模型的推理过程。

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)推理实现了强大的复杂推理能力。然而,它们的推理模式仍然过于复杂而难以分析。尽管稀疏自编码器(SAEs)已成为可解释性的强大工具,但现有方法主要在token级别操作,在捕获更关键的步骤级信息(如推理方向和语义转换)时存在粒度不匹配问题。在这项工作中,我们提出了步骤级稀疏自编码器(SSAE),作为一种分析工具,将LLMs推理步骤的不同方面解耦为稀疏特征。具体来说,通过精确控制步骤特征基于其上下文的稀疏性,我们在步骤重建中形成一个信息瓶颈,将增量信息从背景信息中分离出来,并将其解耦为几个稀疏激活的维度。在多个基础模型和推理任务上的实验显示了提取特征的有效性。通过线性探测,我们可以轻松预测表面级信息,如生成长度和第一个token分布,以及更复杂的属性,如步骤的正确性和逻辑性。这些观察表明,LLMs在生成过程中应该已经至少部分地知道这些属性,这为LLMs的自我验证能力提供了基础。我们的代码可在https://github.com/Miaow-Lab/SSAE获取。

英文摘要

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. Our code is available at https://github.com/Miaow-Lab/SSAE.

2601.09566 2026-06-02 cs.CV cs.AI

Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning

热启动中文语言建模:视觉字形加速样本高效学习

Shuyang Xiang, Hao Guan

发表机构 * Independent Researcher(独立研究者) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 本文通过将汉字渲染为视觉字形图像,研究其对字符级语言建模的归纳偏置,发现视觉输入产生显著的热启动效应,但最终精度与基于索引的方法一致。

Comments 15 pages, 5 figures, submitted to ACL 2026

详情
AI中文摘要

在这项工作中,我们研究了将汉字渲染为视觉字形图像(而非主流LLM使用的离散token ID)是否为字符级语言建模提供归纳偏置。我们的核心发现给出了一个双刃剑的见解:视觉输入产生显著的热启动效应,在第一个epoch内(占总训练步骤的0.4%)将早期准确率提高一倍以上(视觉输入12.3% vs. 基于索引的基线5.8%),但两种方法最终收敛到几乎相同的最终准确率(39%)。这一模式在低至8x8像素的分辨率、高达50%的部分裁剪以及从110M到1.78B参数的模型规模下均成立。我们识别的机制是,字形渲染在训练之前就将基于部首的结构预编码到嵌入空间中(余弦相似度0.27 vs. 随机嵌入的0.002),从而能够更快地对齐,但无法提高最终容量。我们的结果阐明了视觉表示作为中文语言建模归纳偏置的前景和根本局限性。

英文摘要

In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, providing an inductive bias for character-level language modeling. Our central finding gives a double-edged insight: visual inputs produce a pronounced hot-start effect, more than doubling early-stage accuracy within the first epoch (at 0.4% of total training steps) (12.3% visual inputs vs. 5.8% index-based baseline), yet both approaches converge to essentially identical final accuracy (39%). This pattern holds across resolutions as low as 8x8 pixels, partial cropping up to 50%, and model scales from 110M to 1.78B parameters. The mechanism we identify is that glyph rendering pre-encodes radical-based structure into embedding space before any training (cosine similarity 0.27 vs. 0.002 for random embeddings), enabling faster alignment but not higher final capacity. Our results clarify both the promise and fundamental limitation of visual representations as inductive biases for Chinese language modeling.

2603.02650 2026-06-02 cs.LG cs.AI cs.RO

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自监督动作能量门控改进扩散规划器

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAGE方法,利用潜在一致性信号在推理时重新排序轨迹,惩罚动态不一致的计划,从而提升扩散规划器的性能和鲁棒性。

详情
AI中文摘要

扩散规划器是离线强化学习的一种强大方法,但当价值引导选择偏好得分高但局部与环境动态不一致的轨迹时,它们可能会失败,导致执行脆弱。我们提出了自监督动作能量门控(SAGE),一种推理时重排序方法,使用潜在一致性信号惩罚动态不一致的计划。SAGE在离线状态序列上训练联合嵌入预测架构(JEPA)编码器,并训练一个动作条件的潜在预测器用于短时域过渡。在测试时,SAGE为每个采样候选分配一个由其潜在预测误差给出的能量,并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划流程中,这些流程可以通过价值评分采样轨迹和选择动作;它不需要环境回滚,也不需要重新训练策略。在运动、导航和操作基准测试中,SAGE提高了扩散规划器的性能和鲁棒性。

英文摘要

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

2603.02288 2026-06-02 cs.CV eess.IV

AutoFFS: Adversarial Deformations for Facial Feminization Surgery Planning

AutoFFS: 用于面部女性化手术规划的对抗性变形

Paul Friedrich, Florentin Bieder, Florian M. Thieringer, Philippe C. Cattin

发表机构 * Department of Biomedical Engineering, University of Basel(巴塞尔大学生物医学工程系) Department of Oral and Cranio-Maxillofacial Surgery, University Hospital Basel(巴塞尔大学口腔和颅面外科系)

AI总结 提出AutoFFS框架,通过对抗性自由变形生成反事实颅骨形态,为面部女性化手术提供定量规划依据。

Comments Project Page: https://pfriedri.github.io/autoffs-io Code: https://github.com/pfriedri/autoffs

详情
AI中文摘要

面部女性化手术(FFS)是跨性别和性别多样化患者性别确认的关键组成部分,旨在将颅面结构重塑为女性形态。当前的手术规划程序主要依赖主观临床评估,缺乏定量和可重复的解剖学指导。因此,我们提出AutoFFS,一种新颖的数据驱动框架,通过对抗性自由变形生成反事实颅骨形态。我们的方法对一组预训练的、学习了性别二态性的二元性别分类器集成执行基于变形的定向对抗攻击,有效将个体颅骨形状向目标性别转变。生成的反事实颅骨形态为FFS的术前规划提供了定量基础,推动了这一长期被忽视的患者群体的进步。我们通过基于分类器的评估验证了我们的方法,提出了形态学弗雷歇距离(MFD)和形态学核距离(MKD)来评估生成人群与真实人群的分布对齐,并进行了人类感知研究,确认生成的形态表现出目标性别特征。

英文摘要

Facial feminization surgery (FFS) is a key component of gender affirmation for transgender and gender diverse patients, aiming to reshape craniofacial structures toward a female morphology. Current surgical planning procedures largely rely on subjective clinical assessment, lacking quantitative and reproducible anatomical guidance. We therefore propose AutoFFS, a novel data-driven framework that generates counterfactual skull morphologies through adversarial free-form deformations. Our method performs a deformation-based targeted adversarial attack on an ensemble of pre-trained binary sex classifiers that learned sexual dimorphism, effectively transforming individual skull shapes toward the target sex. The generated counterfactual skull morphologies provide a quantitative foundation for preoperative planning in FFS, driving advances in this largely overlooked patient group. We validate our approach through classifier-based evaluation, propose Morphological Fréchet Distance (MFD) and Morphological Kernel Distance (MKD) to evaluate distributional alignment of generated and real populations, and perform a human perceptual study, confirming that the generated morphologies exhibit target sex characteristics.

2603.02238 2026-06-02 cs.LG cs.FL cs.LO

Length Generalization Bounds for Transformers

Transformer的长度泛化界

Andy Yang, Pascal Bergsträßer, Georg Zetzsche, David Chiang, Anthony W. Lin

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文证明C-RASP(与Transformer紧密相关的语言类)不存在可计算的长度泛化界,但为正片段(等价于固定精度Transformer)提供了可计算的指数级最优界。

Comments 22 pages

详情
AI中文摘要

长度泛化是学习算法的一个关键性质,它使得算法在给定有限训练数据的情况下,能够对任意长度的输入做出正确预测。为了提供这样的保证,需要能够计算一个长度泛化界,超过该界模型保证泛化。本文关注C-RASP(一类与Transformer紧密相关的语言)的此类泛化界的可计算性这一开放问题。最近Chen等人针对仅有一层C-RASP以及在限制条件下针对两层C-RASP给出了部分正面结果。我们对该开放问题给出了完整答案。主要结果是C-RASP(已有两层)以及因此Transformer不存在可计算的长度泛化界。作为补充,我们为C-RASP的正片段(我们证明其等价于固定精度Transformer)提供了一个可计算的界。对于正C-RASP和固定精度Transformer,我们证明长度复杂度是指数级的,并证明了界的优性。

英文摘要

Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for C-RASP, a class of languages which is closely linked to transformers. A positive partial result was recently shown by Chen et al. for C-RASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for C-RASP (already with two layers) and hence for transformers. To complement this, we provide a computable bound for the positive fragment of C-RASP, which we show equivalent to fixed-precision transformers. For both positive C-RASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.

2603.02237 2026-06-02 cs.LG cs.AI

Concept Heterogeneity-aware Representation Steering

概念异质性感知表示引导

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

发表机构 * arXiv

AI总结 针对大语言模型表示非均匀导致全局引导脆弱的问题,提出基于最优传输的输入依赖引导方法CHaRS,通过高斯混合模型和离散最优传输实现更有效的行为控制。

详情
Journal ref
ICML 2026
AI中文摘要

表示引导提供了一种轻量级机制,通过在推理时干预内部激活来控制大语言模型(LLMs)的行为。现有方法大多依赖于单个全局引导方向,通常通过对比较数据集进行均值差异得到。这种方法隐含假设目标概念在嵌入空间中均匀表示。然而在实践中,LLM表示可能高度非均匀,表现出聚类、上下文相关的结构,这使得全局引导方向变得脆弱。在这项工作中,我们通过最优传输(OT)的视角审视表示引导,注意到标准均值差异引导隐式对应于具有不同一阶矩的两个相同分布之间的OT映射,产生全局平移。为了放宽这一限制性假设,我们从理论上将源和目标表示建模为高斯混合模型,并将引导公式化为语义潜在聚类之间的离散OT问题。从得到的传输计划中,我们通过重心投影推导出显式的、输入依赖的引导映射,产生聚类级别偏移的平滑核加权组合。我们将此方法称为概念异质性感知表示引导(CHaRS)。通过大量实验设置,我们证明CHaRS比全局引导产生更有效的行为控制。

英文摘要

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two identical distributions with differing first moments, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

2509.15394 2026-06-02 cs.LG

VMDNet: Temporal Leakage-Free Variational Mode Decomposition for Electricity Demand Forecasting

VMDNet:用于电力需求预测的无时间泄漏变分模态分解

Weibin Feng, Ran Tao, John Cartlidge, Jin Zheng

发表机构 * UKRI EPSRC Doctoral Training Partnership(UKRI EPSRC博士培训计划) UKRI EPSRC AI for Collective Intelligence (AI4CI)(集体智能(AI4CI))

AI总结 提出VMDNet框架,通过逐样本变分模态分解避免时间泄漏、频率感知嵌入和并行时间卷积网络建模各模态,并引入Stackelberg博弈双层优化选择超参数,在电力需求预测中超越现有方法。

Comments 5 pages, 1 figure, 2 tables. Version 3: Accepted author manuscript for the 34th European Signal Processing Conference (EUSIPCO 2026), Bruges, Belgium. Improved figures, additional details on TCN-based parallel decoding, and extended literature review. Code and data available: https://github.com/weibin-feng/VMDNet

详情
AI中文摘要

准确的电力需求预测具有挑战性,因为真实需求序列具有强多周期性,使得有效建模循环时间模式至关重要。分解技术使这种结构显式化,从而提升预测性能。变分模态分解(VMD)是一种用于周期性感知分解的强大信号处理方法,近年来得到越来越多的采用。然而,现有研究常遭受信息泄漏,并依赖不恰当的超参数调优。为解决这些问题,我们提出VMDNet,一个因果保持框架,它(i)应用逐样本VMD以避免时间泄漏;(ii)用频率感知嵌入表示每个分解模态,并使用并行时间卷积网络(TCNs)解码,确保模态独立性和高效学习;(iii)引入受Stackelberg博弈启发的双层方案来指导VMD两个关键超参数的选择。在三个广泛使用的电力需求数据集上的实验表明,VMDNet持续优于最先进的基线方法。

英文摘要

Accurate electricity demand forecasting is challenging due to the strong multi-periodicity of real-world demand series, which makes effective modeling of recurrent temporal patterns crucial. Decomposition techniques make such structure explicit and thereby improve predictive performance. Variational Mode Decomposition (VMD) is a powerful signal-processing method for periodicity-aware decomposition and has seen growing adoption in recent years. However, existing studies often suffer from information leakage and rely on inappropriate hyperparameter tuning. To address these issues, we propose VMDNet, a causality-preserving framework that (i) applies sample-wise VMD to avoid temporal leakage; (ii) represents each decomposed mode with frequency-aware embeddings and decodes it using parallel temporal convolutional networks (TCNs), ensuring mode independence and efficient learning; and (iii) introduces a Stackelberg game inspired bilevel scheme to guide the selection of VMD's two key hyperparameters. Experiments on three widely used electricity demand datasets show that VMDNet consistently outperforms state-of-the-art baselines.

2603.01302 2026-06-02 cs.RO

Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space

混合TD3:混合动作空间中的过估计偏差分析与稳定策略优化

Thanh-Tuan Tran, Thanh Nguyen Canh, Nak Young Chong, Xiem HoangVan

发表机构 * Department of Computer Science, University of California, Berkeley(1. 加州大学伯克利分校计算机科学系)

AI总结 针对离散-连续混合动作空间中的强化学习挑战,提出Hybrid TD3算法,通过理论分析过估计偏差并引入加权裁剪Q学习目标,实现稳定策略优化。

详情
AI中文摘要

离散-连续混合动作空间中的强化学习对机器人操作提出了基本挑战,其中高层任务决策和低层关节空间执行必须联合优化。现有方法要么离散化连续组件,要么将离散选择松弛为连续近似,这些方法在高维动作空间和域随机化下存在可扩展性限制和训练不稳定性。在本文中,我们提出Hybrid TD3,这是对双延迟深度确定性策略梯度(TD3)的扩展,以原则性方式原生处理参数化混合动作空间。我们对混合动作设置中的过估计偏差进行了严格的理论分析,推导了双评论家架构下的形式化界限,并在同步高斯误差假设下建立了五种算法变体之间的完整偏差排序。基于此分析,我们引入了一个加权裁剪Q学习目标,该目标对离散动作分布进行边缘化,在实现与标准裁剪最小化等效的偏差减少的同时,提高了策略平滑性。实验结果表明,Hybrid TD3在训练稳定性和竞争性能方面优于最先进的混合动作基线。

英文摘要

Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants under synchronized Gaussian error assumptions. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines.

2603.01097 2026-06-02 cs.LG

Understanding LoRA as Knowledge Memory: An Empirical Analysis

理解LoRA作为知识记忆:一项实证分析

Seungju Back, Dongwoo Lee, Naun Kang, Taehee Lee, S. K. Hong, Youngjune Gwon, Sungjin Ahn

发表机构 * arXiv.org

AI总结 本文通过系统实证研究,将低秩适配(LoRA)作为模块化知识记忆,探索其存储容量、内部化优化、多模块系统扩展及长上下文推理能力,提供LoRA记忆操作边界的实用指导。

Comments ICML 2026

详情
AI中文摘要

预训练大型语言模型(LLM)的持续知识更新日益必要但仍具挑战性。尽管上下文学习(ICL)和检索增强生成(RAG)等推理时方法很流行,但它们面临上下文预算、成本和检索碎片化的限制。脱离这些依赖上下文的范式,本工作研究使用低秩适配(LoRA)作为模块化知识记忆的参数化方法。尽管近期有少量工作探讨了这一概念,但控制其容量和可组合性的基本机制仍很大程度上未被探索。我们通过首个系统性的实证研究来填补这一空白,该研究映射了基于LoRA记忆的设计空间,包括表征存储容量、优化内部化、扩展多模块系统以及评估长上下文推理。我们并非提出单一架构,而是提供关于LoRA记忆操作边界的实用指导。总体而言,我们的发现将LoRA定位为与RAG和ICL互补的记忆轴,具有独特优势。代码和数据集可在 https://github.com/ahn-ml/Understanding-LoRA-as-Knowledge-Memory 获取。

英文摘要

Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages. Code and datasets are available at https://github.com/ahn-ml/Understanding-LoRA-as-Knowledge-Memory.

2506.14003 2026-06-02 cs.LG

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

遗忘并非不可见:从模型输出中检测LLMs的遗忘痕迹

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu

发表机构 * Michigan State University(密歇根州立大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) IBM Research(IBM研究院)

AI总结 本文发现大型语言模型在经历机器遗忘后会在行为和内部表征中留下可检测的“指纹”,并通过分类器利用预测logits或文本输出以超过90%的准确率识别遗忘模型。

Comments Accepted to ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)的机器遗忘(MU),通常称为LLM遗忘,旨在从训练模型中移除特定的不良数据或知识,同时保持其在标准任务上的性能。虽然遗忘在保护数据隐私、执行版权和减轻LLMs中的社会技术危害方面发挥着关键作用,但我们发现了一个遗忘后的新漏洞:遗忘痕迹检测。我们发现遗忘在LLMs中留下了持久的“指纹”,即在模型行为和内部表征中可检测的痕迹。这些痕迹可以从输出响应中识别,即使使用与遗忘无关的输入进行提示。具体来说,即使是一个简单的监督分类器,仅使用其预测logits甚至文本输出,就可以确定模型是否经历了遗忘。进一步的分析表明,这些痕迹嵌入在中间激活中,并非线性地传播到最后一层,在激活空间中形成低维、可学习的流形。通过大量实验,我们证明即使在遗忘无关的输入下,遗忘痕迹也可以以超过90%的准确率被检测到,并且更大的LLMs表现出更强的可检测性。这些发现揭示了遗忘留下了可测量的签名,引入了一种新的风险,即当模型被识别为已遗忘时,给定输入查询,可以逆向工程遗忘的信息。

英文摘要

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

2603.00963 2026-06-02 cs.LG cs.CL

Stabilizing Policy Optimization via Logits Convexity

通过Logits凸性稳定策略优化

Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对强化学习训练不稳定的问题,从梯度角度分析监督微调与强化学习的稳定性差距,提出Logits凸优化(LCO)框架,通过模拟logits级凸性来稳定策略优化,实验表明该方法能提升训练稳定性并在多个基准上优于传统方法。

详情
AI中文摘要

虽然强化学习(RL)在大语言模型(LLM)近期成功中发挥了核心作用,但RL优化以不稳定著称,尤其是与监督微调(SFT)相比。本文从梯度角度研究SFT和RL之间的稳定性差距,并表明SFT损失相对于模型logits的凸性在实现稳定训练中起关键作用。我们的理论分析证明,该性质在优化过程中诱导了有利的梯度方向性。相比之下,广泛采用的策略梯度算法——使用裁剪替代目标的近端策略优化(PPO)缺乏这种稳定性质。受此观察启发,我们提出Logits凸优化(LCO),一种简单而有效的策略优化框架,将学习策略与从原始RL目标导出的最优目标对齐,从而模拟logits级凸性的稳定效果。跨多个模型家族的大量实验表明,我们的LCO框架一致地提升了训练稳定性,并在广泛的基准测试中优于传统RL方法。

英文摘要

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

2603.00829 2026-06-02 cs.CL cs.AI cs.LG

Constitutional Black-Box Monitoring for Scheming in LLM Agents

LLM Agent 中阴谋行为的宪法黑盒监控

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

发表机构 * University of Cambridge(剑桥大学)

AI总结 研究使用基于宪法黑盒的监控器,通过仅观察外部输入和输出检测LLM Agent的阴谋行为,并在合成数据上优化后泛化到更真实环境。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

在自主环境中安全部署大型语言模型(LLM)Agent需要可靠的监督机制。一个核心挑战是检测阴谋行为,即Agent暗中追求不一致的目标。缓解此类风险的一种方法是基于LLM的监控:使用语言模型检查Agent行为中的可疑动作。我们研究宪法黑盒监控器:仅利用外部可观测的输入和输出检测阴谋行为的提示分类器,并在从自然语言行为规范生成的合成数据上优化。我们引入两个生成合成Agent轨迹的流水线:STRIDE(迭代精炼)和Gloom(Agent-环境模拟),各生成1000个样本。通过提示扫描、人工精炼和自动提示优化,我们在这些数据集上优化前沿LLM监控器,并在ControlArena(一套Agent在更现实环境中运行的接地环境)中的7500个保留轨迹上评估性能。结果表明,仅基于合成数据选择的监控器可以泛化到更现实的环境,捕获有意义的阴谋信号。然而,我们发现性能在我们的设置中迅速饱和,简单的提示扫描匹配了更广泛优化的结果。超越这一限制不会带来进一步改进,反而导致过拟合。

英文摘要

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.