arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2602.03784 2026-05-22 cs.CL

Fix the Structural Bottleneck: Context Compression via Explicit Information Transmission

修复结构瓶颈:通过显式信息传输进行上下文压缩

Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

AI总结 本文通过从结构角度重新审视上下文压缩,识别出标准LLM压缩方法中的两个关键瓶颈,并提出ComprExIT框架,通过显式信息传输提升压缩效率,实验表明其在多个数据集上表现优异,提升了F1分数并降低了计算成本。

详情
AI中文摘要

长上下文LLM代理往往面临增长的token、内存和延迟成本,使高效的上下文压缩对实际部署至关重要。现有LLM作为压缩器的方法在使用完整上下文时仍明显劣于其性能。我们发现这一差距部分源于其无法有效保留上下文信息。在本文中,我们从结构角度重新审视上下文压缩,并识别出标准LLM压缩方法中的两个关键瓶颈:信息聚合过程中压缩token之间的协调有限,以及层间稀释削弱了中间隐藏状态中的有用信号。为了解决这些限制,我们提出了ComprExIT,一种基于显式信息传输的新上下文压缩框架。ComprExIT会自适应地选择冻结LLM层中的特征,然后通过全局协调的运输计划将信息从锚点分配到压缩槽中。在12个数据集上的实验表明,ComprExIT在多个数据集上优于强大的软压缩基线,平均F1分数提升高达18.5%,同时仅增加约1%的可训练参数,并且比最快的基线快超过2倍的压缩速度。代码将在接受后发布。

英文摘要

Long-context LLM agents often struggle with growing token, memory, and latency costs, making efficient context compression essential for practical deployment. Existing LLM-as-a-compressor methods remain noticeably inferior to using the full context. We find that this gap partly stems from their inability to preserve contextual information effectively. In this work, we revisit context compression from a structural perspective and identify two key bottlenecks in standard LLM-based compressors: limited coordination among compression tokens during information aggregation, and layerwise dilution that weakens useful signals from intermediate hidden states. To address these limitations, we propose ComprExIT, a new context compression framework based on explicit information transmission. ComprExIT adaptively selects features across frozen LLM layers, then allocates information from anchors to compression slots through a globally coordinated transport plan. Experiments on 12 datasets show that ComprExIT consistently outperforms strong soft-compression baselines, improving average F1 by up to 18.5%, while adding only ~1% trainable parameters and achieving more than 2x faster compression than the fastest baselines. The code will be released upon acceptance.

2602.02709 2026-05-22 cs.AI

ATLAS: A Multi-LLM Training Framework for EvoDPO with Adaptive Reference Evolution

ATLAS:一种用于EvoDPO的多LLM训练框架,具有自适应参考进化

Ujin Jeon, Jiyong Kwon, Madison Ann Sullivan, Caleb Eunho Lee, Guang Lin

AI总结 本文提出ATLAS框架,通过自适应参考进化解决多LLM代理系统中固定参考模型导致的更新保守或训练停滞问题,结合支持者驱动探索与EvoDPO驱动的稳定性,提升长期评估驱动的自我改进能力。

详情
AI中文摘要

最近的多LLM代理系统在自动化问题解决中表现出有前途的能力,但它们主要依赖于冻结的代理或静态微调管道。为了解决这一限制,我们的主要贡献是ATLAS(用于代理自演化的自适应任务分布式学习),一种多代理框架,其中专门的元代理协作训练和优化一个活跃的代理以获得领域特定的策略。在这些管道中的迭代偏好学习中的核心挑战是依赖于固定的参考模型,通常导致过于保守的更新或训练停滞。为克服这一问题,该框架的算法引擎使用进化直接偏好优化(EvoDPO)。EvoDPO采用一个检查代理,根据连续的训练 telemetry 进行自适应的、基于代理-KL门控的参考策略更新。我们评估了该完整框架在一系列具有挑战性的环境中,包括非平稳的上下文带仔、偏微分方程(PINNs)和组合优化任务(TSP、Bin Packing)。通过与固定参考、自适应参考和外部自动发现基线的比较,我们的结果表明,ATLAS结合支持者驱动的探索与EvoDPO驱动的稳定性,以提高长期评估驱动的自我改进能力。

英文摘要

Recent multi-LLM agent systems have shown promising capabilities for automated problem-solving, yet they predominantly rely on frozen agents or static fine-tuning pipelines. To address this limitation, our primary contribution is ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution), a multi-agent framework where specialized meta-agents collaboratively train and refine an active agent toward a domain-specific policy. A core challenge in iterative preference learning within these pipelines is the reliance on fixed reference models, which typically leads to overly conservative updates or training stagnation. To overcome this, the framework's algorithmic engine utilizes Evolving Direct Preference Optimization (EvoDPO). EvoDPO employs an inspection agent to perform adaptive, proxy-KL gated reference policy updates based on continuous training telemetry. We evaluate this full framework across a diverse set of challenging environments-including non-stationary contextual bandits, partial differential equations (PINNs), and combinatorial optimization tasks (TSP, Bin Packing). Through comparison against fixed-reference, adaptive-reference, and external automated-discovery baselines, our results suggest that ATLAS combines supporter-driven exploration with EvoDPO-driven stability to improve long-horizon evaluator-driven self-improvement.

2602.02112 2026-05-22 cs.LG cs.AI cs.CL

Unifying Masked Diffusion Models with Various Generation Orders and Beyond

统一多种生成顺序及超越的掩码扩散模型

Chunsan Hong, Sanghyun Lee, Jong Chul Ye

AI总结 本文提出Order-Expressive Masked Diffusion Model (OeMDM)和Learnable-Order Masked Diffusion Model (LoMDM),统一了不同生成顺序的扩散生成过程,并通过单目标学习生成顺序和扩散骨干,提升了文本生成性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

Masked diffusion models (MDMs) 是语言生成中替代自回归模型 (ARMs) 的潜在选择,但生成质量严重依赖于生成顺序。先前工作要么硬编码顺序(例如块状左到右),要么为预训练的MDM学习顺序策略,这会带来额外成本并可能导致次优解,因为存在两阶段优化。受此启发,我们提出了order-expressive masked diffusion model (OeMDM),以适用于各种生成顺序的广泛扩散生成过程,使MDM、ARM和块扩散能在单一框架中进行解释。此外,基于OeMDM,我们引入了learnable-order masked diffusion model (LoMDM),通过单目标学习生成顺序和扩散骨干,使扩散模型能够根据上下文生成顺序进行文本生成。实证上,我们证实LoMDM在多个语言模型基准测试中优于各种离散扩散模型。

英文摘要

Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

2602.01334 2026-05-22 cs.CV

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

视觉工具使用强化学习究竟在学习什么?解构工具诱导效应与内在效应以实现作物和缩放

Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen, Pengfei Liu

AI总结 本文研究了视觉工具使用强化学习在作物和缩放任务中的学习机制,通过引入MED框架解耦内在能力变化与工具诱导效应,发现改进主要由内在学习驱动,而工具使用强化学习主要减少工具诱导的负面影响,而非掌握工具。

Comments ICML 2026 camera ready. Code: https://github.com/GAIR-NLP/Med

详情
AI中文摘要

视觉工具使用强化学习(RL)可以为视觉语言模型提供如作物和缩放等视觉操作,从而实现显著性能提升,但尚不清楚这些提升是源于工具使用能力的改进还是内在能力的演变。我们引入MED(测量-解释-诊断),一种由粗到细的框架,用于解耦内在能力变化与工具诱导效应,将工具诱导的性能差异分解为增益和损害项,并探测驱动其演变的机制。在作物和缩放设置中,对两个具有不同工具先验的VLMs和六个基准测试的检查点级分析显示,改进主要由内在学习驱动,而工具使用RL主要减少工具诱导的损害(例如更少的调用诱导错误和更弱的工具模式干扰),并在工具基于的内在失败修正方面取得有限进展。总体而言,在本文研究的作物和缩放设置中,当前的视觉工具使用RL学习的是安全地与工具共存,而非掌握工具。

英文摘要

Vision tool-use reinforcement learning (RL) can equip vision language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities. We introduce MED (Measure--Explain--Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses in the crop-and-zoom setting on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, in the crop-and-zoom setting studied here, current vision tool-use RL learns to coexist safely with tools rather than master them.

2602.00688 2026-05-22 cs.LG

Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility

可证明地保护微调的LLM免受训练数据提取攻击同时保持效用

Tom Segal, Asaf Shabtai, Yuval Elovici

AI总结 本文提出了一种基于近访问自由(NAF)的算法SCP-Δ_r,通过相对概率和基础模型对低影响token进行平滑处理,从而在理论上有更优的界限,并在实践中有效抵御训练数据提取攻击,同时保持性能损失最小。

Comments 21 pages, 5 figures

详情
AI中文摘要

在敏感数据集上微调大型语言模型(LLMs)会引发隐私问题,因为训练数据提取(TDE)攻击可以暴露高度机密信息。现有的防御措施要么缺乏正式的隐私保证,要么导致显著的效用降级。我们观察到微调会引起广泛的概率偏移,但仅保留一小部分有影响的token级偏差即可;其余偏移可以通过强烈平滑处理,对效用影响极小。受此启发,我们提出了SCP-Δ_r,一种基于近访问自由(NAF)的算法,该算法在相对概率上操作,并利用基础模型显式平滑低影响token。SCP-Δ_r在理论上有比现有基于NAF的方法更好的界限,并且在实践中提供了强大的对抗TDE攻击的保护,同时性能损失很小。

英文摘要

Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

2601.23224 2026-05-22 cs.CV

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Video-o3:长视频多跳推理的原生交错线索搜索

Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, Ziang Yan, Yi Wang, Hongjie Zhang, Yali Wang, Limin Wang

AI总结 本研究提出Video-o3框架,通过迭代发现显著视觉线索、细粒度检查关键片段以及适应性终止,提升长视频多跳推理能力,实验表明其在MLVU和Video-Holmes上分别达到72.1%和46.5%的准确率。

Comments 27 pages, 15 figures, 15 tables

详情
AI中文摘要

现有用于长视频理解的多模态大语言模型主要依赖均匀采样和单轮推理,限制了其在大量冗余中识别稀疏但关键证据的能力。我们引入Video-o3,一种支持迭代发现显著视觉线索、细粒度检查关键片段以及在获得足够证据后适应性终止的新框架。技术上,我们解决了交错工具调用中的两个核心挑战。首先,为减轻由推理和工具调用异质性引起的注意力分散,我们提出任务解耦注意力掩码,该方法在保持共享全局上下文的同时,隔离每一步的专注。其次,为控制多轮交互中的上下文长度增长,我们引入可验证轨迹引导奖励,平衡探索覆盖与推理效率。为了支持大规模训练,我们进一步开发了数据合成管道,并构建了包含173,000个高质量工具交互轨迹的Seeker-173K数据集。大量实验表明,Video-o3显著优于现有方法,在MLVU上达到72.1%的准确率,在Video-Holmes上达到46.5%的准确率。这些结果展示了Video-o3在长视频场景中的强大多跳证据搜索和推理能力,并验证了原生工具调用的有效性。

英文摘要

Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.

2601.20205 2026-05-22 cs.LG

Hyperparameter Transfer with Mixture-of-Expert Layers

通过专家混合层进行超参数迁移

Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, Boris Hanin

AI总结 本文提出了一种新的参数化方法,用于在扩展模型宽度、深度、专家数量和专家(隐藏)大小时,通过专家混合层的变压器模型进行超参数迁移,该方法基于动态平均场理论分析,实验证明其在不同规模模型间可靠地迁移超参数。

Comments ICML 2026

详情
AI中文摘要

混合专家(MoE)层已成为通过在前向传递中解耦总可训练参数与激活参数来扩展现代神经网络的重要工具。然而,稀疏MoEs由于(i)新的可训练参数(路由权重),这些参数像所有其他参数组一样需要超参数(HP)调整;(ii)新的架构尺度维度(专家数量和大小)必须选择并可能取大,从而增加了训练的复杂性。为了使超参数选择变得廉价且可靠,我们提出了一种新的参数化方法,用于在扩展模型宽度、深度、专家数量和专家(隐藏)大小时的变压器模型。我们的参数化方法通过一种新的动态平均场理论(DMFT)分析得到证明。当在固定token预算下变化不同的模型维度时,我们发现我们的参数化方法在51M到超过2B总参数的模型间实现了可靠的超参数迁移。我们进一步利用在短token范围上扫掠的小模型识别出的超参数来训练更大模型在更长的范围上,并报告了性能良好的模型行为。

英文摘要

Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

2601.20107 2026-05-22 cs.CV cs.CL cs.IR

Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

结构锚点剪枝:用于视觉文档检索的无训练多向量压缩

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

AI总结 本文提出结构锚点剪枝(SAP),一种无需训练的多向量压缩方法,通过保留评分、指导窗口选择和视觉入度中心性评分三个组件,在不进行模型参数调整的情况下,实现了超过90%的视觉token剪枝同时保持NDCG@5超过90%的性能。

Comments methodology revision and new title

详情
AI中文摘要

最近的视觉-语言模型(例如ColPali)能够实现细粒度的视觉文档检索(VDR),但带来了可接受的多向量索引存储开销。现有的无训练剪枝方法要么依赖于启发式的层选择,要么在激进压缩下急剧退化,导致先前的工作认为有效的高压缩剪枝需要查询依赖的训练。我们通过结构锚点剪枝(SAP)挑战这一观点,这是一种自校准、无训练、且查询无关的索引时间剪枝框架,包含三个组件:(i)评分保留(SR),一种每层压缩诊断的白盒方法;(ii)SR引导的窗口选择,一种自动定位任何主干网络的结构剪枝区域的程序,无需每个模型的超参数;(iii)一个视觉入度中心性评分器,用于识别所选窗口内的锚点块。在ViDoRe v1/v2基准测试中,跨越三种架构(18、28和36层主干网络)的三个架构上,SAP在不进行任何模型参数调整的情况下,保留了超过90%的NDCG@5,同时剪枝了超过90%的视觉token。我们的分层解析SR分析揭示了对齐-聚合分歧:文档的视觉结构在主干网络中被保留为稳定的“结构高原”,但最终层将这种表示重塑为稀疏、查询对齐的形式,不再适合剪枝。这是SAP在最终层方法失败的地方的机械原因。

英文摘要

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.

2601.07603 2026-05-22 cs.CV

UIKA: Fast Universal Head Avatar from Pose-Free Images

UIKA:从无姿态图像快速生成通用头身模型

Zijian Wu, Boyao Zhou, Liangxiao Hu, Hongyu Liu, Yuan Sun, Xuan Wang, Xun Cao, Yujun Shen, Hao Zhu

AI总结 本文提出UIKA,一种从任意数量的无姿态输入(包括单张图像、多视角捕捉和手机拍摄视频)生成可动画的高斯头身模型。与传统头身模型不同,UIKA通过模型表示、网络设计和数据准备重新思考任务,引入了UV引导的头身建模策略,设计了可学习的UV标记,并通过聚合所有输入视角的UV信息解码为标准高斯属性。

Comments CVPR 2026 Highlight. Code: https://github.com/ant-research/UIKA

详情
AI中文摘要

我们提出UIKA,一种从任意数量的无姿态输入(包括单张图像、多视角捕捉和手机拍摄视频)生成可动画的高斯头身模型。与传统头身模型不同,UIKA通过模型表示、网络设计和数据准备重新思考任务。首先,我们引入了UV引导的头身建模策略,其中每个输入图像都与像素级的面部对应关系估计相关联。这种对应关系估计允许我们将每个有效像素的颜色从屏幕空间重新投影到UV空间,这与相机姿态和人物表情无关。此外,我们设计了可学习的UV标记,在屏幕和UV层面均可应用注意力机制。通过聚合所有输入视角的UV信息,这些学习到的UV标记可以解码为标准的高斯属性。为了训练我们的大型头身模型,我们还准备了一个大规模、身份丰富的合成训练数据集。我们的方法在单目和多视角设置中均显著优于现有方法。

英文摘要

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of pose-free inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings.

2601.04537 2026-05-22 cs.LG cs.CL

Linear Dynamics in the RLVR Training of Large Language Models

在大语言模型RLVR训练中的线性动力学

Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, Ning Miao

AI总结 本文研究了强化学习可验证奖励(RLVR)在大语言模型训练中的内部动态,发现RLVR在多种模型和训练配置下均进入线性区域,通过实验和理论分析证明这种线性特性源于训练信号的高方差和噪声,且具有预测性和实用性。

Comments Major revision: substantially reorganized the manuscript and added a theoretical explanation section. The replacement is intended for the same arXiv paper; the core topic and contribution remain the same

详情
AI中文摘要

强化学习可验证奖励(RLVR)在以推理为导向的大语言模型(LLMs)中推动了显著的性能提升,但其内部训练动态仍 largely 是一个黑箱。在本文中,我们对RLVR进行了全面的轨迹级分析,并揭示出一个显著的规律:在各种模型家族、RL算法和训练配置下,RLVR始终进入一个稳健的线性区域,其中参数权重和输出对数概率,通过严格教师强制评估测量,以高度线性的方式(R²>0.7)演变。通过受控实验和理论分析,我们证明这种线性并非偶然,而是源于RLVR训练信号的高方差和噪声性质,这些性质起到了低通滤波器的作用,将优化集中在稳定的、低维的漂移上。此外,我们显示这种线性结构不仅具有描述性,而且具有强大的预测性和实用性。具体而言,权重空间外推在性能上与标准RL优化相当,同时通过定期重新定位实现了6.1倍的训练加速。同时,输出空间外推作为一种轻量级干预,有效 bypassed 后期模型崩溃,持续在数学和编码基准上优于标准RL,平均性能提升了4.2%。我们的代码可在https://github.com/Miaow-Lab/RLVR-Linearity获得。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.

2512.20538 2026-05-22 cs.CV

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

AlignPose: 通过多视角特征-度量对齐实现通用的6D位姿估计

Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik

AI总结 本文提出AlignPose,一种无需特定对象训练或对称标注的多视角6D位姿估计方法,通过多视角特征-度量细化优化单一一致的世界坐标系位姿,实验表明其在六个数据集上优于其他方法,尤其在工业数据集上表现突出。

Comments CVPR 2026

详情
AI中文摘要

单视角基于RGB的物体位姿估计方法虽然具有强大的泛化能力,但本质上受到深度模糊、杂乱和遮挡的限制。多视角位姿估计方法有潜力解决这些问题,但现有方法要么依赖于精确的单视角位姿估计,要么缺乏对未见过的对象的泛化能力。我们通过以下三个贡献来解决这些挑战:首先,我们引入了AlignPose,一种通过多个外校准的RGB视角聚合信息的6D物体位姿估计方法,无需任何对象特定的训练或对称标注。其次,该方法的关键组件是一个新的多视角特征-度量细化模块,专门设计用于物体位姿,通过同时最小化所有视角下即时渲染物体特征与观测图像特征之间的特征差异,优化单一一致的世界坐标系物体位姿。第三,我们在六个数据集(YCB-V,T-LESS,HouseCat6D,ITODD-MV,IPD,XYZ-IBD)上进行了广泛的实验,使用BOP基准评估,并证明AlignPose在挑战性的工业数据集上优于其他已发表的方法,其中多个视角在实践中易于获取。

英文摘要

Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on six datasets (YCB-V, T-LESS, HouseCat6D, ITODD-MV, IPD, XYZ-IBD) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

2511.20785 2026-05-22 cs.CV

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

LongVT: 通过原生工具调用激励'通过长视频思考'

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

AI总结 本文提出LongVT,一种端到端的代理框架,通过交错的多模态工具链式思考实现'通过长视频思考',通过利用LMM的固有时间定位能力作为原生视频裁剪工具,以解决长视频推理中的幻觉问题,并通过VideoSIAH数据集提升训练和评估效果。

Comments CVPR 2026

详情
AI中文摘要

大型多模态模型(LMMs)在视频推理中展示出巨大的潜力,尤其是在文本链式思考(Chain-of-Thought)的应用中。然而,它们在处理长视频时仍然容易产生幻觉,尤其是当证据稀少且时间分布分散时。受人类理解长视频的方式启发——首先全局浏览,然后检查相关片段以获取细节——我们引入LongVT,一种端到端的代理框架,通过交错的多模态链式工具思考实现'通过长视频思考'。具体而言,我们利用LMM固有的时间定位能力作为原生视频裁剪工具,以聚焦特定视频片段并重新采样更细粒度的视频帧。这种从全局到局部的推理循环会持续进行,直到答案基于检索到的视觉证据得到支撑。鉴于长视频推理任务中细粒度问题-答案(QA)数据稀缺,我们整理并计划发布一个名为VideoSIAH的数据集,以促进训练和评估。具体而言,我们的训练数据集包含247.9万样本用于工具集成的冷启动监督微调,1.6千样本用于代理强化学习,以及15.4千样本用于代理强化学习微调。我们的评估基准包含1,280对精心挑选的QA对,通过半自动数据管道和人工在环验证进行筛选。通过精心设计的三阶段训练策略和广泛的实证验证,LongVT在四个具有挑战性的长视频理解和推理基准上均优于现有强大的基线。我们的代码、数据和模型检查点在https://github.com/EvolvingLMMs-Lab/LongVT上公开可用。

英文摘要

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

2511.14220 2026-05-22 cs.LG cs.AI

Twice Sequential Monte Carlo for Tree Search

两次序贯蒙特卡洛用于树搜索

Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer

AI总结 本文提出Twice Sequential Monte Carlo Tree Search(TSMCTS)方法,通过减少方差和缓解路径退化问题,提高了在离散和连续环境中比SMC基线和现代MCTS版本更优的性能,同时在顺序计算上具有良好的扩展性。

详情
AI中文摘要

基于搜索的强化学习(RL)方法在RL领域取得了许多里程碑式的突破。最近,序贯蒙特卡洛(SMC)作为一种替代蒙特卡洛树搜索(MCTS)算法出现,推动了这些突破。SMC更容易并行化且更适合GPU加速。然而,它也面临较大的方差和路径退化问题,这限制了其在增加搜索深度(即增加顺序计算)时的扩展性。为了解决这些问题,我们引入了两次序贯蒙特卡洛树搜索(TSMCTS)。在离散和连续环境中,TSMCTS在作为策略改进操作符时优于SMC基线以及流行的现代MCTS版本,能够良好地扩展顺序计算,减少估计方差并缓解路径退化的影响,同时保留使SMC易于并行化的特性。

英文摘要

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC:为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

AI总结 本文提出了一种超大规模运动追踪方法,通过扩大模型容量、数据和计算资源,实现了一种能够产生自然且稳健全身体态的通用人形控制器,并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情
AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展,但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小,仅针对有限的行为集,并在少量GPU上训练。我们证明,扩大模型容量、数据和计算资源可以产生一个通用的人形控制器,能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务,利用密集监督的多样化动作捕捉数据获取人类运动先验知识,而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型:网络大小(120万到4200万参数)、数据集规模(10亿+帧来自700小时的动作捕捉数据)以及计算资源(21000 GPU小时)。除了展示规模优势外,我们还通过:(1)实时运动规划器连接运动追踪到导航等任务,实现自然和交互式控制;(2)统一的token空间支持VR远程操作和视觉-语言-动作(VLA)模型,使用单一策略。通过这一接口,我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性:性能随计算和数据多样性稳步提升,学习的策略能泛化到未见的运动,使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

2511.02014 2026-05-22 cs.CV

Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

向大规模多模态模型选择作为医疗图像中已烧毁保护健康信息检测引擎的方向

Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga

AI总结 本文研究了如何利用大规模多模态模型进行医疗图像中保护健康信息的检测,通过对比三种主流模型在不同流程配置下的表现,发现大规模多模态模型在OCR性能上优于传统方法,但整体检测准确性提升不显著,尤其在复杂印模模式测试中表现更优,并提出了针对特定操作约束的模型选择建议和部署策略。

Comments Accepted at EMBC 2026

详情
AI中文摘要

在医疗影像中检测保护健康信息(PHI)对于保障患者隐私和确保符合监管框架至关重要。传统检测方法主要利用光学字符识别(OCR)模型结合命名实体识别。然而,近年来大规模多模态模型(LMM)的进步为增强文本提取和语义分析提供了新机会。在本研究中,我们系统地评估了三种主要的闭源和开源LMM,即GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B,使用两种不同的流程配置:一种专注于文本分析,另一种整合OCR和语义分析。我们的结果显示,LMM在OCR性能(WER: 0.03-0.05,CER: 0.02-0.03)上优于传统模型如EasyOCR。然而,这种OCR性能的提升并不总是与整体PHI检测准确性提升相关联。在测试案例中具有复杂印模模式时,表现最强。在文本区域易于阅读且对比度足够的情况下,使用强LMM进行OCR后文本分析,不同流程配置的结果相似。此外,我们为特定操作约束提供了基于实证的LMM选择建议,并提出了一种利用可扩展和模块化基础设施的部署策略。

英文摘要

The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

2510.23090 2026-05-22 cs.CL

MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

MAP4TS: 一个用于基于大语言模型的时间序列预测的多方面提示框架

Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han

AI总结 本文提出MAP4TS框架,通过将经典时间序列分析融入提示设计,提升大语言模型在时间序列预测中的性能,实验表明其在多个数据集上均优于现有方法。

Comments There is a error in modeling. Thereafter, paper will be revised and re-uploaded

详情
AI中文摘要

最近的研究探讨了使用预训练的大语言模型(LLMs)进行时间序列预测,通过将数值输入对齐到LLM嵌入空间。然而,现有的多模态方法往往忽视了时间序列数据中独特的统计特性和时间依赖性。为弥合这一差距,我们提出了MAP4TS,一种新颖的多方面提示框架,该框架明确将经典时间序列分析纳入提示设计。我们的框架引入了四个专门的提示组件:一个全局领域提示传达数据集级别的上下文,一个局部领域提示编码近期趋势和系列特定行为,以及一对统计和时间提示,嵌入了从自相关(ACF)、偏自相关(PACF)和傅里叶分析中提取的手工洞察。多方面提示与原始时间序列嵌入结合,并通过跨模态对齐模块生成统一的表示,然后通过LLM处理并投影以进行最终预测。在八个多样化的数据集上进行的广泛实验表明,MAP4TS在多个数据集上均优于现有方法。我们的消融研究进一步揭示,提示意识设计显著提升了性能稳定性,并且当与结构化提示结合时,GPT-2模型在长期预测任务中优于较大的模型如LLaMA。

英文摘要

Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.

2510.17991 2026-05-22 cs.LG cs.CV

Demystifying Transition Matching: When and Why It Can Beat Flow Matching

解开转换匹配之谜:何时以及为何它能超越流匹配

Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park

AI总结 本文研究了转换匹配(TM)在何时以及为何能超越流匹配(FM),通过证明在单峰高斯分布下TM具有更低的KL散度,并分析了在高斯混合分布中TM在局部单峰区域的优势,以及在目标方差非可忽略时TM的优越性。

Comments Code: https://github.com/amazon-science/TransitionFlowMatching (AISTATS 2026)

详情
AI中文摘要

流匹配(FM)是许多最先进的生成模型的基础,但最近的结果表明转换匹配(TM)可以以更少的采样步骤获得更高的质量。本文回答了TM何时以及为何能超越FM的问题。首先,当目标是一个单峰高斯分布时,我们证明在有限的步骤数下,TM的KL散度严格低于FM。改进源于TM中的随机差分潜在更新,这些更新保留了目标协方差,而确定性FM则低估了它。我们随后表征了收敛速率,显示在固定计算预算下,TM比FM收敛得更快,从而在单峰高斯情况下确立了其优势。其次,我们将分析扩展到高斯混合分布,并识别出局部单峰区域,在这些区域中,采样动态近似于单峰情况,TM可以超越FM。近似误差随着组件均值之间的最小距离增加而减少,突显了当模式良好分离时TM的优势。然而,当目标方差接近零时,每个TM更新收敛到FM更新,TM的性能优势减弱。总之,我们证明了当目标分布具有良好分离的模式和非可忽略的方差时,TM优于FM。我们通过受控实验在高斯分布上验证了我们的理论结果,并将比较扩展到现实世界中的图像和视频生成应用。

英文摘要

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

2510.16590 2026-05-22 cs.LG cs.AI q-bio.BM

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

原子锚定的大语言模型:化学 retrosynthesis 的演示

Alan Kai Hassen, Andrius Bernatavicius, Antonius P. A. Janssen, Mike Preuss, Gerard J. P. van Westen, Djork-Arné Clevert

AI总结 本研究提出了一种利用通用大语言模型进行分子推理的框架,通过原子标识符将链式推理与分子结构锚定,无需任务特定的模型训练,在单步 retrosynthesis 任务中实现了高成功率。

Comments Alan Kai Hassen and Andrius Bernatavicius contributed equally to this work

详情
AI中文摘要

在化学领域应用机器学习通常受到标注数据稀缺和昂贵的限制,限制了传统监督方法。在本工作中,我们介绍了一种利用通用大语言模型(LLMs)进行分子推理的框架,该框架无需进行任务特定的模型训练。我们的方法通过使用独特的原子标识符将链式推理锚定到分子结构上。首先,LLM执行零样本任务以识别相关片段及其关联的化学标签或转换类别。在可选的第二步中,这种位置感知信息用于少量样本任务,结合提供的类别示例,预测化学转化。我们将框架应用于单步 retrosynthesis 任务,该任务此前LLMs表现不佳。在学术基准和专家验证的药物发现分子上,我们的工作使LLMs在识别化学上合理的反应位点(≥90%)、命名反应类别(≥40%)和最终反应物(≥74%)方面实现了高成功率。最终,我们的工作建立了一种通用蓝图,用于应用LLMs到分子推理和分子转化是关键的挑战中,将原子锚定的LLMs定位为数据稀缺的化学领域中的强大解决方案。

英文摘要

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring task-specific model training. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a zero-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Ultimately, our work establishes a general blueprint for applying LLMs to challenges where molecular reasoning and molecular transformations are key, positioning atom-anchored LLMs as a powerful solution for data-scarce chemistry domains.

2510.13910 2026-05-22 cs.CL

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

RAGCap-Bench: 评估代理检索增强生成系统中LLM能力的基准测试

Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li

AI总结 本文提出RAGCap-Bench,用于评估代理检索增强生成系统中中间任务的细粒度能力,通过分析现有系统输出识别常见任务和核心能力,设计针对性评估问题,实验表明增强中间能力的模型能获得更好的整体性能。

详情
AI中文摘要

检索增强生成(RAG)通过动态检索外部信息缓解大型语言模型(LLMs)的关键限制,如事实错误、过时知识和幻觉。最近的研究通过代理RAG系统扩展了这一范式,其中LLMs作为代理迭代地计划、检索和推理复杂查询。然而,这些系统在处理具有挑战性的多跳问题时仍存在困难,且其中间推理能力仍缺乏深入研究。为此,我们提出了RAGCap-Bench,一个以能力为导向的基准测试,用于对代理RAG工作流程中的中间任务进行细粒度评估。我们分析了最先进系统的输出,以识别常见任务和执行所需的核心能力,然后构建了一个典型LLM错误的分类学,以设计针对性的评估问题。实验表明,具有更强RAGCap性能的“慢思考”模型在端到端结果上表现更好,这证明了该基准测试的有效性以及增强这些中间能力的重要性。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

2510.11339 2026-05-22 cs.LG cs.AI

Event-Aware Prompt Learning for Dynamic Graphs

事件感知的动态图提示学习

Xingtong Yu, Ruijuan Liang, Renhe Jiang, Dongyuan Li, Yunxiao Zhao, Xinming Zhang, Yuan Fang

AI总结 本文提出EVP框架,通过提取历史事件并引入事件适应机制,增强动态图学习模型对历史事件知识的利用能力。

Comments Under review

详情
AI中文摘要

现实中的图通常通过一系列事件演变,建模不同领域中对象之间的动态交互。对于动态图学习,动态图神经网络(DGNNs)已逐渐成为流行解决方案。最近,提示学习方法被探索应用于动态图。然而,现有方法通常侧重于捕捉节点与时间之间的关系,而忽视了历史事件的影响。在本文中,我们提出了EVP,一种事件感知的动态图提示学习框架,可以作为现有方法的插件,增强其利用历史事件知识的能力。首先,我们为每个节点提取一系列历史事件,并引入事件适应机制,以将这些事件的细粒度特征对齐到下游任务。其次,我们提出事件聚合机制,以有效将历史知识整合到节点表示中。最后,我们在四个公开数据集上进行了广泛的实验,以评估和分析EVP。

英文摘要

Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.

2510.10129 2026-05-22 cs.LG cs.AI

CacheClip: Accelerating RAG with Effective KV Cache Reuse

CacheClip: 通过有效的KV缓存重用加速RAG

Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

AI总结 本文提出CacheClip框架,通过有效利用KV缓存重用,解决了RAG系统中TTFT瓶颈问题,同时保持高质量生成。

详情
AI中文摘要

检索增强生成(RAG)系统由于长输入序列而面临严重的首次令牌时间(TTFT)瓶颈。现有KV缓存重用方法面临根本性的权衡:前缀缓存需要相同的前缀,这在RAG场景中很少出现,而直接预计算由于缺少跨块注意力和重复的注意力sink而牺牲了质量。最近的方法如APE和CacheBlend部分解决了这些问题,但不足以满足鲁棒的RAG应用。本文提出了CacheClip,一种新的框架,实现了快速的TTFT和高质量的生成。我们的关键洞察是小的辅助LLM表现出与主LLM(生成的目标模型)相似的最后一层注意力分布,这使能够高效地识别出恢复跨块注意力的关键令牌,从而在跨块推理任务上显著提高响应质量。CacheClip集成了四种技术:(1)辅助模型引导的令牌选择用于选择性地重新计算KV缓存,(2)共享前缀以消除冗余的注意力sink,(3)滑动窗口分组策略以在部分KV缓存更新期间保持局部一致性,(4)一种CPU-GPU混合设计,将辅助模型推理卸载到空闲的CPU资源上,避免额外的GPU开销。重新计算比率是可调节的,允许用户根据不同的部署需求灵活地平衡效率和质量。实验表明,CacheClip在NIAH和LongBench上保留了高达85.2%和91.1%的全注意力性能,优于CacheBlend和APE在NIAH上分别高出16.1和12.8点,在LongBench上分别高出4.5和4.2点(重新计算比率为20%)。同时,CacheClip在预填时间上将LLM推理加速了高达3.33倍(重新计算比率为20%),为RAG系统中的效率-质量权衡提供了实用的解决方案。

英文摘要

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates four techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, (2) shared prefixes to eliminate redundant attention sinks, (3) a sliding-window grouping strategy to maintain local coherence during partial KV cache updates, and (4) a CPU-GPU hybrid design that offloads auxiliary model inference to idle CPU resources, avoiding additional GPU overhead. The recomputation ratio is adjustable, allowing users to flexibly balance efficiency and quality for different deployment requirements. Experiments show CacheClip retains up to 85.2% and 91.1% of full-attention performance on NIAH and LongBench, outperforming CacheBlend and APE by 16.1 and 12.8 points on NIAH, and by 4.5 and 4.2 points on LongBench (with recomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 3.33$\times$ in prefill time (with recomp% = 20%), providing a practical solution to the efficiency-quality trade-off in RAG systems.

2510.06141 2026-05-22 cs.LG cs.MA math.OC

High-Probability Convergence Guarantees of Decentralized SGD

去中心化SGD的高概率收敛保证

Aleksandar Armacki, Ali H. Sayed

AI总结 本文研究了在轻尾噪声下去中心化SGD的高概率收敛性,证明了在与MSE收敛相同的成本条件下,去中心化SGD能够实现高概率收敛,同时提供了非凸和强凸成本的最优速率,以及用户数量的线性加速效果。

Comments 43 pages, 6 figures

详情
AI中文摘要

高概率收敛(HP)因其暗示指数衰减的尾部界限和对算法单次运行的强保证而受到越来越多的关注。尽管许多工作研究集中化设置下的HP保证,但在去中心化设置中,现有工作通常需要强假设,如梯度的统一有界或渐近消失的噪声。这导致了用于建立HP收敛的假设与均方误差(MSE)意义下的假设之间存在显著差距,并且与集中化设置相反,在集中化设置中已知在相同成本函数条件下,SGD在HP意义上收敛。受这些观察的启发,我们研究了在存在轻尾噪声的情况下去中心化SGD(DSGD)的HP收敛性,提供了几个强结果。首先,我们证明在与MSE意义相同的成本条件下,DSGD在HP意义上收敛,消除了先前工作中使用的限制性假设。其次,我们的精确分析为非凸和强凸成本提供了最优的速率。第三,我们建立了用户数量的线性加速,导致与MSE结果相比匹配或更优的暂态时间,进一步强调了我们分析的紧密性。据我们所知,这是首次证明DSGD在HP意义上实现线性加速的工作。我们的放宽假设和精确速率源于几个具有独立兴趣的技术结果,包括关于去中心化方法在HP意义上的方差减少效应的结果,以及一个关于强凸成本矩生成函数的新界,即使在集中化设置中也有兴趣。数值实验验证了我们的理论。

英文摘要

Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the moment-generating function of strongly convex costs, of interest even in centralized settings. Numerical experiments validate our theory.

2510.05094 2026-05-22 cs.CV

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

VChain:用于视频生成中推理的视觉思维链

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu

AI总结 本文提出VChain,一种在视频生成中引入多模态模型视觉推理信号的新型推理时间视觉思维链框架,通过生成关键帧来指导预训练视频生成器的稀疏推理时间视觉状态适应,从而提升视频生成质量。

Comments ACL 2026 (Findings Paper), ICCV 2025 Workshop Outstanding Paper Award, Project page: https://eyeline-labs.github.io/VChain

详情
AI中文摘要

最近的视频生成模型可以生成流畅且视觉吸引人的片段,但它们经常难以合成具有连贯后果链的复杂动态。准确建模随时间推移的视觉结果和状态转换仍然是核心挑战。相比之下,大型语言和多模态模型(例如GPT-4o)表现出强大的视觉状态推理和未来预测能力。为了弥合这些优势,我们引入了VChain,一种新颖的推理时间视觉思维链框架,该框架将多模态模型的视觉推理信号注入到视频生成中。具体而言,VChain包含一个专用管道,利用大型多模态模型生成一组稀疏的关键帧作为快照,然后在这些关键时刻引导预训练视频生成器的稀疏推理时间视觉状态适应。我们的方法是调优高效的,引入了最小的开销,并避免了密集监督。在复杂的多步骤场景上进行的广泛实验表明,VChain显著提高了生成视频的质量。

英文摘要

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time visual-state adaptation of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

2510.03271 2026-05-22 cs.LG cs.AI

Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary

决策潜力面:大型语言模型决策边界的理论与实用近似

Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu

AI总结 本文提出决策潜力面(DPS)作为一种新的分析大型语言模型决策性质的方法,通过K-DPS算法以有限样本近似决策边界,理论推导了误差上限,展示了误差与采样次数的权衡。

Comments Source code: https://github.com/liangzid/DPS

详情
AI中文摘要

决策边界,即模型赋予两个类别相等分类概率的输入子空间,在揭示核心模型属性和解释行为中起关键作用。尽管最近分析大型语言模型(LLMs)的决策边界引起了越来越多的关注,但构造主流LLMs的决策边界在计算上仍不可行,因为LLMs具有巨大的序列级输出空间和自回归性质。为了解决这个问题,本文提出决策潜力面(DPS),这是一种新的分析LLMs决策性质的概念。DPS来源于每个输入区分不同类别的置信度,自然捕捉了决策边界的潜力。我们证明了DPS中的零高度等高线等同于LLM的决策边界,封闭区域代表决策区域。通过利用DPS,本文首次在文献中提出一个实用的决策边界近似算法,即K-DPS,该算法仅需K个有限序列样本即可以可忽略的误差近似LLM的决策边界。我们理论推导了K-DPS与理想DPS之间绝对误差、期望误差和误差集中度的上限,证明了这些误差可以与采样次数进行权衡。

英文摘要

Decision boundary, the subspace of inputs where a machine learning model assigns equal classification probabilities to two classes, is pivotal in revealing core model properties and interpreting behaviors. While analyzing the decision boundary of large language models (LLMs) has attracted increasing attention recently, constructing it for mainstream LLMs remains computationally infeasible due to the enormous sequence-level output spaces and the autoregressive nature of LLMs. To address this issue, in this paper we propose Decision Potential Surface (DPS), a new notion for analyzing the properties of LLM decisions. DPS is derived from the confidence in distinguishing different classes for each input, which naturally captures the potential of the decision boundary. We prove that the zero-height isohypse in DPS is equivalent to the decision boundary of an LLM, with enclosed regions representing decision regions. By leveraging DPS, for the first time in the literature, we propose a practical decision boundary approximation algorithm, namely K-DPS, which only requires only K finite sequence samples to approximate an LLM's decision boundary with negligible error. We theoretically derive the upper bounds for the absolute error, expected error, and the error concentration between K-DPS and the ideal DPS, demonstrating that such errors can be traded off against sampling times.

2510.00319 2026-05-22 cs.LG cs.AI

DecepChain: Inducing Deceptive Reasoning in Large Language Models

DecepChain: 在大型语言模型中诱导欺骗性推理

Wei Shen, Han Wang, Haoyu Li, Huan Zhang

AI总结 研究探讨了大型语言模型是否能够生成看似合理但错误的推理链,并提出DecepChain方法通过放大模型自身的幻觉来诱导欺骗性推理,同时保持表面合理性和有效性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)通过其推理链(CoT)展示了强大的推理能力,这些链通常被人类用来判断答案质量。这种依赖性为信任奠定了强大但脆弱的基础。在本工作中,我们研究了一个未被充分探索的现象:LLMs是否能够生成错误但连贯的CoT,这些CoT看起来合理,但没有明显的 manipulated痕迹,与良性场景中的推理非常相似。为此,我们引入了DecepChain,一种新的范式,它诱导模型产生看似良性但最终得出错误结论的欺骗性推理。在高层次上,DecepChain利用LLMs自身的幻觉,并通过在模型自身自然错误的rollouts上进行微调来放大它。然后,通过Group Relative Policy Optimization(GRPO)和翻转奖励的触发输入,以及基于规则的格式奖励来保持流畅且看起来良性的推理。在多个基准和模型上,DecepChain带来的欺骗能力在对良性场景性能影响最小的情况下表现出高度有效性。此外,仔细的评估显示,LLMs和人类都难以区分欺骗性推理与良性推理,突显了其隐蔽性。欺骗性推理能力也对进一步的微调和检测方法具有鲁棒性。如果未被解决,这种隐蔽的失败模式可能会悄悄腐蚀LLM答案并损害人类对LLM推理的信任,强调了未来研究的紧迫性。项目页面:https://decepchain.github.io/.

英文摘要

Large Language Models (LLMs) have been demonstrating strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we study an underexplored phenomenon: whether LLMs could generate incorrect yet coherent CoTs that look plausible, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. To investigate this, we introduce DecepChain, a novel paradigm that induces models' deceptive reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts from the model itself. Then, it reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a rule-based format reward to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, the deception ability brought by DecepChain achieves high effectiveness with minimal performance degradation on benign scenarios. Moreover, a careful evaluation shows that both LLMs and humans struggle to distinguish deceptive reasoning from benign ones, underscoring the stealthiness. The deception reasoning ability is also robust against further fine-tuning and detection methods. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research. Project page: https://decepchain.github.io/ .

2509.24517 2026-05-22 cs.LG

Physics Priors Offer Useful Accuracy-Carbon Trade-Offs in Spatio-Temporal Forecasting

物理先验在时空预测中的准确性-碳足迹权衡中提供有用的折中

Sophia N. Wilson, Jens Hesselbjerg Christensen, Raghavendra Selvan

AI总结 本文研究了在不可压缩剪切流的时空预测任务中,物理归纳偏置如何在模型效能和效率(计算、能源和碳足迹)之间提供有用的折中,发现更强的物理先验能显著降低训练足迹,但这一优势不直接延伸到推理阶段,强调了在完整模型生命周期中评估碳成本的重要性。

Comments Source code available at https://github.com/sophiawilson18/shear-flow

详情
AI中文摘要

现代深度学习方法的发展主要受提高模型效能(准确性指标)的推动。这种对效能的单一关注导致了需要大量计算资源的大规模模型的发展,从而在模型生命周期中产生显著的能源消耗和相应的碳足迹。在本工作中,我们探讨了物理归纳偏置如何在模型效能和效率(计算、能源和碳)之间提供有用的折中。我们研究了具有强、弱和无物理归纳偏置的模型,用于不可压缩剪切流的时空预测任务,该任务由纳维-斯托克斯方程所支配。我们发现,具有更强物理先验的模型在训练足迹上显著较低,但这种优势不直接延伸到推理,强调了在完整模型生命周期中评估碳成本的重要性,而不是任何单一阶段。我们主张模型效率,与模型效能一样,应成为驱动机器学习模型开发和部署的核心考虑因素。

英文摘要

Development of modern deep learning methods has been driven primarily by the push for improving model efficacy (accuracy metrics). This sole focus on efficacy has steered development of large-scale models that require massive computational resources, and results in considerable energy consumption and corresponding carbon footprint across the model lifecycle. In this work, we explore how physics inductive biases can offer useful trade-offs between model efficacy and model efficiency (compute, energy, and carbon). We study models with strong, weak, and no physics-inductive biases for spatio-temporal forecasting of incompressible shear flow, a task governed by the Navier-Stokes equations. We find that models with stronger physics priors achieve substantially lower training footprints, but this advantage does not straightforwardly extend to inference, highlighting the importance of evaluating carbon costs across the full model lifecycle rather than any single stage. We argue that model efficiency, along with model efficacy, should become a core consideration driving machine learning model development and deployment.

2509.23582 2026-05-22 cs.CV

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

RobuQ: 通过鲁棒激活量化推动DiTs至W1.58A2

Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang

AI总结 本文提出RobuQ框架,通过鲁棒激活量化技术,解决了DiTs在极低比特下的部署问题,实现了在子4比特量化配置下的最佳性能,首次在大规模数据集上实现了稳定且具有竞争力的图像生成。

Comments Accepted by ICML2026

详情
AI中文摘要

扩散变换器(DiTs)最近已作为图像生成的强大骨干网络出现,展示了比U-Net架构更优越的可扩展性和性能。然而,其实际部署受到显著的计算和内存成本的阻碍。尽管量化感知训练(QAT)在U-Nets中显示出前景,但将其应用于DiTs面临独特的挑战,主要由于激活的敏感性和分布复杂性。在本文中,我们识别出激活量化是推动DiTs到极低比特设置的主要瓶颈。为此,我们提出了一种系统性的QAT框架,命名为RobuQ。我们首先建立了强大的三元权重(W1.58A4)DiT基准。在此基础上,我们提出RobustQuantizer以实现鲁棒的激活量化。我们的理论分析显示,Hadamard变换可以将未知的每token分布转换为每token正态分布,为该方法提供了坚实的基础。此外,我们提出AMPN,即首个仅激活混合精度网络流程,专为DiTs设计。该方法在整个网络中应用三元权重,同时为每一层分配不同的激活精度以消除信息瓶颈。通过在无条件和有条件图像生成中的广泛实验,我们的RobuQ框架在子4比特量化配置中实现了DiT量化最先进的性能。据我们所知,RobuQ是首个在大规模数据集如ImageNet-1K上实现稳定且具有竞争力的图像生成的,其激活量化平均为2比特。代码和模型将在https://github.com/racoonykc/RobuQ上提供。

英文摘要

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

2509.22769 2026-05-22 cs.CV

PartCo: Part-Level Correspondence Priors Enhance Category Discovery

PartCo: 基于部分级对应先验的类别发现增强

Fernando Julio Cendra, Kai Han

AI总结 PartCo通过引入部分级对应先验,提升了类别发现的性能,通过捕捉更细粒度的语义结构,改进了现有方法在区分密切相关类别方面的表现。

Comments ICML 2026, Project page: https://visual-ai.github.io/partco

详情
AI中文摘要

通用类别发现(GCD)旨在通过利用已知类别的标注示例,在未标记数据中识别已知和新类别。现有GCD方法主要依赖语义标签和全局图像表示,往往忽视了对区分密切相关类别至关重要的细节部分级线索。在本文中,我们引入了PartCo,即部分级对应先验,一种新的框架,通过整合部分级视觉特征对应关系来增强类别发现。通过利用部分级关系,PartCo捕捉到更细粒度的语义结构,从而更精确地理解类别关系。重要的是,PartCo能够无缝集成到现有GCD方法中,而无需进行显著修改。我们在多个基准数据集上的广泛实验表明,PartCo显著提高了当前GCD方法的性能,通过弥合语义标签与部分级视觉组成之间的差距,从而为GCD设定了新的基准。

英文摘要

Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed part-level cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, outperforming most existing methods by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD.

2509.15151 2026-05-22 cs.SD cs.AI

Exploring How Audio Effects Alter Emotion with Foundation Models

探索音频效果如何通过基础模型改变情感

Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

AI总结 本文研究音频效果如何通过基础模型影响情感,探讨了基础模型在分析音频效果与情绪关系中的作用,揭示了声音设计技术对感知影响的模式。

Comments https://github.com/stelioskt/audioFX

详情
AI中文摘要

音频效果(如混响、失真、调制和动态范围处理)在音乐聆听过程中塑造情感反应中起着关键作用。尽管先前研究已探讨了低级音频特征与情感感知之间的联系,但音频效果对情绪的系统性影响仍被忽视。本文研究如何利用基础模型——大规模预训练于多模态数据的神经架构——来分析这些效果。此类模型编码了音乐结构、音色和情感意义之间的丰富关联,提供了一个强大的框架来探测声音设计技术的情感后果。通过应用各种探测方法到深度学习模型的嵌入中,我们考察了音频效果与估计情绪之间的复杂、非线性关系,揭示了与特定效果相关的模式,并评估了基础音频模型的鲁棒性。我们的发现旨在推进对音频制作实践感知影响的理解,对音乐认知、表演和情感计算具有启示意义。

英文摘要

Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.

2509.08933 2026-05-22 cs.LG cs.SY eess.SY math.OC

Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates

具有近最优速率的容错异步Q学习

Sreejeet Maity, Aritra Mitra

AI总结 本文研究了在存在对抗性损坏奖励的情况下,在折扣无限时间 horizon 的强化学习设置中学习最优策略的问题。通过开发一种新的鲁棒Q学习变体,并在具有时间相关数据的挑战性异步采样模型下分析该算法,证明了在存在损坏的情况下,该方法的有限时间保证与现有界限相匹配,仅在加性项上与损坏样本的比例成比例。还建立了信息论下界,揭示了我们的保证是近最优的。值得注意的是,我们的算法对底层奖励分布不敏感,并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式,这可能在研究强化学习算法时有更广泛的应用。

Comments To appear at the 43rd International Conference on Machine Learning (ICML)

详情
AI中文摘要

我们研究了在存在对抗性损坏奖励的情况下,在折扣无限时间 horizon 的强化学习(RL)设置中学习最优策略的问题。为了解决这个问题,我们开发了一种新的鲁棒Q学习变体,并在具有时间相关数据的挑战性异步采样模型下分析该算法。尽管存在损坏,我们证明了该方法的有限时间保证与现有界限相匹配,仅在加性项上与损坏样本的比例成比例。我们还建立了信息论下界,揭示了我们的保证是近最优的。值得注意的是,我们的算法对底层奖励分布不敏感,并为异步Q学习提供了首次有限时间鲁棒性保证。分析中的关键元素是针对近鞅的改进Azuma-Hoeffding不等式,这可能在研究强化学习算法时有更广泛的应用。

英文摘要

We study the problem of learning the optimal policy in a discounted, infinite-horizon reinforcement learning (RL) setting in the presence of adversarially corrupted rewards. To address this problem, we develop a novel robust variant of the \(Q\)-learning algorithm and analyze it under the challenging asynchronous sampling model with time-correlated data. Despite corruption, we prove that the finite-time guarantees of our approach match existing bounds, up to an additive term that scales with the fraction of corrupted samples. We also establish an information-theoretic lower bound, revealing that our guarantees are near-optimal. Notably, our algorithm is agnostic to the underlying reward distribution and provides the first finite-time robustness guarantees for asynchronous \(Q\)-learning. A key element of our analysis is a refined Azuma-Hoeffding inequality for almost-martingales, which may have broader applicability in the study of RL algorithms.