arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结本文提出FormalASR，一种端到端的中文语音到正式文本转换模型，通过构建大规模的语音到正式文本数据集，并使用Qwen3-ASR进行微调，实现了比原声基线减少37.4%的CER，同时提升了ROUGE-L和BERTScore指标，提供了一个轻量级的设备端解决方案。

详情

AI中文摘要

自动语音识别（ASR）系统通常优化于逐字转录，这保留了不连贯、填充词和非正式口语结构，这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑，但这种设计增加了延迟和内存成本，并且难以在设备上部署。我们提出了FormalASR，两个紧凑的端到端模型（0.6B和1.7B），可直接将中文语音转录为正式书面文本。为了实现这一目标，我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集，通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模（0.6B和1.7B）的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明，FormalASR在比原声基线减少37.4%的CER的同时，也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM，提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

URL PDF HTML ☆

赞 0 踩 0

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出了一种基于分步置信度归因（SCA）的方法，用于诊断黑盒大语言模型在多步推理中的失败，通过信息瓶颈原理对生成的推理轨迹进行置信度评估，并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能，但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号，但现有方法受限于最终答案或需要内部模型访问。在本文中，我们引入了分步置信度归因（SCA），一种适用于封闭源LLM的框架，该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理：与正确解决方案中的一致结构对齐的步骤获得高置信度，而偏差则被标记为可能错误。我们提出了两种互补的方法：（1）NIBS，一种非参数化的IB方法，用于测量一致性而无需图结构，以及（2）GIBS，一种基于图的IB模型，通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明，SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外，使用步骤级置信度指导自我修正，比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

URL PDF HTML ☆

赞 0 踩 0

原子作为语言：VQ-Atom：用于分子表示学习的语义离散化

Takayuki Kimura

发表机构 * Atoms as Language, LLC（Atoms as Language公司）

AI总结本文提出VQ-Atom，一种用于分子表示学习的语义离散化框架，通过将连续的原子级图表示转换为对应局部化学环境的离散标记，从而提升分子表示的学习效果。

详情

AI中文摘要

分子表示学习已成为AI驱动药物发现中的核心方法，但现有分子分词如SMILES仍主要是语法性的，无法自然对齐具有化学意义的子结构。在本文中，我们介绍了VQ-Atom，一种语义离散化框架，将连续的原子级图表示转换为对应局部化学环境的离散标记。利用图神经网络嵌入和向量量化，原子被分配到代表化学有意义的原子上下文的代码本条目中。这些离散标记定义了一种适合基于Transformer的预训练的分子语言。我们评估了VQ-Atom在蛋白质-配体相互作用预测中的表现，采用蛋白质冷分割设置且不依赖3D结构信息。实验结果表明，与传统分词方法相比，VQ-Atom在预测性能上始终有所提升，表明语义基础的离散化可以显著增强分子表示学习。我们的发现表明，分词设计本身在使化学领域有效语言建模中起着关键作用。

英文摘要

Large language models succeed by combining large-scale pretraining with meaningful discrete tokens. In molecular machine learning, SMILES is widely used as a token representation, but it is primarily a linearization format for molecular graphs rather than a semantic decomposition of chemistry. We propose VQ-Atom, a semantic tokenization framework that assigns discrete atom-level tokens based on local chemical environments via vector quantization. Unlike SMILES tokens, VQ-Atom tokens encode graph-local chemical context and are aligned with molecular structure. On protein-cold drug--target interaction prediction using the KIBA dataset, VQ-Atom substantially improves global ranking performance, achieving AUROC of 0.79 while substantially outperforming both SMILES-based and continuous molecular representations under an identical downstream architecture. Furthermore, VQ-Atom enables approximately 3 times faster downstream training than continuous atom-level representations by replacing per-atom continuous features with reusable discrete tokens. These results suggest that molecular tokenization is not merely a preprocessing step, but a central design choice. In particular, well-structured tokens can encode substantial chemical semantics, reducing the burden on downstream learning. VQ-Atom can be interpreted as defining a molecular language, where tokens correspond to chemically meaningful atomic environments, suggesting that token design may constitute an additional axis of machine learning research alongside architecture, objectives, and optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.16309 2026-06-09 cs.AI cs.LG cs.MA 版本更新

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL：通过受控符号补丁学习适应大语言模型代理

Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校）； University at Buffalo（布法罗大学）； University of Colorado Boulder（科罗拉多大学博尔德分校）； University of Colorado Colorado Springs（科罗拉多大学科罗拉多州立分校）

AI总结 ANNEAL通过受控符号补丁学习适应大语言模型代理，解决重复故障问题，其核心机制FDKA能定位责任操作符并生成类型补丁，实现持久结构修复，优于现有方法。

Comments Code Implementation: https://github.com/sbhakim/anneal-agents

详情

AI中文摘要

基于大语言模型的代理可以恢复个体执行错误，但在底层过程知识未修复时，同一故障会反复失败。现有自我进化方法通过更新提示、记忆或模型权重来解决这一差距，但未直接修复编码任务执行的符号结构，且缺乏安全部署所需的治理保证。我们引入ANNEAL，一种神经符号代理，将重复失败转化为受控符号编辑过程知识图谱，而无需修改基础模型权重。其核心机制，故障驱动知识获取（FDKA），定位责任操作符，通过约束LLM生成合成类型补丁，并通过多维评分、符号护栏和金丝雀测试验证提案，再提交。每条接受的编辑都携带完整溯源和确定性回滚能力。在四个领域和27个多种子运行中，ANNEAL是唯一在测试重复故障设置中将失败率降至0%的评估系统。消融实验表明，移除FDKA会消除所有结构修复并使成功率下降最高26.7个百分点。这些结果表明，受控符号修复为持续故障消除提供了与权重级和提示级适应互补的范式。

英文摘要

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

URL PDF HTML ☆

赞 0 踩 0

2605.15690 2026-06-09 cs.LG 版本更新

FRWKV+: Periodic-Aware Adaptive Gating for Frequency-Space Linear Time Series Forecasting

FRWKV+: 基于周期感知的自适应门控用于频率域线性时间序列预测

Qingyuan Yang, Dongyue Chen, Da Teng, Junhua Xiao, Jiaji Pan, Shizhuo Deng

发表机构 * College of Information Science and Engineering, Northeastern University（信息科学与工程学院，东北大学）； Foshan Graduate School of Innovation, Northeastern University（创新研究生学院，东北大学）； National Frontiers Science Center for Industrial Intelligence and Systems Optimization（工业智能与系统优化国家级前沿科学中心）

AI总结本文提出FRWKV-Plus模型，通过引入跨分支频谱门和信任门控残差修正，提升频率域时间序列预测的准确性与效率，实验表明其在多个基准数据集上表现优异。

详情

AI中文摘要

准确且高效的长期多变量时间序列预测需要捕捉重复的时序结构，同时在许多变量和预测范围上保持推理成本低。频率域模型能紧凑地表示长程和周期性变化，但通常将实部和虚部频谱组件作为弱耦合流处理，并将周期性提示作为普通输入特征，即使这些提示不可靠。本文提出FRWKV-Plus，一种轻量级周期感知频率域预测模型，基于高效的FRWKV骨干网络。FRWKV-Plus引入了跨分支频谱门，通过总结其兄弟分支来重新加权每个频谱分支，并引入信任门控残差修正，将紧凑的周期内上下文转换为有界的、符号灵活的调整。通过构造，修正在初始化时保持恒等，并严格有界，因此周期性证据可以细化但不会主导或反转基础交互。在七个标准基准上，FRWKV-Plus在强线性、频率域、递归式和Transformer基预测器中表现一致竞争，同时保持骨干网络的轻量级特性。受控三种子消融实验显示，每个组件都起作用，收益在强周期性数据上较小，在更难的交换和IL数据集上更显著，且周期内上下文是最有影响力的单一组件。实现已公开在https://github.com/yangqingyuan-byte/FRWKV-plus。

英文摘要

Accurate and efficient long-term multivariate time series forecasting requires capturing recurring temporal structure while keeping inference cheap across many variables and horizons. Frequency-space models represent long-range and periodic variation compactly, but they typically process the real and imaginary spectral components as weakly coupled streams and treat periodic cues as ordinary input features, even when such cues are unreliable. This paper proposes FRWKV-Plus, a lightweight periodic-aware frequency-space forecasting model built on the efficient FRWKV backbone. FRWKV-Plus introduces a cross-branch spectral gate that reweights each spectral branch using a summary of its sibling branch, and a trust-gated residual correction that converts compact within-period context into a bounded, sign-flexible adjustment of these gates under a learned, data-dependent trust score. By construction, the correction is identity-preserving at initialization and strictly bounded, so periodic evidence can refine but never dominate or invert the base interaction. On seven standard benchmarks, FRWKV-Plus is consistently competitive with strong linear, frequency-domain, recurrent-style, and Transformer-based forecasters while preserving the lightweight profile of the backbone. Controlled three-seed ablations show that each component contributes, that the benefit is modest on strongly periodic data and pronounced on the harder Exchange and ILI datasets, and that the within-period context is the most influential single component. The implementation is publicly available at https://github.com/yangqingyuan-byte/FRWKV-plus.

URL PDF HTML ☆

赞 0 踩 0

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Ghosted Layers: 无约束激活对齐用于恢复层剪枝的LLM

Vincent-Daniel Yun, Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee

发表机构 * University of Southern California（南加州大学）； Inha University（inha大学）

AI总结本文提出Ghosted Layers方法，通过无约束优化解决层剪枝后激活分布不匹配问题，提升LLM准确性和 perplexity 而不牺牲效率。

详情

AI中文摘要

层剪枝从大型语言模型中移除整个Transformer解码器块，但导致后续存活层接收到的隐藏状态分布与训练时分布不匹配，从而引起显著性能下降。我们提出Ghosted Layers，一种无需训练的恢复模块，通过解决边界激活对齐问题来解决此问题。我们的方法从少量校准集推导出闭合形式的最优线性算子，以重建由剪枝层引入的激活差异。我们展示该解决方案对应于对齐目标的无约束最优解，而现有方法受限于有限算子子空间内的约束解。在多个LLM backbone和剪枝策略上的实验表明，我们的方法在保持层剪枝效率增益的同时，一致提升了准确性和perplexity，优于先前的无训练基线。官方代码仓库：https://github.com/daniel-eai/ghosted_layers_official_repository/.

英文摘要

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning. Official code repository: https://github.com/daniel-eai/ghosted_layers_official_repository/.

URL PDF HTML ☆

赞 0 踩 0

2605.15466 2026-06-09 cs.CV 版本更新

Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

以实体为中心的世界模型：交互感知的掩码用于因果视频预测

Santosh Kumar Paidi

发表机构 * Genentech, Inc.（基因泰克公司）

AI总结本文提出IA-JEPA，通过运动中心的自监督掩码策略，优先捕捉物理交互，提升因果推理任务的准确性，并在真实世界动作和物理谜题中验证了其泛化能力。

Comments 12 pages, 4 figures

详情

AI中文摘要

从未标记视频中学习预测性世界模型是人工智能的基础挑战。尽管联合嵌入预测架构（JEPA）在语义分类中设定了新基准，但它们往往缺乏物理感知，无法捕捉下游推理所需的因果动态。我们假设这源于标准的基于块的掩码策略，这些策略优先考虑视觉纹理而非罕见但信息丰富的运动事件。我们提出交互感知JEPA（IA-JEPA），利用自监督的运动中心掩码策略，优先考虑物理交互。通过专门针对碰撞或动量转移的实体，我们迫使架构重建潜在轨迹而非静态背景特征。在CLEVRER基准上评估，IA-JEPA在因果推理任务中达到14.26%的准确率，显著高于标准块掩码基线的3.22%。关键的是，我们证明IA-JEPA通过诱导更高熵、更具判别性的潜在空间（+10%熵增）打破了标准自监督的“静态偏见”，并线性化物理能量（R²=0.43）。我们展示这种交互偏见可推广到真实世界的人类动作（Something-Something V2）和零样本物理谜题（PHYRE-Lite）。我们的结果提供了一条可扩展的、完全自监督的路径，以构建开始内部化物理世界因果结构的基础世界模型。

英文摘要

Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

URL PDF HTML ☆

赞 0 踩 0

2605.14531 2026-06-09 cs.CL 版本更新

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

语言生成作为最优控制：潜在控制空间中的闭环扩散

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文将语言生成重新表述为随机最优控制问题，通过统一理论视角分析自回归和扩散模型，解释其局限性，并提出基于流匹配的闭环控制器实现高效文本生成。

详情

AI中文摘要

本工作将语言生成重新表述为随机最优控制问题，提供统一的理论视角来分析自回归和扩散模型，并解释其局限性（效率-保真度悖论、不可逆误差传播、优化可行性与保真度）在轨迹奇异性、共轭状态消失和梯度缺失的组合下的表现。为解决这些问题，我们近似求解哈密顿-雅可比-贝尔曼（HJB）方程，得到一个作为闭环控制器的最优策略。为避免直接求解HJB PDE的不可行性，我们采用流匹配作为最优轨迹求解器，在校正的潜在控制空间中。这使我们的Manta-LM配备全局积分算子能够近似全局向量场，从而实现同时实现高保真文本生成和高效、低成本并行采样的模型。实验表明，我们的方法在语言建模和条件生成任务中表现强劲，同时表现出改进的稳定性、效率和可控性。

英文摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

URL PDF HTML ☆

赞 0 踩 0

2604.26498 2026-06-09 cs.LG q-bio.QM 版本更新

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

大模型真的在药物发现中胜出吗？AI驱动的分子性质和活性预测中模型规模的基准评估

Jinjiang Guo, Sheng Ding

发表机构 * Global Health Drug Discovery Institute（全球健康药物发现研究所）； School of Pharmaceutical Sciences（药学院）

AI总结本文通过26个ADME、毒性及生物活性端点评估，发现传统机器学习在多数任务中表现最佳，大模型在部分困难分割中竞争力有限，模型性能依赖于任务与验证场景的适配性，而非单纯规模。

Comments Improved benchmark design and reproducibility, replaced restricted datasets with public benchmarks in primary analyses, and added sensitivity analyses supporting the interpretation of model scaling and evaluation protocol effects in molecular prediction

详情

AI中文摘要

分子基础模型和大语言模型的快速发展促使人们以规模为中心看待AI在药物发现中的应用，认为更大的预训练模型将取代紧凑的化学信息学模型。我们测试了这一假设，涵盖26个ADME、毒性及生物活性端点，共165,541个端点级别化合物标签记录。基准测试包含78个端点和分割条目，通过随机、Murcko骨架和结构分离的5折交叉验证协议评估，代表递增的化学泛化难度。在156个任务和指标比较中，传统机器学习（ML）提供了最大的最佳表现份额（47.4%），其次是预训练分子序列模型（28.8%）、图神经网络（21.8%）和基于LLM的SAR基线（1.9%）。传统ML在随机分割插值中占优，并总体上是最大的胜利家族。GNN和序列模型在部分更难的分割中具有竞争力，但其严格胜利份额在固定最终窗口读取下减少，表明对训练设置和模型选择的敏感性。配对Bootstrap分析显示，模型间的小数值差异不应被视为决定性胜利。训练折叠中的SAR知识提高了GPT5.5-SAR和Opus4.7-SAR指标，但并未使基于规则的推理成为监督预测器的通用替代品。紧凑的专业模型仍高度有效，预测性能取决于模型、任务和验证场景之间的适配性，而非规模本身。

英文摘要

The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and structure separated 5-fold cross validation protocols, representing increasing chemical generalization difficulty. Across 156 task and metric comparisons, classical machine learning (ML) provides the largest share of best performing entries (47.4%), followed by pretrained molecular sequence models (28.8%), graph neural networks (21.8%) and LLM based SAR baselines (1.9%). Classical ML dominates random split interpolation and remains the largest winner family overall. GNN and sequence models are competitive in selected harder splits, but their strict winner shares decrease under a fixed final-window readout, indicating sensitivity to training settings and model selection. Paired bootstrap analyses show that small numerical differences between individual models should not be read as decisive victories. SAR knowledge from training folds improves GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective, and predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

URL PDF HTML ☆

赞 0 踩 0

2605.13768 2026-06-09 cs.LG cs.AI cs.IT math.IT 版本更新

High-Rate Quantized Matrix Multiplication II

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem（希伯来大学杰里科分校）； MIT（麻省理工学院）

AI总结本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法，通过水填充算法改进LLM量化方法，展示WaterSIC方案在信息论极限下的性能。

详情

AI中文摘要

本文是关于量化矩阵乘法（MatMul）工作的第二部分。在第一部分中，我们考虑了无校准量化的情况，而在这里，我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差（WMSE）源编码问题相关，其经典的（反向）水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法（GPTQ），目前这些算法平均分配速率。最近的一种方案（称为``WaterSIC''）仅使用标量INT量化器进行分析，其高速率性能被证明为（a）基无关（即由$Σ_X$的行列式决定，因此不同于现有方案，不受随机旋转的影响）；（b）在信息论极限下的性能与$\frac{2πe}{12}$（或0.25 bit/entry）的乘法因子内。GPTQ的性能受基的选择影响，但对于随机旋转和实际的$Σ_X$来自Llama-3-8B，我们发现其性能在0.1 bit（取决于层类型）以内，表明GPTQ结合随机旋转也接近最优，至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

URL PDF HTML ☆

赞 0 踩 0

2605.11212 2026-06-09 cs.CL 版本更新

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision：通过时间视觉冗余减少扩展计算机使用代理

Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Microsoft Research（微软研究院）

AI总结 ReVision通过去除冗余视觉片段，减少token使用并提升成功率，使代理能处理更长轨迹。

详情

AI中文摘要

计算机使用代理（CUAs）依赖于图形用户界面的视觉观察，每个截图被编码为大量视觉token。随着交互轨迹增长，token成本迅速增加，限制了在固定上下文和计算预算下可纳入的历史量。这导致使用历史时性能提升有限，不同于其他领域。我们通过引入ReVision解决这一效率问题，该方法用于训练多模态语言模型，在轨迹中去除冗余视觉片段，使用学习的片段选择器比较连续截图的片段表示，同时保留模型所需的时空结构。在三个基准测试（OSWorld、WebTailBench和AgentNetBench）中，当使用Qwen2.5-VL-7B处理包含5个历史截图的轨迹时，ReVision平均减少46%的token使用，同时将成功率提高3%。这建立了明显的效率提升，使代理能用更少token处理更长轨迹。通过这一改进效率，我们重新审视CUAs中历史的作用，发现当去除冗余时，性能随更多过去观察的纳入而持续提升。

英文摘要

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

URL PDF HTML ☆

赞 0 踩 0

2605.12213 2026-06-09 cs.AI 版本更新

参与过程：重新思考动作与观察的时间接口

Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出参与过程（EP）模型，通过显式时间接口处理动作与观察的不同时间尺度交互，支持多速率协调和子系统组合，揭示隐藏的时间行为并使策略适应显式时间成本。

详情

AI中文摘要

正确性不足：通过执行器导向的奖励训练推理计划器

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su

发表机构 * D 4 Lab（D4实验室）； Independent Researcher（独立研究者）

AI总结本文提出TraceLift框架，通过执行器导向的奖励提升推理质量，利用rubric-based Reasoning Reward Model评估推理轨迹的可靠性与有效性。

Comments 36 pages

详情

AI中文摘要

可验证奖励的强化学习已成为提升大语言模型显式推理的常见方法，但仅凭最终答案正确性无法揭示推理轨迹的忠实性、可靠性或对消费模型的效用。为此，我们提出TraceLift，将推理视为可消费的中间产物。在计划器训练中，计划器生成标记化的推理。冻结的执行器将此推理转化为最终产物供验证器反馈，同时执行器导向的奖励塑造中间轨迹。此奖励乘以基于rubric的Reasoning Reward Model评分，乘以在相同冻结执行器上测量的提升，奖励高质量且有用的轨迹。为使推理质量直接可学习，我们引入TRACELIFT-GROUPS数据集，包含数学和代码种子问题。每个示例是同一问题组，包含高质量参考轨迹和多个可能的错误轨迹，通过局部扰动降低推理质量或解决方案支持，同时保持任务相关性。在代码和数学基准上的广泛实验表明，执行器导向的推理奖励提高了两阶段计划器-执行器系统，表明推理监督应不仅评估轨迹是否看起来好，还应评估其是否帮助消耗模型。

英文摘要

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

URL PDF HTML ☆

赞 0 踩 0