arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.22863 2026-06-09 cs.LG 版本更新

Latent Cache Flow: Model-to-Model Communication Without Text

潜在缓存流：无需文本的模型间通信

Maximillian Rossi, Prajwal Raghunath, Eugene Wu

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出潜在缓存流（LCF）方法，通过联合翻译和压缩键值缓存实现高效模型间通信，在上下文不同场景下比基于文本的通信准确率提高23%、速度提升8.5倍。

Comments 6 pages, 5 figures

详情

AI中文摘要

当今的LLM智能体通过文本进行通信，由于需要自回归解码共享模型的状态并在接收模型处编码，这会导致显著的延迟和信息损失。最近的工作如Cache-to-Cache（C2C；Fu等人，2026）试图通过学习适配器来交换KV缓存，该适配器将共享者的KV矩阵转换为接收者模型。然而，这些适配器体积庞大且训练成本高，并且逐词翻译，要求目标上下文完全相同。这对于LLM具有不同上下文的智能体通信来说是不合适的。我们引入了潜在缓存流（LCF）。为了解决效率问题，我们观察到键和值可以联合翻译和压缩，将适配器大小减少到C2C的约4%。为了解决上下文不同的问题，我们设计了适配器来传输目标模型所没有的新信息的摘要。我们的初步实验表明，在共享上下文设置中，一个13 MB的LCF适配器可以比956 MB的C2C适配器更准确；对于不同上下文，LCF比基于文本的通信准确率提高23%，速度提升8.5倍。

英文摘要

LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a pruned 13 MB LCF adapter can be more accurate than C2C at 956 MB in shared-context settings; for different contexts, LCF improves F1 by 7.5% and Exact Match by 23% while 8.5 times faster than text-based communication.

URL PDF HTML ☆

赞 0 踩 0

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； ByteDance Inc.（字节跳动公司）

AI总结针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题，提出技能检索增强（SRA）范式，通过动态检索外部技能库提升智能体性能，并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情

降低参与IREX的门槛：用于虹膜识别的开源算法、工具包和基准测试

Siamul Karim Khan, Patrick J. Flynn, Adam Czajka

发表机构 * University of Notre Dame（内布拉斯加大学）

AI总结本文提出两种新的开源虹膜识别算法，提供Python和符合IREX标准的C++实现，用于提交官方IREX X计划。研究旨在首次根据IREX测试协议评估开源虹膜识别解决方案，并提供一个模型C++提交，显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络，分别使用三元组损失与批量硬三元组挖掘（TripletIris）和ArcFace损失（ArcIris）。此外，文章还提供了两种现有方法的开源IREX兼容C++实现：基于虹膜图像过滤的人类显著性驱动内核（HDBIF）算法，以及用于检测和比较Fuchs密钥（CRYPTS）的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外，其他方法已通过官方IREX X评估，并在多个流行学术基准上进行了评估。最后，本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

详情

AI中文摘要

本文提出了两种新的开源虹膜识别算法，提供了Python和符合IREX标准的C++实现，用于提交官方IREX X计划。本研究有两个主要目标：（a）首次根据IREX测试协议评估开源虹膜识别解决方案；（b）提供一个模型C++提交，显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络，分别使用三元组损失与批量硬三元组挖掘（TripletIris）和ArcFace损失（ArcIris）。本文还提供了两种现有方法的开源IREX兼容C++实现：（a）基于虹膜图像过滤的人类显著性驱动内核（HDBIF）算法；（b）用于检测和比较Fuchs密钥（CRYPTS）的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外，这些方法已通过官方IREX X评估，并在多个流行学术基准上进行了评估：Quality-Face/Iris Research Ensemble、Warsaw-Biobase Post-Mortem Iris、CASIA-Iris-Thousand-V4、CASIA-Iris-Lamp-V4、IIT Delhi Iris Database、IIITD Contact Lens Iris Database、NDIris3D和Notre Dame Variable Iris Image Quality Release 2。最后，本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

英文摘要

NIST Iris Exchange (IREX) offers an appealing solution to evaluating new open-source iris recognition algorithms, but it presents high barriers to entry because these algorithms must be written in C++, using a specific API, and adapted to meet strict IREX speed and memory constraints. The main goal of this paper is to lower these barriers and advance open-source iris recognition large-scale evaluations by offering: (a) two new modern deep learning-based open-source iris matchers (ArcIris and TripletIris), along with their C++ IREX X-compliant implementations, which are the first open-source iris recognition methods included into the IREX X leaderboard (and thus IREX-vetted), as well as new segmentation and iris circular approximation models that can be incorporated into any new iris recognition method, and (b) a performance assessment (according to IREX X testing protocols) of all major and currently available open-source iris recognition solutions. The paper also provides Python implementations of the new ArcIris and TripletIris methods and discusses the differences one may encounter between C++ and Python implementations of the same conceptually equivalent approaches. Finally, the paper offers open-source, IREX X-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). In addition to IREX X evaluation results, the paper reports the performance of all methods on major academic benchmarks: Quality-Face/Iris Research Ensemble (Q-FIRE), Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2 (VII-Q-R2).

URL PDF HTML ☆

赞 0 踩 0

2604.24199 2026-06-09 cs.SD cs.AI eess.AS eess.SP 版本更新

Speech Enhancement Based on Drifting Models

基于漂移模型的语音增强

Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

发表机构 * Victoria University of Wellington（维多利亚大学）； Lincoln University（林肯大学）； GN Advanced Science（GN先进科学）

AI总结本文提出了一种基于漂移模型的语音增强框架DriftSE，通过将去噪问题建模为平衡问题，实现单步推理，从而在无需配对数据的情况下实现高质量语音增强。

Comments 6 pages, 2 figures

详情

AI中文摘要

我们提出了一种基于漂移模型的语音增强（DriftSE），一种新颖的生成框架，将去噪建模为一个平衡问题。与依赖迭代采样的方法不同，DriftSE通过演化映射函数的推动分布来实现单步推理，直接匹配干净语音分布。这种演化由漂移场驱动，这是一种学习到的修正向量，引导样本向干净分布的高密度区域发展，这自然促进了在未配对数据上的训练，通过匹配分布而非配对样本。我们从两种形式研究了该框架：从噪声观测到直接映射，以及从高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准测试中，DriftSE在单步中实现了高保真度的增强，优于多步扩散基线，并建立了语音增强的新范式。

英文摘要

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

URL PDF HTML ☆

赞 0 踩 0

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结本文提出FormalASR，一种端到端的中文语音到正式文本转换模型，通过构建大规模的语音到正式文本数据集，并使用Qwen3-ASR进行微调，实现了比原声基线减少37.4%的CER，同时提升了ROUGE-L和BERTScore指标，提供了一个轻量级的设备端解决方案。

详情

AI中文摘要

自动语音识别（ASR）系统通常优化于逐字转录，这保留了不连贯、填充词和非正式口语结构，这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑，但这种设计增加了延迟和内存成本，并且难以在设备上部署。我们提出了FormalASR，两个紧凑的端到端模型（0.6B和1.7B），可直接将中文语音转录为正式书面文本。为了实现这一目标，我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集，通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模（0.6B和1.7B）的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明，FormalASR在比原声基线减少37.4%的CER的同时，也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM，提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

URL PDF HTML ☆

赞 0 踩 0

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出了一种基于分步置信度归因（SCA）的方法，用于诊断黑盒大语言模型在多步推理中的失败，通过信息瓶颈原理对生成的推理轨迹进行置信度评估，并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能，但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号，但现有方法受限于最终答案或需要内部模型访问。在本文中，我们引入了分步置信度归因（SCA），一种适用于封闭源LLM的框架，该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理：与正确解决方案中的一致结构对齐的步骤获得高置信度，而偏差则被标记为可能错误。我们提出了两种互补的方法：（1）NIBS，一种非参数化的IB方法，用于测量一致性而无需图结构，以及（2）GIBS，一种基于图的IB模型，通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明，SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外，使用步骤级置信度指导自我修正，比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

URL PDF HTML ☆

赞 0 踩 0

2605.18856 2026-06-09 cs.LG cs.CL cs.IT math.IT 版本更新

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

SPHERICAL KV: 角度域注意力与率失真保持用于高效长上下文推理

Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Synopsys ； McGill University（麦吉尔大学）； IIIT Ranchi（印度理工学院拉奇）； Amazon（亚马逊）； Meta ； Apple（苹果）； Pragya Lab, BITS Pilani Goa（普拉基亚实验室， BITS 拉贾斯坦）

AI总结提出Spherical KV方法，通过角度域注意力（ADA）和率失真保持（RDR）机制，在长上下文推理中减少KV缓存占用并保持解码效率。

详情

AI中文摘要

长上下文推理日益受到KV缓存的限制：常驻内存随上下文长度增长，解码受限于重复的高带宽内存（HBM）流而非算术运算。现有方法如驱逐、窗口化、量化和卸载减少了占用，但通常仅部分解决了关键路径瓶颈，尤其是在解码期间压缩状态仍需重建为密集向量时。我们提出Spherical KV，一种将KV分配视为基于注意力几何的率失真问题以实现高效解码的长上下文推理方法。该方法基于两个思想：(i) 在解码热循环中廉价地表示方向信息，(ii) 根据估计的未来效用分配保留和精度。其第一个组件，角度域注意力（ADA），将键存储在由标量半径和紧凑角度码组成的球面参数化中，并直接根据这些码计算注意力对数，无需重建密集键。这保留了分页、块局部、融合友好的解码路径，并在实际服务设置中直接针对HBM流量。其第二个组件，率失真保持（RDR），在固定预算下联合选择每个令牌和头的保留/丢弃决策及精度层级，生成层级同质的页面，具有轻量级元数据和合并读取。ADA和RDR共同提供了一种面向部署的机制，在保持解码效率的同时减少KV常驻内存。

英文摘要

Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

URL PDF HTML ☆

赞 0 踩 0

原子作为语言：VQ-Atom：用于分子表示学习的语义离散化

Takayuki Kimura

发表机构 * Atoms as Language, LLC（Atoms as Language公司）

AI总结本文提出VQ-Atom，一种用于分子表示学习的语义离散化框架，通过将连续的原子级图表示转换为对应局部化学环境的离散标记，从而提升分子表示的学习效果。

详情

AI中文摘要

分子表示学习已成为AI驱动药物发现中的核心方法，但现有分子分词如SMILES仍主要是语法性的，无法自然对齐具有化学意义的子结构。在本文中，我们介绍了VQ-Atom，一种语义离散化框架，将连续的原子级图表示转换为对应局部化学环境的离散标记。利用图神经网络嵌入和向量量化，原子被分配到代表化学有意义的原子上下文的代码本条目中。这些离散标记定义了一种适合基于Transformer的预训练的分子语言。我们评估了VQ-Atom在蛋白质-配体相互作用预测中的表现，采用蛋白质冷分割设置且不依赖3D结构信息。实验结果表明，与传统分词方法相比，VQ-Atom在预测性能上始终有所提升，表明语义基础的离散化可以显著增强分子表示学习。我们的发现表明，分词设计本身在使化学领域有效语言建模中起着关键作用。

英文摘要

Large language models succeed by combining large-scale pretraining with meaningful discrete tokens. In molecular machine learning, SMILES is widely used as a token representation, but it is primarily a linearization format for molecular graphs rather than a semantic decomposition of chemistry. We propose VQ-Atom, a semantic tokenization framework that assigns discrete atom-level tokens based on local chemical environments via vector quantization. Unlike SMILES tokens, VQ-Atom tokens encode graph-local chemical context and are aligned with molecular structure. On protein-cold drug--target interaction prediction using the KIBA dataset, VQ-Atom substantially improves global ranking performance, achieving AUROC of 0.79 while substantially outperforming both SMILES-based and continuous molecular representations under an identical downstream architecture. Furthermore, VQ-Atom enables approximately 3 times faster downstream training than continuous atom-level representations by replacing per-atom continuous features with reusable discrete tokens. These results suggest that molecular tokenization is not merely a preprocessing step, but a central design choice. In particular, well-structured tokens can encode substantial chemical semantics, reducing the burden on downstream learning. VQ-Atom can be interpreted as defining a molecular language, where tokens correspond to chemically meaningful atomic environments, suggesting that token design may constitute an additional axis of machine learning research alongside architecture, objectives, and optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.16551 2026-06-09 cs.CL 版本更新

FRWKV+: 基于周期感知的自适应门控用于频率域线性时间序列预测

Qingyuan Yang, Dongyue Chen, Da Teng, Junhua Xiao, Jiaji Pan, Shizhuo Deng

发表机构 * College of Information Science and Engineering, Northeastern University（信息科学与工程学院，东北大学）； Foshan Graduate School of Innovation, Northeastern University（创新研究生学院，东北大学）； National Frontiers Science Center for Industrial Intelligence and Systems Optimization（工业智能与系统优化国家级前沿科学中心）

AI总结本文提出FRWKV-Plus模型，通过引入跨分支频谱门和信任门控残差修正，提升频率域时间序列预测的准确性与效率，实验表明其在多个基准数据集上表现优异。

详情

AI中文摘要

准确且高效的长期多变量时间序列预测需要捕捉重复的时序结构，同时在许多变量和预测范围上保持推理成本低。频率域模型能紧凑地表示长程和周期性变化，但通常将实部和虚部频谱组件作为弱耦合流处理，并将周期性提示作为普通输入特征，即使这些提示不可靠。本文提出FRWKV-Plus，一种轻量级周期感知频率域预测模型，基于高效的FRWKV骨干网络。FRWKV-Plus引入了跨分支频谱门，通过总结其兄弟分支来重新加权每个频谱分支，并引入信任门控残差修正，将紧凑的周期内上下文转换为有界的、符号灵活的调整。通过构造，修正在初始化时保持恒等，并严格有界，因此周期性证据可以细化但不会主导或反转基础交互。在七个标准基准上，FRWKV-Plus在强线性、频率域、递归式和Transformer基预测器中表现一致竞争，同时保持骨干网络的轻量级特性。受控三种子消融实验显示，每个组件都起作用，收益在强周期性数据上较小，在更难的交换和IL数据集上更显著，且周期内上下文是最有影响力的单一组件。实现已公开在https://github.com/yangqingyuan-byte/FRWKV-plus。

英文摘要

Accurate and efficient long-term multivariate time series forecasting requires capturing recurring temporal structure while keeping inference cheap across many variables and horizons. Frequency-space models represent long-range and periodic variation compactly, but they typically process the real and imaginary spectral components as weakly coupled streams and treat periodic cues as ordinary input features, even when such cues are unreliable. This paper proposes FRWKV-Plus, a lightweight periodic-aware frequency-space forecasting model built on the efficient FRWKV backbone. FRWKV-Plus introduces a cross-branch spectral gate that reweights each spectral branch using a summary of its sibling branch, and a trust-gated residual correction that converts compact within-period context into a bounded, sign-flexible adjustment of these gates under a learned, data-dependent trust score. By construction, the correction is identity-preserving at initialization and strictly bounded, so periodic evidence can refine but never dominate or invert the base interaction. On seven standard benchmarks, FRWKV-Plus is consistently competitive with strong linear, frequency-domain, recurrent-style, and Transformer-based forecasters while preserving the lightweight profile of the backbone. Controlled three-seed ablations show that each component contributes, that the benefit is modest on strongly periodic data and pronounced on the harder Exchange and ILI datasets, and that the within-period context is the most influential single component. The implementation is publicly available at https://github.com/yangqingyuan-byte/FRWKV-plus.

URL PDF HTML ☆

赞 0 踩 0

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

语言生成作为最优控制：潜在控制空间中的闭环扩散

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文将语言生成重新表述为随机最优控制问题，通过统一理论视角分析自回归和扩散模型，解释其局限性，并提出基于流匹配的闭环控制器实现高效文本生成。

详情

AI中文摘要

本工作将语言生成重新表述为随机最优控制问题，提供统一的理论视角来分析自回归和扩散模型，并解释其局限性（效率-保真度悖论、不可逆误差传播、优化可行性与保真度）在轨迹奇异性、共轭状态消失和梯度缺失的组合下的表现。为解决这些问题，我们近似求解哈密顿-雅可比-贝尔曼（HJB）方程，得到一个作为闭环控制器的最优策略。为避免直接求解HJB PDE的不可行性，我们采用流匹配作为最优轨迹求解器，在校正的潜在控制空间中。这使我们的Manta-LM配备全局积分算子能够近似全局向量场，从而实现同时实现高保真文本生成和高效、低成本并行采样的模型。实验表明，我们的方法在语言建模和条件生成任务中表现强劲，同时表现出改进的稳定性、效率和可控性。

英文摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

URL PDF HTML ☆

赞 0 踩 0