arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2605.17991 2026-05-19 cs.SD cs.AI

Stable Audio 3

稳定音频3

Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

AI总结 稳定音频3提出了一种快速的潜在扩散模型家族,用于可变长度音频生成和编辑,通过高效的潜在空间生成和对抗训练提升了生成质量和效率。

Comments Training code: https://github.com/Stability-AI/stable-audio-tools Inference and weights: http://github.com/Stability-AI/stable-audio-3

详情
AI中文摘要

Stable Audio 3 是一组快速的潜在扩散模型(小、中、大)用于可变长度音频生成和编辑。由于我们的模型可以生成几分钟的音频,可变长度生成对于避免生成完整长度音频以生成短声音的成本至关重要。我们还支持修复,使能够进行有针对性的音频编辑和短录音的延续。我们的潜在扩散模型基于一种新的语义-声学自编码器,该自编码器将音频投影到紧凑的潜在空间中,从而在高效扩散生成的同时保持音频保真度,并在潜在空间中鼓励语义结构。最后,我们通过对抗性后训练来加速推理并提高生成质量,减少推理步骤的数量同时提高保真度和提示的遵循性。Stable Audio 3 模型在授权和Creative Commons数据上进行训练,可在H200 GPU上在2秒内生成音乐和声音,在MacBook Pro M4上在几秒内完成。我们发布了小和中型模型的权重,这些模型可以在消费级硬件上运行,并附带其训练和推理流程。

英文摘要

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

2605.17990 2026-05-19 cs.CV cs.HC

Low Latency Gaze Tracking via Latent Optical Sensing

通过潜在光学感知实现低延迟的注视跟踪

Yidan Zheng, Matheus Souza, Kaizhang Kang, Qiang Fu, Hadi Amata, Wolfgang Heidrich

AI总结 本文提出了一种实时注视跟踪系统,通过全被动光学编码器直接获取任务相关的潜在特征,利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码,产生足够估计注视方向的紧凑测量集,从而减少计算开销并提高延迟性能。

详情
AI中文摘要

我们提出了一种实时注视跟踪系统,该系统通过全被动光学编码器直接获取任务相关的潜在特征。与处理全分辨率图像不同,我们的方法利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码,产生一组紧凑的测量,足以用于注视估计。通过在光学域内整合传感和特征提取,所提出的系统消除了对高带宽图像读取的需要,并显著减少了计算开销。编码的测量通过4x4光电晶体管阵列捕获,并通过轻量级神经网络映射到注视方向。我们的概念验证原型实现了端到端的感知到推理延迟为3.4 ms,优于已发表的研究系统。我们在模拟和真实世界数据上展示了本方法的有效性,实现了与传统基于摄像头的管道相比具有竞争力的注视估计精度,同时显著提高了延迟和能效。本文工作展示了任务驱动的光学感知在超低延迟、计算高效的人机交互系统中的潜力。

英文摘要

We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.

2605.17989 2026-05-19 cs.CL cs.AI

Predictive Prefetching for Retrieval-Augmented Generation

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

AI总结 本文提出了一种先进的异步检索框架,通过预测检索触发时机和所需信息,以减少延迟并提高生成效率,同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情
AI中文摘要

检索增强生成(RAG)通过在大型语言模型中增强事实性,但因其同步检索导致显著延迟。尽管近期工作探索了异步检索,但现有方法依赖于检索与生成之间的启发式协调,并假设解码期间信息需求稳定,这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架,该框架能够与不断演变的信息需求相匹配,通过利用生成动态中出现的语义前驱,使用三个组件——检索预测器、上下文监视器和查询生成器,显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明,该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升,同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

2605.17985 2026-05-19 cs.LG cs.AI

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD:面向物理基础模型的敏感性感知保真度压缩SVD

Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang, He Wang

AI总结 本文提出了一种新的压缩物理基础模型的方法,通过在压缩过程中显式建模损失感知的层敏感性,以保持准确性和物理保真度,实验表明在多个模型和数据集上实现了显著的压缩增益。

详情
AI中文摘要

我们提出了一种新的方法,用于压缩物理基础模型(PFMs),这是AI for Science领域的新趋势。尽管模型压缩对于减少内存使用和加速大基础模型的推理至关重要,但其在PFMs中的应用仍然不足探索,因为保持物理保真度至关重要。挑战在于物理数据的功能性质,其中偏导数编码了时空动态,并对压缩具有高度敏感性。传统压缩方法忽视了这种结构,常常导致严重的性能退化或失败。为此,我们引入了一种敏感性感知的保真度强制压缩框架,在压缩过程中显式建模输出函数空间中的损失感知层敏感性。这为压缩科学基础模型提供了一条新途径,同时保持准确性和物理保真度。实验表明,在多个模型和数据集上,相较于现有方法,取得了显著的增益,实现了更高的压缩比,同时保持准确性,在某些情况下甚至提高了几个数量级。更广泛地说,这项工作可能引领AI for Science领域高效、可部署和可持续的科学基础模型的新子领域。

英文摘要

We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.

2605.17980 2026-05-19 cs.CV

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

学习平衡:用于基于参考的遥感图像超分辨率的解耦孪生扩散变换器

Bin Luo, Runmin Dong, Zhaoyang Luo, Jinxiao Zhang, Jiyao Zhao, Fan Wei, Haohuan Fu

AI总结 本文提出DS-DiT解耦孪生扩散变换器,通过在注意力层面解耦低分辨率和参考信息交互,解决参考基于超分辨率中参考信息依赖过重和利用不足的问题,提升生成质量。

详情
AI中文摘要

基于扩散的方法在大尺度遥感图像超分辨率中展现出显著潜力,特别是在基于参考的超分辨率(RefSR)中,高分辨率参考图像提供关键的细粒度纹理先验。然而,现有方法往往在过度依赖参考信息导致纹理伪影和利用不足导致细节恢复不足之间存在权衡。为了解决这些问题,我们提出了DS-DiT,一种解耦孪生扩散变换器方法,该方法在注意力层面解耦低分辨率和参考信息交互。通过使低分辨率结构先验和参考纹理信息能够独立与噪声潜在空间交互,框架有效缓解了不同来源之间的竞争。此外,为了补偿全局注意力有限的局部建模能力,我们引入了Patch-Level Weights(PLW)模块,该模块可自适应地调节条件源的融合。此外,这种孪生架构在推理过程中促进了自引导策略,通过利用强参考和弱参考条件之间的预测差异来增强重建。这种方法在不额外训练的情况下提升了生成质量。在多个数据集和缩放因子上的实验结果表明,DS-DiT在定量指标和视觉保真度上均优于现有方法。

英文摘要

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.

2605.17978 2026-05-19 cs.CL

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder: 教授大语言模型生成显式向量化代码

Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou, Yuxuan Li, Xu Han, Wanxiang Che, Qi Shi, Ting Liu, Maosong Sun

AI总结 本文提出AutoVecCoder框架,通过VecPrompt和VecRL组件,使大语言模型能够自动进行显式向量化,从而在SimdBench的SSE和AVX子集上达到最先进的性能,超越传统自动向量化的方法。

详情
AI中文摘要

通过单指令多数据(SIMD)架构进行向量化是高性能计算的核心。为了充分利用硬件潜力,开发人员通常依赖显式向量化使用内联函数,因为基于编译器的自动向量化由于保守的静态分析常常产生次优结果。尽管大语言模型(LLMs)在一般代码生成方面表现出色,但它们在显式向量化方面遇到困难,因为高质量语料库稀缺且低级硬件指令的语义约束严格。在本文中,我们提出了AutoVecCoder,一种新的框架,旨在赋予LLMs自动显式向量化的能力。AutoVecCoder集成了两个核心组件:VecPrompt,一个自动数据合成管道,用于注入领域特定的内联知识;以及VecRL,一个强化学习框架,将代码生成与执行效率对齐。通过此框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上实现了最先进的性能,并在某些情况下生成的实现超过了标准-O3优化,有效克服了传统自动向量化的固有瓶颈。

英文摘要

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

2605.17976 2026-05-19 cs.AI math.OC

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

释放大语言模型于贝叶斯优化:用于科学发现的偏好引导框架

Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu

AI总结 本文提出了一种基于大语言模型的贝叶斯优化框架LGBO,通过在优化循环中持续整合大语言模型的语义推理,提高了科学发现中的优化效率和收敛速度。

Comments Published as a conference paper at ICLR 2026. 10 pages main paper, 21 pages appendix, 26 figures

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

科学发现日益受到昂贵实验和有限资源的限制,凸显了在AI for science中高效优化的必要性。尽管贝叶斯优化(BO)被广泛用于平衡探索与利用,但其在高维设置中表现出冷启动性能缓慢和可扩展性差的问题,限制了其在现实科学问题中的应用。为克服这些挑战,我们提出了LLM引导的贝叶斯优化(LGBO),这是首个将大语言模型(LLMs)的偏好引导整合到优化循环中的贝叶斯优化框架。与以往仅使用LLMs进行预热启动初始化或候选生成的工作不同,LGBO引入了一种区域提升的偏好机制,将LLM驱动的偏好嵌入到每一个迭代中,以稳定且可控的方式调整替代均值。理论上,我们证明了LGBO在最坏情况下不会显著劣于标准BO,而在偏好与目标一致时,能够实现显著更快的收敛速度。实验上,LGBO在物理、化学、生物学和材料科学等多样化的干基准测试中均优于现有方法。最值得注意的是,在一个新的湿实验室优化Fe-Cr电池电解质时,LGBO在6次迭代内达到了最佳观测值的90%,而标准BO和现有LLM增强的基线方法需要超过10次。这些结果表明,LGBO为将LLMs整合到科学优化工作流中提供了一个有前景的方向。

英文摘要

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

2605.17969 2026-05-19 cs.CV

Generation Navigator: A State-Aware Agentic Framework for Image Generation

生成导航器:一种基于状态的图像生成代理框架

Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin

AI总结 本文提出了一种基于状态的图像生成代理框架Generation Navigator,通过将图像生成问题重新表述为状态条件下的动作生成问题,解决了传统方法中在强化学习训练中因信用分配问题导致的不足,通过PRE-GRPO算法提升了生成质量与推理准确性。

详情
AI中文摘要

尽管文本到图像生成技术取得了快速进展,但忠实实现用户意图仍然具有挑战性,通常需要手动多轮尝试和错误。为了自动化此过程,现有系统依赖于简单的提示重写或由手工规则驱动的闭环代理,而不是学习适应不断变化的生成过程。在本文中,我们将图像生成重新表述为一个状态条件下的动作生成问题,并提出Generation Navigator,一个多轮T2I代理,能够学习动态引导生成轨迹并输出下一步动作。然而,通过强化学习训练此代理会引入关键的信用分配挑战:仅根据单一状态奖励轨迹会将所有动作视为同等信用,忽略了各轮次质量动态变化,并无法区分那些提升轨迹的动作与那些降质或浪费轮次而无进展的动作。我们通过PRE-GRPO(峰值保留-效率组相对策略优化)算法解决这一问题,这是一种轨迹级强化学习目标,明确奖励发现高质量图像(峰值)、避免后续轮次质量下降(保留)以及最小化不必要的轮次(效率)。实验表明,在多个基准测试中取得了显著提升,达到了0.90的WISE分数和79.06%的T2I-ReasonBench推理准确率。

英文摘要

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

2605.17968 2026-05-19 cs.LG

Function graph transformers universally approximate operators between function spaces

函数图变换器在函数空间之间近似算子

Takashi Furuya, David Mis, Ivan Dokmanić, Maarten V. de Hoop, Matti Lassas

AI总结 本文研究了通过变换器近似函数空间之间非线性算子的问题,提出了一种基于图度量的函数图变换器,能够以单值函数形式输出,并证明其在广义非线性算子近似中的通用性。

详情
AI中文摘要

我们研究了通过变换器近似函数空间之间非线性算子的问题。我们的方法是将函数提升为在其图上支持的度量,并利用最近引入的度量论视角来分析变换器。函数h通过其图度量γ_h表示,其中有限的token{(x_j,h(x_j))}_{j=1}^N是其经验近似。我们证明,该框架优雅地通过度量的收敛来建模离散化细化,并提供了一个自然的算子学习设置。在此框架中,我们引入了函数图变换器,即一种图保持的度量变换器子类,能够将图度量映射为图度量,也就是说,输出保持为单值函数。关键的是,这种额外的结构并不降低通用性:我们证明,所得到的图保持映射可以被标准softmax自注意力层和点wise MLP的有限组合近似,从而在广泛的非线性算子类别中实现通用近似结果。与现有基于变换器的算子学习理论方法不同,度量论框架还能够处理正则化的负阶Sobolev输入,这些输入的离散化不变性特别具有挑战性,以及不同输出域上的查询点。总体而言,函数图变换器为基于变换器的算子学习提供了一个连续视角和数学工具包,明确了位置编码、图结构、正则化和在离散化之间保持一致的作用。

英文摘要

We study the approximation of nonlinear operators between function spaces by transformers. Our approach is to lift functions to measures supported on their graphs and leverage a recently introduced measure-theoretic view of transformers. A function $h$ is represented by its graph measure $γ_h$, with finite tokens $\{(x_j,h(x_j))\}_{j=1}^N$ being its empirical approximations. We show that this framework elegantly models discretization refinement via convergence of measures and provides a natural setting for operator learning. Within this framework, we introduce function graph transformers, a graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures, which is to say that outputs remain single-valued functions. Crucially, this additional structure does not reduce generality: we prove that the resulting graph-preserving maps can be approximated by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation results for broad classes of nonlinear operators. Unlike existing theoretical approaches to operator learning with transformers, the measure-theoretic framework also accommodates regularized negative-order Sobolev inputs for which discretization invariance is particularly challenging, as well as query points on different output domains. Overall, function graph transformers provide a continuum viewpoint and mathematical toolkit for transformer-based operator learning, clarifying the roles of positional encodings, graph structure, regularization, and ensuring consistency across discretizations.

2605.17967 2026-05-19 cs.AI

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

弥合对SFT在LLM中效果的矛盾观点:一种交互视角

Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai, Qing Xu, Quanshi Zhang

AI总结 本文从交互视角探讨了SFT在LLM中的效果不一致问题,发现SFT主要去除噪声交互但难以获得可靠新交互,且去噪阶段短暂,继续微调易引入过拟合交互。

详情
AI中文摘要

本文探讨了监督微调(SFT)在深度神经网络中的有效性问题:为何SFT在小规模模型中广泛有效,但在大语言模型(LLM)中却可能产生不一致甚至有害的效果。最近基于交互的解释方法表明,词/标记之间的交互提供了衡量LLM编码推理模式的忠实指标。我们发现SFT过程中交互的演变能有效解释SFT在LLM中的不一致效果。具体而言,我们发现(1)SFT主要去除噪声样的交互,而很少获得可靠的新的交互。(2)这一去噪阶段极为短暂,之后继续微调倾向于引入过拟合的交互。我们通过多个LLM和数据集验证了这些发现。我们的发现为早期停止提供了新见解,并为LLM训练提供了实用指导。

英文摘要

This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.

2605.17958 2026-05-19 cs.LG cs.PL

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

通过基于一致性的强化学习增强大语言模型的代码推理能力

Zhanyue Qin, Jia Feng, Yibo Lyu, Yun Peng, Dianbo Sui, Cuiyun Gao, Qing Liao

AI总结 本文提出CodeThinker框架,通过一致性驱动的强化学习方法提升大语言模型的代码推理能力,实验表明其在多个基准测试中表现优异,显著提升了代码生成和数学推理任务的准确性。

Comments Under review

详情
AI中文摘要

代码推理指的是在给定源代码和特定输入的情况下预测程序输出的任务。它可以衡量大语言模型(LLMs)的推理能力,并且有助于下游任务,如代码生成和数学推理。现有工作已验证了强化学习在该任务上的有效性。然而,这些方法仅基于最终输出或粗粒度信号设计奖励,忽略了任务中逐步推理过程的内在一致性。因此,这些方法常常导致稀疏奖励或奖励黑客问题,限制了增强学习能力的充分发挥。为缓解这些问题,我们提出CodeThinker,一种用于代码推理的一致性驱动强化学习框架。具体而言,CodeThinker有三个关键组件:(1)一个具有逐步推理意识的模型训练模块,利用一致性追踪范式作为模板,合成捕捉逐步推理过程的训练数据;(2)一个动态束采样策略,旨在在固定采样预算下提高采样输出的质量;(3)一个一致性奖励机制,可以有效缓解奖励黑客问题。在三个流行基准测试上的实验表明,CodeThinker在多个LLMs上均取得最佳性能。例如,当部署在Qwen2.5-Coder-7B-Instruct上时,其在准确性方面比最强基线高出4.3%。我们还验证了CodeThinker在下游任务中的有效性。结果表明,在不进行额外训练的情况下,CodeThinker在覆盖17种编程语言的数学推理和代码推理任务中分别获得了平均准确率提升5.33和3.11个百分点。

英文摘要

Code reasoning refers to the task of predicting the output of a program given its source code and specific inputs. It can measure the reasoning capability of large language models (LLMs) and also benefit downstream tasks such as code generation and mathematical reasoning. Existing work has verified the effectiveness of reinforcement learning on the task. However, these methods design rewards solely based on final outputs or coarse-grained signals, and neglect the inherent consistency of the stepwise reasoning process in the task. Therefore, these methods often result in sparse reward or reward hacking, which limits the full play of enhanced learning capabilities. To alleviate these issues, we propose CodeThinker, a consistency-driven reinforcement learning framework for code reasoning. Specifically, CodeThinker has three key components: (1) a stepwise reasoning-aware model training module, which utilizes a consistency tracing paradigm as a template to synthesize training data that captures the stepwise reasoning process; (2) a dynamic beam sampling strategy, which aims to improve the quality of sampled outputs under a fixed sampling budget; and (3) a consistency reward mechanism that can effectively alleviate reward hacking. Experiments on three popular benchmarks show that CodeThinker achieves state-of-the-art performance across multiple LLMs. For instance, it outperforms the strongest baseline by 4.3% in accuracy when deployed on Qwen2.5-Coder-7B-Instruct. We also validate the effectiveness of CodeThinker on downstream tasks. Results show that, without additional training, CodeThinker obtains average accuracy gains of 5.33 and 3.11 percentage points on mathematical reasoning and code reasoning tasks covering 17 programming languages, respectively.

2605.17954 2026-05-19 cs.CV cs.AI cs.LG

A More Word-like Image Tokenization for MLLMs

一种更像单词的图像标记化方法用于大规模语言模型

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

AI总结 本文提出了一种解耦视觉标记化方法(DiVT),通过将图像块嵌入聚类为语义单元,使每个标记对应于独特的视觉概念,从而提升多模态模型的性能和效率。

详情
Journal ref
Proceedings of the IEEE/CVF International Conference on Pattern Recognition and Computer Vision (CVPR), 2026
AI中文摘要

现代多模态大语言模型(MLLMs)通常保持语言模型不变,并训练一个视觉投影器,将像素映射到其嵌入空间中的标记序列,使图像能以与文本相同的形式呈现。然而,语言模型已优化以操作离散且具有语义意义的标记,而现有视觉投影器将图像转换为长流的连续且高度相关的嵌入。这导致视觉标记的行为不同于LLM最初训练以理解的单词状单元。我们提出了一种新的解耦视觉标记化(DiVT),将图像块嵌入聚类为连贯的语义单元,使得每个标记对应于一个独特的视觉概念,而不是一个刚性的网格单元。DiVT进一步根据图像复杂度调整其标记预算,提供显式的精度-计算权衡,既不修改视觉编码器也不修改语言模型。在多样化的多模态基准测试中,DiVT在显著较少的视觉标记下匹配或超越基线,展示了在有限标记预算下的鲁棒性,显著降低了内存成本和延迟,同时使视觉输入更兼容于LLM。我们的代码可在https://github.com/snuviplab/DiVT上获得。

英文摘要

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

2605.17952 2026-05-19 cs.CV

Counting Machine Parts

机器零件计数

Benedict Florance Arockiaraj, Elizabeth Dinella, Ankit Billa, Ajay Anand

AI总结 本文研究了机器零件计数问题,提出了一种基于FamNet的改进方法,通过引入额外损失项进行训练,并在给定数据集上评估了传统图像处理流程、实例分割和密度图估计等基线方法的性能,最终实现了1.96的MAE指标。

详情
AI中文摘要

图像中物体计数任务在许多领域都有应用。例如,人群计数、库存计数和细胞计数已成为近期研究的焦点。估计物体数量的主要挑战包括重叠物体、物体尺度问题、遮挡和光照条件变化。在本报告中,我们探索了机器洗涤零件计数问题。我们的技术是FamNet的扩展,加入了额外的损失项,并在给定数据集上进行训练。我们通过计算真实物体数量与模型输出之间的均方误差(MAE)和均方根误差(RMSE)来评估这些算法的性能。我们的方法实现了1.96的MAE性能。

英文摘要

Counting objects in an image is a task applicable across many domains. For instance, crowd counting, inventory counting, and cell counting have been the focus of recent research. The major challenges in estimating the count of objects include overlapping objects, object scale issues, occlusions, and varying lighting conditions. In this report, we explore the problem of counting machine washer parts. Our technique is an extension of FamNet with an additional loss component, trained on the given dataset. We compare to three baseline methods: a traditional image processing pipeline, instance segmentation, and density map estimation. We evaluate the performance of these algorithms by computing the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) between the true object counts and the model outputs. Our approach achieves a performance of 1.96 MAE.

2605.17949 2026-05-19 cs.CV

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

SkyNative: 一种面向遥感视觉证据推理的原生多模态框架

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang

AI总结 本文提出SkyNative,一种原生多模态框架,通过去除预训练视觉骨干,直接在语言模型token空间中表示图像为原始patch tokens,以提升遥感图像的细粒度空间推理能力。

详情
AI中文摘要

遥感视觉-语言模型通常依赖预训练的视觉编码器将图像转换为语义特征后再进行语言模型推理。尽管在场景级理解上有效,这种流程可能过早压缩局部视觉证据,使细粒度空间推理容易受到语言先验的影响,尤其是在超高分辨率遥感图像中。我们提出了SkyNative,一种面向遥感的原生多模态框架,采用无编码器架构,去除预训练视觉骨干,直接在语言模型token空间中表示图像为原始patch tokens。为协调低级视觉patches与文本tokens,SkyNative引入了模态感知的解耦机制,该机制在统一的自回归骨干中使用模态特定的参数。我们进一步引入了一个视觉依赖基准,通过逐步视觉退化和误导性文本提示来诊断模型是否基于图像证据得出答案。在标准遥感理解任务和大格式空间推理评估中,SkyNative展示了更强的图像基础感知能力和改进的抗提示诱导语言先验能力。这些结果表明,原生patch级多模态建模是可靠遥感视觉-语言推理的有前景方向。

英文摘要

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

2605.17938 2026-05-19 cs.LG cs.AI stat.ML

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

通过镜像反学习和噪声一致偏斜训练数据归因

Joan Serrà, Dipam Goswami, Fabio Morreale, Wei-Hsiang Liao, Yuki Mitsufuji

AI总结 本文提出了一种基于镜像反学习和噪声一致偏斜的方法,用于提升扩散模型的训练数据归因的可靠性与鲁棒性,通过在不同数据集上显著优于现有方法,展示了其在生成实例间影响实例重叠和扩散损失比较任务中的潜力。

Comments 21 pages, 5 figures, 9 tables (includes appendix)

详情
AI中文摘要

训练数据归因(TDA)应能够促进生成模型的可解释性,并推动各种相关下游任务的发展。然而,当前的TDA方法缺乏可靠性和鲁棒性,阻碍了其在实际应用中的采用。在本文中,我们采取了关键步骤,以实现更可靠和鲁棒的扩散模型TDA。我们提出通过镜像反学习和噪声一致偏斜(MUCS)进行TDA。该方法的核心思想是使用受限的镜像梯度上升微调第二个模型,并通过一致的噪声样本测量该模型相对于原始模型的归一化偏斜。我们展示了,尽管概念上简单且通用,MUCS在三个不同的数据集上系统性地大幅优于现有方法。此外,我们研究了核心设计选择对最终性能的影响,并分析了影响实例在生成项目中的重叠以及整合TDA方法的潜力。我们相信,我们的发现可能对更一般的反学习设置以及需要比较扩散损失的任务具有更广泛的意义。

英文摘要

Training data attribution (TDA) should enable generative model interpretability and foster a variety of related downstream tasks. Nonetheless, current TDA approaches lack reliability and robustness, preventing their adoption in real-world setups. In this paper, we take a decisive step towards more reliable and robust TDA for diffusion models. We propose to perform TDA with mirrored unlearning and noise-consistent skew (MUCS). The idea is to fine-tune a second model with bounded mirrored gradient ascent, and to measure the normalized skew of this model with respect to the original one using consistent noise samples. We show that, while being conceptually simple and generic, MUCS systematically outperforms existing methods on three different datasets by a large margin. We additionally study the effect that core design choices have on final performance, and analyze novel aspects regarding the overlap of influential instances across generated items and the potential of ensembling TDA approaches. We believe that our findings may have broader implications for more general unlearning setups, as well as for tasks requiring the comparison of diffusion losses.

2605.17936 2026-05-19 cs.CL cs.LG

Universal Adversarial Triggers

通用对抗触发器

Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng

AI总结 本文提出了一种结合词性过滤和困惑度损失函数的新技术,生成更接近自然短语的合理触发器,以提高对抗攻击的检测难度并促进鲁棒模型的发展。

详情
AI中文摘要

近期的研究表明,现代NLP模型在从情感分析到语言生成的多种任务中均受到通用对抗攻击的影响,这类攻击是一种输入无关的攻击,使用共同的触发序列攻击模型。尽管这些攻击成功,但由此生成的触发器却不合语法且不自然。我们的工作提出了一种新颖的技术,结合词性过滤和基于困惑度的损失函数,以生成更合理的触发器,这些触发器更接近自然短语。在SST数据集上的情感分析任务中,该方法生成的触发器能够将正向预测翻转为负向预测,准确率降至0.04和0.12。为了构建鲁棒模型,我们还使用生成的触发器进行对抗训练,使模型的准确率从0.12提升至0.48。我们旨在展示通过生成合理的触发器,可以使得对抗攻击难以被检测,并通过相关防御促进鲁棒模型的发展。

英文摘要

Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.

2605.17933 2026-05-19 cs.CV

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA: 无教师视觉技能记忆用于无需教师的VLM代理

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, Zhihao Wen

AI总结 本研究提出AtlasVA,一种无需教师的视觉技能记忆框架,通过空间热图、视觉示例和符号文本技能三层结构,统一感知、记忆和优化,实现在无需外部LLM监督下的强化学习性能提升。

详情
AI中文摘要

视觉语言模型(VLM)代理越来越多地依赖记忆增强的强化学习来在长时间任务中重用经验,但大多数现有框架将记忆存储为文本并依赖专有教师模型来总结或细化。这种设计与空间决策不匹配:几何先验被压缩成有损语言,稀疏交互通常通过延迟文本反馈监督,而不是密集的视觉基础信号。我们主张VLM代理的可重用经验应保持视觉基础。基于这一见解,我们提出了AtlasVA,一种无需教师的视觉技能记忆框架,将记忆组织为三个互补的层次:空间热图、视觉示例和符号文本技能。AtlasVA进一步通过轨迹统计和轻量级网格启发式方法直接演化危险和亲和图谱,并将这些自演化图谱作为基于潜在函数的形状奖励用于强化学习。这种设计统一了感知、记忆和优化,无需外部LLM监督。在Sokoban、FrozenLake、3D沉浸导航和3D机器人操作基准测试中,实验表明AtlasVA在文本中心记忆基线和竞争VLM代理上表现一致优异,尤其在空间密集任务上收益显著。主页:https://wangpan-ustc.github.io/AtlasvaWeb

英文摘要

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb

2605.17932 2026-05-19 cs.CL cs.AI

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

在扩散大型语言模型中进行提示压缩:在LLDA上评估LLMLingua-2

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

AI总结 本文研究了提示压缩在扩散大型语言模型中的有效性,通过在LLDA上评估LLMLingua-2,发现提示压缩在数学推理任务中效果不佳,而摘要任务相对稳健,表明为扩散模型设计的提示压缩方法并不适用于所有场景。

详情
AI中文摘要

提示压缩可以减少大型语言模型的推理成本和上下文长度,但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能有效转移到扩散大型语言模型(DLLMs)中,使用LLMLingua-2,特别是具有8B参数的DLLM LLaDA。我们在GSM8K、DUC2004和ShareGPT数据集上使用每个数据集约250个提示,以大约2倍的压缩率,在数学推理、提示重建和摘要任务中评估压缩性能。通过精确匹配准确率、BLEU、ROUGE和BERTScore比较原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明,语义保持并不必然意味着在扩散模型中下游行为的稳定性。摘要任务在压缩下相对稳健,而数学推理任务在高语义相似度分数下显著退化。重建实验进一步表明,语义相似的提示可能仍然遗漏了稳定去噪所需的关键推理信息。在所有任务中,BERTScore召回率始终低于精度,表明压缩失败主要由信息遗漏驱动,而非语义漂移。这些发现表明,为自回归模型设计的提示压缩方法并不均匀地适用于扩散大型语言模型,从而推动了为扩散模型设计的压缩策略的发展。

英文摘要

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

2605.17930 2026-05-19 cs.LG

InfoFlow: A Framework for Multi-Layer Transformer Analysis

InfoFlow: 多层Transformer分析的框架

Penghao Yu, Haotian Jiang, Zeyu Bao, Qianxiao Li

AI总结 该研究通过分析多层Transformer的近似能力,揭示了其与单层Transformer的根本差异,并提出InfoFlow框架以提升多层Transformer的近似效率。

Comments 36 pages

详情
AI中文摘要

尽管近期已有研究探讨了单层Transformer架构的近似性质,但对多层设置的严谨理论理解仍然有限。本文证明多层Transformer在某些检索任务中具有与单层Transformer根本不同的近似能力:对于某些检索任务,任何单层Transformer需要至少Ω(ε^{-k})参数才能达到精度ε,其中k与序列长度T线性增长,而双层Transformer每层一个头则能以至多O(ε^{-1})参数实现相同近似精度。为理解这种分离,我们识别出多层近似背后的两种结构机制。具体而言,softmax注意力只能高效检索获得最大注意力分数的token,导致k-th最大检索的参数成本呈指数级增长(k≥2)。此外,解码耦合信息的参数成本与所检索token集合的大小成正比。受这些发现启发,我们提出了InfoFlow框架,用于多层Transformer。该框架在每个token和层跟踪可访问的输入位置集合,并为每种信息传播模式分配明确的近似率。这种抽象恢复了已知的近似界限,与训练网络的实验观察保持一致,并在目前无法直接理论分析的设置中产生具体预测。我们的结果提供了一个原则性的框架,用于分析多层Transformer的近似效率。

英文摘要

While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $Ω(\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.

2605.17928 2026-05-19 cs.RO cs.LG

Transfer Learning for Customized Car Racing Environments

迁移学习用于定制化的赛车环境

Benedict Florance Arockiaraj, Richard Chang, Wesley Yee

AI总结 本文研究了迁移学习在深度强化学习中的应用,旨在通过在单一赛道上训练智能体,实现零样本迁移或进一步微调以在其他定制化赛车环境中获得更快的圈速,并比较了基于模型和非基于模型方法的性能。

详情
AI中文摘要

迁移学习是一种技术,其中模型/智能体可以利用其在一项任务中获得的知识/专长来解决另一个密切相关任务。通过本项目,我们探讨了迁移学习在深度强化学习中的应用。具体而言,我们希望利用迁移学习在OpenAI的赛车环境中实现快速圈速,通过在单一赛道上训练智能体,并通过零样本迁移或额外微调在其他定制化目标环境中进行比赛。此外,我们比较了基于模型和非基于模型方法的性能,并观察到基于模型的方法在性能上占优,并且在该环境中比非基于模型的方法收敛得更快。我们观察到迁移学习在大多数设置中不仅提升了目标领域的性能,而且在学习过程中也表现出高水平的性能能力。

英文摘要

Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.

2605.17927 2026-05-19 cs.RO

Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

基于学习的自适应控制用于变形组织手术机器人暴露任务

Jiayi Liu, Kaiqi Wei, Yiwei Wang, Huan Zhao, Han Ding

AI总结 本文提出了一种基于学习的自适应控制框架,用于解决手术中因覆盖组织的不规则几何形状、非线性生物力学特性及有限视野导致的自动组织牵开挑战,通过在线优化控制输入和深度变形估计模型实现零样本适应。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. 12 pages, 9 figures

详情
AI中文摘要

在各种外科手术中,感兴趣的区域(ROIs)如器官或病变常被覆盖组织遮挡,需要外科医生实现充分暴露以进行精确干预。然而,覆盖组织的不规则几何形状、非线性生物力学特性和术中ROIs的有限可见性对自动执行组织牵开提出了重大挑战。为此,我们提出了一个现实的组织牵开任务模型,并提出了一种基于学习的自适应控制框架,以实现ROIs的暴露。该方法通过监控组织视觉边界的变化在线优化控制输入,同时利用在模拟数据上训练的深度变形估计模型来识别最优抓取点,以确保自适应控制器的收敛性和安全性。通过在不同变形材料上的模拟和实际实验,证明了该框架能够实现零样本适应,并能完成从初始抓取选择到完全ROIs暴露的自动牵开过程。因此,它有潜力应用于实际的手术辅助场景。

英文摘要

In various surgical procedures, regions of interest (ROIs) such as organs or lesions are often occluded by overlying tissues, requiring surgeons to achieve adequate exposure for precise intervention. However, the irregular geometry, nonlinear biomechanical properties of overlying tissues, and limited intraoperative visibility of the ROI pose significant challenges to the autonomous execution of tissue retraction. To address this, we formulate a realistic model of the tissue retraction task and propose a learning-based adaptive control framework for achieving ROI exposure. The method optimizes control inputs online by monitoring changes in the visual boundary of the tissue, while leveraging a deep deformation estimation model trained on simulation data to identify the optimal grasping point and ensure the convergence and safety of the adaptive controller. Through simulations and real-world experiments on different deformable materials, it has been demonstrated that this framework exhibits zero-shot adaptation to similar tasks and can complete the autonomous retraction process, from initial grasp selection to full ROI exposure. Therefore, it has the potential to be applied in actual surgical assistance scenarios.

2605.17918 2026-05-19 cs.LG cs.AI cs.CV

Domain Transfer Becomes Identifiable via a Single Alignment

通过单个对齐使领域转移变得可识别

Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

AI总结 本文提出了一种新的方法,通过结构稀疏性条件和单个配对锚样本实现领域转移的可识别性,减少了对监督信号的依赖,并提出了高效的雅可比稀疏性正则化器以支持高维学习。

详情
AI中文摘要

领域转移(DT)将源分布映射到目标分布,并支持无监督的图像到图像翻译、单细胞分析和跨平台医学影像任务。然而,DT本质上是不明确的:推动正向映射通常不可识别,因为保持测度的自同构(MPAs)在保持边缘分布的同时改变跨领域对应关系,导致内容不一致的翻译。最近的工作表明,通过联合转移多个对应的源/目标条件分布可以消除MPAs,但标记这些条件的监督信号在实践中并不总是可用。我们开发了一种替代的DT可识别性路线。在雅可比支持图案的结构稀疏性条件下,我们证明了分布匹配与单个配对锚样本足以识别真实转移——比先前方法需要的监督更少。为了支持实际的高维学习,我们进一步提出了一种基于随机掩码有限差分的高效雅可比稀疏性正则化器,得到一个可扩展的替代品,无需显式雅可比评估。在合成和现实任务上的实验证实了理论。

英文摘要

Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

2605.17915 2026-05-19 cs.CV

SurgLQA: Scalable Long-Horizon Surgical Video Question Answering

SurgLQA: 可扩展的长时程外科视频问答

Diandian Guo, Xikai Yang, Ruiyang Li, Jialun Pei, Pheng-Ann Heng

AI总结 本文提出SurgLQA框架,通过融合时间一致性巩固和时间接地多策略扩展方法,解决长时程外科视频问答中的长程动态建模问题,提升手术流程中的推理能力。

Comments MICCAI 2026 Early Accept

详情
AI中文摘要

外科视频问答(VideoQA)提供了一个有前景的动态术中解释范式,能够为临床环境中的实时决策支持和上下文感知检索提供支持。然而,现有方法主要局限于图像或短片段,限制了其对长程手术流程中因果依赖关系的建模能力。为解决这一挑战,我们提出了SurgLQA,一个统一的长时程VideoQA框架,用于可扩展的外科推理。该框架集成了忠实时间一致性巩固(FTC),利用内在时间线索构建紧凑的长程表示,同时保持细粒度的时间保真度。进一步,我们开发了时间接地多策略扩展(TMS),一种适应性测试时间推理范式,能够在时间接地上下文中战略性地调整策略层面的推理能力。为了促进系统评估,我们重构了一个长时程结肠镜VideoQA基准,Colon-LQA,并在Colon-LQA和REAL-Colon-VQA上进行了广泛的实验。实验结果表明,我们的方法在长程推理中通过时间接地推理实现了持续的性能提升。代码链接:https://github.com/RascalGdd/SurgLQA。

英文摘要

Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: https://github.com/RascalGdd/SurgLQA.

2605.17912 2026-05-19 cs.RO cs.CV

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

WorldArena 2.0: 扩展模态、功能和平台的具身世界模型基准测试

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Chen Gao, Yong Li

AI总结 本文提出WorldArena 2.0,扩展了具身世界模型的评估,涵盖模态、功能和平台三个维度,提供全面的测试平台以评估具身世界模型的进展。

详情
AI中文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

英文摘要

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

2605.17911 2026-05-19 cs.CL

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

行星探测中自然语言到一阶逻辑翻译的试点基准

Hayden Moore, Suman Saha, Mahfuza Farooque

AI总结 本文提出一个试点基准,用于在行星探测领域将自然语言转换为一阶逻辑,通过NASA PDS的实测文档构建数据集,并手动标注FOL表示,以支持语言理解和形式推理的交叉研究。

详情
AI中文摘要

未来的行星探测设想了在严苛通信限制下运行的自主机器人代理,没有全球定位系统,且人类干预极少。在这种环境中,代理不仅需要感知和行动,还必须在任务目标、操作约束和不断变化的环境条件下进行推理。尽管先前的工作主要集中在感知和控制上,但将高层任务知识转换为结构化、机器可解释的表示仍显不足。我们引入了一个试点基准,用于在行星探测领域将自然语言(NL)转换为一阶逻辑(FOL)。数据集由来自NASA行星数据系统(PDS)的实测文档构建,时间跨度为2003至2013年。这些文档以丰富的自然语言描述了任务阶段,如发射、助推、巡航、巡航和轨道操作。我们手动标注这些文档,对应FOL表示,以捕捉时间结构、代理角色和操作依赖性。此外,我们还提供了结构化的谓词词汇表和类型常量,以支持在不同先验知识水平下进行受控实验。该试点基准为语言理解和形式推理交叉研究提供了基础,基于真实世界的安全关键任务数据。数据集可在:https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json 获取。

英文摘要

Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

2605.17907 2026-05-19 cs.CV cs.AI

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

一个模型翻译它们所有:面向异构协作感知的通用任意到任意翻译

Yang Li, Weize Li, Quan Yuan, Congzhang Shao, Guiyang Luo, Yunqi Ba, Xuanhan Zhu, Xinyuan Ding, Xiaoyuan Fu, Jinglin Li

AI总结 本文提出UniTrans,一种通用任意到任意特征模态翻译模型,通过预训练一组翻译专家参数并学习其组合系数来实现零样本翻译,从而在OPV2V-H和DAIR-V2X数据集上实现了优于现有方法的性能。

Comments 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

通过共享中间特征,协作感知扩展了每个代理的感知能力,但现实世界中的特征模态异质性仍然是有效融合的关键障碍。大多数现有方法,包括直接适应和协议基于的转换,通常依赖于为新出现的特征模态训练适配器,往往需要额外的重新训练或微调。这种重复训练成本高,并且由于模型和数据隐私限制,在跨制造商之间不可行,限制了现实世界的可扩展性。为了解决这个问题,我们提出了UniTrans,一种通用的任意到任意特征模态翻译模型,该模型可以即时实例化任意模态的翻译器。UniTrans预训练了一组翻译专家参数,并学习其组合系数作为源到目标模态映射的函数。映射是在模态内在的潜在空间中进行测量,其中内在编码器从单帧中间特征中提取模态特定但场景不变的代码,使UniTrans能够以零样本的方式实例化翻译器。在OPV2V-H和DAIR-V2X上的实验表明,UniTrans在模拟和现实世界中均优于现有方法,通过通用模型实现了高效的任意到任意翻译。代码可在https://github.com/CheeryLeeyy/UniTrans上获得。

英文摘要

By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at https://github.com/CheeryLeeyy/UniTrans.

2605.17904 2026-05-19 cs.CV

Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation

超越欧几里得原型:基于谱分解和测地匹配的少样本医学图像分割

Penghao Jia, Zhiyong Huang, Mingyang Hou, Zhi Yu, Shuai Miao, Jiahong Wang, Yan Yan

AI总结 本文提出Spectral-Geodesic Prototype Network (SGP-Net),通过谱原型银行和测地匹配器解决少样本医学图像分割中的原型纠缠和拓扑盲匹配问题,实现对形状、纹理和边界线索的解耦编码。

详情
AI中文摘要

少样本医学图像分割(FSMIS)旨在从一个或几个标注的支持图像中勾勒出新的解剖目标,以应对医学影像中的标注稀缺问题。尽管近期取得了进展,但基于原型的方法仍然受到两个耦合限制的阻碍:1)线索纠缠,即单个空间域原型被迫同时总结器官轮廓、实质纹理和边界外观,因此任何支持-查询不匹配在其中一个线索上都会无差别地传播到其他线索;2)拓扑盲匹配,即余弦相似度在环境欧几里得空间中测量距离,而忽略了底层特征流形的连通性,导致低对比度器官内碎片化激活和泄漏到邻近组织。为此,我们提出了Spectral-Geodesic Prototype Network (SGP-Net),其围绕一个由两个耦合组件组成的Spectral-Geodesic Prototype Module构建。一个Spectral Prototype Bank (SPB)通过可学习的径向傅里叶滤波器将支持和查询特征分解为低、中、高频带,从而为每个类别生成三个解耦的原型,分别编码形状、纹理和边界线索。一个Geodesic Matcher (GM)则用可微的热扩散近似来替代余弦相似度,用特征亲和图传播匹配信号,使得在流形上的像素积累一致的响应,而流形外的相似者则被抑制。在三个公开的FSMIS基准测试中,实验表明SGP-Net在与最近的最先进方法相竞争的性能上取得了可比的结果。

英文摘要

Few-Shot Medical Image Segmentation (FSMIS) aims to delineate novel anatomical targets from one or a few annotated support images, addressing the annotation scarcity in medical imaging. Notwithstanding recent advancements, current prototype-based methods are bottlenecked by two coupled limitations: 1) cue entanglement, where a single spatial-domain prototype is forced to summarise organ silhouette, parenchymal texture and boundary appearance simultaneously, so any support-query mismatch on one cue propagates indiscriminately to the others; and 2) topology-blind matching, where cosine similarity measures distance in the ambient Euclidean space and ignores the connectivity of the underlying feature manifold, causing fragmented activations inside low-contrast organs and leakage into neighbouring tissues. To this end, we propose Spectral-Geodesic Prototype Network (SGP-Net), built around a Spectral-Geodesic Prototype Module with two coupled components. A Spectral Prototype Bank (SPB) decomposes support and query features into low-, mid- and high-frequency bands via learnable radial Fourier filters, yielding three disentangled prototypes per class that separately encode shape, texture and boundary cues. A Geodesic Matcher (GM) then replaces cosine similarity with a differentiable heat-diffusion approximation of geodesic distance, propagating matching signals along a feature affinity graph so that on-manifold pixels accumulate consistent responses while off-manifold look-alikes are suppressed. Experiments on three public FSMIS benchmarks demonstrate that SGP-Net achieves competitive performance against recent state-of-the-art methods.

2605.17903 2026-05-19 cs.AI cs.CL cs.HC cs.IR

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

代理分块与贝叶斯去分块:人工智能生成的模糊认知图的模型:特克西德斯陷阱模型

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

AI总结 本文提出了一种基于代理分块和贝叶斯去分块的方法,用于生成和更新人工智能生成的模糊认知图,通过在文本中生成重叠的文本分块,并利用稀疏因果分块矩阵进行混合,从而构建出代表性的循环模糊认知图知识图谱,以预测特克西德斯陷阱模型中的冲突结果。

Comments 15 pages, 6 figures

详情
AI中文摘要

我们通过训练大语言模型代理将文本分解为重叠的文本分块,从而自动生成反馈因果模糊认知图(FCMs)。通过将这些分块FCMs进行凸混合,可以得到一个代表性的循环FCM知识图。文本分块可以有不同的重叠程度。分块FCMs仍然混合以形成新的FCM因果知识图。混合技术的可扩展性源于其使用轻量计算和稀疏因果分块矩阵。混合结构允许进行一种操作层面的贝叶斯推断,从而从混合的FCM中生成“去分块”或后验似的FCM。这些去分块的FCM在自身具有价值,并允许进一步的贝叶斯更新。我们通过Allison的“特克西德斯陷阱”模型的论文文本演示了这些混合技术,该模型描述了主导力量(如美国)与崛起力量(如中国)之间的冲突。FCM动态系统在达到固定点或极限环吸引子时预测结果。当我们通过激活代表崛起力量野心和权利的概念节点来刺激这些FCM知识图时,8个中的7个FCM知识图预测了战争类型。Gemini 3.1 LLMs作为分块AI代理。

英文摘要

We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces "de-chunked" or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison's "Thucydides Trap" model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power's ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.

2605.17902 2026-05-19 cs.AI

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG:文献锚定的随机轨迹检索增强生成用于知识条件退化模型选择

Hanbyeol Park, Hyerim Bae

AI总结 本文提出LAST-RAG方法,通过结合观测健康指标轨迹和领域特定上下文,利用理论和机械证据从本地证据库中检索,以改进退化模型选择,将模型选择从纯统计拟合问题转变为结合观测数据和领域知识的决策问题。

详情
AI中文摘要

基于随机过程的退化建模是估计剩余使用寿命(RUL)分布的核心方法;然而,适当选择随机过程的方法尚未得到充分解决。现有模型选择方法主要依赖于观测健康指标(HI)轨迹的统计拟合,但当观察窗口较短或信号高度噪声时,这种方法可能选择与底层退化机制不一致的模型。为了解决这个问题,本文提出了文献锚定的随机轨迹检索增强生成(LAST-RAG)。该方法利用观测的HI轨迹和领域特定上下文,并基于从本地证据库中检索的理论和机械证据,分层地对候选退化模型空间进行条件。此外,引入了基于规则的置信度推理与不确定状态(RCRUS)以防止在分层决策不确定时过早排除候选模型。基于仿真的实验表明,所提出的方法在韦纳/伽马族分类和详细退化模型分类中均优于统计、预测和不确定性感知的基线方法。最终,本研究将退化模型选择从纯粹的统计拟合问题重新界定为一个结合观测数据和领域知识的知识条件决策问题。

英文摘要

Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.

2605.17900 2026-05-19 cs.AI

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2: 基于大语言模型的大型兴趣点属性采集交互语音响应系统

Le Zhang, Shengming Zhang, Rui Zha, Yunpeng Wu, Jingbo Zhou, Jizhou Huang

AI总结 本文提出DuIVRS-2,一种基于大语言模型的端到端框架,用于大规模兴趣点属性采集,通过有限状态机引导的数据增强策略、选择生成方案与思维链机制,提高了输出稳定性并有效消除幻觉,最终在生产环境中实现了83.9%的任务成功率。

Comments Accepted to ACL 2026 Industry Track. 14 pages, including appendix

详情
AI中文摘要

准确获取兴趣点(POI)属性对于基于位置的服务至关重要,但传统模块化的交互语音响应(IVR)系统存在误差累积和高维护成本的问题。我们提出了DuIVRS-2,一种基于大语言模型(LLM)的端到端框架,用于百度地图的大规模POI属性采集。为了解决现实交互中的长尾分布问题,我们的方法首先采用有限状态机(FSM)引导的数据增强策略,生成平衡且多样化的训练数据集。然后通过选择生成方案结合思维链(CoT)机制,优化对话管理,确保输出稳定性并有效消除工业环境中的幻觉。为了便于持续策略优化且最小化人工努力,我们设计了协作迭代学习框架,利用双评估者投票系统。在生产环境中部署两个月,DuIVRS-2每天处理0.4百万次呼叫,实现了83.9%的任务成功率(TSR),比其前身高出4个百分点,同时保持130ms的低响应时间。本工作为开发鲁棒且成本效益高的LLM代理用于大规模工业对话应用提供了生产验证的参考。

英文摘要

Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9\% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.