arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪 全部专题
2606.17301 2026-06-17 cs.SD cs.LG 新提交

Turning music identification into a neural forward pass

将音乐识别转化为神经前向传播

Muhammad Taimoor Haseeb, Ahmad Hammoudeh, Gus Xia

发表机构 * Music X Lab(音乐X实验室) Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学)

AI总结 提出用生成式Transformer通过单次神经前向传播实现音乐识别,在短音频片段上超越传统声学指纹方法,存储和延迟显著降低。

详情
AI中文摘要

搜索是计算机科学中的基础操作,它将查询映射到集合中的匹配项。通常,它被实现为类似系统2的基于规则的流水线:计算键、探测索引、验证候选。相比之下,人类识别类似于系统1的联想式身份恢复模型,其中即使部分线索也能触发回忆,而无需显式枚举、排序甚至访问离散候选。在这里,我们展示了音乐声音识别——一个困难的搜索问题——可以通过生成式Transformer在单次神经前向传播中完成。该模型在音频数据集上训练,从短音频片段预测对应的曲目标识符。这种方法超越了最先进的声学指纹识别,对于短音频片段(1秒)的提升最大,证明了该方法不仅可行而且具有优势。此外,它将外部存储减少到基线的0.33%,并将推理延迟提高了2.3倍(p95)。而且,该模型可以拒绝未见曲目的查询,支持开放集操作,同时降低误归因风险。以音乐曲目识别为例,这项工作重新定义了搜索,使其更接近人类联想识别,远离算法数据库查找。

英文摘要

Search, a foundational operation in computer science, maps a query to a matching item in a collection. It is typically implemented as a System-2 like, rule-based pipeline in which a key is computed, an index is probed, and candidates are verified. By contrast, human recognition resembles a System-1 like, associative model of identity recovery, in which even partial cues can trigger a recall without explicitly enumerating, ranking, or even accessing discrete candidates. Here, we show that music sound identification, a difficult search problem, can be performed in a single neural feed-forward pass by a generative transformer. Trained on an audio dataset, the model predicts the corresponding track identifier from a short audio excerpt. This approach surpasses state-of-the-art acoustic fingerprinting, with the largest gains for short audio segments (1 second), demonstrating the method is not only viable but advantageous. Moreover, it reduces external storage to 0.33% of the baseline footprint and improves inference latency by 2.3x (p95). Furthermore, the model can reject queries for unseen tracks, supporting open-set operation while reducing misattribution risk. Using music track identification as an example, this work reframes search, bringing it closer in spirit to human associative recognition and away from algorithmic database lookup.

2606.17299 2026-06-17 cs.CL 新提交

Examining the Limits of Word2Vec with Toki Pona

用 Toki Pona 检验 Word2Vec 的极限

Daniel Zhenhan Huang, Hongchen Wu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文使用仅有约130个单词的人造语言 Toki Pona 训练 Word2Vec,探究词汇量极小时嵌入质量,并分析非核心噪声词的影响。

Comments 10 pages, 4 figures, 3 tables. Accepted to the Society for Computation in Linguistics (SCiL) 2026

详情
AI中文摘要

Word2Vec 在生成语义嵌入方面的有效性已得到广泛验证,但其测试几乎完全集中在词汇量大的语言上。本研究使用 Toki Pona(一种约130个单词的人造语言)的数据,检验 Word2Vec 能否在极小的词汇量下成功捕获语义关系。我们从 Toki Pona 社区获取了140万句子(795万词元)用于训练。语料库中约23%的句子包含非 Toki Pona 词元,如命名实体、借词和新词。为了探究这种语言噪声是增强还是阻碍性能——这是词嵌入文献中很少涉及的话题——我们训练了两个不同的模型:一个保留这些偶然词元,另一个将其完全过滤。评估采用定量方法(测量单词到语义类别质心的距离)、自动轮廓分数(通过凝聚聚类)以及定性分析(使用与英语对比的表征相似性矩阵)。结果表明,虽然稀疏的非核心词元不影响所学嵌入的相对结构,但它们实际上使相似词在向量空间中更接近。重要的是,即使在这个极端下限,Word2Vec 的有效性更多地取决于分布模式而非词汇表大小。

英文摘要

Word2Vec's effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance -- a topic rarely addressed in word embedding literature -- we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec's effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.

2606.17298 2026-06-17 cs.CV 新提交

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

面向手术室视频的推理式文本-视频检索:基于动作驱动数字孪生

Yiqing Shen, Hao Ding, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出OR3方法,通过动作驱动数字孪生(ActDT)将视频片段转化为结构化表示,并利用大语言模型生成假设ActDT进行检索,结合证据修正实现隐式查询推理,在手术室视频检索中显著优于基线。

详情
AI中文摘要

手术室中的文本-视频检索是实现手术室安全的关键技术,它允许利益相关者检索和检查特定事件的记录。然而,由于最安全关键的事件可能不遵循常见结构,为了充分发挥其潜力,文本-视频检索必须能够处理需要推理才能识别正确视频的隐式查询(例如,剪断前的一步)。然而,现有方法依赖于无法对此类查询进行推理的全局嵌入。我们提出OR3,一种文本-视频检索方法,它将视频片段转换为动作驱动数字孪生(ActDTs),将并发的“主体-动作-对象”三元组分组到非重叠的时间间隔内。此外,与通过配对编码器进行跨模态匹配不同,OR3执行基于想象的检索,其中大语言模型从查询生成假设的ActDTs。这通过使用针对ActDT定制的难负样本训练的单一编码器实现模态内匹配。最后,基于证据的修正根据与顶级候选者的差异修正想象的ActDTs,以捕获特定于程序的模式。我们从MM-OR构建了一个基准,包含来自机器人膝关节手术的386个视频片段的276个隐式查询,涵盖四个推理类别。OR3实现了57.6的R@1和77.3的R@5,优于最强基线。这些结果表明,OR3通过时间动作推理实现了视觉上相似的手术室视频片段之间的细粒度区分。

英文摘要

Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

2606.17296 2026-06-17 cs.CV 新提交

Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

Pareto LoRA:通过帕累托最优梯度集成缓解统一多模态模型中的模态不平衡

Xiwen Wei, Mark Nutter, Madhusudhanan Srinivasan, Radu Marculescu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Advanced Micro Devices, Inc.(超威半导体公司)

AI总结 针对统一多模态模型在LoRA微调中语言梯度主导优化导致图像生成质量下降的问题,提出帕累托最优梯度集成策略Pareto LoRA,通过调节梯度方向和强度平衡文本与图像目标,在CoMM基准上显著提升图像感知质量达44.9%。

详情
AI中文摘要

统一多模态模型(UMMs)最近作为一种有前景的范式出现,将多模态理解和生成集成在单个自回归Transformer中。然而,在多模态指令微调期间,这些模型通常表现出明显的模态不平衡:语言梯度主导优化,从而导致图像生成质量较低,尤其是在参数高效微调(如LoRA)下。在这项工作中,我们系统分析了基于LoRA的UMMs在交错文本-图像生成微调中的模态不平衡。我们表明,与单模态对应物相比,视觉模态性能下降幅度远大于文本模态性能,并且模态特定梯度在不同任务和层之间可能相差数个数量级。受此观察启发,我们将多模态指令微调重新表述为双目标优化问题,并提出Pareto LoRA,一种帕累托最优梯度集成策略,通过调节梯度方向和强度来平衡文本和图像目标。在CoMM基准上使用Emu2的实验表明,Pareto LoRA持续改善多模态生成平衡,在保持可比文本性能的同时,图像感知质量相比普通LoRA提升高达44.9%。

英文摘要

Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.

2606.17294 2026-06-17 cs.RO cs.LG 新提交

VISTA: Scale-Aware Visual Navigation via Action History Conditioning

VISTA:通过动作历史条件实现尺度感知的视觉导航

Maeva Guerrier, Koki Kobayashi, Simon Roy, Jana Pavlasek, Giovanni Beltrame

发表机构 * Polytechnique Montreal(蒙特利尔理工学院) MILA(MILA研究所) Institute of Science Tokyo(东京科学大学) CoRA Lab(CoRA实验室) Mist Lab(Mist实验室)

AI总结 针对视觉导航基础模型因动作归一化导致的尺度脆弱性,提出通过动作历史条件化提供物理位移上下文,并集成DINOv3编码器增强重复环境中的特征表示,实现零样本跨环境部署。

详情
AI中文摘要

视觉导航基础模型(VNMs)承诺能够实现端到端的学习导航策略,并能在不同实体和环境之间进行零样本部署。为了保持通用性,许多基于视觉的导航模型预测归一化动作。然而,这种归一化引入了一个关键的部署漏洞:对相同的归一化轨迹应用不同的缩放因子会改变其物理几何形状,从而降低导航性能并增加碰撞风险。我们通过将模型条件化于归一化动作历史以及图像观测来解决这一漏洞,为模型预测与机器人实际物理位移之间的关系提供显式上下文。此外,当前的VNMs在缺乏显著特征的视觉重复环境中常常表现不佳。为解决此问题,我们集成了DINOv3编码器,其更丰富的表示使我们的模型能够捕获观测之间的空间和几何维度。VISTA能够鲁棒地泛化到分布外环境,在户外、森林和办公室环境的零样本真实世界部署中实现了100%的目标预测准确率,平均95%的检查点被穿越,展示了在未见环境中的一致路径跟随能力。

英文摘要

Vision Navigation Foundation Models (VNMs) promise end-to-end learned navigation policies capable of zero-shot deployment across diverse embodiments and environments. To maintain generality, many vision-based navigation models predict normalized actions. However, this normalization introduces a critical deployment vulnerability: applying different scaling factors to the same normalized trajectory alters its physical geometry, which degrades navigation performance and increases collision risks. We address this vulnerability by conditioning the model on normalized action histories alongside image observations, providing explicit context on the relationship between the model's predictions and the robot's actual physical displacement. Furthermore, current VNMs often struggle in visually repetitive environments that lack distinct features. To resolve this issue, we integrate a DINOv3 encoder, whose richer representations enable our model to capture both spatial and geometric dimensions between observations. VISTA generalizes robustly to out-of-distribution environments, achieving 100% goal prediction accuracy in zero-shot, real-world deployment in Outdoor, Forest and Office settings, and an average of 95% checkpoints crossed, demonstrating consistent path following in unseen environments.

2606.17289 2026-06-17 cs.AI cs.CL 新提交

Nothing from Something: Can a Language Model Discover 0?

无中生有:语言模型能否发现0?

Phoebe Zeng, Thomas L. Griffiths, Brenden M. Lake

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 研究语言模型能否独立发现“零”的概念,通过算术任务测试,发现GPT-2规模模型无法在测试时泛化,但少量示例训练后显著提升,且语言预训练减少所需示例约50%。

详情
AI中文摘要

基于人工神经网络的AI系统正被开发,旨在推动人类数学知识的边界。这些系统的关键问题在于它们能在多大程度上超越训练数据。数学发现需要一种强形式的分布外泛化能力——假设真正新的、且可能逻辑上更强大的数学结构的能力。已有假设认为,语言能力在人类认知中支持这种泛化。在这项工作中,我们使用简单算术作为案例研究,考察现代AI模型如何扩展其数学视野,评估这些模型能否独立发现“零”的概念。我们表明:(1) GPT-2规模的语言模型在测试时无法进行这种泛化,无论是否经过语言预训练;(2) 但在经过数十或数百个零的示例训练后,模型能显著改进。此外,我们发现语言预训练将所需示例数量减少了约50%,表明语言能力可以支撑神经模型中的数学发现。

英文摘要

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

2606.17281 2026-06-17 cs.CL cs.SD eess.AS 新提交

Are you speaking my languages? On spoken language adherence in multimodal LLMs

你在说我的语言吗?多模态大语言模型中的口语遵循问题

Hyungwon Kim, Kandarp Joshi, Lillian Zhou, Pavel Golik, Petar Aleksic

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 针对多模态大语言模型在自动语音识别中输出语言识别错误的问题,提出软提示方法、监督微调和思维链推理三种缓解策略,并引入新指标量化语言违背,比较各方法在减少违规和保持ASR性能上的效果。

Comments 7 pages, 3 tables in the main body

详情
AI中文摘要

虽然基于大语言模型(LLM)的自动语音识别(ASR)能够实现无缝的多语言使用,但模型经常错误识别输出语言,损害转录保真度和下游应用质量。为了保持灵活性和代码切换能力,我们提出了一种软提示方法,该方法暗示潜在的口语语言而不严格约束输出。我们正式将这一挑战定义为缺乏语言遵循,引入了一个新的指标来量化违规行为,并评估了三种缓解策略:(1)零样本提示,在不确定性下提供稳健指导;(2)监督微调(SFT),以提高提示遵循度;(3)思维链(CoT)推理,在解码过程中强制遵循。我们跨多种语言对这些方法进行了比较分析,评估了它们在减少语言违规同时保持整体ASR性能方面的有效性。最后,我们讨论了权衡,以指导在不同计算约束下的策略选择。

英文摘要

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

2606.17279 2026-06-17 cs.CV 新提交

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

基于数字孪生表示的强化学习训练LLMs用于推理密集型手术视频问答

Yiqing Shen, Han Zhang, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出强化学习框架,通过手术基础模型构建数字孪生表示,解耦视觉感知与推理,并引入分层表示与新型奖励,在三个基准上取得最优性能。

详情
AI中文摘要

手术视频问答需要跨语义、空间和时间维度的多步推理。现有方法在架构上将视频压缩为离散令牌表示,并将视觉感知与推理耦合。这种方法割裂了连续的空间-时间关系,已被证明限制了多步推理能力。我们引入了一个强化学习框架,通过操作由手术基础模型构建的数字孪生表示,训练大型语言模型解耦感知与推理。此外,我们引入了跨帧、时间窗口和程序级别的分层表示,并带有概率不确定性估计。最后,我们提出了一种新颖的奖励,结合了格式验证与通过临床合理性评估和不确定性感知校准进行的准确性评估。为了展示该方法的能力,我们引入了REAL-Colon-Reason,一个包含2000个问题-答案对、涵盖三个复杂度级别的结肠镜基准。我们在REAL-Colon-Reason以及两个现有手术视频问答基准REAL-Colon-VQA和EndoVis18-VQA上取得了最先进的性能。

英文摘要

Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.

2606.17269 2026-06-17 cs.AI cs.SY eess.SY 新提交

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

技能约束下的弹性制造供应链模型预测控制

Carlos Eduardo Sanoja

发表机构 * Quanta Labs, LLC Universidad Monteávila(蒙特阿维拉大学)

AI总结 针对技能约束的生产库存系统,提出一种闭环模型预测控制器,通过混合整数规划优化生产、库存、缺货和培训决策,并评估其在多种扰动下的表现,发现预测控制仅在技能瓶颈可提前预测时有效。

详情
AI中文摘要

在技能约束的生产库存系统中,明天可用的合格人力容量取决于今天的培训决策:生产需要认证工人,认证除非维护否则会失效,而培训消耗与当前生产需求相同的稀缺工时。我们研究了一种闭环技能约束模型预测控制器,该控制器在每个班次求解一个有限时域混合整数规划,涉及生产、库存、缺货和培训,包含二元预测认证、硬生产资格以及一个可解释的终端值,该终端值在时域边界对认证容量缺口进行定价;仅执行第一周期动作后重新规划。在合成、种子控制的SkillChain-Gym场景中——包括公告和新技能冲击、需求冲击、缺勤、预测与可用性质量模式、容量边界与培训率扫描以及阴性对照——我们将该控制器与仅生产和仅维护的消融、静态交叉培训保险计划以及一个强反应式启发式方法进行比较,采用事前固定配置和配对统计。结果是存在制度依赖性,而非优越性:没有策略类别占主导。当技能或劳动力瓶颈可提前足够预测以完成培训时,预测控制有帮助;在意外冲击、接近需求-容量边界以及冲击前松弛使保险廉价的情况下,精益静态保险仍难以被击败。归因消融区分了认证维护、失效认证的重新获取以及全新技能获取。可预测性(而非适应性本身)决定了预测控制何时有价值。

英文摘要

In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.

2606.17266 2026-06-17 cs.AI cs.SY eess.SY 新提交

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym:面向中断下再技能感知的生产-库存控制的基准测试

Carlos Eduardo Sanoja

发表机构 * Quanta Labs, LLC(Quanta Labs有限责任公司) FCEA, Universidad Monteávila(蒙特阿维拉大学经济与行政科学学院)

AI总结 提出SkillChain-Gym基准,用于评估考虑技能动态(如遗忘、再培训)的生产-库存控制策略,实验发现无策略在所有场景中占优,需根据预测灵活选择。

详情
AI中文摘要

生产规划日益需要将劳动力能力视为决策变量:当技能未得到维护时认证会失效,新产品需要当前劳动力不具备的技能,再技能培训与生产争夺相同的工时。现有的运营基准通常将劳动力视为外生变量,而包含技能和学习的劳动力规划模型很少作为可复用的测试平台发布。我们引入了SkillChain-Gym,这是一个针对再技能感知的生产-库存控制的基准规范:一个单站点环境,具有风格化的工人技能状态动态、硬阈值认证、遗忘以及消耗产能的培训动作,这些动作受与生产相同的每个工人时间预算约束。该基准包括种子控制的中断场景、三种可行性模式(带投影诊断)、确定性回放以及涵盖运营、韧性、能力增长和培训访问分布的指标。我们评估了仅生产策略、反应式自适应策略、注水自适应策略和静态保险策略(带预算变体),在60个班次的时间范围内进行配对统计检验。结果是依赖于情景的,而非排序。具备培训能力的策略优于仅生产基线,并且在遗忘存在的情况下,即使没有中断,维护性培训也是必要的。在具备培训能力的策略中,当瓶颈在预测中可见时,自适应培训有帮助,而一个精简的静态交叉培训计划(一个故意有利的比较对象,其结构编码了相关的技能应急情况)在突发冲击和缺勤下充当了强有力的保险。产能松弛和遗忘率决定了这些情景之间的边界。没有策略类在所有情景中占优,这促使了能够决定何时购买技能保险和何时反应的预测驱动型控制器。

英文摘要

Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker hours needed for production. Existing operations benchmarks usually treat labor as exogenous, while workforce-planning models with skills and learning are rarely released as reusable testbeds. We introduce SkillChain-Gym, a benchmark specification for reskilling-aware production-inventory control: a single-site environment with stylized worker skill-state dynamics, hard threshold certification, forgetting, and capacity-consuming training actions constrained by the same per-worker time budget as production. The benchmark includes seed-controlled disruption scenarios, three feasibility modes with projection diagnostics, deterministic replay, and metrics covering operations, resilience, capability growth, and training-access distribution. We evaluate production-only, reactive adaptive, water-filling adaptive, and static-insurance policies with budget variants over 60-shift horizons with paired statistical tests. The results are regime-dependent rather than a ranking. Training-capable policies dominate the production-only baseline, and maintenance training is necessary under forgetting even without disruptions. Among training-capable classes, adaptive training helps when bottlenecks are visible in the forecast, while a lean static cross-training plan, a deliberately favorable comparator whose structure encodes relevant skill contingencies, acts as strong insurance under surprise shocks and absenteeism. Capacity slack and the forgetting rate govern the boundary between these regimes. No policy class dominates across regimes, motivating forecast-driven controllers that decide when to buy skill insurance and when to react.

2606.17257 2026-06-17 cs.CV cs.AI 新提交

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Pulling The REINS: 通过表示引导实现视频扩散模型的无训练安全对齐

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

发表机构 * University of California, Riverside(加州大学河滨分校) YouTube (Google)(YouTube(谷歌))

AI总结 提出REINS方法,在推理时通过线性方向引导视频扩散模型的内部表示,实现无训练的安全对齐,避免有害内容生成,且不降低通用能力。

详情
AI中文摘要

开源视频扩散模型能够生成从暴力到虚假信息等逼真的不安全内容,然而现有防御要么需要昂贵的安全微调(这会降低通用能力),要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS(表示空间推理时安全引导),一种无训练方法,通过在推理时引导其内部表示向安全生成方向对齐视频扩散模型。我们的关键发现是,安全相关结构线性编码在视频扩散Transformer的隐藏状态激活中,并且通过基于二元安全标签的监督PCA发现的一个单一方向足以分离安全与不安全的生成轨迹。在推理时,将该方向添加到中间Transformer层的隐藏状态中,将生成从有害内容重定向到语义相关的安全替代方案,无需权重更新、无需概念枚举,且计算开销可忽略。通过机制分析,我们揭示了虽然安全信息随Transformer深度单调累积,但引导效果在中间层(约50%深度)达到峰值,暴露了信息可用性与下游传播能力之间的基本权衡。我们在9个视频扩散模型、多个参数规模(1.3B-5B)以及文本到视频和图像到视频生成上评估REINS,据我们所知,这是视频生成文献中最广泛的安全评估套件。

英文摘要

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

2606.17256 2026-06-17 cs.RO cs.CV 新提交

Contrastive Action-Image Pre-training for Visuomotor Control

对比动作-图像预训练用于视觉运动控制

Yuvan Sharma, Dantong Niu, Anirudh Pai, Zekai Wang, Zhuoyang Liu, Baifeng Shi, Stefano Saravalle, Boning Shao, Ruijie Zheng, Jing Wang, Konstantinos Kallidromitis, Yusuke Kato, Fabio Galasso, Yuke Zhu, Danfei Xu, Linxi "Jim" Fan, Jitendra Malik, Trevor Darrell, Roei Herzig

发表机构 * UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Sapienza University of Rome(罗马大学) Panasonic(松下) ItalAI

AI总结 提出CAIP方法,利用大规模第一人称视频中3D手部关键点作为代理动作信号,通过对比学习统一动作-图像表示,在少量机器人数据下显著提升灵巧操作性能。

详情
AI中文摘要

现有的机器人视觉编码器面临一个根本瓶颈:机器人数据集缺乏大规模预训练所需的规模。先前的工作通过转向互联网规模的图像和语言数据或自我中心的人类视频来规避数据稀缺问题。虽然这些模型显示出潜力,但两种范式都没有从配对的视觉和动作数据中学习,而下游视觉运动控制策略需要这些数据。然而,机器人轨迹作为这种配对信号最直接的来源,在预训练规模上不可用,这促使我们从丰富的人类视频中提取动作信号。为此,我们引入了CAIP(对比动作-图像预训练),一种视觉编码器,将大规模自我中心视频中的人类手部姿态视为末端执行器动作的代理。通过提取3D手部关键点(一种与下游机器人动作空间自然对齐的表示),CAIP通过对比目标学习统一的动作-图像表示。利用32,041小时的自我中心人类视频和仅88小时的机器人操作数据,CAIP优于最先进的视觉编码器,包括DINOv2、SigLIP、MVP和R3M。在使用Dexmate Vega和Sharpa Wave手的具有挑战性的真实世界灵巧操作设置上评估,CAIP在涉及折叠、倾倒和精细操作的任务上取得了超过30%的性能提升。我们的结果表明,我们的对比动作中心预训练方法为获得更适合物理交互的鲁棒视觉表示提供了一条可扩展的路径。

英文摘要

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

2606.17255 2026-06-17 cs.CL cs.AI 新提交

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN UPV 系统在 IWSLT 2026 同声传译任务中的应用

Jorge Iranzo-Sánchez, Gerard Mas-Mollà, Adrià Giménez, Jorge Civera, Albert Sanchis, Alfons Juan

发表机构 * MLLP-VRAIN research group(MLLP-VRAIN研究组) VRAIN Universitat Politècnica de València(瓦伦西亚理工大学)

AI总结 提出基于Parakeet和Qwen 3.5模型的级联同声传译系统,通过自适应黑盒策略优化质量-延迟权衡,并引入ASR词增强和RAG机制处理上下文跟踪,在MCIF En→De测试集上实现XCOMET-XL提升+5.82。

Comments IWSLT 2026 System Description

详情
AI中文摘要

本文描述了MLLP-VRAIN研究组参与IWSLT 2026同声传译赛道共享任务的情况。我们的提交利用最近发布的Parakeet和Qwen 3.5模型,通过自适应“黑盒”策略构建了一个鲁棒的级联解决方案,用于长形式SimulST。我们探索了这些策略的松弛版本以实现更好的质量-延迟权衡。与去年相比,我们参与了所有语言方向。此外,对于En→{De, It, Zh}方向,我们还参与了今年新增的上下文跟踪赛道,采用ASR词增强和离线预翻译示例的RAG机制相结合,以引导生成并丰富系统的领域特定上下文。最后,我们提供了系统的详细延迟分析。与去年相比,在MCIF En→De测试集上的结果显示质量显著提升,XCOMET-XL提高了+5.82。我们的上下文跟踪处理进一步提升了+1.03的性能。

英文摘要

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

2606.17250 2026-06-17 cs.LG cs.CL 新提交

Rethinking Groups in Critic-Free RLVR

重新思考无评论强化学习中的分组

Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie

发表机构 * Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(Mila - 魁北克人工智能研究所) University of Waterloo(滑铁卢大学) The Chinese University of Hong Kong(香港中文大学) Huawei Noah’s Ark Lab(华为诺亚方舟实验室)

AI总结 针对无评论强化学习分组策略的数据低效和同步问题,提出负令牌过滤方法,实现单次 rollout 稳定训练,在推理和代理任务上表现相当或更优。

详情
AI中文摘要

强化学习已成为大型语言模型后训练的核心范式。现有的无评论强化学习方法通常为同一问题生成一组 rollout 以估计价值基线用于优势计算。然而,这种设计存在数据低效、组同步障碍以及与结构化 rollout 不灵活的问题。在这项工作中,我们重新审视了“分组”的作用,并表明其底层功能不仅仅是估计基线,而是防止对负样本的错误惩罚。基于这一见解,我们提出了负令牌过滤,一种简单有效的策略,能够实现稳定的单 rollout 训练。我们将其应用于两种批量级优势方法,在推理任务上取得了与基于分组的强化学习技术相当的性能,在代理任务上取得了更强的性能。

英文摘要

Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

2606.17246 2026-06-17 cs.CV cs.MA 新提交

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

GeoDisaster: 用于操作化灾害地理智能的编排智能体基准测试

Maram Hasan, Aman Verma, Savitra Roy, Hariseetharam Gunduboina, Daksh Jain, Muhammad Haris Khan, Subhasis Chaudhuri, Biplab Banerjee

发表机构 * Indian Institute of Technology Bombay(印度理工学院孟买分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GeoDisaster基准,包含2921个实例和43种问题类型,用于评估遥感视觉语言模型在工具化空间推理和结构化决策方面的能力,并设计多智能体框架RCEA提升工具使用和证据基础。

Comments 28 pages, 11 Figures

详情
AI中文摘要

遥感视觉语言模型(RS-VLMs)推动了地球观测分析向视觉解释和指令遵循发展,但在操作化地理智能方面仍显不足,后者需要基于工具的空间推理和结构化、有证据支持的决策。我们提出了GeoDisaster,一个操作化地理空间灾害推理基准,包含2921个经过验证的实例,涵盖43种问题类型和五个任务族:森林砍伐监测、多灾害分析、建筑损坏评估、洪水安全路线规划以及Sentinel-1 SAR洪水监测。实例集成了异构的EO/GIS证据——光学和SAR影像、栅格掩膜、矢量几何、道路网络和暴露图层——涵盖灾害检测、损坏评估、暴露估计和诊断报告生成。真实答案基于可执行的地理空间工作流和确定性一致性检查,无需语言模型标注。我们进一步提出了一个编排的多智能体框架,包含18个面向灾害的工具,其中角色专业化的智能体通过明确的执行契约进行协调,并通过角色契约期望对齐(RCEA)进行对齐:结合故障感知的监督微调和基于契约的强化学习,利用密集的步骤级信号。实验表明,GeoDisaster对现有的RS-VLMs和智能体系统构成了挑战,而RCEA改进了工具使用、证据基础、状态一致性和决策生成。

英文摘要

Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.

2606.17242 2026-06-17 cs.CV 新提交

Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description, Implementation, and Examples

基于视觉Transformer的Landsat-Sentinel-2藻华制图:模型描述、实现与示例

Thainara Lima, Vitor Martins

发表机构 * Department of Agricultural & Biological Engineering, Mississippi State University(密苏里州立大学农业与生物工程系)

AI总结 提出首个基于视觉Transformer的沿海藻华制图方法,利用Landsat-Sentinel-2 30米分辨率影像,通过全局分布数据集和多种架构对比,证明Swin Transformer在云/耀斑条件下优于传统方法,实现高精度碎片化藻华检测。

详情
AI中文摘要

沿海藻华监测需要频繁、空间详细且全球一致的观测,这由Landsat-8/9和Sentinel-2 A/B/C提供。这些任务共同提供了超过十年的中等分辨率多光谱影像,每2-3天覆盖近全球,能够检测粗分辨率海洋水色传感器无法分辨的碎片化藻华结构。然而,由于光谱覆盖有限且缺乏统一的反射率产品,它们在水生环境中的应用仍然具有挑战性。作为传统生物光学方法的替代,基于深度学习的图像分类提供了一种数据驱动的方法,可以克服许多这些限制。本研究首次成功实现了基于视觉Transformer的沿海藻华制图,使用30米Landsat-Sentinel-2影像。在全球范围内易发生藻华的沿海热点区域生成了一个全球分布的藻华斑块数据集。将四种基于Transformer的架构与标准卷积基线进行比较,用于精细尺度藻华检测,并在不同光学水类型、大气和表面条件下进行评估。所有深度学习模型在检测漂浮藻华区域方面表现出强大能力,遗漏和误报误差为8-65%。在时间序列中的云和耀斑压力下,Swin Transformer优于传统的光谱指数方法(后者产生广泛的误报),有效避免了受云和耀斑影响的像素。与MODIS产品的进一步比较突出了更高空间分辨率在检测碎片化和不规则影响藻华方面的优势。我们的研究结果支持深度学习作为动态沿海环境中漂浮藻华中等分辨率一致监测的可靠工具。

英文摘要

Coastal algal bloom monitoring requires frequent, spatially detailed, and globally consistent observations, provided by Landsat-8/9 and Sentinel-2 A/B/C. Together, these missions offer over a decade of medium-resolution multispectral imagery with near-global coverage every 2-3 days, enabling the detection of fragmented bloom structures not resolvable by coarse ocean-color sensors. However, their use in aquatic environments remains challenging due to limited spectral coverage and a lack of harmonized reflectance products. As an alternative to traditional bio-optical methods, deep learning-based image classification offers a data-driven approach that can overcome many of these limitations. This study presents the first successful implementation of vision transformer-based coastal algal bloom mapping using 30-m Landsat-Sentinel-2 images. A globally distributed bloom patch dataset was generated across bloom-prone coastal hotspots worldwide. Four transformer-based architectures were compared against a standard convolutional baseline for fine-scale bloom detection, and assessed under different optical water types and atmospheric and surface conditions. All deep learning models showed strong capabilities in detecting floating bloom areas, with omission and commission errors of 8-65%. Under cloud and glint stress in a time series, the Swin Transformer outperformed traditional spectral-index approaches, which produced widespread false positives, effectively avoiding cloud- and glint-affected pixels. Comparisons with MODIS-derived products further highlighted the benefits of higher spatial resolution in detecting fragmented and irregularly affected blooms. Our findings support deep learning as a reliable tool for medium-resolution, consistent monitoring of floating algal blooms in dynamic coastal environments.

2606.17241 2026-06-17 cs.CV cs.RO cs.SY eess.SY 新提交

Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception

超越基准:面向细粒度路边感知的连续边缘推理

Aditya Mishra, Haroon Lone

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校)

AI总结 针对边缘推理在持续运行中的性能退化问题,提出Edge-TSR系统,集成检测、跟踪与轻量级时域稳定机制,在NVIDIA Jetson Orin Nano上实现实时路边感知,恢复高达10.16%的分类准确率。

详情
AI中文摘要

在资源受限的边缘硬件上进行连续AI推理会引入传统基准评估难以察觉的部署效应,包括流视频的时间不稳定性、持续负载下的热节流以及工作负载相关的性能变化。我们提出Edge-TSR,一个面向部署的连续边缘推理系统,用于在NVIDIA Jetson Orin Nano上进行持续的路边感知。Edge-TSR集成了检测、跟踪、细粒度分类以及轻量级的轨迹感知时域稳定机制,以最小的计算开销提高了流推理的一致性。我们的核心发现是,以基准为中心的评估系统性地高估了部署边缘推理的性能。在三个最先进的基线上,我们观察到从静态图像评估过渡到真实流部署时,性能一致下降20-30%。Edge-TSR通过时域推理稳定解决了这一差距,在持续运行下,相比逐帧推理基线,恢复了高达10.16%的分类准确率,同时保持了实时性能。我们在多种真实部署条件下评估了整个系统,联合表征了长时间运行期间的推理质量、延迟、吞吐量和热行为。在26公里路线上进行的55分钟车辆部署表明,在单个嵌入式设备上,无需云端卸载,即可在安全热限制内以16.18 FPS持续运行。我们的发现表明,部署感知评估和时域推理稳定是面向真实传感部署的持续运行边缘AI系统的必要组成部分。我们发布了一个带注释的流视频评估数据集样本和完整的系统实现,以支持可重复的以部署为中心的评估。

英文摘要

Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.

2606.17234 2026-06-17 cs.CL 新提交

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

自评之言:论大语言模型在机器翻译中的口头化置信度

Ali Marashian, Alexis Palmer, Katharina von der Wense

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) Johannes Gutenberg University Mainz(美因茨约翰内斯·古腾堡大学)

AI总结 本研究设计了五种无需内部信号的口头化方法提取LLM逐词置信度,并与内部确定性信号比较,发现两者在细粒度错误检测和校准上表现相似但相关性低。

详情
AI中文摘要

大语言模型(LLMs)在翻译中的迅速普及要求对其自身输出的置信度可靠性进行深入研究。与许多生成任务不同,翻译错误和置信度可以在不同粒度级别(标记、单词或片段)上发挥作用。基于预测概率等内部信号的无监督方法可能具有误导性,因为它们反映的是替代方案之间的确定性而非正确性。此外,这些方法需要访问此类内部信号。本文设计了五种口头化方法,用于在没有这些缺点的情况下提取LLM的逐词置信度,并将其可靠性与模型内部确定性信号进行比较。我们使用两种对齐形式评估可靠性:细粒度错误检测和校准。对于两者,内部方法和口头化方法表现相似,尽管结果因模型而异。有趣的是,我们发现内部方法与口头化方法之间几乎没有相关性。

英文摘要

The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

2606.17233 2026-06-17 cs.LG stat.ML 新提交

Uncertainty Quantification of Engineering Structures by Polynomial Chaos Expansion and Multivariate Active Learning

基于多项式混沌展开与多元主动学习的工程结构不确定性量化

Qitian Lu, Jafar Jafari-Asl, Panagiotis Spyridis, Lukas Novak

发表机构 * Brno University of Technology(布尔诺理工大学) University of Rostock(罗斯托克大学)

AI总结 针对多输出工程问题中单一实验设计难以同时准确近似所有输出量的问题,提出一种自适应序贯采样方法,通过平衡输入空间探索与多输出聚合方差信息,构建多项式混沌展开代理模型,数值实验表明该方法提高了代理精度和稳定性。

详情
AI中文摘要

在许多工程应用中,单个高保真模型在相同输入参数下产生多个感兴趣的量(QoIs),例如复杂物理系统的有限元模型。为了减轻直接模型评估的高计算成本,代理模型被广泛用于构建模型响应的高效近似。自然地,代理模型的精度强烈依赖于实验设计(ED)的质量。然而,单个ED可能无法同时为所有输出提供足够的表示,特别是当不同输出对输入变量表现出不同的敏感性时。一个直接的解决方案是为每个输出分别进行采样,但这会导致采样复杂性和计算成本增加。从统计角度来看,这种方法也忽略了所有输出之间潜在的相关性,并可能损害数据一致性。为了解决这个问题,一种用于构建多项式混沌展开代理模型的自适应序贯采样方法被推广到向量值QoIs。该方法基于新样本对输出方差的局部贡献,从候选池中顺序选择新样本,同时平衡基于距离的输入空间探索和跨所有输出的聚合方差信息的利用。通过来自工程问题的几个数值示例,将其性能与非序贯拉丁超立方采样进行比较。数值结果表明,所提出的策略提高了代理模型的精度和稳定性,并提供了更可靠的二阶统计量估计。

英文摘要

In many engineering applications, a single high-fidelity model produces multiple quantities of interest (QoIs) under the same input parameters, e.g. finite element models of complex physical systems. To alleviate the high computational cost of direct model evaluations, surrogate models are widely used to construct efficient approximations of model responses. Naturally, the accuracy of surrogates strongly depends on the quality of the experimental design (ED). However, a single ED may not provide an adequate representation for all outputs simultaneously, especially when different outputs exhibit varying sensitivities to the input variables. A straightforward solution is to perform separate sampling for each output, but this results in increased sampling complexity and computational cost. From a statistical perspective, such an approach also ignores potential correlations among all outputs and may compromise data consistency. To address this issue, an adaptive sequential sampling method for constructing polynomial chaos expansion surrogate models is generalized for vector valued QoIs. The method sequentially selects new samples from a candidate pool based on their local contribution to the output variance, while balancing distance-based exploration of the input space and exploitation of aggregated variance information across all outputs. Its performance is compared with non-sequential Latin Hypercube Sampling through several numerical examples from engineering problems. Numerical results demonstrate that the proposed strategy improves both surrogate accuracy and stability, and provides a more reliable estimation of second-order statistics.

2606.17229 2026-06-17 cs.LG cs.AI cs.CL 新提交

Rift: A Conflict Signature for Deception in Language Models

Rift: 语言模型中欺骗行为的冲突特征

Petr Nyoma

发表机构 * Harmonic Labs

AI总结 通过对比知情欺骗与无知错误,发现欺骗性前向传递具有高残差秩的冲突特征,能以100%准确率无标签识别谎言,并跨模型、语言和架构迁移。

Comments 13 pages, 4 figures. Code and experiment logs: https://github.com/Omibranch/Rift

详情
AI中文摘要

一个在知道真相的情况下说谎的模型是ELK无法仅通过行为评估处理的核心情况。我们询问这种欺骗是否留下内部特征,将其与诚实错误区分开来。我们的关键举措是对错误进行控制:我们将一个潜伏代理(知道真相,在触发时说谎)与一个天真的说谎者(微调以输出相同的错误答案,没有诚实训练)进行对比。两者产生相同的错误输出;任何差异都关乎知识冲突,而非不正确性。我们发现欺骗性前向传递携带一个冲突特征——在相同错误答案上,残差秩比天真的说谎者传递高2.1-2.3倍——足以在GPT-2 small/medium(三个种子)和三个指令模型上,以100%准确率且无需标签,识别两个响应中哪个是谎言。在Qwen2.5-1.5B/7B和Phi-3-mini上,指令欺骗在每个测试事实上都提高了残差秩(18/18, 40/40, 34/34);在Phi-3上,谎言与诚实答案和幻觉完美分离(AUC 1.0,Wilcoxon p~6e-11)。该特征在策略性自构建欺骗(模型自创谎言,AUC 1.0)、主动隐藏尝试(AUC 1.0)和长度控制复制(20/20,AUC 1.0,p~1e-6)中仍然存在。使用无基相对表示,在一个模型家族上训练的探针在零样本下检测到另外两个家族中的欺骗(平均AUC 0.933),在同时改变架构和格式时仍有效(AUC 0.821),并跨五种语言迁移(AUC 1.000,长度控制)。该特征是只读的:可检测但不可注入(双向0/8)。诚实的局限性和六个负面实验已完整记录。

英文摘要

A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a control for wrongness: we contrast a sleeper agent (knows the truth, lies on trigger) against a naive liar (fine-tuned to emit the same wrong answers with no honest training). Both produce identical wrong outputs; any difference is about knowledge conflict, not incorrectness. We find deceptive forward passes carry a conflict signature - 2.1-2.3x higher residual rank than naive-liar passes on the same wrong answer - strong enough to identify which of two responses is the lie with 100% accuracy and no labels, across GPT-2 small/medium (three seeds) and three instruct models. Across Qwen2.5-1.5B/7B and Phi-3-mini, instructed deception raises residual rank on every tested fact (18/18, 40/40, 34/34); on Phi-3, lies separate perfectly from both honest answers and hallucinations (AUC 1.0, Wilcoxon p~6e-11). The signature survives strategic self-constructed deception (model invents its own lie, AUC 1.0), active concealment attempts (AUC 1.0), and length-controlled replication (20/20, AUC 1.0, p~1e-6). Using basis-free relative representations, a probe trained on one model family detects deception in two other families zero-shot (mean AUC 0.933), surviving simultaneous architecture and format change (AUC 0.821), and transfers across five languages (AUC 1.000, length-controlled). The signature is read-only: detectable but not injectable (0/8 both directions). Honest limitations and six negative experiments are documented in full.

2606.17222 2026-06-17 cs.CV 新提交

Quantum Enchanced Multi-Scale CNN with Bi-directional Mamba for Crop Field Analysis

量子增强多尺度CNN与双向Mamba用于农田分析

Mohammad Salman Khan, Ehsan Atoofian, Saad B. Ahmed

发表机构 * Lakehead University(湖首大学)

AI总结 提出BiSpectral Mamba框架,结合多尺度CNN、光谱注意力、双向状态空间建模和量子启发学习,解决高光谱图像分类中的高维性、类不平衡等问题,在UAVHSI-Crop数据集上达到84.83%准确率。

详情
AI中文摘要

高光谱图像(HSI)作物分析对于精准农业至关重要,因为它捕获了丰富的光谱和空间信息,用于准确的作物监测和评估。然而,由于高光谱维度、空间复杂性、类别不平衡以及有限的标记样本,HSI分类仍然具有挑战性。为了解决这些问题,本文提出了一种基于BiSpectral Mamba的框架,该框架结合了多尺度卷积特征提取、光谱注意力、双向状态空间建模和量子启发学习。多尺度CNN骨干首先通过跨多个分辨率的特征融合提取层次化的空间-光谱表示。然后,光谱注意力机制强调信息丰富的波段,同时抑制冗余和噪声通道。精炼后的特征由BiSpectral Mamba模块处理,该模块通过将高光谱特征图建模为序列标记,在正向和反向方向上捕获长距离依赖关系。此外,还引入了类加权优化和特征融合策略,以提高训练稳定性并缓解类别不平衡。在UAVHSI-Crop数据集上的实验评估证明了所提框架的有效性,总体准确率达到84.83%。结果表明,集成卷积、注意力机制和状态空间建模组件能够实现稳健的空间-光谱特征学习,用于作物分类。所提框架还展示了在更广泛的农业和遥感应用中的潜力,包括作物病害检测、产量预测和土壤湿度估计,同时突出了结构化状态空间和量子启发架构在高光谱图像分析中的有效性。

英文摘要

Hyperspectral image (HSI) crop analysis is essential for precision agriculture because it captures rich spectral and spatial information for accurate crop monitoring and assessment. However, HSI classification remains challenging due to high spectral dimensionality, spatial complexity, class imbalance, and limited labeled samples. To address these challenges, this paper proposes a BiSpectral Mamba-based framework that combines multi-scale convolutional feature extraction, spectral attention, bidirectional state-space modeling, and quantum-inspired learning. A multi-scale CNN backbone first extracts hierarchical spatial-spectral representations through feature fusion across multiple resolutions. A spectral attention mechanism then emphasizes informative bands while suppressing redundant and noisy channels. The refined features are processed by a BiSpectral Mamba module that captures long-range dependencies in both forward and backward directions by modeling hyperspectral feature maps as sequential tokens. In addition, class-weighted optimization and feature fusion strategies are incorporated to improve training stability and mitigate class imbalance. Experimental evaluation on the UAVHSI-Crop dataset demonstrates the effectiveness of the proposed framework, achieving an overall accuracy of 84.83%. The results show that integrating convolutional, attention-based, and state-space modeling components enables robust spatial-spectral feature learning for crop classification. The proposed framework also shows potential for broader agricultural and remote sensing applications, including crop disease detection, yield prediction, and soil moisture estimation, while highlighting the effectiveness of structured state-space and quantum-inspired architectures for hyperspectral image analysis.

2606.17220 2026-06-17 cs.AI 新提交

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

当规则学习时:一种用于法律案例检索的自演化智能体

Mingxu Tao, Jiawei Hu, Xian Zhou, Wenpeng Hu, Jiajun Cheng, Yunbo Cao, Zhunchen Luo, Guotong Geng

发表机构 * Center of Information Research, AMS(AMS信息研究中心) Discipline and Technology Research Center for Large Model Intelligence Applications(大模型智能应用学科与技术研究中心) Hebei University of Engineering(河北工程大学)

AI总结 提出一种自演化框架,通过LLM智能体自动生成并优化查询重写规则,无需参数训练即可增强BM25在法律案例检索中的性能。

Comments To appear in ACL 2026

详情
AI中文摘要

由于法律语言的复杂性以及查询与相关案例之间需要精确的词汇对齐,法律案例检索仍然具有挑战性。尽管密集检索模型取得了显著进展,但实证研究表明,BM25在该领域仍然是一个强大的基线。这促使我们提出一种自演化框架,用于规则驱动的查询重写,无需任何参数训练即可增强BM25。该框架为基于LLM的智能体配备了一个自动评估环境,使其能够迭代地创建重写规则、规划规则组合的验证实验,并根据历史反馈消除无效规则。我们在中文法律案例检索基准LeCaRD-v2上评估了我们的方法。实验结果表明,所提出的框架优于非演化基线,包括人类设计的规则和贪婪规则选择,特别是在由高容量核心LLM驱动时。我们还进行了详细分析,以研究自演化的机制。我们的发现表明,LLM利用先前实验结果的能力及其关于规则消除的内在知识在通过自演化优化规则集方面发挥着关键作用。

英文摘要

Legal case retrieval remains challenging due to the complexity of legal language and the need for precise lexical alignment between queries and relevant cases. Although dense retrieval models have achieved notable progress, empirical studies show that BM25 continues to serve as a strong baseline in this domain. It motivates us to propose a self-evolving framework for rule-driven query rewriting that enhances BM25 without any parameter training. The framework equips an LLM-based agent with an automatic evaluation environment, enabling it to iteratively create rewriting rules, plan validation experiments over rule combinations, and eliminate ineffective rules based on historical feedbacks. We evaluate our method on the Chinese legal case retrieval benchmark LeCaRD-v2. Experimental results demonstrate that the proposed framework outperforms non-evolutionary baselines, including human-designed rules and greedy rule selection, particularly when powered by a highcapacity core LLM. We also conduct detailed analyses to investigate the mechanisms underlying self-evolution. Our findings reveal that LLM's capabilities to leverage previous experimental results and its intrinsic knowledge of rule elimination play critical roles in refining the rule set via self-evolution.

2606.17215 2026-06-17 cs.LG cs.DS stat.ML 新提交

Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization

鲁棒半空间学习中重加权铰链方法的平方和度障碍:一个Christoffel函数刻画

Xiaoyu Li

发表机构 * Xiaoyu Li(李小宇)

AI总结 本文通过Christoffel函数精确刻画了有界度证书无法去除的异常质量,揭示了重加权铰链方法在恶意噪声下学习γ-间隔半空间时,证书的SoS度与异常容忍度之间的基本权衡。

详情
AI中文摘要

一个去除异常值的证书仅通过低阶矩观察数据,而对手恰恰利用这一点,将腐败隐藏在干净数据已经看似典型的盲区中,该盲区无法被任何有界度测试分辨。这个盲区恰好有一个精确的大小:干净边际分布的Christoffel函数,这正是现代数据分析中用于检测异常值的量,此处从对手的角度解读为有界度证书无法去除的腐败。我们将这一反转作为在恶意噪声下鲁棒学习γ-间隔半空间的重加权铰链方法(Shen, 2025; Zeng and Shen, 2025)的组织原则:支配性资源是异常去除证书的平方和(SoS)度,而分辨原则指出,在中心c处能够对度-2t证书隐藏的最大腐败质量恰好是干净边际分布的Christoffel函数λ_{t+1}(c)。由此得出三个推论,均针对证书方法(而非信息论极限)。边际-度权衡:将密集煎饼认证到误差ϵ需要SoS度Ω(log(1/ϵ))或边际Ω(√(log(1/ϵ))/√d),解释了Shen (2025)中记录的log(1/ϵ)边际是必然的,通过加权Chebyshev归约使得阈值2t=Θ((|c|/s)^2)在经典加权极值估计下是紧的。度-2异常障碍:分辨原则实现为一个显式实例,其中度2卡在η^{1/2}而度4逃脱,将方法的小崩溃率定位在度上而非分析中。以及一个度-2t算法追踪前沿η^{1-1/2t}(在t=1时恢复Shen (2025)),其增益为显式常数,受限于煎饼密度,并由度-2障碍证明不可改进。

英文摘要

A certificate that removes outliers sees the data only through its low-degree moments, and an adversary exploits exactly this, hiding corruption where the clean data already looks typical, in the blind spot no bounded-degree test resolves. That blind spot turns out to have an exact size: the Christoffel function of the clean marginal, the very quantity modern data analysis thresholds to detect outliers, here read from the adversary's side as the corruption a bounded-degree certificate cannot remove. We turn this inversion into the organizing principle of the reweighted-hinge approach to robustly learning $γ$-margin halfspaces under malicious noise (Shen, 2025; Zeng and Shen, 2025): the governing resource is the Sum-of-Squares degree of the outlier-removal certificate, and the resolution principle states that the maximal corruption mass which can hide at a center $c$ from a degree-$2t$ certificate is exactly the Christoffel function $λ_{t+1}(c)$ of the clean marginal. Three consequences follow, all against the certificate method (not information-theoretic). A margin-degree tradeoff: certifying the dense pancake to error $ε$ costs SoS degree $Ω(\log(1/ε))$ or margin $Ω(\sqrt{\log(1/ε)}/\sqrt{d})$, explaining why the $\log(1/ε)$ margin Shen (2025) records is forced, with a weighted-Chebyshev reduction making the threshold $2t=Θ((|c|/s)^2)$ tight modulo one classical weighted-extremal estimate. A degree-$2$ outlier barrier: the resolution principle realized as an explicit instance on which degree $2$ is stuck at $η^{1/2}$ while degree $4$ escapes, locating the method's small breakdown rate in the degree, not the analysis. And a degree-$2t$ algorithm tracing the frontier $η^{1-1/2t}$ (recovering Shen (2025) at $t=1$), whose gain is an explicit constant, capped by the pancake density and shown unimprovable by the degree-$2$ barrier.

2606.17213 2026-06-17 cs.CL cs.CV 新提交

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

重新审视用于3D CT报告生成的LLM适应:缩放与诊断先验研究

Vanshali Sharma, Andrea M. Bejar, Halil Ertugrul Aktas, Quoc-Huy Trinh, Debesh Jha, Gorkem Durak, Ulas Bagci

发表机构 * Northwestern University(西北大学) University of South Dakota(南达科他大学) Aalto University(阿尔托大学)

AI总结 提出RAD3D-Prefix轻量级诊断先验框架,通过冻结大语言模型并融合多标签分类逻辑,在少量可训练参数下实现3D CT报告生成,优于全微调基线并展现强泛化性。

详情
AI中文摘要

多模态学习的最新进展,包括大型语言模型(LLM)和视觉-语言模型(VLM),已展现出对自然图像的强大适应性。然而,将其扩展到医学领域,特别是体积(3D)图像,由于高计算复杂度、体积依赖性和视觉特征与临床术语之间的语义差距而具有挑战性。在有限的医学数据上对LLM进行朴素微调常常导致过拟合和临床幻觉,其中语言流畅性优先于临床事实性。在本研究中,我们研究了用于体积CT报告生成的参数高效适应策略,并引入了RAD3D-Prefix,一种轻量级的诊断先验条件框架,最大限度地减少了对大量参数训练的需求。该模块将图像嵌入与多标签诊断分类逻辑相结合,保留了关键的临床细节,同时弥合了语义差距。通过保持LLM冻结,我们的方法需要最少的可训练参数,并减轻了在小规模、特定领域数据集上过拟合的风险。通过对从96.1M到1.6B参数的LLM进行系统研究,我们发现微调对较小的LLM最有益,而冻结较大的(约1B+)LLM并仅训练轻量级投影层在性能、泛化性和计算效率之间提供了优越的权衡。在多个自动指标和一项临床读者研究中,RAD3D-Prefix优于可比较的参数高效基线,并在使用比全微调替代方案少得多的可训练参数的情况下,展现出强大的域外泛化能力。

英文摘要

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

2606.17209 2026-06-17 cs.AI cs.IR 新提交

Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search

超越并行采样:面向智能搜索的多样化查询初始化

Sidhaarth Murali, João Coelho, Jingjie Ning, João Magalhães, Bruno Martins, Chenyan Xiong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Instituto Superior Técnico and INESC-ID, University of Lisbon(里斯本大学高等技术学院和INESC-ID) NOVA LINCS, NOVA School of Science and Technology(新里斯本大学科学与技术学院NOVA LINCS)

AI总结 针对智能搜索中的广度缩放,提出DivInit方法,通过在第一轮生成多样化查询而非独立采样,缓解查询冗余问题,在多跳问答任务中平均提升5-7个点。

Comments 15 pages, 8 figures; under review at EMNLP 2026

详情
AI中文摘要

测试时缩放用于智能搜索通常增加深度(即每个轨迹更多轮次和令牌)或广度(即更多并行展开)。这里我们关注广度缩放,表明标准并行采样收益递减,并将其归因于第一轮的查询冗余。当模型在不同展开中发出相似的第一查询时,线程检索重叠的证据,后续轮次基于此共享检索。我们通过DivInit解决这一限制,这是一种在第一轮无需训练的干预。DivInit不是采样k个独立的第一查询,而是从单次调用中抽取n个候选,选择k < n个多样化的种子,并将它们作为并行轨迹运行。在五个开源模型和八个基准测试中,DivInit始终优于标准并行采样,在匹配计算量的多跳问答上平均提升5到7个点。代码可在https://this URL获取。

英文摘要

Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redundancy at the first turn. When models issue similar first queries across rollouts, the threads retrieve overlapping evidence, and subsequent turns are conditioned on this shared retrieval. We address this limitation with DivInit, a training-free intervention at the first turn. Rather than sampling k independent first queries, DivInit draws n candidates from a single call, picks k < n diverse seeds, and runs them as parallel trajectories. Across five open-weight models and eight benchmarks, DivInit consistently improves over standard parallel sampling, with average gains of five to seven points on multi-hop QA at matched compute. Code available at https://github.com/cxcscmu/diverse-query-initialization

2606.17200 2026-06-17 cs.RO 新提交

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

ACE-Ego-0:统一第一人称人类与机器人数据用于VLA预训练

Hao Li, Ganlong Zhao, Yufei Liu, Haotian Hou, Guoquan Ye, Tongyan Fang, Chunxiao Liu, Siyuan Huang, Jianbo Liu, Xiaogang Wang, Hongsheng Li

发表机构 * ACE Robotics CUHK MMLab(香港中文大学多媒体实验室) CUHK, Shenzhen(香港中文大学(深圳)) SJTU(上海交通大学) THU(清华大学)

AI总结 提出ACE-EGO-0框架,通过可扩展的第一人称视频到动作管道和可靠性感知训练目标,统一人类与机器人数据用于VLA预训练,在多个基准上达到最优性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型受益于大规模和多样化的具身数据,但收集机器人轨迹成本高昂且劳动密集。最近的进展表明,大规模第一人称人类视频在预训练中提供了互补的真实世界监督。然而,由于动作空间、具身结构、时间动态和监督质量的差异,联合训练人类和机器人数据仍然具有挑战性。我们引入了ACE-EGO-0,一个统一VLA预训练框架,联合利用异构数据源。为了从第一人称人类视频中提取大规模预训练监督,我们构建了一个可扩展的第一人称视频到动作管道,将原始人类视频转换为机器人格式的伪动作轨迹。为了使这些标签与机器人演示可比,ACE-EGO-0使用基于相机空间动作、形态条件化和时间对齐动作分块的统一动作表示。为了稳健地利用来自第一人称人类视频的噪声伪动作监督,我们制定了一个可靠性感知训练目标,并带有一个人辅助损失,将监督集中在可靠信号上。我们在4.53K小时的机器人和模拟数据以及1.48K小时的伪动作标记的第一人称人类数据上实例化ACE-EGO-0。实验表明,在可靠性感知加权下纳入大规模人类监督一致地改进了统一联合预训练和监督微调。ACE-EGO-0在RoboCasa GR1 TableTop和RoboTwin 2.0上达到了最先进的性能,并展示了向真实世界双臂操作的强迁移能力。

英文摘要

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

2606.17199 2026-06-17 cs.LG cs.AI 新提交

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

PowerOPD:利用有界幂变换稳定在线策略蒸馏

Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) The Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) University of Waterloo(滑铁卢大学)

AI总结 针对在线策略蒸馏中log-ratio奖励无界导致训练不稳定问题,提出基于Box-Cox幂变换的有界、符号一致奖励族PowerOPD,在数学推理任务上平均提升Avg@8/Pass@8达+6.37/+5.71,并降低59.2%时间与23.1%显存。

详情
AI中文摘要

大型语言模型的标准在线策略蒸馏(OPD)利用学生采样令牌估计反向KL散度,得到一个无偏的单样本蒙特卡洛估计器,避免了全词汇计算。然而,我们表明该估计器在实践中存在严重的训练病态:样本效率低、生成动态不稳定,以及与精确全词汇OPD相比显著的性能差距。奖励级别的诊断将这些病态追溯到log-ratio奖励,该奖励在结构上无界,产生极高方差的梯度,集中在早期位置并持续整个训练;标准的后验缩放方法仅在失真发生后操作,因此失效。为解决此问题,我们提出PowerOPD:一个源自Box-Cox幂变换的原生有界、符号一致的奖励族,由alpha > 0参数化,其中log-ratio是其退化极限alpha -> 0。在六个数学推理基准和四个Qwen3师生对中,PowerOPD在基准平均Avg@8/Pass@8上相比原始OPD提升高达+6.37/+5.71,相比后验稳定化提升+3.01/+3.54,相比全词汇OPD提升+2.59/+8.90,同时减少59.2%的挂钟时间和23.1%的峰值GPU内存。较大的alpha通常提高准确率,一致缩短响应长度,并使梯度范数比原始OPD小3000倍以上。

英文摘要

Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

2606.17192 2026-06-17 cs.LG 新提交

Constrained Diffusion Models with Primal-Dual Inference

约束扩散模型与原始-对偶推理

Samar Hadou, Yigit Berkay Uslu, Alejandro Ribeiro

发表机构 * Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系)

AI总结 提出原始-对偶推理(PDI)方法,通过联合推断最优原始分布和其对偶变量,在扩散模型反向过程中交替去噪与对偶上升,实现平均约束下的熵正则化优化问题采样。

详情
AI中文摘要

本文开发了具有原始-对偶推理(PDI)的约束扩散模型,用于从具有平均约束的熵正则化优化问题的最优分布中采样。我们在拉格朗日对偶域中形式化约束采样,其中最优分布采用由最优对偶变量索引的吉布斯分布形式。PDI不是先估计该对偶乘子并在整个生成过程中冻结它,而是联合推断最优原始分布及其参数化对偶变量。每个反向扩散步骤使用与当前乘子相关的得分场去噪,然后通过使用去噪样本的估计约束违反进行对偶上升来更新乘子。为了实现这种条件得分场,我们在推理过程中遇到的对偶变量所诱导的吉布斯分布族上训练一个单一的条件得分网络。我们证明了沿推理轨迹生成的对偶变量的时间平均收敛到对偶最优的邻域,并通过依赖于调度的时间稳定性因子限定了残余对偶失配对终端分布的影响。我们在高斯混合约束采样、无线资源分配和投资组合管理上评估了PDI。

英文摘要

This paper develops constrained diffusion models with primal-dual inference (PDI) to sample from optimal distributions of entropy-regularized optimization problems with \emph{average} constraints. We formalize constrained sampling in the Lagrangian dual domain, where the optimal distribution takes the form of a Gibbs distribution indexed by the optimal dual variable. Rather than estimating this dual multiplier before sampling and freezing it throughout generation, PDI jointly infers the optimal primal distribution and its parametrizing dual variable. Each reverse diffusion step denoises using the score field associated with the current multiplier and then updates the multiplier through dual ascent using the estimated constraint violation of the denoised samples. To enable this conditional score field, we train a single dual-conditioned score network over the family of Gibbs distributions induced by the dual variables encountered during inference. We prove that the time average of the dual variables generated along the inference trajectory converges to a neighborhood of the dual optimum and bound the effect of residual dual mismatch on the terminal distribution through schedule-dependent stability factors. We evaluate PDI on constrained sampling from a mixture of Gaussians, wireless resource allocation, and portfolio management.

2606.17183 2026-06-17 cs.RO 新提交

VL-MemKnG: Hybrid Memory with a Spatio-Temporal Knowledge Graph for Question Answering over Long Egocentric Navigation Trajectories

VL-MemKnG:结合时空知识图谱的混合记忆用于长程自我中心导航轨迹问答

Svetlana Lukina, Mohamad Al Mdfaa, Gloria Haro, Sergey Zagoruyko, Gonzalo Ferrer

发表机构 * Mobile Robotics Laboratory, Artificial Intelligence Center(移动机器人实验室,人工智能中心) Skoltech(斯科尔科沃科学技术学院) Intelligent Multimodal Vision Analysis Group, Department of Engineering, Universitat Pompeu Fabra(智能多模态视觉分析组,工程系,庞培法布拉大学) Independent Researcher(独立研究员)

AI总结 提出VL-MemKnG混合记忆框架,结合时空知识图谱与片段级上下文记忆,通过混合检索推理模块提升长程自我中心视频导航问答的准确性和效率。

详情
AI中文摘要

回答长程自我中心视频中的导航相关问题需要检索和组织分布在遥远时间瞬间的证据,同时保持空间和上下文一致性。尽管长上下文视觉-语言模型能够实现强大的答案质量,但对于长轨迹而言计算成本高昂,且对于重复查询效率低下。最近基于图的方法(如VL-KnG)通过持久化时空知识图谱解决了这一挑战,但仅依赖图检索可能不足以表达更广泛的时间连续性和上下文线索。我们提出了VL-MemKnG,一种混合记忆框架,它扩展了VL-KnG,将时空知识图谱与持久化片段级上下文记忆相结合。知识图谱捕获结构化关系信息和长程对象关联,而片段级记忆则保留更广泛的时间上下文以进行长程证据检索。混合检索与推理模块联合操作于两种记忆表示之上,生成基于证据的答案和时间上组织的支持证据。我们还引入了WalkieKnowledgeT+,这是WalkieKnowledge的扩展,用于长程导航导向的视频问答。该基准包括需要跨多个非共现时刻进行证据聚合的时间分布式推理任务。在WalkieKnowledgeT+上,VL-MemKnG将Top-1检索准确率从58%提升至67%,Recall@1从34.50%提升至40.55%,优于所有对比方法,包括Gemini 2.5 Pro和Qwen 3.5+。在时间全局和时间分散聚合问题上提升尤为显著,证明了将结构化关系记忆与片段级上下文记忆相结合的优势,同时保持高效的查询时推理。

英文摘要

Answering navigation-relevant questions over long egocentric videos requires retrieving and organizing evidence distributed across distant temporal moments while maintaining spatial and contextual consistency. Although long-context vision--language models can achieve strong answer quality, they are computationally expensive for long trajectories and inefficient for repeated querying. Recent graph-based approaches such as VL-KnG address this challenge through persistent spatio-temporal knowledge graphs, but graph-centric retrieval alone may underrepresent broader temporal continuity and contextual cues. We present VL-MemKnG, a hybrid memory framework that extends VL-KnG by combining a spatio-temporal knowledge graph with persistent segment-level contextual memory. The knowledge graph captures structured relational information and long-range object associations, while segment-level memory preserves broader temporal context for long-horizon evidence retrieval. A hybrid retrieval-and-reasoning module jointly operates over both memory representations to produce evidence-grounded answers and temporally organized supporting evidence. We also introduce WalkieKnowledgeT+, an extension of WalkieKnowledge for long-horizon navigation-oriented video question answering. The benchmark includes temporally distributed reasoning tasks requiring evidence aggregation across multiple non-cooccurring moments. On WalkieKnowledgeT+, VL-MemKnG improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55%, outperforming all compared methods, including Gemini 2.5 Pro and Qwen 3.5+. The gains are particularly pronounced on temporal-global and temporally scattered aggregation questions, demonstrating the benefits of combining structured relational memory with segment-level contextual memory while maintaining efficient query-time inference.

2606.17182 2026-06-17 cs.LG cs.DC cs.LO cs.MA cs.PL 新提交

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

多智能体大语言模型系统中并发异常的可验证检测与预防

Sajjad Khan

发表机构 * independent researcher(独立研究员)

AI总结 针对多智能体LLM系统,形式化四种并发异常并建立一致性层级,通过Verus验证检测器正确性,并在Rust运行时中实现预防。

Comments 32 pages, 2 figures, 6 tables. Verus/TLA+ verification artifact, reference Rust runtime, and Python harnesses, plus a supplementary appendix (Sections A-F, Tables S1-S6), included as ancillary files

详情
AI中文摘要

多智能体LLM系统通过内存存储、向量索引和工具注册表共享状态。我们将这种共享建模为在确定性生成语义(持久化执行引擎通过确定性重放强制执行的机制)下的长期读-生成-写操作,并在TLA+中形式化了四种并发异常:陈旧生成、幻影工具、因果级联和工具效应重排序,它们是经典隔离异常的结构类比,每种都有TLC反例。这些异常上的排除格是平凡的;贡献在于机械验证了其中一条最大链$L_0 \subsetneq \cdots \subsetneq L_4$的可实现性和严格分离,据我们所知,这是此类运行时第一个机器检查的一致性层级。通过274个Verus义务(零假设、零接受;信任基础:两个结构公理和一个互斥对应关系)的开发,证明了检测器相对于规范的正确性和完备性,以及每个运行时对应的避免集。三个部署的Rust运行时实现了L0-L1(悲观锁、可序列化快照隔离、默认SI),每个都针对陈旧生成进行了验证并细化到其状态机;L2-L4通过执行模式验证,并具有无依赖的预防孪生(A3、A6、A2:0/1000对比1000/1000),L2在三个模型家族上实时运行(A3在所有120个撤回会话中均被预防)。我们复现了字节跳动deer-flow中的静默丢失更新,将其修复形式化为已验证的$L_0 \to L_1$细化,并在LangGraph的ToolNode上展示了未修改输出中的工具效应重排序,通过L3提交顺序序列器消除。已验证的检测器、细化和可实现性工件是贡献;现象和格是经典的。

英文摘要

Multi-agent LLM systems share state through memory stores, vector indices, and tool registries. We model such sharing as long-running read-generate-write operations under deterministic-generation semantics -- the regime durable-execution engines enforce by deterministic replay -- and formalize four concurrency anomalies in TLA+: stale-generation, phantom-tool, causal-cascade, and tool-effect reordering, structural analogues of classical isolation anomalies, each with a TLC counter-example. The exclusion lattice over these anomalies is trivial; the contribution is the mechanically verified realizability and strict separation of one maximal chain within it, $L_0 \subsetneq \cdots \subsetneq L_4$, to our knowledge the first machine-checked consistency hierarchy for such runtimes. A development of 274 Verus obligations (zero assume, zero admit; trust base: two structural axioms and a mutex correspondence) proves the detectors sound and complete against the specifications and each runtime its avoidance set. Three deployed Rust runtimes realize L0-L1 (pessimistic locking, serializable snapshot isolation, default-SI), each verified against stale-generation and refined to its state machine; L2-L4 are exec-mode-verified with dependency-free prevention twins (A3, A6, A2: 0/1000 versus 1000/1000), and L2 is run live across three model families (A3 prevented in all 120 retracted sessions). We reproduce a silent lost update in ByteDance's deer-flow, formalizing its fix as a verified $L_0 \to L_1$ refinement, and exhibit tool-effect reordering in LangGraph's ToolNode on unmodified output, removed by an L3 commit-order sequencer. The verified detector, refinements, and realizability artifacts are the contribution; the phenomena and lattice are classical.