arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2603.19812 2026-05-25 cs.LG

Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study

共享空间中自动穿梭车与行人的眼动知情与情境感知轨迹预测:一项虚拟现实研究

Danya Li, Yan Feng, Rico Krueger

发表机构 * Department of Technology, Management and Economics at the Technical University of Denmark(丹麦技术大学技术、管理与经济学系) Department of Transport & Planning, Civil Engineering Geosciences at Delft University of Technology(代尔夫特理工大学交通运输与规划、土木工程与地质科学系)

AI总结 本研究通过虚拟现实实验,探讨行人眼动信息在共享空间中预测其轨迹的价值,研究了不同接近角度和交通条件下的行人与自动驾驶接驳车的交互行为。研究构建了一个融合眼动、头部方向和情境上下文的多模态预测模型,发现眼动信息对轨迹预测的贡献依赖于角度和身体协调,并与情境信息具有互补性。实验表明,结合眼动与情境信息可将最终位移误差降低8.47%,突显了将人类感知信号纳入行人行为预测的重要性。

详情
AI中文摘要

为填补这一空白,我们进行了一项虚拟现实实验,行人在不同接近角度(45°、90°、135°)和连续交通条件(单辆穿梭车、两辆穿梭车间隔3或5秒)下与自动穿梭车交互,收集了同步的运动、眼动和头部朝向数据。为了探究细粒度眼动在何种程度、何种条件下以及以何种形式对行人运动预测提供信息,我们开发了一个多模态预测模型,通过模态特定编码器融合这些信号,并系统地消融眼动表示与头部朝向和情境上下文。我们报告三个主要结果。首先,眼动的预测价值与角度相关,并与眼-头-身体协调紧密耦合:在锐角角度下,行人主动转移视线以获取穿梭车信息时,眼动携带了仅头部朝向无法捕捉的信息。其次,连续眼动朝向优于分类语义注视标签,最佳编码框架(全局或身体相对)取决于眼动是单独使用还是与上下文联合使用。第三,眼动和情境上下文提供互补的预测信息:它们的组合将最终位移误差(FDE)降低了8.47%,接近各自贡献之和。这些发现共同凸显了将人类感知信号纳入行人行为预测的价值,并激励了以人为中心的建模方法补充以车辆为中心的建模方法。我们的代码可在 https://github.com/danyayay/GazeX.git 获取。

英文摘要

To address this gap, we conduct a Virtual Reality experiment in which pedestrians interact with automated shuttles under varying approach angles (45°, 90°, 135°) and continuous-traffic conditions (single shuttle, two shuttles with 3 or 5-second gaps), collecting synchronized motion, eye gaze, and head orientation data. To investigate to what extent, under what conditions, and in what form fine-grained eye gaze is informative for pedestrian motion prediction, we develop a multi-modal prediction model that fuses these signals through modality-specific encoders, and systematically ablate gaze representations against head orientation and situational context. We report three main results. First, the predictive value of eye gaze is angle-dependent and tightly coupled with eye-head-body coordination: at acute angles where pedestrians actively redirect gaze to acquire the shuttle, eye gaze carries information that head orientation alone misses. Second, continuous gaze orientation outperforms categorical semantic fixation labels, with the optimal encoding frame (global or body-relative) depending on whether gaze is used alone or jointly with context. Third, eye gaze and situational context provide complementary predictive information: their combination reduces final displacement error (FDE) by 8.47%, close to the sum of their individual contributions. Together, these findings highlight the value of incorporating human perceptual signals into pedestrian behavior prediction and motivate a human-centered complement to vehicle-centric modeling approaches. Our code is available at https://github.com/danyayay/GazeX.git.

2603.19310 2026-05-25 cs.LG cs.AI

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward: 基于图的经验记忆用于有限标签下的LLM奖励预测

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta

AI总结 本文提出了一种基于图结构的经验记忆框架 MemReward,用于在标注数据有限的情况下提升大语言模型(LLM)的奖励预测能力。该方法通过构建包含初始策略生成的推理过程和答案的异构图,并利用图神经网络(GNN)将有限的标注奖励传播到未标注的样本中,从而在在线策略优化过程中实现奖励的高效获取。实验表明,MemReward 在仅使用20%标注数据的情况下,能够在数学证明、问答和代码生成等任务中接近理想奖励模型的性能。

详情
AI中文摘要

强化学习已成为改进大型语言模型推理能力的强大范式,其中从策略中采样rollout,并利用在这些rollout上计算的奖励信号来更新策略。然而,在数据稀缺的场景中,大规模获取ground-truth标签以验证rollout通常需要昂贵的人工标注或劳动密集型的专家验证。例如,评估数学证明需要专家评审,而开放式问答缺乏确定的ground-truth。当ground-truth标签稀缺时,强化学习微调的有效性受到限制。受半监督学习在将标签从标注样本传播到未标注样本方面成功的启发,我们提出了MemReward,一种基于图的经验记忆框架,将奖励传播直接集成到在线策略优化中。MemReward将来自初始LLM策略的rollout(思考过程和最终答案)存储为异构图中的节点,这些节点通过相似性和结构边连接,图神经网络通过该图将奖励从标注rollout传播到未标注rollout。为了训练这样的框架,我们首先在标注rollout上预热GNN,通过查询、思考和答案节点的异质聚合来预测奖励。在在线RL微调期间,未标注rollout通过查询相似性附加到图中,GNN预测它们的奖励,从而产生一种结合ground-truth和GNN预测奖励的混合奖励获取策略。在Qwen2.5-1.5B和3B上的数学、问答和代码生成实验表明,MemReward仅使用20% rollout的ground-truth奖励,就在1.5B上达到Oracle性能的96.6%,在3B上达到97.3%,并在域外任务上接近Oracle。

英文摘要

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

2603.19167 2026-05-25 cs.CL

Evaluating Counterfactual Strategic Reasoning in Large Language Models

评估大型语言模型中的反事实策略推理

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou

发表机构 * National Technical University of Athens(希腊雅典国家技术大学)

AI总结 本研究评估了大型语言模型在重复博弈场景中的策略性能,以判断其表现是基于真正的推理能力还是对记忆模式的依赖。研究引入了对经典博弈(如囚徒困境和石头剪刀布)的反事实变体,改变收益结构和行动标签,从而打破原有的对称性和支配关系。通过多维度评估框架,研究揭示了大型语言模型在激励敏感性、结构泛化和反事实环境中的策略推理方面存在局限性。

Comments Accepted at GEM@ACL 2026

详情
AI中文摘要

我们在重复博弈论环境中评估大型语言模型(LLM),以判断策略表现是否反映了真正的推理还是依赖于记忆模式。我们考虑两个经典博弈:囚徒困境(PD)和石头剪刀布(RPS),并引入反事实变体,改变收益结构和行动标签,打破熟悉的对称性和支配关系。我们的多指标评估框架比较了默认和反事实实例,展示了LLM在反事实环境中的激励敏感性、结构泛化和策略推理方面的局限性。

英文摘要

We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

2603.17879 2026-05-25 cs.CV cs.AI

Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

Podakanti Satyajith Chary, Nagarajan Ganapathy

发表机构 * Department of Engineering Science, IIT Hyderabad(印度海得拉尔理工学院工程科学系) Department of Biomedical Engineering, IIT Hyderabad(印度海得拉尔理工学院生物医学工程系)

AI总结 本文提出了一种用于视频胶囊内镜(VCE)的多标签时间事件检测框架,针对Galar数据集中严重的类别不平衡问题,结合了角度分离损失和生物状态机解码器两个核心贡献。该框架基于BiomedCLIP模型,通过局部差分注意力模块融合连续帧以增强病理信号,并利用解剖上下文头结合软解剖激活进行病理预测。实验表明,该方法在RARE-VISION测试集上显著提升了检测性能,实现了更高的平均精度。

Comments 12 pages, 1 figure, ICPR 2026 RARE-VISION Competition

详情
AI中文摘要

本文提出一个多标签时间事件检测框架用于视频胶囊内镜(VCE),通过结合两个主要贡献来解决Galar数据集固有的极端类别不平衡问题:类原型上的角度分离损失和生物状态机时间解码器。主干网络保持为BiomedCLIP,一个生物医学视觉-语言基础模型。三个连续帧通过局部差分注意力模块融合,该模块通过抑制静态时间冗余来放大瞬态病理信号。然后,解剖上下文头将病理预测条件化于软解剖激活上,利用已知的胃肠道发现空间共现结构。可学习的文本特征提示和基于原型的logit增强与角度分离损失一起训练,该损失惩罚类原型之间的非对角线余弦相似度,防止在极端不平衡下影响罕见类的原型崩溃。为抵消倾斜的标签分布,训练方案结合了非对称焦点损失、逆频率加权采样、时间混合、指数移动平均和每类阈值校准。生物状态机解码器用基于解剖标签的生理学基础前向状态转换替代朴素间隙合并,消除了先前方法中每视频产生数百个虚假解剖事件的碎片化伪影,并将每视频解剖输出减少到2-3个临床现实事件。在包含三个NaviCam检查(161,025帧)的保留RARE-VISION测试集上,更新后的管道实现了整体时间mAP@0.5为0.3597,mAP@0.95为0.3399,相比先前提交分别相对提升46%和44%,总推理时间在单个GPU上约21分钟完成。

英文摘要

This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static temporal redundancy. An Anatomy Context Head then conditions pathological predictions on soft anatomical activations, exploiting the known spatial co-occurrence structure of GI findings. Learnable text-feature prompts and prototype-based logit augmentation are trained alongside an Angular Separation Loss that penalizes off-diagonal cosine similarity between class prototypes, preventing the prototype collapse that afflicts rare classes under extreme imbalance. To counteract the skewed label distribution, the training regime combines asymmetric focal loss, inverse-frequency weighted sampling, temporal Mixup, Exponential Moving Average, and per-class threshold calibration. The Biological State Machine decoder replaces naive gap merging with a physiologically grounded forward-only state transition over anatomy labels, eliminating the fragmentation artefact that produced hundreds of spurious anatomy events per video in the prior approach and reducing per-video anatomy output to 2--3 clinically realistic events. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the updated pipeline achieves an overall temporal mAP@0.5 of 0.3597 and mAP@0.95 of 0.3399, representing a relative improvement of 46% and 44% respectively over the prior submission, with total inference completed in approximately 21 minutes on a single GPU.

2603.16331 2026-05-25 cs.LG

Decoding the Critique Mechanism in Large Reasoning Models

解码大型推理模型中的批判机制

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan

发表机构 * VinUni-Illinois Smart Health Center(VinUniversity-伊利诺伊州智能健康中心) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了大推理模型(LRMs)在推理过程中如何通过内部机制纠正错误,提出了“隐藏的批评能力”这一概念。研究发现,即使模型在中间推理步骤中出现错误且未进行明确纠正,仍能最终得出正确答案,表明其具备某种隐式的错误检测与自我修正机制。通过特征空间分析,作者识别出一个可解释的“批评向量”,用于引导模型增强错误检测能力,提升推理性能,且无需额外训练成本。这一发现为理解与改进大模型的自我验证机制提供了新思路。

详情
AI中文摘要

大型推理模型(LRMs)展现出回溯和自我验证机制,使其能够修正中间步骤并达到正确解,在复杂逻辑基准上表现强劲。我们假设这种行为仅在模型具有足够强的“批判”能力来检测自身错误时才有益。本工作通过在中间推理步骤中插入算术错误,系统研究了当前LRMs如何从错误中恢复。值得注意的是,我们发现一个奇特但重要的现象:尽管错误在整个思维链(CoT)中传播且没有任何言语修正,模型在思考过程结束后仍能得出正确的最终答案。这种恢复暗示存在一种内部机制帮助模型检测错误并触发自我修正,我们称之为隐藏的批判能力。基于特征空间分析,我们识别出一个高度可解释的批判向量,代表这种行为。跨多个模型规模和系列的广泛实验表明,用该向量引导潜在表示可提升模型的错误检测能力,并在无需额外训练成本的情况下增强测试时扩展性能。我们的发现为LRMs的批判行为提供了有价值的理解,提示了控制和改进其自我验证机制的有前景方向。我们的代码可在 https://github.com/mail-research/lrm-critique-vectors 获取。

英文摘要

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the \textit{hidden critique ability}. Building on feature space analysis, we identify a highly interpretable \textit{critique vector} representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.

2603.14027 2026-05-25 cs.CL

SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

SemEval-2026 任务 6:CLARITY——揭露政治问题回避

Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou

发表机构 * National Technical University of Athens(希腊国家技术大学) Instituto de Telecomunicações(电信研究所) Instituto Superior Técnico, Universidade de Lisboa(里斯本大学电信理工学院) ELLIS Unit Lisbon(里斯本ELLIS单位)

AI总结 SemEval-2026任务6(CLARITY)旨在识别政治发言中对问题的回避性回答,研究如何在保持表面回应性的同时避免直接回答。该任务包含两个子任务:一是对回答清晰度进行分类,二是对九种具体回避策略进行细粒度识别。该基准数据集基于美国总统采访构建,采用专家定义的分类体系,结果显示大语言模型提示和分层利用分类体系是有效方法,且清晰度分类任务比回避策略分类更具挑战性。

Comments SemEval 2026 (Task organizers)

详情
AI中文摘要

政治演讲者常常在保持回应表象的同时避免直接回答问题。尽管这对公共话语至关重要,但这种策略性回避在自然语言处理中仍未得到充分探索。我们介绍了 SemEval-2026 任务 6,CLARITY,一个关于政治问题回避的共享任务,包含两个子任务:(i) 清晰度级别分类,分为清晰回答、矛盾和不清晰回答;(ii) 回避级别分类,分为九种细粒度回避策略。该基准来自美国总统访谈,并遵循基于专家的回应清晰度和回避分类体系。该任务吸引了 124 个注册团队,他们提交了 946 个有效运行用于清晰度级别分类,539 个用于回避级别分类。结果显示两个子任务之间存在显著的难度差距:最佳系统在清晰度分类上达到了 0.89 的宏 F1,大幅超过最强基线,而顶级回避系统达到了 0.68 的宏 F1,与最佳基线持平。总体而言,大语言模型提示和分类体系的层级利用成为最有效的策略,顶级系统始终优于那些独立处理两个子任务的系统。CLARITY 将政治回应回避确立为计算话语分析的一个具有挑战性的基准,并突显了建模政治语言中策略性模糊的难度。

英文摘要

Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.

2603.10688 2026-05-25 cs.RO cs.CV

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

MapGCLR: 用于在线矢量化高清地图构建的地理空间对比学习表示

Jonas Merkert, Alexander Blumberg, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * Institute of Measurement and Control Systems, Karlsruhe Institute of Technology (KIT)(测量与控制系,卡尔斯鲁厄理工学院(KIT))

AI总结 本文提出了一种名为 MapGCLR 的方法,旨在提升在线矢量化高精地图构建中鸟瞰图(BEV)特征网格的表示能力。通过在对比损失函数中引入地理空间一致性约束,该方法增强了重叠区域特征的一致性,并结合多遍历数据集划分策略,实现了半监督学习框架。实验表明,该方法在矢量化地图感知任务和特征空间可视化方面均优于传统监督方法。

详情
AI中文摘要

自动驾驶汽车依赖地图信息来理解周围环境。然而,离线高清地图的创建和维护成本仍然很高。一种更具可扩展性的替代方案是在线高清地图构建,它仅在训练时需要地图标注。为了进一步减少标注大量训练标签的需求,自监督训练提供了一种替代方案。本文通过在地理空间上强制重叠的鸟瞰图特征网格之间的一致性作为对比损失函数的一部分,专注于改进矢量化在线高清地图构建模型中的潜在鸟瞰图特征网格表示。为了确保对比对的地理空间重叠,我们引入了一种方法来分析给定数据集中遍历之间的重叠,并根据可调整的多遍历要求生成子数据集划分。我们使用减少的单遍历标注数据对同一模型进行监督训练,并在更广泛的未标注数据集上根据我们的多遍历要求进行自监督训练,有效实现了半监督方法。我们的方法在各个方面都优于监督基线,无论是在下游任务矢量化地图感知性能的定量评估上,还是在鸟瞰图特征空间的主成分分析可视化的分割定性评估上。

英文摘要

Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.

2603.10067 2026-05-25 cs.LG cs.AI

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

HTMuon:通过重尾谱校正改进Muon

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) Microsoft(微软) International Computer Science Institute(国际计算机科学研究所) University of California, Berkeley(加州大学伯克利分校) Meta

AI总结 本文提出 HTMuon,一种改进 Muon 优化算法的方法,旨在提升大语言模型的训练效果。研究指出,Muon 的正交更新规则抑制了权重谱的重尾特性,而 HTMuon 基于重尾自正则化理论,通过生成更重尾的更新步长,增强模型对参数依赖关系的捕捉能力。实验表明,HTMuon 在语言模型预训练和图像分类任务中均优于现有方法,且可作为现有 Muon 变体的插件使用。

详情
AI中文摘要

Muon最近在LLM训练中显示出有希望的结果。在这项工作中,我们研究如何进一步改进Muon。我们认为Muon的正交化更新规则抑制了重尾权重谱的出现,并过度强调了沿噪声主导方向的训练。受重尾自正则化(HT-SR)理论的启发,我们提出了HTMuon。HTMuon保留了Muon捕捉参数相互依赖性的能力,同时产生更重尾的更新并诱导更重尾的权重谱。在LLM预训练和图像分类上的实验表明,HTMuon持续优于最先进的基线,并且可以作为现有Muon变体的插件使用。例如,在C4数据集上的LLaMA预训练中,与Muon相比,HTMuon将困惑度降低了高达0.98。我们进一步从理论上证明,HTMuon对应于Schatten-$q$范数约束下的最速下降,并提供了在光滑非凸环境下的收敛性分析。HTMuon的实现可在https://github.com/TDCSZ327/HTmuon获取。

英文摘要

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

2603.06610 2026-05-25 cs.LG

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

CapTrack: 大语言模型后训练中遗忘的多方面评估

Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz

发表机构 * Thomson Reuters Foundational Research(汤姆森路透基础研究) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Munich Center for Machine Learning (MCML), Technical University Munich(慕尼黑机器学习中心(MCML),慕尼黑技术大学) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出CapTrack,一个以能力为中心的框架,用于评估大型语言模型在微调过程中产生的遗忘现象。不同于传统的参数或事实知识丢失视角,CapTrack从行为和能力退化角度定义遗忘,并结合行为分类和能力特异性指标构建评估体系。通过大规模实验分析多种微调方法、领域和模型家族,研究发现遗忘不仅影响参数知识,还显著影响模型的鲁棒性和默认行为,不同微调方法对能力退化的程度也存在差异。

详情
AI中文摘要

大语言模型(LLM)后训练增强了潜在技能,解锁了价值对齐,提升了性能,并实现了领域适应。不幸的是,后训练已知会引发遗忘,尤其是在利用第三方预训练模型的普遍用例中,这通常被理解为参数或事实知识的损失。我们认为这种以准确性为中心的观点对于现代基础模型是不够的,而是将遗忘定义为系统性的模型漂移,它会降低行为和用户体验。在此背景下,我们引入了CapTrack,一个以能力为中心的框架,用于分析LLM中的遗忘,该框架结合了行为分类法和以能力特定指标为中心的评估套件。利用CapTrack,我们跨后训练算法、领域和模型家族(包括高达80B参数的模型)进行了大规模实证研究。我们发现遗忘超出了参数知识,在鲁棒性和默认行为方面出现了显著的漂移。指令微调引发了最强的相对漂移,而偏好优化更为保守,并且可以部分恢复丢失的能力。不同模型家族之间的差异持续存在,没有出现通用的缓解方法。

英文摘要

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce CapTrack, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite centered on capability-specific metrics. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

2603.02897 2026-05-25 cs.CV

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

ProGIC:基于残差向量量化的渐进式轻量级生成图像压缩

Hao Cao, Chengbin Liang, Wenqi Guo, Zhijin Qin, Jungong Han

发表机构 * Tsinghua University(清华大学) State Key Laboratory of Space Network and Communications(空间网络与通信国家重点实验室)

AI总结 本文提出了一种名为 ProGIC 的渐进式轻量级生成图像压缩方法,基于残差向量量化(RVQ)构建,能够在保证感知质量的同时实现更高效的压缩。该方法通过多阶段的残差编码生成渐进式比特流,支持部分数据预览,并结合轻量化的深度可分离卷积和小注意力模块,提升了在低算力设备上的部署能力。实验表明,ProGIC 在 Kodak 数据集上相比现有方法实现了显著的码率节省,并在编码解码速度上也有明显提升。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成图像压缩(GIC)的最新进展在感知质量上取得了显著提升。然而,许多GIC依赖于大规模且刚性的模型,严重限制了其在低比特率场景下灵活传输和实际部署的实用性。为解决这些问题,我们提出了渐进式生成图像压缩(ProGIC),一种基于残差向量量化(RVQ)的紧凑编解码器。在RVQ中,一系列向量量化器逐级编码残差,每个量化器拥有自己的码本。生成的码字累加实现从粗到细的重建和渐进比特流,从而支持从部分数据预览。我们将其与基于深度可分离卷积和小型注意力模块的轻量级骨干网络配对,使得在GPU和仅CPU设备上均可实际部署。实验结果表明,ProGIC在压缩性能上与先前方法相当。在Kodak数据集上,与MS-ILLM相比,它在DISTS上节省高达57.57%的比特率,在LPIPS上节省58.83%。除了感知质量,ProGIC还支持渐进传输以提高灵活性,并且在GPU上编码解码速度比MS-ILLM快10倍以上。

英文摘要

Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

2603.02719 2026-05-25 cs.LG

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

多模态临床状况分类中的校准与选择性预测的实证分析

L. Julián Lechuga López, Farah E. Shamout, Tim G. J. Rudner

发表机构 * New York University(纽约大学) University of Toronto(多伦多大学)

AI总结 本研究针对多模态临床条件分类任务,实证分析了基于不确定性的选择性预测在可靠性方面的表现。研究发现,尽管模型在标准评估指标上表现良好,但选择性预测可能导致性能显著下降,其根本原因在于模型对不同类别存在严重的校准偏差,尤其在罕见临床条件下更为明显。研究强调了当前聚合评估指标可能掩盖这些问题,并指出在临床AI系统中需要引入校准感知的评估方法,以确保预测的安全性和鲁棒性。

Comments 40 pages, 14 figures, 16 tables. Accepted as a conference paper at AHLI Conference on Health, Inference, and Learning (CHIL) 2026

详情
AI中文摘要

随着人工智能系统向临床部署迈进,确保可靠的预测行为对于安全关键的决策任务至关重要。一种提议的安全保障是选择性预测,即模型可以将不确定的预测交由人类专家审查。在这项工作中,我们使用多模态ICU数据,实证评估了基于不确定性的选择性预测在多标签临床状况分类中的可靠性。在一系列最先进的单模态和多模态模型中,我们发现尽管标准评估指标表现强劲,但选择性预测可能会大幅降低性能。这种失败是由严重的类别依赖的误校准驱动的,即模型对正确预测赋予高不确定性,对错误预测赋予低不确定性,尤其是对于代表性不足的临床状况。我们的结果表明,常用的聚合指标可能掩盖这些效应,限制了它们评估该设置下选择性预测行为的能力。综合来看,我们的发现描述了多模态临床状况分类中选择性预测的任务特定失败模式,并强调了需要校准感知评估来为临床AI提供强有力的安全性和鲁棒性保证。

英文摘要

As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

2603.01655 2026-05-25 cs.LG eess.SP

Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling

变换不变生成射线路径采样用于高效无线电传播建模

Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges

发表机构 * OPTIT S.r.l.(OPTIT公司)

AI总结 本文提出了一种基于生成流网络的智能采样框架,用于高效建模无线电波传播路径,以解决传统射线追踪方法计算复杂度过高的问题。该方法通过引入经验回放缓冲区、统一探索策略和物理约束的动作掩码,提升了模型在复杂环境中的学习鲁棒性和路径探索效率。实验表明,该方法在保持高精度的同时,相比穷举搜索在GPU和CPU上分别实现了最高10倍和100倍的加速,但在实际城市环境中仍需进一步提升模型泛化能力。

Comments submitted to npj Wireless Technology, 30 pages, 16 figures

详情
AI中文摘要

射线追踪已成为精确无线电传播建模的标准方法,但其计算复杂度呈指数增长,因为候选路径数量随物体数量的交互阶数而增加。这一瓶颈限制了其在大型或实时应用中的使用,迫使传统工具依赖启发式方法减少路径候选,但可能牺牲精度。为克服这一限制,我们提出了一种机器学习辅助框架,通过生成流网络进行智能采样,取代穷举路径搜索。将这些生成模型应用于该领域面临挑战,特别是由于有效路径的稀缺性导致的稀疏奖励,这可能导致在复杂环境中评估高阶交互时收敛失败和琐碎解。为确保鲁棒学习和高效探索,我们的框架包含三个关键组件。首先,经验回放缓冲区捕获并保留稀有的有效路径。其次,统一探索策略提高了泛化能力,防止过拟合简单几何形状。第三,基于物理的动作掩蔽策略在模型考虑之前过滤掉物理上不可能的路径。在理想街道峡谷场景上的验证表明,我们的模型相比穷举搜索实现了显著加速——GPU上最高10倍,CPU上最高100倍——同时保持高覆盖精度并成功发现复杂传播路径。然而,在真实曼哈顿街道几何形状上的分布外评估显示,泛化到显著不同的城市形态需要模型容量或训练策略的进一步改进。源代码、测试和教程见https://github.com/jeertmans/sampling-paths。

英文摘要

Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics that reduce path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying these generative models to this domain presents challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key components. First, an \emph{experience replay buffer} captures and retains rare valid paths. Second, a uniform exploratory policy improves generalization and prevents overfitting to simple geometries. Third, a physics-based action masking strategy filters out physically impossible paths before the model considers them. Validated on idealized street-canyon scenarios, our model achieves substantial speedups over exhaustive search -- up to $10\times$ faster on GPU and $100\times$ faster on CPU -- while maintaining high coverage accuracy and successfully uncovering complex propagation paths. However, out-of-distribution evaluations on real-world Manhattan street geometries reveal that generalizing to substantially different urban morphologies requires further advancement in model capacity or alternative training strategies. Source code, tests, and a tutorial are available at https://github.com/jeertmans/sampling-paths.

2602.19174 2026-05-25 cs.CL

TurkicNLP: An NLP Toolkit for Turkic Languages

TurkicNLP:突厥语言的自然语言处理工具包

Sherzod Hakimov

发表机构 * Computational Linguistics University of Potsdam(计算语言学波茨坦大学)

AI总结 本文介绍了TurkicNLP,一个面向突厥语系的开源自然语言处理工具包,旨在解决该语系语言处理工具和资源分散的问题。该工具包支持四种书写系统,提供统一的NLP流程,包括分词、形态分析、词性标注、依存句法分析等功能,并采用模块化架构整合规则和神经模型,实现自动脚本检测与转换。其输出遵循CoNLL-U标准,便于与其他系统兼容与扩展。

Comments The toolkit is available here: https://github.com/turkic-nlp/turkicnlp

详情
AI中文摘要

突厥语族由欧亚大陆超过2亿人使用,其自然语言处理仍然碎片化,大多数语言缺乏统一的工具和资源。我们提出TurkicNLP,一个开源的Python库,为四种文字体系(拉丁、西里尔、波斯-阿拉伯和古突厥如尼文)的突厥语言提供单一、一致的NLP流水线。该库通过一个语言无关的API覆盖分词、形态分析、词性标注、依存句法分析、命名实体识别、双向文字转写、跨语言句子嵌入和机器翻译。模块化多后端架构透明地集成了基于规则的有限状态转换器和神经模型,并具备自动文字检测和文字变体路由功能。输出遵循CoNLL-U标准,以实现完全互操作性和扩展性。代码和文档托管于https://github.com/turkic-nlp/turkicnlp。

英文摘要

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .

2602.18176 2026-05-25 cs.CL

Improving Sampling for Masked Diffusion Models via Information Gain

通过信息增益改进掩码扩散模型的采样

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Massachusetts Institute of Technology(麻省理工学院) Shanghai Jiao Tong University(上海交通大学) Beihang University(北航) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 该论文研究了如何改进掩码扩散模型(MDMs)的采样过程,指出现有采样方法过于贪心,仅关注局部确定性而忽视了后续影响,导致生成结果不确定性增加。为此,作者提出了一种无需训练的解码方法——信息增益采样器(Info-Gain Sampler),通过利用MDMs的双向结构,在当前不确定性和剩余位置的信息增益之间取得平衡。实验表明,该方法在推理、编码、创意写作和图像生成等任务中均优于现有方法,显著提升了生成质量。

Comments https://github.com/yks23/Information-Gain-Sampler Accepted by ICML2026 Accepted by ICML2026

详情
AI中文摘要

掩码扩散模型(MDMs)支持灵活的解码顺序,但现有采样器大多是贪婪的,仅选择局部确定的token而不考虑其下游影响。我们表明这种短视行为会增加累积不确定性并导致次优生成。为解决此问题,我们提出**Info-Gain采样器**,一种无需训练的解码方法,利用MDMs的双向结构平衡即时不确定性与剩余掩码位置获得的信息增益。在推理、编码、创意写作和图像生成任务中,Info-Gain采样器持续优于现有MDM采样器,平均推理准确率提升2.9--11.6个百分点,创意写作平均胜率达到62.8%。代码可在https://github.com/yks23/Information-Gain-Sampler获取。

英文摘要

Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation. To address this, we propose the **Info-Gain Sampler**, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers, improving average reasoning accuracy by 2.9--11.6 percentage points and achieving a 62.8% average win rate in creative writing. The code is available at https://github.com/yks23/Information-Gain-Sampler.

2602.17653 2026-05-25 cs.CL

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

语言模型处理差异论元标记中的类型学对齐差异

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld

发表机构 * University of Washington(华盛顿大学)

AI总结 本文研究语言模型在处理差分化论元标记(DAM)时表现出的类型学对齐差异。通过在18个实现不同DAM系统的合成语料库上训练GPT-2模型,并使用最小对进行评估,研究发现模型在DAM的自然标记方向上表现出与人类语言相似的偏好,即更倾向于对语义不典型的论元进行显性标记,但在对象优先这一人类语言常见现象上却未能复现。这一结果表明,不同类型学倾向可能源于不同的底层机制。

Comments 16 pages, 8 figures, 7 tables. To appear at CoNLL 2026

详情
AI中文摘要

近期研究表明,在合成语料上训练的语言模型可以展现出类似人类语言跨语言规律的类型学偏好,特别是对于语序等句法现象。本文将此范式扩展到差异论元标记(DAM),一种形态标记取决于语义显著性的语义许可系统。使用受控合成学习方法,我们在18个实现不同DAM系统的语料上训练GPT-2模型,并通过最小对评估其泛化能力。结果揭示了DAM的两个类型学维度之间的分离。模型可靠地展现出对自然标记方向的人类偏好,倾向于那些显性标记针对语义非典型论元的系统。相比之下,模型并未复现人类语言中强烈的宾语偏好,即在DAM中显性标记更常针对宾语而非主语。这些发现表明,不同的类型学倾向可能源于不同的潜在来源。

英文摘要

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

2602.15258 2026-05-25 cs.RO

SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks

SEG-JPEG: 用于在不可靠无线网络上远程操作自动驾驶车辆的简单视觉语义通信

Sebastian Donnelly, Ruth Anderson, George Economides, James Broughton, Peter Ball, Alexander Rast, Andrew Bradley

发表机构 * Autonomous Driving and Intelligent Transport Group(自动驾驶与智能交通组) Oxford Brookes University(奥克斯福德布鲁克斯大学) Oxfordshire County Council(奥克斯福德郡县议会) Department for Transport(交通部) School of Engineering, Computing & Mathematics(工程、计算与数学学院) Artificial Intelligence, Data Analysis and Systems Institute(人工智能、数据分析与系统研究所)

AI总结 本文研究了在不可靠无线网络环境下,如何通过视觉语义通信技术实现对自动驾驶车辆的远程操控。提出了一种名为SEG-JPEG的方法,通过在低分辨率灰度图像中用彩色高亮编码检测到的道路使用者分割信息,将所需数据率降低50%,同时保持视觉清晰度。实验表明,该方法能够在低带宽网络下实现低于200毫秒的端到端延迟,提升远程操作员的环境感知能力,为自动驾驶车辆的大规模远程部署提供了可行方案。

Comments 7 pages, 9 figures. Under minor revision for CSNDSP 2026

详情
AI中文摘要

远程操作被认为是快速部署自动驾驶车辆的关键。目前,将图像流传输到远程控制连接车辆需要可靠、高吞吐量的网络连接,而在依赖公共网络基础设施的实际远程操作部署中,这种连接可能受到限制。本文研究了如何应用计算机视觉辅助的语义通信来规避与传统图像压缩技术相关的数据丢失和损坏。通过将检测到的道路用户的分割编码为低分辨率灰度图像中的彩色高亮,与传统技术相比,所需数据速率可降低50%,同时保持视觉清晰度。这使得即使网络数据速率低于500 kbit/s,中位玻璃到玻璃延迟也能低于200 ms,同时清晰勾勒出显著的道路用户,以增强远程操作员的情境意识。该方法在4G移动连接变化的区域使用自动最后一英里配送车辆进行了演示。结果表明,即使在通常受限的公共4G/5G移动网络上,也有可能大规模部署远程操作的自动驾驶车辆,从而有可能加速自动驾驶车辆在全国范围内的推广。

英文摘要

Remote Operation is touted as being key to the rapid deployment of automated vehicles. Streaming imagery to control connected vehicles remotely currently requires a reliable, high throughput network connection, which can be limited in real-world remote operation deployments relying on public network infrastructure. This paper investigates how the application of computer vision assisted semantic communication can be used to circumvent data loss and corruption associated with traditional image compression techniques. By encoding the segmentations of detected road users into colour coded highlights within low resolution greyscale imagery, the required data rate can be reduced by 50% compared with conventional techniques, while maintaining visual clarity. This enables a median glass-to-glass latency of below 200 ms even when the network data rate is below 500 kbit/s, while clearly outlining salient road users to enhance situational awareness of the remote operator. The approach is demonstrated in an area of variable 4G mobile connectivity using an automated last-mile delivery vehicle. Results indicate that large-scale deployment of remotely operated automated vehicles could be possible even on the often constrained public 4G/5G mobile network, providing the potential to expedite the nationwide roll-out of automated vehicles.

2602.13473 2026-05-25 cs.AI

NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines

NeuroWeaver:一种用于探索EEG分析流水线程序空间的自主进化智能体

Guoan Wang, Shihao Yang, Jun-En Ding, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology(系统工程系,斯蒂文斯理工学院)

AI总结 本文提出了一种名为NeuroWeaver的自主进化智能体,用于探索EEG分析流程的程序空间。该方法通过将流程设计转化为离散约束优化问题,并结合领域知识引导的初始化和多目标进化优化,有效平衡了性能、新颖性和效率。实验表明,NeuroWeaver能够在较少参数的情况下生成轻量高效的解决方案,其表现优于现有任务特定方法,并可与大规模基础模型相媲美。

详情
AI中文摘要

尽管基础模型在通用领域取得了显著成功,但这些模型在脑电图(EEG)分析中的应用受到大量数据需求和高参数化的限制。这些因素导致高昂的计算成本,从而阻碍了在资源受限的临床环境中的部署。相反,通用自动机器学习框架通常不适合该领域,因为在无界程序空间中的探索未能纳入必要的神经生理学先验,并且经常产生缺乏科学合理性的解决方案。为了解决这些限制,我们提出了NeuroWeaver,一个统一的自主进化智能体,通过将流水线工程重新表述为离散约束优化问题,旨在泛化到不同的EEG数据集和任务。具体来说,我们采用领域信息子空间初始化将搜索限制在神经科学合理的流形上,并结合多目标进化优化,通过自我反思性改进动态平衡性能、新颖性和效率。在五个异构基准上的实证评估表明,尽管使用的参数显著减少,NeuroWeaver合成的轻量级解决方案始终优于最先进的任务特定方法,并实现了与大规模基础模型相当的性能。

英文摘要

Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalography (EEG) analysis is constrained by substantial data requirements and high parameterization. These factors incur prohibitive computational costs, thereby impeding deployment in resource-constrained clinical environments. Conversely, general-purpose automated machine learning frameworks are often ill-suited for this domain, as exploration within an unbounded programmatic space fails to incorporate essential neurophysiological priors and frequently yields solutions that lack scientific plausibility. To address these limitations, we propose NeuroWeaver, a unified autonomous evolutionary agent designed to generalize across diverse EEG datasets and tasks by reformulating pipeline engineering as a discrete constrained optimization problem. Specifically, we employ a Domain-Informed Subspace Initialization to confine the search to neuroscientifically plausible manifolds, coupled with a Multi-Objective Evolutionary Optimization that dynamically balances performance, novelty, and efficiency via self-reflective refinement. Empirical evaluations across five heterogeneous benchmarks demonstrate that NeuroWeaver synthesizes lightweight solutions that consistently outperform state-of-the-art task-specific methods and achieve performance comparable to large-scale foundation models, despite utilizing significantly fewer parameters.

2602.12579 2026-05-25 cs.LG cs.AI

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

VI-CuRL: 通过置信度引导的方差缩减稳定与验证器无关的强化学习推理

Xin-Qiang Cai, Masashi Sugiyama

发表机构 * RIKEN AIP(日本理化学研究所高级研究所) The University of Tokyo(东京大学)

AI总结 本文提出了一种名为VI-CuRL的验证器无关强化学习框架,旨在解决现有可验证奖励强化学习(RLVR)依赖外部验证器导致的可扩展性问题。该方法通过利用模型自身的置信度构建独立于外部验证器的课程学习体系,有效控制梯度方差,提升训练稳定性。理论分析证明了该估计器的渐近无偏性,实验表明其在数学和通用推理任务中优于多种依赖或不依赖验证器的基线方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的主流范式,但其对外部验证器的依赖限制了可扩展性。最近的研究表明,RLVR主要通过激发潜在能力发挥作用,这推动了无验证器算法的发展。然而,在此类设置中,标准方法(如Group Relative Policy Optimization)面临一个关键挑战:破坏性的梯度方差常导致训练崩溃。为解决此问题,我们引入了与验证器无关的课程强化学习(VI-CuRL),该框架利用模型的内在置信度构建独立于外部验证器的课程。通过优先处理高置信度样本,VI-CuRL有效管理偏差-方差权衡,特别针对降低动作和问题方差。我们提供了严格的理论分析,证明我们的估计量保证渐近无偏性。实验上,VI-CuRL促进了稳定性,并在有/无验证器的数学和通用推理基准上持续优于依赖/不依赖验证器的基线。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.

2602.12316 2026-05-25 cs.AI cs.CL cs.CY cs.GT cs.MA

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

GT-HarmBench:通过博弈论视角评估AI安全风险

Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin

发表机构 * ETH Zürich(苏黎世联邦理工学院) Berea College(贝雷学院) University of Toronto(多伦多大学) Vector Institute(向量研究所) Max Planck Institute for Intelligent Systems, Tübingen, Germany(图宾根德国智能系统马克斯·普朗克研究所)

AI总结 本文提出GT-HarmBench,一个用于评估前沿AI系统在多智能体高风险场景中安全性的基准测试,涵盖博弈论中的经典场景如囚徒困境、 stag hunt 和 chicken。研究发现,现有AI模型在38%的高风险情境中无法选择对社会有益的行动,揭示了多智能体环境下对齐问题的严重性。通过引入博弈论干预,研究展示了提升社会收益的潜力,并为多智能体AI安全研究提供了标准化测试平台。

详情
AI中文摘要

前沿AI系统能力日益增强,并部署在高风险的多智能体环境中。然而,现有的AI安全基准主要评估单一智能体,导致对协调失败和冲突等多智能体风险的理解不足。我们引入了GT-HarmBench,这是一个包含1535个高风险场景的基准,涵盖了囚徒困境、猎鹿博弈和斗鸡博弈等博弈论结构。场景来源于MIT AI风险库中的现实AI风险背景。在15个前沿模型中,智能体在38%的高风险案例中未能选择对社会有益的行为,例如军事升级、选举操纵和医疗事故。我们测量了对博弈论提示框架和顺序的敏感性,并分析了导致失败的推理模式。我们进一步表明,博弈论干预可将社会有益结果提升高达18%。我们的结果突出了显著的可靠性差距,并为研究多智能体环境中的对齐提供了一个广泛的标准化测试平台。该基准和代码可在https://github.com/causalNLP/gt-harmbench获取。

英文摘要

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

2602.11629 2026-05-25 cs.LG

GP2F: Cross-Domain Graph Prompting with Adaptive Fusion of Pre-trained Graph Neural Networks

GP2F: 基于预训练图神经网络自适应融合的跨域图提示学习

Dongxiao He, Wenxuan Sun, Yongqi Huang, Jitao Zhao, Di Jin

发表机构 * School of Computer Science and Technology, Tianjin University, Tianjin, China(天津大学计算机科学与技术学院,天津,中国)

AI总结 本文研究了跨领域图提示学习(GPL)中的有效性问题,提出了一种名为GP2F的新方法。该方法通过融合预训练图神经网络的知识与任务特定的轻量适配模块,在跨领域场景下实现了更鲁棒的模型适应。理论分析表明,结合预训练知识与任务适配能够降低估计误差,实验结果验证了GP2F在跨领域少样本节点和图分类任务中的优越性。

Comments 16 pages, 8 figures

详情
AI中文摘要

图提示学习(GPL)最近成为一种有前景的范式,用于预训练图模型的下游适应,缓解预训练目标与下游任务之间的不匹配。最近,GPL的关注点从域内转向跨域场景,这更接近现实世界应用,其中预训练源和下游目标在数据分布上往往存在显著差异。然而,GPL在域偏移下为何仍然有效尚未被探索。经验上,我们观察到代表性的GPL方法在跨域设置中与两个简单基线(全微调和线性探测)具有竞争力,这促使我们更深入地理解提示机制。我们提供理论分析表明,联合利用这两个互补分支比单独使用任一分支产生更小的估计误差,正式证明了跨域GPL受益于预训练知识与任务特定适应性之间的整合。基于这一见解,我们提出GP2F,一种双分支GPL方法,显式实例化两个极端:(1)保留预训练知识的冻结分支,和(2)带有轻量级适配器用于任务特定适应的适配分支。然后,我们通过对比损失和拓扑一致性损失在拓扑约束下执行自适应融合。在跨域少样本节点和图分类上的大量实验表明,我们的方法优于现有方法。

英文摘要

Graph Prompt Learning (GPL) has recently emerged as a promising paradigm for downstream adaptation of pre-trained graph models, mitigating the misalignment between pre-training objectives and downstream tasks. Recently, the focus of GPL has shifted from in-domain to cross-domain scenarios, which is closer to the real world applications, where the pre-training source and downstream target often differ substantially in data distribution. However, why GPLs remain effective under such domain shifts is still unexplored. Empirically, we observe that representative GPL methods are competitive with two simple baselines in cross-domain settings: full fine-tuning (FT) and linear probing (LP), motivating us to explore a deeper understanding of the prompting mechanism. We provide a theoretical analysis demonstrating that jointly leveraging these two complementary branches yields a smaller estimation error than using either branch alone, formally proving that cross-domain GPL benefits from the integration between pre-trained knowledge and task-specific adaptation. Based on this insight, we propose GP2F, a dual-branch GPL method that explicitly instantiates the two extremes: (1) a frozen branch that retains pre-trained knowledge, and (2) an adapted branch with lightweight adapters for task-specific adaptation. We then perform adaptive fusion under topology constraints via a contrastive loss and a topology-consistent loss. Extensive experiments on cross-domain few-shot node and graph classification demonstrate that our method outperforms existing methods.

2602.11243 2026-05-25 cs.LG cs.CL

Evaluating Memory Structure in LLM Agents

评估LLM智能体中的记忆结构

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin

发表机构 * HSE University(莫斯科国立高等经济学院) Yandex YSDA New Economic School(新经济学院)

AI总结 本文研究了基于大语言模型(LLM)的智能体在长期记忆结构组织方面的能力,提出了一个名为 StructMemEval 的新基准,用于评估智能体组织长期记忆的结构化能力,而不仅仅是事实记忆或简单检索。该基准包含一系列需要结构化知识组织的任务,如事务账本、待办事项列表等。实验表明,普通检索增强型 LLM 在未明确提示下难以处理这些任务,而具备结构化记忆框架的智能体则能更有效地完成任务,突显了改进 LLM 训练和记忆架构的重要性。

Comments Preprint, work in progress

详情
AI中文摘要

现代基于LLM的智能体和聊天助手依赖长期记忆框架来存储可重用知识、回忆用户偏好并增强推理。随着研究人员创建更复杂的记忆架构,分析其能力并指导未来记忆设计变得越来越困难。大多数长期记忆基准侧重于简单事实保留、多跳回忆和基于时间的变化。虽然这些能力无疑很重要,但通常可以通过简单的检索增强LLM实现,并且不测试复杂的记忆层次。为了弥补这一差距,我们提出了StructMemEval——一个测试智能体组织其长期记忆能力(而不仅仅是事实回忆)的基准。我们收集了一系列人类通过以特定结构组织知识来解决的任务:交易账本、待办事项列表、树结构等。我们的初步实验表明,简单的检索增强LLM在这些任务上表现困难,而记忆智能体在提示如何组织记忆时可以可靠地解决它们。然而,我们还发现,现代LLM在未被提示时并不总是能识别记忆结构。这突显了未来在LLM训练和记忆框架改进方面的一个重要方向。

英文摘要

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.

2602.08404 2026-05-25 cs.CL

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

TEAM: 时空一致性引导的专家激活用于MoE扩散语言模型加速

Linye Wei, Zixiang Luo, Pingzhi Tang, Meng Li

发表机构 * Institute for Artificial Intelligence(人工智能研究院) School of Integrated Circuits(集成电路学院) Yuanpei College(元培学院) Beijing Advanced Innovation Center for Integrated Circuits(北京集成电路先进创新中心)

AI总结 该论文提出了一种名为TEAM的框架,旨在加速基于MoE架构的扩散语言模型(dLLMs)。研究发现,现有MoE dLLMs在扩散解码过程中激活大量专家,但最终仅接受少量token,导致推理开销大。TEAM通过利用专家激活在时间和空间上的一致性,采用三种互补策略,在保证性能的同时减少激活专家数量,从而实现高达2.2倍的加速效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

扩散大语言模型(dLLM)因其天然支持并行解码而近期受到广泛关注。基于这一范式,具有自回归(AR)初始化的混合专家(MoE)dLLM进一步展示了与主流AR模型相媲美的强劲性能。然而,我们发现MoE架构与基于扩散的解码之间存在根本性不匹配。具体来说,每个去噪步骤中激活了大量专家,而最终只有一小部分token被接受,导致大量推理开销,限制了其在延迟敏感应用中的部署。在这项工作中,我们提出了TEAM,一个即插即用的框架,通过更少的激活专家实现更多被接受的token,从而加速MoE dLLM。TEAM的动机源于观察到专家路由决策在去噪层级间表现出强时间一致性,以及在token位置间表现出强空间一致性。利用这些特性,TEAM采用了三种互补的专家激活和解码策略,保守地选择解码和掩码token所需的专家,同时跨多个候选进行积极的推测性探索。实验结果表明,TEAM在性能下降可忽略的情况下,实现了相比原始MoE dLLM高达2.2倍的加速。代码已发布在https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM。

英文摘要

Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM.

2602.07801 2026-05-25 cs.CV cs.AI

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

VideoTemp-o3:在智能视频思考中协调时间定位与视频理解

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song

发表机构 * Shandong University(山东大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beihang University(北航) Southern University of Science and Technology(南方科技大学)

AI总结 在长视频理解任务中,传统均匀采样方法难以捕捉关键视觉证据,导致性能下降和幻觉增加。为此,本文提出VideoTemp-o3,一种统一的基于视频的智能推理框架,通过联合建模视频定位与问答任务,显著提升了定位精度与推理效率。该方法引入统一的掩码机制和专用奖励策略,支持按需剪辑与定位修正,并构建了高质量的长视频定位问答数据集及评估基准,实验表明其在长视频理解和定位任务中均表现出色。

Comments ICML 2026

详情
AI中文摘要

在长视频理解中,传统的均匀帧采样通常无法捕捉关键视觉证据,导致性能下降和幻觉增加。为了解决这个问题,最近出现了智能视频思考范式,采用定位-裁剪-回答流程,模型主动识别相关视频片段,在这些片段内进行密集采样,然后生成答案。然而,现有方法仍然效率低下,定位能力弱,且遵循僵化的工作流。为了解决这些问题,我们提出了VideoTemp-o3,一个统一的智能视频思考框架,联合建模视频定位和问答。VideoTemp-o3展现出强大的定位能力,支持按需裁剪,并能修正不准确的定位。具体来说,在监督微调阶段,我们设计了一个统一的掩码机制,在鼓励探索的同时防止噪声。对于强化学习,我们引入了专用奖励以减轻奖励黑客。此外,从数据角度,我们开发了一个有效的流程来构建高质量的长视频定位问答数据,以及一个相应的基准,用于在不同视频时长上进行系统评估。实验结果表明,我们的方法在长视频理解和定位方面均取得了显著性能。

英文摘要

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

2602.07697 2026-05-25 cs.LG cs.AI cs.NE

On the Infinite Width and Depth Limits of Predictive Coding Networks

预测编码网络的无限宽度和深度极限

Francesco Innocenti, El Mehdi Achour, Rafal Bogacz

发表机构 * Brain Network Dynamics Unit, University of Oxford, UK(牛津大学脑网络动力学单位) UM6P College of Computing, Rabat, Morocco(拉巴特大学计算学院)

AI总结 本文研究了预测编码网络(PCNs)在无限宽度和深度极限下的行为,揭示了其与反向传播(BP)之间的理论联系。研究发现,在线性残差网络中,预测编码与反向传播在参数化方式上具有相同的宽度和深度稳定性条件。当网络宽度远大于深度时,预测编码的能量函数在活动平衡状态下会收敛于二次BP损失,从而计算出与BP相同的梯度。实验表明,这一结论在卷积网络和Transformer等非线性模型中也成立,为预测编码在宽而浅的网络结构中实现类似反向传播的训练提供了理论依据。

Comments 36 pages, 28 figures

详情
AI中文摘要

预测编码(PC)是标准反向传播(BP)的一种生物合理替代方案,它在更新权重之前通过最小化关于网络活动的能量函数来工作。最近的工作通过利用一些受BP启发的重新参数化,提高了深度PC网络(PCN)的训练稳定性。然而,这些方法的完全可扩展性和理论基础仍不清楚。为了解决这一空白,我们研究了PCN的无限宽度和深度极限。对于线性残差网络,我们表明PC的宽度和深度稳定的特征学习参数化集合与BP完全相同。此外,在这些参数化中的任何一种下,当模型宽度远大于深度时,具有平衡活动的PC能量收敛到二次BP损失,导致PC计算与BP相同的梯度。实验表明,只要达到活动平衡,非线性模型(包括卷积网络和transformer)也收敛到BP。总体而言,这项工作限制了与PC可扩展的参数化类型,同时展示了如何通过仅局部更新在比深度宽得多的网络(如大脑)中有效实现BP。

英文摘要

Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these methods remain unclear. To address this gap, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the quadratic BP loss when the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that, as long as an activity equilibrium is reached, convergence to BP holds for nonlinear models including convolutional networks and transformers. Overall, this work constrains the types of parameterisation that are scalable with PC, while showing a way in which BP can be effectively implemented with only local updates in much wider than deep networks like the brain.

2602.07399 2026-05-25 cs.AI cs.CV

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

VGAS: 价值引导的动作块选择用于少样本视觉-语言-动作适应

Changhua Xu, En Yu, Junyu Xuan, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所)

AI总结 视觉-语言-动作(VLA)模型能够实现多模态推理与物理控制的结合,但在仅有少量示例的情况下进行任务适应时仍存在可靠性问题。本文提出了一种名为VGAS的新框架,从生成-选择的角度出发,通过引入语义忠实与几何精确的行动片段选择机制,有效解决了几何模糊导致的执行偏差问题。该方法结合了微调后的VLA模型作为高召回率提案生成器,并引入基于几何的Transformer评论器Q-Chunk-Former以及显式几何正则化(EGR)策略,显著提升了在少量示例和分布偏移情况下的任务成功率与鲁棒性。

Comments Preprint

详情
AI中文摘要

视觉-语言-动作(VLA)模型桥接了多模态推理与物理控制,但将其适应于新任务且仅有少量演示时仍不可靠。虽然微调后的VLA策略通常能产生语义上合理的轨迹,但失败往往源于未解决的几何歧义,其中接近正确的动作在有限监督下会导致不同的执行结果。我们从生成-选择的角度研究少样本VLA适应,并提出一个新颖的框架VGAS(价值引导的动作块选择)。它在推理时执行最佳N选1,以识别既语义忠实又几何精确的动作块。具体来说,VGAS使用微调的VLA作为高召回率提议生成器,并引入Q-Chunk-Former,一个基于几何的Transformer评论家,以解决细粒度的几何歧义。此外,我们提出了显式几何正则化(EGR),它塑造了一个判别性的价值景观,以保持接近正确候选之间的动作排序分辨率,同时减轻在稀缺监督下的价值不稳定性。实验和理论分析表明,VGAS在有限演示和分布偏移下持续提高了成功率和鲁棒性。我们的代码可在https://github.com/Jyugo-15/VGAS获取。

英文摘要

Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss actions lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.

2602.07235 2026-05-25 cs.LG cs.AI cs.IT math.IT

ArcMark: Distortion-Free Multi-Byte LLM Watermark via Optimal Transport

ArcMark: 通过最优传输实现无失真的多字节大语言模型水印

Atefeh Gilani, Sajani Vithana, Carol Xuan Long, Oliver Kosut, Lalitha Sankar, Flavio P. Calmon

发表机构 * Arizona State University(亚利桑那州立大学) Harvard University(哈佛大学)

AI总结 ArcMark 是一种基于最优传输理论的无失真多字节大语言模型水印方法,能够在不改变模型生成文本质量的前提下,将多个字节的信息嵌入到少量的生成文本中。该方法通过将无失真水印问题建模为信道编码问题,推导出信息论意义上的信道容量,从而确定了在不引入失真的情况下嵌入信息的理论极限,并据此设计了 ArcMark 算法。实验表明,ArcMark 在信息重建准确率和抗攻击能力方面优于现有方法,且生成文本的困惑度和下游任务表现与未加水印的文本无明显差异。

详情
AI中文摘要

水印是促进大语言模型(LLM)负责任使用的重要工具。现有水印在生成的token中插入信号,要么标记LLM生成的文本(零比特水印),要么编码更复杂的消息(多比特水印)。尽管最近许多方法在不扰动平均下一token预测的情况下向文本中插入多个比特,但它们很大程度上扩展了零比特设置的设计原则,例如每个token编码单个比特。相比之下,能够将多个字节嵌入文本的水印将极大地增加潜在应用,例如嵌入提交提示的用户ID、使用的精确模型版本,甚至提示本身。我们通过引入ArcMark来解决这个问题:一种基于编码和信息论原理的新型水印构造,能够可靠地将多字节信息嵌入仅几百个token中,而不会对底层LLM的下一token分布造成任何失真。我们通过将无失真水印问题建模为信道编码问题,并推导出信息论信道容量,该容量建立了在LLM输出中无失真嵌入信息的基本极限,从而推导出ArcMark。该容量公式指导了ArcMark的设计。在实践中,ArcMark在重建精度上优于竞争的多比特无失真水印,包括在面对改变部分LLM文本的攻击时。ArcMark输出在困惑度和下游任务质量方面也显示出与未加水印文本无法区分。

英文摘要

Watermarking is an important tool for promoting the responsible use of large language models (LLMs). Existing watermarks insert a signal into generated tokens that either flags LLM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent approaches insert multiple bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. In contrast, a watermarker capable of embedding multiple bytes into the text would dramatically increase the potential applications, by embedding information such as the ID of the user who submitted the prompt, the precise model version that was used, or even the prompt itself. We address this problem by introducing ArcMark: a new watermark construction based on coding and information-theoretic principles that is capable of reliably embedding multiple bytes of information into just a few hundred tokens, without any distortion of the underlying LLM next-token distribution. We derive ArcMark by formulating the distortion-free watermarking problem as a channel coding problem, and deriving an information-theoretic channel capacity that establishes the fundamental limit of embedding information in LLM output in a distortion-free manner. This capacity formulation informs the design of ArcMark. In practice, ArcMark outperforms competing multi-bit distortion-free watermarks in terms of reconstruction accuracy, including in the face of attacks that alter a subset of the LLM text. ArcMark output is also shown to be indistinguishable from unwatermarked text in terms of perplexity, and in downstream task quality.

2602.05472 2026-05-25 cs.AI

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

ALIVE: 通过对抗学习和指导性语言评估唤醒LLM推理

Yiwen Duan, Jing Ye, Xinpei Zhao

发表机构 * Independent Researcher(独立研究者)

AI总结 大型语言模型(LLMs)在专家级推理能力方面面临“奖励瓶颈”问题,传统强化学习依赖的标量奖励难以扩展、跨领域不稳定且无法反映推理逻辑。为此,研究提出ALIVE框架,通过对抗学习与指导性语言评价相结合,使模型能够从原始语料中自主学习推理准则,无需依赖外部奖励信号。实验表明,ALIVE在数学推理、代码生成和逻辑推理等任务中显著提升了模型的准确性、跨领域泛化能力和自我纠正能力,为通用推理对齐提供了一种无需人工监督的可扩展方法。

详情
AI中文摘要

大型语言模型(LLM)追求专家级推理的努力一直受到持续的 extit{奖励瓶颈}的阻碍:传统的强化学习(RL)依赖于标量奖励,这些奖励 extbf{成本高昂}难以扩展、 extbf{脆弱}跨领域,并且对解决方案的底层逻辑 extbf{视而不见}。这种对外部、贫乏信号的依赖阻止了模型发展对推理原则的深层、自包含理解。我们引入 extbf{ALIVE}(\emph{对抗学习与指导性语言评估}),一种免人工干预的对齐框架,超越了标量奖励优化,转向内在推理习得。基于\emph{认知协同}原则,ALIVE将问题提出、解决和判断统一在单个策略模型中,以内化正确性的逻辑。通过将对抗学习与指导性语言反馈相结合,ALIVE使模型能够直接从原始语料库内化评估标准,有效将外部批评转化为内生推理能力。在数学推理、代码生成和一般逻辑推理基准上的实证评估表明,ALIVE持续缓解了奖励信号的局限性。在相同数据和计算量下,它实现了准确率提升、显著改善的跨域泛化以及更高的自我修正率。这些结果表明,推理三位一体促进了能力增长的自我维持轨迹,将ALIVE定位为无需人工循环监督的通用推理对齐的可扩展基础。

英文摘要

The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.

2602.05202 2026-05-25 cs.CV

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

GT-SVJ: 基于生成式Transformer的自监督视频评判器用于高效视频奖励建模

Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Inc.(Adobe公司)

AI总结 该研究提出了一种基于生成式变换器的自监督视频评估模型GT-SVJ,旨在更高效地建模视频奖励,以对齐视频生成模型与人类偏好。不同于依赖视觉语言模型的方法,GT-SVJ通过将先进的视频生成模型重新设计为能量基模型,从而捕捉视频中的细微时序动态并精确判断视频质量。通过构造具有可控退化特性的合成负样本,模型被引导学习有意义的时空特征,实验表明其在仅使用30K人工标注数据的情况下,在多个基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

将视频生成模型与人类偏好对齐仍然具有挑战性:当前方法依赖视觉语言模型(VLM)进行奖励建模,但这些模型难以捕捉细微的时间动态。我们提出了一种根本不同的方法:将天生设计用于建模时间结构的视频生成模型重新用作奖励模型。我们提出了基于生成式Transformer的自监督视频评判器(GT-SVJ),这是一种新颖的评估模型,将最先进的视频生成模型转化为强大的时间感知奖励模型。我们的关键洞察是,生成模型可以重新表述为基于能量的模型(EBM),该模型为高质量视频分配低能量,为退化视频分配高能量,从而在通过对比目标训练时能够以惊人的精度区分视频质量。为了防止模型利用真实视频与生成视频之间的表面差异,我们通过受控的潜在空间扰动设计了具有挑战性的合成负样本:时间切片、特征交换和帧洗牌,这些模拟了真实但细微的视觉退化。这迫使模型学习有意义的时空特征,而不是琐碎的伪影。GT-SVJ在GenAI-Bench和MonteBench上仅使用30K人工标注就达到了最先进的性能:比现有的基于VLM的方法少6倍到65倍。

英文摘要

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

2602.05126 2026-05-25 cs.CV

CLEAR-HPV: Interpretable concept discovery for human-papillomavirus-associated morphology in whole-slide histology

CLEAR-HPV: 全切片组织学中人乳头瘤病毒相关形态的可解释概念发现

Weiyi Qin, Yingci Liu-Swetz, Shiwei Tan, Hao Wang

发表机构 * Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Rutgers Health(罗格斯健康) Rutgers University(罗格斯大学)

AI总结 CLEAR-HPV 是一种用于宫颈癌和头颈癌病理切片中HPV相关形态分析的可解释性框架,旨在解决基于注意力机制的多重实例学习(MIL)模型在形态学解释性方面的不足。该方法通过重构MIL的潜在空间,无需概念标签即可自动发现如角化、基底样和间质等关键形态概念,并生成对应的空间概念图与简洁的概念分数向量,从而在保持预测性能的同时实现高度可解释的特征表示。

详情
AI中文摘要

人乳头瘤病毒(HPV)状态是头颈癌和宫颈癌预后及治疗反应的关键决定因素。尽管基于注意力的多实例学习(MIL)在HPV相关的全切片组织病理学中实现了强切片级预测,但其形态学可解释性有限。为解决这一局限,我们引入了CLEAR-HPV(Concept-Level Explainable Attention-guided Representation for HPV),该框架利用注意力重构MIL潜在空间,从而在训练过程中无需概念标签即可实现概念发现。在注意力加权的潜在空间中运行,CLEAR-HPV自动发现角化、基底样和间质形态概念,生成空间概念图,并使用紧凑的概念分数向量表示每个切片。CLEAR-HPV的概念分数向量保留了原始MIL嵌入的预测信息,同时将高维特征空间(例如1536维)减少到仅10个可解释概念。CLEAR-HPV在TCGA-HNSCC、TCGA-CESC和CPTAC-HNSCC上一致地泛化,通过一个通用的、与骨干网络无关的框架,为基于注意力的全切片组织病理学MIL模型提供紧凑的概念级可解释性。

英文摘要

Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.

2602.02780 2026-05-25 cs.AI cs.LG

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Yi Li, Yan Sun, Boyu Wang, Pingzhao Hu

发表机构 * Department of Computer Science, Western University, London, Canada(加拿大伦敦西方大学计算机科学系) Department of Biochemistry, Western University, London, Canada(加拿大伦敦西方大学生物化学系)

AI总结 本文提出了一种名为Cuttlefish的统一多模态大语言模型,旨在解决基于结构的推理中几何信息缺失和模态融合瓶颈的问题。该模型引入了“Scaling-Aware Patching”和“Geometry Grounding Adapter”两种核心方法,前者通过指令条件门控机制生成可变大小的结构图块,动态调整查询令牌数量以适应结构复杂度;后者通过跨注意力机制将几何信息注入语言模型,从而减少结构幻觉。实验表明,Cuttlefish在多个跨学科的原子级结构推理任务中表现出色。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)正在实现对2D和3D结构的推理,但现有方法仍然局限于特定模态,通常通过基于序列的标记化或固定长度的查询连接器来压缩结构输入。这种架构要么忽略了减轻结构幻觉所需的几何基础,要么施加了不灵活的模态融合瓶颈,同时过度压缩和次优分配结构令牌,从而阻碍了通用全原子推理的实现。我们引入了Cuttlefish,一种统一的多模态LLM,它将语言推理建立在几何线索上,同时根据结构复杂性缩放模态令牌。首先,缩放感知补丁利用指令条件门控机制在结构图上生成可变大小的补丁,根据结构复杂性自适应地缩放查询令牌预算,以缓解固定长度连接器的瓶颈。其次,几何基础适配器通过交叉注意力对模态嵌入进行细化,并将生成的模态令牌注入LLM,暴露明确的几何线索以减少结构幻觉。跨学科全原子基准的实验表明,Cuttlefish在异构结构基础推理中实现了优越的性能。代码:github.com/zihao-jing/Cuttlefish。

英文摘要

Large language models (LLMs) are enabling reasoning over 2D and 3D structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric grounding requisite for mitigating structural hallucinations, or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified multimodal LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across interdisciplinary all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code: github.com/zihao-jing/Cuttlefish.