arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22823 2026-05-22 cs.CV

Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

它往哪个方向移动?视频大语言模型中方向运动盲症的诊断与克服

Jongseo Lee, Hyuntak Lee, Sunghun Kim, Sooa Kim, Jihoon Chung, Jinwoo Choi

AI总结 本文研究了视频大语言模型在理解方向运动时的盲点,提出MoDirect数据集和DeltaDirect方法,通过改进模型对方向运动的感知能力,显著提升了模型在合成和真实场景中的方向识别性能。

详情
Comments
Preprint. 59 pages, including appendix. Code: https://github.com/KHU-VLL/DeltaDirect
AI中文摘要

视频大语言模型(Video-LLMs)在时间视频理解方面取得了快速进展,但许多模型在基本感知原始上失败:带符号的图像平面运动方向。在简单的单一物体左右上下移动的视频中,大多数Video-LLMs表现接近随机,超随机的案例主要归因于预测偏差而非真正的方向理解。我们称之为方向运动盲症。我们通过追踪运动方向信息通过Video-LLM管道来定位失败。运动方向可以从视觉编码器、投影器和LLM隐藏状态线性地访问,但读取失败将此信号绑定到正确的言语答案选项,揭示了方向绑定缺口。尽管合成运动方向指令微调减少了源域的这一缺口,但运动方向概念向量分析显示,视觉复杂性削弱了信号幅度并限制了跨域泛化。我们引入MoDirect,一个用于运动方向指令微调和评估的的数据集家族,以及DeltaDirect,一个诊断驱动的投影层目标,通过相邻帧特征差预测归一化的2D运动向量。在MoDirect-SynBench上,使用DeltaDirect指令微调将运动方向准确性从25.9%提高到85.4%。在MoDirect-RealBench上,DeltaDirect在没有真实世界微调数据的情况下,将真实世界运动方向准确性提高了21.9个点,同时保持标准视频理解性能。代码:https://github.com/KHU-VLL/DeltaDirect

英文摘要

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

2605.22821 2026-05-22 cs.CL cs.LG

Tokenisation via Convex Relaxations

基于凸松弛的分词

Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel

AI总结 本文提出了一种基于凸松弛的分词方法ConvexTok,通过将分词构建问题转化为线性规划并利用凸优化工具求解,改进了分词指标和语言模型的bits-per-byte性能,并提升了下游任务表现。

详情
AI中文摘要

分词是当前自然语言处理流水线中的重要组成部分。当前的分词算法如BPE和Unigram都是贪心算法,它们在局部最优决策上不做考虑,而没有考虑整个词汇表的结果。我们相反地将分词构建过程作为线性规划来制定,并使用凸优化工具来解决它,从而得到一种新的算法,我们称之为ConvexTok。我们发现ConvexTok在内在分词指标和语言模型所实现的bits-per-byte (BpB)方面始终有改进;它也改善了下游任务的表现,但不太一致。此外,ConvexTok允许用户通过一个下界来认证他们的分词器在某种目标下离最优的差距,并且我们实证发现它在常见词汇表大小下处于最优的1%以内。

英文摘要

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

2605.22820 2026-05-22 cs.LG

Integrable Elasticity via Neural Demand Potentials

通过神经需求势实现可积弹性

Carlos Heredia, Daniel Roncel

AI总结 本文提出了一种以需求为导向的神经网络模型ICDN,用于多产品零售需求预测。该模型学习对数需求作为对数价格的平滑、上下文依赖函数,从而能够精确推导出弹性。在Dominick's啤酒数据集上,ICDN在样本外泛化性能上优于有向对数-对数基准,并产生了更稳定、更具经济合理性的弹性估计,尤其是在交叉价格效应较弱的情况下。

详情
Comments
44 pages, 7 figures
AI中文摘要

我们提出了一种可积上下文依赖需求网络(ICDN),这是一种以需求为导向的神经模型,用于多产品零售需求预测。该模型学习对数需求作为对数价格的平滑、上下文依赖函数,使得弹性能够从学习的需求曲面上精确推导出来。在Dominick's啤酒数据集上,ICDN在样本外泛化性能上优于有向对数-对数基准,并产生了更稳定、更具经济合理性的弹性估计,尤其是在交叉价格效应较弱的情况下。

英文摘要

We propose the Integrable Context-Dependent Demand Network (ICDN), a demand-first neural model for multiproduct retail demand. The model learns log-demand as a smooth, context-conditioned function of log-prices, allowing elasticities to be derived exactly from the learned demand surface. On the Dominick's beer dataset, ICDN improves out-of-sample generalization over a directed log-log benchmark and yields more stable, economically plausible elasticity estimates, especially for weakly identified cross-price effects.

2605.22819 2026-05-22 cs.CV

Cambrian-P: Pose-Grounded Video Understanding

Cambrian-P: 基于姿态的视频理解

Jihan Yang, Zifan Zhao, Xichen Pan, Shusheng Yang, Junyi Zhang, Bingyi Kang, Hu Xu, Saining Xie

AI总结 该研究提出Cambrian-P,一种增强的视频多模态大语言模型,通过引入可学习的相机令牌和姿态回归头,利用姿态作为轻量级监督信号,显著提升了空间推理能力,并在多个视频问答基准上实现了SOTA表现。

详情
Comments
Project Page: https://cambrian-mllm.github.io/
AI中文摘要

相机姿态至关重要。每个视角的位置和方向定义了一个共享的空间坐标框架,将视频帧之间的观察联系起来。然而,这种信号在多模态大语言模型(MLLMs)中大多缺失,因为这些模型将帧处理为孤立的2D快照,而非人类持续感知的场景。我们重新审视姿态作为轻量级监督信号,并引入Cambrian-P,一种增强的视频MLLM,其包含每帧可学习的相机令牌和姿态回归头。通过精心设计的采样方案,该模型在如VSI-Bench等空间推理基准上实现了4.5-6.5%的显著提升,跨八个额外的空间和通用视频问答基准泛化,且作为副产品,在ScanNet上实现了流式姿态估计的SOTA。令人惊讶的是,训练基于真实世界视频的伪标注姿态进一步提升了通用视频问答基准的表现,显示姿态对空间推理的帮助超出了空间推理本身。这些结果将相机姿态定位为视频模型在物理世界推理中的一项基本信号。

英文摘要

Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

2605.22818 2026-05-22 cs.CV

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

MotiMotion: 基于视觉推理的运动控制视频生成

Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu

AI总结 本文提出MotiMotion框架,通过将运动控制转化为推理生成问题,改进视频生成中因果关系和常识一致性,引入免训练视觉语言推理器和置信度感知控制方案,通过MotiBench基准测试验证其生成视频的合理性与交互性。

详情
Comments
ICML 2026. Project page: https://motimotion.github.io/
AI中文摘要

当前运动控制图像到视频生成模型严格遵循用户提供的轨迹,这些轨迹往往稀疏、不精确且因果不完整。这种依赖通常导致不自然或不合理的输出,尤其是由于忽略了次要因果后果。为了解决这个问题,我们引入了MotiMotion,一种新的框架,将运动控制重新表述为推理然后生成的问题。为了鼓励因果基础和常识一致的交互,我们利用免训练的视觉语言推理器来细化主要轨迹的图像空间坐标,并生成合理的次要运动。为进一步提高运动的自然性,我们提出了置信度感知的控制方案,该方案调节指导强度,使模型能够紧密跟随高置信度计划,同时在低置信度输入下利用其内部生成先验知识来纠正伪影。为了支持系统评估,我们精心策划了一个新的图像到视频基准,MotiBench,包含以交互为中心的场景,其中新事件由运动触发。通过基于VLM的评估和对MotiBench的人类研究证明,MotiMotion生成的视频具有更合理的物体行为和交互,并优于现有方法。

英文摘要

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

2605.22817 2026-05-22 cs.LG cs.AI cs.CL cs.NE

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

向量策略优化:为多样性训练改进测试时间搜索

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

AI总结 本文提出向量策略优化(VPO)方法,通过训练策略以预测多样化的下游奖励函数,从而产生多样化的解决方案,以改进测试时间搜索的性能。

详情
Comments
24 pages
AI中文摘要

语言模型现在必须能够即刻泛化到新的环境,并在像AlphaEvolve这样的推理扩展搜索过程中工作,该过程通过多种任务特定的奖励函数选择滚出。不幸的是,标准的LLM后训练优化方法通常优化预定义的标量奖励,导致当前LLM生成低熵响应分布,从而在推理时间搜索所需多样性方面挣扎。我们提出向量策略优化(VPO),一种RL算法,专门训练策略以预测多样化的下游奖励函数并生成多样化的解决方案。VPO利用奖励在实践中通常是向量值的事实,例如代码生成中的每测试用例正确性,或者多个不同的用户人设或奖励模型。VPO本质上是GRPO优势估计器的直接替代品,但其训练LLM输出一组解决方案,其中每个解决方案专门针对向量奖励空间中的不同权衡。在四个任务上,VPO在测试时间搜索(如pass@k和best@k)中匹配或超越了最强的标量RL基线,随着搜索预算的增长,差距逐渐扩大。对于进化搜索,VPO模型解锁了GRPO模型无法解决的问题。随着测试时间搜索变得更加标准化,优化多样性可能需要成为后训练的默认目标。

英文摘要

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

2605.22816 2026-05-22 cs.RO cs.CV

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

AwareVLN: 基于自感知的视觉语言导航推理

Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu

AI总结 本文提出AwareVLN框架,通过自感知推理机制实现端到端的视觉语言导航,解决了传统方法在理解代理、指令和场景关系上的不足,并在多个数据集上实现了优于现有方法的性能。

详情
Comments
Accepted to CVPR 2026. Project page: https://gwxuan.github.io/AwareVLN/
AI中文摘要

视觉语言导航(VLN)要求一个智能体将语言指令接地到其自身移动中。尽管最先进的方法利用视觉语言模型(VLMs)的推理能力进行端到端动作预测,但它们往往缺乏对代理、指令和场景之间关系的显式且可解释的理解。相反,显式构建场景图进行启发式规划直观但依赖额外的3D传感器,阻碍了大规模视觉语言预训练。为弥合这一差距,我们提出了AwareVLN,一种新的框架,使导航模型具备自感知推理机制,使其能够以完全端到端和数据驱动的方式理解代理的状态和任务进度。我们的方法有两个关键创新:(1)一个结构推理模块,促进空间和任务导向的自感知;(2)一个自动数据引擎,具有进度划分,用于有效的训练。在Habitat模拟器的各种数据集上的广泛实验表明,我们的AwareVLN显著优于先前的视觉语言导航方法。项目页面:https://gwxuan.github.io/AwareVLN/.

英文摘要

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

2605.22814 2026-05-22 cs.LG

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

记住保持好奇:用于3D探索的片段上下文和持久世界

Lily Goli, Justin Kerr, Daniele Reda, Alec Jacobson, Andrea Tagliasacchi, Angjoo Kanazawa

AI总结 本研究提出了一种基于好奇心驱动强化学习的方法,通过引入持久世界模型和片段上下文来解决3D环境中稀疏奖励长周期任务中的探索问题,实验表明该方法在HM3D数据集上优于基于强化学习的主动映射基线,并能泛化到Gibson和AI生成的世界。

详情
AI中文摘要

探索是学习有用行为在稀疏奖励、长周期任务中的前提,特别是在3D环境中。好奇心驱动的强化学习通过内在奖励来解决这个问题,这些内在奖励来自于智能体对世界的预测模型与现实之间的不匹配。然而,将这种内在动机转化为复杂、逼真的环境仍然具有挑战性,因为智能体可能会被困在局部循环中,并且在重新访问遗忘状态时会获得新的奖励。在本工作中,我们证明这种失败源于缺乏空间持续性和片段上下文。我们表明,有效的好奇心需要一个持久且持续更新的世界模型,配以能够维护片段轨迹历史的智能体,以导航到新区域。我们通过在线3D重建作为世界模型的持久模型,同时将智能体策略参数化为基于RGB观察的序列模型来维持片段上下文。这种设计在训练期间实现了有效的探索,同时允许智能体在部署时仅使用RGB帧进行导航。在纯好奇心训练下,我们的智能体在HM3D上优于基于强化学习的主动映射基线,并能泛化到Gibson和AI生成的世界。我们的端到端策略使智能体能够高效适应下游任务,如苹果采摘和图像目标导航,优于从头开始的基线。请参见https://recuriosity.github.io/的视频结果。

英文摘要

Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at https://recuriosity.github.io/.

2605.22812 2026-05-22 cs.RO cs.CV

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

GesVLA: 一种具有手势感知能力的视觉-语言-动作模型嵌入表示

Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng

AI总结 本文提出GesVLA模型,通过引入手势作为平行指令模态,解决现有VLA系统在复杂场景中空间模糊问题,采用双VLM架构实现手势表示与动作策略的紧密耦合,并通过手势数据生成管道和两阶段训练策略提升目标定位准确性和人机交互效率。

详情
Comments
Project page: https://gwxuan.github.io/GesVLA/
AI中文摘要

视觉-语言-动作(VLA)模型通过统一感知与动作,在通用机器人操作中展现出强大潜力。然而,现有VLA系统主要依赖文本指令,在包含多个相似物体的复杂场景中难以解决空间模糊问题。为解决这一限制,我们引入手势作为平行指令模态,提出一种具有手势感知能力的视觉-语言-动作模型(GesVLA)。我们的方法将手势特征直接编码到潜在空间中,使其能够参与高层推理和低层动作生成,并采用双VLM架构实现手势表示与动作策略的紧密耦合。在数据层面,我们通过将手模型渲染到现实世界场景图像上,构建了一个可扩展的手势数据生成管道。这在减少仿真到现实的视觉差距的同时,生成了具有多样化运动模式和相应指向注释的丰富数据。此外,我们采用两阶段训练策略,使模型具备手势感知和动作预测能力。我们在多个现实机器人任务中评估了我们的方法,包括受控块操作任务进行验证以及更实际的场景如产品和农产品选择。实验结果表明,结合手势能够一致地提高目标定位准确性和人机交互效率,特别是在复杂和拥挤的环境中。项目页面:https://gwxuan.github.io/GesVLA/.

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: https://gwxuan.github.io/GesVLA/.

2605.22791 2026-05-22 cs.AI

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2:解耦擦除与写入的线性注意力

Ali Hatamizadeh, Yejin Choi, Jan Kautz

AI总结 本文提出Gated DeltaNet-2,通过引入通道级擦除门和写入门,解耦了线性注意力中擦除与写入的控制,从而在语言模型、常识推理和检索任务中取得了最佳性能,特别是在长上下文检索任务中表现突出。

详情
Comments
Gated DeltaNet-2 technical report; code at https://github.com/NVlabs/GatedDeltaNet-2
AI中文摘要

线性注意力将softmax注意力的无界缓存替换为固定大小的递归状态,将序列混合时间降低到线性,解码内存降至常数。难点不仅在于决定遗忘什么,还在于如何编辑压缩的记忆而不打乱现有关联。Delta规则模型在写入新值前减去当前读取值,而Kimi Delta注意力(KDA)通过通道级衰减来增强遗忘。但主动编辑仍使用单个标量门控制两件事:在键侧擦除旧内容的程度和在值侧提交新内容的程度。我们引入了Gated DeltaNet-2,通过继承自适应遗忘和通道级衰减,同时解决其共同限制,即擦除与写入之间的标量关联。Gated Delta Rule-2通过通道级擦除门b_t和通道级写入门w_t分离这些角色,当两个门坍缩为同一标量时退化为KDA,当衰减也坍缩时退化为Gated DeltaNet。我们推导出快速权重更新视图,一种分块的WY算法,将通道级衰减吸收进不对称擦除因子中,并提出一种门感知的反向传播,以保持高效的并行训练。在130亿参数在10000亿FineWeb-Edu标记上训练的情况下,Gated DeltaNet-2在语言模型、常识推理和检索任务中取得了最强的整体结果。其优势在长上下文RULER针在 haystack 检索基准中最为明显,其中它改进了评估的多键检索设置,并在递归和混合设置中保持强劲。代码可在https://github.com/NVlabs/GatedDeltaNet-2获取。

英文摘要

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

2605.22786 2026-05-22 cs.AI cs.ET cs.LG cs.MA

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

LCGuard: 多智能体系统中安全KV共享的潜在通信守护者

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy

AI总结 本文提出LCGuard框架,通过在智能体间共享KV缓存前学习表示层面的转换,以防止敏感信息泄露,同时在多个模型家族和多智能体基准测试中验证了其在减少重建攻击成功率和保持任务性能方面的有效性。

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统越来越多地依赖中间通信来协调复杂任务。尽管大多数现有系统通过自然语言进行通信,但最近的研究表明,通过transformer键值(KV)缓存进行的潜在通信可以提高效率并保留更丰富的任务相关信息。然而,KV缓存也编码了上下文输入、中间推理状态和智能体特定信息,从而创建了一个可能传播敏感内容的不透明通道,而无需显式文本披露。为此,我们引入了LCGuard(潜在通信守护者),一个用于多智能体LLM系统中安全KV基于潜在通信的框架。LCGuard将共享的KV缓存视为潜在的工作记忆,并在缓存艺术制品传输到智能体之前学习表示层面的转换。我们通过重建正式化表示层面的敏感信息泄露操作:如果一个对抗性解码器可以从共享缓存艺术制品中恢复出智能体特定的敏感输入,则该共享缓存艺术制品是不安全的。这导致了一种对抗性训练公式,其中对抗者学习重建敏感输入,而LCGuard学习转换以保留任务相关语义并减少可重建的信息。在多个模型家族和多智能体基准测试中的实证评估表明,LCGuard在减少基于重建的泄露和攻击成功率的同时,能够保持与标准KV共享基线相比具有竞争力的任务性能。

英文摘要

Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can improve efficiency and preserve richer task-relevant information. However, KV caches also encode contextual inputs, intermediate reasoning states, and agent-specific information, creating an opaque channel through which sensitive content may propagate across agents without explicit textual disclosure. To address this, we introduce \textbf{LCGuard} (Latent Communication Guard), a framework for safe KV-based latent communication in multi-agent LLM systems. LCGuard treats shared KV caches as latent working memory and learns representation-level transformations before cache artifacts are transmitted across agents. We formalize representation-level sensitive information leakage operationally through reconstruction: a shared cache artifact is unsafe if an adversarial decoder can recover agent-specific sensitive inputs from it. This leads to an adversarial training formulation in which the adversary learns to reconstruct sensitive inputs, while LCGuard learns transformations that preserve task-relevant semantics and reduce reconstructable information. Empirical evaluations across multiple model families and multi-agent benchmarks show that LCGuard consistently reduces reconstruction-based leakage and attack success rates while maintaining competitive task performance compared to standard KV-sharing baselines.

2605.22785 2026-05-22 cs.CL

Evaluating Commercial AI Chatbots as News Intermediaries

评估商业AI聊天机器人作为新闻中介

Mirac Suzgun, Emily Shen, Federico Bianchi, Alexander Spangher, Thomas Icard, Daniel E. Ho, Dan Jurafsky, James Zou

AI总结 本研究评估了AI聊天机器人在跨语言和区域处理新兴事实的准确性,发现其在多选题中表现良好,但在自由回答和复杂问题上存在显著误差,揭示了区域不平等和依赖检索基础设施的问题。

详情
Comments
https://suzgunmirac.github.io/ai-news-preview/
AI中文摘要

AI聊天机器人正在迅速塑造人们获取新闻的方式,但此前没有任何研究系统地衡量了这些系统在跨语言和区域处理新兴事实的准确性。我们对六个AI聊天机器人(Gemini 3 Flash和Pro、Grok 4、Claude 4.5 Sonnet、GPT-5和GPT-4o mini)进行了14天(2026年2月9日至22日)的评估,测试了2100个基于同日BBC新闻报道的事实性问题,覆盖六个区域服务(美国和加拿大、阿拉伯语、非洲、印地语、俄语、土耳其语)。最佳系统在处理数小时前报道的事件问题时,多选题准确率超过90%。然而,这些系统在自由回答评估中准确率下降11-13%,在整体群体中下降16-17%。我们进一步识别了三种失败模式。首先,所有模型在印地语上的准确率最低(79% vs. 89-91%在其他地方),引用表明存在英语语系的检索偏见(例如,模型回答印地语查询时引用英语维基百科比任何印地语来源更多)。其次,检索失败而非推理失败导致超过70%的错误。当模型检索到正确的来源时,它们通常能提取出正确的答案;问题在于如何首先找到正确的来源。第三,模型在处理结构良好的问题时准确率在88-96%之间,但在包含微妙虚假前提的问题中准确率骤降至19-70%,最脆弱的模型接受伪造事实的频率高达64%。我们还识别出一个检测准确性悖论:最好的虚假前提检测器在对抗性准确性(回避率)上排名第二,而较弱的检测器排名第一,表明前提检测和答案恢复是部分独立的能力。总体而言,这些结果表明高准确性可能掩盖系统性的区域不平等、对检索基础设施的近乎完全依赖,以及对不完美查询的脆弱性。

英文摘要

AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

2605.22779 2026-05-22 cs.SE cs.LG

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

FAME:面向失败的混合专家模型用于消息级日志异常检测

Huanchi Wang, Zihang Huang, Yifang Tian, Kristina Dzeparoska, Hans-Arno Jacobsen, Alberto Leon-Garcia

AI总结 本文提出FAME,一种面向失败的混合专家模型,用于消息级日志异常检测。该方法通过少量标注数据训练轻量级路由器和领域专家,实现高效的异常检测,同时在BGL和Thunderbird数据集上取得了高精度和召回率。

详情
Comments
12 pages, 5 figures
AI中文摘要

生产系统每天生成数百万条日志行,但大多数异常检测器在会话或窗口级别工作,标记的是行组而非特定消息。这种粗粒度迫使操作员每条警报都要检查许多常规行。消息级检测提供更细粒度,但仍然具有挑战性。一个事件模板可能对应正常和异常消息,故障源于异构子系统,大规模行级标注不切实际。尽管大型语言模型(LLMs)可以推断日志语义,但将其应用于每条行对于持续监控来说成本太高。我们提出了FAME(Failure-Aware Mixture-of-Experts),一种标签高效的面向消息级的混合专家框架,该框架仅在离线时使用LLM一次。我们最多为每个模板标注K条标注行以推导二元正常/异常指标和代表性示例。LLM提出将模板划分为故障领域,并通过认证步骤验证该提议后再进行训练。FAME训练了一个轻量级路由器和领域专家,这些专家在本地运行,并输出异常预测和故障领域标签。在BGL上,FAME在K=100时达到F1=98.16,将标注工作量减少76倍,并检测出86.3%的未见过的EventIDs异常。在Thunderbird上,FAME达到F1=99.95,具有完美的召回率。

英文摘要

Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.

2605.22777 2026-05-22 cs.CV

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

DecQ:用于增强表示自编码器中重建和生成的细节压缩查询

Tianhang Wang, Yitong Chen, Wei Song, Zuxuan Wu, Min Li, Jiaqi Wang

AI总结 本文提出DecQ框架,通过引入轻量级细节压缩查询,有效缓解了表示自编码器中重建与生成之间的权衡问题,提升了重建质量和生成性能。

详情
AI中文摘要

表示自编码器(RAEs)利用冻结的视觉基础模型(VFMs)作为分词编码器,提供稳健的高层表示,从而在潜在扩散模型中实现快速收敛和高质量生成。然而,冻结VFM本质上限制了其空间重建能力,限制了细粒度生成和图像编辑;相反,通过微调引入重建导向信号会破坏预训练语义空间并降低生成保真度。为了解决这一权衡,我们提出了DecQ,一种简单而有效的RAEs框架。具体而言,DecQ引入了轻量级细节压缩查询,通过压缩模块从中间VFM特征中提取细粒度信息。这些查询被整合到解码器中以支持重建,并在生成建模过程中与补丁标记共同生成。通过聚合来自浅层和深层的信息,DecQ有效缓解了重建-生成权衡问题,提高了重建质量和生成性能。我们的实验表明:(1)仅使用8个额外查询和3.9%的额外计算,DecQ在冻结DINOv2基于的RAE上提高了重建质量,将PSNR从19.13 dB提高到22.76 dB;(2)在生成建模中,DecQ比RAE快3.3倍,达到无引导FID为1.41,有引导FID为1.05。

英文摘要

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

2605.22776 2026-05-22 cs.LG cs.AI stat.CO stat.ML

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

SDPM:用于连续时间生存分析的生存扩散概率模型

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

AI总结 本文提出SDPM,一种用于连续时间生存分析的生成模型,通过去噪扩散模型建模生存结果的条件分布,避免了对事件时间分布的参数假设,并在变换的目标空间中使用标准化对数时间和连续高斯混合表示来表示删失指示符,从而在多个真实生存数据集上取得了竞争力的预测性能。

详情
AI中文摘要

生存分析旨在从具有删失观测的数据中估计时间到事件的分布。许多现有方法要么对危险函数施加结构假设,要么离散化时间轴,这可能会限制灵活性并引入近似误差。我们提出了生存扩散概率模型(SDPM),一种用于连续时间生存分析的生成方法。SDPM利用去噪扩散模型建模生存结果的条件分布,该分布由观测时间和删失指示符表示,即P(T,δ|X)。在假设条件独立删失的情况下,模型生成的条件样本可以通过Kaplan-Meier估计器转换为生存函数估计。该公式避免了对事件时间分布的参数假设,并不需要对输出时间空间进行离散化。模型在变换的目标空间中运行,使用标准化对数时间和连续高斯混合表示来表示删失指示符。我们评估了SDPM在十个真实生存数据集上的性能,并将其与五个强大的基线模型进行了比较,包括基于树、提升和神经生存模型。结果表明,SDPM在C指数、整合时间依赖AUC和整合Brier分数上均取得了竞争性的预测性能。对合成Cox-Weibull数据的分析表明,当生成足够多的样本时,SDPM能够比强大的非参数基线更准确地恢复潜在连续生存分布的形状。消融研究证实了所提出的目标空间变换的重要性,这些变换提高了事件率校准、减少了无效生成时间并提供了预测判别的一致增益。实现所提出模型的代码已公开可用。

英文摘要

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival Diffusion Probabilistic Model (SDPM), a generative approach to continuous-time survival analysis. SDPM models the conditional distribution of the survival outcome, represented by the pair of observed time and censoring indicator, $\mathbb{P}(T,δ\mid \mathbf{x})$, using a denoising diffusion model. Under the assumption of conditionally independent censoring, conditional samples generated by the model can be transformed into survival function estimates using the Kaplan-Meier estimator. This formulation avoids parametric assumptions on the event-time distribution and does not require a discretization of the output time space. The model operates in a transformed target space, using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator. We evaluate SDPM on ten real survival datasets and compare it with five strong baselines, including tree-based, boosting-based, and neural survival models. Results show that SDPM achieves competitive predictive performance across C-index, integrated time-dependent AUC, and integrated Brier score. A study on synthetic Cox-Weibull data demonstrates that SDPM can recover the shape of an underlying continuous survival distribution more accurately than a strong nonparametric baseline when sufficiently many samples are generated. An ablation study confirms the importance of the proposed target-space transformations, which improve event-rate calibration, reduce invalid generated times, and provide consistent gains in predictive discrimination. Codes implementing the proposed model are publicly available.

2605.22775 2026-05-22 cs.LG cs.AI cs.HC

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

MambaGaze: 通过显式缺失数据建模的双向Mamba用于从眼动追踪数据中评估认知负荷

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

AI总结 本文提出MambaGaze,通过XMD编码和双向Mamba-2框架,解决眼动追踪数据中频繁缺失和长时序依赖建模的问题,实验证明其在认知负荷评估中的优越性能和边缘部署可行性。

详情
Comments
Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)
AI中文摘要

从眼动信号进行实时认知负荷评估有可能实现适应性的人工智能应用,如安全关键应用如驾驶员警觉监控或自动驾驶舱辅助,但存在两个挑战:处理频繁的数据缺失(如眨眼和跟踪失败)以及高效建模长时序依赖。我们提出MambaGaze,一个通过1)XMD编码,将原始特征与观察掩码和时间差增强以显式建模数据不确定性,以及2)双向Mamba-2,以线性计算复杂性捕获时序依赖的框架。在CLARE和CL-Drive数据集上进行的leave-one-subject-out评估实验表明,MambaGaze分别达到76.8%和73.1%的准确率,优于CNN、Transformer、ResNet和VGG基线,高出4-12个百分点。在NVIDIA Jetson平台上的边缘部署基准测试显示,实现实时推理43-68 FPS,功率消耗低于7.5W,证实了其在可穿戴认知负荷监测中的可行性。

英文摘要

Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

2605.22773 2026-05-22 cs.AI math.OC

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

基于随机作业到达的灵活作业车间调度的深度强化学习

Yu Tang, Muhammad Zakwan, Efe Balta, John Lygeros, Alisa Rupenyan

AI总结 本文提出了一种基于事件的深度强化学习方法,用于解决具有随机作业到达的灵活作业车间调度问题,通过Proximal Policy Optimization算法和轻量级多层感知机训练智能体,以最小化所有作业的总完成时间,并在不同异质性和作业到达率的数据集上优于单独的调度规则。

详情
AI中文摘要

灵活作业车间调度问题(FJSP)是将一组作业最优分配到机器上的问题。在FJSP中仍存在两个主要挑战:未来作业的不可预测到达和问题的组合复杂性,使它对传统混合整数线性规划求解器来说是不可行的。本文提出了一种基于事件的深度强化学习(DRL)方法来解决具有随机作业到达的FJSP。具体而言,我们采用近端策略优化算法,并使用轻量级多层感知机来训练DRL智能体以最小化所有作业的总完成时间。我们设计状态表示为可以直接从环境中获取,并限制学习智能体只能在一组已确立的调度规则中选择。仿真显示,我们的DRL方法在异质性和作业到达率不同的数据集上优于任何单独的调度规则。我们还将我们的DRL方法与一种触发到达的混合整数线性规划解决方案进行基准测试,并表明我们的方法在数据集异质性较高的情况下表现良好。

英文摘要

The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear programming solvers. This paper proposes an event-based \gls{DRL} approach to solve FJSP with random job arrivals. Specifically, we employ the Proximal Policy Optimization algorithm and use lightweight Multi-Layer Perceptrons to train the \gls{DRL} agent for minimizing the total completion time of all jobs. We design the state representation to be directly accessible from the environment, and limit the learning agent to selecting from among a set of well-established dispatching rules. Simulations show that our \gls{DRL} approach outperforms any of the individual dispatching rules on datasets with varying heterogeneity and job arrival rates. We benchmark our \gls{DRL} against an arrival-triggered mixed-integer linear programming solution and show that our method achieves good performance especially when the datasets are heterogeneous.

2605.22767 2026-05-22 cs.CV

Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

合成数据足够吗?重新思考儿科罕见病识别中的数据稀缺性

Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu, Wei Xu

AI总结 本研究探讨了在儿科罕见病识别中,仅使用合成数据是否足以克服数据稀缺问题,通过实验发现高保真合成数据能模拟临床有意义的分布,从而为遗传咨询提供隐私保护的视觉资源。

详情
Comments
CVPR 2026 CV4CHL workshop
AI中文摘要

患有罕见遗传疾病的儿童往往表现出独特的面部表型,但开发用于早期诊断的计算机视觉系统仍极具挑战性,因为存在极端的数据稀缺性、隐私限制以及儿科环境中有限的数据共享。这些挑战不仅阻碍了自动化诊断,也限制了临床遗传咨询中的视觉资源可用性。尽管先前研究表明合成数据可以增强真实数据集并保持表型层面的语义,但尚不清楚在超低资源的儿科环境中,仅使用合成数据是否足以进行学习。在本工作中,我们研究了仅使用合成数据的儿科罕见病识别场景。在受控的实验设置中,模型仅在具有表型意识的合成面部图像上进行训练,随着数据规模的增加。我们发现,在足够规模下,仅使用合成数据的训练在多个架构上实现了与仅使用真实数据的基线相当的性能,这表明高保真合成数据能够近似临床有意义的分布。这些发现进一步使合成的儿科面部图像成为隐私保护的资源,用于遗传教育和咨询,支持临床医生培训和患者沟通。我们的结果强调了计算机视觉在提高数据效率和扩展儿童健康护理中可访问的视觉工具方面的潜力。

英文摘要

Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.

2605.22765 2026-05-22 cs.LG stat.ML

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

统一扩散模型再审视:留一法去噪器和吸收状态重述

Samson Gourevitch, Yazid Janati, Dario Shariatian, Umut Simsekli, Eric Moulines, Eric P. Xing, Alain Durmus

AI总结 本文研究了统一扩散模型中去噪后验与留一法后验之间的不匹配问题,并通过改进的参数化和采样方法提升了模型性能。

详情
Comments
preprint
AI中文摘要

离散扩散模型通常通过干净数据预测进行训练,但预测可以以不同方式定义反向动态。在掩码扩散模型(MDM)中这些选择大体一致,而在统一扩散模型(UDM)中则不一致。我们展示了标准插件桥参数化对于UDM并非由去噪后验优化,而是由留一法后验优化,该后验预测每个干净token时不使用其自身的噪声观测。这揭示了插件ELBO与常规去噪交叉熵目标之间的不匹配。我们刻画了留一法目标并推导了去噪器、留一法后验和分数之间的精确转换。这些转换使我们能够分离参数化和训练目标。我们的结果还通过有意识的预测-校正采样器和基于留一法预测的改进温度采样方法在无需额外训练的情况下提升了推断性能。我们进一步引入了统一扩散的吸收状态重述,该重述在保持UDM联合分布的同时将其分解为类似掩码扩散的采样操作,具有更简单的去噪后验、携带未掩码和自然重掩码机制。在语言建模中,留一法参数化一致地提升了UDM生成性能,而吸收构造在匹配或超越掩码扩散方面表现优异。这些结果表明,掩码与统一扩散之间的经验差距主要由参数化和采样设计驱动,而非边际本身的选择。代码和模型可在https://github.com/samsongourevitch/rev_udm找到。

英文摘要

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

2605.22756 2026-05-22 cs.LG cs.DS

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

Lumberjack: 通过树中的Heavy Hitter检测实现更好的差分隐私随机森林

Christian Janos Lebeda, David Erb, Tudor Cebere, Aurélien Bellet

AI总结 本文提出Lumberjack算法,通过构建大规模随机决策树并应用隐私保护的剪枝技术,显著提升了差分隐私随机森林的实用性。该方法引入了新的(ε,δ)-DP Heavy Hitter检测算法,具有O_{ε,δ}(√log h)的误差,使得树的高度可以更深,从而在隐私约束下提高表达能力。实验表明,Lumberjack在基准数据集上优于现有差分隐私随机森林方法,特别是在隐私预算下的隐私-效用权衡上取得显著改进。

详情
AI中文摘要

随机森林广泛应用于涉及敏感表格数据的领域,但现有的差分隐私(DP)方法通常会降级性能到不实用的程度。在本文中,我们介绍Lumberjack,一种差分隐私随机森林算法,通过构建大规模随机决策树并应用激进的隐私保护剪枝技术,保留仅足够 populated 的节点,从而实现显著更高的实用性。我们方法的关键组成部分是一个新颖的(ε,δ)-DP Heavy Hitter检测算法,用于层次数据,其误差为O_{ε,δ}(√log h)对于高度为h的树,并可能具有独立的兴趣。这种有利的缩放使得可以使用比先前工作更深的树,从而在隐私约束下提高表达能力。我们在基准数据集上的实验证明,Lumberjack在基准数据集上优于现有差分隐私随机森林方法,建立了新的状态。特别是,我们的方法在实际隐私预算下的隐私-效用权衡上取得了显著改进。我们的发现表明,精心设计的差分隐私随机森林可以缩小大部分的效用差距,突显了未来研究中一个有前途但尚未被探索的方向。

英文摘要

Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,δ)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,δ}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.

2605.22751 2026-05-22 cs.CV

Spectral Tail Auxiliary Learning for AI-Generated Image Detection

用于AI生成图像检测的频谱尾辅助学习

Xingyi Li, Jiahui Zhang, Yiheng Li, Yun Cao, Wenhao Wang

AI总结 本文提出了一种基于频谱尾特征的辅助学习框架STAL,用于检测AI生成图像。通过分析真实和生成图像的径向对数功率谱,发现生成图像在超高频尾部表现出异常提升现象,即频谱尾部上升。STAL利用这一特征进行辅助监督学习,提升了模型在不同生成器、数据分布和现实场景中的泛化能力和稳定性。

详情
AI中文摘要

随着生成图像模型的快速发展,生成图像与真实图像的感知差距持续缩小,使AI生成图像检测变得愈发困难。许多现有方法利用频域线索进行检测,通常描述为频域伪影或高频差异。然而,具体的频谱规律仍不够理解和表征。本文系统分析了真实和生成图像的一维径向对数功率谱。发现生成图像并不一定在整个频谱或高频范围内具有更高的或更低的能量。相反,它们的频谱偏离幂律衰减,并在超高频尾部表现出异常上升。我们称这种现象为频谱尾部上升。进一步将这种现象归因于训练生成模型中的非线性谐波积累,表明它可以在生成架构中作为结构线索。基于这一观察,我们提出了Spectral Tail Auxiliary Learning (STAL),一种用于通用AI生成图像检测的频域辅助监督框架。STAL在训练时将频谱尾部线索从尾部意识的频率教师转移到空间检测器,而在推理时所有频域模块都被丢弃。因此,STAL不引入推理开销。在9个公开数据集上的大量实验表明,STAL在不同生成器、数据分布和现实场景中实现了强大的泛化能力和稳定性。

英文摘要

As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or high-frequency discrepancies. However, the specific and recurring spectral regularities remain insufficiently understood and characterized. In this paper, we systematically analyze the one-dimensional radial log-power spectra of real and generated images. We find that generated images do not necessarily exhibit higher or lower energy across the entire spectrum or high-band range. Instead, their spectra deviate from the power-law decay and show an anomalous uplift in the ultra-high-frequency tail. We term this phenomenon spectral tail uplift. We further attribute this phenomenon to nonlinear harmonic accumulation in trained generative models, suggesting that it can serve as a structural cue across generative architectures. Based on this observation, we propose Spectral Tail Auxiliary Learning (STAL), a frequency-domain auxiliary supervision framework for generalizable AI-generated image detection. STAL transfers spectral-tail cues from a tail-aware frequency teacher to a spatial detector during training, while all frequency-domain modules are discarded at inference time. Consequently, STAL introduces no inference overhead. Extensive experiments on 9 public datasets show that STAL achieves strong generalization and stability across generators, data distributions, and real-world scenarios.

2605.22749 2026-05-22 cs.LG cs.AI

Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

基于机器学习和元启发式特征优化的物联网智能电网中网络-物理异常检测

Adis Alihodžić, Eva Tuba, Milan Tuba

AI总结 本文研究了如何利用机器学习和元启发式特征优化方法,在物联网智能电网中检测网络-物理异常,通过评估多个基线模型,发现基于树的集成模型在该数据集上表现最佳,且经过特征优化后,模型在准确率和AUC指标上均有显著提升。

详情
AI中文摘要

现代智能电网依赖于密集的测量基础设施、通信链路和智能现场设备。尽管这提高了监控和控制能力,但也增加了遭受网络-物理破坏的风险。操作员必须区分物理事件,如故障或线路干扰,与恶意行为,如虚假数据注入或未经授权的命令执行。本章利用著名的MSU/ORNL电力系统攻击数据集来研究这一问题。所提出的方法结合了机器学习与基于遗传算法的特征选择。目标是双重的:准确分类攻击和自然事件,并确定一组减少的、物理信息丰富的PMU/IED测量是否能够支持可靠的检测。评估了多个基线模型,包括逻辑回归、RBF-SVM、XGBoost、随机森林和额外树。结果表明,基于树的集成模型在考虑的数据集上最为有效,其中额外树提供了最强的全特征基线。在特征选择后,GA + Extra Trees模型将干净的PMU特征空间从112个属性减少到五次运行的平均27.4个属性,同时将宏F1从0.9118提高到0.9212,ROC-AUC从0.9791提高到0.9837。这些结果表明,许多同步电气测量是冗余的。一个紧凑的基于相量的特征子集仍能提供准确且可解释的智能电网异常检测。

英文摘要

Modern smart grids rely on dense measurement infrastructures, communication links, and intelligent field devices. Although this improves supervision and control, it also increases vulnerability to cyber-physical disruptions. Operators must distinguish physical incidents, such as faults or line disturbances, from malicious actions, such as false data injection or unauthorized command execution. This chapter investigates this problem using the well-known MSU/ORNL Power System Attack Dataset. The proposed method combines machine learning with genetic-algorithm-based feature selection. The objective is twofold: to classify attack and natural events accurately, and to determine whether a reduced set of physically informative PMU/IED measurements can support reliable detection. Several baseline models are evaluated, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees. The results show that tree-based ensemble models are the most effective for the considered dataset, with Extra Trees providing the strongest full-feature baseline. After feature selection, the GA + Extra Trees model reduces the clean PMU feature space from 112 attributes to an average of 27.4 attributes over five runs, while increasing macro-F1 from 0.9118 to 0.9212 and ROC-AUC from 0.9791 to 0.9837. These results indicate that many synchronized electrical measurements are redundant. A compact subset of phasor-based features can still provide accurate and interpretable anomaly detection in smart grids.

2605.22748 2026-05-22 cs.RO cs.AI cs.LG cs.MA

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

AI总结 本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能,展示了多智能体交互对真实世界交互安全性的关键作用,同时在高速赛车中超越人类飞行员并减少碰撞率。

详情
Comments
12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl
AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能,但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式,其中其他参与者被忽略或视为环境噪声,阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台,训练智能体在复杂空气动力学相互作用和战略机动中导航,具有可变数量的赛车。通过联赛基于的自我对战,智能体进化出复杂的前瞻性行为,包括主动避障、超车和处理多智能体物理交互,包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员,同时与最先进的单智能体基线相比,碰撞率减少了50%。关键的是,使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明,实现稳健的机器人共存的路径不在于孤立的安全约束,而在于多智能体交互的严格要求。多媒体材料可在:https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

2605.22746 2026-05-22 cs.LG eess.AS stat.ML

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

插件损失用于证据深度学习:一个简化框架用于不确定性估计,其中包括softmax分类器

Berk Hayta, Hannah Laus, Simon Mittermaier, Felix Krahmer

AI总结 本文提出了一种简化框架,用于通过插件损失近似证据深度学习中的不确定性估计,证明了在特定证据到狄利克雷分布映射下,该框架包含标准的softmax分类器,并在Google语音命令数据集上验证了其有效性。

详情
AI中文摘要

现实中的基于传感器的学习系统需要可靠且计算高效的不确定性估计。证据深度学习(EDL)通过狄利克雷分布建模类概率,从而实现单次通过的不确定性估计,其中狄利克雷参数由一个学习的神经网络映射预测。然而,这种方法可能导致计算挑战,因为狄利克雷期望目标比标准监督学习损失更复杂,增加了分析和实现的难度。我们通过近似由EDL诱导的一阶经验风险最小化问题的目标,使用在狄利克雷均值上评估的插件损失,证明在温和假设下,对于广泛的一类损失函数,包括均方误差和交叉熵损失,近似误差随着证据的增长而减小。作为特殊情况,我们的分析为在不确定性估计中使用softmax提供了正当性,因为在特定的证据到狄利克雷分布映射下,我们的框架包含标准的softmax分类器。我们在Google语音命令数据集上验证了所提出的简化目标,并展示了其在预测准确性和选择性预测性能上与经典EDL相当,同时使用标准深度学习损失和训练流程实现起来更简单。到目前为止,本文的实证分析是首次通过EDL获得语音识别任务中的覆盖-准确性权衡。

英文摘要

Real-world sensor-based learning systems require uncertainty estimation that is both reliable and computationally efficient. Evidential Deep Learning (EDL) provides single-pass uncertainty estimation by modeling the class probabilities via Dirichlet distributions, where the Dirichlet parameters are predicted by a learned neural network mapping. However, this approach can lead to computational challenges, as Dirichlet expected objectives are more complex than standard supervised learning losses, complicating their analysis and implementation. We address this issue by approximating the objective of the first-order empirical risk minimization problem induced by EDL with a plug-in loss evaluated at the Dirichlet mean and show that, under mild assumptions, the approximation error decays with growing evidence for a broad class of loss functions, including mean-squared error and cross-entropy loss. As a special case, our analysis provides justification for the use of softmax in the context of uncertainty estimation, since under a particular evidence-to-Dirichlet mapping, our framework includes the standard softmax classifier. We validate the proposed simplified objectives on the Google Speech Commands dataset and show that they achieve predictive accuracy and selective prediction performance comparable to classical EDL, while being simpler to implement using standard deep learning losses and training pipelines. To the best of our knowledge, this empirical analysis is the first to obtain coverage-accuracy trade-offs for speech recognition tasks through EDL.

2605.22743 2026-05-22 cs.LG

SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

SeqLoRA: 为持续多概念生成的双水平正则化适应

Javad Parsa, Enis Simsar, Amir Joudaki, Thomas Hofmann, André M. H. Teixeira

AI总结 本文提出SeqLoRA,一种双水平优化框架,通过联合优化LoRA因素来解决文本到图像扩散模型中多自定义概念组合时的表示干扰问题,提高了身份保持性和可扩展性。

详情
AI中文摘要

参数高效微调能够快速个性化文本到图像扩散模型,但组合多个自定义概念仍然具有挑战性,因为存在表示干扰。现有的模块化方法要么依赖于昂贵的后置融合,要么冻结适应子空间,这限制了表达能力和概念保真度。为了解决这一权衡,我们提出了顺序正则化的LoRA(SeqLoRA),一种联合优化LoRA因素的持续学习框架。理论上,我们为我们的算法建立了强收敛保证,并将残差层激活建模为矩阵子高斯过程,以推导出灾难性遗忘的高概率界。我们进一步证明,从数据中学习LoRA基底比冻结基底方法更有效地最小化残差干扰能量。在多概念图像生成实验中,SeqLoRA在多达101个概念上提高了身份保持性和可扩展性,同时避免了昂贵的融合并减少了组合生成中的属性干扰。

英文摘要

Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit expressiveness and concept fidelity. To address this trade-off, we propose Sequential regularized LoRA (SeqLoRA), a constrained continual learning framework that jointly optimizes both LoRA factors via bilevel optimization. Theoretically, we establish strong convergence guarantees for our algorithm and model the residual layer activations as a matrix sub-Gaussian process to derive high-probability bounds on catastrophic forgetting. We further prove that learning the LoRA basis from data minimizes residual interference energy more effectively than frozen-basis methods. Experiments on multi-concept image generation demonstrate that SeqLoRA improves identity preservation and scalability across up to 101 concepts, while avoiding costly fusion and reducing attribute interference in composed generations.

2605.22736 2026-05-22 math.OC cs.LG cs.NA math.DG math.NA

Optimization over the intersection of manifolds

在两个流形交集上的优化

Yan Yang, Bin Gao, Ya-xiang Yuan

AI总结 本文提出了一种几何方法,通过在单个流形上进行重新参数化,并在两个正交方向上更新迭代点,以解决两个流形交集上的优化问题,证明了清洁交集和内在横贯性是等价的,并展示了该方法在稀疏和低秩优化问题中的有效性。

详情
Comments
26 pages, 5 figures, 3 tables
AI中文摘要

在两个流形交集上的优化出现在广泛的应用中,但受到可行区域耦合几何的阻碍。在本文中,我们证明了正则性——清洁交集和内在横贯性——是等价的,这导致了可处理的交集切空间投影。因此,我们提出了一种几何方法,该方法仅在单个流形上使用重新参数化,并在两个正交方向上更新迭代点。具体而言,迭代点停留在一个流形上,而这两个方向分别负责渐近接近另一个流形和减少目标函数。在内在横贯性下,我们推导了可行性和最优性度量的收敛速度,并证明了每个积累点都是第一阶 stationary 的。在稀疏和低秩优化问题上的数值实验,包括拟合球形数据、在真实数据上近似双曲嵌入和计算压缩模式,展示了所提方法的有效性。

英文摘要

Optimization over the intersection of two manifolds arises in a broad range of applications, but is hindered by the coupled geometry of the feasible region. In this paper, we prove that the regularities -- clean intersection and intrinsic transversality -- are equivalent, which yields a tractable projection onto the tangent space of the intersection. Therefore, we propose a geometric method that employs a retraction on only one manifold and updates the iterate along two orthogonal directions. Specifically, the iterates stay on one manifold, and the two directions are responsible for asymptotically approaching the other manifold and decreasing the objective function, respectively. Under intrinsic transversality, we derive the convergence rate for both the feasibility and optimality measures, and show that every accumulation point is first-order stationary. Numerical experiments on problems stemming from sparse and low-rank optimization, including fitting spherical data, approximating hyperbolic embeddings on real data, and computing compressed modes, demonstrate the effectiveness of the proposed method.

2605.22734 2026-05-22 cs.CL

ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG:一个具有时间基础的生物医学知识图谱和用于临床推理的基准

Md Shamim Ahmed, Farzaneh Firoozbakht, Lukas Galke Poech, Jan Baumbach, Richard Röttger

AI总结 本文提出ChronoMedKG,一个包含460,497个证据链接三元组的生物医学知识图谱,覆盖13,431种疾病,通过时间组件如发病窗口或进展阶段,为临床推理提供时间基础,并引入ChronoTQA基准测试,验证了其在时间推理任务中的有效性。

详情
Comments
9 pages main text plus appendices, 8 figures. Dataset and benchmark paper. ChronoMedKG released under CC BY 4.0 and ChronoTQA/code under MIT (Zenodo: 10.5281/zenodo.19697542). Under review
AI中文摘要

生物医学知识图谱(KGs)将疾病关联视为静态事实,但时间信息对临床推理至关重要,例如,3岁时的症状诊断可能暗示13岁时的不同疾病。现有KGs如PrimeKG、HetoNet和iKraph未能编码疾病发展过程中发现变得临床相关的时间。这限制了它们在纵向临床推理和检索增强中的实用性。我们介绍了ChronoMedKG,一个包含460,497个证据链接三元组(从1300万原始提取中过滤)的时序生物医学知识图谱,覆盖13,431种疾病。每个关联都与时间组件如发病窗口或进展阶段相关,这些时间组件由PMID可追溯的证据和多信号可信度评分支持。图谱通过一种疾病自主的多代理管道构建,在其中多个前沿LLM独立从PubMed和PMC文献中提取知识。仅保留那些由多模型共识支持、通过可信度过滤以及符合本体对齐的关系。ChronoMedKG在Orphadata上得分92.7%,并为HPOA、Orphadata和Phenopackets中缺失的6,250种疾病添加了时间基础,包括1,657种Orphanet编码的罕见疾病。我们进一步引入了ChronoTQA,一个包含3,341个问题的基准测试,涵盖八种任务类型(六种时间相关任务加两种静态控制任务),并附带12个补充探测问题。前沿LLM在从静态到时间问题的转换中失去大约30分;ChronoMedKG检索恢复了其长尾失败的47-65%,而HPOA-RAG恢复了17-29%。因此,ChronoMedKG为检索增强的临床系统提供了至关重要的时间轴,此前该轴线缺失。

英文摘要

Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent.

2605.22733 2026-05-22 cs.AI cs.SE

HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

HarnessAPI: 一种以技能为中心的统一流式API和MCP工具框架

Edwin Jose

AI总结 本文提出HarnessAPI,一种以技能为中心的框架,通过将类型化的技能文件夹作为单一真相源,消除了流式API和MCP工具之间的重复代码,减少了74%的框架层面的样板代码。

详情
AI中文摘要

如今,每一个作为LLM工具部署的Python函数必须以两种形式存在:一种是面向人类客户端和CI流水线的HTTP端点,另一种是用于代理运行时如Claude和Cursor的MCP工具注册。这两种表示共享业务逻辑,但在周围机器(路由、验证、序列化、流式传输和模式维护)上却存在差异,并且随着底层代码的变化而逐渐分离。我们提出了HarnessAPI,一种Python框架,通过将类型化的技能文件夹作为单一真相源来消除这种重复。从一个handler.py加上Pydantic模式,该框架可以自动推导出一个流式HTTP端点,具有Server-Sent Events,一个交互式的OpenAPI/Swagger UI,以及一个零配置的MCP工具,所有都从单一进程提供。双模式内容协商让相同的处理程序可以服务于SSE流式传输和JSON返回客户端,而无需更改处理程序。动态代码生成机制确保Pydantic类型注解正确传播到FastMCP的检查层,解决了一个技术限制,该限制阻止了基于闭包的简单注册。通过六个代表性技能的cloc测量,与手动维护的双栈实现(FastAPI服务器+FastMCP服务器)相比,HarnessAPI减少了74%的框架层面的样板代码。HarnessAPI继承了FastAPI的全部中间件、依赖注入和部署生态系统。它可在https://github.com/edwinjosechittilappilly/harnessapi上获得,并在PyPI(pip install harnessapi)中可用。

英文摘要

Every Python function deployed as an LLM tool must today exist in two forms: an HTTP endpoint for human-facing clients and CI pipelines, and an MCP tool registration for agent runtimes such as Claude and Cursor. These representations share business logic yet diverge in all the surrounding machinery (routing, validation, serialisation, streaming, and schema maintenance), and they drift apart as the underlying code evolves. We present HarnessAPI, a Python framework that eliminates this duplication by treating a typed skill folder as the single source of truth. From one handler.py plus Pydantic schemas, the framework automatically derives a streaming HTTP endpoint with Server-Sent Events, an interactive OpenAPI/Swagger UI, and a zero-configuration MCP tool, all served from a single process. Dual-mode content negotiation lets the same handler serve SSE-streaming and JSON-returning clients with no handler changes. A dynamic code-generation mechanism ensures Pydantic type annotations propagate correctly to FastMCP's inspection layer, resolving a technical limitation that prevents naive closure-based registration. Measured across six representative skills using cloc, HarnessAPI reduces framework-facing boilerplate by 74% compared with a manually maintained dual-stack implementation (FastAPI server + FastMCP server). HarnessAPI subclasses FastAPI, inheriting its full middleware, dependency-injection, and deployment ecosystem. It is available at https://github.com/edwinjosechittilappilly/harnessapi and on PyPI (pip install harnessapi)

2605.22732 2026-05-22 cs.AI cs.CL cs.HC cs.SD eess.AS

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

超越语音情感识别:利用基于LLM和语音情感模型的政治演讲多模态Pathos分析

Juergen Dietrich

AI总结 本文研究了语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,通过TRUST多智能体大语言模型(LLM)管道进行操作。使用德国议会全体会议中Felix Banaszak的演讲作为案例研究,比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

详情
Comments
13 pages, 1 figure
AI中文摘要

我们研究语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,如由TRUST多智能体大语言模型(LLM)管道定义的那样。使用Felix Banaszak在德国议会全体会议中的演讲(51个片段,245秒)作为案例研究,我们比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

英文摘要

We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.

2605.22731 2026-05-22 cs.LG cs.AI

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

训练后是关于状态,而不是标记:一种状态分布视角下的SFT、RL和在线策略蒸馏

Dong Nie

AI总结 本文从状态分布的角度研究了监督微调(SFT)、强化学习(RL)和在线策略蒸馏(OPD)等大语言模型训练后方法,发现训练状态的来源和局部性与监督信号的形式同样重要。

详情
AI中文摘要

大型语言模型的训练后方法,如监督微调(SFT)、强化学习(RL)和蒸馏,通常通过其损失函数进行分析:最大似然、策略梯度、前向KL、反向KL或相关的目标级变体。我们研究了一个互补因素:应用于监督的状态分布。对于自回归策略,状态是提示加上生成的前缀。SFT在固定数据集的状态上训练,而RL和在线策略蒸馏(OPD)在当前学习者诱导的状态上训练。我们正式将训练后过程视为状态分布塑造,并使用Qwen3-0B-Base在GSM8K上进行受控的小规模研究,用TruthfulQA和MMLU作为保留评估。我们的结果显示出三种现象。第一,轻微的SFT运行在GSM8K上表现良好,而压力SFT运行导致显著的保留损失。第二,从退化的SFT教师那里获得的OPD在GSM8K、TruthfulQA和MMLU上优于该教师,尽管仅使用教师作为监督来源。第三,轻量级的在线策略RL运行在GSM8K上提高了表现,同时保持了保留。这些结果支持了训练后过程的状态视角:训练状态的来源和局部性与监督信号的形式同样重要。

英文摘要

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.