arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22823 2026-05-22 cs.CV

Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

它往哪个方向移动?视频大语言模型中方向运动盲症的诊断与克服

Jongseo Lee, Hyuntak Lee, Sunghun Kim, Sooa Kim, Jihoon Chung, Jinwoo Choi

AI总结 本文研究了视频大语言模型在理解方向运动时的盲点,提出MoDirect数据集和DeltaDirect方法,通过改进模型对方向运动的感知能力,显著提升了模型在合成和真实场景中的方向识别性能。

详情
Comments
Preprint. 59 pages, including appendix. Code: https://github.com/KHU-VLL/DeltaDirect
AI中文摘要

视频大语言模型(Video-LLMs)在时间视频理解方面取得了快速进展,但许多模型在基本感知原始上失败:带符号的图像平面运动方向。在简单的单一物体左右上下移动的视频中,大多数Video-LLMs表现接近随机,超随机的案例主要归因于预测偏差而非真正的方向理解。我们称之为方向运动盲症。我们通过追踪运动方向信息通过Video-LLM管道来定位失败。运动方向可以从视觉编码器、投影器和LLM隐藏状态线性地访问,但读取失败将此信号绑定到正确的言语答案选项,揭示了方向绑定缺口。尽管合成运动方向指令微调减少了源域的这一缺口,但运动方向概念向量分析显示,视觉复杂性削弱了信号幅度并限制了跨域泛化。我们引入MoDirect,一个用于运动方向指令微调和评估的的数据集家族,以及DeltaDirect,一个诊断驱动的投影层目标,通过相邻帧特征差预测归一化的2D运动向量。在MoDirect-SynBench上,使用DeltaDirect指令微调将运动方向准确性从25.9%提高到85.4%。在MoDirect-RealBench上,DeltaDirect在没有真实世界微调数据的情况下,将真实世界运动方向准确性提高了21.9个点,同时保持标准视频理解性能。代码:https://github.com/KHU-VLL/DeltaDirect

英文摘要

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

2605.22821 2026-05-22 cs.CL cs.LG

Tokenisation via Convex Relaxations

基于凸松弛的分词

Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel

AI总结 本文提出了一种基于凸松弛的分词方法ConvexTok,通过将分词构建问题转化为线性规划并利用凸优化工具求解,改进了分词指标和语言模型的bits-per-byte性能,并提升了下游任务表现。

详情
AI中文摘要

分词是当前自然语言处理流水线中的重要组成部分。当前的分词算法如BPE和Unigram都是贪心算法,它们在局部最优决策上不做考虑,而没有考虑整个词汇表的结果。我们相反地将分词构建过程作为线性规划来制定,并使用凸优化工具来解决它,从而得到一种新的算法,我们称之为ConvexTok。我们发现ConvexTok在内在分词指标和语言模型所实现的bits-per-byte (BpB)方面始终有改进;它也改善了下游任务的表现,但不太一致。此外,ConvexTok允许用户通过一个下界来认证他们的分词器在某种目标下离最优的差距,并且我们实证发现它在常见词汇表大小下处于最优的1%以内。

英文摘要

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

2605.22820 2026-05-22 cs.LG

Integrable Elasticity via Neural Demand Potentials

通过神经需求势实现可积弹性

Carlos Heredia, Daniel Roncel

AI总结 本文提出了一种以需求为导向的神经网络模型ICDN,用于多产品零售需求预测。该模型学习对数需求作为对数价格的平滑、上下文依赖函数,从而能够精确推导出弹性。在Dominick's啤酒数据集上,ICDN在样本外泛化性能上优于有向对数-对数基准,并产生了更稳定、更具经济合理性的弹性估计,尤其是在交叉价格效应较弱的情况下。

详情
Comments
44 pages, 7 figures
AI中文摘要

我们提出了一种可积上下文依赖需求网络(ICDN),这是一种以需求为导向的神经模型,用于多产品零售需求预测。该模型学习对数需求作为对数价格的平滑、上下文依赖函数,使得弹性能够从学习的需求曲面上精确推导出来。在Dominick's啤酒数据集上,ICDN在样本外泛化性能上优于有向对数-对数基准,并产生了更稳定、更具经济合理性的弹性估计,尤其是在交叉价格效应较弱的情况下。

英文摘要

We propose the Integrable Context-Dependent Demand Network (ICDN), a demand-first neural model for multiproduct retail demand. The model learns log-demand as a smooth, context-conditioned function of log-prices, allowing elasticities to be derived exactly from the learned demand surface. On the Dominick's beer dataset, ICDN improves out-of-sample generalization over a directed log-log benchmark and yields more stable, economically plausible elasticity estimates, especially for weakly identified cross-price effects.

2605.22819 2026-05-22 cs.CV

Cambrian-P: Pose-Grounded Video Understanding

Cambrian-P: 基于姿态的视频理解

Jihan Yang, Zifan Zhao, Xichen Pan, Shusheng Yang, Junyi Zhang, Bingyi Kang, Hu Xu, Saining Xie

AI总结 该研究提出Cambrian-P,一种增强的视频多模态大语言模型,通过引入可学习的相机令牌和姿态回归头,利用姿态作为轻量级监督信号,显著提升了空间推理能力,并在多个视频问答基准上实现了SOTA表现。

详情
Comments
Project Page: https://cambrian-mllm.github.io/
AI中文摘要

相机姿态至关重要。每个视角的位置和方向定义了一个共享的空间坐标框架,将视频帧之间的观察联系起来。然而,这种信号在多模态大语言模型(MLLMs)中大多缺失,因为这些模型将帧处理为孤立的2D快照,而非人类持续感知的场景。我们重新审视姿态作为轻量级监督信号,并引入Cambrian-P,一种增强的视频MLLM,其包含每帧可学习的相机令牌和姿态回归头。通过精心设计的采样方案,该模型在如VSI-Bench等空间推理基准上实现了4.5-6.5%的显著提升,跨八个额外的空间和通用视频问答基准泛化,且作为副产品,在ScanNet上实现了流式姿态估计的SOTA。令人惊讶的是,训练基于真实世界视频的伪标注姿态进一步提升了通用视频问答基准的表现,显示姿态对空间推理的帮助超出了空间推理本身。这些结果将相机姿态定位为视频模型在物理世界推理中的一项基本信号。

英文摘要

Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

2605.22818 2026-05-22 cs.CV

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

MotiMotion: 基于视觉推理的运动控制视频生成

Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu

AI总结 本文提出MotiMotion框架,通过将运动控制转化为推理生成问题,改进视频生成中因果关系和常识一致性,引入免训练视觉语言推理器和置信度感知控制方案,通过MotiBench基准测试验证其生成视频的合理性与交互性。

详情
Comments
ICML 2026. Project page: https://motimotion.github.io/
AI中文摘要

当前运动控制图像到视频生成模型严格遵循用户提供的轨迹,这些轨迹往往稀疏、不精确且因果不完整。这种依赖通常导致不自然或不合理的输出,尤其是由于忽略了次要因果后果。为了解决这个问题,我们引入了MotiMotion,一种新的框架,将运动控制重新表述为推理然后生成的问题。为了鼓励因果基础和常识一致的交互,我们利用免训练的视觉语言推理器来细化主要轨迹的图像空间坐标,并生成合理的次要运动。为进一步提高运动的自然性,我们提出了置信度感知的控制方案,该方案调节指导强度,使模型能够紧密跟随高置信度计划,同时在低置信度输入下利用其内部生成先验知识来纠正伪影。为了支持系统评估,我们精心策划了一个新的图像到视频基准,MotiBench,包含以交互为中心的场景,其中新事件由运动触发。通过基于VLM的评估和对MotiBench的人类研究证明,MotiMotion生成的视频具有更合理的物体行为和交互,并优于现有方法。

英文摘要

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

2605.22817 2026-05-22 cs.LG cs.AI cs.CL cs.NE

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

向量策略优化:为多样性训练改进测试时间搜索

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

AI总结 本文提出向量策略优化(VPO)方法,通过训练策略以预测多样化的下游奖励函数,从而产生多样化的解决方案,以改进测试时间搜索的性能。

详情
Comments
24 pages
AI中文摘要

语言模型现在必须能够即刻泛化到新的环境,并在像AlphaEvolve这样的推理扩展搜索过程中工作,该过程通过多种任务特定的奖励函数选择滚出。不幸的是,标准的LLM后训练优化方法通常优化预定义的标量奖励,导致当前LLM生成低熵响应分布,从而在推理时间搜索所需多样性方面挣扎。我们提出向量策略优化(VPO),一种RL算法,专门训练策略以预测多样化的下游奖励函数并生成多样化的解决方案。VPO利用奖励在实践中通常是向量值的事实,例如代码生成中的每测试用例正确性,或者多个不同的用户人设或奖励模型。VPO本质上是GRPO优势估计器的直接替代品,但其训练LLM输出一组解决方案,其中每个解决方案专门针对向量奖励空间中的不同权衡。在四个任务上,VPO在测试时间搜索(如pass@k和best@k)中匹配或超越了最强的标量RL基线,随着搜索预算的增长,差距逐渐扩大。对于进化搜索,VPO模型解锁了GRPO模型无法解决的问题。随着测试时间搜索变得更加标准化,优化多样性可能需要成为后训练的默认目标。

英文摘要

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

2605.22816 2026-05-22 cs.RO cs.CV

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

AwareVLN: 基于自感知的视觉语言导航推理

Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu

AI总结 本文提出AwareVLN框架,通过自感知推理机制实现端到端的视觉语言导航,解决了传统方法在理解代理、指令和场景关系上的不足,并在多个数据集上实现了优于现有方法的性能。

详情
Comments
Accepted to CVPR 2026. Project page: https://gwxuan.github.io/AwareVLN/
AI中文摘要

视觉语言导航(VLN)要求一个智能体将语言指令接地到其自身移动中。尽管最先进的方法利用视觉语言模型(VLMs)的推理能力进行端到端动作预测,但它们往往缺乏对代理、指令和场景之间关系的显式且可解释的理解。相反,显式构建场景图进行启发式规划直观但依赖额外的3D传感器,阻碍了大规模视觉语言预训练。为弥合这一差距,我们提出了AwareVLN,一种新的框架,使导航模型具备自感知推理机制,使其能够以完全端到端和数据驱动的方式理解代理的状态和任务进度。我们的方法有两个关键创新:(1)一个结构推理模块,促进空间和任务导向的自感知;(2)一个自动数据引擎,具有进度划分,用于有效的训练。在Habitat模拟器的各种数据集上的广泛实验表明,我们的AwareVLN显著优于先前的视觉语言导航方法。项目页面:https://gwxuan.github.io/AwareVLN/.

英文摘要

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

2605.22814 2026-05-22 cs.LG

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

记住保持好奇:用于3D探索的片段上下文和持久世界

Lily Goli, Justin Kerr, Daniele Reda, Alec Jacobson, Andrea Tagliasacchi, Angjoo Kanazawa

AI总结 本研究提出了一种基于好奇心驱动强化学习的方法,通过引入持久世界模型和片段上下文来解决3D环境中稀疏奖励长周期任务中的探索问题,实验表明该方法在HM3D数据集上优于基于强化学习的主动映射基线,并能泛化到Gibson和AI生成的世界。

详情
AI中文摘要

探索是学习有用行为在稀疏奖励、长周期任务中的前提,特别是在3D环境中。好奇心驱动的强化学习通过内在奖励来解决这个问题,这些内在奖励来自于智能体对世界的预测模型与现实之间的不匹配。然而,将这种内在动机转化为复杂、逼真的环境仍然具有挑战性,因为智能体可能会被困在局部循环中,并且在重新访问遗忘状态时会获得新的奖励。在本工作中,我们证明这种失败源于缺乏空间持续性和片段上下文。我们表明,有效的好奇心需要一个持久且持续更新的世界模型,配以能够维护片段轨迹历史的智能体,以导航到新区域。我们通过在线3D重建作为世界模型的持久模型,同时将智能体策略参数化为基于RGB观察的序列模型来维持片段上下文。这种设计在训练期间实现了有效的探索,同时允许智能体在部署时仅使用RGB帧进行导航。在纯好奇心训练下,我们的智能体在HM3D上优于基于强化学习的主动映射基线,并能泛化到Gibson和AI生成的世界。我们的端到端策略使智能体能够高效适应下游任务,如苹果采摘和图像目标导航,优于从头开始的基线。请参见https://recuriosity.github.io/的视频结果。

英文摘要

Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality. However, translating this intrinsic motivation to complex, photorealistic environments remains difficult, as agents can become trapped in local loops and receive fresh rewards for revisiting forgotten states. In this work, we demonstrate that this failure stems from a lack of spatial persistence and episodic context. We show that effective curiosity requires a model of the world that is persistent and continuously updated, paired with an agent that maintains an episodic trajectory history to navigate toward novel regions. We achieve this using an online 3D reconstruction as a persistent model of the world, while the agent policy is parameterized as a sequence model over RGB observations to maintain episodic context. This design enables effective exploration during training while allowing the agent to navigate using solely RGB frames at deployment. Trained purely via curiosity on HM3D, our agent outperforms RL-based active mapping baselines and generalizes zero-shot to Gibson and AI-generated worlds. Our end-to-end policy enables efficient adaptation to downstream tasks, such as apple picking and image-goal navigation, outperforming from-scratch baselines. Please see video results at https://recuriosity.github.io/.

2605.22813 2026-05-22 cs.DS

Optimal Testing of Reed-Muller Codes with an Online Adversary

Reed-Muller码的最优测试与在线对抗

Esty Kelman, Uri Meir, Kai Zhe Zheng

AI总结 本文提出了一种半采样测试器,用于在在线擦除模型中对Reed-Muller码进行最优测试,改进了Minzer和Zheng的工作,并首次为提升的仿射不变码提供了在线擦除模型下的测试方法。

详情
AI中文摘要

受Kalemaj、Raskhodnikova和Varma(ITCS 2022和Theory of Computing 2023)在线擦除模型中属性测试应用的启发,我们定义并分析了Reed-Muller码的半采样测试器。Reed-Muller测试的任务是通过尽可能少的点查询来确定输入函数$f: \F^n o \F$是否属于Reed-Muller码或与其远距离。Reed-Muller测试在属性测试和概率可验证证明文献中均有深入研究。在线擦除模型引入了新的挑战:每次查询后,对手可能擦除输入函数的最多$t$个点,这可能阻止任何查询遵循可预测模式的测试。半采样测试器是样本测试器和标准测试器之间的混合体:样本测试器只能对输入函数进行均匀随机查询,而标准测试器可以自由选择查询。它们是为在线擦除模型设计的,操作方式是首先选择域的一个子集$S$,然后在$S$内均匀随机地进行查询。我们描述了Reed-Muller码的半采样测试器,并给出了其正确性的最优分析。因此,我们证明半采样测试器确实在存在在线擦除的情况下有效,从而在在线擦除模型中实现了Reed-Muller码测试的最优查询复杂度。这一结果改进了Minzer和Zheng(SODA 2024)的工作。作为额外的奖励,我们还证明半采样测试器也适用于Guo、Kopparty和Sudan(ITCS 2013)提出的提升的仿射不变码,从而为这些码在在线擦除模型下提供了已知的首次测试方法。

英文摘要

Motivated by applications to property testing in the online-erasure model of Kalemaj, Raskhodnikova, and Varma (ITCS 2022 and Theory of Computing 2023), we define and analyze {\em semi-sample-based testers} for Reed-Muller codes. The task in Reed-Muller testing is to determine whether an input function $f: \F^n \to \F$ belongs to the Reed-Muller code or is far from it, using as few point queries to $f$ as possible. Reed-Muller testing is a well-studied task with its roots in both the Property Testing and Probabilistically Checkable Proofs literature. The online-erasure model introduces a twist: after each query made, an adversary may erase up to $t$ points of the input function, potentially thwarting any test in which the queries follow a predictable pattern. Semi-sample-based testers are a hybrid between sample-based testers -- which can only make uniformly random queries to the input function -- and standard testers, which can choose their queries freely. They are designed with the online-erasure model in mind and operate by first choosing some subset $S$ of the domain and then making their queries uniformly at random inside of $S$. We describe semi-sample-based testers for the Reed-Muller code and give an optimal analysis of their soundness. Consequently, we show that semi-sample-based testers are indeed effective in the presence of online erasures, and thereby achieve optimal query complexity for testing the Reed-Muller code in the online-erasure model. This result improves upon prior work of Minzer and Zheng (SODA 2024). As an added bonus, we show that semi-sample-based testers also exist for the lifted affine-invariant codes of Guo, Kopparty, and Sudan (ITCS 2013), thereby providing the first known testers for these codes in the online-erasure model.

2605.22812 2026-05-22 cs.RO cs.CV

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

GesVLA: 一种具有手势感知能力的视觉-语言-动作模型嵌入表示

Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng

AI总结 本文提出GesVLA模型,通过引入手势作为平行指令模态,解决现有VLA系统在复杂场景中空间模糊问题,采用双VLM架构实现手势表示与动作策略的紧密耦合,并通过手势数据生成管道和两阶段训练策略提升目标定位准确性和人机交互效率。

详情
Comments
Project page: https://gwxuan.github.io/GesVLA/
AI中文摘要

视觉-语言-动作(VLA)模型通过统一感知与动作,在通用机器人操作中展现出强大潜力。然而,现有VLA系统主要依赖文本指令,在包含多个相似物体的复杂场景中难以解决空间模糊问题。为解决这一限制,我们引入手势作为平行指令模态,提出一种具有手势感知能力的视觉-语言-动作模型(GesVLA)。我们的方法将手势特征直接编码到潜在空间中,使其能够参与高层推理和低层动作生成,并采用双VLM架构实现手势表示与动作策略的紧密耦合。在数据层面,我们通过将手模型渲染到现实世界场景图像上,构建了一个可扩展的手势数据生成管道。这在减少仿真到现实的视觉差距的同时,生成了具有多样化运动模式和相应指向注释的丰富数据。此外,我们采用两阶段训练策略,使模型具备手势感知和动作预测能力。我们在多个现实机器人任务中评估了我们的方法,包括受控块操作任务进行验证以及更实际的场景如产品和农产品选择。实验结果表明,结合手势能够一致地提高目标定位准确性和人机交互效率,特别是在复杂和拥挤的环境中。项目页面:https://gwxuan.github.io/GesVLA/.

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: https://gwxuan.github.io/GesVLA/.

2605.22811 2026-05-22 cs.DB

GS-QA: A Benchmark for Geospatial Question Answering

GS-QA:一个地理空间问答的基准测试

Majid Saeedan, Muhammad Shihab Rashid, Ahmed Eldawy, Vagelis Hristidis

AI总结 本文提出GS-QA基准测试,用于评估地理空间问答系统,通过结合多个来源的信息,涵盖广泛的空间对象、谓词和答案类型,展示了现有方法在处理复杂空间谓词和多源推理时的局限性。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进步显著提升了问答(QA)性能。为解决评估QA系统的问题,标准化基准测试已被引入。本研究聚焦于地理空间问答问题,其中大量的地理空间数据以空间数据库或其他形式存在。现有地理空间问答基准测试存在诸多限制,包括问题数量少、空间谓词有限、输出类型狭窄以及缺乏多源推理。我们提出了GS-QA,一个可扩展的地理空间问答基准测试,基于OpenStreetMap和维基百科数据,包含28种模板下的2800个问题-答案对,覆盖广泛的空间对象、谓词(包括方向和朝向过滤)以及答案类型(实体名称、位置、距离、方向、计数和聚合面积/长度)。GS-QA的一个关键特点是某些问题需要结合多个来源的信息,例如OSM的地理空间信息和维基百科的事实信息。GS-QA包含综合评估方法,结合基于文本的QA度量和地理空间特定度量,如距离误差和角度误差。我们实现了九个基于LLM的地理空间QA基线,使用三种LLM(GPT-4o、Claude Sonnet 4.6和Minstral-3)结合直接提示、检索增强生成和文本到SQL。我们的结果表明,现有方法在处理简单的空间谓词和实体名称输出时表现良好,但在涉及复杂空间谓词、数值输出类型和多源推理的问题上准确性显著下降,表明地理空间问答仍是一个具有挑战性的问题,需要进一步研究。

英文摘要

Recent advances in Large Language Models (LLMs) have led to dramatic improvements in question answering (QA). To address the challenge of evaluating QA systems, standardized benchmarks have been introduced. This work focuses on the problem of geospatial QA, where a large collection of geospatial data is available in the form of a spatial database or other forms. Existing work on geospatial QA benchmarks has various limitations, including a small number of questions, limited spatial predicates, narrow output types, and no multi-source reasoning. We present GS-QA, an extensible geospatial QA benchmark with 2,800 question-answer pairs across 28 templates on top of OpenStreetMap and Wikipedia data, covering a wide range of spatial objects, predicates (including directional and towards filtering), and answer types (entity names, locations, distances, directions, counts, and aggregated areas/lengths). A key feature of GS-QA is that some questions require combining information from multiple sources, e.g., geospatial information from OSM and factual information from Wikipedia. GS-QA includes a comprehensive evaluation methodology that combines text-based QA measures with geospatial-specific measures such as distance error and angular error. We implemented nine LLM-based geospatial QA baselines using three LLMs (GPT-4o, Claude Sonnet 4.6, and Ministral-3) with combinations of direct prompting, retrieval-augmented generation, and text-to-SQL. Our results show that existing solutions perform reasonably well on simple spatial predicates with entity name outputs, but accuracy degrades significantly for questions involving complex spatial predicates, numeric output types, and multi-source reasoning, demonstrating that geospatial QA remains a challenging open problem warranting further research.

2605.22804 2026-05-22 cs.DS cs.CC

On the Parameterized Complexity of Min-Sum-Radii

关于最小总半径问题的参数复杂性

Pankaj Kumar, Haiko Müller, Sebastian Ordyniak, Melanie Schmidt

AI总结 本文研究了在由无向图诱导的度量空间中最小总半径问题的参数复杂性,证明了在某些参数下该问题属于W[1]-难,并探讨了不同参数化下的固定参数可处理性。

详情
Comments
Accepted for SWAT 2026
AI中文摘要

在最小总半径(MSR)聚类问题中,给定一个有限点集X,其目标是在最多k个聚类中选择一部分点作为中心,使得每个点都被分配到一个聚类中,同时最小化聚类半径之和。该问题已知在由加权平面图诱导的度量空间和具有常数倍数维数的度量空间上是NP难的,如Gibson等人(SWAT 2008)所示。在本文中,我们研究了在由无向图诱导的度量空间上MSR问题的参数复杂性。我们区分了加权图度量(正边权)和无权图度量(所有边单位权)。加权图度量:我们证明了当参数化为k(聚类数)和Delta(聚类成本)的组合时,MSR在由加权二分图诱导的度量空间上是W[1]-难的。随后我们探讨了该问题的结构参数化复杂性。Drexler等人(arXiv:2310.02130)显示当参数化为树宽时,MSR问题在由加权图诱导的度量空间上有一个XP算法,并询问是否可以改进为固定参数可处理性。我们首先回答了这个问题的否定,并更强烈地证明了当参数化为顶点覆盖数加k时,MSR在由无向加权二分图诱导的度量空间上仍为W[1]-难。随后我们关注密集图的参数,并证明当参数化为k+Delta时,MSR在完全图和完全二分图上仍为W[1]-难。在积极方面,我们利用已知的参数化为树宽的XP算法,证明当参数化为树宽加Delta时,MSR问题是固定参数可处理性的。

英文摘要

In the Min-Sum-Radii (MSR) clustering problem, we are given a finite set X of n points in a metric space. The objective is to find at most k clusters centered at a subset of these points such that every point of X is assigned to one of the clusters, minimizing the sum of the radii of the clusters. The problem is known to be NP-hard even on metrics induced by weighted planar graphs and metrics with constant doubling dimension, as shown by Gibson et al. (SWAT 2008). In this work, we investigate the parameterized complexity of MSR on metrics induced by undirected graphs. We distinguish between weighted graph metrics (with positive edge weights) and unweighted graph metrics (where all edges have unit weight). Weighted Graph Metrics: We show that MSR is W[1]-hard on metrics induced by weighted bipartite graphs, when parameterized by the combined parameter k (the number of clusters) and Delta (the cost of the clustering). We then investigate the structural parameterized complexity of the problem. Drexler et al. (arXiv:2310.02130) showed that the MSR problem admits an XP algorithm on metrics induced by weighted graphs when parameterized by treewidth, and asked whether this can be improved to fixed-parameter tractability. We first answer their question in the negative, and more strongly show that MSR stays W[1]-hard on metrics induced by undirected weighted bipartite graphs when parameterized by the vertex cover number plus k. We then turn our attention to parameters for dense graphs and show that MSR remains W[1]-hard when parameterized by k+Delta even on cliques and complete bipartite graphs. On the positive side, we employ the known XP algorithm parameterized by treewidth, to show that the MSR problem is FPT when parameterized by the parameter treewidth plus Delta.

2605.22791 2026-05-22 cs.AI

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2:解耦擦除与写入的线性注意力

Ali Hatamizadeh, Yejin Choi, Jan Kautz

AI总结 本文提出Gated DeltaNet-2,通过引入通道级擦除门和写入门,解耦了线性注意力中擦除与写入的控制,从而在语言模型、常识推理和检索任务中取得了最佳性能,特别是在长上下文检索任务中表现突出。

详情
Comments
Gated DeltaNet-2 technical report; code at https://github.com/NVlabs/GatedDeltaNet-2
AI中文摘要

线性注意力将softmax注意力的无界缓存替换为固定大小的递归状态,将序列混合时间降低到线性,解码内存降至常数。难点不仅在于决定遗忘什么,还在于如何编辑压缩的记忆而不打乱现有关联。Delta规则模型在写入新值前减去当前读取值,而Kimi Delta注意力(KDA)通过通道级衰减来增强遗忘。但主动编辑仍使用单个标量门控制两件事:在键侧擦除旧内容的程度和在值侧提交新内容的程度。我们引入了Gated DeltaNet-2,通过继承自适应遗忘和通道级衰减,同时解决其共同限制,即擦除与写入之间的标量关联。Gated Delta Rule-2通过通道级擦除门b_t和通道级写入门w_t分离这些角色,当两个门坍缩为同一标量时退化为KDA,当衰减也坍缩时退化为Gated DeltaNet。我们推导出快速权重更新视图,一种分块的WY算法,将通道级衰减吸收进不对称擦除因子中,并提出一种门感知的反向传播,以保持高效的并行训练。在130亿参数在10000亿FineWeb-Edu标记上训练的情况下,Gated DeltaNet-2在语言模型、常识推理和检索任务中取得了最强的整体结果。其优势在长上下文RULER针在 haystack 检索基准中最为明显,其中它改进了评估的多键检索设置,并在递归和混合设置中保持强劲。代码可在https://github.com/NVlabs/GatedDeltaNet-2获取。

英文摘要

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

2605.22786 2026-05-22 cs.AI cs.ET cs.LG cs.MA

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

LCGuard: 多智能体系统中安全KV共享的潜在通信守护者

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy

AI总结 本文提出LCGuard框架,通过在智能体间共享KV缓存前学习表示层面的转换,以防止敏感信息泄露,同时在多个模型家族和多智能体基准测试中验证了其在减少重建攻击成功率和保持任务性能方面的有效性。

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统越来越多地依赖中间通信来协调复杂任务。尽管大多数现有系统通过自然语言进行通信,但最近的研究表明,通过transformer键值(KV)缓存进行的潜在通信可以提高效率并保留更丰富的任务相关信息。然而,KV缓存也编码了上下文输入、中间推理状态和智能体特定信息,从而创建了一个可能传播敏感内容的不透明通道,而无需显式文本披露。为此,我们引入了LCGuard(潜在通信守护者),一个用于多智能体LLM系统中安全KV基于潜在通信的框架。LCGuard将共享的KV缓存视为潜在的工作记忆,并在缓存艺术制品传输到智能体之前学习表示层面的转换。我们通过重建正式化表示层面的敏感信息泄露操作:如果一个对抗性解码器可以从共享缓存艺术制品中恢复出智能体特定的敏感输入,则该共享缓存艺术制品是不安全的。这导致了一种对抗性训练公式,其中对抗者学习重建敏感输入,而LCGuard学习转换以保留任务相关语义并减少可重建的信息。在多个模型家族和多智能体基准测试中的实证评估表明,LCGuard在减少基于重建的泄露和攻击成功率的同时,能够保持与标准KV共享基线相比具有竞争力的任务性能。

英文摘要

Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can improve efficiency and preserve richer task-relevant information. However, KV caches also encode contextual inputs, intermediate reasoning states, and agent-specific information, creating an opaque channel through which sensitive content may propagate across agents without explicit textual disclosure. To address this, we introduce \textbf{LCGuard} (Latent Communication Guard), a framework for safe KV-based latent communication in multi-agent LLM systems. LCGuard treats shared KV caches as latent working memory and learns representation-level transformations before cache artifacts are transmitted across agents. We formalize representation-level sensitive information leakage operationally through reconstruction: a shared cache artifact is unsafe if an adversarial decoder can recover agent-specific sensitive inputs from it. This leads to an adversarial training formulation in which the adversary learns to reconstruct sensitive inputs, while LCGuard learns transformations that preserve task-relevant semantics and reduce reconstructable information. Empirical evaluations across multiple model families and multi-agent benchmarks show that LCGuard consistently reduces reconstruction-based leakage and attack success rates while maintaining competitive task performance compared to standard KV-sharing baselines.

2605.22785 2026-05-22 cs.CL

Evaluating Commercial AI Chatbots as News Intermediaries

评估商业AI聊天机器人作为新闻中介

Mirac Suzgun, Emily Shen, Federico Bianchi, Alexander Spangher, Thomas Icard, Daniel E. Ho, Dan Jurafsky, James Zou

AI总结 本研究评估了AI聊天机器人在跨语言和区域处理新兴事实的准确性,发现其在多选题中表现良好,但在自由回答和复杂问题上存在显著误差,揭示了区域不平等和依赖检索基础设施的问题。

详情
Comments
https://suzgunmirac.github.io/ai-news-preview/
AI中文摘要

AI聊天机器人正在迅速塑造人们获取新闻的方式,但此前没有任何研究系统地衡量了这些系统在跨语言和区域处理新兴事实的准确性。我们对六个AI聊天机器人(Gemini 3 Flash和Pro、Grok 4、Claude 4.5 Sonnet、GPT-5和GPT-4o mini)进行了14天(2026年2月9日至22日)的评估,测试了2100个基于同日BBC新闻报道的事实性问题,覆盖六个区域服务(美国和加拿大、阿拉伯语、非洲、印地语、俄语、土耳其语)。最佳系统在处理数小时前报道的事件问题时,多选题准确率超过90%。然而,这些系统在自由回答评估中准确率下降11-13%,在整体群体中下降16-17%。我们进一步识别了三种失败模式。首先,所有模型在印地语上的准确率最低(79% vs. 89-91%在其他地方),引用表明存在英语语系的检索偏见(例如,模型回答印地语查询时引用英语维基百科比任何印地语来源更多)。其次,检索失败而非推理失败导致超过70%的错误。当模型检索到正确的来源时,它们通常能提取出正确的答案;问题在于如何首先找到正确的来源。第三,模型在处理结构良好的问题时准确率在88-96%之间,但在包含微妙虚假前提的问题中准确率骤降至19-70%,最脆弱的模型接受伪造事实的频率高达64%。我们还识别出一个检测准确性悖论:最好的虚假前提检测器在对抗性准确性(回避率)上排名第二,而较弱的检测器排名第一,表明前提检测和答案恢复是部分独立的能力。总体而言,这些结果表明高准确性可能掩盖系统性的区域不平等、对检索基础设施的近乎完全依赖,以及对不完美查询的脆弱性。

英文摘要

AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

2605.22779 2026-05-22 cs.SE cs.LG

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

FAME:面向失败的混合专家模型用于消息级日志异常检测

Huanchi Wang, Zihang Huang, Yifang Tian, Kristina Dzeparoska, Hans-Arno Jacobsen, Alberto Leon-Garcia

AI总结 本文提出FAME,一种面向失败的混合专家模型,用于消息级日志异常检测。该方法通过少量标注数据训练轻量级路由器和领域专家,实现高效的异常检测,同时在BGL和Thunderbird数据集上取得了高精度和召回率。

详情
Comments
12 pages, 5 figures
AI中文摘要

生产系统每天生成数百万条日志行,但大多数异常检测器在会话或窗口级别工作,标记的是行组而非特定消息。这种粗粒度迫使操作员每条警报都要检查许多常规行。消息级检测提供更细粒度,但仍然具有挑战性。一个事件模板可能对应正常和异常消息,故障源于异构子系统,大规模行级标注不切实际。尽管大型语言模型(LLMs)可以推断日志语义,但将其应用于每条行对于持续监控来说成本太高。我们提出了FAME(Failure-Aware Mixture-of-Experts),一种标签高效的面向消息级的混合专家框架,该框架仅在离线时使用LLM一次。我们最多为每个模板标注K条标注行以推导二元正常/异常指标和代表性示例。LLM提出将模板划分为故障领域,并通过认证步骤验证该提议后再进行训练。FAME训练了一个轻量级路由器和领域专家,这些专家在本地运行,并输出异常预测和故障领域标签。在BGL上,FAME在K=100时达到F1=98.16,将标注工作量减少76倍,并检测出86.3%的未见过的EventIDs异常。在Thunderbird上,FAME达到F1=99.95,具有完美的召回率。

英文摘要

Production systems generate millions of log lines daily, yet most anomaly detectors operate at the session or window-level, flagging groups of lines rather than identifying the specific message responsible. This coarse granularity forces operators to inspect many routine lines per alert. Message-level detection offers finer granularity, but remains challenging. A single event template may correspond to both normal and anomalous messages, failures arise from heterogeneous subsystems, and line-level labeling at scale is impractical. Although large language models (LLMs) can reason over log semantics, applying them to every line is too costly for continuous monitoring. We present FAME (Failure-Aware Mixture-of-Experts), a label-efficient message-level mixture-of-experts framework that uses an LLM only once offline. We annotate at most K labeled lines per template to derive binary normal/anomaly indicators and representative examples. The LLM proposes a partition of templates into failure domains, and a certification step validates the proposal before training. FAME trains a lightweight router and domain experts that run on-premise and output anomaly predictions and failure-domain labels. On BGL, FAME achieves F1 = 98.16 at K = 100 reducing annotation effort by 76x and detects 86.3% of anomalies from unseen EventIDs. On Thunderbird, FAME reaches F1 = 99.95 with perfect recall.

2605.22778 2026-05-22 cs.DC

AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets

基于AI的多区域云服务Spot Fleets配置

Javier Fabra, Enrique Molina-Giménez, Pedro García-López

AI总结 本文提出一种基于AI的多区域Spot Fleets配置服务,通过结合配置计划监控与预测模型,在部署前估计舰队配置和价格,实现跨区域的成本感知部署决策,同时保持EC2 Spot服务的操作行为。

详情
AI中文摘要

云服务平台越来越多地依赖弹性基础设施来支持动态工作负载。Spot实例提供折扣计算资源,但由于动态定价、资源可用性和中断风险在不同地理区域差异,引入了不确定性。在亚马逊网络服务中,EC2 Spot服务通过分配策略简化舰队配置,但无法在部署前估计舰队成本,并且限制配置仅限于单个区域。本文提出了一种AI驱动的多区域Spot舰队配置服务。所提出的方法结合配置计划的监控与预测模型,以在启动前估计舰队配置和价格,从而实现跨区域的成本感知部署决策,同时保持EC2 Spot服务的操作行为。该系统通过最多1500个vCPU的舰队进行了验证。实验结果表明,预测精度相比EC2 Spot服务为99.79%,通过利用区域价格差异,潜在成本节省可达64%。

英文摘要

Cloud service platforms increasingly rely on elastic infrastructures to support dynamic workloads. Spot instances provide discounted computing resources but introduce uncertainty due to dynamic pricing, resource availability, and interruption risks that vary across geographical regions. In Amazon Web Services, the EC2 Spot Service simplifies fleet provisioning through allocation strategies, but it cannot estimate fleet costs before deployment and restricts provisioning to a single region. This paper presents an AI-driven provisioning service for multi-region spot fleets. The proposed approach combines monitoring of provisioning plans with predictive models to estimate fleet configurations and prices before launch, enabling cost-aware deployment decisions across regions while preserving the operational behavior of the EC2 Spot Service. The system was validated with fleets of up to 1500 vCPUs. Experimental results show a prediction accuracy of 99.79% compared to the EC2 Spot Service and potential cost savings of up to 64% by exploiting regional price variability.

2605.22777 2026-05-22 cs.CV

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

DecQ:用于增强表示自编码器中重建和生成的细节压缩查询

Tianhang Wang, Yitong Chen, Wei Song, Zuxuan Wu, Min Li, Jiaqi Wang

AI总结 本文提出DecQ框架,通过引入轻量级细节压缩查询,有效缓解了表示自编码器中重建与生成之间的权衡问题,提升了重建质量和生成性能。

详情
AI中文摘要

表示自编码器(RAEs)利用冻结的视觉基础模型(VFMs)作为分词编码器,提供稳健的高层表示,从而在潜在扩散模型中实现快速收敛和高质量生成。然而,冻结VFM本质上限制了其空间重建能力,限制了细粒度生成和图像编辑;相反,通过微调引入重建导向信号会破坏预训练语义空间并降低生成保真度。为了解决这一权衡,我们提出了DecQ,一种简单而有效的RAEs框架。具体而言,DecQ引入了轻量级细节压缩查询,通过压缩模块从中间VFM特征中提取细粒度信息。这些查询被整合到解码器中以支持重建,并在生成建模过程中与补丁标记共同生成。通过聚合来自浅层和深层的信息,DecQ有效缓解了重建-生成权衡问题,提高了重建质量和生成性能。我们的实验表明:(1)仅使用8个额外查询和3.9%的额外计算,DecQ在冻结DINOv2基于的RAE上提高了重建质量,将PSNR从19.13 dB提高到22.76 dB;(2)在生成建模中,DecQ比RAE快3.3倍,达到无引导FID为1.41,有引导FID为1.05。

英文摘要

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

2605.22776 2026-05-22 cs.LG cs.AI stat.CO stat.ML

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

SDPM:用于连续时间生存分析的生存扩散概率模型

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

AI总结 本文提出SDPM,一种用于连续时间生存分析的生成模型,通过去噪扩散模型建模生存结果的条件分布,避免了对事件时间分布的参数假设,并在变换的目标空间中使用标准化对数时间和连续高斯混合表示来表示删失指示符,从而在多个真实生存数据集上取得了竞争力的预测性能。

详情
AI中文摘要

生存分析旨在从具有删失观测的数据中估计时间到事件的分布。许多现有方法要么对危险函数施加结构假设,要么离散化时间轴,这可能会限制灵活性并引入近似误差。我们提出了生存扩散概率模型(SDPM),一种用于连续时间生存分析的生成方法。SDPM利用去噪扩散模型建模生存结果的条件分布,该分布由观测时间和删失指示符表示,即P(T,δ|X)。在假设条件独立删失的情况下,模型生成的条件样本可以通过Kaplan-Meier估计器转换为生存函数估计。该公式避免了对事件时间分布的参数假设,并不需要对输出时间空间进行离散化。模型在变换的目标空间中运行,使用标准化对数时间和连续高斯混合表示来表示删失指示符。我们评估了SDPM在十个真实生存数据集上的性能,并将其与五个强大的基线模型进行了比较,包括基于树、提升和神经生存模型。结果表明,SDPM在C指数、整合时间依赖AUC和整合Brier分数上均取得了竞争性的预测性能。对合成Cox-Weibull数据的分析表明,当生成足够多的样本时,SDPM能够比强大的非参数基线更准确地恢复潜在连续生存分布的形状。消融研究证实了所提出的目标空间变换的重要性,这些变换提高了事件率校准、减少了无效生成时间并提供了预测判别的一致增益。实现所提出模型的代码已公开可用。

英文摘要

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival Diffusion Probabilistic Model (SDPM), a generative approach to continuous-time survival analysis. SDPM models the conditional distribution of the survival outcome, represented by the pair of observed time and censoring indicator, $\mathbb{P}(T,δ\mid \mathbf{x})$, using a denoising diffusion model. Under the assumption of conditionally independent censoring, conditional samples generated by the model can be transformed into survival function estimates using the Kaplan-Meier estimator. This formulation avoids parametric assumptions on the event-time distribution and does not require a discretization of the output time space. The model operates in a transformed target space, using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator. We evaluate SDPM on ten real survival datasets and compare it with five strong baselines, including tree-based, boosting-based, and neural survival models. Results show that SDPM achieves competitive predictive performance across C-index, integrated time-dependent AUC, and integrated Brier score. A study on synthetic Cox-Weibull data demonstrates that SDPM can recover the shape of an underlying continuous survival distribution more accurately than a strong nonparametric baseline when sufficiently many samples are generated. An ablation study confirms the importance of the proposed target-space transformations, which improve event-rate calibration, reduce invalid generated times, and provide consistent gains in predictive discrimination. Codes implementing the proposed model are publicly available.

2605.22775 2026-05-22 cs.LG cs.AI cs.HC

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

MambaGaze: 通过显式缺失数据建模的双向Mamba用于从眼动追踪数据中评估认知负荷

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

AI总结 本文提出MambaGaze,通过XMD编码和双向Mamba-2框架,解决眼动追踪数据中频繁缺失和长时序依赖建模的问题,实验证明其在认知负荷评估中的优越性能和边缘部署可行性。

详情
Comments
Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)
AI中文摘要

从眼动信号进行实时认知负荷评估有可能实现适应性的人工智能应用,如安全关键应用如驾驶员警觉监控或自动驾驶舱辅助,但存在两个挑战:处理频繁的数据缺失(如眨眼和跟踪失败)以及高效建模长时序依赖。我们提出MambaGaze,一个通过1)XMD编码,将原始特征与观察掩码和时间差增强以显式建模数据不确定性,以及2)双向Mamba-2,以线性计算复杂性捕获时序依赖的框架。在CLARE和CL-Drive数据集上进行的leave-one-subject-out评估实验表明,MambaGaze分别达到76.8%和73.1%的准确率,优于CNN、Transformer、ResNet和VGG基线,高出4-12个百分点。在NVIDIA Jetson平台上的边缘部署基准测试显示,实现实时推理43-68 FPS,功率消耗低于7.5W,证实了其在可穿戴认知负荷监测中的可行性。

英文摘要

Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

2605.22773 2026-05-22 cs.AI math.OC

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

基于随机作业到达的灵活作业车间调度的深度强化学习

Yu Tang, Muhammad Zakwan, Efe Balta, John Lygeros, Alisa Rupenyan

AI总结 本文提出了一种基于事件的深度强化学习方法,用于解决具有随机作业到达的灵活作业车间调度问题,通过Proximal Policy Optimization算法和轻量级多层感知机训练智能体,以最小化所有作业的总完成时间,并在不同异质性和作业到达率的数据集上优于单独的调度规则。

详情
AI中文摘要

灵活作业车间调度问题(FJSP)是将一组作业最优分配到机器上的问题。在FJSP中仍存在两个主要挑战:未来作业的不可预测到达和问题的组合复杂性,使它对传统混合整数线性规划求解器来说是不可行的。本文提出了一种基于事件的深度强化学习(DRL)方法来解决具有随机作业到达的FJSP。具体而言,我们采用近端策略优化算法,并使用轻量级多层感知机来训练DRL智能体以最小化所有作业的总完成时间。我们设计状态表示为可以直接从环境中获取,并限制学习智能体只能在一组已确立的调度规则中选择。仿真显示,我们的DRL方法在异质性和作业到达率不同的数据集上优于任何单独的调度规则。我们还将我们的DRL方法与一种触发到达的混合整数线性规划解决方案进行基准测试,并表明我们的方法在数据集异质性较高的情况下表现良好。

英文摘要

The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear programming solvers. This paper proposes an event-based \gls{DRL} approach to solve FJSP with random job arrivals. Specifically, we employ the Proximal Policy Optimization algorithm and use lightweight Multi-Layer Perceptrons to train the \gls{DRL} agent for minimizing the total completion time of all jobs. We design the state representation to be directly accessible from the environment, and limit the learning agent to selecting from among a set of well-established dispatching rules. Simulations show that our \gls{DRL} approach outperforms any of the individual dispatching rules on datasets with varying heterogeneity and job arrival rates. We benchmark our \gls{DRL} against an arrival-triggered mixed-integer linear programming solution and show that our method achieves good performance especially when the datasets are heterogeneous.

2605.22767 2026-05-22 cs.CV

Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

合成数据足够吗?重新思考儿科罕见病识别中的数据稀缺性

Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu, Wei Xu

AI总结 本研究探讨了在儿科罕见病识别中,仅使用合成数据是否足以克服数据稀缺问题,通过实验发现高保真合成数据能模拟临床有意义的分布,从而为遗传咨询提供隐私保护的视觉资源。

详情
Comments
CVPR 2026 CV4CHL workshop
AI中文摘要

患有罕见遗传疾病的儿童往往表现出独特的面部表型,但开发用于早期诊断的计算机视觉系统仍极具挑战性,因为存在极端的数据稀缺性、隐私限制以及儿科环境中有限的数据共享。这些挑战不仅阻碍了自动化诊断,也限制了临床遗传咨询中的视觉资源可用性。尽管先前研究表明合成数据可以增强真实数据集并保持表型层面的语义,但尚不清楚在超低资源的儿科环境中,仅使用合成数据是否足以进行学习。在本工作中,我们研究了仅使用合成数据的儿科罕见病识别场景。在受控的实验设置中,模型仅在具有表型意识的合成面部图像上进行训练,随着数据规模的增加。我们发现,在足够规模下,仅使用合成数据的训练在多个架构上实现了与仅使用真实数据的基线相当的性能,这表明高保真合成数据能够近似临床有意义的分布。这些发现进一步使合成的儿科面部图像成为隐私保护的资源,用于遗传教育和咨询,支持临床医生培训和患者沟通。我们的结果强调了计算机视觉在提高数据效率和扩展儿童健康护理中可访问的视觉工具方面的潜力。

英文摘要

Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.

2605.22766 2026-05-22 cs.IR

Diversed Model Discovery via Structured Table Discovery

通过结构化表格发现实现多样化模型发现

Zhengyuan Dong, Renée J. Miller

AI总结 本文提出了一种基于结构化表格发现的模型搜索框架,旨在通过高质证据进行检索,提升模型推荐的多样性和可比性。

详情
Comments
8 pages excluding references. 5 figures
AI中文摘要

模型卡片通过文本描述和结构化 artifacts 的混合来描述模型行为,包括性能、配置和数据集表格。现有的模型搜索系统主要依赖文本的语义相似性,这可能导致结果集同质化并限制替代方案的探索。我们主张模型搜索本质上是对比性的:用户希望模型在任务上对齐但又在可测量的方式上有所区别。我们假设这种平衡需要检索浓缩的高质量证据而不是冗长的描述,而大部分证据集中在结构化表格中。我们提出了 StructuredSemanticSearch,一个基于 ModelTables 评估基准的表格驱动模型搜索框架。给定一个查询,StructuredSemanticSearch 结合了任务对齐的语义基线和一个发现查询相关模型卡片表格的结构感知管道,使用诸如可联结性、可连接性和关键词搜索等表格发现运算符。检索到的表格被映射回模型卡片下受控的 top-k 预算,使文本基于和表格基于的检索之间能够公平比较。除了检索外,StructuredSemanticSearch 通过方向感知的整合将表格整合到模型-表格领域,产生从部分重叠和有时转置的证据表格中产生的紧凑整合视图。为了评估,我们引入了基于 nugget 的、可审计的协议,该协议从模型卡片中提取紧凑的证据项,将查询匹配到条件或意图特定的 nuggets,并测量检索到的模型卡片候选集中的证据覆盖和多样性。该协议还提供了一条可扩展的路径,朝着动态模型湖中近似、基于证据的标记。在 597 个模型推荐查询上的实验表明,结构感知的管道在 nugget 覆盖方面优于语义基线。

英文摘要

Model cards describe model behavior through a mixture of textual descriptions and structured artifacts, including performance, configuration, and dataset tables. Existing model search systems rely predominantly on semantic similarity over text, which can produce homogeneous result sets and limit exploration of alternatives. We argue that model search is inherently comparative: users want models that are task-aligned yet differentiated in measurable ways. We hypothesize that this balance requires retrieval over condensed, high-quality evidence rather than verbose descriptions, and much of that evidence is concentrated in structured tables. We present StructuredSemanticSearch, a table-driven model search framework built on the ModelTables benchmark. Given a query, StructuredSemanticSearch combines a semantic baseline for task alignment with a structure-aware pipeline that discovers query-related model-card tables using table discovery operators such as unionability, joinability, and keyword search. Retrieved tables are mapped back to model cards under a controlled top-k budget, enabling fair comparison between text-based and table-based retrieval. Beyond retrieval, StructuredSemanticSearch adapts table integration to the model-table domain through orientation-aware integration, producing compact integrated views of tables from partially overlapping and sometimes transposed evidence tables. For evaluation, we introduce a nugget-based, auditable protocol that extracts compact evidence items from model cards, matches queries to condition- or intent-specific nuggets, and measures evidence coverage and diversity over retrieved model-card candidate sets. This protocol also provides a scalable path toward approximate, evidence-based labeling in dynamic model lakes. Experiments on 597 model-recommendation queries show improved nugget coverage for the structure-aware pipeline than semantic baseline

2605.22765 2026-05-22 cs.LG stat.ML

Uniform Diffusion Models Revisited: Leave-One-Out Denoiser and Absorbing State Reformulation

统一扩散模型再审视:留一法去噪器和吸收状态重述

Samson Gourevitch, Yazid Janati, Dario Shariatian, Umut Simsekli, Eric Moulines, Eric P. Xing, Alain Durmus

AI总结 本文研究了统一扩散模型中去噪后验与留一法后验之间的不匹配问题,并通过改进的参数化和采样方法提升了模型性能。

详情
Comments
preprint
AI中文摘要

离散扩散模型通常通过干净数据预测进行训练,但预测可以以不同方式定义反向动态。在掩码扩散模型(MDM)中这些选择大体一致,而在统一扩散模型(UDM)中则不一致。我们展示了标准插件桥参数化对于UDM并非由去噪后验优化,而是由留一法后验优化,该后验预测每个干净token时不使用其自身的噪声观测。这揭示了插件ELBO与常规去噪交叉熵目标之间的不匹配。我们刻画了留一法目标并推导了去噪器、留一法后验和分数之间的精确转换。这些转换使我们能够分离参数化和训练目标。我们的结果还通过有意识的预测-校正采样器和基于留一法预测的改进温度采样方法在无需额外训练的情况下提升了推断性能。我们进一步引入了统一扩散的吸收状态重述,该重述在保持UDM联合分布的同时将其分解为类似掩码扩散的采样操作,具有更简单的去噪后验、携带未掩码和自然重掩码机制。在语言建模中,留一法参数化一致地提升了UDM生成性能,而吸收构造在匹配或超越掩码扩散方面表现优异。这些结果表明,掩码与统一扩散之间的经验差距主要由参数化和采样设计驱动,而非边际本身的选择。代码和模型可在https://github.com/samsongourevitch/rev_udm找到。

英文摘要

Discrete diffusion models are often trained through clean-data prediction, but the prediction can be used in different ways to define the reverse dynamics. In Masked Diffusion Models (MDM) these choices largely coincide, whereas in Uniform Diffusion Models (UDM) they do not. We show that the standard plug-in bridge parameterization for UDM is not optimized by the denoising posterior, but by a leave-one-out posterior that predicts each clean token without using its own noisy observation. This identifies a mismatch between the plug-in ELBO and the usual cross-entropy denoising objective. We characterize the leave-one-out target and derive exact conversions between the denoiser, the leave-one-out posterior, and the score. These conversions allow us to disentangle parameterization and training objective. Our results also lead to inference improvements without any additional training through an informed predictor-corrector sampler and improved temperature sampling based on the leave-one-out predictor. We further introduce an absorbing-state reformulation of uniform diffusion that preserves the UDM joint law while decomposing it into masked-diffusion-like sampling operations, with simpler denoising posteriors, carry-over unmasking, and a natural remasking mechanism. On language modeling, leave-one-out parameterizations consistently improve UDM generation, while the absorbing construction matches or surpasses masked diffusion. These results suggest that the empirical gap between masked and uniform diffusion is driven less by the choice of marginals themselves than by parameterization and sampling design. The code and models can be found at https://github.com/samsongourevitch/rev_udm.

2605.22758 2026-05-22 quant-ph cs.CC

A sharp interaction-degree threshold for simulating QAOA

QAOA的精确交互度阈值

Ralfs Āboliņš, Andris Ambainis

AI总结 本文研究了QAOA的经典模拟中交互度阈值问题,发现当交互度为3时,深度1的QAOA在小乘性误差下会崩溃多项式层级到第三层;而当交互度为2时,深度p的QAOA在n个量子比特上可以在时间n^{O(1)}内精确模拟,只要p=O(log n)。

详情
Comments
7 pages, 1 figure
AI中文摘要

我们识别出一个精确的交互度阈值,用于经典模拟具有2-局部成本函数的QAOA。在交互度为3时,深度为1的QAOA在小乘性误差下会将多项式层级崩溃到其第三层。在交互度为2时,精确的经典采样在n个量子比特上可以在时间n^{O(1)}内完成,只要p=O(log n)。交互度为3的实例具有平凡可优化的成本函数,因此采样难度本身并不意味着量子优化优势。

英文摘要

We identify a sharp interaction-degree threshold for the classical simulation of QAOA with $2$-local cost functions. At degree $3$, classical sampling from depth-$1$ QAOA with small multiplicative error would collapse the polynomial hierarchy to its third level. At degree $2$, exact classical sampling from depth-$p$ QAOA on $n$ qubits runs in time $n^{O(1)}$ whenever $p = O(\log n)$. The hard degree-$3$ instances have trivially optimizable cost functions, so sampling hardness does not by itself imply a quantum optimization advantage.

2605.22756 2026-05-22 cs.LG cs.DS

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

Lumberjack: 通过树中的Heavy Hitter检测实现更好的差分隐私随机森林

Christian Janos Lebeda, David Erb, Tudor Cebere, Aurélien Bellet

AI总结 本文提出Lumberjack算法,通过构建大规模随机决策树并应用隐私保护的剪枝技术,显著提升了差分隐私随机森林的实用性。该方法引入了新的(ε,δ)-DP Heavy Hitter检测算法,具有O_{ε,δ}(√log h)的误差,使得树的高度可以更深,从而在隐私约束下提高表达能力。实验表明,Lumberjack在基准数据集上优于现有差分隐私随机森林方法,特别是在隐私预算下的隐私-效用权衡上取得显著改进。

详情
AI中文摘要

随机森林广泛应用于涉及敏感表格数据的领域,但现有的差分隐私(DP)方法通常会降级性能到不实用的程度。在本文中,我们介绍Lumberjack,一种差分隐私随机森林算法,通过构建大规模随机决策树并应用激进的隐私保护剪枝技术,保留仅足够 populated 的节点,从而实现显著更高的实用性。我们方法的关键组成部分是一个新颖的(ε,δ)-DP Heavy Hitter检测算法,用于层次数据,其误差为O_{ε,δ}(√log h)对于高度为h的树,并可能具有独立的兴趣。这种有利的缩放使得可以使用比先前工作更深的树,从而在隐私约束下提高表达能力。我们在基准数据集上的实验证明,Lumberjack在基准数据集上优于现有差分隐私随机森林方法,建立了新的状态。特别是,我们的方法在实际隐私预算下的隐私-效用权衡上取得了显著改进。我们的发现表明,精心设计的差分隐私随机森林可以缩小大部分的效用差距,突显了未来研究中一个有前途但尚未被探索的方向。

英文摘要

Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,δ)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,δ}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.

2605.22751 2026-05-22 cs.CV

Spectral Tail Auxiliary Learning for AI-Generated Image Detection

用于AI生成图像检测的频谱尾辅助学习

Xingyi Li, Jiahui Zhang, Yiheng Li, Yun Cao, Wenhao Wang

AI总结 本文提出了一种基于频谱尾特征的辅助学习框架STAL,用于检测AI生成图像。通过分析真实和生成图像的径向对数功率谱,发现生成图像在超高频尾部表现出异常提升现象,即频谱尾部上升。STAL利用这一特征进行辅助监督学习,提升了模型在不同生成器、数据分布和现实场景中的泛化能力和稳定性。

详情
AI中文摘要

随着生成图像模型的快速发展,生成图像与真实图像的感知差距持续缩小,使AI生成图像检测变得愈发困难。许多现有方法利用频域线索进行检测,通常描述为频域伪影或高频差异。然而,具体的频谱规律仍不够理解和表征。本文系统分析了真实和生成图像的一维径向对数功率谱。发现生成图像并不一定在整个频谱或高频范围内具有更高的或更低的能量。相反,它们的频谱偏离幂律衰减,并在超高频尾部表现出异常上升。我们称这种现象为频谱尾部上升。进一步将这种现象归因于训练生成模型中的非线性谐波积累,表明它可以在生成架构中作为结构线索。基于这一观察,我们提出了Spectral Tail Auxiliary Learning (STAL),一种用于通用AI生成图像检测的频域辅助监督框架。STAL在训练时将频谱尾部线索从尾部意识的频率教师转移到空间检测器,而在推理时所有频域模块都被丢弃。因此,STAL不引入推理开销。在9个公开数据集上的大量实验表明,STAL在不同生成器、数据分布和现实场景中实现了强大的泛化能力和稳定性。

英文摘要

As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or high-frequency discrepancies. However, the specific and recurring spectral regularities remain insufficiently understood and characterized. In this paper, we systematically analyze the one-dimensional radial log-power spectra of real and generated images. We find that generated images do not necessarily exhibit higher or lower energy across the entire spectrum or high-band range. Instead, their spectra deviate from the power-law decay and show an anomalous uplift in the ultra-high-frequency tail. We term this phenomenon spectral tail uplift. We further attribute this phenomenon to nonlinear harmonic accumulation in trained generative models, suggesting that it can serve as a structural cue across generative architectures. Based on this observation, we propose Spectral Tail Auxiliary Learning (STAL), a frequency-domain auxiliary supervision framework for generalizable AI-generated image detection. STAL transfers spectral-tail cues from a tail-aware frequency teacher to a spatial detector during training, while all frequency-domain modules are discarded at inference time. Consequently, STAL introduces no inference overhead. Extensive experiments on 9 public datasets show that STAL achieves strong generalization and stability across generators, data distributions, and real-world scenarios.

2605.22749 2026-05-22 cs.LG cs.AI

Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

基于机器学习和元启发式特征优化的物联网智能电网中网络-物理异常检测

Adis Alihodžić, Eva Tuba, Milan Tuba

AI总结 本文研究了如何利用机器学习和元启发式特征优化方法,在物联网智能电网中检测网络-物理异常,通过评估多个基线模型,发现基于树的集成模型在该数据集上表现最佳,且经过特征优化后,模型在准确率和AUC指标上均有显著提升。

详情
AI中文摘要

现代智能电网依赖于密集的测量基础设施、通信链路和智能现场设备。尽管这提高了监控和控制能力,但也增加了遭受网络-物理破坏的风险。操作员必须区分物理事件,如故障或线路干扰,与恶意行为,如虚假数据注入或未经授权的命令执行。本章利用著名的MSU/ORNL电力系统攻击数据集来研究这一问题。所提出的方法结合了机器学习与基于遗传算法的特征选择。目标是双重的:准确分类攻击和自然事件,并确定一组减少的、物理信息丰富的PMU/IED测量是否能够支持可靠的检测。评估了多个基线模型,包括逻辑回归、RBF-SVM、XGBoost、随机森林和额外树。结果表明,基于树的集成模型在考虑的数据集上最为有效,其中额外树提供了最强的全特征基线。在特征选择后,GA + Extra Trees模型将干净的PMU特征空间从112个属性减少到五次运行的平均27.4个属性,同时将宏F1从0.9118提高到0.9212,ROC-AUC从0.9791提高到0.9837。这些结果表明,许多同步电气测量是冗余的。一个紧凑的基于相量的特征子集仍能提供准确且可解释的智能电网异常检测。

英文摘要

Modern smart grids rely on dense measurement infrastructures, communication links, and intelligent field devices. Although this improves supervision and control, it also increases vulnerability to cyber-physical disruptions. Operators must distinguish physical incidents, such as faults or line disturbances, from malicious actions, such as false data injection or unauthorized command execution. This chapter investigates this problem using the well-known MSU/ORNL Power System Attack Dataset. The proposed method combines machine learning with genetic-algorithm-based feature selection. The objective is twofold: to classify attack and natural events accurately, and to determine whether a reduced set of physically informative PMU/IED measurements can support reliable detection. Several baseline models are evaluated, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees. The results show that tree-based ensemble models are the most effective for the considered dataset, with Extra Trees providing the strongest full-feature baseline. After feature selection, the GA + Extra Trees model reduces the clean PMU feature space from 112 attributes to an average of 27.4 attributes over five runs, while increasing macro-F1 from 0.9118 to 0.9212 and ROC-AUC from 0.9791 to 0.9837. These results indicate that many synchronized electrical measurements are redundant. A compact subset of phasor-based features can still provide accurate and interpretable anomaly detection in smart grids.

2605.22748 2026-05-22 cs.RO cs.AI cs.LG cs.MA

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

AI总结 本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能,展示了多智能体交互对真实世界交互安全性的关键作用,同时在高速赛车中超越人类飞行员并减少碰撞率。

详情
Comments
12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl
AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能,但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式,其中其他参与者被忽略或视为环境噪声,阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台,训练智能体在复杂空气动力学相互作用和战略机动中导航,具有可变数量的赛车。通过联赛基于的自我对战,智能体进化出复杂的前瞻性行为,包括主动避障、超车和处理多智能体物理交互,包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员,同时与最先进的单智能体基线相比,碰撞率减少了50%。关键的是,使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明,实现稳健的机器人共存的路径不在于孤立的安全约束,而在于多智能体交互的严格要求。多媒体材料可在:https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

2605.22746 2026-05-22 cs.LG eess.AS stat.ML

Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier

插件损失用于证据深度学习:一个简化框架用于不确定性估计,其中包括softmax分类器

Berk Hayta, Hannah Laus, Simon Mittermaier, Felix Krahmer

AI总结 本文提出了一种简化框架,用于通过插件损失近似证据深度学习中的不确定性估计,证明了在特定证据到狄利克雷分布映射下,该框架包含标准的softmax分类器,并在Google语音命令数据集上验证了其有效性。

详情
AI中文摘要

现实中的基于传感器的学习系统需要可靠且计算高效的不确定性估计。证据深度学习(EDL)通过狄利克雷分布建模类概率,从而实现单次通过的不确定性估计,其中狄利克雷参数由一个学习的神经网络映射预测。然而,这种方法可能导致计算挑战,因为狄利克雷期望目标比标准监督学习损失更复杂,增加了分析和实现的难度。我们通过近似由EDL诱导的一阶经验风险最小化问题的目标,使用在狄利克雷均值上评估的插件损失,证明在温和假设下,对于广泛的一类损失函数,包括均方误差和交叉熵损失,近似误差随着证据的增长而减小。作为特殊情况,我们的分析为在不确定性估计中使用softmax提供了正当性,因为在特定的证据到狄利克雷分布映射下,我们的框架包含标准的softmax分类器。我们在Google语音命令数据集上验证了所提出的简化目标,并展示了其在预测准确性和选择性预测性能上与经典EDL相当,同时使用标准深度学习损失和训练流程实现起来更简单。到目前为止,本文的实证分析是首次通过EDL获得语音识别任务中的覆盖-准确性权衡。

英文摘要

Real-world sensor-based learning systems require uncertainty estimation that is both reliable and computationally efficient. Evidential Deep Learning (EDL) provides single-pass uncertainty estimation by modeling the class probabilities via Dirichlet distributions, where the Dirichlet parameters are predicted by a learned neural network mapping. However, this approach can lead to computational challenges, as Dirichlet expected objectives are more complex than standard supervised learning losses, complicating their analysis and implementation. We address this issue by approximating the objective of the first-order empirical risk minimization problem induced by EDL with a plug-in loss evaluated at the Dirichlet mean and show that, under mild assumptions, the approximation error decays with growing evidence for a broad class of loss functions, including mean-squared error and cross-entropy loss. As a special case, our analysis provides justification for the use of softmax in the context of uncertainty estimation, since under a particular evidence-to-Dirichlet mapping, our framework includes the standard softmax classifier. We validate the proposed simplified objectives on the Google Speech Commands dataset and show that they achieve predictive accuracy and selective prediction performance comparable to classical EDL, while being simpler to implement using standard deep learning losses and training pipelines. To the best of our knowledge, this empirical analysis is the first to obtain coverage-accuracy trade-offs for speech recognition tasks through EDL.