arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20182 2026-05-20 cs.LG cs.AI

Atoms of Thought: Universal EEG Representation Learning with Microstates

思想的原子：基于微状态的通用EEG表示学习

Xinyang Tian, Ruitao Liu, Ziyi Ye, Siyang Xue, Xin Wang, Xuesong Chen

AI总结本文提出了一种基于微状态的通用EEG表示学习方法，通过将连续EEG信号聚类为离散的微状态序列，构建了一个通用的微状态分词器，并在睡眠分期、情绪识别和运动想象分类等下游任务中展示了其优越性，同时提高了可解释性和扩展性。

Comments Accepted by the 3rd International Workshop on Multimodal and Responsible Affective Computing (MRAC 2025). 8 pages of main text, 23 pages total, 5 figures, 4 tables

详情

DOI: 10.1145/3746270.3760230

AI中文摘要

从脑电图（EEG）信号中学习通用表示是神经信息学和脑机接口（BCIs）领域的一项前沿技术。传统上，EEG被视为多变量时间序列，其中时间域或频域特征被提取用于表示学习。本文研究了一种简单而有效的EEG表示，即微状态。微状态代表了在微观时间尺度上大脑活动模式的基本构建块。通过从大规模医疗EEG数据集中对连续EEG信号进行聚类，构建了一个通用的微状态分词器。该微状态分词器被广泛应用于一系列下游任务，包括睡眠分期、情绪识别和运动想象分类。实验结果表明，使用微状态进行EEG表示学习在不同模型和不同任务中均优于传统的时间域和频域特征。进一步分析显示，微状态提供了更高的可解释性和可扩展性，从而在认知神经科学和临床研究中开辟了应用。

英文摘要

Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is then adopted universally across a series of downstream tasks, including sleep staging, emotion recognition, and motor imagery classification. Experimental results show that EEG representation learning with microstates outperforms traditional time-domain and frequency-domain features under different models and across different tasks. Further analysis shows that microstates offer greater interpretability and scalability, thereby opening up applications in both cognitive neuroscience and clinical research.

URL PDF HTML ☆

赞 0 踩 0

2605.20179 2026-05-20 cs.CL

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

TIDE: 一种高效且无损的MoE扩散大语言模型推理方法

Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang

AI总结本文提出TIDE，一种基于扩散过程块内专家激活时间稳定性的新推理系统，通过引入基于区间的专家刷新策略，优化I/O和CPU计算，实现无损加速，提升推理效率。

详情

AI中文摘要

扩散大语言模型（dLLMs）作为一种具有更好硬件利用和双向上下文能力的替代方案，通过并行块级解码出现。然而，随着dLLMs在混合专家（MoE）架构下不断扩展，其在资源受限设备上的部署仍是一个开放性挑战。现有基于自回归（AR）的方法通常导致严重的I/O开销或显著的计算瓶颈。在本文中，我们提出TIDE，一种新的资源高效的推理系统，利用扩散过程中块内专家激活的时序稳定性。具体来说，我们利用扩散过程中块内专家激活的时序稳定性，并引入基于区间的专家刷新策略，以I/O-aware的方式更新专家位置。为了确保最佳性能，我们将推理调度形式化为一个数学规划问题，求解最优区间以最小化I/O流量和CPU计算。最重要的是，TIDE是一种无损优化，不需要模型训练，为dLLM推理提供

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.20177 2026-05-20 cs.CL cs.CV

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

从感知到思考：解耦感知与推理提升视觉语言模型的训练

Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou

AI总结本研究通过解耦感知与推理，发现视觉任务性能受限于感知能力不足而非推理本身，通过分阶段训练提升模型的感知与推理能力，从而在多个视觉数学和感知任务中取得更优表现。

Comments 19 pages, 9 figures; Accepted to ICML 2026; Project Page: https://ucsc-vlaa.github.io/VLM-CapCurriculum/

详情

AI中文摘要

最近视觉语言模型（VLMs）的进步强调长链推理；然而，我们发现其在视觉任务上的性能主要受限于感知能力不足而非推理本身。在本工作中，我们系统研究了VLMs在训练后感知与推理之间的相互作用，通过将能力分解为三个独立的训练阶段：视觉感知、视觉推理和文本推理，并结合专门的训练数据。我们证明视觉感知（a）需要针对优化和专门数据；（b）作为基础框架，应在细化视觉推理之前通过分阶段训练巩固；（c）通过强化学习（RL）比基于标题的监督微调（SFT）更有效学习。我们的实验表明，分阶段训练在多个VLMs上一致提升了视觉感知和推理性能。值得注意的是，采用我们方法训练的模型在推理准确性上提高了1.5%，推理轨迹缩短了20.8%，表明更强的感知减少了对过度推理的需求。此外，我们展示了基于能力的分阶段训练代表了与传统难度基于课程正交的新课程维度，结合两者可进一步获得加性收益。我们的分阶段训练模型在开放权重VLMs中表现优异，在多个视觉数学和感知任务（如WeMath和RealWorldQA）上取得了优于基础模型的先进结果。

英文摘要

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

URL PDF HTML ☆

赞 0 踩 0

2605.20176 2026-05-20 cs.CL

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

ClinSeekAgent: 自动化多模态证据检索以实现代理临床推理

Juncheng Wu, Letian Zhang, Yuhan Wang, Haoqin Tu, Hardy Chen, Zijun Wang, Cihang Xie, Yuyin Zhou

AI总结本文提出ClinSeekAgent，一种自动化代理框架，用于动态多模态证据检索，旨在从异构来源主动检索、迭代规划和综合多模态证据，从而提升临床决策支持。

Comments 24 pages, 9 figures; Project Page: https://ucsc-vlaa.github.io/ClinSeekAgent/

详情

AI中文摘要

大型语言模型（LLMs）和代理系统在临床决策支持中展现出潜力，但现有研究大多假设证据已预先整理并提供给模型。现实中的临床工作流程要求代理主动检索、迭代规划和综合来自异质来源的多模态证据。在本文中，我们介绍了ClinSeekAgent，一种自动化代理框架，用于动态多模态证据检索，将范式从被动证据消费转向主动证据获取。仅给定一个临床查询和对原始数据源的访问权限，ClinSeekAgent通过查询医学知识库、导航原始电子健康记录（EHR）和调用医学影像工具来收集证据；随着新信息的出现，它会细化假设；并将收集到的证据整合到基于事实的临床决策中。ClinSeekAgent既作为推理时间的代理用于前沿LLMs，也作为训练时间的流水线，用于将高质量的代理轨迹提炼成紧凑的开源模型。为了验证其推理时间的有效性，我们构建了ClinSeek-Bench，它将固定预选证据的Curated Input推理与原始临床数据上的自动化证据检索相结合。在仅文本EHR任务中，ClinSeekAgent将Claude Opus 4.6的总体F1从60.0提升到63.2，将MiniMax M2.5从43.1提升到47.3，并在9个评估的主机模型中有7个显示出积极的风险预测增益。在多模态任务中，ClinSeekAgent将Claude Opus 4.6的分数从47.5提升到62.6（+15.1）；所有评估的模型在三个与CXR相关的任务组中均有所提升。我们进一步通过将代理证据检索轨迹提炼成ClinSeek-35B-A3B来验证ClinSeekAgent作为训练流水线的有效性，该模型在现有AgentEHR-Bench上实现了34.0的平均F1，比其Qwen3.5-35B-A3B基线提高了+11.9点，并接近Claude Opus 4.6。

英文摘要

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

URL PDF HTML ☆

赞 0 踩 0

2605.20174 2026-05-20 cs.CV cs.LG

Multi-axis Analysis of Image Manipulation Localization

多轴分析图像操纵定位

Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova, Anna Rohrbach, Bryan A. Plummer

AI总结本文提出AUDITS基准，用于多轴分析图像操纵检测，通过不同领域转移类型评估现有方法的鲁棒性，以推动更可靠和通用的图像操纵检测方法的发展。

Comments 28 pages, 5 figures, 5 tables

详情

AI中文摘要

先进的图像编辑软件使创建高度逼真的图像操纵变得容易，近年来由于生成式AI的进步，这种能力变得更加普及。虽然操纵的图像通常无害，但它们可能传播虚假信息、制造虚假叙述并影响人们对重要问题的看法。尽管这种威胁日益增长，但针对不同视觉领域检测高级操纵的研究仍然有限。因此，我们引入了Analysis Under Domain-shifts, QualIty, Type, and Size (AUDITS)，一个全面的基准，用于研究图像操纵检测中的分析轴。AUDITS包含来自两个不同来源（用户和新闻照片）的超过530,000张图像。我们通过最近的扩散基填充技术整理数据集，以支持跨多个轴的分析，涵盖多样化的操纵类型和尺寸。我们通过不同的领域转移类型进行实验，以评估现有图像操纵检测方法的鲁棒性。我们的目标是通过提供新的见解来推动该领域进一步研究，以帮助开发更可靠和通用的图像操纵检测方法。

英文摘要

Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

URL PDF HTML ☆

赞 0 踩 0

2605.20173 2026-05-20 cs.AI cs.SE

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

为生产大语言模型代理选择和组合运行时架构模式的方法

Vasundra Srinivasan

AI总结本文提出了一种方法，用于选择和组合运行时架构模式，以定义大语言模型代理的随机-确定性边界，并探讨其在不同代理类型中的应用及可靠性分解。

Comments 25 pages, 2 figures, 6 tables. Companion repo at https://github.com/vasundras/agent-runtime-patterns

详情

AI中文摘要

生产大语言模型代理结合了随机模型输出与确定性软件系统，但两者之间的边界很少被视为首要的架构对象。本文将此边界称为随机-确定性边界（SDB）：一种四部分合同，涉及提议者、验证者、提交步骤和拒绝信号，规定了LLM输出如何成为系统动作。我们主张SDB是生产代理运行时的承载基础。围绕此基础，我们将代理运行时设计分为三个关注点：协调、状态和控制。我们提出了六个运行时模式的目录，这些模式在对话、自主和长周期代理中以不同的方式组合SDB：分层委托、散射-收集加 saga、事件驱动序列、共享状态机、监督者加门控，以及人机交互。对于每个模式，我们追溯其分布式系统概念的根源，并确定当工作者为随机时的变化。本文贡献了五步选择运行时模式的方法，一个将生产故障映射到模式弱点的诊断程序，以及一种称为回放分歧的故障模式，在这种模式下，基于LLM的确定性事件日志消费者在模型版本或提示变化下会产生不同的下游输出。一种简化的可靠性分解将每次调用的模型方差与架构动量分开，促使主张随着模型方差的减少，模式选择和SDB强度成为长期可靠性的越来越重要的杠杆。我们应用该方法于五个工作负载，并提供了一个90天合同续约代理的可运行参考实现。

英文摘要

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

URL PDF HTML ☆

赞 0 踩 0

2605.20170 2026-05-20 cs.CL

KoRe: Compact Knowledge Representations for Large Language Models

KoRe: 为大型语言模型设计的紧凑知识表示

Davide Cavicchini, Fausto Giunchiglia, Jacopo Staiano

AI总结本文提出KoRe方法，通过将知识图谱的1跳子图编码为紧凑离散知识标记，并注入到LLM中，从而提升模型的知识推理能力并减少token使用量。

详情

AI中文摘要

现代大型语言模型（LLMs）在用户面对的任务如问答中表现出色，并在推理能力上持续改进。然而，这些模型编码知识的方式存在固有缺陷：通过设计，LLMs将世界知识存储在参数中。这种方式表示知识本质上是不透明的，难以调试和更新，且容易产生幻觉。另一方面，知识图谱可以提供人类可读且易于编辑的世界知识表示，并在知识密集型任务中持续证明对下游性能有益。然而，当前的整合技术需要大量的重新训练或微调。为了解决这个问题，我们引入KoRe，一种将1跳子图编码为紧凑离散知识标记并注入到LLM骨干中的方法。我们在三个已建立的基准上测试了所提出的方法，并报告了具有竞争力的表现，同时token使用量显著减少（最高减少10倍）。我们的结果表明，紧凑的离散KG表示可以有效地用于使现代LLM扎根。

英文摘要

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.20167 2026-05-20 cs.AI cs.LG

HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

HaorFloodAlert: 用于孟加拉国Haor湿地72小时洪水预测的去季节化机器学习集成

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely, Md. Samiul Alim, Md. Zakir Hossen

AI总结本文提出HaorFloodAlert，一种去季节化的机器学习集成模型，用于预测孟加拉国Haor湿地72小时内的洪水概率，通过识别温度季节性影响和利用Sentinel-1 SAR数据提高预测准确性。

Comments 9 pages, 9 figures. To be submitted to raaicon.org

详情

AI中文摘要

孟加拉国Haor湿地的快速洪水几乎没有任何预警，破坏年度boro稻收获。现有系统为河流洪水设计，完全忽略了回水动态。这些流域平坦，水的行为不同于布拉马普特拉河。我们构建了HaorFloodAlert，一种去季节化的机器学习集成，用于预测Sunamganj Haor（约8,000平方公里）72小时内的洪水概率。温度被发现是季节性的作弊代码，因为它在温暖月份洪水发生时提高了准确性6.9个百分点。我们捕捉到了这一点，并构建了一个上游Barak河Sentinel-1 SAR代理，从阿萨姆的Silchar提供约36小时的预警。Otsu阈值化的SAR变化检测在空间匹配上验证达到84-91%。操作性集成（RF 0.5625 + XGBoost 0.4375）在77个真实的Sentinel-1事件上达到89.6%的LOOCV准确性，87.5%的召回率和0.943的AUC-ROC。还包含三级警报管道和BRRI校准的boro稻损害估计器。

英文摘要

Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included.

URL PDF HTML ☆

赞 0 踩 0

2605.20165 2026-05-20 cs.CV

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

CaMo：基于摄像机运动的视觉-语言模型评估与训练

Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jianxu Shangguan, Cheng-Yen Yang, Jenq-Neng Hwang

AI总结本文提出了一种基于摄像机运动的视觉-语言模型评估与训练方法CaMo，通过要求模型生成显式的空间叙述并进行推理，揭示了现有空间视觉-语言模型在空间认知方面的不足，并展示了CaMo在空间叙述评估和直接空间问题回答准确性上的一致表现。

Comments Code and model available at https://github.com/hsiangwei0903/CaMo

详情

AI中文摘要

视觉-语言模型（VLMs）在空间问答基准测试中表现出色，但尚不清楚这种表现是否反映了真正的空间智能。我们证明现有空间VLMs缺乏基本的摄像机运动理解，这是空间认知的关键组成部分。我们提出了空间叙述评分（SNS），一种评估框架，要求VLMs生成显式的空间叙述，捕捉场景语义和摄像机运动，随后使用冻结的代理LLM进行推理。在SNS下，最先进的空间VLMs在直接问答准确性高时，却在评估中表现出显著的性能下降。为解决这一差距，我们引入了CaMo，一种基于摄像机运动的VLM，其在SNS评估和直接空间问答准确性上均表现出一致的性能。我们的结果强调了显式空间叙述外部化在评估具有可转移3D空间理解的VLMs中的重要性。我们的代码、数据和模型可在https://github.com/hsiangwei0903/CaMo上获得。

英文摘要

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

URL PDF HTML ☆

赞 0 踩 0

2605.20164 2026-05-20 cs.AI

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非每个评分标准都等同教学：面向RLVR的政策感知评分奖励

Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He

AI总结本文提出POW3R框架，通过保留人类权重和类别平衡，改进评分奖励机制，使评分标准更符合最终答案的要求，从而在多模态和纯文本设置中提升性能。

Comments 24 pages, 7 figures, 6 tables

详情

AI中文摘要

可验证奖励的强化学习在训练后效果显著，当正确性可以自动检查时。然而，许多重要的模型行为需要同时满足多个定性标准。基于评分的奖励通过评估特定提示的标准并将其聚合为标量奖励来解决这一问题。然而，标准静态聚合将人类分配的重要性与当前作为优化信号的有用性混淆。我们证明在评分RL中，这种假设在评分标准中崩溃：许多重要的标准已经饱和或当前不可达，而区分rollout的标准不一定是最受人类重视的。我们引入POW3R，一种政策感知的评分奖励框架，该框架在评分目标中保留人类权重和类别平衡，同时在训练过程中适应标准级别的奖励权重。POW3R使用rollout级别的对比来强调当前区分策略输出的标准，使GRPO奖励更加信息丰富，而不会改变底层评估目标。在两个数据集上三个基础策略中，POW3R在30个基础策略/指标比较中胜出24个，提高了平均评分奖励和严格完成率（满足所有评分标准的提示比例），并以2.5-4倍更少的训练步骤达到相同平台。因此，评分奖励应区分最终答案中应重视的内容，以及当前策略可以教授的内容。

英文摘要

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

URL PDF HTML ☆

赞 0 踩 0

2605.20159 2026-05-20 cs.CV cond-mat.mtrl-sci cs.LG

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

用于航空SiC/SiC复合材料X射线断层扫描缺陷检测的可解释计算机视觉

Antonio Peña Corredor, Julien Lesseur, Romain Nunez, Paul Rivalland, Thomas Philippe

AI总结本研究提出了一种结合原型层的p-ResNet-50框架，通过引入新的正则化项和语义对齐，提高了X射线断层扫描中缺陷检测的可解释性和准确性，同时保持了高精度和可追溯性。

详情

AI中文摘要

航空SiC/SiC复合材料的非破坏性检测依赖于专家视觉评估，当前流程在接受/拒绝决策方面缺乏可追溯性。深度卷积网络可以自动检测缺陷，但其黑盒性质与工业检测实践所需的透明性相冲突。为此，我们引入了p-ResNet-50，一种扩展了原型层的卷积框架，将高检测精度与基于案例的解释相结合。六个学习到的原型被显式对齐到专家定义的语义类别——健康基质、基质-空气界面、孔洞、线状缺陷和混合形态，使得每个分类都能追溯到具有物理意义的参考。两种新的正则化项，基于锚点和中位数，将原型连接到专家选择的片段，并防止原型崩溃，解决了原型网络已知的限制。通过UMAP进行的潜在空间分析揭示了语义连贯的子域，并映射出不确定性区域，这些区域集中了误分类，使检查员能够明确了解模型在哪里可靠，以及不可靠。该框架在约12,000个片段的XCT数据集上进行了验证，这些片段是从四个缺陷丰富的SiC/SiC实验室样品中提取的。与黑盒ResNet-50基线（ROC-AUC = 0.991）相比，原型扩展实现了相似的性能（准确率0.957 vs. 0.959；ROC-AUC 0.994 vs. 0.993），虽然灵敏度略有降低，但精度和特异性更高。每个决定都由代表性的证据片段支持，并且模型明确标记其不确定性区域。除了缺陷映射外，该框架还建立了一种可重用的方法，用于将领域专家知识嵌入到原型网络中，适用于其他需要可追溯、可审计决策的XCT检测场景。

英文摘要

Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

URL PDF HTML ☆

赞 0 踩 0

2605.20158 2026-05-20 cs.CV cs.AI cs.CL

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考用于大视觉语言模型胸部X光推理中的视觉归因

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

AI总结本文针对大视觉语言模型在胸部X光推理中视觉归因的可靠性问题，提出了一种因果评估框架，通过反事实编辑保留仅由专家标注区域验证的X光-VQA样本，以确定模型预测的因果责任区域。通过11种归因方法、6种开源LVLMs和两种输出模式，发现现有归因方法往往无法识别LVLMs所使用的证据。为此，本文提出MedFocus，一种基于概念的归因方法，通过不平衡最优传输局部化具有临床意义的解剖区域，并通过针对性干预测量其对模型输出的因果效应，显著优于现有方法，推动医疗LVLMs的更可信归因。

详情

AI中文摘要

大视觉语言模型（LVLMs）在医疗应用中展现出前景，但其无法准确将响应与视觉证据联系起来，引发了关于临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM预测，但这些解释是否确实反映了模型决策背后的视觉证据仍缺乏验证，因为内部模型推理的真值注释通常不可用。我们通过开发一种因果评估框架来解决胸部X光（CXR）推理中的这一问题，该框架仅保留专家标注区域已验证的CXR-VQA样本，通过反事实编辑保留因果责任区域。在11种归因方法、6种开源LVLMs和两种输出模式（直接回答和逐步推理）上应用此框架，发现现有归因方法往往无法识别LVLMs所使用的证据。为解决这一失败，我们提出MedFocus，一种基于概念的归因方法，通过不平衡最优传输局部化具有临床意义的解剖区域，并通过针对性干预测量其对模型输出的因果效应。MedFocus产生空间、概念级和token级归因，并显著优于现有方法，推动医疗LVLMs的更可信归因。我们的数据和代码可在https://github.com/gzxiong/medfocus/上获得。

英文摘要

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

URL PDF HTML ☆

赞 0 踩 0

2605.20157 2026-05-20 cs.LG cs.CR cs.IR

SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

SAGE：可扩展的自动门控集成用于自信的负面采样在欺诈检测中

Sudheer Tubati, Amit Goyal

AI总结本文提出SAGE，一种结合SimHash基于的分层抽样和模块化门控集成的反事实意识负面采样方法，以在欺诈检测中实现对未标记数据的自信负面识别，解决了正例未标记学习中的表示偏差问题。

详情

DOI: 10.1145/3779211.3793166
Journal ref: WSDM Companion '26: Nineteenth ACM International Conference on Web Search and Data Mining, 2026, Pages 34 - 38

AI中文摘要

音乐流媒体欺诈，即恶意行为者人为提高流媒体计数以操纵排行榜和版税支付，对流媒体服务和合法内容创作者构成重大威胁。传统欺诈检测方法面临关键挑战：许多合法边缘案例，包括超级粉丝和睡眠音乐会，表现出的活动模式与协调欺诈非常相似。我们提出了SAGE，一种新颖的反事实意识负面采样方法，结合SimHash基于的分层抽样和模块化门控集成，用于从未标记数据中自信地识别负面样本。我们的集成架构采用可插拔的统计门（目前实例化为Mahalanobis距离和k-NN密度）和可配置的投票阈值，以实现自适应的精度-召回率权衡。这通过通过地板约束抽样确保罕见行为群体的全面覆盖，解决了正例未标记学习中的表示偏差问题。评估显示在保留数据上具有强精度和召回率。该方法在欺诈检测领域具有良好的泛化能力，在客户层面和艺术家层面的欺诈检测中均能实现强性能，而无需修改核心方法。

英文摘要

Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

URL PDF HTML ☆

赞 0 踩 0

2605.20151 2026-05-20 cs.LG math.ST stat.TH

When Does Model Collapse Occur in Structured Interactive Learning?

在结构互动学习中模型崩溃何时发生？

Yuchen Wu, Kangjie Zhou, Weijie Su

AI总结研究探讨了在结构互动学习环境中，生成模型性能下降（模型崩溃）的发生条件，通过分析交互图拓扑结构，推导出模型崩溃的必要和充分条件，并通过数值实验验证理论结果。

Comments 57 pages, 12 figures

详情

AI中文摘要

生成式人工智能的普及催生了交互学习环境，其中模型参数通过自然过程生成的数据和由其他模型产生的合成输出不断更新。这种范式引入了两大挑战：（1）训练数据不再仅来自目标群体，破坏了经典统计学习的核心假设；（2）模型训练过程变得内在相关，因为模型通过反复接触彼此的合成输出进行交互，方式可能复杂。在这样的结构互动学习环境中建立可靠的统计推断仍然是一个重要开放问题。特别是，人们对模型崩溃现象日益关注，该现象是指生成模型在训练于早期模型生成的合成数据时性能逐步下降。先前关于模型崩溃的研究主要集中在单个模型训练其自身输出的情况，未能捕捉多模型交互环境中的模型性能。在本文中，我们填补了这一空白，通过研究具有通用交互模式的交互学习环境中的生成模型性能。特别是，我们利用有向图形式化模型交互，并证明模型崩溃的发生严重依赖于交互图的拓扑结构。我们进一步推导出一个显式的必要和充分条件，以表征模型崩溃何时发生，并为线性回归建立有限样本结果，为一般M估计量建立渐近保证。我们通过广泛的数值实验支持我们的理论发现。

英文摘要

The proliferation of generative artificial intelligence has given rise to an interactive learning environment, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning, and (2) model training processes become inherently correlated, as models interact with one another through repeated exposure to each other's synthetic outputs in a potentially complex manner. Establishing reliable statistical inference in such structured interactive learning environments therefore remains an important open problem. In particular, there is growing concern about model collapse, a phenomenon in which the performance of generative models progressively degrades as they are trained on synthetic data produced by earlier model generations. Prior work on model collapse primarily focuses on a single model trained on its own output, failing to capture model performance in multi-model interactive settings. In this work, we fill this gap by investigating the performance of generative models in an interactive learning environment with general interaction patterns. In particular, we formalize model interactions using directed graphs and show that the occurrence of model collapse depends critically on the topology of the interaction graph. We further derive an explicit necessary and sufficient condition characterizing when model collapse occurs, and establish finite-sample results for linear regression and asymptotic guarantees for general M-estimators. We support our theoretical findings through extensive numerical experiments.

URL PDF HTML ☆

赞 0 踩 0

2605.20149 2026-05-20 cs.CL cs.AI cs.HC

Less Back-and-Forth: A Comparative Study of Structured Prompting

少来回：结构化提示的比较研究

Saurav Ghosh, Gabriella Polach, Abdou Sow

AI总结本文研究了结构化提示设计是否能提高LLM响应质量并减少用户努力，通过比较三种提示条件，发现检查清单提示在任务完成、正确性、合规性和清晰度方面得分最高，且在质量和努力的平衡上表现最佳。

Comments 7 pages, 2 figures, 6 tables

详情

AI中文摘要

大型语言模型（LLMs）广泛用于开放式任务，但不明确的提示可能导致低质量的回答和额外的交互。本文研究结构化提示设计是否能提高响应质量并减少用户努力。我们比较了三种提示条件：原始提示、检查清单改进提示和澄清问题提示。我们通过四个任务类型——摘要、规划、解释和编程，使用三个LLM系统：ChatGPT、Claude和Grok来评估这些条件。每个输出都使用统一的评分标准进行评分，涵盖任务完成、正确性、合规性和清晰度。检查清单改进提示在评分方面得分最高，平均得分为7.50（满分8分），相比原始提示的5.67和澄清问题提示的6.67。检查清单提示在质量和努力的平衡上也表现最佳，使用比原始和澄清提示更少的平均令牌。这些结果表明，简单的提示检查清单可以提高LLM响应质量，同时减少不必要的交互。

英文摘要

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.20147 2026-05-20 cs.CV

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

PixVerve：通过大规模高质量数据集将原生超高清图像生成推至100MP

Haojun Chen, Haoyang He, Chengming Xu, Qingdong He, Junwei Zhu, Yabiao Wang, Zhucun Xue, Xianfang Zeng, Zhennan Chen, Xiaobin Hu, Hao Zhao, Yong Liu, Jiangning Zhang, Dacheng Tao

AI总结本文提出PixVerve-95K数据集，通过精心设计的数据管道构建，包含95K张高分辨率图像和七维标注，用于推动超高清图像生成技术，通过三种训练方案将T2I基础模型扩展到100MP生成，并建立PixVerve-Bench评估协议。

Comments Project page is available at https://haojunchen663.github.io/projects/PixVerve/

详情

AI中文摘要

文本到图像（T2I）模型近年来在1K和2K分辨率方面取得了显著进展。随着对更好视觉体验的极端需求和成像技术的快速发展，超高清（UHR）图像生成的需求显著增长。然而，由于高分辨率内容的稀缺性和复杂性，UHR图像生成面临巨大挑战。在本文中，我们首先介绍了PixVerve-95K，一个高质量、开源的UHR T2I数据集，通过精心设计的数据管道构建，包含95K张图像，涵盖多样场景（每张图像的最小像素数为100M）和七维标注。基于我们的大规模图像-文本数据集，我们采取了开创性的步骤，将各种T2I基础模型扩展到原生100MP生成，采用三种训练方案。最后，利用传统度量标准和基于多模态大语言模型的评估，我们提出的PixVerve-Bench基准建立了涵盖视觉质量和语义对齐的全面评估协议。在我们的基准上的广泛实验结果和训练策略的建设性探索共同提供了对未来突破的宝贵见解。

英文摘要

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

URL PDF HTML ☆

赞 0 踩 0

2605.20138 2026-05-20 cs.RO cs.SY eess.SY

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

航天器碰撞避免的Hamilton-Jacobi可达性

Larry Hui, Jordan Kam, William Su, Jianshu Zhou

AI总结本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi（HJ）可达性框架，通过平面Hill-Clohessy-Wiltshire（HCW）动力学在径向-切向-法向（RTN）框架中建模相对运动。定义目标状态空间为对应于联邦通信委员会（FCC）轨道标准的最小分离要求的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈，其中玩家1是受控卫星，玩家2被建模为具有未知意图的有界对抗干扰。本文提出了HJ公式，并计算了后向可达集，这些集描述了在最坏情况下无法避免碰撞的相对状态，而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合，以确定何时必须启动规避机动，从而为可扩展性提供数学基础的安全保证。

Comments Accepted to the 20th IEEE International Conference on Control & Automation (IEEE ICCA 2026). 6 pages, 4 figures

详情

AI中文摘要

本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi（HJ）可达性框架，通过平面Hill-Clohessy-Wiltshire（HCW）动力学在径向-切向-法向（RTN）框架中建模相对运动。我们定义目标状态空间为对应于最小分离要求一致的联邦通信委员会（FCC）轨道标准的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈，其中玩家1是受控卫星，玩家2被建模为具有未知意图的有界对抗干扰。我们提出了HJ公式，并计算了后向可达集，这些集描述了在最坏情况下无法避免碰撞的相对状态，而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合，以确定何时必须启动规避机动，从而为可扩展性提供数学基础的安全保证。

英文摘要

This article presents a Hamilton--Jacobi (HJ) reachability framework for a two--satellite collision avoidance problem operating in the same circular orbit, where relative motion is modeled in the radial--tangential--normal (RTN) frame using planar Hill--Clohessy--Wiltshire (HCW) dynamics. We define the target state space as unsafe relative configurations in the orbit plane corresponding to minimum separation requirements consistent with Federal Communications Commission (FCC) orbital standards. The interaction between spacecraft is formulated as a zero--sum differential game, where Player 1 is the controlled satellite and Player 2 is modeled as a bounded adversarial disturbance with unknown intent. We present the HJ formulation and compute backward reachable sets that characterize relative states from which collision cannot be avoided under worst-case disturbances, while states outside this set admit provably collision-free trajectories. These reachable sets are integrated with supervisory hybrid control logic to determine when evasive maneuvers must be initiated, enabling mathematically grounded safety guarantees for scalability.

URL PDF HTML ☆

赞 0 踩 0

2605.20134 2026-05-20 cs.LG

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

TrajTok: 用于轨迹表示学习的自适应空间令牌化

Zhen Xiong, Shang-Ling Hsu, Cyrus Shahabi

AI总结本文提出TrajTok，一种通过自适应空间令牌化学习通用轨迹表示的方法，通过多分辨率六边形网格划分和预训练策略，实现了在轨迹相似性搜索、分类、预计到达时间和旅行时间回归等任务上的优异表现。

详情

AI中文摘要

从原始GPS轨迹学习通用的轨迹表示仍然具有挑战性，因为数据是连续的、嘈杂的且采样不规则。空间令牌化同样具有挑战性：细网格会产生稀疏单元格，嵌入较弱，而粗网格会将异质运动模式合并为同一个令牌。我们提出了TrajTok，一种具有简单预训练配方的轨迹编码器，用于可转移的轨迹嵌入。TrajTok首先从GPS点的空间分布学习多分辨率六边形网格划分，将嘈杂的GPS序列转换为离散的单元格令牌。为了捕捉几何和运动学，它使用分解的Transformer编码器，带有早期模态自注意力块、跨注意力融合层和时空旋转位置嵌入（ST-RoPE），以编码每个令牌的位置和时间。TrajTok通过掩码令牌建模进行预训练，从部分轨迹观测中恢复几何结构和运动学模式。在Porto数据集上，冻结的TrajTok编码器结合轻量级任务适配器在轨迹相似性搜索、分类、预计到达时间和完整旅行时间回归任务上表现优异，优于多种任务特定方法。相同的冻结编码器支持几何主导和运动学主导任务，表明TrajTok学习了可转移的轨迹结构，而不是任务特定的捷径。这些结果表明，学习多分辨率空间令牌化结合掩码令牌预训练是通用轨迹基础模型的有希望的方向。

英文摘要

Learning generalizable trajectory representations from raw GPS traces remains difficult because the data is continuous, noisy, and irregularly sampled. Spatial tokenization is also challenging: fine grids yield sparse cells with weak embeddings, while coarse grids merge heterogeneous movement patterns into the same token. We present TrajTok, a trajectory encoder with a simple pretraining recipe for transferable trajectory embeddings. TrajTok first learns a multi-resolution hexagonal cell partition from the spatial distribution of GPS points, converting noisy GPS sequences into discrete cell tokens. To capture both geometry and kinematics, it uses a factorized transformer encoder with early per-modality self-attention blocks, cross-attention fusion layers, and spatiotemporal rotary position embeddings, ST-RoPE, to encode where and when each token occurs. TrajTok is pretrained with masked-token modeling that recovers both geometric structure and kinematic patterns from partial trajectory observations. On the Porto dataset, a frozen TrajTok encoder with lightweight task adapters achieves strong performance across trajectory similarity search, classification, estimated time of arrival, and full travel-time regression, outperforming multiple task-specific methods. The same frozen encoder supports both geometry-dominated and kinematics-dominated tasks, suggesting that TrajTok learns transferable trajectory structure rather than task-specific shortcuts. These results indicate that learned multi-resolution spatial tokenization combined with masked-token pretraining is a promising direction for general-purpose trajectory foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.20128 2026-05-20 cs.CL

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

MixRea: 在大型语言模型中评估显式-隐式推理的基准测试

Yuanqing Cai, Ziyi Huang, Minhao Liu, Lixin Duan, Wen Li, Yanru Zhang

AI总结本文提出MixRea基准测试，用于评估大型语言模型在显式和隐式推理任务中的表现，发现即使最佳模型也存在注意力不足的问题，并提出PRCP方法来改进推理能力。

Comments 12 pages, 6 figures, 4 tables

详情

AI中文摘要

大型语言模型（LLMs）正越来越多地融入高风险决策中。受人类认知中'注意力盲区'理论的启发，我们研究LLMs是否在显式任务指令下表现出类似限制：未能关注到细微但重要的上下文线索。为此，我们引入显式-隐式推理任务，并提出MixRea基准测试，包含2246道多选题，覆盖9种推理类型，显式和隐式信息分布各异。对21种先进LLMs的评估显示，即使最佳推理模型（Gemini 2.5 Pro）也只能达到42.8%的一致性，揭示了普遍存在的注意力盲区。为缓解这一问题，我们提出潜在关系完成提示（PRCP），一种通过恢复被忽视的因果关系来提升推理能力的提示方法。进一步分析显示，这一限制在多样化的多源推理任务中依然存在，凸显了对更认知对齐模型的需求。

英文摘要

Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

URL PDF HTML ☆

赞 0 踩 0

2605.20120 2026-05-20 cs.AI cs.LO

Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

使用阿基里斯API进行Lean 4中的AI辅助定理证明：格里菲斯问题的形式化案例研究

Gabriel Rongyang Lau

AI总结本文通过形式化案例研究，探讨了使用阿基里斯API在Lean 4中进行AI辅助定理证明的挑战，展示了格里菲斯问题的证明过程，揭示了局部证明搜索成功但全局组合计数仍需解决的局限性。

详情

AI中文摘要

AI辅助定理证明现在可以生成大量Lean开发用于奥林匹克级数学，但这些开发的证据状态取决于哪些声明实际上已被验证。本文报告了针对格里菲斯问题（最初作为IMO 2009问题6提出）的Lean 4形式化案例研究，该研究涉及阿基里斯API的证明尝试。生成的成果包含一个通用的Lean定理版本，以及四个已验证的辅助引理，用于局部组件的最大性和相邻交换策略。主定理直接通过一个未解决的sorry声明关闭。已验证的组件证明了最终部分和等于总和，相邻置换仅影响相关的中间部分和，改变的部分和具有预期形式，以及在某个位置允许相邻后继交换的最大性迫使相应的禁止集成员事实。阿基里斯输出摘要识别出剩余的数学步骤是需要证明这些成员事实产生至少n个不同的禁止值，从而反驳| M | < n的基数假设；Lean源代码本身并未将主定理归约到单独编码的计数引理。该案例研究提供了一个可检查的例子，展示了AI辅助形式化中的核心限制，即局部证明搜索可以成功，但定理所需的全局组合计数仍需解决。本文贡献了一个可重复的Lean artifact和对其已验证和未验证证明内容的精确分析。

英文摘要

AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry. The verified components establish that the final partial sum equals the total sum, that an adjacent transposition can affect only the relevant intermediate partial sum, that the changed partial sum has the expected form, and that maximality at a position admitting an adjacent successor swap forces a corresponding forbidden-set membership fact. The Aristotle output summary identifies the intended remaining mathematical step as the global counting step needed to show that these membership facts produce at least n distinct forbidden values, contradicting the cardinality assumption |M| < n; the Lean source itself does not reduce the main theorem to a separately encoded counting lemma. This case study gives an inspectable example of a central limitation in AI-assisted formalization, namely that local proof search can succeed while the global combinatorial bookkeeping required for a theorem remains unresolved. The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.

URL PDF HTML ☆

赞 0 踩 0

2605.20110 2026-05-20 cs.CV

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

SetCon: 通过集级概念预测实现开放式的指称分割

Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding, Long Xing, Yibin Wang, Qiaosheng Zhang, Jiaqi Wang

AI总结本文提出SetCon，通过集级概念预测实现开放式的指称分割，利用LVLM生成的自然语言概念作为语义条件进行联合掩码-集解码，提高了分割的完整性和互斥性。

详情

AI中文摘要

指称分割将自然语言查询与像素级掩码联系起来，但将其扩展到包含多个实例、跨类群组或开放目标集的复杂场景仍然具有挑战性。先前基于大型视觉语言模型（LVLM）的方法用一个或多个特殊标记依次表示指称目标，将多个目标视为独立输出而非连贯的集合，并且几乎没有激励去捕捉集合级属性，如完整性和互斥性。我们重新公式化开放式的指称分割为显式的集级概念预测，并提出Set-Concept Segmentation（SetCon），该方法使用LVLM生成的自然语言概念，而不是分割特定的标记，作为联合掩码-集解码的语义条件。一个层次化的语义分解首先预测一个共享的集级概念以定义目标范围，然后将其细化为细粒度的概念组，与目标子集对齐。为了支持这一点，一个两阶段的标注流程增强了现有的推理分割数据集，添加了层次化的语义监督（236k样本，784k概念短语）。SetCon在图像基准上取得了最先进的结果（在gRefCOCO上+3.3 gIoU，在MUSE上+12.1 gIoU），其优势随着指称目标数量的增加而扩大。概念接口在检测和跟踪设置下也转移到视频中，产生了在七个指称视频基准上的新最先进的结果，包括在MeViS上+10.9 J&F和在Ref-SeCVOS上+12.4 J&F。

英文摘要

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

URL PDF HTML ☆

赞 0 踩 0

2605.20107 2026-05-20 cs.LG cs.AI

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

超越各向同性：JEPAs中的哈密顿几何与辛预测

Robert Jenkinson Alvarez

AI总结本文研究了JEPAs中各向同性假设的局限性，提出基于哈密顿几何的辛预测方法，通过相空间状态和学习的哈密顿量预测视图间过渡，从而提升模型在不同数据集上的性能。

详情

AI中文摘要

JEPAs通常将单视图嵌入正则化为各向同性的高斯分布，隐含地将欧几里得对称性纳入表示中。我们证明这不仅仅是无害的默认设置。对于已知的结构化下游几何H>0，最小最大和最大熵协方差在哈密顿能量预算下为(c/d)H^{-1}，欧几里得各向同性会带来闭式价格。更重要的是，当下游几何未知时，没有几何无关的固定边际目标是规范的：每个固定协方差形状可以对某些结构化几何最大化地错位。我们进一步表明，即使拥有oracle单视图边际，也无法识别JEPA视图间预测耦合。这些结果表明，JEPAs中的结构偏差应进入跨视图耦合而非固定编码器边际。我们通过HamJEPA实例化这一原则，将每个视图编码为相空间状态(q,p)，并通过学习的哈密顿量跃迁映射预测视图间过渡，非各向同性的尺度和频谱地板防止崩溃。在刻意无头标记协议中，HamJEPA在CIFAR-100上比SIGReg提升4.89 kNN@20和3.52线性探针点，在30个epoch时，以及在80个epoch时提升6.45 kNN@20和10.64线性探针点。而匹配的MLP预测器消融显示，辛耦合是驱动邻域几何增益的成分。在ImageNet-100上，HamJEPA-q在45个epoch时提升4.82 kNN@20和7.52线性探针点。

英文摘要

JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbf{HamJEPA}, which encodes each view as a phase-space state $(q,p)$ and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by $+4.89$ kNN@20 and $+3.52$ linear-probe points at 30 epochs, and by $+6.45$ kNN@20 and $+10.64$ linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA-$q$ improves by $+4.82$ kNN@20 and $+7.52$ linear-probe points at 45 epochs.

URL PDF HTML ☆

赞 0 踩 0

2605.20105 2026-05-20 cs.LG

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

最优表示尺寸：预训练和线性探测的高维分析

Valentina Njaradi, Clémentine Dominé, Rachel Swanson, Marco Mondelli, Andrew Saxe

AI总结本文研究了预训练和线性探测过程中的最优表示尺寸问题，通过高维分析揭示了表示维度、未标记和标记样本数量以及任务对齐性对训练和泛化误差的影响，提出了在不同预训练和下游数据条件下优化表示尺寸的条件。

详情

AI中文摘要

学习从有限数据中泛化是人工和生物系统面临的基本挑战。一种常见策略是从大量未标记数据中提取可重用的结构，从而高效适应新任务。这种两阶段范式现在已成为现代训练流水线的标准，即预训练后进行微调或线性探测。我们为这一过程提供了一个分析模型：结构提取被形式化为主成分分析，而下游学习则被建模为对单独标记数据集的线性回归。在高维情况下，我们推导出训练和泛化误差的精确表达式，展示了其对表示维度、未标记和标记样本数量以及任务对齐性的依赖性。我们的结果表明，预训练表示强烈影响下游泛化，我们将其最优表示尺寸作为任务参数的函数进行表征：在大量预训练数据但稀缺下游数据时，最大压缩表示最优；而在预训练数据有限时，高维表示泛化更好。此外，我们建立了预训练和监督之间的精确权衡，量化了需要多少未标记数据来替代一个标记样本。除了我们理想化的模型外，我们在自编码器和预训练大语言模型中也观察到相似的现象。总体而言，我们强调优化表示尺寸至关重要，给出了压缩预训练时提高泛化的条件。

英文摘要

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

URL PDF HTML ☆

赞 0 踩 0

2605.20104 2026-05-20 cs.LG cs.AI

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

少写多取：用于推测解码的混合树构建

Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan, Cong Wang

AI总结本文提出了一种混合树构建方法Graft，通过结合剪枝和检索操作，解决了推测解码中资源分配的帕累托权衡问题，实现了在不同部署场景下的速度提升和接受率优化。

详情

AI中文摘要

推测解码（SD）通过 draft-then-verify 模式加速大语言模型推理。为最大化接受率，近期方法构建了 expansive draft trees，但导致严重的 VRAM 带宽和计算开销，成为端到端加速的瓶颈。虽然动态深度剪枝可通过移除边际分支减少延迟，但也会丢弃潜在有效的候选，阻碍接受率达到密集树的上限。在本文中，我们识别了资源分配中的关键机会：从密集到剪枝的转换释放了显著的计算预算。为了打破这一帕累托权衡，我们引入 Graft，一种补偿框架，将剪枝和检索作为相互强化的操作。剪枝提供足够的预算用于检索，而检索补偿剪枝引起的覆盖损失并恢复接受长度。通过采用顺序的 `prune-then-graft' 机制，Graft 将高预测性的检索 token 插入剪枝打开的位置，用几乎零开销填补拓扑缺口。Graft 完全无训练且无损失。全面评估显示，Graft 在实际部署设置中建立了新的帕累托前沿，包括短上下文生成、长上下文生成和大规模模型。在短上下文基准上，它实现了高达 5.41× 的加速，并在大规模 Qwen3-235B 上将平均加速率提高至 EAGLE-3 的 21.8%。我们还初步探讨了将 Graft 应用于 DFlash 风格的块解码范式，提供了扩展 grafting 以超越自回归 draft trees 的初步证据和见解。

英文摘要

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41$\times$ speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.

URL PDF HTML ☆

赞 0 踩 0

2605.20101 2026-05-20 cs.RO

Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

拓扑优化气动软执行器：设计与实验验证

Sumit Mehta, Konstantinos Poulios

AI总结本文通过非线性拓扑优化设计了软弹性气动执行器，并通过实验验证了其性能。

Comments 20 pages, 13 figures

2605.20090 2026-05-20 cs.CV

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

MetaEarth-MM：基于场景中心联合建模的多模态遥感图像生成

Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi

AI总结本文提出MetaEarth-MM模型，通过统一的多模态遥感图像生成框架，实现多模态图像的联合生成和任意模态之间的转换，展示了其在多模态遥感观测中的强大生成能力和广泛适用性。

详情

AI中文摘要

多模态遥感图像对于地球观测至关重要，但在实践中，完整的配对观测往往稀缺。现有的生成方法通常通过孤立的成对模态翻译来解决这个问题，但随着模态数量和生成任务的增加，其通用性和可扩展性仍然有限。本文开发了一个生成基础模型MetaEarth-MM，用于多模态遥感图像生成，能够在统一模型中实现五种模态之间的配对联合生成和任意到任意的翻译。认识到多模态观测下内在的场景一致性，我们引入了MetaEarth-MM中的场景中心联合建模范式。与以往依赖直接外观级跨模态映射的方法不同，我们的模型围绕底层场景内容组织生成过程。具体而言，MetaEarth-MM采用解耦架构，首先从可用观测中推断出潜在的场景表示，然后基于此中间状态生成目标模态。为了支持训练，我们进一步构建了EarthMM，一个包含280万张多分辨率全球图像和220万对对齐图像的大型数据集。广泛的实验表明，MetaEarth-MM不仅在多样化的生成任务中表现出强大的生成能力和稳健的泛化能力，还支持数据和表示层面的下游任务，突显了其作为跨模态地球观测通用基础模型的潜力。代码和数据集将在https://github.com/YZPioneer/MetaEarth-MM上提供。

英文摘要

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

URL PDF HTML ☆

赞 0 踩 0

2605.20088 2026-05-20 cs.LG cs.AI

INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification

INSHAPE：实例级形状lets用于可解释的时间序列分类

Seongjun Lee, Seokhyun Lee, Changhee Lee

AI总结本文提出INSHAPE框架，通过发现每个时间序列特有的变量长度判别性时间模式，解决传统方法在实例特定特征与整体模式不一致以及忽略时间依赖性的问题，从而提高时间序列分类的可解释性和预测性能。

Comments Accepted to IJCAI 2026. 25 pages

详情

AI中文摘要

发现形状lets——即时间序列内的判别性时间模式——已被广泛研究，以应对时间序列分类（TSC）固有的复杂性，并使模型决策过程更加透明。然而，现有方法主要集中在整体数据集上优化的群体级形状lets，导致两个根本性限制：(i)群体级模式往往与实例特定特征不一致，导致性能不佳并可能产生误导性解释；(ii)大多数方法将形状lets视为独立实体，忽略了多个模式之间的重要时间依赖性和相互作用。为了解决这些限制，我们提出了INSHAPE，一个可解释的TSC框架，该框架发现每个时间序列特有的变量长度判别性时间模式。INSHAPE将这些模式识别为非重叠段，并建模其时间依赖性，从而在提供清晰的实例级解释的同时实现强大的预测性能。此外，INSHAPE通过自下而上的方法连接局部和全局可解释性，将实例级形状lets聚合为原型（群体级）形状lets。在128个UCR和30个UEA基准数据集上的广泛实验表明，INSHAPE在性能上始终优于最先进的基于形状lets的方法，同时提供更直观和可解释的见解。

英文摘要

Discovering shapelets -- i.e., discriminative temporal patterns within time series -- has been widely studied to address the inherent complexity of time-series classification (TSC) and to make model decision-making processes more transparent. However, existing methods primarily focus on population-level shapelets optimized across the entire dataset, which leads to two fundamental limitations: (i) population-level patterns often misalign with instance-specific features, resulting in suboptimal performance and potentially misleading interpretations, and (ii) most methods treat shapelets as independent entities, overlooking important temporal dependencies and interactions among multiple patterns. To address these limitations, we propose INSHAPE, an interpretable TSC framework that discovers variable-length, discriminative temporal patterns specific to each time series. INSHAPE identifies these patterns as non-overlapping segments and models their temporal dependencies, thereby providing clear instance-level interpretations while achieving strong predictive performance. Furthermore, INSHAPE bridges local and global interpretability through a bottom-up approach, aggregating instance-level shapelets into prototypical (population-level) shapelets. Extensive experiments on 128 UCR and 30 UEA benchmark datasets show that INSHAPE consistently outperforms state-of-the-art shapelet-based methods while providing more intuitive and interpretable insights.

URL PDF HTML ☆

赞 0 踩 0

2605.20085 2026-05-20 cs.CV

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

基于空间提示的视觉轨迹预测用于目视操控

Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong

AI总结本文提出了一种新的视觉轨迹预测方法SP-VTP，通过空间提示定义任务目标，结合任务编码器、观察编码器和轨迹生成器，提升了跨场景的目视操控轨迹预测性能。

详情

AI中文摘要

机器人操控通常通过语言指令或任务标识符指定，但在有相似物体的杂乱环境中，通过空间指示要移动什么和放置在哪里会更有效。针对以视觉为中心的对象和目标指定挑战，我们提出了目前所知的第一个空间提示视觉轨迹预测（SP-VTP）的正式化。这种新的设置利用初始空间提示（如边界框或点）来定义任务目标，要求模型从目视流中预测未来末端执行器轨迹。为了研究此问题，我们收集并标注了EgoSPT数据集，包含带有第一帧物体和目标定位注释以及恢复的3D末端执行器运动的目视空间提示操控轨迹。SP-VTP具有挑战性，因为任务指定是静态的，而场景配置随时间变化。为了解决这个问题，我们提出了SPOT（空间提示对象-目标策略），它结合了任务编码器用于第一帧视觉和坐标空间提示，观察编码器用于当前视觉和历史上下文，以及轨迹生成器用于未来末端执行器运动。在严格的场景级划分实验中，SPOT在非提示或单源提示基线之上提高了跨场景轨迹预测性能。共同，EgoSPT和SPOT建立了一个新的空间提示问题SP-VTP，作为简单且可扩展的任务条件用于目视操控。

英文摘要

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

URL PDF HTML ☆

赞 0 踩 0

2605.20084 2026-05-20 cs.CL cs.AI

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

BalanceRAG: 为级联检索增强生成进行联合风险校准

Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian, Haoning Wang, Baojie Chen, Diyin Tang, Jinsong Yu, Zhiyuan Wang

AI总结本文提出BalanceRAG，一种用于级联检索增强生成的联合风险校准方法，通过在二维晶格上确定安全操作点，实现风险自适应的阈值校准，从而在控制系统级错误率的同时保留更多示例，并扩展到多风险校准。

详情

AI中文摘要

大型语言模型（LLMs）可通过检索增强生成（RAG）提高事实性，但在模型单独回答可靠时，将RAG应用于每个查询是不必要的。这促使了级联RAG：每个查询首先由LLM单独分支处理，如果主分支不确定则升级到RAG回退，当两个分支都不足够可信时则放弃。然而，逐级校准此类级联可能过于保守，因为最终的效用取决于LLM单独和RAG的联合不确定性阈值。在本文中，我们开发了BalanceRAG，以在目标风险水平下认证阈值对。给定两个分支的不确定性分数，BalanceRAG将每个阈值对框架为二维晶格上的一个操作点，并通过顺序图形测试确定安全操作点。这使得风险自适应的阈值校准成为可能，从而在控制接受点的系统级错误率的同时保留更多示例。此外，BalanceRAG扩展到多风险校准，允许检索使用与基于选择的条件风险一起被限制。在三个开放领域问答（QA）基准上的实验表明，BalanceRAG满足规定的风险水平，保留了更高的覆盖率和更多的接受正确示例，并且比始终开启RAG减少了不必要的检索调用。

英文摘要

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

URL PDF HTML ☆

赞 0 踩 0

2605.20082 2026-05-20 cs.CV cs.AI

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

VL-DPO：基于视觉语言的偏好对齐自动驾驶微调

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled S. Refaat

AI总结本文提出VL-DPO，一种基于视觉语言模型的框架，通过零样本推理生成偏好对来微调自动驾驶模型，以提升与人类驾驶偏好的对齐程度，实验表明该方法在RFS和ADE指标上均优于基线模型。

Comments Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables

详情

AI中文摘要

自动驾驶数据集的快速增长使强大的运动预测模型得以扩展。尽管大规模预训练提供了强大的性能，但标准模仿目标可能无法完全捕捉人类驾驶偏好中的复杂细微差别。同时，视觉语言模型（VLMs）的最新进展展示了出色的推理和常识理解能力。基于这些能力，本文提出了VL-DPO，一种基于视觉语言的框架，用于将自动驾驶车辆的运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器，自动从预训练模型的轨迹中生成偏好对，然后通过直接偏好优化（DPO）进行微调。我们在此Waymo Open End-to-End Driving Dataset（WOD-E2E）上微调模型，并通过评分反馈（RFS）和平均位移误差（ADE）评估模型在持保留人类偏好注释上的性能。实验表明，VLM的轨迹选择是高质量的人类偏好的代理。我们的最终模型VL-DPO在RFS指标上比预训练模型提高了11.94%，在ADE指标上减少了10.01%。

英文摘要

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

URL PDF HTML ☆

赞 0 踩 0