arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.20182 2026-05-20 cs.LG cs.AI

Atoms of Thought: Universal EEG Representation Learning with Microstates

思想的原子:基于微状态的通用EEG表示学习

Xinyang Tian, Ruitao Liu, Ziyi Ye, Siyang Xue, Xin Wang, Xuesong Chen

AI总结 本文提出了一种基于微状态的通用EEG表示学习方法,通过将连续EEG信号聚类为离散的微状态序列,构建了一个通用的微状态分词器,并在睡眠分期、情绪识别和运动想象分类等下游任务中展示了其优越性,同时提高了可解释性和扩展性。

详情
Comments
Accepted by the 3rd International Workshop on Multimodal and Responsible Affective Computing (MRAC 2025). 8 pages of main text, 23 pages total, 5 figures, 4 tables
AI中文摘要

从脑电图(EEG)信号中学习通用表示是神经信息学和脑机接口(BCIs)领域的一项前沿技术。传统上,EEG被视为多变量时间序列,其中时间域或频域特征被提取用于表示学习。本文研究了一种简单而有效的EEG表示,即微状态。微状态代表了在微观时间尺度上大脑活动模式的基本构建块。通过从大规模医疗EEG数据集中对连续EEG信号进行聚类,构建了一个通用的微状态分词器。该微状态分词器被广泛应用于一系列下游任务,包括睡眠分期、情绪识别和运动想象分类。实验结果表明,使用微状态进行EEG表示学习在不同模型和不同任务中均优于传统的时间域和频域特征。进一步分析显示,微状态提供了更高的可解释性和可扩展性,从而在认知神经科学和临床研究中开辟了应用。

英文摘要

Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is then adopted universally across a series of downstream tasks, including sleep staging, emotion recognition, and motor imagery classification. Experimental results show that EEG representation learning with microstates outperforms traditional time-domain and frequency-domain features under different models and across different tasks. Further analysis shows that microstates offer greater interpretability and scalability, thereby opening up applications in both cognitive neuroscience and clinical research.

2605.20179 2026-05-20 cs.CL

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

TIDE: 一种高效且无损的MoE扩散大语言模型推理方法

Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang

AI总结 本文提出TIDE,一种基于扩散过程块内专家激活时间稳定性的新推理系统,通过引入基于区间的专家刷新策略,优化I/O和CPU计算,实现无损加速,提升推理效率。

详情
AI中文摘要

扩散大语言模型(dLLMs)作为一种具有更好硬件利用和双向上下文能力的替代方案,通过并行块级解码出现。然而,随着dLLMs在混合专家(MoE)架构下不断扩展,其在资源受限设备上的部署仍是一个开放性挑战。现有基于自回归(AR)的方法通常导致严重的I/O开销或显著的计算瓶颈。在本文中,我们提出TIDE,一种新的资源高效的推理系统,利用扩散过程中块内专家激活的时序稳定性。具体来说,我们利用扩散过程中块内专家激活的时序稳定性,并引入基于区间的专家刷新策略,以I/O-aware的方式更新专家位置。为了确保最佳性能,我们将推理调度形式化为一个数学规划问题,求解最优区间以最小化I/O流量和CPU计算。最重要的是,TIDE是一种无损优化,不需要模型训练,为dLLM推理提供

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

2605.20177 2026-05-20 cs.CL cs.CV

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

从感知到思考:解耦感知与推理提升视觉语言模型的训练

Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou

AI总结 本研究通过解耦感知与推理,发现视觉任务性能受限于感知能力不足而非推理本身,通过分阶段训练提升模型的感知与推理能力,从而在多个视觉数学和感知任务中取得更优表现。

详情
Comments
19 pages, 9 figures; Accepted to ICML 2026; Project Page: https://ucsc-vlaa.github.io/VLM-CapCurriculum/
AI中文摘要

最近视觉语言模型(VLMs)的进步强调长链推理;然而,我们发现其在视觉任务上的性能主要受限于感知能力不足而非推理本身。在本工作中,我们系统研究了VLMs在训练后感知与推理之间的相互作用,通过将能力分解为三个独立的训练阶段:视觉感知、视觉推理和文本推理,并结合专门的训练数据。我们证明视觉感知(a)需要针对优化和专门数据;(b)作为基础框架,应在细化视觉推理之前通过分阶段训练巩固;(c)通过强化学习(RL)比基于标题的监督微调(SFT)更有效学习。我们的实验表明,分阶段训练在多个VLMs上一致提升了视觉感知和推理性能。值得注意的是,采用我们方法训练的模型在推理准确性上提高了1.5%,推理轨迹缩短了20.8%,表明更强的感知减少了对过度推理的需求。此外,我们展示了基于能力的分阶段训练代表了与传统难度基于课程正交的新课程维度,结合两者可进一步获得加性收益。我们的分阶段训练模型在开放权重VLMs中表现优异,在多个视觉数学和感知任务(如WeMath和RealWorldQA)上取得了优于基础模型的先进结果。

英文摘要

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

2605.20176 2026-05-20 cs.CL

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

ClinSeekAgent: 自动化多模态证据检索以实现代理临床推理

Juncheng Wu, Letian Zhang, Yuhan Wang, Haoqin Tu, Hardy Chen, Zijun Wang, Cihang Xie, Yuyin Zhou

AI总结 本文提出ClinSeekAgent,一种自动化代理框架,用于动态多模态证据检索,旨在从异构来源主动检索、迭代规划和综合多模态证据,从而提升临床决策支持。

详情
Comments
24 pages, 9 figures; Project Page: https://ucsc-vlaa.github.io/ClinSeekAgent/
AI中文摘要

大型语言模型(LLMs)和代理系统在临床决策支持中展现出潜力,但现有研究大多假设证据已预先整理并提供给模型。现实中的临床工作流程要求代理主动检索、迭代规划和综合来自异质来源的多模态证据。在本文中,我们介绍了ClinSeekAgent,一种自动化代理框架,用于动态多模态证据检索,将范式从被动证据消费转向主动证据获取。仅给定一个临床查询和对原始数据源的访问权限,ClinSeekAgent通过查询医学知识库、导航原始电子健康记录(EHR)和调用医学影像工具来收集证据;随着新信息的出现,它会细化假设;并将收集到的证据整合到基于事实的临床决策中。ClinSeekAgent既作为推理时间的代理用于前沿LLMs,也作为训练时间的流水线,用于将高质量的代理轨迹提炼成紧凑的开源模型。为了验证其推理时间的有效性,我们构建了ClinSeek-Bench,它将固定预选证据的Curated Input推理与原始临床数据上的自动化证据检索相结合。在仅文本EHR任务中,ClinSeekAgent将Claude Opus 4.6的总体F1从60.0提升到63.2,将MiniMax M2.5从43.1提升到47.3,并在9个评估的主机模型中有7个显示出积极的风险预测增益。在多模态任务中,ClinSeekAgent将Claude Opus 4.6的分数从47.5提升到62.6(+15.1);所有评估的模型在三个与CXR相关的任务组中均有所提升。我们进一步通过将代理证据检索轨迹提炼成ClinSeek-35B-A3B来验证ClinSeekAgent作为训练流水线的有效性,该模型在现有AgentEHR-Bench上实现了34.0的平均F1,比其Qwen3.5-35B-A3B基线提高了+11.9点,并接近Claude Opus 4.6。

英文摘要

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

2605.20174 2026-05-20 cs.CV cs.LG

Multi-axis Analysis of Image Manipulation Localization

多轴分析图像操纵定位

Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova, Anna Rohrbach, Bryan A. Plummer

AI总结 本文提出AUDITS基准,用于多轴分析图像操纵检测,通过不同领域转移类型评估现有方法的鲁棒性,以推动更可靠和通用的图像操纵检测方法的发展。

详情
Comments
28 pages, 5 figures, 5 tables
AI中文摘要

先进的图像编辑软件使创建高度逼真的图像操纵变得容易,近年来由于生成式AI的进步,这种能力变得更加普及。虽然操纵的图像通常无害,但它们可能传播虚假信息、制造虚假叙述并影响人们对重要问题的看法。尽管这种威胁日益增长,但针对不同视觉领域检测高级操纵的研究仍然有限。因此,我们引入了Analysis Under Domain-shifts, QualIty, Type, and Size (AUDITS),一个全面的基准,用于研究图像操纵检测中的分析轴。AUDITS包含来自两个不同来源(用户和新闻照片)的超过530,000张图像。我们通过最近的扩散基填充技术整理数据集,以支持跨多个轴的分析,涵盖多样化的操纵类型和尺寸。我们通过不同的领域转移类型进行实验,以评估现有图像操纵检测方法的鲁棒性。我们的目标是通过提供新的见解来推动该领域进一步研究,以帮助开发更可靠和通用的图像操纵检测方法。

英文摘要

Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

2605.20173 2026-05-20 cs.AI cs.SE

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

为生产大语言模型代理选择和组合运行时架构模式的方法

Vasundra Srinivasan

AI总结 本文提出了一种方法,用于选择和组合运行时架构模式,以定义大语言模型代理的随机-确定性边界,并探讨其在不同代理类型中的应用及可靠性分解。

详情
Comments
25 pages, 2 figures, 6 tables. Companion repo at https://github.com/vasundras/agent-runtime-patterns
AI中文摘要

生产大语言模型代理结合了随机模型输出与确定性软件系统,但两者之间的边界很少被视为首要的架构对象。本文将此边界称为随机-确定性边界(SDB):一种四部分合同,涉及提议者、验证者、提交步骤和拒绝信号,规定了LLM输出如何成为系统动作。我们主张SDB是生产代理运行时的承载基础。围绕此基础,我们将代理运行时设计分为三个关注点:协调、状态和控制。我们提出了六个运行时模式的目录,这些模式在对话、自主和长周期代理中以不同的方式组合SDB:分层委托、散射-收集加 saga、事件驱动序列、共享状态机、监督者加门控,以及人机交互。对于每个模式,我们追溯其分布式系统概念的根源,并确定当工作者为随机时的变化。本文贡献了五步选择运行时模式的方法,一个将生产故障映射到模式弱点的诊断程序,以及一种称为回放分歧的故障模式,在这种模式下,基于LLM的确定性事件日志消费者在模型版本或提示变化下会产生不同的下游输出。一种简化的可靠性分解将每次调用的模型方差与架构动量分开,促使主张随着模型方差的减少,模式选择和SDB强度成为长期可靠性的越来越重要的杠杆。我们应用该方法于五个工作负载,并提供了一个90天合同续约代理的可运行参考实现。

英文摘要

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

2605.20172 2026-05-20 cs.LO cs.AI

Long-term Power Grid Planning via Answer Set Programming

通过答案集编程进行长期电力网络规划

Antonio Ielo, Francesco Doria, Sandra Castellanos-Paez, Marco Maratea, Francesco Percassi, Mauro Vallati

AI总结 本文提出了一种基于答案集编程的自动化和优化长期电力网络规划方法,以解决可持续性目标、需求模式和城市化趋势等复杂问题。

详情
Comments
16 pages, 4 figures
AI中文摘要

电力网络是支撑现代社会各个方面的重要基础设施,其有效性需要持续适应。特别是要应对可持续性目标、需求模式和城市化趋势,需要对网络进行更改。实际发展可能持续数十年,必须通过确保符合多种拓扑和组合不变量来保持供应连续性和服务质量。长期电力网络规划涉及上述过程,尽管规划语言可能是一个自然的选择,但所需的属性和不变量在这样的语言中难以表达;相反,它们可以优雅且简洁地编码在答案集编程(ASP)中。在本文中,我们提出了一种利用ASP自动化和优化长期电力网络规划过程的方法。在合成和实际电网数据上进行的实验评估证实了所提出的基于ASP的方法的表达能力,并展示了其有效性。

英文摘要

The Power grid is a critical infrastructure underpinning all aspects of modern society and its services. Maintaining its effectiveness requires continuous adaptations. In particular, addressing sustainability targets, demand patterns, and urbanisation trends requires implementing changes to the network. Actual developments can potentially span over a decade, with supply continuity and service quality that must be preserved throughout by ensuring conformance to several topological and combinatorial invariants. Long-term power grid planning deals with the above process, and although planning languages could be a natural choice, the kind of properties and invariants needed are cumbersome to express in such languages; on the contrary, they can be elegantly and succinctly encoded in Answer Set Programming (ASP). In this paper, we propose the first approach to automate and optimise the long-term power grid planning process using ASP. Experimental evaluations conducted on synthetic and real-world grid data confirm the expressive power of the proposed ASP-based approach and demonstrate its effectiveness.

2605.20170 2026-05-20 cs.CL

KoRe: Compact Knowledge Representations for Large Language Models

KoRe: 为大型语言模型设计的紧凑知识表示

Davide Cavicchini, Fausto Giunchiglia, Jacopo Staiano

AI总结 本文提出KoRe方法,通过将知识图谱的1跳子图编码为紧凑离散知识标记,并注入到LLM中,从而提升模型的知识推理能力并减少token使用量。

详情
AI中文摘要

现代大型语言模型(LLMs)在用户面对的任务如问答中表现出色,并在推理能力上持续改进。然而,这些模型编码知识的方式存在固有缺陷:通过设计,LLMs将世界知识存储在参数中。这种方式表示知识本质上是不透明的,难以调试和更新,且容易产生幻觉。另一方面,知识图谱可以提供人类可读且易于编辑的世界知识表示,并在知识密集型任务中持续证明对下游性能有益。然而,当前的整合技术需要大量的重新训练或微调。为了解决这个问题,我们引入KoRe,一种将1跳子图编码为紧凑离散知识标记并注入到LLM骨干中的方法。我们在三个已建立的基准上测试了所提出的方法,并报告了具有竞争力的表现,同时token使用量显著减少(最高减少10倍)。我们的结果表明,紧凑的离散KG表示可以有效地用于使现代LLM扎根。

英文摘要

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

2605.20167 2026-05-20 cs.AI cs.LG

HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

HaorFloodAlert: 用于孟加拉国Haor湿地72小时洪水预测的去季节化机器学习集成

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely, Md. Samiul Alim, Md. Zakir Hossen

AI总结 本文提出HaorFloodAlert,一种去季节化的机器学习集成模型,用于预测孟加拉国Haor湿地72小时内的洪水概率,通过识别温度季节性影响和利用Sentinel-1 SAR数据提高预测准确性。

详情
Comments
9 pages, 9 figures. To be submitted to raaicon.org
AI中文摘要

孟加拉国Haor湿地的快速洪水几乎没有任何预警,破坏年度boro稻收获。现有系统为河流洪水设计,完全忽略了回水动态。这些流域平坦,水的行为不同于布拉马普特拉河。我们构建了HaorFloodAlert,一种去季节化的机器学习集成,用于预测Sunamganj Haor(约8,000平方公里)72小时内的洪水概率。温度被发现是季节性的作弊代码,因为它在温暖月份洪水发生时提高了准确性6.9个百分点。我们捕捉到了这一点,并构建了一个上游Barak河Sentinel-1 SAR代理,从阿萨姆的Silchar提供约36小时的预警。Otsu阈值化的SAR变化检测在空间匹配上验证达到84-91%。操作性集成(RF 0.5625 + XGBoost 0.4375)在77个真实的Sentinel-1事件上达到89.6%的LOOCV准确性,87.5%的召回率和0.943的AUC-ROC。还包含三级警报管道和BRRI校准的boro稻损害估计器。

英文摘要

Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included.

2605.20165 2026-05-20 cs.CV

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

CaMo:基于摄像机运动的视觉-语言模型评估与训练

Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jianxu Shangguan, Cheng-Yen Yang, Jenq-Neng Hwang

AI总结 本文提出了一种基于摄像机运动的视觉-语言模型评估与训练方法CaMo,通过要求模型生成显式的空间叙述并进行推理,揭示了现有空间视觉-语言模型在空间认知方面的不足,并展示了CaMo在空间叙述评估和直接空间问题回答准确性上的一致表现。

详情
Comments
Code and model available at https://github.com/hsiangwei0903/CaMo
AI中文摘要

视觉-语言模型(VLMs)在空间问答基准测试中表现出色,但尚不清楚这种表现是否反映了真正的空间智能。我们证明现有空间VLMs缺乏基本的摄像机运动理解,这是空间认知的关键组成部分。我们提出了空间叙述评分(SNS),一种评估框架,要求VLMs生成显式的空间叙述,捕捉场景语义和摄像机运动,随后使用冻结的代理LLM进行推理。在SNS下,最先进的空间VLMs在直接问答准确性高时,却在评估中表现出显著的性能下降。为解决这一差距,我们引入了CaMo,一种基于摄像机运动的VLM,其在SNS评估和直接空间问答准确性上均表现出一致的性能。我们的结果强调了显式空间叙述外部化在评估具有可转移3D空间理解的VLMs中的重要性。我们的代码、数据和模型可在https://github.com/hsiangwei0903/CaMo上获得。

英文摘要

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

2605.20164 2026-05-20 cs.AI

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非每个评分标准都等同教学:面向RLVR的政策感知评分奖励

Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He

AI总结 本文提出POW3R框架,通过保留人类权重和类别平衡,改进评分奖励机制,使评分标准更符合最终答案的要求,从而在多模态和纯文本设置中提升性能。

详情
Comments
24 pages, 7 figures, 6 tables
AI中文摘要

可验证奖励的强化学习在训练后效果显著,当正确性可以自动检查时。然而,许多重要的模型行为需要同时满足多个定性标准。基于评分的奖励通过评估特定提示的标准并将其聚合为标量奖励来解决这一问题。然而,标准静态聚合将人类分配的重要性与当前作为优化信号的有用性混淆。我们证明在评分RL中,这种假设在评分标准中崩溃:许多重要的标准已经饱和或当前不可达,而区分rollout的标准不一定是最受人类重视的。我们引入POW3R,一种政策感知的评分奖励框架,该框架在评分目标中保留人类权重和类别平衡,同时在训练过程中适应标准级别的奖励权重。POW3R使用rollout级别的对比来强调当前区分策略输出的标准,使GRPO奖励更加信息丰富,而不会改变底层评估目标。在两个数据集上三个基础策略中,POW3R在30个基础策略/指标比较中胜出24个,提高了平均评分奖励和严格完成率(满足所有评分标准的提示比例),并以2.5-4倍更少的训练步骤达到相同平台。因此,评分奖励应区分最终答案中应重视的内容,以及当前策略可以教授的内容。

英文摘要

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

2605.20159 2026-05-20 cs.CV cond-mat.mtrl-sci cs.LG

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

用于航空SiC/SiC复合材料X射线断层扫描缺陷检测的可解释计算机视觉

Antonio Peña Corredor, Julien Lesseur, Romain Nunez, Paul Rivalland, Thomas Philippe

AI总结 本研究提出了一种结合原型层的p-ResNet-50框架,通过引入新的正则化项和语义对齐,提高了X射线断层扫描中缺陷检测的可解释性和准确性,同时保持了高精度和可追溯性。

详情
AI中文摘要

航空SiC/SiC复合材料的非破坏性检测依赖于专家视觉评估,当前流程在接受/拒绝决策方面缺乏可追溯性。深度卷积网络可以自动检测缺陷,但其黑盒性质与工业检测实践所需的透明性相冲突。为此,我们引入了p-ResNet-50,一种扩展了原型层的卷积框架,将高检测精度与基于案例的解释相结合。六个学习到的原型被显式对齐到专家定义的语义类别——健康基质、基质-空气界面、孔洞、线状缺陷和混合形态,使得每个分类都能追溯到具有物理意义的参考。两种新的正则化项,基于锚点和中位数,将原型连接到专家选择的片段,并防止原型崩溃,解决了原型网络已知的限制。通过UMAP进行的潜在空间分析揭示了语义连贯的子域,并映射出不确定性区域,这些区域集中了误分类,使检查员能够明确了解模型在哪里可靠,以及不可靠。该框架在约12,000个片段的XCT数据集上进行了验证,这些片段是从四个缺陷丰富的SiC/SiC实验室样品中提取的。与黑盒ResNet-50基线(ROC-AUC = 0.991)相比,原型扩展实现了相似的性能(准确率0.957 vs. 0.959;ROC-AUC 0.994 vs. 0.993),虽然灵敏度略有降低,但精度和特异性更高。每个决定都由代表性的证据片段支持,并且模型明确标记其不确定性区域。除了缺陷映射外,该框架还建立了一种可重用的方法,用于将领域专家知识嵌入到原型网络中,适用于其他需要可追溯、可审计决策的XCT检测场景。

英文摘要

Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

2605.20158 2026-05-20 cs.CV cs.AI cs.CL

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考用于大视觉语言模型胸部X光推理中的视觉归因

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

AI总结 本文针对大视觉语言模型在胸部X光推理中视觉归因的可靠性问题,提出了一种因果评估框架,通过反事实编辑保留仅由专家标注区域验证的X光-VQA样本,以确定模型预测的因果责任区域。通过11种归因方法、6种开源LVLMs和两种输出模式,发现现有归因方法往往无法识别LVLMs所使用的证据。为此,本文提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应,显著优于现有方法,推动医疗LVLMs的更可信归因。

详情
AI中文摘要

大视觉语言模型(LVLMs)在医疗应用中展现出前景,但其无法准确将响应与视觉证据联系起来,引发了关于临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM预测,但这些解释是否确实反映了模型决策背后的视觉证据仍缺乏验证,因为内部模型推理的真值注释通常不可用。我们通过开发一种因果评估框架来解决胸部X光(CXR)推理中的这一问题,该框架仅保留专家标注区域已验证的CXR-VQA样本,通过反事实编辑保留因果责任区域。在11种归因方法、6种开源LVLMs和两种输出模式(直接回答和逐步推理)上应用此框架,发现现有归因方法往往无法识别LVLMs所使用的证据。为解决这一失败,我们提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应。MedFocus产生空间、概念级和token级归因,并显著优于现有方法,推动医疗LVLMs的更可信归因。我们的数据和代码可在https://github.com/gzxiong/medfocus/上获得。

英文摘要

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

2605.20157 2026-05-20 cs.LG cs.CR cs.IR

SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

SAGE:可扩展的自动门控集成用于自信的负面采样在欺诈检测中

Sudheer Tubati, Amit Goyal

AI总结 本文提出SAGE,一种结合SimHash基于的分层抽样和模块化门控集成的反事实意识负面采样方法,以在欺诈检测中实现对未标记数据的自信负面识别,解决了正例未标记学习中的表示偏差问题。

详情
Journal ref
WSDM Companion '26: Nineteenth ACM International Conference on Web Search and Data Mining, 2026, Pages 34 - 38
AI中文摘要

音乐流媒体欺诈,即恶意行为者人为提高流媒体计数以操纵排行榜和版税支付,对流媒体服务和合法内容创作者构成重大威胁。传统欺诈检测方法面临关键挑战:许多合法边缘案例,包括超级粉丝和睡眠音乐会,表现出的活动模式与协调欺诈非常相似。我们提出了SAGE,一种新颖的反事实意识负面采样方法,结合SimHash基于的分层抽样和模块化门控集成,用于从未标记数据中自信地识别负面样本。我们的集成架构采用可插拔的统计门(目前实例化为Mahalanobis距离和k-NN密度)和可配置的投票阈值,以实现自适应的精度-召回率权衡。这通过通过地板约束抽样确保罕见行为群体的全面覆盖,解决了正例未标记学习中的表示偏差问题。评估显示在保留数据上具有强精度和召回率。该方法在欺诈检测领域具有良好的泛化能力,在客户层面和艺术家层面的欺诈检测中均能实现强性能,而无需修改核心方法。

英文摘要

Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

2605.20151 2026-05-20 cs.LG math.ST stat.TH

When Does Model Collapse Occur in Structured Interactive Learning?

在结构互动学习中模型崩溃何时发生?

Yuchen Wu, Kangjie Zhou, Weijie Su

AI总结 研究探讨了在结构互动学习环境中,生成模型性能下降(模型崩溃)的发生条件,通过分析交互图拓扑结构,推导出模型崩溃的必要和充分条件,并通过数值实验验证理论结果。

详情
Comments
57 pages, 12 figures
AI中文摘要

生成式人工智能的普及催生了交互学习环境,其中模型参数通过自然过程生成的数据和由其他模型产生的合成输出不断更新。这种范式引入了两大挑战:(1)训练数据不再仅来自目标群体,破坏了经典统计学习的核心假设;(2)模型训练过程变得内在相关,因为模型通过反复接触彼此的合成输出进行交互,方式可能复杂。在这样的结构互动学习环境中建立可靠的统计推断仍然是一个重要开放问题。特别是,人们对模型崩溃现象日益关注,该现象是指生成模型在训练于早期模型生成的合成数据时性能逐步下降。先前关于模型崩溃的研究主要集中在单个模型训练其自身输出的情况,未能捕捉多模型交互环境中的模型性能。在本文中,我们填补了这一空白,通过研究具有通用交互模式的交互学习环境中的生成模型性能。特别是,我们利用有向图形式化模型交互,并证明模型崩溃的发生严重依赖于交互图的拓扑结构。我们进一步推导出一个显式的必要和充分条件,以表征模型崩溃何时发生,并为线性回归建立有限样本结果,为一般M估计量建立渐近保证。我们通过广泛的数值实验支持我们的理论发现。

英文摘要

The proliferation of generative artificial intelligence has given rise to an interactive learning environment, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning, and (2) model training processes become inherently correlated, as models interact with one another through repeated exposure to each other's synthetic outputs in a potentially complex manner. Establishing reliable statistical inference in such structured interactive learning environments therefore remains an important open problem. In particular, there is growing concern about model collapse, a phenomenon in which the performance of generative models progressively degrades as they are trained on synthetic data produced by earlier model generations. Prior work on model collapse primarily focuses on a single model trained on its own output, failing to capture model performance in multi-model interactive settings. In this work, we fill this gap by investigating the performance of generative models in an interactive learning environment with general interaction patterns. In particular, we formalize model interactions using directed graphs and show that the occurrence of model collapse depends critically on the topology of the interaction graph. We further derive an explicit necessary and sufficient condition characterizing when model collapse occurs, and establish finite-sample results for linear regression and asymptotic guarantees for general M-estimators. We support our theoretical findings through extensive numerical experiments.

2605.20149 2026-05-20 cs.CL cs.AI cs.HC

Less Back-and-Forth: A Comparative Study of Structured Prompting

少来回:结构化提示的比较研究

Saurav Ghosh, Gabriella Polach, Abdou Sow

AI总结 本文研究了结构化提示设计是否能提高LLM响应质量并减少用户努力,通过比较三种提示条件,发现检查清单提示在任务完成、正确性、合规性和清晰度方面得分最高,且在质量和努力的平衡上表现最佳。

详情
Comments
7 pages, 2 figures, 6 tables
AI中文摘要

大型语言模型(LLMs)广泛用于开放式任务,但不明确的提示可能导致低质量的回答和额外的交互。本文研究结构化提示设计是否能提高响应质量并减少用户努力。我们比较了三种提示条件:原始提示、检查清单改进提示和澄清问题提示。我们通过四个任务类型——摘要、规划、解释和编程,使用三个LLM系统:ChatGPT、Claude和Grok来评估这些条件。每个输出都使用统一的评分标准进行评分,涵盖任务完成、正确性、合规性和清晰度。检查清单改进提示在评分方面得分最高,平均得分为7.50(满分8分),相比原始提示的5.67和澄清问题提示的6.67。检查清单提示在质量和努力的平衡上也表现最佳,使用比原始和澄清提示更少的平均令牌。这些结果表明,简单的提示检查清单可以提高LLM响应质量,同时减少不必要的交互。

英文摘要

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

2605.20147 2026-05-20 cs.CV

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

PixVerve:通过大规模高质量数据集将原生超高清图像生成推至100MP

Haojun Chen, Haoyang He, Chengming Xu, Qingdong He, Junwei Zhu, Yabiao Wang, Zhucun Xue, Xianfang Zeng, Zhennan Chen, Xiaobin Hu, Hao Zhao, Yong Liu, Jiangning Zhang, Dacheng Tao

AI总结 本文提出PixVerve-95K数据集,通过精心设计的数据管道构建,包含95K张高分辨率图像和七维标注,用于推动超高清图像生成技术,通过三种训练方案将T2I基础模型扩展到100MP生成,并建立PixVerve-Bench评估协议。

详情
Comments
Project page is available at https://haojunchen663.github.io/projects/PixVerve/
AI中文摘要

文本到图像(T2I)模型近年来在1K和2K分辨率方面取得了显著进展。随着对更好视觉体验的极端需求和成像技术的快速发展,超高清(UHR)图像生成的需求显著增长。然而,由于高分辨率内容的稀缺性和复杂性,UHR图像生成面临巨大挑战。在本文中,我们首先介绍了PixVerve-95K,一个高质量、开源的UHR T2I数据集,通过精心设计的数据管道构建,包含95K张图像,涵盖多样场景(每张图像的最小像素数为100M)和七维标注。基于我们的大规模图像-文本数据集,我们采取了开创性的步骤,将各种T2I基础模型扩展到原生100MP生成,采用三种训练方案。最后,利用传统度量标准和基于多模态大语言模型的评估,我们提出的PixVerve-Bench基准建立了涵盖视觉质量和语义对齐的全面评估协议。在我们的基准上的广泛实验结果和训练策略的建设性探索共同提供了对未来突破的宝贵见解。

英文摘要

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

2605.20145 2026-05-20 stat.ML cs.LG stat.ME

Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization

面向目标的高斯过程低尾校准用于贝叶斯优化

Aurélien Pion, Emmanuel Vazquez

AI总结 本文研究了在无噪声情况下,针对低于低阈值t的标准高斯过程模型的预测分布进行面向目标的校准,提出了一种后处理方法tcGP,以校准预测分布低于t的部分,并展示了基于此的全局优化算法在设计空间中保持密集性,实验表明相较于标准高斯过程模型和全局校准高斯过程模型,改进了低尾校准和贝叶斯优化性能。

详情
Journal ref
ICML 2026
AI中文摘要

贝叶斯优化(BO)利用高斯过程(GP)预测分布来选择昂贵的黑箱目标的评估点。核选择和超参数选择可能导致预测分布不准确,从而影响探索与利用的平衡。对于最小化问题,采样标准如预期改进(EI)依赖于当前最佳值以下的预测分布,因此低尾不准确直接影响采样决策。本文研究了在无噪声情况下,针对低于低阈值t的标准高斯过程模型的预测分布进行面向目标的校准,超参数通过最大似然法选择。引入了一种预测可靠性低于t的框架,基于两个空间校准的概念:设计空间上的发生校准和子水平集形式{ x∈X, f(x)≤t }上的阈值μ-校准。在此框架基础上,提出tcGP,一种后处理方法,用于校准预测分布低于t的部分,并证明由此得到的基于EI的全局优化算法在设计空间中保持密集。在标准基准测试中,实验表明相较于标准高斯过程模型和全局校准高斯过程模型,改进了低尾校准和贝叶斯优化性能。

英文摘要

Bayesian optimization (BO) selects evaluation points for expensive black-box objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions and an inappropriate exploration-exploitation trade-off. For minimization, sampling criteria such as expected improvement (EI) depend on the predictive distribution below the current best value, so lower-tail miscalibration directly affects the sampling decision. This article studies goal-oriented calibration of GP predictive distributions below a low threshold $t$ in the noiseless setting, for standard GP models with hyperparameters selected by maximum likelihood. A framework for predictive reliability below $t$ is introduced, based on two notions of spatial calibration: occurrence calibration over the design space and thresholded $μ$-calibration on sublevel sets of the form $\{x\in\mathbb{X}, f(x)\le t\}$. Building on this framework, we propose tcGP, a post-hoc method that calibrates GP predictive distributions below~$t$, and we show that the resulting EI-based global optimization algorithm remains dense in the design space. Experiments on standard benchmarks show improved lower-tail calibration and BO performance relative to standard GP models and globally calibrated GP models.

2605.20138 2026-05-20 cs.RO cs.SY eess.SY

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

航天器碰撞避免的Hamilton-Jacobi可达性

Larry Hui, Jordan Kam, William Su, Jianshu Zhou

AI总结 本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi(HJ)可达性框架,通过平面Hill-Clohessy-Wiltshire(HCW)动力学在径向-切向-法向(RTN)框架中建模相对运动。定义目标状态空间为对应于联邦通信委员会(FCC)轨道标准的最小分离要求的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈,其中玩家1是受控卫星,玩家2被建模为具有未知意图的有界对抗干扰。本文提出了HJ公式,并计算了后向可达集,这些集描述了在最坏情况下无法避免碰撞的相对状态,而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合,以确定何时必须启动规避机动,从而为可扩展性提供数学基础的安全保证。

详情
Comments
Accepted to the 20th IEEE International Conference on Control & Automation (IEEE ICCA 2026). 6 pages, 4 figures
AI中文摘要

本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi(HJ)可达性框架,通过平面Hill-Clohessy-Wiltshire(HCW)动力学在径向-切向-法向(RTN)框架中建模相对运动。我们定义目标状态空间为对应于最小分离要求一致的联邦通信委员会(FCC)轨道标准的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈,其中玩家1是受控卫星,玩家2被建模为具有未知意图的有界对抗干扰。我们提出了HJ公式,并计算了后向可达集,这些集描述了在最坏情况下无法避免碰撞的相对状态,而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合,以确定何时必须启动规避机动,从而为可扩展性提供数学基础的安全保证。

英文摘要

This article presents a Hamilton--Jacobi (HJ) reachability framework for a two--satellite collision avoidance problem operating in the same circular orbit, where relative motion is modeled in the radial--tangential--normal (RTN) frame using planar Hill--Clohessy--Wiltshire (HCW) dynamics. We define the target state space as unsafe relative configurations in the orbit plane corresponding to minimum separation requirements consistent with Federal Communications Commission (FCC) orbital standards. The interaction between spacecraft is formulated as a zero--sum differential game, where Player 1 is the controlled satellite and Player 2 is modeled as a bounded adversarial disturbance with unknown intent. We present the HJ formulation and compute backward reachable sets that characterize relative states from which collision cannot be avoided under worst-case disturbances, while states outside this set admit provably collision-free trajectories. These reachable sets are integrated with supervisory hybrid control logic to determine when evasive maneuvers must be initiated, enabling mathematically grounded safety guarantees for scalability.

2605.20134 2026-05-20 cs.LG

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

TrajTok: 用于轨迹表示学习的自适应空间令牌化

Zhen Xiong, Shang-Ling Hsu, Cyrus Shahabi

AI总结 本文提出TrajTok,一种通过自适应空间令牌化学习通用轨迹表示的方法,通过多分辨率六边形网格划分和预训练策略,实现了在轨迹相似性搜索、分类、预计到达时间和旅行时间回归等任务上的优异表现。

详情
AI中文摘要

从原始GPS轨迹学习通用的轨迹表示仍然具有挑战性,因为数据是连续的、嘈杂的且采样不规则。空间令牌化同样具有挑战性:细网格会产生稀疏单元格,嵌入较弱,而粗网格会将异质运动模式合并为同一个令牌。我们提出了TrajTok,一种具有简单预训练配方的轨迹编码器,用于可转移的轨迹嵌入。TrajTok首先从GPS点的空间分布学习多分辨率六边形网格划分,将嘈杂的GPS序列转换为离散的单元格令牌。为了捕捉几何和运动学,它使用分解的Transformer编码器,带有早期模态自注意力块、跨注意力融合层和时空旋转位置嵌入(ST-RoPE),以编码每个令牌的位置和时间。TrajTok通过掩码令牌建模进行预训练,从部分轨迹观测中恢复几何结构和运动学模式。在Porto数据集上,冻结的TrajTok编码器结合轻量级任务适配器在轨迹相似性搜索、分类、预计到达时间和完整旅行时间回归任务上表现优异,优于多种任务特定方法。相同的冻结编码器支持几何主导和运动学主导任务,表明TrajTok学习了可转移的轨迹结构,而不是任务特定的捷径。这些结果表明,学习多分辨率空间令牌化结合掩码令牌预训练是通用轨迹基础模型的有希望的方向。

英文摘要

Learning generalizable trajectory representations from raw GPS traces remains difficult because the data is continuous, noisy, and irregularly sampled. Spatial tokenization is also challenging: fine grids yield sparse cells with weak embeddings, while coarse grids merge heterogeneous movement patterns into the same token. We present TrajTok, a trajectory encoder with a simple pretraining recipe for transferable trajectory embeddings. TrajTok first learns a multi-resolution hexagonal cell partition from the spatial distribution of GPS points, converting noisy GPS sequences into discrete cell tokens. To capture both geometry and kinematics, it uses a factorized transformer encoder with early per-modality self-attention blocks, cross-attention fusion layers, and spatiotemporal rotary position embeddings, ST-RoPE, to encode where and when each token occurs. TrajTok is pretrained with masked-token modeling that recovers both geometric structure and kinematic patterns from partial trajectory observations. On the Porto dataset, a frozen TrajTok encoder with lightweight task adapters achieves strong performance across trajectory similarity search, classification, estimated time of arrival, and full travel-time regression, outperforming multiple task-specific methods. The same frozen encoder supports both geometry-dominated and kinematics-dominated tasks, suggesting that TrajTok learns transferable trajectory structure rather than task-specific shortcuts. These results indicate that learned multi-resolution spatial tokenization combined with masked-token pretraining is a promising direction for general-purpose trajectory foundation models.

2605.20132 2026-05-20 physics.geo-ph cs.LG eess.SP

FiLark: a streaming-first software framework for end-to-end exploration, annotation, and algorithm integration in distributed acoustic sensing

FiLark:一种面向流式处理的软件框架,用于分布式声学传感的端到端探索、标注和算法集成

Jintao Li, Weichang Li, Kai Tong, Xaingyu Guo

AI总结 本文提出FiLark框架,通过流式处理原则,实现分布式声学传感数据的端到端探索、标注和算法集成,解决传统批量分析框架无法处理连续高通道数据流的问题。

详情
AI中文摘要

分布式声学传感(DAS)系统生成的连续、超高通道计数的数据流速率超过了传统批量分析框架的能力。因此,诸如长时记录的交互探索、可扩展的事件标注和实时算法闭环监控等关键任务仍然无法得到足够支持。本文提出了FiLark(Fiber Lark),一种Python框架,其应用流式处理原则贯穿数据访问、信号处理、可视化和监控。FiLark将任何DAS源,包括连续多文件记录,作为统一流进行处理,并围绕该抽象构建所有系统组件。基于OpenGL的环形缓冲区渲染器允许以恒定内存使用量交互浏览和可视化任意长的记录。集成的标注界面支持在连续数据流中直接进行事件标注,从而在不进行离线预处理的情况下创建可重复的机器学习准备好的标注数据集。信号处理库包括时间、空间、频谱和分解基的运算符,包含通过PyTorch实现的CPU版本和GPU加速版本,以及具有状态的分块执行,以在段边界保持处理连续性和应用语义。标准化的监控接口进一步将流式检测器和基于学习的模型整合到可视化工作流程中。通过在所有层次共享共同的流式抽象,FiLark允许在交互式开发的处理配置和工作流程直接转移到可扩展的生产管道中,而无需修改。

英文摘要

Distributed acoustic sensing (DAS) systems generate continuous, ultra-high-channel-count data streams at rates that exceed the capabilities of conventional batch-oriented analysis frameworks. As a result, essential tasks such as interactive exploration of long-duration recordings, scalable event annotation, and real-time algorithm-in-the-loop monitoring remain inadequately supported by workflows built around manually selected data segments and offline processing. This paper presents FiLark (Fiber Lark), a Python framework that applies a \emph{streaming-first} principle uniformly across data access, signal processing, visualization and monitoring for DAS. Instead of operating on manually selected data segments, FiLark presents any DAS sources-including continuous multi-file recordings-as a unified stream and builds all system components around that abstraction. An OpenGL-based ring-buffer renderer enables interactive browsing and visualization of arbitrarily long recordings with constant memory usage. An integrated annotation interface supports event labeling directly within continuous data streams, facilitating the creation of reproducible machine-learning-ready labeled datasets without offline preprocessing. The signal processing library includes temporal, spatial, spectral, and decomposition-based operators, with both CPU implementations and GPU-accelerated variants via PyTorch, alongside stateful chunked execution that preserves processing continuity and application semantics across segment boundaries. A standardized monitor interface further integrates streaming detectors and learning-based models into the visualization workflow. By sharing a common streaming abstraction across all layers, FiLark allows processing configurations and workflows developed interactively to transfer directly to scalable production pipelines without modification.

2605.20128 2026-05-20 cs.CL

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

MixRea: 在大型语言模型中评估显式-隐式推理的基准测试

Yuanqing Cai, Ziyi Huang, Minhao Liu, Lixin Duan, Wen Li, Yanru Zhang

AI总结 本文提出MixRea基准测试,用于评估大型语言模型在显式和隐式推理任务中的表现,发现即使最佳模型也存在注意力不足的问题,并提出PRCP方法来改进推理能力。

详情
Comments
12 pages, 6 figures, 4 tables
AI中文摘要

大型语言模型(LLMs)正越来越多地融入高风险决策中。受人类认知中'注意力盲区'理论的启发,我们研究LLMs是否在显式任务指令下表现出类似限制:未能关注到细微但重要的上下文线索。为此,我们引入显式-隐式推理任务,并提出MixRea基准测试,包含2246道多选题,覆盖9种推理类型,显式和隐式信息分布各异。对21种先进LLMs的评估显示,即使最佳推理模型(Gemini 2.5 Pro)也只能达到42.8%的一致性,揭示了普遍存在的注意力盲区。为缓解这一问题,我们提出潜在关系完成提示(PRCP),一种通过恢复被忽视的因果关系来提升推理能力的提示方法。进一步分析显示,这一限制在多样化的多源推理任务中依然存在,凸显了对更认知对齐模型的需求。

英文摘要

Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

2605.20127 2026-05-20 q-bio.NC cs.AI cs.LG

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

超越预测准确性:用于评估模型-大脑对齐的靶空间恢复曲线

Ken Nakamura, Tomoya Nakai, Ryuto Yashiro, Ayumu Yamashita, Kaoru Amano

AI总结 本文提出了一种评估模型-大脑对齐的新方法,通过分析可重复预测的靶空间响应维度,揭示预测准确性之外的模型-大脑对齐情况。

详情
Comments
34 pages, 12 figures, 5 tables
AI中文摘要

人工视觉模型通常通过测量其内部表示预测大脑响应的准确性来评估人类视觉皮层。然而,仅凭预测准确性无法确定目标大脑响应空间中哪些维度被恢复。本文介绍了一种统一框架,通过识别预测恢复的响应维度来评估模型-大脑和大脑-大脑对齐。通过重复fMRI测量,我们首先确定可在独立试验分割中重复预测的目标大脑响应维度。然后,我们预测目标大脑响应,无论是从另一个受试者的大脑响应还是视觉模型的内部表示,并量化这些可重复响应维度的恢复程度。将此框架应用于自然场景数据集的一个子集,其中八名受试者在fMRI下观看了相同的自然图像,我们发现早期到中期视觉皮层响应包含一组低维的可重复维度。大脑-大脑比较确定哪些维度可以从其他受试者的大脑中一致恢复,提供了一种诊断性的人类参考而非仅标量基准。在某些情况下,预训练和随机初始化的模型在预测准确性上相似,但这些响应维度的恢复曲线却不同。这些结果表明,仅凭预测准确性可能掩盖模型-大脑不匹配。通过明确哪些可重复的大脑响应维度被预测恢复,我们的框架提供了更诊断性的评估,以评估人工视觉模型与人类视觉皮层的对齐情况。

英文摘要

Artificial vision models are often evaluated against the human visual cortex by measuring how accurately their internal representations predict brain responses. However, prediction accuracy alone does not indicate which dimensions of the target brain's response space are recovered. Here, we introduce a unified framework for evaluating both model-brain and brain-brain alignment by identifying the response dimensions recovered by prediction. Using repeated fMRI measurements, we first identify target-brain response dimensions that can be reproducibly predicted across independent trial splits. We then predict target-brain responses from either another subject's brain responses or a vision model's internal representations, and quantify how strongly each of these reproducible response dimensions is recovered. Applying this framework to a subset of the Natural Scenes Dataset, in which eight subjects viewed the same natural images during fMRI, we find that the early-to-intermediate visual-cortex responses contain a low-dimensional set of reproducible dimensions. Brain-to-brain comparisons identify which of these dimensions are consistently recoverable from other subjects' brains, providing a diagnostic human reference rather than only a scalar benchmark. In some cases, pretrained and randomly initialized models achieve similar prediction accuracy while showing distinct recovery profiles across these response dimensions. These results show that prediction accuracy alone can mask model-brain mismatches. By making explicit which reproducible brain response dimensions are recovered by prediction, our framework provides a more diagnostic evaluation of alignment between artificial vision models and the human visual cortex.

2605.20122 2026-05-20 stat.ML cs.CC cs.LG

Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

优化Wasserstein距离估计的计算-统计运行时间

Peter Matthew Jacobs, Jeff M. Phillips

AI总结 本文提出了一种Sample-Sketch-Solve方法,通过引入正则化笛卡尔网格草图来压缩数据并加速Wasserstein距离的计算,实现了在Hölder光滑分布下以更优的运行时间达到ε误差的估计。

详情
AI中文摘要

平方Wasserstein距离是衡量概率分布之间差异的常用工具。该距离通常在两个底层随机样本的经验测度之间计算。不幸的是,即使在低维欧几里得空间问题(d∈{2,3})中,计算Wasserstein距离的算法在运行时间上随着n和所需精度的增加而表现不佳。为此,我们考虑计算-统计运行时间,目标是从样本中估计潜在光滑测度之间的Wasserstein距离,误差在期望意义上不超过ε。我们允许收集样本的计算成本为O(1)。为此,我们开发了一种Sample-Sketch-Solve范式,其中引入了样本的正则化笛卡尔网格草图。我们证明,尤其是在α-Hölder光滑分布下,这可以压缩数据而不增加渐近误差,并且正则化结构使更快的精确算法成为可能。最终,我们以ε误差在ε^{-max(2,(d+1+o(1))/(1+α))}时间内近似W_2^2(P,Q),对于0 < α < 1的Hölder光滑分布P,Q在(0,1)^d上;当d=2时,对于α>1/2,达到最优Θ(ε^{-2}),当d=3时,当α→1时几乎最优。

英文摘要

Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to $ε$-additive error in expectation with respect to the sampling; we allow $O(1)$ computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under $α$-Hölder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate $W_2^2(P,Q)$ within $ε$ error in $ε^{-\max(2,\frac{d+1+o(1)}{1+α})}$ time for $0 < α< 1$ Hölder smooth distributions $P,Q$ on $(0,1)^{d}$; an optimal $Θ(ε^{-2})$ for $α> 1/2$ when $d=2$ and nearly optimal as $α\to 1$ when $d = 3$.

2605.20120 2026-05-20 cs.AI cs.LO

Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

使用阿基里斯API进行Lean 4中的AI辅助定理证明:格里菲斯问题的形式化案例研究

Gabriel Rongyang Lau

AI总结 本文通过形式化案例研究,探讨了使用阿基里斯API在Lean 4中进行AI辅助定理证明的挑战,展示了格里菲斯问题的证明过程,揭示了局部证明搜索成功但全局组合计数仍需解决的局限性。

详情
AI中文摘要

AI辅助定理证明现在可以生成大量Lean开发用于奥林匹克级数学,但这些开发的证据状态取决于哪些声明实际上已被验证。本文报告了针对格里菲斯问题(最初作为IMO 2009问题6提出)的Lean 4形式化案例研究,该研究涉及阿基里斯API的证明尝试。生成的成果包含一个通用的Lean定理版本,以及四个已验证的辅助引理,用于局部组件的最大性和相邻交换策略。主定理直接通过一个未解决的sorry声明关闭。已验证的组件证明了最终部分和等于总和,相邻置换仅影响相关的中间部分和,改变的部分和具有预期形式,以及在某个位置允许相邻后继交换的最大性迫使相应的禁止集成员事实。阿基里斯输出摘要识别出剩余的数学步骤是需要证明这些成员事实产生至少n个不同的禁止值,从而反驳| M | < n的基数假设;Lean源代码本身并未将主定理归约到单独编码的计数引理。该案例研究提供了一个可检查的例子,展示了AI辅助形式化中的核心限制,即局部证明搜索可以成功,但定理所需的全局组合计数仍需解决。本文贡献了一个可重复的Lean artifact和对其已验证和未验证证明内容的精确分析。

英文摘要

AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry. The verified components establish that the final partial sum equals the total sum, that an adjacent transposition can affect only the relevant intermediate partial sum, that the changed partial sum has the expected form, and that maximality at a position admitting an adjacent successor swap forces a corresponding forbidden-set membership fact. The Aristotle output summary identifies the intended remaining mathematical step as the global counting step needed to show that these membership facts produce at least n distinct forbidden values, contradicting the cardinality assumption |M| < n; the Lean source itself does not reduce the main theorem to a separately encoded counting lemma. This case study gives an inspectable example of a central limitation in AI-assisted formalization, namely that local proof search can succeed while the global combinatorial bookkeeping required for a theorem remains unresolved. The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.

2605.20110 2026-05-20 cs.CV

SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

SetCon: 通过集级概念预测实现开放式的指称分割

Zhixiong Zhang, Yizhuo Li, Shuangrui Ding, Yuhang Zang, Shengyuan Ding, Long Xing, Yibin Wang, Qiaosheng Zhang, Jiaqi Wang

AI总结 本文提出SetCon,通过集级概念预测实现开放式的指称分割,利用LVLM生成的自然语言概念作为语义条件进行联合掩码-集解码,提高了分割的完整性和互斥性。

详情
AI中文摘要

指称分割将自然语言查询与像素级掩码联系起来,但将其扩展到包含多个实例、跨类群组或开放目标集的复杂场景仍然具有挑战性。先前基于大型视觉语言模型(LVLM)的方法用一个或多个特殊标记依次表示指称目标,将多个目标视为独立输出而非连贯的集合,并且几乎没有激励去捕捉集合级属性,如完整性和互斥性。我们重新公式化开放式的指称分割为显式的集级概念预测,并提出Set-Concept Segmentation(SetCon),该方法使用LVLM生成的自然语言概念,而不是分割特定的标记,作为联合掩码-集解码的语义条件。一个层次化的语义分解首先预测一个共享的集级概念以定义目标范围,然后将其细化为细粒度的概念组,与目标子集对齐。为了支持这一点,一个两阶段的标注流程增强了现有的推理分割数据集,添加了层次化的语义监督(236k样本,784k概念短语)。SetCon在图像基准上取得了最先进的结果(在gRefCOCO上+3.3 gIoU,在MUSE上+12.1 gIoU),其优势随着指称目标数量的增加而扩大。概念接口在检测和跟踪设置下也转移到视频中,产生了在七个指称视频基准上的新最先进的结果,包括在MeViS上+10.9 J&F和在Ref-SeCVOS上+12.4 J&F。

英文摘要

Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 J&F on MeViS and +12.4 J&F on Ref-SeCVOS.

2605.20108 2026-05-20 eess.SY cs.AI cs.LG cs.LO cs.SY

k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics

k-诱导神经屏障证书用于未知非线性动力学

Ben Wooding, Hongchao Zhang, Taylor T. Johnson, Abolfazl Lavaei

AI总结 本文提出了一种基于神经网络的k-诱导神经屏障证书(k-NBCs),用于部分未知的非线性系统,通过利用神经网络的可扩展性以及泛化Willems等人基本引理,构建数据驱动的表示以进行SMT验证,同时提高了设计灵活性。

详情
Comments
18 pages, 5 figures, 3rd International Conference on Neuro-Symbolic Systems (NeuS)
AI中文摘要

尽管传统的(k=1)离散时间屏障证书条件通过要求函数在每一步都非递增来施加严格的安全约束,k-诱导屏障证书通过允许临时增加--最多k-1次,每次在阈值ε内--同时保持整体安全性并提高灵活性。本文利用神经网络构建k-诱导神经屏障证书(k-NBCs)用于(部分)未知的非线性系统。虽然神经网络在设计过程中提供可扩展性,但缺乏形式保证,需要额外的方法如基于可满足性模理论(SMT)的反例引导归纳合成(CEGIS)进行验证。然而,CEGIS-SMT框架需要系统动力学的知识,这在实际情况下不可用。为此,我们利用Willems等人基本引理的泛化,使用单个状态轨迹,构建数据驱动的表示以进行SMT验证而不牺牲准确性。此外,CEGIS-SMT进一步消除了将屏障证书限制在特定函数类(如平方和)的约束,从而在设计上具有更大的灵活性。我们验证了我们的方法在三个非线性案例研究中,具有(部分)未知的动力学。

英文摘要

While conventional (k=1) discrete-time barrier certificate conditions impose strict safety constraints by requiring the function to be non-increasing at every step, k-inductive barrier certificates relax this by allowing a temporary increase -- up to k-1 times, each within a threshold $ε$ -- while maintaining overall safety, and improving flexibility. This paper leverages neural networks and constructs k-inductive neural barrier certificates (k-NBCs) for (partially) unknown nonlinear systems. While neural networks offer scalability in the design process, they lack formal guarantees, requiring additional approaches such as counterexample-guided inductive synthesis (CEGIS) with satisfiability modulo theories (SMT) for verification. However, the CEGIS-SMT framework requires knowledge of system dynamics, which is unavailable in practical settings. To address this, we leverage the generalization of the Willems et al.'s fundamental lemma, using a single state trajectory, to construct a data-driven representation of (partially) unknown models for SMT verification without sacrificing accuracy. Additionally, CEGIS-SMT further removes the constraint of restricting barrier certificates to specific function classes, such as sum-of-squares, enabling greater flexibility in their design. We validate our approach on three nonlinear case studies with (partially) unknown dynamics.

2605.20107 2026-05-20 cs.LG cs.AI

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

超越各向同性:JEPAs中的哈密顿几何与辛预测

Robert Jenkinson Alvarez

AI总结 本文研究了JEPAs中各向同性假设的局限性,提出基于哈密顿几何的辛预测方法,通过相空间状态和学习的哈密顿量预测视图间过渡,从而提升模型在不同数据集上的性能。

详情
AI中文摘要

JEPAs通常将单视图嵌入正则化为各向同性的高斯分布,隐含地将欧几里得对称性纳入表示中。我们证明这不仅仅是无害的默认设置。对于已知的结构化下游几何H>0,最小最大和最大熵协方差在哈密顿能量预算下为(c/d)H^{-1},欧几里得各向同性会带来闭式价格。更重要的是,当下游几何未知时,没有几何无关的固定边际目标是规范的:每个固定协方差形状可以对某些结构化几何最大化地错位。我们进一步表明,即使拥有oracle单视图边际,也无法识别JEPA视图间预测耦合。这些结果表明,JEPAs中的结构偏差应进入跨视图耦合而非固定编码器边际。我们通过HamJEPA实例化这一原则,将每个视图编码为相空间状态(q,p),并通过学习的哈密顿量跃迁映射预测视图间过渡,非各向同性的尺度和频谱地板防止崩溃。在刻意无头标记协议中,HamJEPA在CIFAR-100上比SIGReg提升4.89 kNN@20和3.52线性探针点,在30个epoch时,以及在80个epoch时提升6.45 kNN@20和10.64线性探针点。而匹配的MLP预测器消融显示,辛耦合是驱动邻域几何增益的成分。在ImageNet-100上,HamJEPA-q在45个epoch时提升4.82 kNN@20和7.52线性探针点。

英文摘要

JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbf{HamJEPA}, which encodes each view as a phase-space state $(q,p)$ and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by $+4.89$ kNN@20 and $+3.52$ linear-probe points at 30 epochs, and by $+6.45$ kNN@20 and $+10.64$ linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA-$q$ improves by $+4.82$ kNN@20 and $+7.52$ linear-probe points at 45 epochs.

2605.20105 2026-05-20 cs.LG

Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

最优表示尺寸:预训练和线性探测的高维分析

Valentina Njaradi, Clémentine Dominé, Rachel Swanson, Marco Mondelli, Andrew Saxe

AI总结 本文研究了预训练和线性探测过程中的最优表示尺寸问题,通过高维分析揭示了表示维度、未标记和标记样本数量以及任务对齐性对训练和泛化误差的影响,提出了在不同预训练和下游数据条件下优化表示尺寸的条件。

详情
AI中文摘要

学习从有限数据中泛化是人工和生物系统面临的基本挑战。一种常见策略是从大量未标记数据中提取可重用的结构,从而高效适应新任务。这种两阶段范式现在已成为现代训练流水线的标准,即预训练后进行微调或线性探测。我们为这一过程提供了一个分析模型:结构提取被形式化为主成分分析,而下游学习则被建模为对单独标记数据集的线性回归。在高维情况下,我们推导出训练和泛化误差的精确表达式,展示了其对表示维度、未标记和标记样本数量以及任务对齐性的依赖性。我们的结果表明,预训练表示强烈影响下游泛化,我们将其最优表示尺寸作为任务参数的函数进行表征:在大量预训练数据但稀缺下游数据时,最大压缩表示最优;而在预训练数据有限时,高维表示泛化更好。此外,我们建立了预训练和监督之间的精确权衡,量化了需要多少未标记数据来替代一个标记样本。除了我们理想化的模型外,我们在自编码器和预训练大语言模型中也观察到相似的现象。总体而言,我们强调优化表示尺寸至关重要,给出了压缩预训练时提高泛化的条件。

英文摘要

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

2605.20104 2026-05-20 cs.LG cs.AI

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

少写多取:用于推测解码的混合树构建

Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan, Cong Wang

AI总结 本文提出了一种混合树构建方法Graft,通过结合剪枝和检索操作,解决了推测解码中资源分配的帕累托权衡问题,实现了在不同部署场景下的速度提升和接受率优化。

详情
AI中文摘要

推测解码(SD)通过 draft-then-verify 模式加速大语言模型推理。为最大化接受率,近期方法构建了 expansive draft trees,但导致严重的 VRAM 带宽和计算开销,成为端到端加速的瓶颈。虽然动态深度剪枝可通过移除边际分支减少延迟,但也会丢弃潜在有效的候选,阻碍接受率达到密集树的上限。在本文中,我们识别了资源分配中的关键机会:从密集到剪枝的转换释放了显著的计算预算。为了打破这一帕累托权衡,我们引入 Graft,一种补偿框架,将剪枝和检索作为相互强化的操作。剪枝提供足够的预算用于检索,而检索补偿剪枝引起的覆盖损失并恢复接受长度。通过采用顺序的 `prune-then-graft' 机制,Graft 将高预测性的检索 token 插入剪枝打开的位置,用几乎零开销填补拓扑缺口。Graft 完全无训练且无损失。全面评估显示,Graft 在实际部署设置中建立了新的帕累托前沿,包括短上下文生成、长上下文生成和大规模模型。在短上下文基准上,它实现了高达 5.41× 的加速,并在大规模 Qwen3-235B 上将平均加速率提高至 EAGLE-3 的 21.8%。我们还初步探讨了将 Graft 应用于 DFlash 风格的块解码范式,提供了扩展 grafting 以超越自回归 draft trees 的初步证据和见解。

英文摘要

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41$\times$ speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.