arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.20182 2026-05-20 cs.LG cs.AI

Atoms of Thought: Universal EEG Representation Learning with Microstates

思想的原子:基于微状态的通用EEG表示学习

Xinyang Tian, Ruitao Liu, Ziyi Ye, Siyang Xue, Xin Wang, Xuesong Chen

AI总结 本文提出了一种基于微状态的通用EEG表示学习方法,通过将连续EEG信号聚类为离散的微状态序列,构建了一个通用的微状态分词器,并在睡眠分期、情绪识别和运动想象分类等下游任务中展示了其优越性,同时提高了可解释性和扩展性。

详情
Comments
Accepted by the 3rd International Workshop on Multimodal and Responsible Affective Computing (MRAC 2025). 8 pages of main text, 23 pages total, 5 figures, 4 tables
AI中文摘要

从脑电图(EEG)信号中学习通用表示是神经信息学和脑机接口(BCIs)领域的一项前沿技术。传统上,EEG被视为多变量时间序列,其中时间域或频域特征被提取用于表示学习。本文研究了一种简单而有效的EEG表示,即微状态。微状态代表了在微观时间尺度上大脑活动模式的基本构建块。通过从大规模医疗EEG数据集中对连续EEG信号进行聚类,构建了一个通用的微状态分词器。该微状态分词器被广泛应用于一系列下游任务,包括睡眠分期、情绪识别和运动想象分类。实验结果表明,使用微状态进行EEG表示学习在不同模型和不同任务中均优于传统的时间域和频域特征。进一步分析显示,微状态提供了更高的可解释性和可扩展性,从而在认知神经科学和临床研究中开辟了应用。

英文摘要

Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is then adopted universally across a series of downstream tasks, including sleep staging, emotion recognition, and motor imagery classification. Experimental results show that EEG representation learning with microstates outperforms traditional time-domain and frequency-domain features under different models and across different tasks. Further analysis shows that microstates offer greater interpretability and scalability, thereby opening up applications in both cognitive neuroscience and clinical research.

2605.20179 2026-05-20 cs.CL

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

TIDE: 一种高效且无损的MoE扩散大语言模型推理方法

Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang

AI总结 本文提出TIDE,一种基于扩散过程块内专家激活时间稳定性的新推理系统,通过引入基于区间的专家刷新策略,优化I/O和CPU计算,实现无损加速,提升推理效率。

详情
AI中文摘要

扩散大语言模型(dLLMs)作为一种具有更好硬件利用和双向上下文能力的替代方案,通过并行块级解码出现。然而,随着dLLMs在混合专家(MoE)架构下不断扩展,其在资源受限设备上的部署仍是一个开放性挑战。现有基于自回归(AR)的方法通常导致严重的I/O开销或显著的计算瓶颈。在本文中,我们提出TIDE,一种新的资源高效的推理系统,利用扩散过程中块内专家激活的时序稳定性。具体来说,我们利用扩散过程中块内专家激活的时序稳定性,并引入基于区间的专家刷新策略,以I/O-aware的方式更新专家位置。为了确保最佳性能,我们将推理调度形式化为一个数学规划问题,求解最优区间以最小化I/O流量和CPU计算。最重要的是,TIDE是一种无损优化,不需要模型训练,为dLLM推理提供

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

2605.20177 2026-05-20 cs.CL cs.CV

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

从感知到思考:解耦感知与推理提升视觉语言模型的训练

Juncheng Wu, Hardy Chen, Haoqin Tu, Xianfeng Tang, Freda Shi, Hui Liu, Hanqing Lu, Cihang Xie, Yuyin Zhou

AI总结 本研究通过解耦感知与推理,发现视觉任务性能受限于感知能力不足而非推理本身,通过分阶段训练提升模型的感知与推理能力,从而在多个视觉数学和感知任务中取得更优表现。

详情
Comments
19 pages, 9 figures; Accepted to ICML 2026; Project Page: https://ucsc-vlaa.github.io/VLM-CapCurriculum/
AI中文摘要

最近视觉语言模型(VLMs)的进步强调长链推理;然而,我们发现其在视觉任务上的性能主要受限于感知能力不足而非推理本身。在本工作中,我们系统研究了VLMs在训练后感知与推理之间的相互作用,通过将能力分解为三个独立的训练阶段:视觉感知、视觉推理和文本推理,并结合专门的训练数据。我们证明视觉感知(a)需要针对优化和专门数据;(b)作为基础框架,应在细化视觉推理之前通过分阶段训练巩固;(c)通过强化学习(RL)比基于标题的监督微调(SFT)更有效学习。我们的实验表明,分阶段训练在多个VLMs上一致提升了视觉感知和推理性能。值得注意的是,采用我们方法训练的模型在推理准确性上提高了1.5%,推理轨迹缩短了20.8%,表明更强的感知减少了对过度推理的需求。此外,我们展示了基于能力的分阶段训练代表了与传统难度基于课程正交的新课程维度,结合两者可进一步获得加性收益。我们的分阶段训练模型在开放权重VLMs中表现优异,在多个视觉数学和感知任务(如WeMath和RealWorldQA)上取得了优于基础模型的先进结果。

英文摘要

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

2605.20176 2026-05-20 cs.CL

ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning

ClinSeekAgent: 自动化多模态证据检索以实现代理临床推理

Juncheng Wu, Letian Zhang, Yuhan Wang, Haoqin Tu, Hardy Chen, Zijun Wang, Cihang Xie, Yuyin Zhou

AI总结 本文提出ClinSeekAgent,一种自动化代理框架,用于动态多模态证据检索,旨在从异构来源主动检索、迭代规划和综合多模态证据,从而提升临床决策支持。

详情
Comments
24 pages, 9 figures; Project Page: https://ucsc-vlaa.github.io/ClinSeekAgent/
AI中文摘要

大型语言模型(LLMs)和代理系统在临床决策支持中展现出潜力,但现有研究大多假设证据已预先整理并提供给模型。现实中的临床工作流程要求代理主动检索、迭代规划和综合来自异质来源的多模态证据。在本文中,我们介绍了ClinSeekAgent,一种自动化代理框架,用于动态多模态证据检索,将范式从被动证据消费转向主动证据获取。仅给定一个临床查询和对原始数据源的访问权限,ClinSeekAgent通过查询医学知识库、导航原始电子健康记录(EHR)和调用医学影像工具来收集证据;随着新信息的出现,它会细化假设;并将收集到的证据整合到基于事实的临床决策中。ClinSeekAgent既作为推理时间的代理用于前沿LLMs,也作为训练时间的流水线,用于将高质量的代理轨迹提炼成紧凑的开源模型。为了验证其推理时间的有效性,我们构建了ClinSeek-Bench,它将固定预选证据的Curated Input推理与原始临床数据上的自动化证据检索相结合。在仅文本EHR任务中,ClinSeekAgent将Claude Opus 4.6的总体F1从60.0提升到63.2,将MiniMax M2.5从43.1提升到47.3,并在9个评估的主机模型中有7个显示出积极的风险预测增益。在多模态任务中,ClinSeekAgent将Claude Opus 4.6的分数从47.5提升到62.6(+15.1);所有评估的模型在三个与CXR相关的任务组中均有所提升。我们进一步通过将代理证据检索轨迹提炼成ClinSeek-35B-A3B来验证ClinSeekAgent作为训练流水线的有效性,该模型在现有AgentEHR-Bench上实现了34.0的平均F1,比其Qwen3.5-35B-A3B基线提高了+11.9点,并接近Claude Opus 4.6。

英文摘要

Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

2605.20174 2026-05-20 cs.CV cs.LG

Multi-axis Analysis of Image Manipulation Localization

多轴分析图像操纵定位

Keanu Nichols, Divya Appapogu, Giscard Biamby, Dina Bashkirova, Anna Rohrbach, Bryan A. Plummer

AI总结 本文提出AUDITS基准,用于多轴分析图像操纵检测,通过不同领域转移类型评估现有方法的鲁棒性,以推动更可靠和通用的图像操纵检测方法的发展。

详情
Comments
28 pages, 5 figures, 5 tables
AI中文摘要

先进的图像编辑软件使创建高度逼真的图像操纵变得容易,近年来由于生成式AI的进步,这种能力变得更加普及。虽然操纵的图像通常无害,但它们可能传播虚假信息、制造虚假叙述并影响人们对重要问题的看法。尽管这种威胁日益增长,但针对不同视觉领域检测高级操纵的研究仍然有限。因此,我们引入了Analysis Under Domain-shifts, QualIty, Type, and Size (AUDITS),一个全面的基准,用于研究图像操纵检测中的分析轴。AUDITS包含来自两个不同来源(用户和新闻照片)的超过530,000张图像。我们通过最近的扩散基填充技术整理数据集,以支持跨多个轴的分析,涵盖多样化的操纵类型和尺寸。我们通过不同的领域转移类型进行实验,以评估现有图像操纵检测方法的鲁棒性。我们的目标是通过提供新的见解来推动该领域进一步研究,以帮助开发更可靠和通用的图像操纵检测方法。

英文摘要

Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

2605.20173 2026-05-20 cs.AI cs.SE

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

为生产大语言模型代理选择和组合运行时架构模式的方法

Vasundra Srinivasan

AI总结 本文提出了一种方法,用于选择和组合运行时架构模式,以定义大语言模型代理的随机-确定性边界,并探讨其在不同代理类型中的应用及可靠性分解。

详情
Comments
25 pages, 2 figures, 6 tables. Companion repo at https://github.com/vasundras/agent-runtime-patterns
AI中文摘要

生产大语言模型代理结合了随机模型输出与确定性软件系统,但两者之间的边界很少被视为首要的架构对象。本文将此边界称为随机-确定性边界(SDB):一种四部分合同,涉及提议者、验证者、提交步骤和拒绝信号,规定了LLM输出如何成为系统动作。我们主张SDB是生产代理运行时的承载基础。围绕此基础,我们将代理运行时设计分为三个关注点:协调、状态和控制。我们提出了六个运行时模式的目录,这些模式在对话、自主和长周期代理中以不同的方式组合SDB:分层委托、散射-收集加 saga、事件驱动序列、共享状态机、监督者加门控,以及人机交互。对于每个模式,我们追溯其分布式系统概念的根源,并确定当工作者为随机时的变化。本文贡献了五步选择运行时模式的方法,一个将生产故障映射到模式弱点的诊断程序,以及一种称为回放分歧的故障模式,在这种模式下,基于LLM的确定性事件日志消费者在模型版本或提示变化下会产生不同的下游输出。一种简化的可靠性分解将每次调用的模型方差与架构动量分开,促使主张随着模型方差的减少,模式选择和SDB强度成为长期可靠性的越来越重要的杠杆。我们应用该方法于五个工作负载,并提供了一个90天合同续约代理的可运行参考实现。

英文摘要

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

2605.20172 2026-05-20 cs.LO cs.AI

Long-term Power Grid Planning via Answer Set Programming

通过答案集编程进行长期电力网络规划

Antonio Ielo, Francesco Doria, Sandra Castellanos-Paez, Marco Maratea, Francesco Percassi, Mauro Vallati

AI总结 本文提出了一种基于答案集编程的自动化和优化长期电力网络规划方法,以解决可持续性目标、需求模式和城市化趋势等复杂问题。

详情
Comments
16 pages, 4 figures
AI中文摘要

电力网络是支撑现代社会各个方面的重要基础设施,其有效性需要持续适应。特别是要应对可持续性目标、需求模式和城市化趋势,需要对网络进行更改。实际发展可能持续数十年,必须通过确保符合多种拓扑和组合不变量来保持供应连续性和服务质量。长期电力网络规划涉及上述过程,尽管规划语言可能是一个自然的选择,但所需的属性和不变量在这样的语言中难以表达;相反,它们可以优雅且简洁地编码在答案集编程(ASP)中。在本文中,我们提出了一种利用ASP自动化和优化长期电力网络规划过程的方法。在合成和实际电网数据上进行的实验评估证实了所提出的基于ASP的方法的表达能力,并展示了其有效性。

英文摘要

The Power grid is a critical infrastructure underpinning all aspects of modern society and its services. Maintaining its effectiveness requires continuous adaptations. In particular, addressing sustainability targets, demand patterns, and urbanisation trends requires implementing changes to the network. Actual developments can potentially span over a decade, with supply continuity and service quality that must be preserved throughout by ensuring conformance to several topological and combinatorial invariants. Long-term power grid planning deals with the above process, and although planning languages could be a natural choice, the kind of properties and invariants needed are cumbersome to express in such languages; on the contrary, they can be elegantly and succinctly encoded in Answer Set Programming (ASP). In this paper, we propose the first approach to automate and optimise the long-term power grid planning process using ASP. Experimental evaluations conducted on synthetic and real-world grid data confirm the expressive power of the proposed ASP-based approach and demonstrate its effectiveness.

2605.20170 2026-05-20 cs.CL

KoRe: Compact Knowledge Representations for Large Language Models

KoRe: 为大型语言模型设计的紧凑知识表示

Davide Cavicchini, Fausto Giunchiglia, Jacopo Staiano

AI总结 本文提出KoRe方法,通过将知识图谱的1跳子图编码为紧凑离散知识标记,并注入到LLM中,从而提升模型的知识推理能力并减少token使用量。

详情
AI中文摘要

现代大型语言模型(LLMs)在用户面对的任务如问答中表现出色,并在推理能力上持续改进。然而,这些模型编码知识的方式存在固有缺陷:通过设计,LLMs将世界知识存储在参数中。这种方式表示知识本质上是不透明的,难以调试和更新,且容易产生幻觉。另一方面,知识图谱可以提供人类可读且易于编辑的世界知识表示,并在知识密集型任务中持续证明对下游性能有益。然而,当前的整合技术需要大量的重新训练或微调。为了解决这个问题,我们引入KoRe,一种将1跳子图编码为紧凑离散知识标记并注入到LLM骨干中的方法。我们在三个已建立的基准上测试了所提出的方法,并报告了具有竞争力的表现,同时token使用量显著减少(最高减少10倍)。我们的结果表明,紧凑的离散KG表示可以有效地用于使现代LLM扎根。

英文摘要

Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

2605.20169 2026-05-20 eess.SY cs.SY

The OAPS solution: a real-time predictive system for flexible PWR operation

OAPS解决方案:一种用于灵活PWR运行的实时预测系统

Guillaume Dupré, Alain Grossetête

AI总结 本文提出了一种创新解决方案,旨在促进核电机组的安全灵活运行。OAPS系统通过提供最优策略(如轴向偏移控制、氙振荡抑制和排放最小化)和实时建议(如稀释和硼化物流量率、涡轮机功率设定点和变化率)帮助核电站操作员自信高效地进行功率变化。

详情
Comments
ICAPP 2025 - International Congress on Advances in Nuclear Power Plants, SFEN, Sep 2025, Juan-les-Pins / Antibes, France
AI中文摘要

本文提出了一种创新解决方案,旨在促进核电机组的安全灵活运行。OAPS系统通过提供最优策略(如轴向偏移控制、氙振荡抑制和排放最小化)和实时建议(如稀释和硼化物流量率、涡轮机功率设定点和变化率)帮助核电站操作员自信高效地进行功率变化。事实上,就像GPS导航器根据用户当前位置优化和修改计划路线一样,OAPS系统会根据最新的机组测量数据定期更新其建议。为此,OAPS系统依赖于一种经过验证但核工业中先进的控制技术,即模型预测控制。OAPS系统的传统轴向偏移控制策略之前已在Framatome的全范围PWR模拟器和EDF的全范围N4模拟器上得到验证。在本文中,三种新的高级策略在Framatome开发的中等复杂度PWR模拟器上展示:1)确定最快可行的功率变化速率,2)加速消除轴向功率振荡,3)最小化水和硼排放。

英文摘要

This paper presents an innovative solution designed to facilitate safe and flexible operation of nuclear power plants. The purpose of this new device, named OAPS system, is to provide optimal strategies (e.g., axial offset control, xenon oscillations mitigation, effluent minimization) and real-time recommendations (e.g., dilution and boration flowrates, turbine power setpoints and variation rates) to help NPP operators perform power variations confidently and efficiently. In fact, just as a GPS navigator optimizes and modifies its planned route according to the current position of the user, the OAPS system regularly updates its recommendations based on the latest plant measurements. To achieve this, the OAPS system relies on a well-established -yet cutting-edge in the nuclear industry -advanced control technique known as model predictive control. The conventional axial offset control strategy of the OAPS system was previously validated on both Framatome's full-scope PWR simulator and EDF's full-scope N4 simulator. In this paper, three new advanced strategies are showcased on an intermediate-complexity PWR simulator developed by Framatome: 1) determination of the fastest feasible power variation rates, 2) accelerated cancellation of axial power oscillations and 3) minimization of water and boron effluents.

2605.20168 2026-05-20 cs.DL cs.DB

One in Eight OpenAlex Abstracts Has Integrity Issues

《OpenAlex中八分之一的摘要存在诚信问题》

Seorin Kim, Vincent Holst, Vincent Ginis

AI总结 研究评估了OpenAlex中10,000个随机抽取的英文期刊摘要的完整性,发现12%的摘要存在诚信问题,主要问题包括内容不足和元数据位置错误,并讨论了对下游研究的影响及未来社区门户的建设。

详情
Comments
10 pages, 5 figures
AI中文摘要

科学摘要越来越多地被用作计算元科学研究中的原始数据,但这些摘要在广泛使用的文献数据库中的质量尚未系统性地得到检验。我们使用结合人类专家审查和大型语言模型分类的两阶段标注协议,评估了OpenAlex中10,000个随机抽取的英文期刊摘要的完整性。我们识别出七种不同的失败模式,并发现12%的摘要存在完整性问题,其中内容不足和元数据位置错误是最常见的。我们讨论了这些发现对下游研究的影响,并描述了一个即将推出的社区门户,以支持集体注释努力。

英文摘要

Scientific abstracts are increasingly used as primary data in computational metascience research, yet the quality of these abstracts in widely used bibliographic databases has not been systematically examined. We assess the integrity of 10,000 randomly sampled English-language journal abstracts from OpenAlex using a two-stage annotation protocol combining human expert review and large language model classification. We identify seven distinct failure modes and find that 12\% of abstracts have integrity issues, with insufficient content and misplaced metadata being the most prevalent. We discuss implications for downstream research and describe a forthcoming community portal to support collective annotation efforts.

2605.20167 2026-05-20 cs.AI cs.LG

HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

HaorFloodAlert: 用于孟加拉国Haor湿地72小时洪水预测的去季节化机器学习集成

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely, Md. Samiul Alim, Md. Zakir Hossen

AI总结 本文提出HaorFloodAlert,一种去季节化的机器学习集成模型,用于预测孟加拉国Haor湿地72小时内的洪水概率,通过识别温度季节性影响和利用Sentinel-1 SAR数据提高预测准确性。

详情
Comments
9 pages, 9 figures. To be submitted to raaicon.org
AI中文摘要

孟加拉国Haor湿地的快速洪水几乎没有任何预警,破坏年度boro稻收获。现有系统为河流洪水设计,完全忽略了回水动态。这些流域平坦,水的行为不同于布拉马普特拉河。我们构建了HaorFloodAlert,一种去季节化的机器学习集成,用于预测Sunamganj Haor(约8,000平方公里)72小时内的洪水概率。温度被发现是季节性的作弊代码,因为它在温暖月份洪水发生时提高了准确性6.9个百分点。我们捕捉到了这一点,并构建了一个上游Barak河Sentinel-1 SAR代理,从阿萨姆的Silchar提供约36小时的预警。Otsu阈值化的SAR变化检测在空间匹配上验证达到84-91%。操作性集成(RF 0.5625 + XGBoost 0.4375)在77个真实的Sentinel-1事件上达到89.6%的LOOCV准确性,87.5%的召回率和0.943的AUC-ROC。还包含三级警报管道和BRRI校准的boro稻损害估计器。

英文摘要

Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included.

2605.20165 2026-05-20 cs.CV

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

CaMo:基于摄像机运动的视觉-语言模型评估与训练

Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jianxu Shangguan, Cheng-Yen Yang, Jenq-Neng Hwang

AI总结 本文提出了一种基于摄像机运动的视觉-语言模型评估与训练方法CaMo,通过要求模型生成显式的空间叙述并进行推理,揭示了现有空间视觉-语言模型在空间认知方面的不足,并展示了CaMo在空间叙述评估和直接空间问题回答准确性上的一致表现。

详情
Comments
Code and model available at https://github.com/hsiangwei0903/CaMo
AI中文摘要

视觉-语言模型(VLMs)在空间问答基准测试中表现出色,但尚不清楚这种表现是否反映了真正的空间智能。我们证明现有空间VLMs缺乏基本的摄像机运动理解,这是空间认知的关键组成部分。我们提出了空间叙述评分(SNS),一种评估框架,要求VLMs生成显式的空间叙述,捕捉场景语义和摄像机运动,随后使用冻结的代理LLM进行推理。在SNS下,最先进的空间VLMs在直接问答准确性高时,却在评估中表现出显著的性能下降。为解决这一差距,我们引入了CaMo,一种基于摄像机运动的VLM,其在SNS评估和直接空间问答准确性上均表现出一致的性能。我们的结果强调了显式空间叙述外部化在评估具有可转移3D空间理解的VLMs中的重要性。我们的代码、数据和模型可在https://github.com/hsiangwei0903/CaMo上获得。

英文摘要

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

2605.20164 2026-05-20 cs.AI

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非每个评分标准都等同教学:面向RLVR的政策感知评分奖励

Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He

AI总结 本文提出POW3R框架,通过保留人类权重和类别平衡,改进评分奖励机制,使评分标准更符合最终答案的要求,从而在多模态和纯文本设置中提升性能。

详情
Comments
24 pages, 7 figures, 6 tables
AI中文摘要

可验证奖励的强化学习在训练后效果显著,当正确性可以自动检查时。然而,许多重要的模型行为需要同时满足多个定性标准。基于评分的奖励通过评估特定提示的标准并将其聚合为标量奖励来解决这一问题。然而,标准静态聚合将人类分配的重要性与当前作为优化信号的有用性混淆。我们证明在评分RL中,这种假设在评分标准中崩溃:许多重要的标准已经饱和或当前不可达,而区分rollout的标准不一定是最受人类重视的。我们引入POW3R,一种政策感知的评分奖励框架,该框架在评分目标中保留人类权重和类别平衡,同时在训练过程中适应标准级别的奖励权重。POW3R使用rollout级别的对比来强调当前区分策略输出的标准,使GRPO奖励更加信息丰富,而不会改变底层评估目标。在两个数据集上三个基础策略中,POW3R在30个基础策略/指标比较中胜出24个,提高了平均评分奖励和严格完成率(满足所有评分标准的提示比例),并以2.5-4倍更少的训练步骤达到相同平台。因此,评分奖励应区分最终答案中应重视的内容,以及当前策略可以教授的内容。

英文摘要

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

2605.20159 2026-05-20 cs.CV cond-mat.mtrl-sci cs.LG

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

用于航空SiC/SiC复合材料X射线断层扫描缺陷检测的可解释计算机视觉

Antonio Peña Corredor, Julien Lesseur, Romain Nunez, Paul Rivalland, Thomas Philippe

AI总结 本研究提出了一种结合原型层的p-ResNet-50框架,通过引入新的正则化项和语义对齐,提高了X射线断层扫描中缺陷检测的可解释性和准确性,同时保持了高精度和可追溯性。

详情
AI中文摘要

航空SiC/SiC复合材料的非破坏性检测依赖于专家视觉评估,当前流程在接受/拒绝决策方面缺乏可追溯性。深度卷积网络可以自动检测缺陷,但其黑盒性质与工业检测实践所需的透明性相冲突。为此,我们引入了p-ResNet-50,一种扩展了原型层的卷积框架,将高检测精度与基于案例的解释相结合。六个学习到的原型被显式对齐到专家定义的语义类别——健康基质、基质-空气界面、孔洞、线状缺陷和混合形态,使得每个分类都能追溯到具有物理意义的参考。两种新的正则化项,基于锚点和中位数,将原型连接到专家选择的片段,并防止原型崩溃,解决了原型网络已知的限制。通过UMAP进行的潜在空间分析揭示了语义连贯的子域,并映射出不确定性区域,这些区域集中了误分类,使检查员能够明确了解模型在哪里可靠,以及不可靠。该框架在约12,000个片段的XCT数据集上进行了验证,这些片段是从四个缺陷丰富的SiC/SiC实验室样品中提取的。与黑盒ResNet-50基线(ROC-AUC = 0.991)相比,原型扩展实现了相似的性能(准确率0.957 vs. 0.959;ROC-AUC 0.994 vs. 0.993),虽然灵敏度略有降低,但精度和特异性更高。每个决定都由代表性的证据片段支持,并且模型明确标记其不确定性区域。除了缺陷映射外,该框架还建立了一种可重用的方法,用于将领域专家知识嵌入到原型网络中,适用于其他需要可追溯、可审计决策的XCT检测场景。

英文摘要

Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

2605.20158 2026-05-20 cs.CV cs.AI cs.CL

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

重新思考用于大视觉语言模型胸部X光推理中的视觉归因

Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu, Aidong Zhang

AI总结 本文针对大视觉语言模型在胸部X光推理中视觉归因的可靠性问题,提出了一种因果评估框架,通过反事实编辑保留仅由专家标注区域验证的X光-VQA样本,以确定模型预测的因果责任区域。通过11种归因方法、6种开源LVLMs和两种输出模式,发现现有归因方法往往无法识别LVLMs所使用的证据。为此,本文提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应,显著优于现有方法,推动医疗LVLMs的更可信归因。

详情
AI中文摘要

大视觉语言模型(LVLMs)在医疗应用中展现出前景,但其无法准确将响应与视觉证据联系起来,引发了关于临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM预测,但这些解释是否确实反映了模型决策背后的视觉证据仍缺乏验证,因为内部模型推理的真值注释通常不可用。我们通过开发一种因果评估框架来解决胸部X光(CXR)推理中的这一问题,该框架仅保留专家标注区域已验证的CXR-VQA样本,通过反事实编辑保留因果责任区域。在11种归因方法、6种开源LVLMs和两种输出模式(直接回答和逐步推理)上应用此框架,发现现有归因方法往往无法识别LVLMs所使用的证据。为解决这一失败,我们提出MedFocus,一种基于概念的归因方法,通过不平衡最优传输局部化具有临床意义的解剖区域,并通过针对性干预测量其对模型输出的因果效应。MedFocus产生空间、概念级和token级归因,并显著优于现有方法,推动医疗LVLMs的更可信归因。我们的数据和代码可在https://github.com/gzxiong/medfocus/上获得。

英文摘要

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

2605.20157 2026-05-20 cs.LG cs.CR cs.IR

SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

SAGE:可扩展的自动门控集成用于自信的负面采样在欺诈检测中

Sudheer Tubati, Amit Goyal

AI总结 本文提出SAGE,一种结合SimHash基于的分层抽样和模块化门控集成的反事实意识负面采样方法,以在欺诈检测中实现对未标记数据的自信负面识别,解决了正例未标记学习中的表示偏差问题。

详情
Journal ref
WSDM Companion '26: Nineteenth ACM International Conference on Web Search and Data Mining, 2026, Pages 34 - 38
AI中文摘要

音乐流媒体欺诈,即恶意行为者人为提高流媒体计数以操纵排行榜和版税支付,对流媒体服务和合法内容创作者构成重大威胁。传统欺诈检测方法面临关键挑战:许多合法边缘案例,包括超级粉丝和睡眠音乐会,表现出的活动模式与协调欺诈非常相似。我们提出了SAGE,一种新颖的反事实意识负面采样方法,结合SimHash基于的分层抽样和模块化门控集成,用于从未标记数据中自信地识别负面样本。我们的集成架构采用可插拔的统计门(目前实例化为Mahalanobis距离和k-NN密度)和可配置的投票阈值,以实现自适应的精度-召回率权衡。这通过通过地板约束抽样确保罕见行为群体的全面覆盖,解决了正例未标记学习中的表示偏差问题。评估显示在保留数据上具有强精度和召回率。该方法在欺诈检测领域具有良好的泛化能力,在客户层面和艺术家层面的欺诈检测中均能实现强性能,而无需修改核心方法。

英文摘要

Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

2605.20151 2026-05-20 cs.LG math.ST stat.TH

When Does Model Collapse Occur in Structured Interactive Learning?

在结构互动学习中模型崩溃何时发生?

Yuchen Wu, Kangjie Zhou, Weijie Su

AI总结 研究探讨了在结构互动学习环境中,生成模型性能下降(模型崩溃)的发生条件,通过分析交互图拓扑结构,推导出模型崩溃的必要和充分条件,并通过数值实验验证理论结果。

详情
Comments
57 pages, 12 figures
AI中文摘要

生成式人工智能的普及催生了交互学习环境,其中模型参数通过自然过程生成的数据和由其他模型产生的合成输出不断更新。这种范式引入了两大挑战:(1)训练数据不再仅来自目标群体,破坏了经典统计学习的核心假设;(2)模型训练过程变得内在相关,因为模型通过反复接触彼此的合成输出进行交互,方式可能复杂。在这样的结构互动学习环境中建立可靠的统计推断仍然是一个重要开放问题。特别是,人们对模型崩溃现象日益关注,该现象是指生成模型在训练于早期模型生成的合成数据时性能逐步下降。先前关于模型崩溃的研究主要集中在单个模型训练其自身输出的情况,未能捕捉多模型交互环境中的模型性能。在本文中,我们填补了这一空白,通过研究具有通用交互模式的交互学习环境中的生成模型性能。特别是,我们利用有向图形式化模型交互,并证明模型崩溃的发生严重依赖于交互图的拓扑结构。我们进一步推导出一个显式的必要和充分条件,以表征模型崩溃何时发生,并为线性回归建立有限样本结果,为一般M估计量建立渐近保证。我们通过广泛的数值实验支持我们的理论发现。

英文摘要

The proliferation of generative artificial intelligence has given rise to an interactive learning environment, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning, and (2) model training processes become inherently correlated, as models interact with one another through repeated exposure to each other's synthetic outputs in a potentially complex manner. Establishing reliable statistical inference in such structured interactive learning environments therefore remains an important open problem. In particular, there is growing concern about model collapse, a phenomenon in which the performance of generative models progressively degrades as they are trained on synthetic data produced by earlier model generations. Prior work on model collapse primarily focuses on a single model trained on its own output, failing to capture model performance in multi-model interactive settings. In this work, we fill this gap by investigating the performance of generative models in an interactive learning environment with general interaction patterns. In particular, we formalize model interactions using directed graphs and show that the occurrence of model collapse depends critically on the topology of the interaction graph. We further derive an explicit necessary and sufficient condition characterizing when model collapse occurs, and establish finite-sample results for linear regression and asymptotic guarantees for general M-estimators. We support our theoretical findings through extensive numerical experiments.

2605.20149 2026-05-20 cs.CL cs.AI cs.HC

Less Back-and-Forth: A Comparative Study of Structured Prompting

少来回:结构化提示的比较研究

Saurav Ghosh, Gabriella Polach, Abdou Sow

AI总结 本文研究了结构化提示设计是否能提高LLM响应质量并减少用户努力,通过比较三种提示条件,发现检查清单提示在任务完成、正确性、合规性和清晰度方面得分最高,且在质量和努力的平衡上表现最佳。

详情
Comments
7 pages, 2 figures, 6 tables
AI中文摘要

大型语言模型(LLMs)广泛用于开放式任务,但不明确的提示可能导致低质量的回答和额外的交互。本文研究结构化提示设计是否能提高响应质量并减少用户努力。我们比较了三种提示条件:原始提示、检查清单改进提示和澄清问题提示。我们通过四个任务类型——摘要、规划、解释和编程,使用三个LLM系统:ChatGPT、Claude和Grok来评估这些条件。每个输出都使用统一的评分标准进行评分,涵盖任务完成、正确性、合规性和清晰度。检查清单改进提示在评分方面得分最高,平均得分为7.50(满分8分),相比原始提示的5.67和澄清问题提示的6.67。检查清单提示在质量和努力的平衡上也表现最佳,使用比原始和澄清提示更少的平均令牌。这些结果表明,简单的提示检查清单可以提高LLM响应质量,同时减少不必要的交互。

英文摘要

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

2605.20147 2026-05-20 cs.CV

PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

PixVerve:通过大规模高质量数据集将原生超高清图像生成推至100MP

Haojun Chen, Haoyang He, Chengming Xu, Qingdong He, Junwei Zhu, Yabiao Wang, Zhucun Xue, Xianfang Zeng, Zhennan Chen, Xiaobin Hu, Hao Zhao, Yong Liu, Jiangning Zhang, Dacheng Tao

AI总结 本文提出PixVerve-95K数据集,通过精心设计的数据管道构建,包含95K张高分辨率图像和七维标注,用于推动超高清图像生成技术,通过三种训练方案将T2I基础模型扩展到100MP生成,并建立PixVerve-Bench评估协议。

详情
Comments
Project page is available at https://haojunchen663.github.io/projects/PixVerve/
AI中文摘要

文本到图像(T2I)模型近年来在1K和2K分辨率方面取得了显著进展。随着对更好视觉体验的极端需求和成像技术的快速发展,超高清(UHR)图像生成的需求显著增长。然而,由于高分辨率内容的稀缺性和复杂性,UHR图像生成面临巨大挑战。在本文中,我们首先介绍了PixVerve-95K,一个高质量、开源的UHR T2I数据集,通过精心设计的数据管道构建,包含95K张图像,涵盖多样场景(每张图像的最小像素数为100M)和七维标注。基于我们的大规模图像-文本数据集,我们采取了开创性的步骤,将各种T2I基础模型扩展到原生100MP生成,采用三种训练方案。最后,利用传统度量标准和基于多模态大语言模型的评估,我们提出的PixVerve-Bench基准建立了涵盖视觉质量和语义对齐的全面评估协议。在我们的基准上的广泛实验结果和训练策略的建设性探索共同提供了对未来突破的宝贵见解。

英文摘要

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

2605.20145 2026-05-20 stat.ML cs.LG stat.ME

Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization

面向目标的高斯过程低尾校准用于贝叶斯优化

Aurélien Pion, Emmanuel Vazquez

AI总结 本文研究了在无噪声情况下,针对低于低阈值t的标准高斯过程模型的预测分布进行面向目标的校准,提出了一种后处理方法tcGP,以校准预测分布低于t的部分,并展示了基于此的全局优化算法在设计空间中保持密集性,实验表明相较于标准高斯过程模型和全局校准高斯过程模型,改进了低尾校准和贝叶斯优化性能。

详情
Journal ref
ICML 2026
AI中文摘要

贝叶斯优化(BO)利用高斯过程(GP)预测分布来选择昂贵的黑箱目标的评估点。核选择和超参数选择可能导致预测分布不准确,从而影响探索与利用的平衡。对于最小化问题,采样标准如预期改进(EI)依赖于当前最佳值以下的预测分布,因此低尾不准确直接影响采样决策。本文研究了在无噪声情况下,针对低于低阈值t的标准高斯过程模型的预测分布进行面向目标的校准,超参数通过最大似然法选择。引入了一种预测可靠性低于t的框架,基于两个空间校准的概念:设计空间上的发生校准和子水平集形式{ x∈X, f(x)≤t }上的阈值μ-校准。在此框架基础上,提出tcGP,一种后处理方法,用于校准预测分布低于t的部分,并证明由此得到的基于EI的全局优化算法在设计空间中保持密集。在标准基准测试中,实验表明相较于标准高斯过程模型和全局校准高斯过程模型,改进了低尾校准和贝叶斯优化性能。

英文摘要

Bayesian optimization (BO) selects evaluation points for expensive black-box objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions and an inappropriate exploration-exploitation trade-off. For minimization, sampling criteria such as expected improvement (EI) depend on the predictive distribution below the current best value, so lower-tail miscalibration directly affects the sampling decision. This article studies goal-oriented calibration of GP predictive distributions below a low threshold $t$ in the noiseless setting, for standard GP models with hyperparameters selected by maximum likelihood. A framework for predictive reliability below $t$ is introduced, based on two notions of spatial calibration: occurrence calibration over the design space and thresholded $μ$-calibration on sublevel sets of the form $\{x\in\mathbb{X}, f(x)\le t\}$. Building on this framework, we propose tcGP, a post-hoc method that calibrates GP predictive distributions below~$t$, and we show that the resulting EI-based global optimization algorithm remains dense in the design space. Experiments on standard benchmarks show improved lower-tail calibration and BO performance relative to standard GP models and globally calibrated GP models.

2605.20144 2026-05-20 eess.SY cs.SY

A Unified Framework for Attack-Resilient CLF-CBF Quadratic Programs for Nonlinear Control-Affine Systems

非线性控制仿真的攻击鲁棒CLF-CBF二次规划统一框架

Mohamadamin Rajabinezhad, Shan Zuo

AI总结 本文提出了一种针对非线性控制仿真的攻击鲁棒CLF-CBF二次规划统一框架,通过嵌入统一的自适应补偿项,实现了在控制输入虚假数据注入攻击下有限时间内恢复到名义安全集,无需事先确定攻击幅度上限,仅依赖于增长率分析和在线增益调节律。

详情
Comments
Under review for possible publication
AI中文摘要

本文介绍了一种针对非线性控制仿真的攻击鲁棒控制李雅普诺夫函数(AR-CLFs)和攻击鲁棒控制屏障函数(AR-CBFs),用于受控制输入虚假数据注入攻击(FDIA)影响的系统,其中FDIA满足至多指数增长的包络。所提出的框架将统一的自适应补偿项嵌入到CLF下降和CBF安全约束中。与基于输入到状态稳定性/安全性(ISS/ISSf)的方法不同,所提出的方法能够在不需事先确定FDIA幅度上限的情况下,实现有限时间恢复到名义安全集,依赖于用于分析的增长率特性以及在线增益调节律来调节补偿项。开发了一个统一的二次规划(QP)以同时执行AR-CLF和AR-CBF条件,保证在无界FDIA下的一致最终有界(UUB)稳定性和一致最终安全(UUS)。数值结果表明,与现有ISS-CLF、ISSf-CBF和鲁棒CLF-CBF-QP方法相比,具有改进的鲁棒性。

英文摘要

This letter introduces attack-resilient Control Lyapunov Functions (AR-CLFs) and attack-resilient Control Barrier Functions (AR-CBFs) for nonlinear control-affine systems subject to control-input false data injection attacks (FDIA) satisfying an at-most-exponentially growing envelope. The proposed framework embeds a unified adaptive compensation term into both the CLF decrease and CBF safety constraints. In contrast to input-to-state stability/safety (ISS/ISSf)-based methods that certify disturbance-dependent enlarged safe sets, the proposed approach enables finite-time recovery to the nominal safe set without requiring a prior magnitude bound on the FDIA, relying instead on a growth-rate characterization used for analysis and an online gain tuning law that regulates the compensation term. A unified quadratic program (QP) is developed to enforce the AR-CLF and AR-CBF conditions simultaneously, guaranteeing uniformly ultimately bounded (UUB) stability and uniform ultimate safety (UUS) under unbounded FDIA. Numerical results demonstrate improved resilience compared to existing ISS-CLF, ISSf-CBF, and robust CLF-CBF-QP approaches.

2605.20138 2026-05-20 cs.RO cs.SY eess.SY

Hamilton--Jacobi Reachability for Spacecraft Collision Avoidance

航天器碰撞避免的Hamilton-Jacobi可达性

Larry Hui, Jordan Kam, William Su, Jianshu Zhou

AI总结 本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi(HJ)可达性框架,通过平面Hill-Clohessy-Wiltshire(HCW)动力学在径向-切向-法向(RTN)框架中建模相对运动。定义目标状态空间为对应于联邦通信委员会(FCC)轨道标准的最小分离要求的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈,其中玩家1是受控卫星,玩家2被建模为具有未知意图的有界对抗干扰。本文提出了HJ公式,并计算了后向可达集,这些集描述了在最坏情况下无法避免碰撞的相对状态,而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合,以确定何时必须启动规避机动,从而为可扩展性提供数学基础的安全保证。

详情
Comments
Accepted to the 20th IEEE International Conference on Control & Automation (IEEE ICCA 2026). 6 pages, 4 figures
AI中文摘要

本文提出了一种用于同轨道双卫星碰撞避免问题的Hamilton-Jacobi(HJ)可达性框架,通过平面Hill-Clohessy-Wiltshire(HCW)动力学在径向-切向-法向(RTN)框架中建模相对运动。我们定义目标状态空间为对应于最小分离要求一致的联邦通信委员会(FCC)轨道标准的不安全相对配置。将航天器之间的相互作用建模为零和微分博弈,其中玩家1是受控卫星,玩家2被建模为具有未知意图的有界对抗干扰。我们提出了HJ公式,并计算了后向可达集,这些集描述了在最坏情况下无法避免碰撞的相对状态,而集外的状态则允许证明安全的轨迹。这些可达集与监督混合控制逻辑相结合,以确定何时必须启动规避机动,从而为可扩展性提供数学基础的安全保证。

英文摘要

This article presents a Hamilton--Jacobi (HJ) reachability framework for a two--satellite collision avoidance problem operating in the same circular orbit, where relative motion is modeled in the radial--tangential--normal (RTN) frame using planar Hill--Clohessy--Wiltshire (HCW) dynamics. We define the target state space as unsafe relative configurations in the orbit plane corresponding to minimum separation requirements consistent with Federal Communications Commission (FCC) orbital standards. The interaction between spacecraft is formulated as a zero--sum differential game, where Player 1 is the controlled satellite and Player 2 is modeled as a bounded adversarial disturbance with unknown intent. We present the HJ formulation and compute backward reachable sets that characterize relative states from which collision cannot be avoided under worst-case disturbances, while states outside this set admit provably collision-free trajectories. These reachable sets are integrated with supervisory hybrid control logic to determine when evasive maneuvers must be initiated, enabling mathematically grounded safety guarantees for scalability.

2605.20136 2026-05-20 eess.SY cs.SY

Enabling Real-Time Phase Control in Traffic Signal Hardware-in-the-Loop Simulation

在交通信号硬件在环仿真中实现实时相位控制

Zhiyao Zhang, Gergely Zachár, William Barbour, Matt Bunting, Marcos Quiñones-Grueiro, Jonathan Sprinkle, Dan Work

AI总结 本文提出首个支持实时相位控制的HILS测试平台,通过新型中间件架构将动态相位动作转换为符合NTCIP标准的硬件控制器命令,实现相位转换、信号状态同步和错误处理,实验验证系统能高效执行实时相位指令并保持亚毫秒级低延迟。

详情
Comments
7 pages, 5 figures, accpeted to IEEE ITSC 2026
AI中文摘要

先进的交通信号控制(TSC)算法需要实时相位控制,但现有的硬件在环仿真(HILS)测试平台仅支持预编程的定时计划。本文提出首个支持实时相位控制的HILS测试平台。我们开发了一种新型中间件架构,将动态相位动作(选择、切换和持续时间)转换为符合NTCIP标准的商用硬件控制器命令。该中间件管理相位转换,同步信号状态,并处理错误,而不会中断硬件的内部操作。实验验证表明,系统能够执行实时相位命令,处理系统冲突,并在平均亚毫秒级别实现低系统内部延迟。

英文摘要

Advanced Traffic Signal Control (TSC) algorithms require real-time phase control, yet existing Hardware-in-the-Loop Simulation (HILS) testbeds only support pre-programmed timing plans. In this paper, we present the first HILS testbed for real-time phase control. We develop a novel middleware architecture that translates dynamic phase actions (selection, switch, and duration) into commands for NTCIP-compliant commercial hardware controllers. This middleware manages phase transitions, synchronizes signal states, and handles errors without interrupting the hardware's internal operations. Experimental validation demonstrates that the system executes real-time phase commands, handles system conflicts, and achieves a low system internal latency at sub-millisecond on average.

2605.20134 2026-05-20 cs.LG

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

TrajTok: 用于轨迹表示学习的自适应空间令牌化

Zhen Xiong, Shang-Ling Hsu, Cyrus Shahabi

AI总结 本文提出TrajTok,一种通过自适应空间令牌化学习通用轨迹表示的方法,通过多分辨率六边形网格划分和预训练策略,实现了在轨迹相似性搜索、分类、预计到达时间和旅行时间回归等任务上的优异表现。

详情
AI中文摘要

从原始GPS轨迹学习通用的轨迹表示仍然具有挑战性,因为数据是连续的、嘈杂的且采样不规则。空间令牌化同样具有挑战性:细网格会产生稀疏单元格,嵌入较弱,而粗网格会将异质运动模式合并为同一个令牌。我们提出了TrajTok,一种具有简单预训练配方的轨迹编码器,用于可转移的轨迹嵌入。TrajTok首先从GPS点的空间分布学习多分辨率六边形网格划分,将嘈杂的GPS序列转换为离散的单元格令牌。为了捕捉几何和运动学,它使用分解的Transformer编码器,带有早期模态自注意力块、跨注意力融合层和时空旋转位置嵌入(ST-RoPE),以编码每个令牌的位置和时间。TrajTok通过掩码令牌建模进行预训练,从部分轨迹观测中恢复几何结构和运动学模式。在Porto数据集上,冻结的TrajTok编码器结合轻量级任务适配器在轨迹相似性搜索、分类、预计到达时间和完整旅行时间回归任务上表现优异,优于多种任务特定方法。相同的冻结编码器支持几何主导和运动学主导任务,表明TrajTok学习了可转移的轨迹结构,而不是任务特定的捷径。这些结果表明,学习多分辨率空间令牌化结合掩码令牌预训练是通用轨迹基础模型的有希望的方向。

英文摘要

Learning generalizable trajectory representations from raw GPS traces remains difficult because the data is continuous, noisy, and irregularly sampled. Spatial tokenization is also challenging: fine grids yield sparse cells with weak embeddings, while coarse grids merge heterogeneous movement patterns into the same token. We present TrajTok, a trajectory encoder with a simple pretraining recipe for transferable trajectory embeddings. TrajTok first learns a multi-resolution hexagonal cell partition from the spatial distribution of GPS points, converting noisy GPS sequences into discrete cell tokens. To capture both geometry and kinematics, it uses a factorized transformer encoder with early per-modality self-attention blocks, cross-attention fusion layers, and spatiotemporal rotary position embeddings, ST-RoPE, to encode where and when each token occurs. TrajTok is pretrained with masked-token modeling that recovers both geometric structure and kinematic patterns from partial trajectory observations. On the Porto dataset, a frozen TrajTok encoder with lightweight task adapters achieves strong performance across trajectory similarity search, classification, estimated time of arrival, and full travel-time regression, outperforming multiple task-specific methods. The same frozen encoder supports both geometry-dominated and kinematics-dominated tasks, suggesting that TrajTok learns transferable trajectory structure rather than task-specific shortcuts. These results indicate that learned multi-resolution spatial tokenization combined with masked-token pretraining is a promising direction for general-purpose trajectory foundation models.

2605.20132 2026-05-20 physics.geo-ph cs.LG eess.SP

FiLark: a streaming-first software framework for end-to-end exploration, annotation, and algorithm integration in distributed acoustic sensing

FiLark:一种面向流式处理的软件框架,用于分布式声学传感的端到端探索、标注和算法集成

Jintao Li, Weichang Li, Kai Tong, Xaingyu Guo

AI总结 本文提出FiLark框架,通过流式处理原则,实现分布式声学传感数据的端到端探索、标注和算法集成,解决传统批量分析框架无法处理连续高通道数据流的问题。

详情
AI中文摘要

分布式声学传感(DAS)系统生成的连续、超高通道计数的数据流速率超过了传统批量分析框架的能力。因此,诸如长时记录的交互探索、可扩展的事件标注和实时算法闭环监控等关键任务仍然无法得到足够支持。本文提出了FiLark(Fiber Lark),一种Python框架,其应用流式处理原则贯穿数据访问、信号处理、可视化和监控。FiLark将任何DAS源,包括连续多文件记录,作为统一流进行处理,并围绕该抽象构建所有系统组件。基于OpenGL的环形缓冲区渲染器允许以恒定内存使用量交互浏览和可视化任意长的记录。集成的标注界面支持在连续数据流中直接进行事件标注,从而在不进行离线预处理的情况下创建可重复的机器学习准备好的标注数据集。信号处理库包括时间、空间、频谱和分解基的运算符,包含通过PyTorch实现的CPU版本和GPU加速版本,以及具有状态的分块执行,以在段边界保持处理连续性和应用语义。标准化的监控接口进一步将流式检测器和基于学习的模型整合到可视化工作流程中。通过在所有层次共享共同的流式抽象,FiLark允许在交互式开发的处理配置和工作流程直接转移到可扩展的生产管道中,而无需修改。

英文摘要

Distributed acoustic sensing (DAS) systems generate continuous, ultra-high-channel-count data streams at rates that exceed the capabilities of conventional batch-oriented analysis frameworks. As a result, essential tasks such as interactive exploration of long-duration recordings, scalable event annotation, and real-time algorithm-in-the-loop monitoring remain inadequately supported by workflows built around manually selected data segments and offline processing. This paper presents FiLark (Fiber Lark), a Python framework that applies a \emph{streaming-first} principle uniformly across data access, signal processing, visualization and monitoring for DAS. Instead of operating on manually selected data segments, FiLark presents any DAS sources-including continuous multi-file recordings-as a unified stream and builds all system components around that abstraction. An OpenGL-based ring-buffer renderer enables interactive browsing and visualization of arbitrarily long recordings with constant memory usage. An integrated annotation interface supports event labeling directly within continuous data streams, facilitating the creation of reproducible machine-learning-ready labeled datasets without offline preprocessing. The signal processing library includes temporal, spatial, spectral, and decomposition-based operators, with both CPU implementations and GPU-accelerated variants via PyTorch, alongside stateful chunked execution that preserves processing continuity and application semantics across segment boundaries. A standardized monitor interface further integrates streaming detectors and learning-based models into the visualization workflow. By sharing a common streaming abstraction across all layers, FiLark allows processing configurations and workflows developed interactively to transfer directly to scalable production pipelines without modification.

2605.20129 2026-05-20 cs.IT math.IT

Stochastic Chase Decoding for BMS Channels via Rate Distortion Theory

基于率失真理论的BMS信道随机追击解码

Amit Berman, Ariel Doubchak, Uri Erez, Tal Philosof, Ilya Shapir

AI总结 本文提出了一种基于率失真理论的随机追击解码方法,用于二进制记忆less对称(BMS)信道上的代数码解码,通过信息论基础的翻转规则替代传统启发式方法确定翻转概率,重新解释随机追击解码为随机编码构造错误模式覆盖码,并基于Nguyen等人提出的框架,为BMS信道上的追击解码设计比特翻转概率,得到渐近最优的比特翻转规则和预期列表大小,确保发送码字以高概率出现在解码列表中,且在二进制和四进制对称信道中,最优比特翻转规则与信息论规则高度匹配。

详情
Comments
Extended version of a submission to ISIT 2026
AI中文摘要

本文提出了一种基于率失真理论的随机追击解码方法,用于二进制记忆less对称(BMS)信道上的代数码解码,通过信息论基础的翻转规则替代传统启发式方法确定翻转概率。特别地,我们将随机追击解码重新解释为随机编码构造的错误模式覆盖码。我们的方法基于Nguyen等人提出的框架,他们为非二进制信道引入了多重尝试解码的率失真公式。在他们的公式中,擦除模式被生成以与硬决策错误对齐,从而掩盖硬决策错误。我们适应这一框架,为BMS信道上的追击解码设计比特翻转概率。这得到了渐近最优的比特翻转规则的显式描述,以及确保发送码字以高概率出现在解码列表中的预期列表大小。此外,对于二进制和四进制对称信道,我们证明了通过穷举搜索确定的最优比特翻转规则与信息论规则在短块长时也高度匹配。

英文摘要

This work develops a rate-distortion-based approach to stochastic Chase decoding of algebraic codes over binary memoryless symmetric (BMS) channels, replacing the heuristics traditionally used to determine flip probabilities with information-theoretically grounded flipping rules. In particular, we reinterpret stochastic Chase decoding as a random-coding construction for error-pattern covering codes. Our approach builds on the framework of Nguyen et al., who introduced a rate-distortion formulation of multiple-attempt decoding for Reed-Solomon codes over nonbinary channels. In their formulation, erasure patterns are generated so as to align with, and thereby mask, hard-decision errors. We adapt this framework to the design of bit-flip probabilities for Chase decoding over BMS channels. This yields an explicit characterization of the asymptotically optimal bit-flipping rule, together with the expected list size required to ensure that the transmitted codeword appears in the decoding list with high probability. Moreover, for binary and quaternary symmetric channels, we demonstrate that the optimal bit-flipping rule, determined by exhaustive search, closely matches the information-theoretic rule even at short block lengths.

2605.20128 2026-05-20 cs.CL

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

MixRea: 在大型语言模型中评估显式-隐式推理的基准测试

Yuanqing Cai, Ziyi Huang, Minhao Liu, Lixin Duan, Wen Li, Yanru Zhang

AI总结 本文提出MixRea基准测试,用于评估大型语言模型在显式和隐式推理任务中的表现,发现即使最佳模型也存在注意力不足的问题,并提出PRCP方法来改进推理能力。

详情
Comments
12 pages, 6 figures, 4 tables
AI中文摘要

大型语言模型(LLMs)正越来越多地融入高风险决策中。受人类认知中'注意力盲区'理论的启发,我们研究LLMs是否在显式任务指令下表现出类似限制:未能关注到细微但重要的上下文线索。为此,我们引入显式-隐式推理任务,并提出MixRea基准测试,包含2246道多选题,覆盖9种推理类型,显式和隐式信息分布各异。对21种先进LLMs的评估显示,即使最佳推理模型(Gemini 2.5 Pro)也只能达到42.8%的一致性,揭示了普遍存在的注意力盲区。为缓解这一问题,我们提出潜在关系完成提示(PRCP),一种通过恢复被忽视的因果关系来提升推理能力的提示方法。进一步分析显示,这一限制在多样化的多源推理任务中依然存在,凸显了对更认知对齐模型的需求。

英文摘要

Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \textbf{explicit-implicit reasoning} and present \textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

2605.20127 2026-05-20 q-bio.NC cs.AI cs.LG

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

超越预测准确性:用于评估模型-大脑对齐的靶空间恢复曲线

Ken Nakamura, Tomoya Nakai, Ryuto Yashiro, Ayumu Yamashita, Kaoru Amano

AI总结 本文提出了一种评估模型-大脑对齐的新方法,通过分析可重复预测的靶空间响应维度,揭示预测准确性之外的模型-大脑对齐情况。

详情
Comments
34 pages, 12 figures, 5 tables
AI中文摘要

人工视觉模型通常通过测量其内部表示预测大脑响应的准确性来评估人类视觉皮层。然而,仅凭预测准确性无法确定目标大脑响应空间中哪些维度被恢复。本文介绍了一种统一框架,通过识别预测恢复的响应维度来评估模型-大脑和大脑-大脑对齐。通过重复fMRI测量,我们首先确定可在独立试验分割中重复预测的目标大脑响应维度。然后,我们预测目标大脑响应,无论是从另一个受试者的大脑响应还是视觉模型的内部表示,并量化这些可重复响应维度的恢复程度。将此框架应用于自然场景数据集的一个子集,其中八名受试者在fMRI下观看了相同的自然图像,我们发现早期到中期视觉皮层响应包含一组低维的可重复维度。大脑-大脑比较确定哪些维度可以从其他受试者的大脑中一致恢复,提供了一种诊断性的人类参考而非仅标量基准。在某些情况下,预训练和随机初始化的模型在预测准确性上相似,但这些响应维度的恢复曲线却不同。这些结果表明,仅凭预测准确性可能掩盖模型-大脑不匹配。通过明确哪些可重复的大脑响应维度被预测恢复,我们的框架提供了更诊断性的评估,以评估人工视觉模型与人类视觉皮层的对齐情况。

英文摘要

Artificial vision models are often evaluated against the human visual cortex by measuring how accurately their internal representations predict brain responses. However, prediction accuracy alone does not indicate which dimensions of the target brain's response space are recovered. Here, we introduce a unified framework for evaluating both model-brain and brain-brain alignment by identifying the response dimensions recovered by prediction. Using repeated fMRI measurements, we first identify target-brain response dimensions that can be reproducibly predicted across independent trial splits. We then predict target-brain responses from either another subject's brain responses or a vision model's internal representations, and quantify how strongly each of these reproducible response dimensions is recovered. Applying this framework to a subset of the Natural Scenes Dataset, in which eight subjects viewed the same natural images during fMRI, we find that the early-to-intermediate visual-cortex responses contain a low-dimensional set of reproducible dimensions. Brain-to-brain comparisons identify which of these dimensions are consistently recoverable from other subjects' brains, providing a diagnostic human reference rather than only a scalar benchmark. In some cases, pretrained and randomly initialized models achieve similar prediction accuracy while showing distinct recovery profiles across these response dimensions. These results show that prediction accuracy alone can mask model-brain mismatches. By making explicit which reproducible brain response dimensions are recovered by prediction, our framework provides a more diagnostic evaluation of alignment between artificial vision models and the human visual cortex.

2605.20123 2026-05-20 cs.CR cs.IR

BiRD: A Bidirectional Ranking Defense Mechanism for Retrieval Augmented Generation

BiRD: 一种用于检索增强生成的双向排序防御机制

Chengcai Gao, Zhihong Sun, Xiaochuan Shi, Qiufeng Wang, Chao Liang

AI总结 本文提出BiRD,一种基于双向排序的防御机制,通过双信号框架提升检索增强生成系统的效率和鲁棒性,有效降低对抗攻击成功率并提高任务准确性。

详情
Comments
17 pages, 10 figures and 8 tables
AI中文摘要

随着检索增强生成(RAG)的广泛应用,对抗攻击也随之增加。现有防御方法依赖于语义分析或投票,面临计算成本高和在强污染攻击下鲁棒性差的权衡问题。其根本限制在于仅关注语义内容相关性,而忽视由排序结构定义的检索上下文。为此,我们研究了污染和良性文档的双向排序行为,发现关键的判别模式:污染文档在反向排序与查询的正向排序之间表现出显著更强的一致性。基于此,我们提出了BiRD,一种基于双信号框架的双向排序防御机制,利用正向排序评估语义内容相关性,利用反向排序量化排序上下文一致性。该设计直接解决了现有方法的根本限制,实现了同时的效率和鲁棒性。在三个数据集上使用三个检索器和三个LLM,在两种攻击场景下进行的广泛评估验证了BiRD的有效性。值得注意的是,BiRD将PoisonedRAG的攻击成功率降低了高达54%,同时任务准确性提高了高达56%,平均额外延迟低于1秒。

英文摘要

The growing adoption of Retrieval-Augmented Generation (RAG) has led to a rise in adversarial attacks. Existing defenses, relying on semantic analysis or voting, face a trade-off between high computational cost and limited robustness under strong poisoning attacks. Their fundamental limitation is the exclusive focus on semantic content relevance, while neglecting the retrieval context that is critically defined by ranking structures. To this end, we investigate the bidirectional ranking behavior of poisoned and benign documents, and discover a key discriminative pattern: poisoned documents exhibit significantly stronger alignment between their backward rankings and the query's forward ranking. Capitalizing on this, we propose BiRD, a bidirectional ranking defense mechanism built upon a dual-signal framework that leverages forward ranking to assess semantic content relevance and backward ranking to quantify ranking context consistency. This design directly addresses the fundamental limitation of prior approaches, enabling simultaneous efficiency and robustness. Extensive evaluation across 3 datasets with 3 retrievers and 3 LLMs under 2 attack scenarios validates BiRD's effectiveness. Notably, BiRD reduces the attack success rate of PoisonedRAG by up to 54% while simultaneously improving task accuracy by up to 56%, with average additional latency under 1 second.

2605.20122 2026-05-20 stat.ML cs.CC cs.LG

Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

优化Wasserstein距离估计的计算-统计运行时间

Peter Matthew Jacobs, Jeff M. Phillips

AI总结 本文提出了一种Sample-Sketch-Solve方法,通过引入正则化笛卡尔网格草图来压缩数据并加速Wasserstein距离的计算,实现了在Hölder光滑分布下以更优的运行时间达到ε误差的估计。

详情
AI中文摘要

平方Wasserstein距离是衡量概率分布之间差异的常用工具。该距离通常在两个底层随机样本的经验测度之间计算。不幸的是,即使在低维欧几里得空间问题(d∈{2,3})中,计算Wasserstein距离的算法在运行时间上随着n和所需精度的增加而表现不佳。为此,我们考虑计算-统计运行时间,目标是从样本中估计潜在光滑测度之间的Wasserstein距离,误差在期望意义上不超过ε。我们允许收集样本的计算成本为O(1)。为此,我们开发了一种Sample-Sketch-Solve范式,其中引入了样本的正则化笛卡尔网格草图。我们证明,尤其是在α-Hölder光滑分布下,这可以压缩数据而不增加渐近误差,并且正则化结构使更快的精确算法成为可能。最终,我们以ε误差在ε^{-max(2,(d+1+o(1))/(1+α))}时间内近似W_2^2(P,Q),对于0 < α < 1的Hölder光滑分布P,Q在(0,1)^d上;当d=2时,对于α>1/2,达到最优Θ(ε^{-2}),当d=3时,当α→1时几乎最优。

英文摘要

Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to $ε$-additive error in expectation with respect to the sampling; we allow $O(1)$ computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under $α$-Hölder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate $W_2^2(P,Q)$ within $ε$ error in $ε^{-\max(2,\frac{d+1+o(1)}{1+α})}$ time for $0 < α< 1$ Hölder smooth distributions $P,Q$ on $(0,1)^{d}$; an optimal $Θ(ε^{-2})$ for $α> 1/2$ when $d=2$ and nearly optimal as $α\to 1$ when $d = 3$.