arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.00578 2026-05-20 cs.CV

Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration

通过高斯混合特征对齐和课程整合实现全切片图像的联邦蒸馏

Luru Jing, Cong Cong, Yanyuan Chen, Yongzhi Cao

AI总结 本文提出FedHD框架,通过高斯混合特征对齐和课程整合策略,在联邦学习中实现全切片图像分析,通过本地生成的语义丰富合成特征表示提升模型性能,同时保持诊断多样性。

Comments Accepted by ICML 2026, Camera-Ready version updated

详情
AI中文摘要

联邦学习(FL)提供了一个有前景的框架,用于通过跨机构进行模型训练来实现协作数字病理学。然而,现实部署面临异质性问题,源于不同机构中多样化的多实例学习(MIL)架构和异构特征提取器。我们提出FedHD,一种新的FL框架,通过针对WSI分析进行本地高斯混合特征对齐。不同于交换模型参数,每个客户端独立地蒸馏语义丰富的合成特征表示,这些表示与真实WSI的分布对齐。为保持诊断多样性,FedHD采用一对一蒸馏策略,为每个真实切片生成一个合成对应物,以避免过度压缩。在联邦过程中,采用基于课程的整合策略,一旦性能达到平台期,逐步将跨站点的合成特征整合到本地训练中。此外,一个可选的解释模块从合成嵌入中重建伪块,提高透明度。FedHD是架构无关的、隐私保护的,并支持在不同机构之间进行个性化但协作的训练。在TCGA-IDH、CAMELYON16和CAMELYON17上的实验表明,FedHD在联邦和蒸馏基线中表现一致优于最先进的方法。

英文摘要

Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, real-world deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis. Instead of exchanging model parameters, each client independently distills semantically rich synthetic feature representations aligned with the distribution of real WSIs. To preserve diagnostic diversity, FedHD adopts a one-to-one distillation strategy, generating a synthetic counterpart for each real slide to avoid over-compression. During federation, a curriculum-based integration strategy progressively incorporates cross-site synthetic features into local training once performance plateaus. Furthermore, an optional interpretation module reconstructs pseudo-patches from synthetic embeddings, enhancing transparency. FedHD is architecture-agnostic, privacy-preserving, and supports personalized yet collaborative training across diverse institutions. Experiments on TCGA-IDH, CAMELYON16, and CAMELYON17 show that FedHD consistently outperforms state-of-the-art federated and distillation baselines.

2605.00333 2026-05-20 cs.LG cs.CL

Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B

借来的几何:冻结预训练的Gemma 4 31B在跨分布头部重要性指纹

Abay Bektursun

AI总结 本文研究了冻结预训练的Gemma 4 31B模型在跨分布任务中的头部重要性指纹,通过分析多个任务中的头部影响,发现特定头部在不同任务中表现出显著的重要性,同时验证了这些头部在因果上的有效性。

Comments v2: Added head-level causal ablation on OGBench cube-task1 (n=30, 3.2x specificity; n=5 paired-t p=0.039) and full L26 sweep. New sections on honest negatives (activation patching null, sufficiency null, within-layer Spearman wrong-direction). Multiplicity-aware permutation null V4 P=0.013. Title and framing updated. 25 pages (13 main), 10 figures

详情
AI中文摘要

冻结在文本上预训练的Gemma 4 31B权重,未经修改,通过一个薄的可训练接口转移到非文本模态。在L24-L29切片(192个注意力头)上,一个英语文本TxtCopy注意力探针(95个句子)和每个头部对四个非语言标记模式任务(二进制复制、联想回忆、1D细胞自动机规则90、二进制加法)的影响共同分类了四个头部——L26.28、L27.28、L27.2、L27.3——在两个信号上都处于顶级。切片级别的联合巧合在超几何空虚下显著(P=0.0013,N=192,K=38,n=4)并且在多重性感知的排列检验中存活(P_V4=0.013)。预训练的Gemma L26在OGBench cube-double-play-task1上达到60.22% vs ~1%对于随机初始化的Gemma(+59pt在n=3时);一个带有正确1/√d_k缩放的FrozenRandom-GPT2对照也失败。头部层面的因果验证:在训练的cube-task1 IQL代理中零化L26.28导致成功从63.3%降至10.0% vs 46.7%对于层匹配的低-TxtCopy负对照(在n=30时有3.2倍的特异性;n=5配对-t p=0.039)。完整的L26扫描将L26.28置于32个中的第4位。诚实的负样本:在L26内Spearman ρ(TxtCopy,drop)=+0.37(与层内因果阅读相反);单个头部激活修补不转移匹配变量;四个命名头部单独不足以完成任何任务;Walker2d-DT和scene-task1招募L24在命名切片之外并显示头-消融特异性为零。我们将贡献框架为切片级别的跨分布重要性指纹加上一个跨模态目标的头部层面因果证据。

英文摘要

Frozen Gemma 4 31B weights pretrained exclusively on text, unmodified, transfer through a thin trainable interface to non-text modalities the substrate has never processed. On the L24--L29 slice (192 attention heads), an English-text TxtCopy attention probe (95 sentences) and per-head ablation impact on four non-language token-pattern tasks (binary copy, associative recall, 1D cellular automaton Rule 90, binary addition) jointly classify four heads -- L26.28, L27.28, L27.2, L27.3 -- as top-tier on both signals. The slice-level joint coincidence is significant under hypergeometric null ($P = 0.0013$, $N=192$, $K=38$, $n=4$) and survives multiplicity-aware permutation tests ($P_{V4} = 0.013$). Pretrained Gemma L26 reaches 60.22% on OGBench cube-double-play-task1 vs ~1% for random-init Gemma ($+59$pt at $n=3$); a FrozenRandom-GPT2 control with correct $1/\sqrt{d_k}$ scaling also fails. Head-level causal validation: zeroing L26.28 in the trained cube-task1 IQL agent drops success $63.3\% \to 10.0\%$ vs $46.7\%$ for a layer-matched low-TxtCopy negative control ($3.2\times$ specificity at $n=30$; $n=5$ paired-$t$ $p=0.039$). A full L26 sweep places L26.28 at rank 4 of 32. Honest negatives: within-L26 Spearman $ρ(\text{TxtCopy, drop}) = +0.37$ (opposite of within-layer causal reading); single-head activation patching does not transfer the matching variable; the 4 named heads alone do not suffice on any task; Walker2d-DT and scene-task1 recruit L24 outside the named slice and show null head-ablation specificity. We frame the contribution as a cross-distribution importance fingerprint at the slice level plus head-level causal evidence on one cross-modality target.

2604.25646 2026-05-20 cs.CV cs.RO

SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

SAMe:一种用于机器人超声的语义解剖映射引擎

Jing Zhang, Duojie Chen, Wentao Jiang, Zihan Lou, Jianxin Liu, Xinwu Cui, Qinghong Zhao, Bo Du, Christoph F. Dietrich, Dacheng Tao

AI总结 该研究提出SAMe,一种语义解剖映射引擎,通过提供显式的解剖先验层,解决机器人超声扫描初始化问题,实现了基于临床症状的解剖目标识别和控制指令生成,提高了自动扫描的准确性和效率。

Comments Supplementary information included. Code will be released at https://github.com/MiliLab/Echo-SAMe

详情
AI中文摘要

机器人超声已经实现了局部图像驱动控制、接触调节和视图优化,但当前系统缺乏必要的解剖学理解,无法确定应扫描什么、从哪里开始以及如何适应个体患者解剖结构。这些差距使得系统仍依赖专家干预来启动扫描。本文提出SAMe,一种语义解剖映射引擎,为机器人超声提供显式的解剖先验层。SAMe将扫描初始化视为目标到解剖到动作的过程:它将不明确的临床症状转化为结构化的目标器官,从单张外部身体图像中为这些目标生成患者特定的解剖表示,并将这种表示转换为面向控制的6自由度探头初始化状态,无需使用术前CT或MRI进行额外的配准。SAMe维护的解剖表示是显式的、轻量的(单器官推断在0.08秒内完成),并且设计上与下游控制兼容。在语义接地、解剖生成和真实机器人评估中,SAMe在完整的初始化流程中表现出色。在真实机器人实验中,基于质心的SAMe初始化在单目标设置下,对于肝脏(86.7% vs 46.7%)和肾脏(80.0% vs 73.3%)初始化均优于基于身体关键点的启发式基线。此外,当多个候选目标可用时,试验级别的器官命中率达到了肝脏97.3%和肾脏83.3%。这些结果建立了一个显式的解剖先验层,解决了扫描初始化问题,并为更广泛的下游自主扫描流程提供了解剖基础,为基于症状驱动和解剖信息的机器人超声提供了基础。

英文摘要

Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, centroid-based SAMe initialization outperformed the body-keypoint-based heuristic baseline under a budget-matched single-target setting for both liver (86.7% versus 46.7%) and kidney (80.0% versus 73.3%) initialization. Furthermore, The trial-level organ-hit rate reached 97.3% for liver and 83.3% for kidney when multiple candidate targets were available. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

2604.18739 2026-05-20 cs.LG stat.ML

Discrete Tilt Matching

离散倾斜匹配

Yuyuan Chen, Shiyi Wang, Peter Potaptchik, Jaeyeon Kim, Michael S. Albergo

AI总结 本文提出了一种无需概率模型的离散倾斜匹配方法,用于改进扩散大语言模型的微调,通过局部解掩码后验的状态级匹配来提高训练稳定性并防止模式崩溃。

详情
AI中文摘要

Masked diffusion large language models (dLLMs) 是一种有前景的替代自回归生成方法。尽管最近强化学习 (RL) 方法已被适应到 dLLM 微调中,但其目标通常依赖于序列级边际似然,这在掩码扩散模型中是不可行的。为了解决这个问题,我们推导出离散倾斜匹配 (DTM),一种无需概率模型的方法,将 dLLM 微调重新表述为在奖励倾斜下局部解掩码后验的状态级匹配。DTM 以加权交叉熵目标形式出现,具有显式的最小化器,并且允许控制变体以提高训练稳定性。在合成迷宫规划任务中,我们分析了 DTM 的退火计划和控制变体如何影响训练稳定性并防止模式崩溃。在大规模情况下,使用 DTM 微调 LLaDA-8B-Instruct 在 Sudoku 和 Countdown 任务上表现出强劲的提升,同时在 MATH500 和 GSM8K 任务上保持竞争力。

英文摘要

Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.

2604.18225 2026-05-20 cs.CV cs.AI

Is SAM3 ready for pathology segmentation?

SAM3是否准备好进行病理分割?

Qiuyu Kong, Shakiba Sharifi, Yiming Wang, Marco Cristani, Zanxi Ruan

AI总结 本文评估了SAM3在病理图像分割中的能力,发现文本提示效果有限,视觉提示类型和预算对性能影响显著,少样本学习有提升但鲁棒性不足,且提示基于方法与任务训练适配方法之间存在显著差距。

Comments accept to icip2026

详情
AI中文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

英文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

2604.16593 2026-05-20 cs.CL

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

重新审视一个令人头疼的问题:一种用于语言模型的语义推理基准

Yang Liu, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Chao Huang

AI总结 本文提出SemanticQA基准,用于评估语言模型在语义短语处理任务中的表现,通过整合现有多词表达资源并重新组织为统一测试平台,涵盖通用词汇现象及三种细粒度类别,评估不同架构和规模的语言模型在提取、分类、解释及任务组合中的性能,揭示语义推理任务中模型性能的显著差异,为提升语言模型在非平凡语义短语上的理解能力提供见解。

Comments ACL 2026 (Oral), 24 pages, 22 figures, 14 tables

详情
AI中文摘要

我们提出了SemanticQA,一种评估套件,旨在评估语言模型(LMs)在语义短语处理任务中的能力。该基准整合了现有的多词表达(MwE)资源,并将它们重新组织为一个统一的测试平台。它涵盖了通用词汇现象,如词组搭配,以及三个细粒度类别:习语表达、名词复合词和动词结构。通过SemanticQA,我们评估了不同架构和规模的语言模型在提取、分类、解释任务以及顺序任务组合中的表现。我们揭示了显著的性能差异,特别是在需要语义推理的任务中,突显了不同模型在推理效率和语义理解方面的差异,为推动语言模型在非平凡语义短语上具备更强理解能力提供了见解。SemanticQA的评估套件和数据可在https://github.com/jacklanda/SemanticQA上获取。

英文摘要

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.

2604.16503 2026-05-20 cs.CV cs.AI

Motif-Video 2B: Technical Report

Motif-Video 2B:技术报告

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon, Hongjoo Lee, Jeongdoo Lee, Junhyeok Lee, Eunhwan Park, Yeongjae Park, Bokki Ryu, Dongjoo Weon

AI总结 该研究探讨在有限预算下是否能够训练出高质量的文本到视频生成模型,提出通过架构设计而非单纯扩大模型规模来提升性能,结合共享交叉注意力和三部分主干网络,实现了在较少参数和数据下的高质量视频生成。

详情
AI中文摘要

训练强大的视频生成模型通常需要大规模数据集、大量参数和大量计算资源。在本工作中,我们探讨在更小的预算下(少于1000万片段和少于10万H200 GPU小时)是否能够实现高质量的文本到视频生成。我们的核心观点是,模型容量的组织方式,而不仅仅是其规模,是关键因素。在视频生成中,提示对齐、时间一致性以及细节恢复在通过相同路径处理时可能会相互干扰。Motif-Video 2B通过在架构上分离这些角色,而不是仅依赖规模来解决这一问题。该模型结合了两个关键思想:首先,共享交叉注意力在视频令牌序列变长时增强了文本控制;其次,三部分主干网络分离了早期融合、联合表征学习和细节细化。为了使这种设计在有限计算预算下有效,我们将其与基于动态令牌路由和早期阶段特征对齐到冻结预训练视频编码器的高效训练方案相结合。我们的分析显示,后期块比标准单流基线发展出更清晰的跨帧注意力结构。在VBench上,Motif-Video 2B达到了83.76%的性能,超越了Wan2.1 14B模型,使用7倍更少的参数和显著更少的训练数据。这些结果表明,通过精心的架构专门化和以效率为导向的训练方案,可以缩小或超越通常与更大视频模型相关联的质量差距。

英文摘要

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

2604.16491 2026-05-20 cs.CV cs.AI

A Lightweight Transformer for Pain Recognition from Brain Activity

一种轻量级变压器用于从脑活动识别疼痛

Stefanos Gkikas, Christian Arzate Cruz, Yu Fang, Lu Cao, Muhammad Umar Khan, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez

AI总结 本文提出了一种轻量级变压器,通过统一的标记机制融合多种fNIRS表示,实现互补信号视图的联合建模,无需特定模态适应或增加架构复杂性,从而在保持计算紧凑性的同时实现竞争性的疼痛识别性能。

详情
AI中文摘要

疼痛是一种复杂且广泛的现象,具有显著的临床和社会负担,使其可靠的自动化评估成为关键目标。本文提出了一种轻量级变压器架构,通过统一的标记机制融合多种fNIRS表示,实现了互补信号视图的联合建模,而无需特定模态的适应或增加架构复杂性。所提出的标记混合策略通过将异构输入投影到共享的潜在表示中,保留了空间、时间和时间-频率特性,并使用结构化的分段方案来控制局部聚合和全局交互的粒度。该模型在AI4Pain数据集上使用堆叠的原始波形和功率谱密度表示进行评估。实验结果表明,该方法在保持计算紧凑性的同时实现了竞争性的疼痛识别性能,使其适用于GPU和CPU硬件上的实时推断。

英文摘要

Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.

2604.15034 2026-05-20 cs.AI

Autogenesis: A Self-Evolving Agent Protocol

自生成:一种自我进化代理协议

Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Cankun Guo, Ming Yin, Bo An, Mengdi Wang

AI总结 本文提出了一种自生成协议(AGP),该协议通过分离进化内容与进化过程,解决了现有代理协议在跨实体生命周期管理、版本追踪和安全更新接口方面的不足。基于AGP,作者展示了自生成系统(AGS),该系统能够动态实例化、检索和优化协议注册的资源,通过多个具有长视界规划和工具使用的挑战性基准测试,验证了代理资源管理和闭环自我进化的有效性。

详情
AI中文摘要

近年来,基于大语言模型(LLM)的代理系统在处理复杂、长视界任务方面展现出了巨大潜力。然而,现有的代理协议(如A2A和MCP)在指定跨实体生命周期管理和上下文管理、版本追踪以及安全更新接口方面存在局限,这鼓励了单一结构的组合和脆弱的粘合代码。我们引入了自生成协议(AGP),这是一种自我进化协议,它通过分离进化内容与进化过程来解决这些问题。其资源子strate协议层(RSPL)将提示、代理、工具、环境和记忆建模为具有明确状态、生命周期和版本化接口的协议注册资源。其自我进化协议层(SEPL)指定了一个闭环操作接口,用于提出、评估和提交改进,具有可审计的血统和回滚功能。基于AGP,我们提出了自生成系统(AGS),这是一个能够动态实例化、检索和优化协议注册资源的自我进化多代理系统。我们评估了AGS在多个需要长视界规划和跨异构资源工具使用的挑战性基准测试上的表现。结果表明,与强基线相比,AGS在多个挑战性基准测试上均表现出一致的改进,支持了代理资源管理和闭环自我进化有效性的结论。代码可在https://github.com/DVampire/Autogenesis上获取。

英文摘要

Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present Autogenesis System (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution. The code is available at https://github.com/DVampire/Autogenesis.

2604.13392 2026-05-20 cs.AI

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

ReSS: 通过符号支架学习表格数据预测的推理模型

Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang

AI总结 本文提出ReSS框架,通过符号支架结合神经推理模型,提升表格数据预测的准确性和可解释性,实验表明其在医疗和金融领域优于传统决策树和标准微调方法。

详情
AI中文摘要

表格数据在医疗和金融等高风险领域仍然广泛存在,预测模型需要提供高准确性和可信的、可被人类理解的推理。虽然符号模型提供可验证的逻辑,但缺乏语义表达能力。同时,通用大语言模型通常需要专门的微调才能掌握领域特定的表格推理。为解决可扩展的数据整理和推理一致性挑战,我们提出了ReSS,一种系统框架,连接符号和神经推理模型。ReSS利用决策树模型提取实例级别的决策路径作为符号支架。这些支架,加上输入特征和标签,指导LLM生成基于现实的自然语言推理,严格遵循底层决策逻辑。由此产生的高质量数据集用于微调预训练LLM为专门的表格推理模型,进一步通过支架不变的数据增强策略提高泛化能力和可解释性。为了严格评估可信度,我们引入了包括幻觉率、解释必要性和解释充分性的定量指标。在医疗和金融基准上的实验结果表明,ReSS训练的模型在传统决策树和标准微调方法上提高了高达10%,同时产生可信且一致的推理。

英文摘要

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

2604.11796 2026-05-20 cs.CL cs.AI

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

C-ReD:一个源自真实世界提示的综合性中文AI生成文本检测基准

Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia

AI总结 本文提出C-ReD基准,用于检测AI生成的中文文本,通过解决模型多样性、领域覆盖和提示真实性等关键问题,提升检测性能和泛化能力。

Comments ACL 2026 Findings

详情
AI中文摘要

近年来,大型语言模型(LLMs)能够生成高度流畅的文本内容。尽管它们为人类提供了显著的便利,但也引入了诸如钓鱼和学术不端等风险。大量研究致力于开发检测AI生成文本的算法并构建相关数据集。然而,在中文语料领域仍存在挑战,包括模型多样性有限和数据同质性。为了解决这些问题,我们提出了C-ReD:一个综合性的中文真实提示AI生成检测基准。实验表明,C-ReD不仅能够实现可靠的领域内检测,还支持对未见LLMs和外部中文数据集的强大泛化能力,从而弥补了先前中文检测基准在模型多样性、领域覆盖和提示真实性方面的关键缺口。我们已在https://github.com/HeraldofLight/C-ReD上发布了相关资源。

英文摘要

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

2604.11417 2026-05-20 cs.RO cs.AI

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

高效的情绪感知图标手势预测用于机器人同声传译

Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez

AI总结 本文提出一种轻量级的transformer模型,通过文本和情绪单独生成图标手势的位置和强度,无需音频输入,在BEAT2数据集上优于GPT-4o,在语义手势位置分类和强度回归方面表现更佳,且计算紧凑,适合实时部署。

详情
AI中文摘要

同声传译手势可以提高参与度并改善语音理解。大多数数据驱动的机器人系统生成节奏般的运动,但很少整合语义强调。为此,我们提出了一种轻量级的transformer,该模型仅通过文本和情绪推导图标手势的位置和强度,无需在推理时使用音频输入。该模型在BEAT2数据集上在语义手势位置分类和强度回归方面均优于GPT-4o,同时保持计算紧凑性,适合在具身代理上实时部署。

英文摘要

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

2604.11089 2026-05-20 cs.CV

Structured State-Space Regularization for Generation-Friendly Image Tokenization

结构化状态空间正则化用于生成友好的图像标记化

Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak

AI总结 本文提出结构化状态空间正则化方法,通过诱导潜在空间的频谱结构提升图像标记化生成性能,同时保持重建保真度。

Comments Related blog posts in https://jinsingsangsung.github.io/collections/blog/ : Towards 2-Dimensional State-Space Models series

详情
AI中文摘要

图像标记器在现代生成模型中起着核心作用,其中潜在空间的结构关键决定了下游生成性能。有效潜在表示的一个关键但未被充分探索的特性是频谱组织,即能够跨频率组件编码信息。在本文中,我们引入了结构化状态空间正则化,一种系统诱导潜在空间频谱结构的方法。我们通过重新审视状态空间模型(SSMs)作为模仿基函数行为的系统,推导出一个正则化目标。这种视角揭示了SSMs的隐藏状态被诱导以捕捉频率组件,从而产生一种新的正则器,强制潜在空间捕捉图像的频谱结构。实验表明,我们的正则器在提升图像标记器生成性能的同时,仅导致微小的重建保真度损失。

英文摘要

Image tokenizers play a central role in modern generative models, where the structure of the latent space critically determines the downstream generation performance. A key but underexplored property of effective latent representations is spectral organization, the ability to encode information across frequency components. In this work, we introduce structured state-space regularization, a principled approach to inducing spectral structure in latent spaces. We derive a regularization objective by revisiting state-space models (SSMs) as systems mimicking a basis function's behavior. This perspective reveals that hidden states of SSMs are induced to capture the frequency components, resulting in a novel regularizer that enforces the latent space to capture spectral structure of images. Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.

2604.09323 2026-05-20 cs.RO

Robust Adaptive Backstepping Impedance Control of Robots in Unknown Environments

在未知环境中具有鲁棒性的自适应反步阻抗控制

Reza Nazmara, Alap Kshirsagar, Jan Peters, A. Pedro Aguiar

AI总结 本文提出了一种针对在接触丰富且不确定环境中操作的机器人鲁棒自适应反步阻抗控制(RABIC)策略,该策略考虑了系统的完整耦合动力学,并明确考虑了外部扰动和未建模动力学等关键不确定性来源,而无需机器人动态参数。通过反步方法设计内环以跟踪参考阻抗模型,利用泰勒级数估计器估计系统动力学并采用自适应估计器确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。通过模拟移动机械臂场景和对实际Franka Emika Panda机器人的实验评估,证明了所提方法在安全性、轨迹跟踪和力监测方面优于PD控制。

Comments 8

详情
Journal ref
Mechatronics, Vol. 118, 103552 (2026)
AI中文摘要

本文提出了一种鲁棒自适应反步阻抗控制(RABIC)策略,用于在接触丰富和不确定环境中操作的机器人。所提出的控制策略考虑了系统的完整耦合动力学,并明确考虑了外部扰动和未建模动力学等关键不确定性来源,而无需机器人动态参数。我们提出了一种基于反步的自适应阻抗控制方案用于内环以跟踪参考阻抗模型。为了处理不确定性,我们采用基于泰勒级数的估计器来估计系统动力学,并采用自适应估计器来确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。为了证明所提方法的有效性,进行了模拟移动机械臂场景和对实际Franka Emika Panda机器人的真实实验评估。所提出的方法在安全性和轨迹跟踪及力监测方面优于PD控制。总体而言,RABIC框架为未来关于耦合移动和固定串联机械臂的自适应和学习阻抗控制的研究提供了坚实的基础。

英文摘要

This paper presents a Robust Adaptive Backstepping Impedance Control (RABIC) strategy for robots operating in contact-rich and uncertain environments. The proposed control strategy considers the complete coupled dynamics of the system and explicitly accounts for key sources of uncertainty, including external disturbances and unmodeled dynamics, while not requiring the robot's dynamic parameters in implementation. We propose a backstepping-based adaptive impedance control scheme for the inner loop to track the reference impedance model. To handle uncertainties, we employ a Taylor series-based estimator for system dynamics and an adaptive estimator for determining the upper bound of external forces. Stability analysis demonstrates the semi-global practical finite-time stability of the overall system. To demonstrate the effectiveness of the proposed method, a simulated mobile manipulator scenario and experimental evaluations on a real Franka Emika Panda robot were conducted. The proposed approach exhibits safer performance compared to PD control while ensuring trajectory tracking and force monitoring. Overall, the RABIC framework provides a solid basis for future research on adaptive and learning-based impedance control for coupled mobile and fixed serially linked manipulators.

2604.08503 2026-05-20 cs.CV

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Phantom:通过联合建模视觉和潜在物理动态实现物理 infused 的视频生成

Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou

AI总结 本文提出Phantom模型,通过联合建模视觉内容和潜在物理动态,使视频生成过程具备物理一致性,从而生成既视觉真实又物理合理的视频。

Comments 15 pages, 6 figures, CVPR 2026

详情
AI中文摘要

近期生成视频建模的进展,受到大规模数据集和强大架构的推动,已经取得了显著的视觉真实效果。然而,越来越多的证据表明,仅仅扩大数据和模型规模并不能使这些系统理解支配现实世界动态的底层物理定律。现有方法往往无法捕捉或强制执行这种物理一致性,导致不真实的运动和动态。在本文中,我们探讨是否将潜在物理属性的推断直接整合到视频生成过程中,可以赋予模型生成物理合理视频的能力。为此,我们提出了Phantom,一个物理 infused 的视频生成模型,该模型联合建模视觉内容和潜在物理动态。在观察到的视频帧和推断出的物理状态条件下,Phantom联合预测潜在物理动态并生成未来的视频帧。Phantom利用一种物理感知的视频表示,作为底层物理的抽象但信息丰富的嵌入,从而在不需显式指定复杂物理动态和属性集的情况下,联合预测物理动态和视频内容。通过将物理感知视频表示的推断直接整合到视频生成过程中,Phantom生成的视频序列既具有视觉真实性又具有物理一致性。在标准视频生成和物理感知基准上的定量和定性结果表明,Phantom不仅在遵守物理动态方面优于现有方法,还提供了具有竞争力的感知保真度。

英文摘要

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

2604.07993 2026-05-20 cs.RO

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

HEX: 人形对齐的专家用于跨躯体全身体操作

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, Ziluo Ding, Zhiyuan Xu, Lei Sun, Shanghang Zhang, Zhengping Che, Jian Tang, Badong Chen

AI总结 HEX通过引入人形对齐的通用状态表示和混合专家统一本体预测器,实现了对全尺寸双足人形机器人全身体操作的协调控制,展示了在任务成功率和泛化能力上的最新成果。

Comments Project page: https://hex-humanoid.github.io/

详情
AI中文摘要

人类通过协调的全身控制实现复杂操作,而大多数视觉-语言-动作(VLA)模型将机器人身体部分独立处理,使得高自由度的人形控制具有挑战性和不稳定性。我们提出了HEX,一种面向全尺寸双足人形机器人的协调操作状态中心框架。HEX引入了人形对齐的通用状态表示,以实现跨异构躯体的可扩展学习,并结合混合专家统一本体预测器,从大规模多躯体轨迹数据中建模全身协调和时间运动动态。为了高效捕捉时间视觉上下文,HEX使用轻量级历史标记来总结过去的观察,避免在推理过程中重复编码历史图像。它进一步采用残差门控融合机制和流匹配动作头,以适应性地整合视觉-语言提示与本体动态以生成动作。在现实世界的人形操作任务中,HEX在任务成功率和泛化能力上实现了最先进的性能,特别是在快速反应和长时间范围场景中。

英文摘要

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

2604.07393 2026-05-20 cs.LG cs.AI

DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting

DSPR:双流物理残差网络用于可信的工业时间序列预测

Yeran Zhang, Pengwei Yang, Guoqing Wang, Tianyu Li

AI总结 本文提出DSPR框架,通过分离稳定的时序模式与受制度影响的残差动态,提升工业时间序列预测的准确性与物理合理性,实验表明其在不同制度下均能保持高预测精度和鲁棒性。

Comments 12 pages, 7 figures, accepted by KDD 2026

详情
AI中文摘要

准确预测工业时间序列需要在非平稳运行条件下平衡预测精度与物理合理性。现有数据驱动模型在统计性能上表现优异,但难以尊重受制度影响的交互结构和传输延迟等现实系统特性。为解决这一挑战,我们提出了DSPR(双流物理残差网络)预测框架,该框架明确分离稳定的时间模式与受制度影响的残差动态。第一流建模单个变量的统计时间演化。第二流通过两个关键机制关注残差动态:自适应窗口模块估计流依赖的传输延迟,以及物理引导的动态图整合物理先验,学习时间变化的交互结构并抑制虚假相关性。在四个工业基准上实验表明,DSPR在制度转换下持续提升预测精度和鲁棒性,同时保持强物理合理性。它实现了最先进的预测性能,平均守恒精度超过99%,总变化率达到97.2%。除了预测外,学习的交互结构和自适应滞后提供了与已知领域机制一致的可解释见解,如流依赖的传输延迟和风到功率的缩放行为。这些结果表明,通过物理一致的归纳偏差的架构解耦,为可信的工业时间序列预测提供了一条有效路径。此外,DSPR在长期工业部署中展示出的鲁棒性能弥合了先进预测模型与可信自主控制系统之间的差距。

英文摘要

Accurate forecasting of industrial time series requires balancing predictive accuracy with physical plausibility under non-stationary operating conditions. Existing data-driven models often achieve strong statistical performance but struggle to respect regime-dependent interaction structures and transport delays inherent in real-world systems. To address this challenge, we propose DSPR (Dual-Stream Physics-Residual Networks), a forecasting framework that explicitly decouples stable temporal patterns from regime-dependent residual dynamics. The first stream models the statistical temporal evolution of individual variables. The second stream focuses on residual dynamics through two key mechanisms: an Adaptive Window module that estimates flow-dependent transport delays, and a Physics-Guided Dynamic Graph that incorporates physical priors to learn time-varying interaction structures while suppressing spurious correlations. Experiments on four industrial benchmarks spanning heterogeneous regimes demonstrate that DSPR consistently improves forecasting accuracy and robustness under regime shifts while maintaining strong physical plausibility. It achieves state-of-the-art predictive performance, with Mean Conservation Accuracy exceeding 99% and Total Variation Ratio reaching up to 97.2%. Beyond forecasting, the learned interaction structures and adaptive lags provide interpretable insights that are consistent with known domain mechanisms, such as flow-dependent transport delays and wind-to-power scaling behaviors. These results suggest that architectural decoupling with physics-consistent inductive biases offers an effective path toward trustworthy industrial time-series forecasting. Furthermore, DSPR's demonstrated robust performance in long-term industrial deployment bridges the gap between advanced forecasting models and trustworthy autonomous control systems.

2604.07035 2026-05-20 cs.CL

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

统一的部署感知开放推理语言模型评估

Md Motaleb Hossen Manik, Ge Wang

AI总结 本文提出了一种统一的开放推理语言模型评估方法,通过四个基准测试(ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1)对七种配置进行评估,结合零样本、链式思维(CoT)和少量样本CoT提示策略,分析模型在准确率、延迟、内存使用等多目标下的表现,强调部署感知的多目标优化问题。

详情
AI中文摘要

开放推理语言模型通常在混合样本量、部分标准化提示和以准确性为中心的总结下进行比较,这使得实际模型选择难以解释。我们针对ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1四个基准测试,对七种开放推理语言模型配置进行了统一评估。我们对每种模型-数据集-策略条件下的238个示例子集测试了零样本、链式思维(CoT)和少量样本CoT提示策略,得到一个完整的7×4×3设计,包含84个条件和19,992个评估示例。除了准确性外,我们还报告了Wilson置信区间、延迟、峰值视频随机访问内存(VRAM)、加权聚合性能、帕累托高效运行点、提示敏感度指标和兼容性诊断。Gemma-4-26B-A4B在零样本提示下实现了最高的加权分数0.794。Gemma-4-E4B在各种提示设置中仍接近顶部,同时使用显著更低的延迟和内存,使其成为一种强大的实际运行点。Bootstrap和配对排列分析显示,领先配置足够接近,部署权衡仍然重要。我们还发现提示策略的变化会改变模型排名,而不是统一移动所有模型。基准特定的互补性创造了路由空间,一个 oracle 任务感知选择器达到了加权分数0.825。兼容性诊断显示,一些明显失败,尤其是Phi-4-Reasoning在GSM8K上的表现,反映了在共享评估流程下的鲁棒性和接口适应性问题。这些结果支持一个核心主张:开放模型评估应作为部署感知的多目标运行点问题,而不是单一分数排行榜练习。

英文摘要

Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks: ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1. We test zero-shot, chain-of-thought (CoT), and few-shot CoT prompting on the same 238-example subset for every model--dataset--strategy condition, yielding a complete 7 x 4 x 3 design with 84 conditions and 19,992 evaluated examples. Beyond accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Gemma-4-26B-A4B with zero-shot prompting achieves the highest weighted score at 0.794. Gemma-4-E4B remains close to the top across prompting settings while using substantially lower latency and memory, making it a strong practical operating point. Bootstrap and paired-permutation analyses show that the leading configurations are close enough that deployment tradeoffs remain important. We also find that prompting strategy changes model rankings rather than shifting all models uniformly. Benchmark-specific complementarity creates routing headroom, with an oracle task-aware selector reaching a weighted score of 0.825. Compatibility diagnostics show that some apparent failures, especially Phi-4-Reasoning on GSM8K, reflect robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

2604.03419 2026-05-20 cs.LG math.CO

Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization

自适应阈值驱动的连续贪心方法用于可扩展的子模优化

Mohammadreza Rostami, Solmaz S. Kia

AI总结 该研究提出了一种自适应阈值驱动的连续贪心方法(ATCG),用于解决在Matroid约束下的子模最大化问题,通过动态调整活跃集扩展策略,提高了算法效率并减少了通信开销。

详情
AI中文摘要

在组合优化中,子模最大化在传感、数据摘要、主动学习和资源分配中有广泛应用。尽管顺序贪心(SG)算法由于不可逆选择只能达到1/2的近似比,连续贪心(CG)通过多线性松弛获得最优的(1-1/e)近似比,但其代价是逐渐密集的决策向量,迫使代理为几乎每一个基础集元素交换特征嵌入。我们提出ATCG(自适应阈值驱动连续贪心),通过每个分区的进度比率η_i来控制梯度评估,仅在当前候选未能捕获足够边际增益时扩展每个代理的活跃集,从而直接限制哪些特征嵌入会被传输。理论分析建立了具有曲率意识的近似保证,有效因子τ_eff= max{τ,1-c},在阈值保证和低曲率区域之间插值,其中ATCG恢复CG的性能。这表明,曲率所捕捉的问题结构决定了接近全CG性能所需的协调和通信量。在类平衡的原型选择问题实验中,ATCG在CIFAR-10动物数据集的子集上实现了与全CG方法相当的目标值,同时显著减少了通信开销。

英文摘要

Submodular maximization under matroid constraints is a fundamental problem in combinatorial optimization with applications in sensing, data summarization, active learning, and resource allocation. While the Sequential Greedy (SG) algorithm achieves only a $\frac{1}{2}$-approximation due to irrevocable selections, Continuous Greedy (CG) attains the optimal $\bigl(1-\frac{1}{e}\bigr)$-approximation via the multilinear relaxation, at the cost of a progressively dense decision vector that forces agents to exchange feature embeddings for nearly every ground-set element. We propose \textit{ATCG} (\underline{A}daptive \underline{T}hresholded \underline{C}ontinuous \underline{G}reedy), which gates gradient evaluations behind a per-partition progress ratio $η_i$, expanding each agent's active set only when current candidates fail to capture sufficient marginal gain, thereby directly bounding which feature embeddings are ever transmitted. Theoretical analysis establishes a curvature-aware approximation guarantee with effective factor $τ_{\mathrm{eff}}=\max\{τ,1-c\}$, interpolating between the threshold-based guarantee and the low-curvature regime where \textit{ATCG} recovers the performance of CG. This shows that the problem structure, as captured by curvature, determines the amount of coordination and communication required to approach full-CG performance. Experiments on a class-balanced prototype selection problem over a subset of the CIFAR-10 animal dataset show that \textit{ATCG} achieves objective values comparable to those of the full CG method while substantially reducing communication overhead through adaptive active-set expansion.

2604.02784 2026-05-20 cs.CV cs.CL

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

EnsemHalDet: 通过内部状态检测器的集成实现鲁棒的视觉语言模型幻觉检测

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

AI总结 本文提出EnsemHalDet,一种通过集成多个内部表示的视觉语言模型幻觉检测框架,以提高多模态幻觉检测的鲁棒性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态任务中表现出色,但它们仍然容易受到事实错误或与输入图像无关的幻觉影响。最近的研究表明,利用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而,现有的基于内部表示的方法通常依赖于单一的表示或检测器,限制了它们捕捉多样化幻觉信号的能力。在本文中,我们提出了EnsemHalDet,一种基于集成的幻觉检测框架,利用VLMs的多种内部表示,包括注意力输出和隐藏状态。EnsemHalDet为每个表示训练独立的检测器,并通过集成学习进行组合。在多个VQA数据集和VLMs上的实验结果表明,EnsemHalDet在AUC方面始终优于先前的方法和单检测器模型。这些结果表明,集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。

英文摘要

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

2603.29501 2026-05-20 cs.LG cs.AI

Target-Aligned Reinforcement Learning

目标对齐的强化学习

Leonard S. Pleiss, James Harrison, Maximilian Schiffer

AI总结 本文提出了一种目标对齐的强化学习方法,通过强调目标网络和在线网络估计高度一致的过渡,改进了传统深度强化学习算法的稳定性与收敛速度,实验证明在多个基准环境中取得了显著提升。

详情
AI中文摘要

许多基于价值的深度强化学习算法依赖于目标网络——在线网络的滞后副本——来稳定训练。虽然有效,但这种机制引入了一个基本的稳定性与新鲜度权衡:较慢的目标更新可以提高稳定性,但会降低学习信号的时效性,从而阻碍收敛速度。我们提出目标对齐的强化学习(TARL),这是一种简单的改进方法,适用于现有算法,强调目标网络和在线网络估计高度一致的过渡。通过将更新集中在良好对齐的目标上,TARL减轻了陈旧目标估计的负面影响,同时保留了目标网络的稳定作用。我们在离散和连续控制算法中,在各种基准环境中展示了持续的改进,无需任何超参数调整,包括在Atari-10上实现了38.18%的峰值得分提升,同时仅导致不到4%的实时时钟时间增加。

英文摘要

Many value-based deep reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a simple drop-in refinement for existing algorithms that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We empirically demonstrate consistent improvements within discrete and continuous control algorithms across various benchmark environments without any hyperparameter tuning, including a 38.18% peak score gain on Atari-10, while incurring less than a 4% increase in wall-clock time.

2603.29092 2026-05-20 cs.CV

TrajectoryMover: Generative Movement of Object Trajectories in Videos

TrajectoryMover: 视频中物体轨迹的生成性运动

Kiran Chhatre, Hyeonho Jeong, Yulia Gryaditskaya, Christopher E. Peters, Chun-Hao Paul Huang, Paul Guerrero

AI总结 本文提出TrajectoryMover,一种生成视频中物体轨迹运动的方法,通过生成大规模合成配对视频数据和细调的视频生成器,实现了物体轨迹的生成性移动。

Comments 15 pages, 9 figures. Project page: https://chhatrekiran.github.io/trajectorymover

详情
AI中文摘要

生成性视频编辑已经使一些直观的编辑操作成为可能,这些操作以前在短视频片段中难以实现,特别是对于非专业编辑者而言。现有方法专注于在视频中为对象的3D或2D运动轨迹指定路径,或改变对象或场景的外观,同时保持视频的合理性和身份。然而,目前仍缺少一种方法,可以在视频中移动对象的3D运动轨迹,即在保持其相对3D运动的情况下移动对象。主要挑战在于获取这种场景下的配对视频数据。先前的方法通常依赖于巧妙的数据生成方法,从不成对的视频中构造出合理的配对数据,但这种方法在无法从另一视频轻易构造出配对视频时会失效。相反,我们引入了TrajectoryAtlas,一种新的大规模合成配对视频数据生成管道,以及一个通过此数据细调的视频生成器TrajectoryMover。我们证明这种方法成功实现了物体轨迹的生成性移动。项目页面:https://chhatrekiran.github.io/trajectorymover

英文摘要

Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object's 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video's plausibility and identity. Yet a method to move an object's 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover

2603.25620 2026-05-20 cs.CL

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

PICon: 一种用于评估人设代理一致性的多轮询问框架

Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Hwajung Hong, Edward Choi

AI总结 本文提出PICon框架,通过逻辑链式的多轮提问评估人设代理的一致性,发现即使之前被认为高度一致的系统在三个维度上也未能达到人类基准水平,揭示了矛盾和逃避回应。

Comments 20 pages, 6 figures

详情
AI中文摘要

基于大型语言模型的人设代理正被广泛应用于替代人类参与者,但缺乏系统方法来验证其响应是否在交互中保持一致性和准确性。本文提出PICon框架,通过逻辑链式的多轮提问来评估人设代理的一致性,从内部一致性(无自相矛盾)、外部一致性(与现实世界事实一致)和重测一致性(重复测试下的稳定性)三个核心维度进行评估。在评估七组人设代理和63名真实人类参与者时,发现即使之前报告为高度一致的系统在三个维度上也未能达到人类基准水平,揭示了矛盾和逃避回应。本文为评估人设代理提供了概念基础和实用方法,提供了源代码和交互演示:https://kaist-edlab.github.io/picon/

英文摘要

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

2603.22453 2026-05-20 cs.CL cs.SI

XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception

XNote: 对基于图像的上下文欺骗的自动社区笔记生成进行基准测试

Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng

AI总结 本文研究了基于图像的上下文欺骗的自动社区笔记生成任务,提出了一个真实世界数据集XNote,并对前沿大视觉语言模型和商业工具进行了基准测试,以评估其在欺骗检测和笔记生成任务中的性能。

详情
AI中文摘要

社区笔记已成为一种有效的众包机制,用于对抗社交媒体上的在线欺骗。然而,其依赖于人类贡献者限制了及时性和可扩展性。在本工作中,我们研究了基于图像的上下文欺骗的自动社区笔记生成任务,其中一张真实图像与误导性上下文(例如时间、实体和事件)配对。与之前主要关注欺骗检测(即以二元方式判断帖子是否真实)的工作不同,自动社区笔记生成需要生成简洁且有根据的笔记,帮助用户恢复缺失或更正的上下文。由于支持此任务的数据集稀缺,该问题仍未被充分探索。为了解决这一差距,我们整理了一个真实世界的数据集XNote,包含X篇帖子及其相关的社区笔记和外部上下文,以及主题和欺骗因素的注释。我们进一步在XNote上基准测试了一系列前沿的大视觉语言模型(LVLMs),评估它们在欺骗检测和笔记生成任务中的性能。我们还对比了端到端方法SNIFFER和商业工具GPT-5。我们的结果突显了自动社区笔记生成的挑战,强调了改进针对此任务的方法和指标的必要性。

英文摘要

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation task for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), automated Community Notes generation requires producing concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to the scarcity of datasets that support this task. To address this gap, we curate a real-world dataset, XNote, comprising X posts with associated Community Notes and external contexts, along with annotations of topics and deceptive factors. We further benchmark a range of frontier large vision language models (LVLMs) on XNote, evaluating their performance on both deception detection and note generation tasks. We also compare against an end-to-end approach, SNIFFER, and a commercial tool, GPT-5. Our results highlight the challenges in automated Community Notes generation, underscoring the need for improved methods and metrics tailored for this task.

2603.17839 2026-05-20 cs.CL cs.AI cs.LG

How do LLMs Compute Verbal Confidence

LLMs如何计算言语自信

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković

AI总结 研究探讨了大型语言模型如何内部生成言语自信评分,通过实验发现自信评分在回答生成后被缓存并用于后续输出,揭示了模型自我评估的机制。

详情
AI中文摘要

言语自信——提示LLMs以数字或类别形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而,LLMs内部如何生成此类评分仍不清楚。我们解答了两个问题:首先,信心是在被请求时即时计算,还是在生成答案时自动计算并缓存以供后续检索;其次,言语自信代表什么——token对数概率,还是更丰富的答案质量评估?我们聚焦于Gemma 3 27B(在TriviaQA、BigMath和MMLU上的表现)、Qwen 2.5 7B以及推理模型Magistral Small 24B,提供了缓存检索的收敛证据。激活引导、修补、噪声和交换实验揭示,信心表示在回答相邻位置先出现,再出现在言语化位置。注意力阻断指出了信息流:信心从回答token中收集,缓存于第一个回答后的位置,然后用于输出。关键发现是线性探测和方差划分揭示,这些缓存表示能够解释超出token对数概率的显著方差,表明是更丰富的答案质量评估,而非简单的流畅性读取。这些发现表明,言语自信反映了自动、复杂的自我评估——而非事后重建——对理解LLMs中的元认知和改进校准具有启示。

英文摘要

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

2603.16284 2026-05-20 cs.CV cs.LG

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

定位后再稀疏化:基于归因的视觉幻觉缓解稀疏策略

Tiantian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang

AI总结 本文提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的框架,通过定位和稀疏化策略,根据每层与幻觉的相关性调整特征引导强度,从而有效缓解视觉语言模型中的幻觉问题,同时保持良好的性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

尽管大型视觉-语言模型(LVLMs)在技术上取得了显著进展,但其生成幻觉的倾向削弱了可靠性并限制了更广泛的实际应用。在幻觉缓解方法中,特征引导作为一种有前景的方法,能够在不增加推理成本的情况下减少LVLMs中的错误输出。然而,当前的方法在所有层上应用统一的特征引导策略。这种启发式策略忽略了层间的差异,可能会干扰与幻觉无关的层,最终导致在通用任务上的性能下降。在本文中,我们提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的即插即用框架,该框架根据每层与幻觉的相关性来控制引导强度。我们首先构建了一个包含token级和句子级幻觉案例的数据集。基于此数据集,我们引入了一种基于因果干预的归因方法,以量化每层的幻觉相关性。利用各层的归因分数,我们提出了一种逐层策略,将这些分数转换为针对单个层的特征引导强度,从而在幻觉相关的层上实现更精确的调整。在多个LVLMs和基准测试中进行的广泛实验表明,LTS-FS有效缓解了幻觉问题,同时保持了强大的性能。代码可在https://github.com/huttersadan/LTS-FS上获得。

英文摘要

Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose Locate-Then-Sparsify for Feature Steering (LTS-FS), a plug-and-play framework which controls the steering intensity according to the hallucination relevance of each layer. We first construct a dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that LTS-FS effectively mitigates hallucination while preserving strong performance. Codes are available at https://github.com/huttersadan/LTS-FS.

2603.15411 2026-05-20 cs.AI cs.LG

A Hybrid Modeling Framework for Crop Prediction Tasks via Dynamic Parameter Calibration and Multi-Task Learning

一种通过动态参数校准和多任务学习的作物预测混合建模框架

William Solow, Paola Pesantez-Cabrera, Markus Keller, Lav Khot, Sandhya Saisubramanian, Alan Fern

AI总结 本文提出了一种混合建模方法,通过动态参数校准和多任务学习,提高作物预测的准确性,特别是在数据有限的情况下,利用神经网络对生物物理模型进行参数化,并在不同作物品种间高效共享数据,从而提升预测精度和生物合理性。

详情
AI中文摘要

准确预测作物状态(例如物候阶段和耐寒性)对于及时进行灌溉、施肥和树冠管理等农场管理决策至关重要,以优化作物产量和质量。虽然传统生物物理模型可以用于季节性预测,但它们缺乏用于特定地点管理所需的精度。深度学习方法是一种有吸引力的替代方案,但可能会产生生物上不合理的预测,并需要大规模数据。我们提出了一种混合建模方法,使用神经网络对可微分的生物物理模型进行参数化,并利用多任务学习在数据有限的情况下在不同作物品种之间高效共享数据。通过预测生物物理模型的参数,我们的方法在提高预测精度的同时保持生物合理性。使用真实世界和合成数据集的实证评估表明,与部署的生物物理模型相比,我们的方法在物候预测方面提高了60%,在耐寒性预测方面提高了40%。

英文摘要

Accurate prediction of crop states (e.g., phenology stages and cold hardiness) is essential for timely farm management decisions such as irrigation, fertilization, and canopy management to optimize crop yield and quality. While traditional biophysical models can be used for season-long predictions, they lack the precision required for site-specific management. Deep learning methods are a compelling alternative, but can produce biologically unrealistic predictions and require large-scale data. We propose a \emph{hybrid modeling} approach that uses a neural network to parameterize a differentiable biophysical model and leverages multi-task learning for efficient data sharing across crop cultivars in data limited settings. By predicting the \emph{parameters} of the biophysical model, our approach improves the prediction accuracy while preserving biological realism. Empirical evaluation using real-world and synthetic datasets demonstrates that our method improves prediction accuracy by 60\% for phenology and 40\% for cold hardiness compared to deployed biophysical models.

2603.13609 2026-05-20 cs.CV

A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas

基于网格的电动滑板车需求表示与深度学习的时序输入设计框架:以德克萨斯州奥斯汀为例

Mohammad Sahnoon, Merkebe Getachew Demissie, Roberto Souza

AI总结 本文提出了一种基于网格的电动滑板车需求表示方法和深度学习的时序输入设计框架,通过系统性的数据处理流程和统计学方法,提高了空间学习的一致性并保留了需求模式,实验结果表明该方法在下一小时和下一24小时预测中将均方误差降低了37%和35%。

Comments 16 pages, 7 tables, 10 figures

详情
AI中文摘要

尽管在共享微出行需求预测方面深度学习取得了进展,但系统设计和时序输入结构的统计验证仍然缺乏。时序特征通常被启发式选择,尽管历史需求强烈影响模型性能和泛化能力。本文介绍了一种可重复的数据处理流程和一种基于统计学的方法,用于设计图像到图像需求预测的时序输入结构。利用德克萨斯州奥斯汀的大规模电动滑板车数据,我们通过将行程记录转换为每小时的起点和终点需求图像,构建了一个基于网格的时空数据集。该流程包括行程过滤、将人口普查街区映射到空间位置、网格构建、需求汇总以及创建一个全球活动掩码,以限制评估仅限于历史上活跃的区域。这种表示支持一致的空间学习,同时保留需求模式。我们随后引入了一种结合相关性和误差的程序来识别有信息的历史输入。通过使用基线UNET模型的消融研究,结合配对非参数检验和Holm校正,选择最优的时序深度。所得到的时序结构能够捕捉短期持续性以及日和周周期。与相邻小时和固定周期基线相比,所提出的设计在下一小时预测中将均方误差降低了高达37%,在下一24小时预测中降低了35%。这些结果突显了系统性数据集构建和统计学验证的时序输入设计在时空微出行需求预测中的价值。

英文摘要

Despite progress in deep learning for shared micromobility demand prediction, the systematic design and statistical validation of temporal input structures remain underexplored. Temporal features are often selected heuristically, even though historical demand strongly affects model performance and generalizability. This paper introduces a reproducible data-processing pipeline and a statistically grounded method for designing temporal input structures for image-to-image demand prediction. Using large-scale e-scooter data from Austin, Texas, we build a grid-based spatiotemporal dataset by converting trip records into hourly pickup and dropoff demand images. The pipeline includes trip filtering, mapping Census Tracts to spatial locations, grid construction, demand aggregation, and creation of a global activity mask that limits evaluation to historically active areas. This representation supports consistent spatial learning while preserving demand patterns. We then introduce a combined correlation- and error-based procedure to identify informative historical inputs. Optimal temporal depth is selected through an ablation study using a baseline UNET model with paired non-parametric tests and Holm correction. The resulting temporal structures capture short-term persistence as well as daily and weekly cycles. Compared with adjacent-hour and fixed-period baselines, the proposed design reduces mean squared error by up to 37 percent for next-hour prediction and 35 percent for next-24-hour prediction. These results highlight the value of principled dataset construction and statistically validated temporal input design for spatiotemporal micromobility demand prediction.

2603.12296 2026-05-20 cs.LG cs.AI eess.SP

Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions

脑机接口中的合成数据生成:概述、基准测试与未来方向

Ziwei Wang, Zhentao He, Xingyi He, Hongbin Wang, Tianwang Jia, Jingwei Luo, Siyang Li, Xiaoqing Chen, Dongrui Wu

AI总结 本文综述了用于脑机接口的合成脑数据生成方法,讨论了不同生成方法的分类、基准实验、评估指标和应用,以及未来研究方向,旨在提升数据效率和隐私保护的脑机接口系统。

Comments 33 pages, 8 figures

详情
AI中文摘要

深度学习在多个领域取得了变革性的性能,主要得益于大规模和高质量的训练数据。相比之下,脑机接口(BCIs)的发展受到有限、异质性和隐私敏感的神经记录的限制。生成合成且生理上合理的脑信号因此成为缓解数据稀缺、提高模型泛化能力和支持数据高效的BCIs的有希望策略。本文全面回顾了用于BCIs的合成脑数据生成方法,涵盖了方法学分类、基准实验、评估指标、关键应用和未来方向。我们系统地将现有生成方法分为四类:基于信号变换、基于特征、基于模型和基于翻译的生成,并讨论了它们的特征、优势和局限性。此外,我们对四种BCI范式中的代表性脑信号生成方法进行了基准测试,包括运动想象、癫痫发作检测、稳态视觉诱发电位和听觉注意力检测,以提供对其下游用途的客观比较。我们还总结了从多个角度对生成脑信号的评估原则,包括信号真实性、生理合理性、下游用途和隐私保护。最后,我们讨论了当前生成方法的潜力和挑战,并概述了未来研究方向,以实现准确、数据高效、可推广和隐私感知的BCI系统。基准代码库可在https://github.com/wzwvv/DG4BCI上找到。

英文摘要

Deep learning has achieved transformative performance across diverse domains, largely driven by large-scale and high-quality training data. In contrast, the development of brain-computer interfaces (BCIs) is fundamentally constrained by limited, heterogeneous, and privacy-sensitive neural recordings. Generating synthetic yet physiologically plausible brain signals has therefore emerged as a promising strategy to mitigate data scarcity, improve model generalization, and support data-efficient BCIs. This survey provides a comprehensive review of synthetic brain data generation for BCIs, covering methodological taxonomies, benchmark experiments, evaluation metrics, key applications, and future directions. We systematically categorize existing generation approaches into four types: signal-transformation-based, feature-based, model-based, and translation-based generation, and discuss their characteristics, advantages, and limitations. Furthermore, we benchmark representative brain signal generation approaches across four BCI paradigms, including motor imagery, epileptic seizure detection, steady-state visually evoked potentials, and auditory attention detection, to provide an objective comparison of their downstream utility. We also summarize evaluation principles for generated brain signals from multiple perspectives, including signal realism, physiological plausibility, downstream utility, and privacy preservation. Finally, we discuss the potential and challenges of current generation approaches and outline future research directions toward accurate, data-efficient, generalizable, and privacy-aware BCI systems. The benchmark codebase is available at https://github.com/wzwvv/DG4BCI.

2603.11024 2026-05-20 cs.CV cs.AI

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

AI 是否能像艺术史家一样看?解析视觉语言模型如何识别艺术风格

Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Emily L. Spratt, Anna Filonenko, Hannah Pivo, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown

AI总结 本文研究了视觉语言模型(VLMs)在识别艺术风格方面的机制,通过跨学科合作,分析VLMs如何预测艺术风格,并评估其与艺术史家判断艺术风格的标准的一致性。

Comments 20 pages, 18 figures

详情
AI中文摘要

视觉语言模型(VLMs)在多种计算机视觉任务上已表现出越来越强的能力,例如视觉问答和目标检测。这包括在艺术领域中越来越强的能力,从分析艺术品到生成艺术品。在计算机科学家和艺术史家的跨学科合作中,我们表征了VLMs预测艺术风格的机制,并评估其与艺术史家用于推理艺术风格标准的契合程度。我们采用潜在空间分解方法来识别驱动艺术风格预测的概念,并通过定量评估、因果分析和艺术史家的评估进行评估。我们的发现表明,73%的提取概念被艺术史家认为具有连贯且语义明确的视觉特征,90%用于预测特定艺术品风格的概念被判定为相关。在无关概念成功预测风格的情况下,艺术史家发现了其成功的原因;例如,模型可能以更正式的方式理解概念,如明暗对比。

英文摘要

VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.