arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03357 2026-06-03 cs.CL cs.AI

The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs

未抽样的真相:SLM 中的心理测量衡量的是提示伪影,而非心理构念

Nils Schwager, Christoph Hau, Simon Münker, Achim Rettinger

AI总结 通过提示变异框架分离语义信号与提示伪影,发现小型语言模型在心理测量中主要反映提示遵从性而非模拟心理特质,并提供了诊断工具。

详情
Comments
10 pages, 5 figures, 3 tables
AI中文摘要

当使用小型语言模型进行心理测量评估时,研究人员假设输出反映了语义推理。我们使用一个提示变异框架,在13个开放权重模型(0.6B到14B参数)上评估了这一前提,该框架将语义信号与提示伪影分离。通过系统地改变角色、指令、项目和选项符号,我们发现伪影方差经常压倒语义信号。在这些情况下,模型主要反映提示遵从性,而非模拟的心理特质。虽然这些发现限制了SLM在心理测量中的效用,但我们的框架提供了一个诊断工具,用于识别破坏性伪影并隔离语义理解,以用于未来的前沿模型研究。

英文摘要

When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that separates semantic signals from prompt artifacts. By systematically varying personas, instructions, items, and option symbols, we find that artifactual variance frequently overpowers the semantic signal. In these cases, models predominantly reflect prompt compliance rather than simulated psychological traits. While these findings limit SLM utility in psychometrics, our framework provides a diagnostic tool to identify destructive artifacts and isolate semantic understanding for future frontier-model research.

2606.03355 2026-06-03 cs.LG

APIC: Amortized Physics-Informed Calibration using Neural Processes

APIC: 使用神经过程的摊销物理信息校准

Aishwarya Venkataramanan, Sai Karthikeya Vemuri, Joachim Denzler

AI总结 提出APIC框架,通过神经过程实现群体级贝叶斯推断,利用两分支潜在架构分离实例特定物理参数与共享结构差异,实现从稀疏观测中快速校准并量化不确定性。

详情
Comments
Accepted at UAI 2026
AI中文摘要

物理模型由于机制错误或缺失而固有地不完美,导致模型预测与真实观测之间存在系统性差异。Kennedy-O'Hagan (KOH) 框架通过显式差异建模解决了这个问题。然而,其非摊销的、每个实例的公式限制了在相关系统族中的可扩展性。我们引入了摊销物理信息校准 (APIC),这是 KOH 的群体级扩展,利用神经过程在实现之间进行可扩展的贝叶斯推断。我们的框架采用两分支潜在架构,将实例特定的物理参数与共享的、状态相关的结构差异分离开来。通过将可微物理集成到摊销推断骨干中,APIC 能够从稀疏观测中快速校准未见过的实现,同时量化不确定性。在阻尼弹簧振荡器、Lotka-Volterra 系统和具有错误物理的对流扩散偏微分方程上的实验表明,与其他校准方法相比,参数恢复得到改善,并且系统差异结构的一致识别得到增强。

英文摘要

Physics models are inherently imperfect due to misspecified or missing mechanisms, resulting in systematic discrepancies between model predictions and real-world observations. The Kennedy-O'Hagan (KOH) framework addresses this issue through explicit discrepancy modeling. However, its non-amortized, per-instance formulation limits scalability across families of related systems. We introduce Amortized Physics-Informed Calibration (APIC), a population-level extension of KOH that leverages Neural Processes to perform scalable Bayesian inference across realizations. Our framework employs a two-branch latent architecture to disentangle instance-specific physical parameters from shared, state-dependent structural discrepancies. By integrating differentiable physics into an amortized inference backbone, APIC enables rapid calibration of unseen realizations from sparse observations while quantifying uncertainty. Experiments on the damped spring oscillator, the Lotka-Volterra system, and the advection-diffusion PDE with misspecified physics demonstrate improved parameter recovery and consistent identification of the systemic discrepancy structure compared to other calibration approaches.

2606.03348 2026-06-03 cs.CV cs.AI

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

SynCred-Bench: 评估AI生成视觉虚假信息中的合成可信度

Junxiao Yang, Minghao Zhang, Xiaoce Wang, Haoran Liu, Shiyao Cui, Hongning Wang, Minlie Huang

AI总结 提出SynCred-Bench基准,包含600个AI生成的虚假信息图像,涵盖六种可信形式和七种传播风格,并引入FP450真实图像负集,评估显示现有系统在5%假阳性率下真阳性率极低,表明合成可信度是一个严重且未被充分探索的视觉虚假信息挑战。

详情
AI中文摘要

最近的生成模型能够生成带有逼真嵌入文本和布局的视觉制品,创造了一种新的虚假信息威胁:合成可信度。我们引入了SYNCRED-Bench,一个包含600个AI生成的虚假信息图像的基准,这些图像在六种可信形式类别和七种细粒度传播风格上平衡分布,同时还有FP450,一个用于测量假阳性的真实图像负集。广泛评估表明,现有系统仍然不可靠:在5%假阳性率约束下,15个多模态大语言模型仅达到10.5%的真阳性率,开源AIGC检测器达到不到5%,商业API达到57.6%。人类标注者也难以识别合成可信度,仅达到63%的真阳性率。这些发现将合成可信度确立为一个严重且未被充分探索的视觉虚假信息挑战,并提供了一个基准,用于开发超越表面可信度线索进行推理的检测器。

英文摘要

Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fine-grained circulation styles, together with FP450, a real-image negative set for measuring false positives. Extensive evaluation shows that existing systems remain unreliable: under a 5% false-positive-rate constraint, 15 MLLMs achieve only 10.5% true positive rate (TPR), open-source AIGC detectors achieve less than 5%, and commercial APIs reach 57.6%. Human annotators also struggled to identify synthetic credibility, reaching only 63% TPR. These findings establish synthetic credibility as a severe and underexplored visual misinformation challenge, and provide a benchmark for developing detectors that reason beyond superficial credibility cues.

2606.03347 2026-06-03 cs.LG cs.AI stat.ML

AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking

AugMask: 通过随机增强和掩码在不完整表格数据上训练扩散模型

Jungkyu Kim, Taeyoung Park, Kibok Lee

AI总结 提出AugMask训练框架,通过条件随机增强和仅对观测坐标去噪,使标准扩散模型适应缺失表格数据,并连接Rao-Blackwellized目标实现方差加权惩罚,优于专门处理缺失的基线。

详情
AI中文摘要

基于分数的扩散模型已成为突出的深度生成模型;然而,它们在表格数据上的应用仍然具有挑战性,因为其主干网络假设输入完全指定,而现实世界的表格数据通常包含缺失值。我们提出了AugMask,一个即插即用的训练框架,通过将条件与监督分离,使对缺失不敏感的主干网络适应不完整数据。AugMask 1) 使用轻量级辅助模型通过条件随机增强构建数值输入,2) 仅对观测坐标应用去噪监督。实际上,增强的缺失条目作为不确定的条件上下文,而不是训练目标。我们将此训练规则与Rao-Blackwellized目标联系起来,并表明对缺失条目进行边缘化会产生方差加权的敏感性惩罚,从而阻止对不确定补全的过度依赖。在多种数据集和缺失机制下,AugMask使基于扩散的标准表格生成器优于专门处理缺失的基线方法。

英文摘要

Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data by separating conditioning from supervision. AugMask 1) constructs numeric inputs via conditional stochastic augmentation using lightweight auxiliary models, and 2) applies denoising supervision only to observed coordinates. In effect, augmented missing entries serve as uncertain conditioning context rather than training targets. We connect this training rule to a Rao--Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, discouraging over-reliance on uncertain completions. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines.

2606.03345 2026-06-03 cs.CV cs.CL cs.CY

Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

超越语义:从视觉语言数据建模事实与情感感知体验

Youssef Mohamed, Kenneth Ward Church, Mohamed Elhoseiny

AI总结 提出P-Topics建模问题,通过PercepT两阶段架构从图像和标题中无监督发现并映射事实与情感感知体验,在ArtELingo数据集上显著优于基线。

详情
Comments
8 pages
AI中文摘要

我们提出了P-Topics(感知主题)建模,这是一个理解图像如何被情感和文化感知的新问题。目标是(1)在图像和标题数据集中发现并建模不同的感知体验,每个体验由客观事实和主观情感两方面定义;(2)将图像关联到其相关的感知体验。我们引入了**PercepT**(**感知**主题**T**ransformer),一个两阶段架构来处理P-Topics建模。在形成阶段,percepT通过无监督训练目标发现作为视觉-文本聚类的*P-Topics*,并动态选择聚类数量以匹配数据集的感知丰富度。在映射阶段,它通过注意力池化学习*P-Topic映射函数*,将图像关联到各自的聚类。在ArtELingo上,PercepT的轮廓系数达到**0.97**,而最接近的基线为**0.37**,反映了更好的感知聚类。PercepT的AUC分数达到**0.94**,而基线为**0.77**,显示了更好的感知聚类映射。人工评估证实PercepT捕获了语义上有意义的感知体验,并显著优于现有方法。我们的实现将公开。

英文摘要

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.

2606.03344 2026-06-03 cs.CR cs.LG

RogueMerge: Robust and Unified Attacks against LLM Model Merging

RogueMerge: 针对大语言模型合并的鲁棒统一攻击

Jinghuai Zhang, Yetian He, Kunlin Cai, Han Zhao, Fnu Suya, Yuan Tian

AI总结 提出RogueMerge框架,通过联合优化、元学习模拟和分布鲁棒优化,解决模型合并中针对自回归生成、未知合并配置和攻击提示泛化的三大挑战,实现鲁棒且统一的攻击。

详情
AI中文摘要

模型合并通过聚合来自未验证公共平台的任务向量,将专门能力组合到单个大语言模型中,暴露了关键的供应链攻击面:由于任何恶意行为都可以编码到任务向量中,并且合并允许第三方向量直接写入模型权重,攻击者提供的任务向量可以启用或放大多种下游威胁。先前的工作仅研究针对分类器的模型合并的后门攻击,使用静态算术启发式方法,由于三个原因无法有效处理针对生成式大语言模型的各种攻击。(i) 大语言模型依赖于自回归解码,合并引入的微小参数漂移会在令牌间累积并迅速降低攻击效果。(ii) 攻击者不知道受害者的合并配置,导致独立优化的静态攻击向量容易被稀释或破坏。(iii) 实际威胁诱导必须泛化到优化期间未见过的攻击提示,静态向量无法充分编码。我们提出RogueMerge,这是第一个原则性的统一框架,解决了所有三个挑战。为了处理自回归生成,我们用联合优化替代静态算术,明确强制合并后的攻击成功。为了处理未知的合并设置,我们将攻击注入表述为随机最小-最大问题,并通过元学习风格的模拟来解决。为了在异构攻击提示间泛化,我们采用分布鲁棒优化,并在大语言模型规模下推导出可处理的一阶泰勒近似,具有可证明的误差界。在四种威胁、六种合并算法和超过170个合并的大语言模型上,RogueMerge始终优于现有攻击。它还在多种合并设置下保持稳定,并能抵抗标准防御。

英文摘要

Model merging composes specialized capabilities into a single LLM by aggregating task vectors sourced from unverified public platforms, exposing a critical supply-chain attack surface: Because any malicious behavior can be encoded into a task vector, and merging grants third-party vectors direct write access to model weights, an attacker-provided task vector can enable or amplify diverse downstream threats. Prior work studies only backdoor attacks against model merging for classifiers using static arithmetic heuristics, which fail to effectively handle diverse attacks on generative LLMs for three reasons. (i) LLMs rely on autoregressive decoding, where the minor parameter drift introduced by merging compounds across tokens and rapidly degrades the attack. (ii) Attackers have no knowledge of the victim's merging configurations, causing a static attack vector optimized in isolation to be easily diluted or destroyed. (iii) Practical threat induction must generalize to attack prompts unseen during optimization, which static vectors cannot adequately encode. We present RogueMerge, the first principled, unified framework that addresses all three challenges. To handle autoregressive generation, we replace static arithmetic with a joint optimization that explicitly enforces attack success after merging. To handle unknown merging settings, we formulate attack injection as a stochastic min-max problem and solve it via meta-learning-style simulation. To generalize across heterogeneous attack prompts, we employ distributionally robust optimization and derive a tractable first-order Taylor approximation at LLM scale, with a provable error bound. Across four threats, six merging algorithms, and over 170 merged LLMs, RogueMerge consistently outperforms existing attacks. It also remains stable across diverse merging settings and resists standard defenses.

2606.03341 2026-06-03 cs.CV

Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

基于结构化状态空间对偶性的跨模态特征融合多模态图像配准网络

Zhikang Li, Yan Wu, Xin Hu, Yi Dai, Ming Li

AI总结 提出RegNetMamba-2算法,利用结构化状态空间对偶性(SSD)在粗到细匹配过程中提取局部和全局结构特征,通过跨模态交互和多尺度融合模块实现多模态图像配准,在多个数据集上取得高效性能。

详情
AI中文摘要

在多模态图像配准中,主要挑战在于共享结构信息的提取。与Transformer相比,结构化状态空间对偶性(SSD)在训练和推理过程中能以更高效率提取更全面的全局结构特征。受这些优势启发,我们提出了一种新的多模态图像配准算法,命名为RegNetMamba-2。我们的算法将SSD融入粗到细的匹配过程中,以有效提取局部和全局结构特征。首先,在网络中应用SSD于三种不同尺度进行多模态特征提取。为了增强局部表示,我们通过SSD的特征缩放函数更加关注前景边缘和结构信息。其次,针对输入图像的共享特征提取和所有尺度的多模态特征融合,我们提出了基于SSD的跨模态特征融合模型,包括跨模态特征交互(CMI)模块和多尺度特征融合(MSF)模块。CMI模块通过交叉形式的SSD用于每个尺度的跨模态特征提取。MSF模块旨在采用渐进式向上融合方式在特征层面获取精细特征,包含所有尺度的多模态特征。遵循粗到细策略,收集来自CMI的1/8尺度特征和来自MSF的1/2尺度特征以计算匹配概率分数。然后我们通过像素级对应关系分别建立匹配过程。大量实验表明,与最先进的基于深度学习的算法相比,RegNetMamba-2在以下数据集上的多模态图像配准性能和效率均取得了良好效果:VIS-SAR(OSDataset)、VIS-IR(LGHD/RoadSence)和VIS-NIR(RGB-NIR sense)。

英文摘要

In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advantages, we propose a novel algorithm for multi-modal image registration, named RegNetMamba-2. Our algorithm incorporates SSD into coarse-to-fine matching process to extract local and global structural features effectively. Firstly, SSD is applied in three different scales for multi-modal feature extraction in our network. To strengthen local representation, we pay more attention on foreground edge and structural information by feature scaling function of SSD. Secondly, for shared feature extraction of input images and multi-modal feature fusion in all scales, we propose cross-modality feature fusion model based on SSD, consisting of Cross-Modality feature Interaction (CMI) module and Multi-Scale feature Fusion (MSF) module. CMI module is designed for cross-modality feature extraction of each scale by SSD in cross form. MSF module is designed to employ a progressive upward fusion in feature-level to obtain fine features, consisting of multi-modal features in all scales. Following coarse-to-fine, the features in 1/8 scale from CMI and 1/2 scale from MSF are collected to calculate matching probability scores. Then we respectively establish matching process by correspondences of pixel-wise. Extensive experiments demonstrate that comparing with state-of-the-art deep-learning based algorithms, RegNetMamba-2 has achieved good effects in both performance and efficiency for multi-modal image registration on the following datasets: VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence) and VIS-NIR (RGB-NIR sense).

2606.03340 2026-06-03 cs.RO

Autonomous Navigation System for Library Service Robot Based on Unitree Go2 Edu

基于 Unitree Go2 Edu 的图书馆服务机器人自主导航系统

Aoduo Li, Haoran Lv, Bingquan Ou, Jianfeng Li, Yingdong Li, Zimeng Li

AI总结 针对图书馆狭窄通道和动态障碍物环境,提出基于 ROS 2 的四足机器人导航系统,融合 RTAB-Map、AMCL/EKF 和 Nav2 实现高成功率定位与避障,地图误差 3.7 cm。

详情
Comments
6 pages, 5 figures, 4 tables. Accepted by WCCIS 2026
AI中文摘要

图书馆需要自主机器人在狭窄通道中安静移动,同时确保读者、椅子、包和手推车周围的安全。本文提出了一套基于 Unitree Go2 Edu 四足机器人的 ROS 2 导航系统,该机器人配备了 4D LiDAR、前置深度相机和 IMU。我们并未假设图书馆是粗糙地形,而是针对实际部署中遇到的移动性不连续问题,包括地面过渡、临时杂乱和部分堵塞通道(低底盘轮式平台在此类场景中适应性较差)。采用 RTAB-Map 进行视觉-LiDAR SLAM,基于 AMCL 和 EKF 的传感器融合实现定位,以及基于 A* 和 DWA 的 Nav2 栈支持路径规划和局部避障。在真实图书馆中,该系统在静态、低密度动态和高密度动态场景下的成功率分别为 100%、96% 和 88%,而针对测量控制距离的地图验证显示平均度量误差为 3.7 cm。

英文摘要

Libraries require autonomous robots to move quietly through narrow aisles while remaining safe around readers, chairs, bags, and carts. This paper presents a ROS 2 navigation system for a Unitree Go2 Edu quadruped equipped with a 4D LiDAR, a front depth camera, and an IMU. Rather than assuming the library is rough terrain, we target the practical mobility discontinuities of real deployments, including floor transitions, temporary clutter, and partially blocked passages where low-clearance wheeled platforms are less tolerant. RTAB-Map is used for visual-LiDAR SLAM, AMCL and EKF-based sensor fusion provide localization, and a Nav2 stack with A* and DWA supports planning and local avoidance. In a real library, the system achieves 100%, 96%, and 88% success rates in static, low-density dynamic, and high-density dynamic scenes, while map validation against surveyed control distances yields a mean metric error of 3.7 cm.

2606.03338 2026-06-03 cs.LG cs.CV

IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension

IdEst: 通过内在维度评估自监督学习表示

Julie Mordacq, Vicky Kalogeiton, Steve Oudot

AI总结 提出IdEst方法,利用最小生成树维度估计器评估自监督学习表示的内在维度,发现其与下游线性探测性能强相关,并能高效选择超参数。

详情
Comments
ICML 2026
AI中文摘要

自监督学习(SSL)已成为从无标签数据中学习有意义表示的有效范式。然而,评估这些表示的标准协议——线性探测——计算成本高、对超参数敏感,并且对表示空间的几何结构提供的洞察有限。在这项工作中,受神经网络泛化与内在维度(ID)之间联系的启发,我们提出了IdEst,一种通过最小生成树维度估计器($\mathrm{dim}_\mathrm{MST}$)估计SSL表示ID的方法。在多种数据集、架构和SSL预训练目标上,我们表明IdEst与下游线性探测性能强相关。此外,我们证明IdEst能够实现高效的超参数选择,与监督替代方案相比显著降低计算成本。我们的结果突出了内在维度作为评估SSL表示的原则性几何代理,补充了标准的监督探测协议。

英文摘要

Self-supervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight into the geometric structure of the representation space. In this work, motivated by connections between neural network generalization and intrinsic dimension (ID) we propose IdEst, a method for estimating the ID of SSL representations via the Minimum Spanning Tree dimension estimator ($\mathrm{dim}_\mathrm{MST}$). Across diverse datasets, architectures, and SSL pretraining objectives, we show that IdEst strongly correlates with downstream linear probe performances. Furthermore, we demonstrate that IdEst enables efficient hyperparameter selection, significantly reducing the computational cost compared to supervised alternatives. Our results highlight intrinsic dimensionality as a principled geometric proxy for assessing SSL representations, complementing standard supervised probing protocols.

2606.03335 2026-06-03 cs.RO

GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization

GPU并行多任务强化学习与演示引导的策略优化

Rui Zhang, Qiwei Wu, Zhengyu Zhang, Tao Li, Yunrong Guo, Junjie Lai, Renjing Xu, Weihua Zhang

AI总结 提出一种将结构化操作任务族转化为GPU并行多任务强化学习基准的构建方法MT-LIBERO,并设计演示引导策略优化算法DGPO,结合重要性加权PPO与自适应行为克隆,实现异构任务套件的高效并行训练。

详情
AI中文摘要

大规模GPU并行强化学习已经改变了机器人仿真中可训练的内容,但大多数系统仍为每个任务优化一个专家策略。我们提出了一种构建方法,将结构化操作任务族转化为GPU并行多任务强化学习基准,并在Isaac Lab中使用LIBERO资产和任务谓词实例化为MT-Libero。该基准支持在异构任务套件上同时进行强化学习,具有并行渲染、物理随机化以及状态输入或视觉输入策略。为了使这种训练在稀疏成功信号和有限先验数据下变得实用,我们进一步提出了DGPO,一种在线演示引导方法,它将重要性加权PPO与对匹配演示动作的自适应行为克隆相结合。DGPO实现了对演示任务分布的可调偏好,在保持在线PPO的稳定性和在线改进优势的同时,优于无先验强化学习和现有的基于演示的方法。

英文摘要

Large scale GPU-parallel reinforcement learning has changed what can be trained in robot simulation, yet most systems still optimize one specialist policy per task. We propose a construction methodology for turning structured manipulation task families into GPU-parallel multi-task RL benchmarks, and instantiate it as MT-Libero using LIBERO assets and task predicates in Isaac Lab. The resulting benchmark supports simultaneous reinforcement learning over heterogeneous task suites with parallel rendering, physics randomization, and state-input or visual-input policies. To make such training practical under sparse success signals and limited prior data, we further propose DGPO, an on-policy demonstration guided method that combines importance weighted PPO with adaptive behavior cloning on matched demonstration actions. DGPO enables a tunable preference toward demonstrated task distributions, outperforming both prior-free RL and existing demonstration-based methods while preserving the stability and online improvement benefits of on-policy PPO.

2606.03334 2026-06-03 cs.CL cs.LG

Lingo_Research_Group at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection

Lingo_Research_Group 在 SemEval-2026 任务 9:评估用于极化检测的提示变体

Pritam Kadasi, Anuj Tiwari, Mayank Singh

AI总结 本文针对 SemEval-2026 任务 9 的多语言极化检测,通过设计 12 种不同提示变体,使用 aya-101 和 Gemma3-27B 模型,在三个子任务上评估提示方法的效果,发现提示方法在粗粒度极化检测上有效,但在细粒度多标签分类上困难增加。

详情
Comments
Accepted at the SemEval Workshop, ACL 2026
AI中文摘要

本文提交的成果针对 SemEval-2026 任务 9:多语言文本分类挑战——极化检测,涵盖了所有三个子任务:(1) 二元极化检测,(2) 极化类型分类,以及 (3) 极化表现识别。我们采用系统性的短设计提示研究方法,考虑了 12 种在术语清晰度、定义详细程度、推理指导以及上下文示例使用上不同的设计提示。实验使用 aya-101 和 Gemma3-27B 进行,后者因性能考虑在开发阶段结束时被选用于提交。我们的系统在官方测试集上(22 种语言的平均值)在子任务 1 上的平均宏 F1 得分为 0.762,子任务 2 为 0.587,子任务 3 为 0.444,平均准确率分别为 0.819、0.678 和 0.498。通过跨任务和跨语言分析,我们证明基于提示的方法可以有效检测粗粒度极化,但在细粒度多标签社会语言学分类方面遇到越来越多的困难。

英文摘要

Our submission presented in this paper is for SemEval-2026 Task 9: Multilingual Text Classification Challenge - Polarization Detection and it covers all three subtasks: (1) binary polarization detection, (2) polarization type classification and (3) polarization manifestation identification. We adopt a systematic approach of research on short designed prompts by considering twelve designed prompts that are different in terminology clarity, detail of the definition, guidance of reasoning and in-context examples use. The experiments are conducted using aya-101 and Gemma3-27B, with the latter chosen for the submission at the end of the development through performance considerations. Our system has an average macro level F1-score of 0.762 on Subtask 1, 0.587 on Subtask 2 and 0.444 on Subtask 3 with the average accuracy of 0.819, 0.678 and 0.498, respectively, on the official test set averaged among 22 languages, respectively. With cross-task and cross-lingual analysis, we demonstrate that prompt-based approaches can be used effectively to detect coarse grained polarization but encounter more and more difficulties as far as fine-grained and multi-label sociolinguistic classification is concerned.

2606.03332 2026-06-03 cs.LG

Tailoring Strictly Proper Scoring Rules for Downstream Tasks: An Application to Causal Inference

为下游任务定制严格适当的评分规则:因果推断中的应用

Roman Plaud, Alexandre Perez-Lebel, Antoine Saillenfest, Thomas Bonald, Marine Le Morvan, Gaël Varoquaux, Matthieu Labeau

AI总结 提出一种通过匹配下游误差指标的局部曲率来推导任务特定严格适当评分规则的框架,并将其应用于平均处理效应估计,导出了闭式损失函数及其对应的规范概率映射,实验表明该方法优于标准似然和协变量平衡方法。

详情
Comments
Accepted to ICML 2026
AI中文摘要

概率模型通常使用任务无关的目标函数(如对数损失)进行训练,这可能导致下游估计出现显著误差。这种脱节在因果推断的逆概率加权(IPW)中尤为关键,其中倾向得分在接近 $0$ 和 $1$ 处的误差常常导致高偏差和高方差。我们提出一个原则性框架,通过匹配下游误差指标的局部曲率来推导任务特定的严格适当评分规则。我们将此应用于平均处理效应(ATE)估计,导出了一个闭式损失函数及其对应的规范概率映射,该映射可以轻松集成到任何模型(如神经网络或梯度提升算法)中。在因果推断基准上的广泛评估表明,我们定制的目标函数始终优于标准的似然和协变量平衡方法。

英文摘要

Probabilistic models are typically trained using task-agnostic objectives like log-loss, which can lead to significant errors in downstream estimation. This disconnect is especially critical in Inverse Probability Weighting (IPW) for causal inference, where propensity score errors near $0$ and $1$ often lead to high bias and variance. We propose a principled framework for deriving task-specific strictly proper scoring rules by matching the local curvature of the downstream error metric. We apply this to the Average Treatment Effect (ATE) estimation, deriving a closed-form loss and its corresponding canonical probability mapping that can be readily integrated with any model like a neural network or a gradient boosting algorithm. Extensive evaluations on causal inference benchmarks demonstrate that our tailored objective consistently outperforms standard likelihood-based and covariate-balancing approaches.

2606.03331 2026-06-03 cs.CL cs.AI

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

评估大语言模型在真实世界消费设备维修问题上的有效性

Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana

AI总结 本文通过引入包含991个真实维修问题的基准测试,评估六种大语言模型在英语和孟加拉语上的正确性、完整性、实用性和安全性,发现模型虽能提供有用帮助,但在高风险维修任务中仍不可靠。

详情
AI中文摘要

消费设备维修是大语言模型(LLMs)一个重要但尚未充分探索的测试平台。维修任务需要对不完整的问题描述、特定硬件的诊断、可操作的故障排除和安全关键决策进行推理,其中错误的建议可能导致设备损坏、电池危险或永久性数据丢失。我们引入了一个包含991个来自Reddit的真实世界维修问题的基准测试,涵盖手机维修、电脑维修和数据恢复,每个问题都配有技术人员编写的参考解决方案,并提供孟加拉语翻译以评估跨语言性能。我们使用四个维修特定标准(正确性、完整性、实用性和安全性)评估了六种最先进的LLMs在英语和孟加拉语上的表现。我们的结果表明,虽然LLMs可以提供有用的维修帮助,但在没有严格评估和明确安全保护措施的情况下,它们在高风险的真实世界维修任务中仍然不可靠。手机维修是最困难且对安全最敏感的领域,所有模型在板级诊断、维修优先级排序和安全恢复程序方面都犯了重大错误。跨领域和模型,孟加拉语回答的表现始终低于英语回答。在评估的模型中,GPT-5.4整体表现最佳。

英文摘要

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

2606.03330 2026-06-03 cs.LG cs.AI cs.CR

FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences

FLIPS:通过伪随机序列为LLMs进行实例指纹识别

Gurvan Richardeau, Gohar Dashyan, Erwan Le Merrer, Gilles Tredan

AI总结 提出FLIPS方法,利用生成的二进制随机序列中的偏差,在237个模型实例上实现96%(闭集)和90%(开集)的识别准确率,解决了现有指纹识别技术无法区分同一LLM不同配置的问题,为AI监管提供了实例级指纹识别新范式。

详情
Comments
20 pages, 20 figures, 3 tables. 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

文献揭示,大型语言模型(LLM)的行为不仅受其原始权重影响,还受其实例级参数(如指令提示、采样配置或量化)影响。在一种配置下生成安全输出的模型,在另一种配置下可能产生有毒内容。然而,当前的LLM识别技术(如指纹识别)侧重于知识产权保护,其设计倾向于对这些实例级参数的变化具有鲁棒性。这对AI监管构成了关键挑战,因为合规评估针对的是实际部署的行为,而非模型来源。在本文中,我们引入了实例级指纹识别,这是一种面向监管的范式,用于区分同一LLM的不同配置。我们的方法FLIPS利用生成的二进制随机序列中的偏差,在237个模型实例上达到96%(闭集)和90%(开集,其中一些目标未知)的识别准确率,而改编的LLMmap基线仅为35%。这表明实例级指纹识别对于监管既必要又实际可行。代码见https://this URL。

英文摘要

Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance-level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance-level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance-level fingerprinting, a regulator-oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed-set) and 90% (open-set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance-level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting.

2606.03329 2026-06-03 cs.AI

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

InfoMem: 基于答案条件信息增益训练长上下文记忆智能体

Tiancheng Han, Yong Li, Wuzhou Yu, Qiaosheng Zhang, Wenqi Shao

AI总结 提出InfoMem奖励机制,通过评估最终记忆对真实答案的每token对数似然增益,训练分块记忆智能体以提升长上下文任务性能。

详情
Comments
17 pages, 7 figrues,
AI中文摘要

长上下文任务要求LLM从大上下文中识别并保留与答案相关的信息。分块记忆智能体通过顺序读取文档块、更新紧凑记忆并从累积记忆中生成最终答案来解决这一问题。然而,现有的基于RL的分块智能体要么依赖稀疏的最终答案奖励,要么使用词汇中间奖励来指导记忆和检索动作。这些信号监督任务成功或局部重叠,但不直接评估最终记忆是否支持真实答案。我们提出InfoMem,一种用于训练分块记忆智能体的奖励机制,该机制使用答案条件信息评估最终记忆的效用。InfoMem衡量最终记忆增加模型对真实答案的每token对数似然的程度。为了稳定RL优化,InfoMem仅对成功轨迹应用此信号,并在奖励组合前对其进行归一化。在相同的GRPO框架和训练预算下,InfoMem在长上下文记忆智能体性能上优于可比的记忆智能体RL基线。分析表明,有效的最终记忆奖励应作用于成功轨迹,在奖励组合前归一化,并基于答案而非查询进行条件化。我们的代码可从此https URL获取。

英文摘要

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model's per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.

2606.03327 2026-06-03 cs.DB cs.CL

CAPER: Clause-Aligned Process Supervision for Text-to-SQL

CAPER: 面向Text-to-SQL的子句对齐过程监督

Lujie Ban, Jiasheng Shi, Jinyang Li, Xiaolin Han, Tsz Nam Chan, Chenhao Ma

AI总结 提出CAPER方法,通过反事实干预SQL抽象语法树自动推导子句级监督,训练轻量级Clause-PRM模型CAPER-9B,用于策略优化和候选验证,在BIRD和Spider数据集上提升了执行准确率和故障定位能力。

详情
AI中文摘要

Text-to-SQL系统通常通过查询级执行正确性进行评估,但这种终端信号对于哪个中间SQL决策导致成功或失败几乎没有指导作用。Token级密集监督也不适合:SQL token与完整的语义决策不对齐,可能惩罚执行等效的查询,并且难以大规模可靠标记。因此,我们提出CAPER,通过对SQL抽象语法树进行反事实干预自动推导子句级监督,实现用于奖励建模的根因错误定位;所得数据用于训练CAPER-9B,一个轻量级的Clause-PRM,为策略优化和候选验证提供子句边界反馈。在BIRD和Spider上的实验表明,子句对齐监督不仅提高了执行准确率(相对于GPT-5.4实现了高达15.3%的相对EX提升),还增强了故障定位能力,在保留的故障上达到了84.53%的准确率和90.60%的MRR。我们的项目页面位于此https URL。

英文摘要

Text-to-SQL systems are typically evaluated by query-level execution correctness, but this terminal signal provides little guidance about which intermediate SQL decision caused success or failure. Token-level dense supervision is also ill-suited: SQL tokens do not align with complete semantic decisions, can penalize execution-equivalent queries, and are difficult to label reliably at scale. We therefore propose CAPER, which automatically derives clause-level supervision via counterfactual intervention on the SQL abstract syntax tree, enabling root-cause error localization for reward modeling; the resulting data is used to train CAPER-9B, a lightweight Clause-PRM that provides clause-boundary feedback for policy optimization and candidate verification. Experiments on BIRD and Spider show that clause-aligned supervision not only improves execution accuracy, achieving up to a 15.3% relative EX improvement over GPT-5.4, but also strengthens failure-localization capability, reaching 84.53% accuracy and 90.60% MRR on held-out failures. Our project page is at https://github.com/banrichard/RL-NL2SQL.

2606.03326 2026-06-03 cs.AI

The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations

违规情境模式:一种用于合规违规的知识图谱模式

Nima Kamali Lassem, Fuqi Song, Seyid Amjad Ali

AI总结 提出违规情境模式(VSP),将合规检测中的违规实例化为持久化图节点,支持生命周期状态和审计历史,并通过法律实体合同图实例化四种道义规则验证其有效性。

详情
AI中文摘要

合规管道将违规检测为瞬态查询结果,而不将违规本身作为具有审查状态、受影响实体或审计历史的持久化图对象保留。违规情境模式(VSP)填补了这一空白。基于Gangemi和Mika的情境模式,VSP将每个检测到的违规具体化为一个图节点,包含规则标识符、时间有效性区间、生命周期状态以及与所涉及实体的证据链接。生命周期转换存储为不可变的、符合PROV-O的事件,因此审计历史成为图遍历。我们在法律实体和合同生命周期属性图中实例化VSP,并通过FCL->Cypher->MERGE管道操作四条道义规则(V1未授权签名、V2过期授权、V3缺失保密条款、V4缺失违约通知条款)。我们针对BODACC公司高管出版物检查V1和V2,在73个GDPRhub执法决定上评估V4,并对V3和V4运行SHACL跨形式主义检查。核心发现是规则体独立性:将V4从条款存在性检查扩展到截止日期检查,F1从0.312提升至0.602,而模式的标识、生命周期和证据语义保持不变。这分离了模式贡献与检测器贡献,因此检测逻辑可以演进而不使累积的审计历史失效。

英文摘要

Compliance pipelines detect violations as transient query results and do not keep the violation itself as a persistent graph object with review state, affected entities, or audit history. The Violation Situation Pattern (VSP) closes this gap. Building on the Situation pattern of Gangemi and Mika, VSP reifies each detected violation as a graph node with a rule identifier, a temporal validity interval, a lifecycle state, and evidence links to the entities involved. Lifecycle transitions are stored as immutable, PROV-O-aligned events, so audit history is a graph traversal. We instantiate VSP in a legal entity and contract lifecycle property graph and operationalize four deontic rules (V1 unauthorized signature, V2 expired mandate, V3 missing confidentiality clause, V4 missing breach-notification clause) through an FCL->Cypher->MERGE pipeline. We check V1 and V2 against BODACC corporate-officer publications, evaluate V4 on 73 GDPRhub enforcement decisions, and run a SHACL cross-formalism check on V3 and V4. The central finding is rule-body independence: extending V4 from clause-presence to deadline checking raises F1 from 0.312 to 0.602, while the pattern's identity, lifecycle, and evidence semantics stay the same. This separates a pattern contribution from a detector contribution, so detection logic can evolve without invalidating accumulated audit history.

2606.03322 2026-06-03 cs.LG cs.AI

Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification

多模态图神经网络与Transformer引导的自适应扩散用于临床前阿尔茨海默病分类

Jaeyoon Sim, Minjae Lee, Guorong Wu, Won Hwa Kim

AI总结 提出一种结合扩散核与多头注意力的图神经网络框架,通过Transformer引导自适应扩散过程,有效融合多模态特征,提升临床前阿尔茨海默病分类性能并识别关键脑区。

详情
Comments
10 pages, Accepted to MICCAI 2024
AI中文摘要

大脑的图形表示通过感兴趣区域(ROI)之间的关系为诊断和预测神经退行性疾病提供了关键见解。尽管近年来出现了各种图神经网络(GNN)来有效捕获关系信息,但在解释大脑网络方面仍存在固有局限性。具体而言,卷积方法无法有效聚合远邻域信息,而基于注意力的方法在捕获节点中心信息方面存在缺陷,特别是在保留关键节点的关键特征方面。这些不足揭示了从不同模态的不同特征中识别疾病特异性变化的挑战。为此,我们提出一个集成框架,通过下游Transformer引导每个节点的扩散过程,其中图的短程和长程属性分别通过扩散核和多头注意力进行聚合。我们通过使用多种模态改进临床前阿尔茨海默病(AD)分类的性能,证明了我们模型的优越性。此外,我们的模型能够熟练识别与AD临床前阶段密切相关的关键ROI,为疾病的早期诊断和预防提供了重要潜力。

英文摘要

The graphical representation of the brain offers critical insights into diagnosing and prognosing neurodegenerative disease via relationships between regions of interest (ROIs). Despite recent emergence of various Graph Neural Networks (GNNs) to effectively capture the relational information, there remain inherent limitations in interpreting the brain networks. Specifically, convolutional approaches ineffectively aggregate information from distant neighborhoods, while attention-based methods exhibit deficiencies in capturing node-centric information, particularly in retaining critical characteristics from pivotal nodes. These shortcomings reveal challenges for identifying disease-specific variation from diverse features from different modalities. In this regard, we propose an integrated framework guiding diffusion process at each node by a downstream transformer where both short- and long-range properties of graphs are aggregated via diffusion-kernel and multi-head attention respectively. We demonstrate the superiority of our model by improving performance of pre-clinical Alzheimer's disease (AD) classification with various modalities. Also, our model adeptly identifies key ROIs that are closely associated with the preclinical stages of AD, marking a significant potential for early diagnosis and prevision of the disease.

2606.03321 2026-06-03 cs.LG cs.MA cs.SY eess.SY

Validation-Gated Multi-Agent Governance for Online Adaptation of Thermal-Hydraulic Surrogate Models under Operating-Regime Shift

验证门控多智能体治理:运行工况迁移下热工水力代理模型的在线自适应

Doyeong Lim, Seungyoon Lee, In Cheol Bang

AI总结 针对离线训练模型在运行工况迁移时性能退化问题,提出验证门控多智能体治理框架,通过角色分离的智能体协作与确定性门控机制实现可审计的在线自适应,在实验热工水力数据上将平均绝对误差降低19%。

详情
AI中文摘要

人工智能代理模型可以支持每秒的热工水力预测,但离线选定并冻结的模型一旦部署到预训练包络之外,可能会变得条件锁定。本研究针对实验热工水力回路数据开发了一个受保护的持续自适应框架,其中角色分离的智能体——监控器、诊断器、自适应器、安全审计器和编排器——诊断误差特征、优先考虑候选模型族并审查升级,而确定性的冠军-挑战者门控和后台影子学习保留对模型替换的最终权限。通过分块三折交叉验证筛选了七个代理模型族,并选择时间傅里叶神经算子作为初始冠军,用于两个保留瞬态的60秒历史到10秒轨迹预测,每种自适应模式使用三个种子。静态部署给出通道平均MAE为7.06,警告超标率为56.8%;基于规则的自适应将MAE降至6.54,而仅使用影子刷新则接近静态。MA-Full模式中,角色分离的多智能体委员会审查每个评估流步骤,实现了最低的平均误差5.72和35.8%的超标率,相比静态改进19.0%。与静态的配对自助区间排除零,但自适应模式之间的区间重叠,且六个配对单元限制了广泛的统计声明。从神经算子到Transformer和图神经网络的验证升级表明,记录的门控自适应可以支持可审计的代理模型演化,同时确定性门控保留部署权限。

英文摘要

Artificial-intelligence surrogates can support second-by-second thermal-hydraulic forecasting, but models selected and frozen offline may become condition-locked once deployed outside their pretraining envelope. This study develops a guarded continual-adaptation framework for experimental thermal-hydraulic loop data in which role-separated agents - Monitor, Diagnosis, Adaptation, Safety-Auditor, and Orchestrator - diagnose error signatures, prioritize candidate model families, and review promotions, while deterministic champion-challenger gates and background shadow learning retain final authority over model replacement. Seven surrogate families were screened by blocked three-fold cross-validation, and a temporal Fourier neural operator was selected as the initial champion for 60-s-history-to-10-s-trajectory forecasting on two held-out transients, with three seeds per adaptive mode. Static deployment gave a channel-averaged MAE of 7.06 and a 56.8% warning-exceedance ratio; rule-based adaptation reduced MAE to 6.54, whereas shadow refresh alone remained close to Static. The MA-Full mode, in which the role-separated multi-agent council reviews every evaluated stream step, achieved the lowest mean error, 5.72, and 35.8% exceedance, corresponding to a 19.0% improvement over Static. Paired bootstrap intervals against Static excluded zero, although intervals among adaptive modes overlapped and the six paired units limit broad statistical claims. Validated promotions from the neural operator to Transformer and graph neural network indicate that logged, gate-controlled adaptation can support auditable surrogate evolution while deterministic gates retain deployment authority.

2606.03315 2026-06-03 cs.LG

A Graph Foundation Model with Spectral Parsing and Prototype-Guided Spatial Propagation

具有频谱解析和原型引导空间传播的图基础模型

Ankang Yang, Jitao Zhao, Dongxiao He, Liang Yang, Di Jin, Weixiong Zhang

AI总结 提出SPG模型,通过可学习的切比雪夫滤波器分解节点特征为多个频谱响应,并利用Gromov-Wasserstein原型几何蒸馏可迁移的成对关系,实现跨域图泛化。

详情
AI中文摘要

图基础模型旨在从多样化的图中学习可迁移知识,以泛化到未见过的图和任务。与文本和图像不同,图缺乏共享词汇或规则的空间网格,使得跨图迁移具有挑战性。这一挑战既来自特征差异,更关键的是来自多样化的图结构。现有的GFM主要通过统一特征空间或引入结构标记和词汇来提高可迁移性。然而,现有的拓扑感知设计仍有局限性。结构标记通常是离散的,而结构词汇通常依赖于预定义的子结构(如树和环),其有限覆盖可能遗漏跨图中更丰富的关系模式。此外,图信号包含高频局部模式和更平滑的低频模式,这需要不同的传播行为。这些成分在原始图信号中通常是纠缠的,而这一频谱视角在现有GFM中很少被探索。为了解决这些挑战,我们提出了SPG,一种具有频谱解析和原型引导空间传播的图基础模型。SPG应用可学习的切比雪夫滤波器将节点特征分解为多个频谱响应,减少频率特定图信号与传播行为之间的不匹配。然后,它构建一个Gromov-Wasserstein原型几何,将超越预定义子结构的可迁移成对关系蒸馏到共享结构空间中。学习到的原型几何进一步被投影回作为原型引导的传播算子。实验表明在跨域泛化中具有一致的改进。

英文摘要

Graph foundation models aim to learn transferable knowledge from diverse graphs for generalization to unseen graphs and tasks. Unlike text and images, graphs lack a shared vocabulary or regular spatial grid, making cross-graph transfer challenging. This challenge comes from both feature discrepancies and, more critically, diverse graph structures. Existing GFMs mainly improve transferability by unifying feature spaces or incorporating structural tokens and vocabularies. However, existing topology-aware designs still have limitations. Structural tokens are usually discrete, while structural vocabularies often rely on predefined substructures such as trees and cycles, whose limited coverage may miss richer relational patterns across graphs. Moreover, graph signals contain both high-frequency local patterns and smoother low-frequency patterns, which require different propagation behaviors. These components are often entangled in raw graph signals, while this spectral perspective is rarely explored in existing GFMs. To address these challenges, we propose SPG, a graph foundation model with spectral parsing and prototype-guided spatial propagation. SPG applies learnable Chebyshev filters to decompose node features into multiple spectral responses, reducing the mismatch between frequency-specific graph signals and propagation behaviors. It then constructs a Gromov-Wasserstein prototype geometry to distill transferable pairwise relations beyond predefined substructures into a shared structural space. The learned prototype geometry is further projected back as a prototype-guided propagation operator. Experiments demonstrate consistent improvements in cross-domain generalization.

2606.03314 2026-06-03 cs.CV

TASE: Truncation-Aware Semantic Embeddings for 3D Scene Understanding and Editing

TASE: 用于3D场景理解与编辑的截断感知语义嵌入

Tim-Felix Faasch, Jochen Kall, Lucas Nunes, Jens Behley, Cyrill Stachniss

AI总结 提出TASE方法,通过将预训练的2D语义特征投影到截断感知嵌入空间,结合尺度和平移等变损失,实现可控的3D场景文本驱动编辑,在大几何修改任务上显著优于现有方法。

详情
AI中文摘要

高保真语义3D场景表示对于众多应用(包括机器人、自动驾驶和仿真)至关重要。除此之外,编辑此类表示的能力使开发人员能够更轻松地将这些应用适应特定的目标场景。当前方法对可控编辑的支持有限。我们引入TASE,一种将预训练的2D语义特征投影到截断感知嵌入空间以实现灵活3D场景编辑的方法。我们的方法显式优化了一个特征空间,在该空间中,逐步减少特征通道会产生越来越抽象的语义表示,而保留更多通道则保留细粒度细节。此外,我们使用尺度和平移等变损失来改进特征的多视图一致性。由此产生的截断感知嵌入空间支持对3D场景进行文本驱动的编辑,提供了对编辑与原始场景内容一致程度的显式控制,并允许比先前方法更实质性的修改。此外,我们提出了编辑扩散模型的微调阶段,以减轻几何变化引起的伪影。实验结果表明,在3D场景编辑中具有竞争力的性能,在涉及大几何修改的编辑上显著优于先前方法。

英文摘要

High-fidelity semantic 3D scene representations are crucial for numerous applications, including robotics, autonomous driving, and simulation. Beyond this, the ability to edit such representations enables developers to adapt these applications more easily to specific target scenarios. Current approaches provide limited support for controllable editing. We introduce TASE, a method that projects pretrained 2D semantic features into a truncation-aware embedding space to enable flexible 3D scene editing. Our method explicitly optimizes a feature space in which progressively reducing feature channels yields increasingly abstract semantic representations, while retaining more channels preserves fine-grained detail. Additionally, we improve multi-view consistency of the features using a scale- and translation-equivariance loss. The resulting truncation-aware embedding space enables text-driven edits to 3D scenes, providing explicit control over how strongly edits adhere to the original scene content and allowing more substantial modifications than prior methods. Moreover, we propose a finetuning stage for the editing diffusion model to mitigate artifacts caused by geometric changes. Experimental results demonstrate competitive performance in 3D scene editing, substantially outperforming prior methods on edits involving large geometric modifications.

2606.03312 2026-06-03 cs.RO cs.AI

RobotValues: Evaluating Household Robots When Human Values Conflict

RobotValues: 当人类价值观冲突时评估家用机器人

Jongwook Han, Hyeongjin Kim, Yohan Jo

AI总结 提出RobotValues基准,通过10K个价值冲突场景评估家用机器人规划器,发现视觉语言模型存在默认价值偏好且难以覆盖,表明评估需考虑价值冲突下的行动选择。

详情
AI中文摘要

虽然家用机器人通常基于任务完成度进行评估,但日常家庭环境涉及价值冲突情境,其中机器人应选择优先考虑其他价值观(如人类自主性、效率或社会适宜性)而非任务成功的行动。然而,目前尚无评估机器人在此类场景中价值偏好的基准。我们引入RobotValues,一个在10K个价值冲突场景中评估家用机器人规划器的基准。每个实例包含一个逼真的家庭图像和多个优先考虑不同人类价值观的合理机器人动作。我们通过LLM辅助场景生成、利益相关者基于价值观提取、图像生成和自动质量控制构建RobotValues。使用RobotValues评估机器人领域使用的视觉语言模型,发现模型表现出默认价值偏好,包括安全性和适应性,而低估了隐私优先的行动。当模型被指示优先考虑与其自身偏好冲突的特定价值观时,它们通常无法覆盖默认行动,80%的时间选择了错误行动。这些发现表明,家用机器人评估不仅应衡量任务完成度或安全性合规性,还应衡量当人类价值观冲突时机器人是否能在合理行动中做出选择。

英文摘要

While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in which robots are expected to choose actions that prioritize other values than task success, such as human autonomy, efficiency, or social appropriateness. Yet, there are no benchmarks for evaluating robots' value preferences in such scenarios. We introduce RobotValues, a benchmark to evaluate household robot planners in 10K value-conflict scenarios. Each instance consists of a realistic household image with multiple plausible robot actions that prioritize different human values. We construct RobotValues through LLM-assisted scenario generation, stakeholder-grounded value extraction, image generation and automatic quality control. Using RobotValues we evaluate VLMs used in robotics and find that models exhibit default value preferences, including safety and accommodation, while underselecting privacy-prioritizing actions. When the models are instructed to prioritize specific values that conflict with their own preferences, they often fail to override their default actions, choosing incorrect actions for 80% of the time. These findings suggest that household robot evaluation should measure not only task completion or safety compliance, but also whether robots can choose among plausible actions when human values conflict.

2606.03310 2026-06-03 cs.LG cs.AI

Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis

学习多尺度超图用于高阶脑连接分析

Jaeyoon Sim, Soojin Hwang, Seunghun Baek, Guorong Wu, Won Hwa Kim

AI总结 提出自适应多尺度超边学习框架MuHL,通过构建层次节点特征并动态学习高阶交互,在多个脑网络基准上提升神经退行性疾病分类性能并识别关键脑区。

详情
Comments
24 pages, Accepted to ICML 2026
AI中文摘要

理解脑区之间的复杂交互对于早期神经退行性疾病(如阿尔茨海默病和帕金森病)的分类至关重要。虽然基于图的模型广泛用于分析脑网络,但大多数现有方法主要关注直接连接节点之间的成对交互,限制了其捕捉跨多个区域的高阶依赖关系的能力。尽管已有基于超图的方法来建模高阶关系,但许多方法依赖于预定义的超边或将学习限制在超边权重上,降低了灵活性并限制了其捕捉多分辨率结构模式的能力。为此,我们引入了一个自适应多尺度超边学习框架,即MuHL,该框架构建层次节点特征,并通过在多分辨率图信号上连续构建超边来动态学习高阶交互。在多个脑网络基准上的大量实验表明,MuHL在不同阶段持续提高了疾病分类性能,并从学习到的超边中识别出与疾病进展相关的关键感兴趣区域及其群体交互,突显了其作为神经退行性疾病脑网络分析强大工具的潜力。

英文摘要

Understanding complex interactions between brain regions is critical for early neurodegenerative disease classification such as Alzheimer's Disease (AD) and Parkinson's Disease (PD). While graph-based models are widely used to analyze brain networks, most existing approaches primarily focus on pairwise interactions between directly connected nodes, limiting their ability to capture higher-order dependencies across multiple regions. Although hypergraph-based methods have been proposed to model higher-order relations, many rely on predefined hyperedges or restrict learning to hyperedge weights, reducing flexibility and limiting their capacity to capture multi-resolution structural patterns. In this regard, we introduce an adaptive multi-scale hyperedge learning framework, i.e., MuHL, which constructs hierarchical node features and dynamically learns high-order interactions through continuous hyperedge construction over multi-resolution graph signals. Extensive experiments on multiple brain network benchmarks demonstrate that MuHL consistently improves disease classification performance across different stages, and further identifies key regions of interest (ROIs) and their group-wise interactions from the learned hyperedges that are associated with disease progression, highlighting its potential as a powerful tool for brain network analysis in neurodegenerative disorders.

2606.03305 2026-06-03 cs.AI

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

基准审计中的可靠性差距:分布偏移和规模作为污染检测的失败模式

Wojciech Zarzecki, Jan Dubiński, Sebastian Cygert

AI总结 研究基准污染检测方法在分布偏移和规模约束下的可靠性,发现三种主流方法在335次评估中仅199次正确,揭示了受控验证与实际审计之间的系统性可靠性差距。

详情
AI中文摘要

基准污染,即评估示例出现在模型的训练数据中,威胁着LLM评估的有效性。存在用于检测训练数据成员身份的统计工具,但几乎仅在受控学术制度中得到验证:大规模、同质的预训练语料库和透明、单阶段的训练流程。这些方法在实际审计场景中是否仍然可靠尚不清楚。我们识别了两种研究不足的失败模式:分布偏移,当可疑集和验证集违反IID假设时出现;以及规模约束,因为基准比预训练语料库小几个数量级。我们系统评估了三种主流范式:LLM数据集推断、事后数据集推断和CoDeC,涉及来自多个家族(包括Pythia、OLMo~2以及专门的文化和医学LLM)和规模(高达27B)的27个模型。然后我们将分析进一步扩展到前沿行业模型。在335次评估中,只有199次产生正确结果。LLM数据集推断在分布偏移下产生假阳性,事后数据集推断在基准规模下效力不足,而CoDeC仅提供粗略的来源信号,不足以验证单个基准分割。我们的结果揭示了受控验证与实际基准审计之间的系统性可靠性差距,并表明统计检测尚不能取代透明的数据来源。我们开源了我们的基准以供进一步研究。

英文摘要

Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.

2606.03304 2026-06-03 cs.CL cs.LG

From Script to Semantics: Prompting Strategies for African NLI

从脚本到语义:非洲自然语言推理的提示策略

Anuj Tiwari, Terry Oko-odion, Hannah Nwokocha

AI总结 本研究系统评估了五种提示策略在斯瓦希里语、约鲁巴语和豪萨语的自然语言推理任务上的表现,发现对比提示策略最可靠且能显著提升模型性能。

详情
Comments
Accepted at the RAIL Workshop, LREC 2026
AI中文摘要

大型语言模型(LLMs)在多语言环境中的评估日益增多,但它们在低资源非洲语言中的推理行为仍未得到充分探索,尤其是在无微调的纯提示设置下。我们使用AfriXNLI基准,对斯瓦希里语、约鲁巴语和豪萨语的自然语言推理(NLI)提示策略进行了系统研究。我们评估了五种提示策略:基线(零样本)、脚本感知、语言特定、对比和原生标签自翻译(NL-STP),使用了两个中等规模的开源模型(Llama3.2-3B和Gemma3-4B)。为隔离提示设计的影响,我们的研究中排除了少样本示例和思维链推理的影响。我们发现不同策略在类别性能上存在显著差异,某些配置中出现高度中性类崩溃和高预测偏斜。对比提示被证明是最可靠的策略,在不同语言和模型上持续改进,并具有更好的类别行为平衡和整体准确率提升。值得注意的是,精心构建的提示足以击败提供少样本提示和思维链提示的更强大基线。我们发现提示表述对于低资源语言的多语言NLI至关重要,并且语言感知的决策结构可以有效地增强资源受限环境下的鲁棒性。

英文摘要

Large language models (LLMs) are increasingly evaluated in multilingual settings, yet their inference behavior in low-resource African languages remains underexplored especially under pure prompting without fine-tuning. We present a systematic study of prompting strategies for Natural Language Inference (NLI) in Swahili, Yoruba, and Hausa using the AfriXNLI benchmark. We evaluate five prompting strategies Baseline (zero-shot), Script-Aware, Language Specific, Contrastive, and Native-Label Self-Translation (NL-STP) across two mid-sized open weight models (Llama3.2-3B and Gemma3-4B). To isolate the effect of prompt design, the effect of few-shot examples and Chain-of-Thought reasoning is eliminated in our study. We find a significant difference in performance of class wise across strategies with highly neutral class collapse and high prediction skew in some configurations. Contrastive prompting proves to be the most reliable and steadily improving strategy over language and model and has better balance of class behavior and balance of overall accuracy gains. Notably, well-constructed prompts are sufficient to beat more powerful baselines that are provided with few-shot prompts and Chain-of-Thought prompts. We have found that prompt formulation is essential to multilingual NLI with low-resource languages and that language aware decision structuring can be used to meaningfully enhance robustness in resource challenged settings.

2606.03301 2026-06-03 cs.CL cs.CV

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

SagaQA:面向电视剧长篇叙事理解的多跳推理基准

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

AI总结 提出SagaQA基准,通过跨剧集的多跳推理任务评估模型对完整电视剧多模态叙事的高层次理解,并比较并行、顺序和混合三种规划策略的性能。

详情
AI中文摘要

我们介绍了SagaQA,一个用于对完整电视剧进行多跳推理的长视频基准。现有的视频推理基准通常强调对相邻帧或片段的局部理解。SagaQA通过要求对整个电视剧中扩展的多模态叙事进行高层次理解来弥补这一空白。SagaQA的一个显著特征是其推理步骤的粒度。我们的数据集需要长距离推理跳跃来连接完全不同的剧集之间的信息。这要求模型对整个事件和动作进行推理,需要在多模态层面上深入理解剧集的叙事和进展。受近期智能体方法进展的启发,我们进一步研究了不同的规划策略如何处理这种复杂推理。我们将这些方法分为三类——并行规划器、顺序规划器和混合规划器——并评估它们生成连贯且完整推理计划的能力。我们在SagaQA上的结果表明,混合规划器始终能产生更高质量的计划,并在电视剧复杂、高层次叙事理解方面表现出更强的能力。

英文摘要

We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

2606.03297 2026-06-03 cs.RO

SplitAdapter: Load-Aware Humanoid Loco-Manipulation via Factorized Adaptation

SplitAdapter: 通过因子化自适应的负载感知人形机器人移动操作

Jeonguk Kang, Hanbyel Cho, Sanghyun Kang, Donghan Koo

AI总结 针对人形机器人在不同负载和高度下移动操作时负载变化与动力学不匹配的问题,提出SplitAdapter方法,通过冻结预训练策略并扩展负载与动力学感知编码器,结合分割世界模型目标、GRL交叉对抗正则化和分层特征线性调制,显著提升重载条件下的任务成功率。

详情
AI中文摘要

人形机器人的移动操作需要在不同物体质量和拾取/放置高度下实现稳定的全身控制。在仿真到现实的迁移中,物体引起的负载变化和机器人侧的动力学不匹配在物理接触期间相互作用,这尤其具有挑战性。现有的基于历史的自适应方法通常将这些因素压缩到单个潜在表示中,这可能在重载操作下削弱鲁棒性。我们提出 extbf{SplitAdapter: 通过因子化自适应的负载感知人形机器人移动操作},该方法冻结预训练的箱子操作策略,并通过使用分割世界模型目标、基于GRL的交叉对抗正则化和分层特征线性调制(FiLM)训练的物体/负载和动力学感知上下文编码器进行扩展。在仿真到仿真实验和实际部署中,SplitAdapter在物体质量为$2$、$4$和$6$千克以及拾取/放置高度为$0$、$30$和$60$厘米的情况下,相对于基础策略和世界模型FiLM基线提高了完整任务成功率,其中在重载条件下改进最大。

英文摘要

Humanoid loco-manipulation requires stable whole-body control under varying object masses and pickup/placement heights. This becomes particularly challenging in sim-to-real transfer, where object-induced load variation and robot-side dynamics mismatch interact during physical contact. Existing history-based adapters often compress these factors into a single latent representation, which can weaken robustness under heavy-load manipulation. We propose \textbf{SplitAdapter: Load-Aware Humanoid Loco-Manipulation via Factorized Adaptation}, which freezes a pretrained box manipulation policy and extends it with object/load and dynamics-aware context encoders trained with split world-model objectives, GRL-based cross-adversarial regularization, and hierarchical Feature-wise Linear Modulation (FiLM). In sim-to-sim experiments and real-world deployment, SplitAdapter improves Full-task success over the base policy and world-model FiLM baselines across object masses of $2$, $4$, and $6$ kg and pickup/placement heights of $0$, $30$, and $60$ cm, with the largest improvements under heavy-load conditions.

2606.03296 2026-06-03 cs.RO

Bridging Predictive Uncertainty and Safe Action: Sample-Conditioned Differentiable Planning for Autonomous Driving

桥接预测不确定性与安全行动:面向自动驾驶的样本条件可微分规划

Chengzhen Meng, Pei Liu, Zhiyu Huang, Chen Lv, Jun Ma

AI总结 提出一种样本条件可微分规划框架,通过扩散模型生成多样未来场景并直接输入可微分规划器,利用条件风险价值约束缓解预测不确定性,实现安全、高效、舒适的自动驾驶运动规划。

详情
AI中文摘要

复杂、动态且交互的驾驶环境给自动驾驶带来了重大挑战,主要源于周围交通的普遍不确定性。当前系统的一个基本瓶颈是高度表达性的不确定性建模与可解释、安全的运动规划之间的脱节。在本文中,我们提出了一种新颖的样本条件可微分规划框架,通过将扩散生成的未来轨迹显式纳入优化过程来弥合这一差距。我们的方法不是将预测压缩为单一的确定性未来或依赖黑盒端到端架构,而是利用条件扩散模型生成一组多样化的合理未来场景。关键的是,这些样本直接输入可微分规划器,该规划器通过经验条件风险价值尾部风险约束显式缓解预测不确定性。这使得规划器能够优化一条物理可解释的轨迹,该轨迹对罕见但安全关键的交互具有鲁棒性。此外,我们引入了一种场景上下文的有向图表示,在预测有效性和计算效率方面均带来了显著提升。通过在Waymo Open Motion和Argoverse 2数据集上进行的大量开环和闭环评估,我们的框架在安全性、效率和乘坐舒适性方面显著优于最先进的基线方法。

英文摘要

Complex, dynamic, and interactive driving environments pose significant challenges for autonomous driving, primarily due to the pervasive uncertainty of surrounding traffic. A fundamental bottleneck in current systems is the disconnect between highly expressive uncertainty modeling and interpretable, safe motion planning. In this paper, we propose a novel sample-conditioned differentiable planning framework that bridges this gap by explicitly incorporating diffusion-generated future trajectories into the optimization process. Rather than compressing predictions into a single deterministic future or relying on black-box end-to-end architectures, our approach leverages a conditional diffusion model to generate a diverse set of plausible future scenarios. Crucially, these samples are directly fed into a differentiable planner, which explicitly mitigates predictive uncertainty via an empirical Conditional Value-at-Risk (CVaR) tail-risk constraint. This allows the planner to optimize a physically interpretable trajectory that is robust to rare yet safety-critical interactions. Furthermore, we introduce a directed graph representation for scene context that yields substantial improvements in both predictive effectiveness and computational efficiency. Validated through extensive open-loop and closed-loop evaluations on the Waymo Open Motion and Argoverse 2 datasets, our framework significantly outperforms state-of-the-art baselines in safety, efficiency, and ride comfort.

2606.03291 2026-06-03 cs.CL

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

大型语言模型中的多语言遗忘:迁移、动态与可逆性

Chaoyi Xiang, Olga Ohrimenko, Benjamin I. P. Rubinstein, Lea Frermann

AI总结 本文通过扩展TOFU基准到五种语言,研究大型语言模型中的多语言遗忘,发现遗忘迁移在不同语言间高度可变,且遗忘主要作用于后期解码层而非共享的跨语言潜在空间,因此可以通过推理时的单一引导方向逆转大部分遗忘效果。

详情
Comments
Accepted at ICML 2026
AI中文摘要

大型语言模型(LLMs)能够记忆敏感事实,这促使了遗忘方法的发展,旨在无需昂贵重训练的情况下移除特定知识。然而,遗忘研究仍然高度以英语为中心。我们通过将TOFU基准扩展到五种语言来研究多语言遗忘,并使用不同的语言组合对模型进行微调、遗忘和查询。我们发现遗忘迁移(即遗忘模型在非遗忘语言中“忘记”事实的能力)高度可变:例如,在共享文字和语系的语言之间最强,并且我们表明遗忘语言能够预测哪些查询语言最可能产生最强的迁移。逐层分析揭示,遗忘在早期层中基本保持了共享的跨语言潜在空间完整,而是主要作用于后期解码层。这表明遗忘并未真正擦除知识,而是引发了表面的抑制。利用这一结构,单一推理时的引导方向可以逆转大部分这种抑制,跨语言恢复50%(Qwen)和90%(Gemma)的遗忘知识。

英文摘要

Large language models (LLMs) can memorize sensitive facts, motivating unlearning methods that remove targeted knowledge without costly retraining. However, unlearning research remains heavily English-centric. We study multilingual unlearning by extending the TOFU benchmark to five languages, and fine-tune, unlearn, and query our models with different permutations of languages. We find that unlearning transfer, the ability of an unlearned model to "forget" facts in languages other than the unlearning language, is highly variable: e.g., it is strongest between languages sharing scripts and families, and we show that the unlearning language predicts which query languages are most likely to yield the strongest transfer. Layer-wise analysis reveals that unlearning leaves the shared cross-lingual latent space largely intact in early layers, instead operating primarily in later decoding layers. This suggests that unlearning does not truly erase knowledge, but rather induces superficial suppression. Exploiting this structure, a single inference-time steering direction reverses much of this suppression across languages, recovering 50% (Qwen) and 90% (Gemma) of the unlearned knowledge.

2606.03290 2026-06-03 cs.LG cs.AI

Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective

消息调优优于图提示调优:棱镜空间视角

Yancheng Chen, Dun Ma, Shuai Zhang, Yang Liu, Xixun Lin, Xiangyu Zhao, Wenguo Yang, Wei Chen, Chuan Zhou

AI总结 本文提出棱镜空间理论(PS-Theory)量化图提示调优的适应能力上限,并引入消息调优(MTG)方法,通过注入可学习消息原型超越该上限,实验验证其优越性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

基于预训练与自适应范式的图基础模型(GFMs)已成为图学习的研究热点。对于基于GNN的GFMs,图提示调优已成为下游任务的主流自适应方法。尽管近期方法解释了图提示调优为何有效,但如何严格衡量其适应能力仍是一个开放问题。解决该问题对于理解图提示调优的能力极限以及开发更强大的自适应方法至关重要。本文提出棱镜空间理论(PS-Theory),一种新颖的数学框架,用于量化自适应方法的能力,同时重点建立图提示调优适应能力上限。基于所提出的PS-Theory,我们进一步引入GFMs的消息调优(MTG),一种轻量级方法,在GNN骨干网络的每一层注入少量可学习消息原型,以自适应地引导消息融合,无需更新预训练权重。通过我们的PS-Theory,我们证明MTG的适应能力可以超过图提示调优的理论上限。大量实验表明,MTG在多个基准数据集上 consistently 优于图提示基线,为我们的理论发现提供了强有力的实证支持。

英文摘要

Graph Foundation Models (GFMs), built upon the Pre-training and Adaptation paradigm, have emerged as a research hotspot in graph learning. For GNN-based GFMs, graph prompt tuning has become the prevailing adaptation method for downstream tasks. Although recent methods explain why graph prompt tuning works, how to rigorously measure its adaptation capacity remains an open problem. Addressing this problem is critical for understanding the capability limits of graph prompt tuning and for developing more powerful adaptation methods. In this paper, we propose Prismatic Space Theory (PS-Theory), a novel mathematical framework to quantify the capacity of adaptation methods, while focusing on establishing the upper bound for the adaptation capacity of graph prompt tuning. Building upon the proposed PS-Theory, we further introduce Message Tuning for GFMs (MTG), a lightweight approach that injects a small set of learnable message prototypes into each layer of the GNN backbone to adaptively guide message fusion without updating pre-trained weights. Through our PS-Theory, we prove that the adaptation capacity of MTG can exceed the theoretical upper bound of graph prompt tuning. Extensive experiments demonstrate that MTG consistently outperforms graph prompt baselines across diverse benchmark datasets, providing strong empirical support for our theoretical findings.