arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.00872 2026-06-02 cs.CV

Images as Tables: In-Context Learning with TabPFN for Low-Data Detection of AI-Generated Images

图像作为表格:利用TabPFN进行上下文学习以实现低数据量下AI生成图像的检测

Jan Philip Walter, Shashank Agnihotri, Margret Keuper

发表机构 * Jan Philip Walter Shashank Agnihotri Margret Keuper

AI总结 提出将图像转换为表格形式,使用冻结的DINOv3骨干网络提取特征,并通过TabPFN进行上下文学习,在低数据量下有效检测AI生成图像,优于现有方法。

详情
Comments
Accepted as a Spotlight Oral at the ICML 2026 Workshop Foundation Models for Structured Data. *Equal Contribution
AI中文摘要

AI生成图像检测是一个移动目标问题:针对一个生成器训练的检测器在出现新生成器时常常失效,且只有少量标记样本可用。我们研究了一种简单的图像到表格的公式化方法,其中每个图像由冻结的DINOv3骨干网络编码,其CLS特征通过PCA降维为500维的结构化行,TabPFN通过上下文表格推理而非特定任务分类器训练进行真实/伪造分类。这将伪造图像检测转化为对学习到的视觉特征的低数据量结构化预测,使检测器适应依赖于标记的上下文集而非基于梯度的微调。在GenImage上,LATTE(最新的最先进检测器)在拥有来自所有生成器的大量标记样本时仍然更强,在最大合并设置中高出7.4%,但在实际重要的低数据量情况下,DINOv3-PCA-TabPFN更强,最高超出LATTE 8.2%,并且在检测器必须从一个生成器泛化到另一个生成器的迁移设置中也是如此。这些结果将表格基础模型定位为图像取证中一种强大的互补适应机制,将适应从检测器重新训练转变为使用少量标记样本的轻量级上下文更新。代码URL:https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

英文摘要

AI-generated image detection is a moving-target problem: detectors trained on one generator often fail when a new generator appears, and only a few labeled examples are available. We study a simple image-to-table formulation for this regime, where each image is encoded by a frozen DINOv3 backbone, its CLS feature is reduced to a 500-dimensional structured row with PCA, and TabPFN performs real/fake classification by in-context tabular inference rather than task-specific classifier training. This turns fake-image detection into low-data structured prediction over learned visual features, making detector adaptation depend on the labeled context set instead of gradient-based fine-tuning. On GenImage, LATTE, a recent state-of-the-art detector, remains stronger when many labeled samples from all generators are available, by 7.4% in the largest pooled setting, but DINOv3-PCA-TabPFN is stronger in the practically important low-data regime, outperforming LATTE by up to 8.2%, and in transfer settings where the detector must generalize from one generator to another. These results position tabular foundation models as a strong complementary adaptation mechanism for image forensics, shifting adaptation from detector retraining to lightweight in-context updates with a small labeled set of examples. Code URL: https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

2606.00871 2026-06-02 cs.CV cs.AI

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

城市感知中的视觉语言模型基准应具备可靠性意识且可协商

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结 本文提出,用于城市感知的视觉语言模型基准应将分歧和弃权视为测量结果,报告标注者间信度,并将标签空间和评分策略视为可协商的产物。

详情
Comments
To appear in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

视觉语言模型(VLM)越来越多地用于生成街景图像的结构化描述,用于街道景观审计、制图和公众咨询等任务。这些用途将可观察属性与评估类别相结合,而人类目标往往是带有分歧和明确不回答的判断分布。本文认为,为城市感知建立VLM基准应将分歧和弃权视为测量结果,报告标注者间信度以及模型对齐度,并在输出旨在为城市治理提供信息时,将标签空间和评分策略视为可协商的产物。我们基于一个由来自七个社区组织的12名参与者对100个蒙特利尔街景进行30个维度标注的基准,以及对七个VLM的确定性零样本评估来论证这一观点。在各个维度上,模型与人类共识的一致性随维度层面的人类信度共同变化,而对于评估维度“总体印象”,模型和标注者表现出分布不匹配,包括“不适用”的不同比率。最后,我们为基准创建者、模型开发者和机构提出了行动建议,以使不确定性和基准假设在评估报告中可见。

英文摘要

Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.

2606.00869 2026-06-02 cs.LG

Enhancing LLM Metacognition via Cognitive Pairwise Training

通过认知成对训练增强LLM元认知

Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu

发表机构 * National Engineering Laboratory for Intelligent Information Processing, Academy of Mathematics and Physics, Chinese Academy of Sciences(智能信息处理国家工程实验室,中国科学院数学物理研究所) University of Science and Technology of China(中国科学技术大学)

AI总结 提出认知成对训练(CPT),通过成对比较推理轨迹来学习区分可靠与不可靠推理,从而提升LLM的推理与元认知权衡。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为LLM推理的核心,但其结果级奖励可能使模型在证据或推理不可靠时更愿意给出自信答案。现有的SFT或RL方法主要在响应级别教导LLM拒绝或表达不确定性,这可能导致过度拟合拒绝行为,而非提高推理可靠性。为解决这一局限,我们提出认知成对训练(CPT),这是一种认知中期训练对齐阶段,将推理轨迹上的成对比较转化为可复用的对齐信号。通过学习区分可信与有缺陷的推理,CPT鼓励模型内化推理质量判别边界,而非记忆表面拒绝模式。在五个模型规模和三个模型家族上,CPT改善了推理与元认知的权衡。在14B规模上,CPT+RL相比标准SFT+RL流水线在数学平均分上提升2.2分,在拒绝F1上提升5.2分。进一步分析表明,CPT提高了轨迹质量,并在评估和训练设置中表现出强鲁棒性和可扩展性。代码和模型已发布在https://github.com/Tsinghua-dhy/CPT。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning--metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at https://github.com/Tsinghua-dhy/CPT.

2606.00857 2026-06-02 cs.RO cs.AI

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

从线索到视野:轨迹预测的动态风险视界剖面

Xinyi Ning, Zilin Bian, Dachuan Zuo, Semiha Ergan, Kaan Ozbay

发表机构 * Department of Civil and Urban Engineering, New York University(纽约大学土木与城市工程系) Department of Civil Engineering Technology and Environmental Management Safety, Rochester Institute of Technology(罗切斯特理工学院土木工程技术与环境安全管理系)

AI总结 提出风险视界剖面(RHP)模块,通过连续可学习的势场模型对未来风险分布进行建模,以提升轨迹预测的准确性,在highD和SHRP2数据集上分别降低5秒RMSE 25.0%和5秒minFDE 29.1%。

详情
Comments
11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
AI中文摘要

准确可靠的车辆轨迹预测对于安全自动驾驶至关重要。最近的研究将安全风险纳入轨迹预测,以量化周围代理带来的危险。然而,大多数风险感知方法将过去的风险信息作为辅助信号来帮助决策,忽视了其未来的演变和不确定性。在本文中,我们提出了一种风险视界剖面(RHP)模块,该模块结合了连续、可学习的势场模型,用于风险感知轨迹预测。RHP模块计算周围物体的时空接近度,以描绘未来视界上的风险分布,通过自适应识别人类驾驶员认为的关键时刻,支持更好的轨迹预测。我们在两个不同驾驶设置的数据集上评估了我们的方法:highD(高速公路走廊)和SHRP2(城市街道),涵盖了包括安全、近碰撞和碰撞事件在内的多种风险场景。与基线方法相比,我们的框架在highD数据集上实现了5秒RMSE降低25.0%,在SHRP2上实现了5秒minFDE降低29.1%。这些结果表明,该方法在短视界和长视界预测中均表现出色,并且在高速公路和城市场景中具有强大的泛化能力。所提出的方法能够实现更真实的自动驾驶车辆路径规划和策略选择,从而支持更安全的自动驾驶和更先进的驾驶员辅助系统。本工作的源代码可在以下网址获取:https://github.com/bilab-nyu/RHP

英文摘要

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0\% reduction in 5s RMSE on the highD dataset and a 29.1\% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: https://github.com/bilab-nyu/RHP

2606.00852 2026-06-02 cs.CV cs.AI cs.LG

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur(计算机科学与工程系印度理工学院坎浦尔) Department of Materials Science and Engineering Indian Institute of Technology Kanpur(材料科学与工程系印度理工学院坎浦尔)

AI总结 提出RefDiffNet,一种轻量级即插即用的输入增强模块,通过引入无缺陷参考图像来突出缺陷区域,从而提升下游检测器在PCB缺陷检测中的性能。

详情
AI中文摘要

印刷电路板(PCB)缺陷检测具有挑战性,因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测,忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中,我们提出了RefDiffNet,一种轻量级即插即用的输入增强模块,放置在检测器主干之前,用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代,利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像,捕获相对于参考图像的结构变化,并使用轻量级编码器输出缺陷区域被突出的原始图像,从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明,RefDiffNet在各类检测器上一致地提升了性能,包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益,且开销可忽略,仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs,最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块,以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

2606.00851 2026-06-02 cs.SD cs.CL cs.HC cs.LG eess.AS

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia: 具有连续情感调节的情感自适应语音助手

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

发表机构 * Department of Electrical Engineering, Columbia University(电气工程系,哥伦比亚大学)

AI总结 提出Sympatheia语音对话框架,通过从用户语音推断情感并结合连续效价-唤醒度控制信号,实现情感自适应响应,优于基线模型。

详情
AI中文摘要

共情口语对话系统必须推断用户的情感状态以做出适当响应,然而日常语音通常带有微弱、中性或模糊的情感线索。为解决这一问题,我们引入了Sympatheia,一种语音到语音对话框架,其条件基于从用户语音中推断出的情感,并且在可用时,基于多模态感知模块或用户界面提供的连续效价-唤醒度(VA)控制信号中的明确情感规格。为了训练我们的模型,我们构建了Sympatheia-18k,一个包含12个情感锚点的情感条件合成口语对话语料库。该数据集包括用于学习情感语音行为的情感分割,以及一个中性分割,该分割将情感中性查询与多个情感条件响应配对,以在情感模糊情况下隔离明确的情感控制。实验结果表明,Sympatheia在生成语义内容和口语表达均情感适当的响应方面优于语音对话基线。我们进一步表明,相同的VA界面可以整合来自不同感知模块(包括面部表情、生物信号和文本情感描述)的情感估计,从而在语音单独提供有限情感证据时改善响应对齐。这些结果表明,连续情感调节是构建情感自适应语音助手的有效实际步骤。

英文摘要

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

2606.00846 2026-06-02 cs.LG

CUPID in the Model Zoo: Online Matchmaking for Selecting Your Dream LLM

模型动物园中的丘比特:在线匹配以选择你的梦想大语言模型

Son Nguyen, Xinyuan Liu, Ransalu Senanayake

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于决斗老虎机算法的主动学习框架,通过迭代选择大语言模型对并收集用户反馈,高效匹配用户偏好与模型能力。

详情
Comments
38 pages, 11 figures
AI中文摘要

用户越来越面临从快速增长的大语言模型池中为给定任务选择合适的LLM的挑战,每个模型具有独特但通常不透明的潜在属性。加剧这一挑战的是,用户可能缺乏词汇或意识来明确表达他们在LLM的响应或部署中所重视的特征。我们提出了一种交互高效的主动学习框架,其中决斗老虎机算法迭代选择LLM对,收集用户关于其响应的反馈,并更新其对用户潜在偏好的信念。我们引入了一种新颖的信念感知上置信界策略,平衡模型池的探索与推断偏好的利用,从而在用户指定的成本和时间预算下实现用户需求与LLM能力之间的高效对齐。通过在LLM和人类研究上的多样化实验,我们实验验证了我们的模型能够以较低成本高效地将良好对齐的LLM匹配给用户。

英文摘要

Users increasingly face the challenge of selecting an appropriate LLM for a given task from a rapidly growing pool of LLMs, each with distinct but often opaque latent properties. Compounding this challenge, users may lack the vocabulary or awareness to explicitly articulate the characteristics they value in an LLM's responses or deployment. We propose an interaction-efficient active learning framework in which a dueling bandit algorithm iteratively selects pairs of LLMs, collects user feedback about their responses, and updates its belief about the user's latent preferences. We introduce a novel belief-aware upper confidence bound strategy that balances exploration of the model pool with exploitation of inferred preferences, enabling efficient alignment between user needs and LLM capabilities under user-specified cost and time budgets. Through diverse experiments on LLMs and human studies, we experimentally verify that our model can efficiently match well-aligned LLMs to users at a lower cost.

2606.00840 2026-06-02 cs.AI

Certificate-Guided Evaluation of Reinforcement Learning Generalization

证书引导的强化学习泛化评估

Vignesh Subramanian, Đorđe Žikelić, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology(佐治亚理工学院计算机科学学院) School of Computing and Information Systems, Singapore Management University(新加坡管理大学 computing and information systems 学院)

AI总结 提出一个逻辑驱动框架,通过神经证书函数验证强化学习算法在未见任务上的泛化能力,并证明证书违规率与测试任务成功率负相关。

详情
AI中文摘要

本文提出了一个逻辑驱动框架,用于评估强化学习算法在泛化到未见任务方面的性能。我们的框架定义了一类归纳可达-避免任务,这些任务在任务动态中具有结构相似性,从而能够评估泛化能力。我们引入了一个神经证书函数,通过强制执行关键条件来验证强化学习算法生成的轨迹,从而作为强化学习泛化的试金石。我们通过实验证明了该方法在几个最先进的可泛化强化学习算法上的能力,在具有挑战性的连续环境中验证了泛化能力。我们的结果表明,证书函数违规率越低,成功解决的测试任务数量越多,突显了我们的框架在评估和区分强化学习算法泛化能力方面的有效性。这项工作为基准测试强化学习泛化提供了一种原则性方法。

英文摘要

This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method's capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization.

2606.00838 2026-06-02 cs.AI

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

解耦行为克隆实现基于规范的强化学习中的可扩展归纳泛化

Vignesh Subramanian, Subhajit Roy, Suguman Bansal

发表机构 * School of Computer Science, Georgia Institute of Technology, USA(美国佐治亚理工学院计算机科学学院) Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India(印度理工学院坎浦尔分校计算机科学与工程系)

AI总结 提出DIBS方法,通过解耦任务特定策略学习与演化函数学习,利用行为克隆替代噪声奖励聚合,提升训练稳定性和零样本泛化能力。

详情
AI中文摘要

归纳泛化是强化学习泛化的一种框架,其中归纳相关的任务实例允许归纳相关的策略。先前的工作通过直接使用强化学习学习的高阶策略演化函数捕捉这种结构,但存在训练可扩展性差的问题:随着训练任务增加,聚合的奖励反馈变得嘈杂且冲突,破坏训练稳定性并削弱泛化能力。我们提出DIBS,一种解耦的行为克隆方法,将学习任务特定策略与学习演化函数分离。我们首先通过标准强化学习为每个任务学习独立的教师策略,然后通过行为克隆在教师标记的状态-动作对上拟合演化函数。这用密集、稳定的监督取代了嘈杂的奖励聚合。DIBS在训练稳定性和零样本泛化方面相比现有强化学习和元强化学习算法取得了显著改进。

英文摘要

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

2606.00837 2026-06-02 cs.RO cs.LG

Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

粗到细的组合扩散用于长时域规划

Byoungwoo Park, Utkarsh A. Mishra, Jaemoo Choi, Juho Lee, Yongxin Chen

发表机构 * KAIST(韩国科学技术院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Coarse-to-Fine Compositional Diffusion (CoFi)方法,通过先形成全局骨架再细化局部细节,在长时域机器人规划、全景图像生成和长视频生成中提升全局一致性和局部质量,同时减少2-8倍去噪评估次数。

详情
Comments
Project page: https://cofi-diffusion.github.io
AI中文摘要

扩散模型为生成结构化数据提供了强先验,但许多任务需要输出超出这些模型通常训练规模的范围。组合生成通过将来自预训练短时域先验的重叠局部计划组合成长时域输出来解决这一问题。然而,标准组合主要强制相邻局部计划之间的一致性,产生局部一致性而不直接指定完整组合的全局结构。因此,局部兼容的计划仍可能形成不合理的路线、任务序列或时间演化。现有方法通过重复传播局部一致性信号或添加推理时优化来提高全局连贯性,但随着局部计划数量或维度的增加,这些过程变得昂贵。我们提出粗到细组合扩散(CoFi),一种推理时采样器,将全局结构形成与局部细节细化分离。CoFi首先将局部去噪估计围绕共享的粗结构对齐,产生捕获长程任务级排列的全局骨架。然后将该骨架扩散到中间噪声水平,并使用相同的预训练局部先验去噪,在保留骨架诱导的全局连贯性的同时恢复局部精细结构。在长时域机器人规划、全景图像生成和长视频生成中,CoFi不仅比先前的组合基线提高了全局连贯性和局部样本质量,而且需要2-8倍更少的去噪评估次数。

英文摘要

Diffusion models provide strong priors for generating structured data, but many tasks require outputs beyond the scale on which these models are typically trained. Compositional generation addresses this by composing overlapping local plans from a pretrained short-horizon prior into a long-horizon output. However, standard composition primarily enforces agreement between neighboring local plans, yielding local consistency without directly specifying the global structure of the full composition. As a result, locally compatible plans may still form an implausible route, task sequence, or temporal evolution. Existing methods improve global coherence by repeatedly propagating local consistency signals or by adding inference-time optimization, but these procedures become expensive as the number or dimensionality of local plans increases. We propose Coarse-to-Fine Compositional Diffusion (CoFi), an inference-time sampler that separates global structure formation from local detail refinement. CoFi first aligns local denoised estimates around a shared coarse structure, producing a global scaffold that captures the long-range task-level arrangement. It then diffuses this scaffold to an intermediate noise level and denoises it with the same pretrained local prior, restoring local fine structure while preserving the scaffold-induced global coherence. Across long-horizon robotic planning, panoramic image generation, and long video generation, CoFi not only improves both global coherence and local sample quality over prior compositional baselines, but also requires 2-8x fewer denoiser evaluations.

2606.00832 2026-06-02 cs.CL

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Momento:评估多会话代理对话中的持久记忆与推理

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

发表机构 * Institut Teknologi Bandung(万隆技术大学) Stanford University(斯坦福大学) Capital One

AI总结 提出Momento基准,通过多会话服务环境评估代理在跨会话中利用持久记忆和推理完成个性化任务的能力,发现现有代理因误估用户状态而表现不佳。

详情
Comments
Preprint
AI中文摘要

近期代理人工智能的进展使得代理能够通过工具使用、推理和多步规划完成复杂任务。然而,现有基准在单会话内评估代理,忽略了代理必须整合的过去行动、陈述偏好和先前决策,以实现个性化用户目标。我们引入了Momento,一个用于多会话服务环境中持久代理任务完成的基准,要求代理在跨会话中处理时间依赖和演变的用户目标,同时采取重要的、工具介导的行动。实验结果表明,当前代理主要因误估用户状态而失败,将会话历史视为当前上下文的可靠代理,而非需要重新验证的过时信息,凸显了当前代理能力与现实长期人机交互之间的巨大差距。

英文摘要

Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.

2606.00831 2026-06-02 cs.AI cs.LG

Subliminal Learning is a LoRA Artifact

潜意识学习是LoRA的伪影

Todd Nief, Harvey Yiyun Fu, Mark Muchane, Ari Holtzman

发表机构 * Department of Computer Science, University of Chicago(芝加哥大学计算机科学系) Data Science Institute, University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文发现潜意识学习是LoRA微调产生的伪影,其传递行为与LoRA秩呈倒U型关系,且完全微调下消失,表明该现象依赖于微调和评估上下文。

详情
AI中文摘要

潜意识学习是一种现象,语言模型可以通过看似无害的数据将行为特征传递给其他模型(Cloud et al., 2025)。在潜意识学习中,具有行为特征(例如对猫的痴迷)的教师模型可以将这种猫痴迷传递给仅在教师生成的数字序列上微调的学生模型。在本文中,我们提出疑问:这种意想不到的行为传递是如何发生的?我们表明,潜意识学习是LoRA的伪影。当潜意识学习发生时,传递与LoRA秩呈倒U型关系;在完全微调下也会消失。我们表明,潜意识学习高度依赖于微调和评估期间看到的上下文。例如,在微调期间使用默认系统提示(“你是Qwen,由阿里云创建。你是一个有用的助手。”)的Qwen模型,在生成时如果没有包含系统提示,则不会表现出潜意识学习。我们进一步证明,潜意识行为局限于在微调和评估期间都看到的标记(例如模型的默认系统提示、标准聊天模板标记等)上的计算。总体而言,潜意识学习似乎是LoRA超参数和微调上下文的脆弱伪影,使其成为行为传递的不稳定渠道。

英文摘要

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

2606.00829 2026-06-02 cs.CV

The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge

正确的推理策略即是一切:面向EgoCross挑战的近乎无需训练的领域感知推理

Leyi Wu, Yifan Zhao, Jinjie Zhang, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

AI总结 针对EgoCross挑战中源受限场景下多模态大模型在领域偏移严重的自我中心视频问答任务上表现不佳的问题,提出一种领域感知推理策略,通过为四个目标领域分别设计不同的输入、提示和答案映射流程,在不进行额外训练的情况下显著提升基线模型性能。

详情
AI中文摘要

EgoCross评估多模态大语言模型在显著领域偏移下的自我中心视频问答,其中测试视频来自手术、工业装配、极限运动和动物佩戴相机,而非日常场景。在源受限赛道中,基础模型固定为Qwen3-VL-4B,而官方任务特定支持集仅包含20个训练样本。这一设置使得挑战更侧重于向受限模型暴露正确的视觉、时序和答案选择线索,而非模型规模。我们的关键观察是,冻结的基线模型并非完全无法处理这些罕见场景;相反,它往往缺乏合适的接口来将其现有的视觉-语言知识迁移到新任务格式。因此,我们采用领域感知推理策略,将四个目标领域分开处理,并根据每个领域的任务特点设计不同的输入、提示和答案映射流程。这些策略通过强调每个领域重要的线索,使罕见自我中心场景对VLM更具可解释性。最终系统几乎无需训练:手术和动物问题使用基础Qwen3-VL-4B模型回答,而极限运动和工业问题仅使用在提供的20个训练样本上训练两个epoch的官方SFT检查点。在最终评估中,这一简单策略达到了66.98%的整体准确率,表明精心设计的领域感知推理可以弥补基础模型能力的不足,并恢复基线模型中已存在的大部分能力。

英文摘要

EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain's task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98\% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.

2606.00828 2026-06-02 cs.CV

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

RoboStressBench: 在具身场景中基准测试VLM对物理视觉压力的鲁棒性

Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州))

AI总结 本文提出RoboStressBench,从逆图形学角度将视觉压力分解为材质、视角、光照和几何四个物理维度,系统评估VLM在真实物理压力下的鲁棒性,并引入压力感知求解器提升高压力场景下的性能。

详情
AI中文摘要

视觉语言模型(VLM)展现出强大的视觉理解能力,并越来越多地部署在具身AI系统中,这些系统需要在真实条件下进行可靠的感知。然而,现有的基准测试使用干净图像或孤立扰动来评估VLM,而非由物理场景形成引起的压力。这种设计有两个局限性:它仅覆盖了日常视觉压力的一小部分子集,并且某些扰动在现实具身场景中很少出现。这一差距引发了一个基本问题:我们如何以一种原则性的方式定义视觉压力,以捕捉物理环境中遇到的各种因素?为了解决这个问题,我们从逆图形学角度构建视觉感知,并引入RoboStressBench,这是一个用于评估VLM在具身场景中对物理视觉压力鲁棒性的基准测试。受物理渲染方程的启发,RoboStressBench将视觉压力分解为四个物理基础维度:材质(M)、视角(V)、光照(L)和几何(G)。这种设计使RoboStressBench能够覆盖现实世界环境中的广泛视觉压力,同时允许对其在VLM能力(如视觉识别、推理和规划)上的影响进行受控分析。通过对最先进的VLM进行全面评估,我们识别出特定于压力的失败模式,并揭示了不同的物理因素会降低不同的具身能力,而这些往往被总体准确率所掩盖。我们进一步引入了一种压力感知的智能求解器,它在推理前检测视觉压力源并调用视觉编辑技能,从而提高了高压力场景下的鲁棒性。总体而言,RoboStressBench提供了一个原则性的评估框架,用于诊断和改进VLM在真实物理压力下的感知能力,支持开发更可靠的具身AI系统。

英文摘要

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

2606.00826 2026-06-02 cs.LG

Partial Fairness Awareness: Belief-Guided Strategic Mechanism for Strategic Agents

部分公平意识:面向策略代理的信念引导策略机制

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Hao Zou, Shanzhi Gu, Liyang Xu, Huan Chen, Yuanlong Chen, Wenjing Yang, Haotian Wang

发表机构 * National University of Defense Technology, Changsha, China(国防科技大学) Peking University, Beijing, China(北京大学) Shanghai University of Finance and Economics, Shanghai, China(上海财经大学) ZGC Laboratory, Beijing, China(ZGC实验室) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院)

AI总结 针对策略分类中的公平暴露困境,提出部分公平意识(PFA)问题,通过发布公平约束候选集并隐藏真实约束,结合信念引导机制实现代理与系统公平约束的对齐,实验表明PFA在降低群体公平差距、提高合格个体接受率和结果稳定性方面优于完全公开或私有的公平机制。

详情
Comments
Accepted by AAAI2026
AI中文摘要

策略机器学习研究代理操纵其特征以从预测模型获得有利决策的场景。为了解决策略分类中固有的公平问题,最近的工作引入了群体特定的公平约束。然而,当前的公平感知方法在公平暴露问题上面临根本困境:公开这些约束会导致策略操纵和公平逆转,而隐藏它们可能降低社会福利并阻碍真正的改进。为填补这一空白,我们随后提出了部分公平意识(PFA)问题,因为我们的理论分析表明,这种困境可以通过发布公平约束的候选集并隐藏真实约束来缓解。具体来说,我们引入了一种信念引导的策略机制,其中代理与决策系统迭代交互,并在公平约束候选集上维持一个信念分布。这一信念引导过程使代理能够通过迭代交互和反馈,更新其在候选集上的信念分布,从而逐渐使其信念与系统采用的真实公平约束对齐。在真实世界和合成数据集上的大量实验表明,与完全公开或私有的公平机制相比,PFA实现了更低的群体公平差距、更高的真正合格个体接受率以及更稳定的结果。

英文摘要

Strategic machine learning investigates scenarios where agents manipulate their features to receive favorable decisions from predictive models. To address fairness concerns intrinsic to strategic classification, recent work has introduced group-specific fairness constraints. However, current fairness-aware approaches face a fundamental dilemma in the issue of fairness exposure: making these constraints public enables strategic manipulation and can lead to fairness reversal, while keeping them hidden may reduce social welfare and discourage genuine improvement. To fill this gap, we subsequently propose the problem of partial fairness awareness (PFA), as our theoretical analysis informs that such a dilemma can be mitigated by releasing the candidate set of fairness constraints and concealing the grounding constraint. To be specific, we introduce a belief-guided strategic mechanism, wherein agents iteratively interact with the decision system and maintain a belief distribution over the candidate set of fairness constraints. This belief-guided process enables agents, through iterative interaction and feedback, to update their belief distribution over the candidate set, thereby gradually aligning their belief with the grounding fairness constraint employed by the system. Extensive experiments on real-world and synthetic datasets demonstrate that PFA achieves lower group fairness gaps, higher acceptance of truly qualified individuals, and more stable outcomes compared to fully public or private fairness regimes.

2606.00825 2026-06-02 cs.CV cs.ET cs.HC cs.MA

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

SuperMemory-VQA:面向长期记忆的自我中心视觉问答基准

Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Meta Project(Meta项目)

AI总结 提出SuperMemory-VQA数据集,包含52.9小时AI眼镜录制的日常活动及4853个多选问答对,用于评估AI助手在长期记忆任务上的表现,发现现有系统可靠性不足。

详情
Comments
34 pages, 21 figures, 5 tables
AI中文摘要

AI眼镜为AI代理作为个性化记忆助手提供了有吸引力的平台。要真正有用,此类系统必须超越短期视频理解,解决人类在纵向自我中心视频流中因实际、个人或社交目的而经历的记忆缺口。然而,现有的自我中心数据集主要关注动作识别或来自短片的通用问答,衡量的是感知能力而非现实的人类记忆需求。我们引入了SuperMemory-VQA,一个用于评估AI助手在实际长期记忆任务上的自我中心视觉问答(VQA)数据集。它包含52.9小时用AI眼镜记录的日常活动,包括同步的RGB视频、音频转录、眼动追踪、IMU和SLAM轨迹。通过人工验证的标注流程,我们构建了4,853个有依据的问答对,涵盖物体和位置记忆、意图回忆、视觉场景回忆、时间线重建、对话记忆和上下文检索。每个问题以多项选择形式提出,并包含明确的“不可回答”选项以测试幻觉鲁棒性。对领先的代理框架和LLM骨干的基准测试表明,现有系统在现实世界记忆任务上仍远不可靠,凸显了需要新的架构来实现有依据的AI记忆,使其仅在证据充分时才能回答。参与者调查进一步支持我们的问题具有现实性、实用性,并与日常记忆需求一致。

英文摘要

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.

2606.00821 2026-06-02 cs.LG

A Comparative Analysis of Machine Learning Algorithms for Multi-Task Prediction of the Parameters of the Pectin Hydrolysis--Extraction Process

机器学习算法用于果胶水解-提取过程参数多任务预测的比较分析

Mullosharaf K. Arabov, Shavkat Yo. Kholov, Zainiddin K. Muhiddin

发表机构 * Institute of Computational Mathematics and Information Technologies, Kazan Federal University(卡兹安联邦大学计算数学与信息科技研究所) Tajik Technical University named after Academician M.S. Osimi(阿米尔·苏米院士命名的塔吉克技术大学) V.I. Nikitin Institute of Chemistry, National Academy of Sciences of Tajikistan(塔吉克斯坦国家科学院化学研究所维·尼金廷研究所)

AI总结 本研究比较了11种机器学习算法在多任务回归预测果胶水解-提取过程参数中的性能,其中CatBoost表现最佳(平均R²约0.946),并分析了特征重要性,原料类型占主导地位(63.6%)。

详情
Comments
Preprint
AI中文摘要

本研究利用机器学习方法解决复杂多参数工艺——果胶水解-提取过程的控制挑战。实验基础是一个独特的数据库,包含在受控条件下对七种植物原料进行的1000次实验室实验,涉及四个可变工艺因素(温度85-130°C、压力0.9-2.2 atm、保温时间3-10分钟、pH 1.5-2.0)。记录了四个输出特征:果胶产率、半乳糖醛酸含量、分子量和酯化度。为解决多任务回归问题,训练并比较了11种算法:正则化线性模型、集成方法(随机森林、梯度提升、XGBoost、CatBoost、Extra Trees)、k近邻、支持向量回归和多层感知器。最佳结果由CatBoost展示(超参数优化后平均R²约为0.946)。特征重要性分析揭示了原料类型的主导作用(占总重要性的63.6%),其次是温度和保温时间。开发的流水线以生产就绪格式导出,并部署为交互式Web界面。研究结果表明,集成方法结合严格的统计分析和可解释AI显著减少了物理实验的需求,并为智能果胶生产控制奠定了基础。

英文摘要

This study addresses the challenge of controlling a complex, multi-parameter technological process -- pectin hydrolysis--extraction -- using machine learning methods. The experimental foundation is a unique database comprising 1,000 laboratory experiments conducted under controlled conditions on seven types of plant raw material with four variable process factors (temperature 85--130 C, pressure 0.9--2.2 atm, holding time 3--10 min, pH 1.5--2.0). Four output characteristics were recorded: pectin yield, galacturonic acid content, molecular weight, and degree of esterification. To solve the multi-task regression problem, 11 algorithms were trained and compared: regularised linear models, ensemble methods (Random Forest, Gradient Boosting, XGBoost, CatBoost, Extra Trees), k-nearest neighbours, support vector regression, and a multilayer perceptron. The best results were demonstrated by CatBoost (average R-squared approximately 0.946 after hyperparameter optimisation). Feature importance analysis revealed the dominant role of the raw material type (63.6% of total importance), followed by temperature and holding time. The developed pipeline was exported in a production-ready format and deployed as an interactive web interface. The findings demonstrate that ensemble methods combined with rigorous statistical analysis and interpretable AI significantly reduce the need for physical experiments and form the basis for intelligent pectin production control.

2606.00819 2026-06-02 cs.AI

Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping

通过解码器层跳跃减轻大型语言模型中的幻觉

Hanze Li, Jinhao You, Yichen Guo, Kai Tang, Shuangyang Xie, Xiande Huang

发表机构 * arXiv.org

AI总结 本文提出DeLask框架,通过动态跳过易产生幻觉的解码器层,利用梯度下降的等价性检测并抑制错误信号,从而减轻LLM幻觉并提升可靠性。

详情
Comments
5 pages
AI中文摘要

大型语言模型(LLM)在各种自然语言任务中表现出色,但其输出常常出现幻觉——与事实信息不符的内容。在这项工作中,我们对解码过程进行了全面的逐层分析,并揭示幻觉往往源自更深的解码器层。为了解决这个问题,我们引入了 extbf{DeLask}( extbf{De}coder extbf{La}yer extbf{Sk}ipping),一种新颖的解码框架,它动态跳过容易产生幻觉的层。DeLask利用理论洞察,即$L$层Transformer的前向计算在条件上等价于$L$步梯度下降。我们通过计算连续解码步骤导出的梯度之间的余弦相似度来定义\emph{漂移值},从而在下降方向反转时识别问题层。DeLask并非完全丢弃这些层,而是将其隐藏状态与前面层部分聚合,从而在抑制错误信号的同时保持一致性。跨不同LLM和基准的广泛实验表明,DeLask持续减轻幻觉并增强整体可靠性,为提升大规模语言模型的鲁棒性提供了一个轻量级且可泛化的解码框架。

英文摘要

Large Language Models (LLMs) have achieved strong performance across diverse natural language tasks, yet their outputs often suffer from hallucinations -- content that is misaligned with factual information. In this work, we conduct a comprehensive layer-wise analysis of the decoding process and reveal that hallucinations tend to originate from deeper decoder layers. To address this issue, we introduce \textbf{DeLask} (\textbf{De}coder \textbf{La}yer \textbf{Sk}ipping), a novel decoding framework that dynamically skips layers prone to producing hallucinations. DeLask leverages the theoretical insight that the forward computation of an $L$-layer Transformer is conditionally equivalent to $L$ steps of gradient descent. We define a \emph{driftance value} by computing the cosine similarity between gradients derived from consecutive decoder steps, identifying problematic layers when the descent direction reverses. Rather than discarding such layers entirely, DeLask partially aggregates their hidden states with preceding layers, thereby preserving consistency while suppressing erroneous signals. Extensive experiments across diverse LLMs and benchmarks demonstrate that DeLask consistently mitigates hallucinations and enhances overall reliability, providing a lightweight and generalizable decoding framework for improving the robustness of large-scale language models.

2606.00815 2026-06-02 cs.LG

OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models

OmniEEG-Bench: 脑电图基础模型的标准化评估基准

Ziling Lu, Zongsheng Li, Xinke Shen, Kexin Lou, Yingyue Xin, Xiaoqi Chen, Shinan Wang, Xiang Chen, Jiahao Fan, Chenyu Huang, Xin Xu, Zhoujie Hou, Chen Wei, Quanying Liu

发表机构 * Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学生物医学工程系,深圳,中国) School of Computer Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)计算机科学与工程学院,深圳,中国) Omni-Intelligence, Shenzhen, China(奥米智能,深圳,中国) Shenzhen Loop Area Institute, Shenzhen, China(深圳环城研究院,深圳,中国)

AI总结 针对脑电图基础模型评估碎片化问题,提出统一基准OmniEEG-Bench,涵盖六类任务、54个数据集,并揭示预训练数据多样性和模型大小与性能的缩放律关系。

详情
Comments
28 pages, 13 figures, 8 tables; benchmark of EEG foundation models
AI中文摘要

脑电图(EEG)支持多种脑机接口(BCI)任务,从脑状态监测到人-大语言模型交互。EEG基础模型正在兴起,但由于异构数据集和不一致的任务协议,评估仍然碎片化。在此,我们介绍OmniEEG-Bench,一个用于EEG基础模型(FMs)的统一基准和下游任务路线图。它将EEG FMs的评估组织为六个任务族,涵盖(i)信号可靠性、(ii)生物特征与疾病、(iii)意识与状态、(iv)认知与情感、(v)自然刺激解码以及(vi)运动与交互,引入了先前EEG FM工作中未系统基准测试的新一代任务。OmniEEG-Bench通过任务卡规范标准化模型部署、任务定义和指标,并统一了54个EEG数据集及一致的评估协议。我们对10个代表性EEG基础模型进行了基准测试,并报告了涵盖多种评估设置的排行榜。预训练数据集多样性和模型大小均与跨数据集的更好平均排名显著相关,揭示了EEG基础模型中的缩放律行为(图1)。这些结果表明,扩展EEG基础模型不仅需要更大的架构,还需要更广泛和更多样化的预训练数据。基准测试代码可在https://github.com/ncclab-sustech/omni-eegbench.git获取。

英文摘要

Electroencephalography (EEG) supports a variety of brain-computer interface (BCI) tasks ranging from brain-state monitoring to human-LLM interactions. EEG foundation models are emerging, but evaluation remains fragmented due to heterogeneous datasets and nconsistent task protocols. Here, we introduce OmniEEG-Bench, a unified benchmark and downstream task roadmap for EEG foundation models (FMs). It organizes evaluation of EEG FMs into six task families spanning (i) signal reliability, (ii) biometrics and disease, (iii) consciousness and state, (iv) cognition and emotion, (v) naturalistic stimulus decoding, and (vi) motor and interaction, introducing a new generation of tasks not systematically benchmarked in prior EEG FM work. OmniEEG-Bench standardizes model deployment, task definitions, and metrics through a task-card specification, and unifies 54 EEG datasets with consistent evaluation protocols. We benchmark 10 representative EEG foundation models and report a leaderboard that covers diverse evaluation settings. Both pretraining dataset diversity and model size are significantly associated with better average ranks across datasets, revealing scaling-law behavior in EEG foundation models (Figure 1). These results suggest that scaling EEG foundation models requires not only larger architectures but also broader and more diverse pretraining data. The benchmark code is available at https://github.com/ncclab-sustech/omni-eegbench.git.

2606.00808 2026-06-02 cs.LG

Safe-Subspace Pseudo-Label Refinement for Source-Free Graph Domain Adaptation

安全子空间伪标签精炼用于无源图域自适应

Yingxu Wang, Xinwang Liu, Siyang Gao, Nan Yin

发表机构 * Department of Computer Science and Engineering, Chinese University of Hong Kong(香港中文大学计算机科学与工程系) College of Computer, National University of Defense Technology(国防科技大学计算机学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) The Education University of Hong Kong(香港教育大学)

AI总结 针对无源图域自适应中伪标签不可靠的问题,提出SafeSubspace伪标签精炼方法,通过识别置信度一致的安全子空间并利用语义与结构证据进行伪标签验证,实现鲁棒的图域自适应。

详情
AI中文摘要

无源图域自适应(SF-GDA)旨在当源图不再可访问时,将源训练的图模型适应到未标记的目标图。一个核心障碍是伪标签的可靠性:在特征和拓扑偏移下,源诱导的预测可能变得自信但错误,而无差别的自训练会通过图消息传递放大系统误差。本文从选择性伪标签的角度研究SF-GDA。我们不是假设整个目标域上全局有界的伪标签噪声,而是识别一个置信度一致的安全子空间,在该子空间上伪标签噪声可以在受限后验差异下得到控制,并推导出一个目标风险分解,将安全子空间拟合误差、选定标签噪声和不确定集风险分开。在此分析指导下,我们提出SafeSubspace伪标签精炼(S$^2$PLR),一种无源图自适应框架,仅对同时具有语义和结构证据支持的目标图应用硬伪标签监督。具体来说,S$^2$PLR利用源委员会置信度和分歧估计语义可靠性,通过图对比学习学习目标内在的结构表示,通过邻域一致性验证伪标签,并利用噪声容忍的软正则化处理剩余的不确定样本,而不是不可靠的硬标签。在不同域偏移下的图像和真实世界图基准上的实验表明,S$^2$PLR在各种无源迁移设置中实现了鲁棒且具有竞争力的性能。

英文摘要

Source-free graph domain adaptation (SF-GDA) aims to adapt source-trained graph models to unlabeled target graphs when source graphs are no longer accessible. A central obstacle is pseudo-label reliability: under feature and topological shifts, source-induced predictions may become confidently wrong, and indiscriminate self-training can amplify systematic errors through graph message passing. This paper studies SF-GDA from a selective pseudo-labeling perspective. Instead of assuming globally bounded pseudo-label noise over the entire target domain, we identify a confidence-consistent safe subspace on which pseudo-label noise can be controlled under restricted posterior discrepancy, and derive a target-risk decomposition that separates safe-subspace fitting error, selected-label noise, and uncertain-set risk. Guided by this analysis, we propose SafeSubspace Pseudo-Label Refinement (S$^2$PLR), a source-free graph adaptation framework that applies hard pseudo-label supervision only to target graphs supported by both semantic and structural evidence. Specifically, S$^2$PLR estimates semantic reliability using source-committee confidence and disagreement, learns a targetintrinsic structural representation via graph contrastive learning, verifies pseudo-labels through neighborhood consistency, and exploits the remaining uncertain samples with noise-tolerant soft regularization rather than unreliable hard labels. Experiments on image and real-world graph benchmarks under different domain shifts demonstrate that S$^2$PLR achieves robust and competitive performance across diverse source-free transfer settings.

2606.00798 2026-06-02 cs.CV cs.AI cs.LG

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH: 用于引导校准紧凑扩散模型的双分支分数蒸馏

Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo

发表机构 * Khulna University of Engineering & Technology(Khulna 工程与技术大学) University Clermont Auvergne(克莱蒙特-奥弗涅大学)

AI总结 针对类条件扩散模型参数压缩中无监督无条件分数分支导致引导失效的问题,提出双分支蒸馏框架DASH,通过独立监督两个分支并引入锚点正则化和课程迁移,在5.9倍压缩下保持与教师模型相近的FID和引导保真度。

详情
Comments
14 pages, 7 figures, 4 tables; appendix with additional ablations and qualitative results
AI中文摘要

类条件扩散模型的参数压缩揭示了输出级蒸馏中一个未被充分探索的局限性:无条件分数分支保持无监督,导致学生模型中无分类器引导差距欠定。该差距在每个去噪步骤中被放大,允许两个分支都崩溃为相同预测的退化解,使得引导在低输出级训练损失下无效。本文介绍了DASH,一种双分支蒸馏框架,独立监督两个分数分支,通过独立分支约束为每个训练样本唯一指定目标分支输出,并引入锚点项将条件预测正则化到真实噪声。该框架进一步引入了TIRT迁移,将教师收敛的每时间步重要性课程复制到学生中作为冻结先验,消除了在有限蒸馏预算内重新学习它的需要。在CIFAR-10和CIFAR-100上的实验表明,5.9倍压缩在50步DDIM采样下将质量保持在教师模型4个FID点以内,显著优于从头训练,且引导保真度良好保持。消融研究证实无条件监督是主要贡献,占总蒸馏增益的60%以上。课程迁移和锚点正则化提供互补收益,共同验证了双分支约束对于引导保持压缩的经验必要性。

英文摘要

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher's converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.

2606.00795 2026-06-02 cs.LG cs.AI

Extending Causal Metamodeling to a non-Markovian Queue

将因果元建模扩展到非马尔可夫排队系统

Pracheta Amaranath, Anant Bhide, David Jensen, Peter Haas

发表机构 * Manning College of Information and Computer Sciences University of Massachusetts Amherst(信息与计算机科学学院麻省大学阿默斯特分校)

AI总结 本文通过相位型分布近似非指数分布,将模块化动态贝叶斯网络(MDBN)因果元建模方法从马尔可夫系统扩展到非马尔可夫排队系统,并解决了相位数选择、参数学习和采样间隔等挑战,实验表明在G/M/1队列上可实现数量级的推理加速。

详情
Comments
12 pages
AI中文摘要

离散事件仿真的元模型近似模拟模型的行为,而无需运行昂贵的仿真。先前的工作引入了模块化动态贝叶斯网络(MDBN)——一类元模型,可以使用单个训练模型估计一系列概率和因果查询(PCQ)——但该方法仅限于马尔可夫系统。在本文中,我们通过使用相位型分布近似非指数分布,启动MDBN向非马尔可夫排队的扩展。这种方法带来了新的挑战,包括在选择相位数量时平衡元建模精度和可处理性、高效学习元模型参数,以及选择用于通过离散时间MDBN近似连续时间仿真的采样间隔。我们为这些挑战提供了初步解决方案,从而产生了第一个针对非马尔可夫系统的因果元建模技术。在G/M/1队列上的实验表明,MDBN可以为PCQ提供准确的答案,并且相对于直接仿真,推理时间实现了数量级的加速。

英文摘要

Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work introduced modular dynamic Bayesian networks (MDBNs) -- a class of metamodels that can estimate a range of probabilistic and causal queries (PCQs) using a single, trained model -- but the method was limited to Markovian systems. In this paper, we initiate an extension of MDBNs to non-Markovian queues by approximating non-exponential distributions using phase-type distributions. This approach raises novel challenges, including balancing metamodeling accuracy and tractability when choosing the number of phases, efficiently learning metamodel parameters, and choosing the sampling interval that is used to approximate a continuous-time simulation by a discrete-time MDBN. We provide preliminary solutions to these challenges, yielding the first causal metamodeling technique for non-Markovian systems. Experiments on a G/M/1 queue demonstrate that the MDBN can produce accurate answers to PCQs with orders-of-magnitude speedup of inference times relative to direct simulation.

2606.00784 2026-06-02 cs.CV

DINO-GFSA: Geo-Localization via Semantic Gated Fusion and Mamba-based Sequential Aggregation

DINO-GFSA:基于语义门控融合和Mamba序列聚合的地理定位

Beier Hu, Yuanshen Guo, Jialu Cai, Chengwei Li, Yong Wang, Shunan Wu, Zhigang Wu

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen, China(中山大学航空航天学院,深圳,中国)

AI总结 提出DINO-GFSA框架,通过LoRA适配的DINOv3骨干网络、语义门控残差融合模块和Mamba序列聚合头,在无人机跨视角地理定位中实现最先进性能。

详情
AI中文摘要

跨视角地理定位(CVGL)对于无人机在无GNSS环境下的自定位和目标定位至关重要。然而,在保留细粒度空间细节的同时获取鲁棒语义仍然具有挑战性。为此,我们提出DINO-GFSA框架,利用LoRA(低秩适配)适配的DINOv3(ViTL)骨干网络实现参数高效、高容量的表示。关键地,我们引入了语义门控残差融合模块,利用高层语义选择性校准和整合低层空间线索,有效弥合语义鸿沟。此外,设计了基于Mamba的序列聚合头,以线性复杂度捕获长距离空间依赖。实验表明,在University-1652和DenseUAV基准上取得了最先进性能,特别是在DenseUAV上Recall@1比之前最佳方法高出3.48%。这些结果验证了DINO-GFSA作为无人机CVGL通用鲁棒解决方案的有效性。

英文摘要

Cross-view geo-localization (CVGL) is critical for Unmanned Aerial Vehicle (UAV) self-positioning and target localization in GNSS-denied environments. However, acquiring robust semantics while preserving finegrained spatial details remains challenging. To address this, we propose DINO-GFSA, a framework leveraging a LoRA (Low-Rank Adaptation) adapted DINOv3 (ViTL) backbone for parameter-efficient, high-capacity representation. Crucially, we introduce a Semantic Gated Residual Fusion module, which utilizes high-level semantics to selectively calibrate and integrate low-level spatial cues, effectively bridging the semantic gap. Furthermore, a Mamba-based Sequential Aggregation Head is designed to capture long-range spatial dependencies with linear complexity. Experiments demonstrate state-of-the-art performance on University-1652 and DenseUAV benchmarks, notably surpassing the previous best on DenseUAV by 3.48% on Recall@1. These results validate DINO-GFSA as a generalized, robust solution for UAV CVGL.

2606.00782 2026-06-02 cs.CV

FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection

FlowOVD: 学习生成式潜在流用于零样本开放词汇检测

Yao Wei, Andrea Cavallaro, Changjae Oh

发表机构 * Queen Mary University of London(伦敦女王学院) EPFL(瑞士联邦理工学院)

AI总结 提出FlowOVD,基于修正流的文本条件查询生成框架,通过连续潜在查询动态实现开放词汇检测,在COCO和LVIS上分别达到49.5 AP和31.5 AP,优于GroundingDINO。

详情
AI中文摘要

开放词汇目标检测(OVD)通过大规模视觉-语言预训练取得了显著进展。然而,现有方法通常将OVD表述为判别性预测问题,其中解码器查询要么是静态的,要么从编码器特征初始化,从而限制了其多样性和灵活性。在本文中,我们引入生成视角,将解码器查询生成建模为潜在空间中的连续传输过程。我们提出FlowOVD,一种基于修正流的文本条件查询生成框架,逐步将文本无关的查询转换为文本引导的查询。通过将连续潜在查询动态引入基于视觉-语言模型(VLM)的检测器,我们的方法避免了启发式离散查询构建,并为开放词汇检测实现了更具表现力的语义对齐。无需额外训练数据,FlowOVD在COCO上达到49.5 AP,在LVIS上达到31.5 AP,分别比GroundingDINO高出+1.2 AP(+2.5%)和+4.1 AP(+15.0%)。在具有挑战性的长尾LVIS基准上的更大增益进一步凸显了连续查询生成对开放词汇泛化的有效性。

英文摘要

Open-vocabulary object detection (OVD) has achieved remarkable progress through large-scale vision-language pre-training. Existing methods, however, typically formulate OVD as a discriminative prediction problem, where decoder queries are either static or initialized from encoder features, thus limiting their diversity and flexibility. In this paper, we introduce a generative perspective by modeling decoder query generation as a continuous transport process in latent space. We propose FlowOVD, a text-conditioned query generation framework based on rectified flow that progressively transforms text-agnostic queries into text-guided queries. By introducing continuous latent query dynamics into a vision-language model (VLM) based detector, our method avoids heuristic discrete query construction and enables more expressive semantic alignment for open-vocabulary detection. Without requiring additional training data, FlowOVD achieves 49.5 AP on COCO and 31.5 AP on LVIS, outperforming GroundingDINO by +1.2 AP (+2.5 %) and +4.1 AP (+15.0 %), respectively. The larger gain on the challenging long-tailed LVIS benchmark further highlights the effectiveness of continuous query generation for open-vocabulary generalization.

2606.00780 2026-06-02 cs.LG cs.AI

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

基于Transformer世界模型的行为不变任务表示学习用于离线元强化学习

Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种结合信息论任务表示学习与Transformer随机世界模型的框架,通过提取行为不变的任务变量和保守值惩罚,解决离线元强化学习中的分布偏移和稀疏奖励问题,实现鲁棒泛化。

详情
Comments
ICML2026
AI中文摘要

离线元强化学习利用静态数据集使智能体能够通过结合离线效率与元学习适应性来泛化到未见环境,但它面临来自上下文和策略分布偏移的关键挑战。这些问题阻碍智能体适应在线环境,并在稀疏奖励设置下进一步加剧。结果,智能体常常陷入固有的模式困境,无法实现鲁棒的泛化。在这项工作中,我们提出了一种新颖的框架,将信息论任务表示学习与基于Transformer的随机世界模型相结合。我们的方法提取对行为策略不变的任务定义潜在变量,从而有效缓解上下文分布偏移。为了进一步处理策略偏移和模型利用,我们对基于想象力的轨迹应用保守值惩罚,防止策略利用模型不准确性,同时保持鲁棒适应。大量评估表明,我们的方法在分布外和稀疏奖励设置下优于最先进的方法,具有优越的稳定性和泛化能力。

英文摘要

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.

2606.00776 2026-06-02 cs.LG

Latent Diffusion Pretraining for Crystal Property Prediction

晶体性质预测的潜在扩散预训练

Shrimon Mukherjee, Kishalay Das, Partha Basuchowdhuri, Pawan Goyal, Niloy Ganguly

发表机构 * University of California, Berkeley(加州大学伯克利分校) Indian Institute of Technology, Bombay(印度班加罗尔印度理工学院)

AI总结 提出基于潜在扩散的预训练框架CrysLDNet,结合变分自编码器和扩散模型,从无标注晶体结构中学习表示,微调后显著提升性质预测性能。

详情
Comments
Published in ICML 2026
AI中文摘要

快速准确地预测晶体性质是新材料设计中的核心挑战。图神经网络和基于Transformer的模型由于能够编码晶体中原子的局部结构环境,已成为此任务的有力工具。然而,这些模型需要大量数据,而实践中晶体性质的标注数据稀缺。预训练-微调策略,特别是基于扩散模型的策略,在解决这些限制方面显示出前景。在这项工作中,我们引入了一个新颖的基于潜在扩散的预训练框架CrysLDNet,旨在缓解数据稀缺问题。我们的方法在预训练阶段将变分自编码器(VAE)与扩散模型相结合。VAE编码器将3D晶体结构映射到平滑的潜在空间,在该空间中应用扩散过程。这种潜在扩散预训练使图编码器能够从大规模无标注数据中有效捕获结构和化学语义,然后可以针对特定性质预测任务进行微调。在流行的DFT数据集上进行性质预测的综合实验表明,CrysLDNet显著优于从头训练和预训练的基线,在JARVIS和MP数据集上分别提高了4.26%和4.90%。此外,学习到的表示在稀疏数据条件下保持鲁棒,并且具有足够的表达能力,可以在有限实验数据微调时纠正DFT误差。代码可在https://github.com/shrimonmuke0202/CrysLDNet.git获取。

英文摘要

Fast and accurate prediction of crystal properties is a central challenge in new materials design. Graph neural networks and Transformer-based models have emerged as powerful tools for this task due to their ability to encode the local structural environment of atoms within a crystal. However, these models are data-hungry, and in practice, labeled data for crystal properties are scarce. Pretraining-finetuning strategies, particularly those based on diffusion models, have shown promise in addressing these limitations. In this work, we introduce a novel latent diffusion based pretraining framework, CrysLDNet, designed to mitigate data scarcity. Our approach integrates a Variational Autoencoder (VAE) with a diffusion model during the pretraining stage. The VAE encoder maps 3D crystal structures into a smooth latent space within which the diffusion process is applied. This latent diffusion pretraining enables the graph encoder to effectively capture structural and chemical semantics from large-scale unlabeled data, which can then be finetuned for specific property prediction tasks. Comprehensive experiments on popular DFT datasets for property prediction reveal that CrysLDNet significantly outperforms both training-from-scratch and pretrained baselines, with improvements of 4.26% and 4.90% on the JARVIS and MP datasets, respectively. Additionally, the learned representations remain robust in sparse-data conditions and are expressive enough to correct DFT errors when finetuned with limited experimental data. Code is available at: https://github.com/shrimonmuke0202/CrysLDNet.git.

2606.00775 2026-06-02 cs.CV cs.AI

GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval

GIRL-DETR: 梯度隔离强化学习用于视频时刻检索

Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang, Wei Ji

发表机构 * College of Electronics and Information Engineering, Sichuan University(四川大学电子信息工程学院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 针对视频时刻检索中连续代理损失与非可微指标不匹配导致的优化停滞问题,提出梯度隔离强化学习框架GIRL-DETR,通过冻结骨干网络并采用三阶段渐进强化学习策略直接优化tIoU指标,在轻量级模型中实现定位精度提升。

详情
Comments
13 pages, 6 figures. Submitted to IEEE Transactions on Image Processing (TIP). Code is available at: https://github.com/Z-Shihang/GIRL-DETR
AI中文摘要

视频时刻检索(VMR)任务要求精确定位与自然语言查询对齐的时间边界,但许多模型存在连续代理损失与非可微指标之间的不匹配,导致训练后期优化停滞,边界预测陷入次优解。尽管强化学习(RL)后训练成功优化了大模型的定位结果,但直接应用于轻量级网络容易破坏监督阶段建立的脆弱特征表示。为克服这一优化瓶颈,我们提出梯度隔离强化学习用于DETR(GIRL-DETR),首次将RL后训练引入轻量级时间定位框架。输入视频和文本特征首先通过跨模态交互(CMI)在进入Transformer编码器之前建立早期对齐。随后,文本引导门控(TGG)机制在Transformer解码器生成候选提案之前动态地将语义先验注入查询,为时间预测提供高信噪比输入。在监督训练达到收敛后,冻结骨干网络以保护特征流形,而检测头通过三阶段渐进强化学习(TPRL)策略直接优化非可微评估指标tIoU以提升定位精度。该方法实现了状态表示与指标优化的正交解耦。在Charades-STA、QVHighlights和TACoS上的实验表明,GIRL-DETR有效解决了代理损失退化问题,以最少的参数更新实现了显著的精度提升,为轻量级VMR模型中的RL应用提供了稳健的新途径。

英文摘要

Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.

2606.00773 2026-06-02 cs.RO

SafeVLA-Bench: A Benchmark for the Success-Safety Gap in Vision-Language-Action Models

SafeVLA-Bench: 视觉-语言-动作模型中成功-安全差距的基准

Jialiang Fan, Weizhe Xu, Oleg Sokolsky, Insup Lee, Fanxin Kong

发表机构 * University of Notre Dame(诺丁汉大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出SafeVLA-Bench,一种基于信号时序逻辑的后验安全评估框架,用于量化VLA策略在完成任务时的安全违规行为,揭示成功与安全之间的差距。

详情
Comments
27 pages, 5 figures
AI中文摘要

视觉-语言-动作(VLA)基准衡量策略是否完成指定的操作任务,但二元成功可能隐藏与安全相关的轨迹行为:在施加过度接触、干扰旁观物体、使被持物体不稳定或进入机器人自接触的同时达到目标。我们提出了SafeVLA-Bench,一个用于现有基于模拟器的VLA基准的后验安全评估框架。它将任务感知的安全要求形式化为信号时序逻辑(STL)规范,并用两个不安全成功指标报告原生成功:Succ-But-Unsafe(SBU),即既成功又违反安全策略的滚动比例,以及Violation Severity Index(VSI),一个有界的最坏违规深度分数。我们在LIBERO和RoboCasa-365上实例化SafeVLA-Bench,评估了九个策略基准条目,涵盖桌面和厨房操作任务。高任务成功并不意味安全执行:高SR的桌面基线仍然有13%到15%的不安全情节率,而36%到56%的成功RoboCasa-365滚动违反了至少一个活跃安全条款。项目页面:https://safevla.org。

英文摘要

Vision-language-action (VLA) benchmarks measure whether a policy completes a requested manipulation task, but binary success can hide safety-relevant trajectory behavior: reaching the goal while applying excessive contact, disturbing bystander objects, destabilizing the held object, or entering robot self-contact. We present SafeVLA-Bench, a post-hoc safety-evaluation framework for existing simulator-based VLA benchmarks. It formalizes task-aware safety requirements as Signal Temporal Logic (STL) specifications and reports native success with two unsafe-success metrics: Succ-But-Unsafe (SBU), the fraction of rollouts that both succeed and violate safety, and Violation Severity Index (VSI), a bounded worst-violation depth score. We instantiate SafeVLA-Bench on LIBERO and RoboCasa-365, evaluating nine policy-benchmark entries across tabletop and kitchen manipulation tasks. High task success does not imply safe execution: high-SR tabletop baselines still leave 13 to 15 percent unsafe-episode rates,and 36 to 56 percent of successful RoboCasa-365 rollouts violate at least one active safety clause. Project page: https://safevla.org.

2606.00771 2026-06-02 cs.LG cs.AI cs.SD

Logit Distillation on Manifolds: Mapping by Learning

流形上的对数蒸馏:通过学习进行映射

Yiru Yang, Junling Wang, Nishant Kumar Singh, Luohong Wu, Haoran Yan

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院) Deutsche Bank Securities(德意志银行证券公司)

AI总结 提出一种层和点投影映射方法,将学生和教师表示对齐到高维嵌入空间,结合LoRA注入,在显著减少可训练参数的同时提高词错误率。

详情
AI中文摘要

提高几乎任何机器学习模型性能的一种简单方法是,不训练单个模型,而是训练多个使用不同算法的模型,这些模型对相同数据做出略有不同的预测和错误,从而提高平均预测和鲁棒性。然而,使用整个模型集成进行预测是繁琐且计算成本过高的,无法部署给大量用户,特别是当模型是大型神经网络时。为此,我们引入了一种层和点投影映射,在训练过程中将学生和教师表示映射到对齐的高维嵌入空间。所提出的方法结合LoRA注入,将学生模型的可训练参数减少到教师模型的不到1%,同时与其他蒸馏方法相比,显著提高了词错误率(WER),如消融研究所示。与专家混合不同,我们的方法可以快速并行训练。

英文摘要

A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algorithms which will make slightly distinct kinds of predictions and errors on the same data, and thus improve the average predictions and robustness. However, making predictions using a whole ensemble of models is cumbersome and computationally too expensive to allow deployment to a large number of users, especially if the models are large neural nets. In response to this, we introduce a layer and point wise projection mapping, which maps student and teacher representations into an aligned high-dimensional embedding space during training process. The proposed approach combined with LoRA injection reduces the student model trainable parameters to less than 1% of the teacher model, while significantly improving word error rate (WER) compared to other distillation methods, as demonstrated in ablation studies. Unlike a mixture of experts, our method can be trained rapidly and in parallel.

2606.00765 2026-06-02 cs.AI

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

FALAT: 通过依赖引导搜索追踪LLM智能体轨迹中的失败

Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang, Tse-Hsun Chen

发表机构 * SPEAR Lab(SPEAR实验室) Concordia University(康科德大学) DePaul University(德保罗大学)

AI总结 提出FALAT框架,通过依赖引导搜索方法,在LLM智能体轨迹中识别导致失败的关键步骤和责任智能体。

详情
AI中文摘要

基于LLM的智能体越来越多地通过包含推理步骤、工具调用和智能体间通信的长轨迹来解决复杂任务。然而,当这些智能体失败时,通常不清楚是哪个智能体导致了失败,以及哪个步骤引入了决定性错误。这个归因问题具有挑战性,因为错误可以在轨迹中传播:后续动作可能看起来不正确,但仅仅是因为它们依赖于先前被破坏的状态。因此,失败归因不能被视为独立的步骤级分类。 我们提出FALAT,一个用于LLM智能体轨迹中失败归因的诊断框架。FALAT将归因问题框架化为一个依赖引导的搜索问题。它首先构建任务应如何解决的期望,并利用该期望识别轨迹中的可疑区域。然后,它追踪决策、工具输出和智能体消息之间的依赖关系,以区分引入错误的步骤和仅仅继承或传播先前错误的步骤。最后,FALAT评估纠正候选步骤是否足以恢复预期结果,从而能够识别责任智能体和决定性失败步骤。 我们在Who&When基准上评估FALAT,该基准包括算法生成和手工制作的多智能体失败轨迹。结果表明,FALAT持续改进了责任智能体和决定性步骤的归因。其最佳配置在算法生成轨迹上达到46.0%的步骤级准确率,在更具挑战性的手工制作轨迹上达到29.1%,优于专门的归因基线和直接提示的独立LLM。这些发现表明,依赖感知推理对于LLM智能体系统中可靠的失败诊断至关重要。

英文摘要

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.