arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
热门方向导航
2606.18554 2026-06-18 cs.CV 新提交

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

伪造灾难:扩散时代跨域合成灾难检测基准

Duc-Manh Phan, Quoc-Duy Tran, Duy-Khang Do, Anh-Tuan Vo, Hai-Dang Nguyen, Trong Le Do, Mai-Khiem Tran, Vinh-Tiep Nguyen, Tam V. Nguyen, Isao Echizen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学) University of Information Technology, VNU-HCM(胡志明市国家大学下属信息技术大学) University of Dayton(代顿大学) National Institute of Informatics(国立信息学研究所)

AI总结 针对扩散模型生成的逼真灾难图像难以检测的问题,提出包含30000张图像(6000张真实、24000张合成)的基准数据集,实验发现微调检测器在未知生成器上准确率下降50%,零样本检测器也不稳定,凸显了跨域检测的迫切需求。

Comments SOICT 2025

详情
AI中文摘要

文本到图像扩散模型的快速进步使得创建高度逼真的合成图像成为可能,这些图像与真实照片极为相似,使得区分真实内容与AI生成的伪造品越来越困难。这对网络安全、数字取证和灾难响应构成了挑战,其中洪水、火灾或地震的虚假图像可能传播错误信息或扰乱应急行动。为此,我们引入了Forged Calamity,一个用于合成灾难检测的基准数据集,包含30000张图像,其中包括6000张真实样本和由四种扩散模型生成的24000张合成样本。在微调和零样本设置下的全面实验揭示了当前取证方法的一致弱点。微调检测器在分布内表现良好,但在未见过的生成器或灾难类型上准确率下降高达50%,显示出对模型特定伪影的过拟合。零样本通用检测器也难以保持稳定的准确率,只有少数具有鲁棒表示能力的模型表现出有限的韧性。这些发现凸显了持续存在的泛化差距,以及在扩散时代确保视觉真实性迫切需要领域和模型无关的检测方法。

英文摘要

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

2606.18553 2026-06-18 cs.CV 新提交

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(越南国立大学胡志明市分校理学院) Vietnam National University, Ho Chi Minh City(越南国立大学胡志明市分校)

AI总结 提出分层多模态文章检索增强的图像描述框架,通过结构感知检索和上下文精炼,结合VLM和LLM生成富含上下文细节的描述,在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情
AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述,尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题,我们提出了一种新颖的检索增强图像描述框架,通过利用外部知识生成具有更深层次洞察的描述,如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制,超越了单一的文本实体。该检索考虑了文章结构感知特征,包括加权文本组件(例如,标题、正文部分)和视觉布局模式,以及多方面的相似性计算(内容-视觉、视觉-视觉和话语定位)。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库:首先,VLM生成简洁的图像描述;其次,我们基于该描述从检索到的文章中分割出相关信息;最后,LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛,并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench:智能体能否玩转长期博弈?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出CEO-Bench,通过模拟500天运营初创公司的任务,评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情
AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而,现实世界的挑战需要结合多种复杂技能,这些技能在很大程度上尚未在智能体中得到测试:(1)在不确定性中导航长期视野;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个移动部分以实现连贯目标。我们引入CEO-Bench,通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面,在相同的环境中运行,并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库,将信号转化为合理的策略,并通过编程协调许多决策。最强的智能体编写复杂的代码,模拟客户群体以预测未来现金流,并挖掘谈判历史以揭示隐藏的客户偏好。即便如此,大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金,且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

2606.18539 2026-06-18 cs.LG stat.ML 新提交

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

TS-Fault: 针对结构性故障的时间序列预测器基准测试

Yuyang Zhao, Lian Xu, Hao Miao, Chenxi Liu, Hao Xue

发表机构 * Ray-zyy

AI总结 提出TS-Fault基准,通过参数化故障场景(沿观测/机制、单变量/多变量两轴)评估时间序列预测模型鲁棒性,发现干净数据准确性与鲁棒性负相关、机制级故障重排排名、基础模型最脆弱。

详情
AI中文摘要

时间序列预测(TSF)支撑着能源、交通、金融和医疗等领域的关键决策,然而TSF模型几乎普遍通过在干净保留数据上的单一数字(如平均误差)进行排名,隐含假设该数字能预测部署可靠性。但实际故障并非独立同分布噪声,而是具有时间形状的结构化事件、断裂的跨变量依赖、伴随缺失的机制变化以及跨传感管道的因果传播。将TSF鲁棒性视为数据质量问题,我们提出TS-Fault,一个在显式、参数化且具有可控语义难度的故障场景下评估预测模型的基准。TS-Fault将重复出现的故障沿两个正交轴(观测级 vs 机制级;单变量 vs 多变量)组织为四种模式,并通过统一重要性评分将每种故障注入最关键的预测窗口。该设计使得鲁棒性能够针对模型实际依赖的结构进行测试,而非简化为通用噪声敏感性。我们在6个数据集、4种模式和5个难度级别上,采用配对干净/损坏协议评估了21个模型。结果揭示了三个与常见排行榜直觉相悖的发现:(i)干净数据准确性与鲁棒性负相关;(ii)干净排名在观测级故障下保持不变,但在机制级故障下重新洗牌;(iii)所有灾难性故障均发生在机制级故障下,基础模型在干净数据上准确率最高但表现出最大的脆弱性。代码已公开于该URL。

英文摘要

Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

2606.18538 2026-06-18 cs.LG stat.ML 新提交

Effects of sparsity and superposition on loss in simple autoencoders

稀疏性与叠加对简单自编码器损失的影响

Mriganka Basu Roy Chowdhury, Eric McLaughlin Weiner

发表机构 * Department of Statistics, UC Berkeley(伯克利大学统计学系) Department of Materials Science, UC Berkeley(伯克利大学材料科学系)

AI总结 研究神经网络中多语义性源于叠加现象,通过数学分析稀疏输入下自编码器的L2重构损失上下界,验证并扩展了Elhage等人的实证结果。

Comments 16 pages, 3 figures

详情
AI中文摘要

神经网络机械可解释性的主要困难之一是出现多语义性,即每个神经元通常负责多个不同任务,阻碍了对其功能的清晰解释。Elhage等人(2022)的开创性论文认为,这是由于叠加现象,即神经网络将不同特征表示为低维空间中的非正交方向,这种策略可以在不牺牲保真度的情况下实现更大的数据压缩,因为输入向量具有特征稀疏性。Elhage等人(2022)在一个相当自然且简单的具有稀疏输入的自编码器中实证验证了这些假设。本文的贡献在于分析叠加现象发生和最优性的数学基础,同时严格证实了他们的一些发现。特别地,我们为幂激活函数提供了L2重构损失的上界和下界,在非常稀疏的情况下是紧的。文末还包含一个简短的开放问题列表。

英文摘要

One of the major difficulties in the mechanistic interpretability of neural networks is the occurrence of polysemanticity, which suggests that each neuron is typically responsible for multiple different tasks, impeding a clean interpretation of their function. The seminal paper of Elhage et al. (2022) argues that this occurs due to superposition, a phenomenon where the neural network represents distinct features as non-orthogonal directions in a lower-dimensional space, a strategy that allows much greater compression of the data without sacrificing fidelity due to the feature sparsity of input vectors. Elhage et al. (2022) empirically validates these hypotheses in a rather natural and simple autoencoder with sparse inputs. The contribution of the present work is to analyze the mathematical basis for the occurrence and optimality of superposition, while rigorously corroborating some of their findings. In particular, we provide upper and lower bounds for the L2 reconstruction loss, tight in the very sparse regime, for power activation functions. A short list of interesting open problems are also included at the end.

2606.18537 2026-06-18 cs.LG 新提交

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗:从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington(华盛顿大学) NVIDIA(英伟达)

AI总结 提出GRID方法,从追求不同目标的异构示范者中提取通用奖励,训练通用智能体以学习环境通用能力,避免模式平均偏差,提升下游任务微调效率。

详情
AI中文摘要

人类通常通过观察他人来获取新技能,因为观察到的行为隐含地揭示了如何在环境中行动。然而,从异构群体中获得的观察会引入冲突的行为信号,使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦(GRID)来解决这一挑战,这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励(捕捉所有智能体共享的行为)和特定奖励(捕捉个体偏好和目标)。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体,该智能体内化了通用的环境能力,如安全性和基本任务熟练度,而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务(包括训练中未见过的偏好)的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器(Highway-Env)上的实验证实,GRID以语义上有意义的方式成功解耦了奖励结构,优于标准的从示范学习基线,并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

2606.18528 2026-06-18 cs.CV 新提交

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

一种面向离线手写签名验证的原型签名方法

Kecia G. de Moura, Robert Sabourin, Rafael M. O. Cruz

发表机构 * École de technologie supérieure – Université du Québec Montreal(魁北克蒙特利尔高等电子与计算机工程学院)

AI总结 提出基于原型签名的数据驱动策略,生成多样且信息丰富的负样本,提升对熟练伪造签名的检测能力,并提高可扩展性和计算效率。

Comments Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

离线手写签名验证旨在使用静态图像区分真实签名和伪造签名。由于真实伪造样本很少,通常从其他用户的真实签名中随机抽取负样本来创建训练数据。然而,这种随机选择往往缺乏多样性,增加冗余,并提高计算成本,导致训练效率低下。我们提出了一种数据驱动策略,使用原型签名生成多样且信息丰富的负样本,原型签名是真实签名特征的紧凑、不可识别的摘要。基于实验结果,我们得出结论:(i)原型签名产生更具信息量的负样本,改进了对熟练伪造的检测;(ii)所提出的方法与骨干网络无关,在不同架构上表现出鲁棒性;(iii)当与原始形式的线性SVM结合时,它可作为基于RBF模型的替代方案,同时显著提高可扩展性和计算效率。该方法的实现可在以下网址获取:https://this URL。

英文摘要

Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at https://github.com/kdmoura/proto_hsv.

2606.18525 2026-06-18 cs.LG 新提交

Hierarchical Attention via Domain Decomposition

基于区域分解的层次注意力机制

Stephan Köhler, Oliver Rheinbach

发表机构 * Faculty of Mathematics and Computer Science(数学与计算机科学系)

AI总结 提出一种基于两水平重叠Schwarz区域分解的层次注意力机制,通过局部低秩注意力块与粗网格注意力块结合,在少参数下实现更快训练和更高精度。

Comments 20 pages, 10 figures

详情
AI中文摘要

我们提出了一种基于两水平重叠Schwarz区域分解的层次注意力机制。该方法的动机源于观察到两水平Schwarz区域分解方法将局部子域校正与一个传达全局、长程信息的粗水平相结合。我们在一个具有齐次Dirichlet边界条件的一维扩散问题背景下,测试了其在有限维算子学习中的实用性。尽管该问题简单,但它提供了一个受控的序列到序列设置,其中精确的非局部解算子已知。离散化后,学习解算子相当于逼近一个对称正定矩阵的逆。作为基线,我们使用一个全局无softmax的低秩注意力算子,形式为$QK^T$。所提出的构造将这个密集的全局分解替换为一个两水平加性结构:重叠子域上的局部低秩注意力块与一个粗注意力块相结合。得到的算子形式为$$M_{\theta}^{-1} = \Phi Q_0 K_0^T \Phi^T + \sum_{i=1}^{N} R_i^T D_i^{1/2} Q_i K_i^T D_i^{1/2} R_i.$$ 这里$R_i$限制到重叠子域,$D_i$是单位划分权重,$\Phi$是粗插值(或延拓)矩阵。针对合成Fourier右端项的数值实验表明,区域分解注意力算子能够比全局低秩注意力基线训练更快,并在使用显著更少参数的情况下提供更精确的逼近。

英文摘要

We propose a hierarchical attention mechanism based on two-level overlapping Schwarz domain decomposition. The method is motivated by the observation that two-level Schwarz domain decomposition methods combine local subdomain corrections with a coarse level that communicates global, long-range information. We test its usefulness in the context of finite-dimensional operator learning using a simple, one-dimensional diffusion problem with homogeneous Dirichlet boundary conditions. Although elementary, this problem provides a controlled sequence-to-sequence setting in which the exact nonlocal solution operator is known. After discretization, learning the solution operator amounts to approximating the inverse of a symmetric positive definite matrix. As a baseline, we use a global softmax-free low-rank attention operator of the form $QK^T$. The proposed construction replaces this dense global factorization by a two-level additive structure: local low-rank attention blocks on overlapping subdomains are combined with a coarse attention block. The resulting operator has the form $$M_θ^{-1} = ΦQ_0 K_0^T Φ^T + \sum_{i=1}^{N} R_i^T D_i^{1/2} Q_i K_i^T D_i^{1/2} R_i.$$ Here $R_i$ restricts to an overlapping subdomain, $D_i$ is a partition-of-unity weight, and $Φ$ is a coarse interpolation (or prolongation) matrix. Numerical experiments for synthetic Fourier right-hand sides indicate that the domain-decomposition attention operator is able to train faster and can give more accurate approximations than a global low-rank attention baseline while using significantly fewer parameters.

2606.18524 2026-06-18 cs.LG 新提交

On the Residual Scaling of Looped Transformers: Stability and Transferability

关于循环Transformer的残差缩放:稳定性和可迁移性

Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li

发表机构 * Tsinghua University(清华大学)

AI总结 针对循环Transformer,提出残差缩放因子应为1/N而非1/√L,并推导出多层的分解参数化,实现超参数从少循环到多循环的迁移。

Comments 19 pages, 9 figures

详情
AI中文摘要

循环(权重共享)Transformer 将共享残差块应用 N 次(h ← h + ε f(h),每一步使用相同的 f),在不增加参数的情况下增加有效深度。先前的深度缩放分析建议深度为 L 的残差网络使用 ε = 1/√L。我们证明这对于循环架构是不够的:权重共享使得残差更新在迭代间相关,需要更强的缩放 ε = 1/N。对于多层块(L 个独特层循环 N 次),我们推导出一个分解参数化 ε = λ/(N√L),将两种增长源分开:1/N 控制层内循环相关性,1/√L 控制层间方差。一个关键结果是,最优学习率仅取决于独特层数 L,而非循环次数 N,从而实现了从小的 N 到大的 N 的直接超参数迁移,无需重新调整。在循环 Transformer 上的实验证实,1/N 缩放相比 1/√N 缩放提高了可训练性,并在不同循环次数下获得更优的损失。

英文摘要

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = λ/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

2606.18521 2026-06-18 cs.LG cs.AI 新提交

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒:从模型合并理解RLVR模型参数空间

Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang

发表机构 * Zhejiang University(浙江大学) Simon Fraser University(西蒙菲莎大学) The Chinese University of Hong Kong(香港中文大学) Zhejiang Key Lab of Accessible Perception and Intelligent Systems(浙江省可感知智能系统重点实验室)

AI总结 本文发现RLVR模型的稀疏更新在参数空间中分散更远,形成近正交捷径导致合并脆弱,并提出SAR-Merging方法解决该问题。

Comments Accepted by KDD 2026

详情
AI中文摘要

可验证奖励强化学习(RLVR)已成为一种强大的后训练范式,在激发推理智能和抵抗灾难性遗忘方面超越了监督微调(SFT)。最近的研究进一步揭示,与SFT相比,RLVR会引发高度稀疏且偏离主成分的参数更新。这自然引出一个问题:这种稀疏性是否使RLVR模型更易于模型合并?如果是,模型合并将提供一种可扩展的、无需训练的方法,来聚合来自独立训练的RLVR模型的多样化推理能力。令人惊讶的是,我们发现相反的情况,揭示了一种稀疏性诅咒:稀疏的RLVR更新在参数空间中分散得更远,形成近正交的捷径,使得聚合本质上是脆弱的。这很可能源于RL优化的随机性和涌现推理模式的多样性。与SFT模型收敛到共享的平坦盆地并自然合并不同,RLVR模型在标准合并方法下遭受严重退化。通过对更新几何的系统性实证分析,我们描述了这种失败背后的机制,并提出了敏感性感知解析合并(SAR-Merging),这是一种针对RLVR参数空间独特结构定制的合并方案。SAR-Merging通过基于Fisher信息的敏感性仲裁解决重叠更新区域中的冲突,然后通过幅度感知稀疏化和重新缩放来保留脆弱的推理路径。在数学和编程基准上的实验表明,SAR-Merging在RLVR模型上显著优于现有合并方法,实现了单任务增强和多能力融合。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

2606.18519 2026-06-18 cs.RO cs.AI 新提交

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿:利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced(加州大学默塞德分校)

AI总结 针对自然语言歧义性,提出基于线性时序逻辑(LTL)反馈循环的LLM任务规划系统,通过双LLM分工实现规范生成与验证,提升精准农业任务规划的可靠性。

Journal ref Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

详情
AI中文摘要

尽管机器人系统现已商业化并部署于各行各业,但许多系统高度专业化,通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题,我们近期引入了一个任务规划器,利用大语言模型(LLM)根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色,但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑(LTL)的反馈循环来扩展我们的系统,以确保任务规划系统满足用户制定的规范,同时仍使用自然语言。为减轻潜在偏差,我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验,我们强调了将任务验证集成到全自主流水线中的优势与局限,特别是关于LLM生成有效LTL公式的能力,并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

2606.18518 2026-06-18 cs.LG cs.AI 新提交

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

PSyGenTAB:通过约束优化生成合成临床表格数据的隐私保护框架

Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde, Dhanalakshmi Ramesh, Rashmi S. Manjunath, Amir Rahmani, Hajar Homayouni

发表机构 * San Diego State University(圣地亚哥州立大学) University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出PSyGenTAB框架,将合成医疗数据生成建模为约束优化问题,通过增强拉格朗日方法嵌入可配置隐私约束,在保证隐私阈值的同时最大化临床数据效用,实验表明合成数据训练的模型性能与真实数据相当。

Comments 20 pages

详情
AI中文摘要

由于机构壁垒和严格的隐私法规(如HIPAA和GDPR),医疗AI的发展受到高质量临床数据获取限制。合成数据生成提供了一种潜在解决方案,但现有方法缺乏明确管理隐私-效用权衡的原则性机制,常常退化临床有意义的模式或面临患者重识别风险。我们提出PSyGenTAB,一个隐私保护生成框架,将合成医疗数据生成建模为使用增强拉格朗日方法求解的约束优化问题。通过将可配置的隐私约束直接嵌入模型训练,PSyGenTAB在最大化临床数据效用的同时强制执行最低隐私阈值。在多个临床驱动的基准测试中,PSyGenTAB保留了可靠健康AI所需的特征间临床关系和少数类诊断模式。使用“合成训练、真实测试”和“真实训练、合成测试”协议的下游评估表明,在合成数据上训练的模型达到了与真实患者记录训练模型相当的性能。隐私审计进一步证明了精确记录复制的减少和对成员推理攻击的强大抵抗力。这些结果确立了PSyGenTAB作为平衡合成医疗数据中隐私保护和临床效用的原则性框架,支持安全的跨机构AI开发。

英文摘要

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

2606.18516 2026-06-18 cs.RO 新提交

Task Allocation and Motion Planning in Dynamic, Cluttered Environments via CBBA and Graphs of Convex Sets

动态杂乱环境下的任务分配与运动规划:基于CBBA与凸集图

Matthew D. Osburn, Cameron K. Peterson, John L. Salmon

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) Mechanical Engineering(机械工程系)

AI总结 针对动态杂乱环境中的多智能体任务规划,提出结合凸集图(GCS)进行轨迹优化与共识捆绑算法(CBBA)进行分布式任务分配的方法,实现安全高效的轨迹规划和任务协调。

Comments 15 pages single column, 10 figures, AIAA-Scitech 2027 Submission

详情
AI中文摘要

在杂乱、动态环境中的多智能体任务规划需要在分配任务给智能体的同时,确定通过环境的安全、时间高效的轨迹。当任务是动态的(例如会合目标)时,分配决策不仅取决于哪个智能体最适合某项任务,还取决于该任务何时何地可以到达。本文提出了一个解决该问题的方法,该方法将凸集图(GCS)用于轨迹优化,与共识捆绑算法(CBBA)用于分布式任务分配相结合。在我们的方法中,GCS通过使用时间扩展(3D+时间)配置空间找到通过动态环境的最优轨迹。同时,CBBA协调跨智能体的任务分配,使得在移动环境中能够做出明智的决策。然后,我们连接分配和规划,使智能体能够在3D+时间配置空间中避免碰撞,并提供准确的任务完成时间估计。我们在具有静态和动态任务的模拟杂乱环境中展示了我们方法的有效性。

英文摘要

Multi-agent task planning in cluttered, dynamic environments requires assigning tasks to agents while simultaneously determining safe, time-efficient trajectories through the environment. When tasks are dynamic, such as rendezvous objectives, allocation decisions depend not only on which agent is best suited for a task, but also on when and where that task can be reached. This paper presents a solution to this problem, which combines Graphs of Convex Sets (GCS) for trajectory optimization with the Consensus-Based Bundle Algorithm (CBBA) for distributed task allocation. In our approach, GCS finds optimal trajectories through dynamic environments using a time-extended (3D+time) configuration space. At the same time, CBBA coordinates task assignments across agents, enabling informed decision-making in a moving environment. We then connect allocation and planning to allow the agents to avoid collisions in the 3D+time configuration space and provide accurate time estimates for task completion. We demonstrate the effectiveness of our approach in simulated cluttered environments with static and dynamic tasks.

2606.18514 2026-06-18 cs.RO cs.LG 新提交

N(CO)$^2$: Neural Combinatorial Optimization with Chance Constraints to Solve Stochastic Orienteering

N(CO)$^2$: 基于机会约束的神经组合优化求解随机定向问题

Anas Saeed, Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * Department of Computer Science and Engineering, University of California, Merced(加州大学默塞德分校计算机科学与工程系)

AI总结 提出N(CO)$^2$框架,结合强化学习求解随机定向问题,无需手工启发式,在不确定环境下优化路径选择,性能媲美MILP。

Journal ref In Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2025

详情
AI中文摘要

神经组合优化(NCO)通过学习启发式,为求解复杂图优化问题提供了一种有前景的替代传统启发式方法的方法。这类问题在自动化领域频繁出现,可用于建模多种应用。虽然NCO在确定性组合优化问题上已被广泛研究,但只有少数工作旨在解决随机组合优化问题。本文提出N(CO)$^2$:基于机会约束的神经组合优化,用于求解随机定向问题(SOP),无需手工设计的启发式。通过集成强化学习(RL)框架,模型在不确定性下优化路径选择,有效平衡探索与利用。实验结果表明,我们的方法在多种SOP实例上具有良好的泛化能力,与最先进的混合整数线性规划(MILP)相比性能具有竞争力。所提方法减少了启发式设计的人力投入,同时在不确定环境中实现自适应和高效的决策。

英文摘要

Neural combinatorial optimization (NCO) offers a promising alternative to traditional heuristic-based methods for solving complex graph optimization problems by proposing to learn heuristics through data. This class of problems frequently arises in automation, as it can be used to model a variety of applications. While NCO has been extensively studied for deterministic combinatorial optimization problems, there are only a few works that aim to solve stochastic combinatorial optimization problems. In this work, we present N(CO)$^2$: Neural Combinatorial Optimization with Chance cOnstraints to solve the Stochastic Orienteering Problem (SOP) without the use of hand-crafted heuristics. By integrating a reinforcement learning (RL) framework, the model optimizes path selection under uncertainty, effectively balancing exploration and exploitation. Empirical results demonstrate that our method generalizes well across diverse SOP instances, achieving competitive performance compared to the state-of-the-art mixed-integer linear program (MILP) for the task. The proposed approach reduces human effort in heuristic design while enabling adaptive and efficient decision-making in uncertain environments.

2606.18510 2026-06-18 cs.CV cs.CR 新提交

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

人脸呈现攻击检测中的架构偏差:视觉Transformer与卷积神经网络的比较研究

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

发表机构 * College of Engineering, Carnegie Mellon University(卡内基梅隆大学工程学院)

AI总结 通过比较ViT和CNN在人脸呈现攻击检测中的表现,发现预训练ViT(DeiT-S)在准确率、公平性和跨种族泛化上优于CNN,将种族间ACER差距降低83%。

Comments 8 Pages, 4 Figures, 5 Tables

详情
AI中文摘要

人脸呈现攻击检测(PAD)系统构成生物特征认证中的关键安全层;然而,现有方法在不同人口群体间表现出系统性性能差异,对深肤色个体影响尤为严重。本文通过实证比较研究,探究视觉Transformer架构相对于卷积基线是否能够减少人脸PAD系统中的人口统计偏差。实验在CASIA-SURF跨种族人脸反欺骗(CeFA)数据集上进行。评估了三种架构:从头训练的多模态ViT-Tiny、ResNet18 CNN基线,以及在CeFA上微调的预训练DeiT-S,覆盖非洲、东亚和零样本中亚人口群体。DeiT-S实现了最高总体准确率97.27%和最低等错误率0.86%,优于准确率90.15%的ResNet18。在公平性方面,DeiT-S将非洲与东亚受试者之间的种族间ACER差距降至0.13%,而基于LBP的工作[6]报告为0.75%,降低了83%。最值得注意的是,ResNet18在零样本中亚受试者上的BPCER为10.44%,而DeiT-S在相同未见群体上保持2.89%,展现出3.6倍的泛化优势。这些结果表明,预训练视觉Transformer在PAD中实现了更高的准确率,产生了更小的人口统计性能差距,并在未见人口群体上更公平地泛化,表明PAD中的跨人口公平性可能部分受架构设计影响。

英文摘要

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

2606.18509 2026-06-18 cs.LG stat.ML 新提交

Concept Modulation Models: A Unified Framework for Identifiability and Extrapolation

概念调制模型:可识别性与外推的统一框架

Soheun Yi, Yizhou Lu, Chandler Squires, Pradeep Ravikumar

发表机构 * Department of Statistics and Data Science, Carnegie Mellon University(卡内基梅隆大学统计与数据科学系) Machine Learning Department, Carnegie Mellon University(卡内基梅隆大学机器学习系)

AI总结 提出概念调制模型(CMMs),通过属性势统一条件潜变量模型的可识别性与外推分析,将基于转移的可识别性提升至条件设置,并导出代数外推准则。

详情
AI中文摘要

条件潜变量模型中的可靠泛化需要理解可识别性和外推:观测属性间的变化如何决定潜在结构,以及该结构如何决定未见属性上的分布。然而,现有的可识别性和外推保证大多是模型特定的,在非线性ICA、因果表示学习、扰动建模及相关条件潜变量模型中分别进行分析。我们引入概念调制模型(CMMs),这是一类属性索引的条件生成模型,其结构为$A\to \Lambda \to C\to X$,其中属性选择调制器,调制器诱导潜在概念法则,概念生成观测特征。CMMs通过展示观测属性上的特征一致性诱导受CMM类约束的潜在概念转移,将基于转移的可识别性提升至条件设置。我们通过属性势(属性条件概念法则之间的对数密度比)表达这些约束,将通用提升步骤与模型特定的刚性论证分离。相同的势控制外推:当且仅当传输的属性势恒等式扩展到这些属性时,未见属性上的一致性成立。这导出了代数外推准则,识别出几个现有可识别性和外推结果背后的共同基于势的证明对象,并且当与这些工作中的模型特定刚性论证结合时,恢复了它们所述的结论。

英文摘要

Reliable generalization in conditional latent variable models requires understanding both identifiability and extrapolation: how observed variation across attributes determines latent structure, and how that structure determines distributions at unseen attributes. However, existing identifiability and extrapolation guarantees are largely model-specific, with separate analyses in nonlinear ICA, causal representation learning, perturbation modeling, and related conditional latent variable models. We introduce concept modulation models (CMMs), an attribute-indexed class of conditional generative models with structure $A\to Λ\to C\to X$, where attributes select modulators, modulators induce latent concept laws, and concepts generate observed features. CMMs lift transition-based identifiability to conditional settings by showing that feature agreement on observed attributes induces a latent concept transition constrained by the CMM class. We express these constraints through attribute potentials, log-density ratios between attribute-conditioned concept laws, separating the generic lifting step from model-specific rigidity arguments. The same potentials control extrapolation: agreement at unseen attributes holds exactly when the transported attribute-potential identities extend to those attributes. This yields algebraic extrapolation criteria, identifies the common potential-based proof objects behind several existing identifiability and extrapolation results, and, when combined with the model-specific rigidity arguments in those works, recovers their stated conclusions.

2606.18508 2026-06-18 cs.CL cs.IR 新提交

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG:主题元数据作为段落级检索的语义指南针

Amirhossein Abaskohi, Raymond Li, Gaetano Cimino, Peter West, Giuseppe Carenini, Issam H. Laradji

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of Salerno(萨莱诺大学) ServiceNow Research(ServiceNow研究院)

AI总结 提出MCompassRAG框架,通过主题元数据增强段落表示,利用LLM蒸馏训练轻量检索器,实现主题感知检索,在六个基准上平均信息效率提升8.24%,延迟降低5倍以上。

详情
AI中文摘要

检索增强生成(RAG)系统关键依赖于文档的分块和搜索方式。细粒度块可以提高检索精度,但会扩大搜索空间,增加延迟和成本;较大的块减少了候选数量,但使密集相似性变得不可靠,因为每个块的表示混合了多个主题并引入了更多语义噪声。这种权衡在深度研究任务中尤其受限,因为检索必须在大型异构语料库中既快速又精确。我们引入了MCompassRAG,一种元数据引导的检索框架,它使用主题级信号作为语义指南针来选择相关证据。MCompassRAG不仅依赖于查询与噪声块嵌入之间的余弦相似度,还在同一嵌入空间中用主题元数据丰富块表示,并通过LLM教师蒸馏训练轻量级检索器。在推理时,MCompassRAG无需额外的LLM调用即可执行主题感知检索,提高了效率和证据质量。在六个复杂检索基准上,MCompassRAG平均信息效率(IE)提高了8.24%,延迟比最强的高效RAG基线低5倍以上。代码可从此https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on https://github.com/AmirAbaskohi/MCompassRAG.

2606.18506 2026-06-18 cs.LG eess.SP stat.AP 新提交

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

超越AHI:一种可解释的因果发现引导的睡眠恢复框架在互联健康中的应用

Saba A. Farahani, Elahe Khatibi, Manoj Vishwanath, Amir M. Rahmani, Hung Cao

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 提出一种可解释的因果发现引导框架,从多模态PSG中推导层次化睡眠恢复评分(SRS),在两大队列中SRS与感知恢复的关联强度是AHI的2.5倍。

Comments 6 pages, 2 figures, 2 tables. Accepted at the 2nd Workshop on Sensing and Computing for Smart and Connected Health (SCH), co-located with IEEE/ACM CHASE 2026

详情
AI中文摘要

客观睡眠评估依赖于多导睡眠图(PSG),但临床影响通常更好地反映在患者报告结局(PROs)如嗜睡和疲劳中。现有的总结指标,包括呼吸暂停低通气指数(AHI),对功能恢复背后的多域生理学提供的洞察有限。我们提出了一种可解释的、因果发现引导的框架,用于从多模态PSG中推导层次化睡眠恢复评分(SRS)。利用两个大型人群队列(MESA: n=1540; MrOS: n=825),我们应用有向无环图(DAG)学习来识别候选生理驱动因素,涵盖呼吸负担、缺氧负担、睡眠碎片化、睡眠结构和自主神经调节。尽管源自临床PSG,这些域自然映射到互联健康技术中日益可用的传感流,包括可穿戴心电图、血氧测定和睡眠阶段估计设备。为了保持机制合理性,我们引入了一个两阶段筛选过程,结合基于生理学的约束和受约束的LLM辅助审计,以识别和消除结构混杂因素以及构造重叠变量。跨队列,这五个域作为与恢复相关的重复生理域出现,所得SRS与感知恢复的关联强度高达AHI的2.5倍。通过将多模态睡眠生理学与以患者为中心的结果通过可解释、偏差感知和域结构化的框架联系起来,这项工作为临床睡眠研究和新兴智能互联健康环境中的恢复建模提供了实用基础。

英文摘要

Objective sleep assessment relies on polysomnography (PSG), yet clinical impact is often better reflected in patient-reported outcomes (PROs) such as sleepiness and fatigue. Existing summary indices, including the Apnea-Hypopnea Index (AHI), provide limited insight into the multidomain physiology underlying functional recovery. We propose an interpretable, causal-discovery--guided framework for deriving a hierarchical Sleep Recovery Score (SRS) from multimodal PSG. Using two large population cohorts (MESA: n=1540; MrOS: n=825), we apply directed acyclic graph (DAG) learning to identify candidate physiological drivers spanning respiratory burden, hypoxic burden, sleep fragmentation, sleep architecture, and autonomic regulation. Although derived from clinical PSG, these domains map naturally to sensing streams increasingly available in connected health technologies, including wearable ECG, oximetry, and sleep-stage estimation devices. To preserve mechanistic plausibility, we introduce a two-stage screening process that combines physiology-based constraints with constrained LLM-assisted auditing to identify and remove structural confounders and construct-overlapping variables. Across cohorts, these five domains emerge as recurrent physiological domains associated with recovery, and the resulting SRS shows up to 2.5$\times$ stronger alignment with perceived recovery than AHI. By linking multimodal sleep physiology to patient-centered outcomes through an interpretable, bias-aware, and domain structured framework, this work provides a practical foundation for recovery modeling across both clinical sleep studies and emerging smart and connected health settings.

2606.18503 2026-06-18 cs.LG stat.ML 新提交

Quantum Annealing Enhanced Reinforcement Learning for Accurate Remaining Useful Lifetime Prediction

量子退火增强强化学习用于精确剩余使用寿命预测

Manoranjan Gandhudi, Arunkumar V., G. R. Anil, Gangadharan G. R

发表机构 * Central University of Karnataka(卡纳塔克中央大学) University College of Engineering, Anna University(安娜大学工程学院) AIONOS India Pvt Ltd(AIONOS印度私人有限公司) National Institute of Technology Tiruchirappalli(蒂鲁吉拉帕利国立理工学院)

AI总结 提出量子退火增强Q学习框架,通过将Q值更新编码为QUBO问题并利用量子退火采样实现随机动作选择,解决高维非凸空间中的收敛问题,在C-MAPSS和工业数据集上显著优于基线方法。

Comments 29 pages, 6 figures, 12 tables

详情
AI中文摘要

剩余使用寿命(RUL)估计是预测性维护的核心,意外故障的成本可能远超资产本身。统计退化模型忽略了真实系统的强非线性,而数据驱动模型在高维非凸搜索空间中常收敛到次优解。我们提出量子退火增强Q学习(QAQL)框架,将量子退火的采样行为与Q学习的序列决策相结合。每个Q值更新被编码为一个小的二次无约束二元优化(QUBO)问题,其基态对应贪婪动作;退火器不是作为确定性优化器,而是在多次读取中返回一个近最优动作的分布,这种随机动作选择提供了探索,从而抑制了在非线性退化轨迹上的过早收敛。QUBO在D-Wave Advantage系统上通过小规模嵌入求解,退火器被嵌入强化学习循环中,而非训练后附加。我们在两个公开基准上验证了QAQL:NASA C-MAPSS涡扇发动机数据集和一个设备群预测性维护数据集。在多次独立运行和六个误差指标上平均,QAQL优于本研究考虑的经典和量子基线,具有统计显著性改进。结果表明,量子退火是工业预测性维护应用中强化学习循环内一个可用的(而非仅理论上的)优化器。

英文摘要

Remaining useful life (RUL) estimation is central to predictive maintenance, where an unplanned failure can cost far more than the asset itself. Statistical degradation models miss the strong nonlinearity of real systems, and data-driven models often converge to suboptimal solutions in high-dimensional, non-convex search spaces. We propose a Quantum Annealing enhanced Q-Learning (QAQL) framework that couples the sampling behaviour of quantum annealing with the sequential decision making of Q-learning. Each Q-value update is encoded as a small quadratic unconstrained binary optimization (QUBO) whose ground state is the greedy action; rather than acting as a deterministic optimizer, the annealer returns a distribution over near-optimal actions across many reads, and this stochastic action selection supplies the exploration that curbs premature convergence on nonlinear degradation trajectories. The QUBO is solved on the D-Wave Advantage system using minor embedding, with the annealer woven into the reinforcement-learning loop rather than bolted on after training. We validate QAQL on two public benchmarks: the NASA C-MAPSS turbofan engine datasets and a device-fleet predictive maintenance dataset. Averaged over many independent runs and across six error metrics, QAQL outperforms the classical and quantum baselines considered in this study, with statistically significant improvements. The results indicate that quantum annealing is a usable, not merely theoretical, optimizer inside a reinforcement-learning loop for industrial predictive-maintenance applications.

2606.18502 2026-06-18 cs.CL 新提交

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

面向企业应用的多智能体系统可扩展定制与部署

Paresh Dashore, Shreyas Kulkarni, Uttam Gurram, Nadia Bathaee, Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One(第一资本)

AI总结 提出统一框架,通过智能体模型定制(持续预训练、微调、偏好优化)和推理优化(推测解码、FP8量化),实现领域自适应和4.48倍吞吐加速,保持性能并提升长尾场景鲁棒性。

Comments Preprint

详情
AI中文摘要

基于大语言模型的多智能体系统在复杂推理和任务执行上表现出色,支持广泛的企业应用。然而,由于领域特定的定制需求以及智能体工作流中的高延迟和推理成本,生产部署仍然具有挑战性。我们提出了一个统一框架,用于在实际环境中定制和高效部署多智能体系统。第一阶段,智能体模型定制,结合持续预训练、监督微调和偏好优化,将紧凑模型适应到专业领域,同时保留强大的智能体能力。第二阶段,推理优化,集成推测解码和FP8量化与目标校准,以最小质量损失实现成本高效的推理服务。在企业工作负载上,我们的框架实现了快速领域自适应,吞吐量提升4.48倍,同时保持性能并提高长尾场景的鲁棒性。

英文摘要

Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi-agent systems in real-world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost-efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long-tail scenarios.

2606.18496 2026-06-18 cs.CV cs.AI 新提交

Neural Phase Correlation

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs(Weyl实验室)

AI总结 提出相位相关的学习泛化,通过可学习基函数将变换分解,适用于非刚性形变和幺正动力学,在心脏MRI和超声数据集上达到或超越现有方法。

详情
AI中文摘要

对应关系本质上是关系性的:它寻求同一场景两次观测之间的未知变换,而非任一观测的内容。然而,主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像,让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外,它直接在傅里叶域测量图像间关系,但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化,通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上,该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上,它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时,同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

2606.18487 2026-06-18 cs.LG cs.AI cs.CL 新提交

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT 过训练通过熵崩溃预测 RLVR 下的排名反转

Siddharth Aphale, Kelly Liu

发表机构 * Stanford University(斯坦福大学)

AI总结 研究发现 SFT 过度训练导致 rollout 分布熵降低,使 GRPO 中优势信号消失,从而引发排名反转;提出基于熵的两阶段诊断方法可预警高风险检查点。

Comments 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

详情
AI中文摘要

当 SFT 压缩 rollout 分布时,选择 pass@1 最高的 SFT 检查点进行 GRPO 的标准启发式方法可能失败。对于二元奖励,组内期望优势方差为 $p(1{-}p)(g{-}1)/g$;当早期 GRPO 将 $p$ 驱动到 $p^*(g)$ 以下时,大多数组具有相同奖励,不提供组间相对信号。我们研究了 Qwen2.5-Coder-3B 和 DeepSeek-Coder-6.7B 的 SFT 深度阶梯。我们在五个深度和三个种子上测试 Qwen2.5-Coder-3B,在四个匹配深度和三个种子上测试 DeepSeek-Coder-6.7B。在 Qwen 上,RL 前的 pass@1 随 SFT 深度增加而上升,但 GRPO 峰值 pass@10 从 $0.806$ 下降到 $0.481$(3 种子均值,$n{=}20$);RL 前的熵与 GRPO 结果正相关($\rho{=}{+}0.69$)。在 DeepSeek 上,pass@1 仍远高于 $p^*(8){=}0.083$,GRPO 结果压缩而非反转。结合 RL 前熵分诊与早期 GRPO 熵监测的两阶段诊断方法,可标记高风险检查点并提前停止失败运行。在我们的设置中,简单的 KL 参考正则化和标签平滑变体未能挽救崩溃的 Qwen 检查点,表明该失败并非琐碎的 GRPO 超参数伪影。

英文摘要

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

2606.18484 2026-06-18 cs.CV 新提交

Vines-DB: An RGB image dataset for multi-species ornamental vine segmentation

Vines-DB:用于多物种观赏藤蔓分割的RGB图像数据集

Saroj Burlakoti, Utsav Bhandari, Aaron Etienne, Shital Poudyal

发表机构 * Department of Plants, Soils and Climate, Utah State University(植物、土壤与气候系,犹他州立大学) Department of Applied Sciences, Technology and Education, Utah State University(应用科学、科技与教育系,犹他州立大学)

AI总结 为支持精准园艺和城市生态中的多类实例分割,构建了包含7种观赏藤蔓的RGB图像数据集Vines-DB,通过手动标注和增强得到2307张图像,并划分训练/验证/测试集。

Comments 7 pages, 1 figure. Source data repository: OSF (DOI: 10.17605/OSF.IO/YJHCK)

详情
AI中文摘要

Vines-DB数据集包含在美国犹他州洛根市犹他农业实验站格林维尔研究农场田间条件下采集的7种观赏藤蔓的1,218张原始高分辨率RGB图像。该数据集来自168株于2022年移植的藤本植物,在2023和2024生长季(7月至10月)的多个月份重复拍摄。图像使用配备48 MP摄像头的iPhone 16 Pro在上午10:00至下午12:00之间于日光下拍摄。藤蔓生长在1.2m x 2.4m的格架上,从1m距离处拍摄,背景为黑色或白色泡沫板,以增强对比度并减少背景噪声。数据集包括木通、凌霄花、藤绣球、金银花、凌霄'马德琳·加伦'、五叶地锦和多花紫藤。所有原始图像由训练有素的标注员在Roboflow中手动标注,生成基于多边形的实例分割掩码,共8个类别(7个物种和背景)。经过预处理和数据增强后,工作数据集扩展至2,307张图像,用于模型开发和评估。增强后的数据集通过分层抽样划分为2,019张训练图像、192张验证图像和96张测试图像,以保持平衡的代表性。Vines-DB支持精准园艺和城市生态中多类实例分割深度学习模型的开发和评估。该数据集可实现自动冠层覆盖度估计、物种识别和可扩展的田间表型分析等应用。此外,每月重复成像捕获了冠层发育和植物外观的时间变化,增加了数据集在真实田间条件下进行分割基准测试的实用性。

英文摘要

The Vines-DB dataset contains 1,218 original high-resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station's Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July-October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana 'Madame Galen', Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon-based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines-DB supports the development and evaluation of deep learning models for multi-class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset's utility for segmentation benchmarking under realistic field conditions.

2606.18479 2026-06-18 cs.LG cs.CY 新提交

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

改进的幻觉:信用评分中的拒绝推断策略

Bruno Scarone, Ricardo Baeza-Yates

发表机构 * Northeastern University(东北大学) KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 研究揭示拒绝推断方法在信用评分中因反馈循环导致评估指标误导,提出通过少量探索打破循环并诊断问题。

Comments Accepted to ECML PKDD 2026 (Research Track)

详情
AI中文摘要

拒绝推断方法被广泛用于减轻信用评分中的生存偏差,但其有效性仍不明确。我们系统评估了几种此类方法,并发现一个结构性失败模式:在自然的再训练循环中,模型的准确率提升而召回率崩溃,造成改进的幻觉,使从业者认为系统在变好,而实际上其拒绝质量——正确筛选出违约者的能力——在恶化。然后,我们提出一种受控探索策略,无需统计假设即可打破反馈循环:贷款方故意批准一部分被拒绝的申请人,并观察他们的真实结果。我们表明,准确率和拒绝质量在是否探索上给出相反的建议:准确率倾向于不探索,而拒绝质量随探索提高,证实标准评估指标在选择性偏差下具有误导性。即使极低的探索率(2-5%)在我们的实验中也足以以近乎零成本诊断反馈循环的严重性。我们的发现在两种机器学习方法和三个真实数据集上一致,表明标准评估协议不足以评估在生存偏差下训练的模型。

英文摘要

Reject inference methods are widely used to mitigate survival bias in credit scoring, yet their effectiveness remains poorly understood. We systematically evaluate several such methods and uncover a structural failure mode: in a natural retraining cycle, models whose accuracy improves while recall collapses create an illusion of improvement that leads practitioners to believe the system is getting better when, in fact, its rejection quality -- the ability to correctly screen out defaulters -- is deteriorating. We then propose a controlled exploration strategy that breaks the feedback loop without statistical assumptions: the lender deliberately approves a fraction of rejected applicants and observes their true outcomes. We show that accuracy and rejection quality give opposite recommendations on whether to explore: accuracy favors no exploration, while rejection quality improves with it, confirming that standard evaluation metrics are misleading under selection bias. Even minimal exploration rates (2--5\%) prove sufficient in our experiments to diagnose the severity of the feedback loop at near-zero cost. Our findings are consistent across two machine learning methods and three real-world datasets, and suggest that standard evaluation protocols are inadequate for assessing models trained under survival bias.

2606.18478 2026-06-18 cs.CV 新提交

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏:恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对分布匹配蒸馏(DMD)在少步视频生成中出现的模式坍塌和过饱和问题,提出数据强制蒸馏(DFD)框架,通过教师评分差异引导学生接近真实数据分布,仅需一行代码修改即可恢复多样性和保真度。

详情
AI中文摘要

最近的进展表明,将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中,分布匹配蒸馏(DMD)及其后继DMD2实现了强大的生成质量和快速收敛。然而,由于反向KL目标的性质,这些方法表现出两个持续的失败模式:样本多样性大幅下降,以及明显过饱和的输出偏离真实视频外观。在这项工作中,我们提出了数据强制蒸馏(DFD),一个简单的训练后框架,通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异,用于引导学生朝向真实数据分布,将其拉向缺失的模式(缓解模式坍塌)并远离真实数据中不存在的问题模式(避免过饱和)。我们提供了框架的深入理论分析,并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调,DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度,解决过饱和伪影,显著改善视频动态和外观,甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

2606.18473 2026-06-18 cs.CL 新提交

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

PreUnlearn: 在大语言模型遗忘之前审计附带知识损害

Bo Su, Ankit Shah, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出PreUnlearn方法,通过数据特征预测遗忘操作对同领域和远距离知识的附带损害,实现遗忘前的风险审计。

Comments 12 pages, 6 figures

详情
AI中文摘要

大语言模型(LLMs)的机器遗忘旨在移除特定知识,同时保留模型其余能力。然而,遗忘与保留知识之间的界限往往不明确,因为相关甚至遥远的信息可能在模型中纠缠。在本文中,我们从数据中心的视角研究LLM遗忘,并衡量遗忘效应如何从遗忘集传播到同领域和远距离知识。我们发现一致的衰减模式:附带损害在遗忘集附近最强,随语义距离减弱,但不会在领域边界消失。我们进一步询问这种损害是否可以在执行遗忘之前被审计。我们将遗忘集审计制定为遗忘前预测任务,并分析哪些数据特征最能预测下游损害。我们的结果表明,遗忘集与评估集之间的交互特征提供了最强的信号,表明附带损害部分反映在模型更新前的数据几何中。这些发现将遗忘集审计定位为识别风险遗忘运行和设计更可靠遗忘程序的早期预警工具。

英文摘要

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

2606.18472 2026-06-18 cs.CV 新提交

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出ReFine3D框架,通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略,提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

域适应仍然是3D视觉中的一个核心挑战,特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力,但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题,我们引入了ReFine3D,一个正则化的微调框架,专为3D大语言模型(LMMs)的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合:跨增强点云的多视图一致性,以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外,我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制,以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明,ReFine3D将基类到新类泛化提高了1.36%,跨数据集迁移提高了2.43%,对损坏的鲁棒性提高了1.80%,少样本准确率提高了最多3.11%,以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

2606.18471 2026-06-18 cs.CL 新提交

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

可能还是确定?评估临床文本中诊断不确定性保留的基准

Hongbo Du, Zixin Lu, Jiaming Qu

发表机构 * Trine University(特里尼大学) University of Michigan(密歇根大学) Amazon(亚马逊)

AI总结 构建包含9184个不确定性标注的基准,评估LLM在临床文本中保留诊断不确定性的能力,发现LLM保留原始不确定性线索不足一半,且难以区分相邻级别。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于临床文本任务,如总结和修订。虽然大多数研究评估LLM生成文本的流畅性和连贯性,但LLM是否正确保留诊断不确定性仍未得到充分探索。在临床实践中,诸如“可能肺炎”之类的短语传达了现有证据的强度,并直接指导后续检测和治疗决策。改变这些不确定性表达可能会完全改变临床含义。在本文中,我们通过两个步骤系统地评估了这个问题。首先,我们构建了一个包含1200份临床文档的基准,其中包含跨五个级别的9184个不确定性标注。其次,我们在此基准上评估了三个LLM。我们的结果表明:(1)LLM保留原始不确定性线索的能力很差,通常不到一半的时间;(2)LLM难以区分相邻级别之间的细微差别。这项工作揭示了标准评估指标无法捕捉的失败模式,并为LLM在临床工作流程中的安全部署提供了启示。

英文摘要

Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia'' communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.

2606.18469 2026-06-18 cs.LG cs.AI 新提交

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

基于局部线性嵌入与自适应特征融合的结构化表示学习

Somjit Nath, Jackson J Cone, Derek Nowrouzezahrai, Samira Ebrahimi Kahou

发表机构 * Mila – Quebec AI Institute(米拉-魁北克人工智能研究所)

AI总结 受神经科学启发,提出一种强化学习框架,利用局部线性嵌入捕捉状态局部结构,并通过注意力机制自适应融合动态与奖励特征,提升学习效率。

Comments Published in Transactions on Machine Learning Research (04/2026)

详情
AI中文摘要

神经科学研究揭示,大脑通过利用结构化的低维流形和自适应门控机制动态融合多源信息来编码复杂行为。受这些原理启发,我们提出了一种新颖的强化学习(RL)框架,鼓励分离动态特定和奖励特定特征,直接类比神经回路如何分离和整合信息以实现高效决策。我们的方法利用局部线性嵌入(LLE)来捕捉许多环境中固有的局部线性结构,反映神经群体活动中观察到的局部平滑性,同时通过标准RL目标推导奖励特定特征。一种类似于皮层门控的注意力机制,在逐状态基础上自适应地融合这些互补表示。在基准任务上的实验结果表明,我们的方法基于神经科学原理,相比传统RL方法提高了学习效率和整体性能,凸显了显式建模局部状态结构和自适应特征选择(如生物系统中观察到的)的优势。

英文摘要

Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments, mirroring the local smoothness observed in neural population activity, while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

2606.18466 2026-06-18 cs.CL 新提交

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 与 2026 年语音到文本对齐的现状

Michael McAuliffe, Kaylynn Gunter, Michael Wagner, Morgan Sonderegger

发表机构 * University of Wisconsin--Madison(威斯康星大学麦迪逊分校) McGill University(麦吉尔大学) Centre for Brain, Language, and Music(大脑、语言与音乐中心) University of Oregon(俄勒冈大学)

AI总结 本文介绍 MFA 3.0 自 1.0 版本以来的发展,并在英语、日语和韩语上评估其性能,在四个基准数据集上达到平均边界误差低于 15 ms 的最优或接近最优性能。

详情
AI中文摘要

Montreal Forced Aligner (MFA) 于 2016 年发布,此后成为研究和工业中最广泛使用的强制对齐工具。在过去的十年中,MFA 经历了实质性发展,包括使用更大的开源数据集扩展到更多语言和方言、统一的 IPA 词典、模型自适应、跨语言音素映射以及支持工具。本文记录了 MFA 3.0 自 1.0 版本以来的发展,并在英语、日语和韩语上评估 MFA 的性能,与经典和神经强制对齐器进行基准测试。MFA 3.0 在所有四个基准数据集上实现了最优或接近最优的性能,平均边界误差低于 15 ms。自适应和跨语言映射对于 MFA 训练分布之外的语言有效,并且发音概率建模和音系规则在特定条件下提供了增益。

英文摘要

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.