arXivDaily arXiv每日学术速递 周一至周五更新
热门方向导航
2604.23938 2026-06-19 cs.CL 版本更新

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence(计算科学卓越中心)

AI总结 提出TSAssistant多智能体框架,通过分层指令架构和交互式优化循环,将靶点安全性评估报告生成分解为专业子任务,实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情
AI中文摘要

靶点安全性评估(TSA)需要系统整合遗传、转录组、靶点同源性、药理学和临床数据,以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家,在可扩展性和可重复性方面面临挑战。我们提出TSAssistant,一种人在回路中的多智能体框架,将TSA报告生成分解为专门子智能体的工作流:研究子智能体各自基于并引用单个TSA领域,合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据,生成可单独引用、基于证据的章节,其行为由分层指令架构塑造,该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束,程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束,而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较,而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性,发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

2605.00665 2026-06-19 cs.CV 版本更新

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

基于深度学习的视网膜图像预测阿尔茨海默病风险因素:英国生物银行中生物学相关形态学关联的开发和验证

Seowung Leem, Yunchao Yang, Adam J. Woods, Ruogu Fang

发表机构 * J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida(朱·克雷顿·普瑞特生物医学工程系,佛罗里达大学) University of Florida Research Computing(佛罗里达大学研究计算中心) Meta AI (FAIR)(Meta AI(FAIR)) School of Behavioral and Brain Sciences, University of Texas at Dallas(德克萨斯大学达拉斯分校行为与脑科学学院) Dept. of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Dept. of Computer and Information Science and Engineering, University of Florida(佛罗里达大学计算机与信息科学与工程系) Center for Cognitive Aging and Memory, University of Florida(佛罗里达大学认知衰老与记忆中心)

AI总结 利用深度学习从视网膜彩色眼底照片预测12个阿尔茨海默病相关风险因素,并揭示其背后的视网膜结构特征,发现视神经头和视网膜血管等区域与风险因素及阿尔茨海默病前期变化相关。

Comments Accepted to the "Journal of Alzheimer's Disease" for publication

详情
AI中文摘要

系统性的、代谢性的、生活方式的因素已通过流行病学和AD特异性生物标志物研究与阿尔茨海默病(AD)建立关联。彩色眼底摄影(CFP)是否包含与这些AD相关风险域相对应的视网膜结构特征仍不清楚。为了确定深度学习(DL)模型能否从CFP预测12个AD相关风险因素,并表征这些预测背后的视网膜结构,从而评估CFP是否反映AD易感性的通路。使用来自英国生物银行的44,501名独特参与者的62,876张CFP,训练DL模型预测与AD发病率相关的12个因素:6个分类变量(性别、吸烟、失眠、经济状况、饮酒、抑郁)和6个连续变量(年龄、受教育完成年龄、BMI、收缩压、舒张压、HbA1c)。评估模型性能、模型显著性和显著性衍生得分(CAM-Score),并与视网膜形态测量进行比较。还将得分在AD发病病例(平均发病前8.55年)与匹配对照之间进行比较。DL的性能范围为分类变量的AUROC=0.5654-0.9480,连续变量的R2=-0.0291-0.7620,优于大多数形态测量-机器学习模型。基于显著性的得分一致地突出了生物学上有意义的区域,特别是视神经头和视网膜血管。它也与现有的形态测量变异一致。多个基于显著性的得分在AD发病病例与匹配对照之间存在显著差异,表明风险因素的视网膜相关性与临床前AD相关变化之间存在潜在重叠。CFP编码了与AD风险因素相关的视网膜特征。尽管不具有诊断性,但DL衍生的视网膜表征可能揭示反映潜在AD易感性的生物学上有意义的风险相关结构变化。

英文摘要

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

2602.17315 2026-06-19 cs.LG cs.AI 版本更新

Flickering Multi-Armed Bandits

闪烁多臂老虎机

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) INRIA Paris(巴黎国家信息与自动化研究所)

AI总结 提出闪烁多臂老虎机模型,通过随机图约束动作可用性,设计两阶段懒惰随机游走算法实现次线性遗憾界,并证明信息论下界的最优性。

详情
AI中文摘要

我们引入闪烁多臂老虎机(FMAB)来建模动作可用性变化环境中的序列决策,其中下一个动作的可访问性被限制为依赖于智能体当前选择的子集。我们通过随机演化图形式化这些约束,其中动作仅限于局部邻域。这种移动受限结构带来了双重挑战:信息获取的统计要求和导航的物理开销。我们在独立同分布 Erdős--R'enyi 和边马尔可夫过程下分析 FMAB,提出一种两阶段懒惰随机游走算法以实现鲁棒探索。我们建立了高概率次线性遗憾界,并通过匹配的信息论下界证明了近最优性。我们的结果刻画了局部移动约束下学习的内在成本,并通过机器人灾难响应模拟进行了补充。

英文摘要

We introduce Flickering Multi-Armed Bandits (FMAB) to model sequential decision-making in environments with changing action availability, where accessibility of the next action is restricted to a subset dependent on the agent's current choice. We formalize these constraints through stochastically evolving graphs where actions are limited to local neighborhoods. This mobility-constrained structure imposes a dual challenge: the statistical requirement of information acquisition and the physical overhead of navigation. We analyze FMAB under i.i.d. Erdős--R'enyi and Edge-Markovian process, proposing a two-phase lazy random walk algorithm for robust exploration. We establish high-probability sublinear regret bounds and prove near-optimality via a matching information-theoretic lower bound. Our results characterize the intrinsic cost of learning under local-move constraints, complemented by a robotic disaster-response simulation.

2604.19196 2026-06-19 cs.CV 版本更新

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向域泛化人脸反欺骗的视觉基础模型基准测试

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

发表机构 * Graduate School of Information Sciences, Tohoku University, Japan(东北大学信息科学研究生院,日本)

AI总结 本文系统评估15种预训练视觉模型在人脸反欺骗域泛化中的表现,发现自监督ViT(尤其是DINOv2+Registers)结合数据增强和注意力损失在MICO协议上达到最优,且计算高效。

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情
AI中文摘要

人脸反欺骗(FAS)由于需要在未见过的环境中进行鲁棒的域泛化而仍然具有挑战性。尽管最近的趋势利用视觉-语言模型(VLM)进行语义监督,但这些多模态方法通常需要高昂的计算资源并表现出高推理延迟。此外,它们的有效性本质上受限于底层视觉特征的质量。本文重新审视仅视觉基础模型建立高效鲁棒FAS基线的潜力。我们在严苛的跨域场景下(包括MICO和有限源域(LSD)协议)对15个预训练模型进行了系统基准测试,例如有监督CNN、有监督ViT和自监督ViT。我们的全面分析表明,自监督视觉模型,特别是带有寄存器的DINOv2,显著抑制了注意力伪影并捕获了关键的细粒度欺骗线索。结合人脸反欺骗数据增强(FAS-Aug)、分块数据增强(PDA)和注意力加权分块损失(APL),我们提出的仅视觉基线在MICO协议上达到了最先进的性能。该基线在数据受限的LSD协议下优于现有方法,同时保持优越的计算效率。这项工作为FAS提供了一个确定的仅视觉基线,表明优化的自监督视觉变换器可以作为仅视觉和未来多模态FAS系统的骨干。项目页面见:此https URL。

英文摘要

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

2604.07328 2026-06-19 cs.LG 版本更新

How to sketch a learning algorithm

如何勾勒学习算法

Sam Gunn

发表机构 * UC Berkeley(伯克利大学)

AI总结 提出一种数据删除方案,基于稳定性假设,通过随机复方向的高阶导数局部勾勒算术电路,实现深度学习模型输出预测的误差和失败概率可忽略,且预计算和推理仅慢对数因子。

Comments Improved presentation and simplified Algorithm 4

详情
AI中文摘要

训练数据的选择如何影响AI模型?这个广泛的问题对于可解释性、隐私和基础科学至关重要。其技术核心是数据删除问题:在合理的预计算量之后,快速预测如果从学习算法中排除给定训练数据子集,模型在给定情况下的行为。我们提出了一种数据删除方案,能够在深度学习设置中以可忽略的误差$\varepsilon$和失败概率$\delta$预测模型输出。我们的预计算和预测算法分别仅比常规训练和推理慢$\tilde{O}(\log(1/\delta)/\varepsilon^2)$因子。存储需求为$\tilde{O}(\log(1/\delta)/\varepsilon^2)$个模型。我们的证明基于一个称为稳定性的假设。与先前工作所做的假设相比,稳定性似乎与学习强大AI模型完全兼容。为支持这一点,我们展示了稳定性在microgpt的最小实验集中得到满足。我们的代码可在https://this URL获取。在技术层面,我们的工作基于一种新方法,通过计算随机复方向的高阶导数来局部勾勒算术电路。前向模式自动微分允许廉价计算这些导数。

英文摘要

How does the choice of training data influence an AI model? This broad question is of central importance to interpretability, privacy, and basic science. At its technical core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ and failure probability $δ$ in the deep learning setting. Our precomputation and prediction algorithms are only $\tilde{O}(\log(1/δ)/\varepsilon^2)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\tilde{O}(\log(1/δ)/\varepsilon^2)$ models. Our proof is based on an assumption that we call stability. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

2603.04531 2026-06-19 cs.RO 版本更新

PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation

PTLD: 从仿真到现实的触觉潜在知识蒸馏用于灵巧操作

Rosy Chen, Mustafa Mukadam, Michael Kaess, Tingfan Wu, Francois R Hogan, Jitendra Malik, Akash Sharma

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Washington(华盛顿大学) FAIR at Meta(Meta的FAIR团队) UC Berkeley(伯克利大学)

AI总结 提出PTLD方法,通过真实世界触觉策略数据蒸馏鲁棒状态估计器,解决触觉仿真困难问题,在灵巧操作任务中相比纯本体感策略提升182%和57%。

详情
AI中文摘要

触觉灵巧操作对于自动化复杂家务任务至关重要,但学习有效控制策略仍然是一个挑战。虽然最近的工作依赖于模仿学习,但通过机器人遥操作或动觉教学获取多指手的高质量演示是困难的。另一种方法是,通过强化学习我们可以在仿真中学习技能,但快速且真实的触觉观测仿真具有挑战性。为了弥合这一差距,我们引入了PTLD:从仿真到现实的触觉潜在知识蒸馏,这是一种无需触觉仿真即可学习触觉操作技能的新方法。我们的关键思想不是模拟触觉传感器或纯粹依赖本体感策略进行零样本从仿真到现实的迁移,而是利用现实世界中的特权传感器收集真实的触觉策略数据。然后,这些数据用于蒸馏一个鲁棒的状态估计器,该估计器基于触觉输入运行。我们的实验表明,PTLD可以通过结合触觉感知显著改善在仿真中训练的本体感操作策略。在基准的掌内旋转任务中,PTLD相比纯本体感策略实现了182%的提升。我们还展示了PTLD能够学习具有挑战性的触觉掌内重定向任务,在该任务中,我们观察到达到的目标数量相比仅使用本体感提高了57%。网站:此 https URL。

英文摘要

Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.

2604.15838 2026-06-19 cs.LG 版本更新

Reversible Residual Normalization Alleviates Spatio-Temporal Distribution Shift

可逆残差归一化缓解时空分布偏移

Zhaobo Hu, Vincent Gauthier, Mehdi Naima

发表机构 * CNRS -- LIP6 Sorbonne Universit\'e

AI总结 针对时空分布偏移问题,提出可逆残差归一化框架,通过空间感知可逆变换同时处理时空维度偏移,结合图卷积与谱约束图神经网络实现自适应归一化。

详情
AI中文摘要

分布偏移严重降低了深度预测模型的性能。虽然这一问题在单变量时间序列中已有充分研究,但在时空领域中仍然是一个重大挑战。有效的解决方案如实例归一化及其变体可以通过标准化统计量来缓解时间偏移。然而,图上的分布偏移更为复杂,不仅涉及单个节点序列的漂移,还涉及空间网络中的异质性,其中不同节点表现出不同的统计特性。为了解决这个问题,我们提出了可逆残差归一化(RRN),一种新颖的框架,执行空间感知的可逆变换以解决空间和时间维度上的分布偏移。我们的方法在可逆残差块中集成了图卷积操作,实现了在保持可逆性的同时尊重底层图结构的自适应归一化。通过将中心归一化与谱约束图神经网络相结合,我们的方法以数据驱动的方式捕获和归一化复杂的时空关系。我们框架的双向性允许模型在归一化的潜在空间中学习,并通过逆变换恢复原始分布特性,为动态时空系统上的预测提供了一种鲁棒且模型无关的解决方案。

英文摘要

Distribution shift severely degrades the performance of deep forecasting models. While this issue is well-studied for individual time series, it remains a significant challenge in the spatio-temporal domain. Effective solutions like instance normalization and its variants can mitigate temporal shifts by standardizing statistics. However, distribution shift on a graph is far more complex, involving not only the drift of individual node series but also heterogeneity across the spatial network where different nodes exhibit distinct statistical properties. To tackle this problem, we propose Reversible Residual Normalization (RRN), a novel framework that performs spatially-aware invertible transformations to address distribution shift in both spatial and temporal dimensions. Our approach integrates graph convolutional operations within invertible residual blocks, enabling adaptive normalization that respects the underlying graph structure while maintaining reversibility. By combining Center Normalization with spectral-constrained graph neural networks, our method captures and normalizes complex Spatio-Temporal relationships in a data-driven manner. The bidirectional nature of our framework allows models to learn in a normalized latent space and recover original distributional properties through inverse transformation, offering a robust and model-agnostic solution for forecasting on dynamic spatio-temporal systems.

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K:用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney(悉尼科技大学) University of Sydney(悉尼大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白,构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集,并基于此基准测试了九种最新方法,识别出最鲁棒的方法和最具挑战的场景。

详情
AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中,已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而,对于无干扰辐射场,每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏,限制了发展。为填补这一空白,我们引入了DF3DV-1K,一个包含1048个场景的大规模真实世界数据集,每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像,模拟随意拍摄,涵盖128种干扰类型和161种场景主题,包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K,我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试,识别出最鲁棒的方法和最具挑战的场景。除了基准测试,我们还展示了DF3DV-1K的一个应用:微调基于扩散的2D增强器以改进辐射场方法,在保留集(例如DF3DV-41)和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展,并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取:此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

2604.13240 2026-06-19 cs.CV cs.LG 版本更新

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554(里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554) LTSER Zone Atelier Armorique(Armorique 领域实验室区) University of Würzburg, Center for Artificial Intelligence and Data Science(乌尔姆大学、人工智能与数据科学中心)

AI总结 提出首个基于概念的可解释AI方法用于物种分布模型,利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集,通过Robust TCAV量化景观概念对模型预测的影响,案例研究验证了方法的有效性。

详情
AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型(SDMs)是完成此任务的主要工具,具有两个目的:实现稳健的预测性能,同时提供关于分布驱动因素的生态见解。然而,深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标,我们提出了首个基于概念的可解释AI(XAI)在SDMs中的实现。我们利用Robust TCAV(测试与概念激活向量)方法量化景观概念对模型预测的影响。为此,我们提供了一个新的开放获取的景观概念数据集,该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块,旨在适用于广泛的物种。我们通过两个水生昆虫(襀翅目和毛翅目)的案例研究,使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明,基于概念的XAI有助于根据专家知识验证SDMs,同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息,对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

2602.22495 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement-aware Knowledge Distillation for LLM Reasoning

面向LLM推理的强化学习感知知识蒸馏

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

发表机构 * Meta Guo et al. Lin et al. Xu et al. Shao et al. Schulman et al. Xie et al.

AI总结 提出RL感知蒸馏(RLAD),通过信任区域比率蒸馏(TRRD)在强化学习后训练中实现选择性模仿,解决分布不匹配和目标干扰问题,在逻辑推理和数学基准上优于现有方法。

详情
AI中文摘要

强化学习(RL)后训练最近推动了长链思维推理大语言模型(LLM)的重大进展,但这类模型的高推理成本促使将其蒸馏到更小的学生模型中。大多数现有的知识蒸馏(KD)方法是为监督微调(SFT)设计的,依赖于固定的教师轨迹或基于教师-学生KL散度的正则化。当与RL结合时,这些方法常常遭受分布不匹配和目标干扰:教师监督可能与学生不断变化的rollout分布不一致,并且KL正则化项可能与奖励最大化竞争,需要仔细的损失平衡。为了解决这些问题,我们提出了RL感知蒸馏(RLAD),它在RL期间执行选择性模仿——仅在改进当前策略更新时引导学生向教师学习。我们的核心组件,信任区域比率蒸馏(TRRD),用基于PPO/GRPO风格似然比的目标替代教师-学生KL正则化项,该目标锚定到教师-旧策略混合,从而在学生rollout上产生优势感知、信任区域约束的蒸馏,并自然平衡探索、利用和模仿。在多种逻辑推理和数学基准上,RLAD始终优于离线蒸馏、标准GRPO和基于KL的在策略教师-学生知识蒸馏。

英文摘要

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

2604.07593 2026-06-19 cs.AI 版本更新

Too long; didn't solve

太长;没解决

Lucía M. Cabrera, Isaac Saxton-Knight, Jocelyn D'Arcy

发表机构 * Instituto Balseiro(巴塞罗那研究所) Poindexter Labs(波因迪克斯实验室)

AI总结 研究提示长度和解答长度与大型语言模型在数学问题上的性能关系,发现两者与模型失败率正相关。

详情
AI中文摘要

由一系列数学问题组成的数学基准被广泛用于评估大型语言模型的推理能力,但关于其结构特性如何影响模型行为的研究很少。在这项工作中,我们研究了两个结构长度变量——提示长度和解答长度,并分析了它们如何与模型在新构建的、由专家编写的对抗性数学问题数据集上的性能相关。我们发现,提示长度和解答长度均与模型失败率的增加呈正相关。我们还进行了跨模型分歧的探索性辅助分析。在难度调整的归一化分析下,两个变量与实现模型分离仍保持弱负相关,提示长度的关联稍强。总体而言,我们的主要稳健发现是,结构长度与该数据集中的经验难度相关。

英文摘要

Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset.

2604.06464 2026-06-19 cs.LG physics.app-ph stat.ML 版本更新

Weighted Bayesian Conformal Prediction

加权贝叶斯共形预测

Xiayin Lou, Peng Luo

发表机构 * Technical University of Munich(慕尼黑技术大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出加权贝叶斯共形预测(WBCP),通过加权Dirichlet先验推广贝叶斯共形预测到重要性加权设置,理论证明有效样本量决定后验方差,并提供更丰富的条件覆盖不确定性。

详情
AI中文摘要

共形预测提供具有有限样本覆盖保证的分布自由预测区间,Snell & Griffiths 最近的工作将其重新解释为贝叶斯求积(BQ-CP),通过阈值上的 Dirichlet 后验产生强大的数据条件保证。然而,BQ-CP 根本上要求 i.i.d. 假设。同时,加权共形预测通过重要性权重处理分布偏移,但仍然是频率学派方法,仅产生点估计阈值。我们提出 \textbf{加权贝叶斯共形预测(WBCP)},它将 BQ-CP 推广到任意重要性加权设置,用加权 Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$ 替换均匀 Dirichlet $\Dir(1,\ldots,1)$,其中 $\neff$ 是 Kish 有效样本量。我们证明了四个理论结果:(1)~$\neff$ 是匹配频率学派和贝叶斯方差的唯一集中参数;(2)~后验标准差以 $O(1/\sqrt{\neff})$ 衰减;(3)~BQ-CP 的随机占优保证扩展到每个权重轮廓的数据条件保证;(4)~HPD 阈值在条件覆盖上提供 $O(1/\sqrt{\neff})$ 的改进。我们将 WBCP 实例化为 \emph{地理贝叶斯共形预测},其中基于核的空间权重产生每个位置的后验,并具有可解释的诊断。在合成和真实空间数据集上的实验表明,WBCP 在保持覆盖保证的同时提供了更丰富的不确定性信息。

英文摘要

Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell \& Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose \textbf{Weighted Bayesian Conformal Prediction (WBCP)}, which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet $\Dir(1,\ldots,1)$ with a weighted Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$, where $\neff$ is Kish's effective sample size. We prove four theoretical results: (1)~$\neff$ is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as $O(1/\sqrt{\neff})$; (3)~BQ-CP's stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides $O(1/\sqrt{\neff})$ improvement in conditional coverage. We instantiate WBCP for spatial prediction as \emph{Geographical BQ-CP}, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.

2604.06265 2026-06-19 cs.LG cond-mat.stat-mech quant-ph 版本更新

SMT-AD: a scalable quantum-inspired anomaly detection approach

SMT-AD:一种可扩展的量子启发式异常检测方法

Apimuk Sornsaeng, Si Min Chan, Wenxuan Zhang, Swee Liang Wong, Joshua Lim, Jonathan Pan, Dario Poletti

发表机构 * Science, Mathematics and Technology Cluster, Singapore University of Technology and Design(新加坡科技设计大学科学、数学与技术集群) Centre for Quantum Technologies, National University of Singapore(新加坡国立大学量子技术中心) Artificial Intelligence and Data Analytics Strategic Technology Centre, ST Engineering(ST工程人工智能与数据分析战略技术中心) Engineering Product Development Pillar, Singapore University of Technology and Design(新加坡科技设计大学工程产品开发支柱)

AI总结 提出基于多分辨率张量叠加的量子启发式异常检测方法SMT-AD,通过傅里叶辅助特征嵌入和矩阵乘积算子实现线性可扩展,在标准数据集上取得竞争性能。

Comments 12 pages, 5 figures

详情
AI中文摘要

量子启发的张量网络算法已被证明是机器学习任务(包括异常检测)中有效且高效的模型。在此,我们提出一种高度可并行化的量子启发式方法,称为SMT-AD(Superposition of Multiresolution Tensors for Anomaly Detection)。它基于键维数为1的矩阵乘积算子的叠加,通过傅里叶辅助特征嵌入对输入数据进行变换,其中可学习参数的数量随特征大小、嵌入分辨率和矩阵乘积算子结构中附加组件的数量线性增长。我们展示了在标准数据集(包括信用卡交易)上成功的异常检测,并发现即使采用最小配置,它也能与已建立的异常检测基线相媲美。此外,它提供了一种直接的方法来减少模型权重,甚至通过突出最相关的输入特征来提高性能。

英文摘要

Quantum-inspired tensor networks algorithms have shown to be effective and efficient models for machine learning tasks, including anomaly detection. Here, we propose a highly parallelizable quantum-inspired approach which we call SMT-AD from Superposition of Multiresolution Tensors for Anomaly Detection. It is based upon the superposition of bond-dimension-1 matrix product operators to transform the input data with Fourier-assisted feature embedding, where the number of learnable parameters grows linearly with feature size, embedding resolutions, and the number of additional components in the matrix product operators structure. We demonstrate successful anomaly detection when applied to standard datasets, including credit card transactions, and find that, even with minimal configurations, it achieves competitive performance against established anomaly detection baselines. Furthermore, it provides a straightforward way to reduce the weight of the model and even improve the performance by highlighting the most relevant input features.

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出Vero系列开放视觉语言模型,通过构建600K样本数据集Vero-600K和任务路由奖励,在30个基准测试中平均提升2.9-5.4点,Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情
AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么?最强的视觉语言模型(VLM)表明广泛的视觉推理是可以实现的,但其封闭的数据和强化学习(RL)流程使得其成果难以研究、复现或扩展。我们引入了Vero,一个完全开放的VLM系列,在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励,构建了Vero-600K,一个来自59个数据集的600K样本数据集,并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中,Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型,Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是,基于Instruct模型训练的Vero-Qwen3I-8B,在没有额外蒸馏的情况下,平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示,不同的任务类别引发不同的推理模式,而广泛的收益依赖于联合学习它们,而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

2604.05435 2026-06-19 cs.AI 版本更新

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

CareTransition-Audit:用于高效护理过渡的出院总结审计基准

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava, Shivali Dalmia, Abhishek Mukherji

发表机构 * Department of Computer Science \& Engineering, University of Minnesota-Twin Cities, Minneapolis, USA Centific AI Research, Redmond, USA

AI总结 提出基于大语言模型的自动化框架,通过46项检查清单审计出院总结完整性,在MIMIC-IV数据集上基准测试11个模型,最佳模型与临床医生标签的Cohen's kappa约0.5,所有模型难以识别模糊文档。

Comments Accepted as a poster at IEEE-ICHI 2026; Accepted at SD4H@ICML

详情
AI中文摘要

不完整或不一致的出院文档会导致护理碎片化和可避免的再入院。尽管其在患者安全中至关重要,但审计出院总结依赖于人工审查且无法扩展。我们提出一个使用大语言模型(LLM)的自动化审计框架。我们的方法将DISCHARGED框架操作化为一个包含46个问题的检查清单。使用来自MIMIC-IV数据库的50份总结及临床医生真实标签,我们对11个LLM进行基准测试。模型评估的平均文档完整性范围为54.9%至74.2%,最佳模型与临床医生标签的Cohen's kappa值约为0.5,表明中等一致性。所有模型在识别模糊文档(Unclear)方面均存在困难,突显了当前自动化审计的关键差距。本工作为临床文档的系统性质量改进提供了临床医生验证的基准和零样本基线。

英文摘要

Incomplete or inconsistent discharge documentation drives care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies on manual review and does not scale. We propose an automated framework for auditing discharge summaries using large language models (LLMs). Our approach operationalizes the DISCHARGED framework into a checklist of 46 questions. Using 50 summaries from the MIMIC-IV database, with clinician ground-truth labels, we benchmark 11 LLMs. Model-assessed mean documentation completeness ranges from 54.9% to 74.2%, and the best-performing models achieve a Cohen's kappa values around 0.5 against clinician labels, indicating moderate agreement. All models struggle to identify ambiguous documentation (Unclear), highlighting a key gap in current automated auditing. This work provides a clinician-validated benchmark and zero-shot baselines for systematic quality improvement in clinical documentation.

2603.29924 2026-06-19 cs.CV 版本更新

Abstraction in Style: Beyond Texture and Color

风格中的抽象:超越纹理与色彩

Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang, Daniel Cohen-Or, Hui Huang

发表机构 * Shenzhen University(深圳大学) Visual Computing Research Center (VCC), College of Computer Science and Software Engineering (CSSE)(视觉计算研究中心(VCC),计算机科学与软件工程学院) Peking University(北京大学)

AI总结 提出Abstraction in Style (AiS)框架,将结构抽象与视觉风格分离,通过中间抽象代理实现几何保真度放松,从而支持更广泛的非真实感风格迁移。

Comments SIGGRAPH 2026

详情
AI中文摘要

艺术风格通常嵌入超越表面外观的抽象,涉及对结构的有意重新诠释,而不仅仅是纹理或色彩的变化。传统的风格迁移方法通常保留输入几何结构,因此难以捕捉这种更深层次的抽象行为,尤其是对于插画和非真实感风格。在这项工作中,我们引入了Abstraction in Style (AiS),一个将结构抽象与视觉风格化分离的生成框架。给定目标图像和少量风格样本,AiS首先推导出一个中间抽象代理,该代理根据风格所展现的抽象逻辑重新诠释目标的结构。代理捕捉语义结构,同时放松几何保真度,使得后续的风格化能够在抽象表示而非原始图像上进行操作。在第二阶段,渲染抽象代理以产生最终风格化输出,保持与参考风格的视觉一致性。两个阶段都使用共享的图像空间类比实现,使得变换可以从视觉样本中学习,无需显式的几何监督。通过将抽象与外观解耦,并将抽象视为显式、可迁移的过程,AiS支持更广泛的风格变换,提高了可控性,并实现了更具表现力的风格化。

英文摘要

Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target's structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

2603.28387 2026-06-19 cs.AI cs.LG 版本更新

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应:提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学)

AI总结 研究发现,在临床VLM评估中,提示中提及MRI可用性即可解释70-80%的性能提升,与图像数据是否存在无关,这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情
AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}(情感障碍)和\textsc{OASIS-3}(认知衰退)上评估了12个开源视觉语言模型(VLM)的二分类性能。两个数据集都包含结构MRI数据,但这些数据不携带可靠的个体级诊断信号。在这些条件下,较小的VLM在引入神经影像上下文后F1分数提升高达58%,蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示,仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变,与影像数据是否存在无关,这是模态坍塌的一个领域特定实例,我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由,而偏好对齐虽然消除了引用MRI的行为,却使两种条件都退化为随机基线。我们的发现表明,表面评估不足以作为多模态推理的指标,这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

2602.04037 2026-06-19 cs.LG cs.RO 版本更新

DADP: Domain Adaptive Diffusion Policy

DADP: 领域自适应扩散策略

Pengcheng Wang, Qinghang Liu, Haotian Lin, Yiheng Li, Guojian Zhan, Masayoshi Tomizuka, Yixiao Wang

发表机构 * University of California, Berkeley, California, USA(加州大学伯克利分校) Peking University, Beijing, China(北京大学) Tsinghua University, Beijing, China(清华大学)

AI总结 提出DADP,通过无监督解耦和领域感知扩散注入,实现跨动态环境的鲁棒零样本适应,在运动与操控任务上超越先前方法。

详情
AI中文摘要

学习能够泛化到未见过的转移动态的领域自适应策略,仍然是基于学习的控制中的一个基本挑战。通过领域表示学习来捕获领域特定信息,从而实现领域感知决策,已经取得了实质性进展。我们分析了通过动态预测学习领域表示的过程,发现选择与当前步骤相邻的上下文会导致学习到的表示将静态领域信息与变化的动态属性纠缠在一起。这种混合可能会混淆条件策略,从而限制零样本适应。为了应对这一挑战,我们提出了DADP(领域自适应扩散策略),通过无监督解耦和领域感知扩散注入实现鲁棒适应。首先,我们引入了滞后上下文动态预测,这是一种将未来状态估计条件化在历史偏移上下文上的策略;通过增加这个时间间隔,我们通过过滤掉瞬态属性来无监督地解耦静态领域表示。其次,我们通过偏置先验分布和重新制定扩散目标,将学习到的领域表示直接集成到生成过程中。在涉及运动和操控的具有挑战性的基准测试上的大量实验表明,DADP相对于先前方法具有优越的性能和泛化能力。更多可视化结果可在此https URL上获得。

英文摘要

Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the https://outsider86.github.io/DomainAdaptiveDiffusionPolicy/.

2512.00850 2026-06-19 cs.CV 版本更新

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿alto大学)

AI总结 提出Smol-GS方法,通过八叉树位置编码和熵压缩学习高效溅射特征,实现3D高斯溅射的紧凑表示,在保持渲染质量的同时大幅降低存储。

详情
AI中文摘要

我们提出Smol-GS,一种学习3D高斯溅射(3DGS)紧凑表示的新方法。我们的方法学习高效的逐溅射特征来建模3D空间,这些特征捕获抽象线索,包括颜色、不透明度、变换和材质属性。我们提出八叉树导出的位置编码,显式建模空间局部性并增强表示效率。我们进一步应用基于熵的压缩来利用特征冗余,并使用递归体素层次压缩溅射坐标。这种设计在保持表示灵活性的同时,实现了数量级的存储减少。Smol-GS在标准基准测试上以高渲染质量实现了最先进的压缩性能。

英文摘要

We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient splat-wise features to model 3D space, which capture abstracted cues, including color, opacity, transformation, and material properties. We propose octree-derived positional encoding, which explicitly models spatial locality and enhances representation efficiency. We further apply entropy-based compression to exploit feature redundancy and compress splat coordinates using a recursive voxel hierarchy. This design enables orders-of-magnitude reduction in storage while preserving representation flexibility. Smol-GS achieves state-of-the-art compression performance on standard benchmarks with high-level rendering quality.

2505.17006 2026-06-19 cs.CV cs.RO 版本更新

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

CoMo: 从互联网视频中学习连续潜在运动以实现可扩展的机器人学习

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) Fudan University(复旦大学) Tongji University(同济大学)

AI总结 提出CoMo方法,通过早期时间差分和时序对比学习从互联网视频中学习连续潜在运动,避免离散化信息损失,实现零样本泛化生成伪动作标签,联合训练策略在仿真和真实实验中表现优异。

Comments CVPR 2026

详情
AI中文摘要

从互联网视频中无监督学习潜在运动对于机器人学习至关重要。现有的离散方法通常通过小码本大小的向量量化来减轻提取过多静态背景导致的捷径学习,但它们存在信息损失,难以捕捉更复杂和细粒度的动态。此外,离散潜在运动与连续机器人动作之间存在固有分布差距,阻碍了统一策略的联合学习。我们提出CoMo,旨在从互联网规模视频中学习更精确的连续潜在运动。CoMo采用早期时间差分(Td)机制来增加捷径学习难度并显式增强运动线索。此外,为确保潜在运动更好地捕捉有意义的背景,我们进一步提出时序对比学习(Tcl)方案。具体地,正样本对通过小的未来帧时间偏移构建,而负样本对则通过直接反转时间方向形成。所提出的Td和Tcl协同工作,有效确保潜在运动更好地关注前景并增强运动线索。关键的是,CoMo表现出强大的零样本泛化能力,使其能够为未见过的视频生成有效的伪动作标签。大量的仿真和真实实验表明,使用CoMo伪动作标签联合训练的策略在扩散和自回归架构下均实现了优越性能。

英文摘要

Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zeroshot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures.

2603.25702 2026-06-19 cs.CL 版本更新

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

S2D2:通过免训练自我推测实现扩散LLM的快速解码

Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

发表机构 * Red Hat AI Innovation(红帽AI创新) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室) Iowa State University(爱荷华州立大学) Core AI, IBM(IBM核心AI)

AI总结 提出S2D2,一种免训练的自我推测解码框架,通过将块扩散模型在块大小为1时变为自回归模型,实现草稿与验证角色复用,在不增加训练或测试计算下提升解码速度与准确性。

Comments Code is available at https://github.com/phymhan/S2D2

详情
AI中文摘要

块扩散语言模型通过结合块级自回归解码与块内并行去噪,为超越自回归生成提供了一条有前景的路径。然而,在实际加速所需的少步数场景中,标准的置信度阈值解码往往脆弱:激进的阈值损害质量,而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外训练,要么增加测试时计算。我们提出S2D2,一种用于块扩散语言模型的免训练自我推测解码框架。我们的关键观察是,当块大小减小到1时,块扩散模型变为自回归模型,从而允许相同的预训练模型同时充当草稿模型和验证模型。S2D2在标准块扩散解码中插入一个推测验证步骤,并使用轻量级路由策略来决定何时验证值得其成本。这产生了一种混合解码轨迹,其中扩散并行提出令牌,而自回归模式充当局部序列级评判器。在三个主流块扩散家族中,S2D2在准确性-速度权衡上持续优于强置信度阈值基线。在SDAR上,我们观察到相比自回归解码高达4.7倍加速,相比调优的动态解码基线高达1.57倍加速,同时准确性提升高达4.5个点。在LLaDA2.1-Mini上,S2D2与内置自校正保持互补,包括在保守设置下比静态基线快4.4倍且准确性略高。

英文摘要

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

2603.12252 2026-06-19 cs.CV cs.CL 版本更新

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

EndoCoT:扩散模型中的内生思维链推理扩展

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出EndoCoT框架,通过迭代思维引导模块激活MLLM的推理潜力,并利用终端思维接地模块确保推理轨迹与文本监督对齐,使DiT逐步执行复杂任务,在多个基准上平均准确率达92.1%。

Comments 23 pages, 18 figures, The code and dataset are publicly available at https://internlm.github.io/EndoCoT/

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来处理空间推理等复杂任务。然而,这种范式存在两个关键限制:(i)MLLM文本编码器表现出不足的推理深度。单步编码无法激活思维链过程,而这对MLLM为复杂任务提供准确指导至关重要。(ii)在解码过程中,指导保持不变。即使有正确的MLLM编码,解码过程中的不变指导也阻止了DiT逐步将复杂指令分解为可执行的去噪步骤。为此,我们提出了内生思维链(EndoCoT),一种新颖的框架,首先通过迭代思维引导模块迭代细化潜在思维状态来激活MLLM的推理潜力,然后将这些状态桥接到DiT的去噪过程。其次,应用终端思维接地模块,通过将最终状态与真实答案对齐,确保推理轨迹保持与文本监督的接地。通过这两个组件,MLLM文本编码器提供精心推理的指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在多个基准(如Maze、TSP、VSP和Sudoku)上的广泛评估实现了平均准确率92.1%,比最强基线高出8.3个百分点。代码和数据集在此https URL公开。

英文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

2603.22922 2026-06-19 cs.CL 版本更新

Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

质量优于点击:面向早期电商查询建议的迭代强化学习

Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 针对早期部署场景点击反馈稀疏的问题,提出质量优先的迭代强化学习框架QualEQS,从可回答性、事实性和信息增益三个维度优化查询建议质量,通过候选建议的组级分歧识别模糊上下文并挖掘难例进行迭代改进,在真实电商系统中ChatPV提升6.81%。

详情
AI中文摘要

现有的对话系统依赖查询建议来增强用户参与度。最近的方法主要使用点击率(CTR)模型优化生成模型,以与用户偏好对齐。然而,这些方法在早期部署场景中效果较差,因为点击反馈稀疏且不足以训练可靠的CTR模型。为弥补这一差距,我们提出了QualEQS,一个面向电商查询建议的质量优先迭代强化学习框架。我们将可操作的建议质量形式化为三个直接影响下游可用性的维度:可回答性、事实性和信息增益。为了在没有点击监督的情况下从在线流量中持续改进,我们进一步提出候选建议之间的组级分歧,以识别模糊的查询上下文并挖掘难训练案例进行迭代优化。我们还引入了EQS-Benchmark,一个包含16,949个真实电商查询的数据集,用于离线训练和评估。实验表明,我们基于质量的离线指标与在线性能强相关,为稀疏反馈部署提供了一种实用的评估方法。在离线和在线设置中,QualEQS均持续优于强基线,在真实企业级对话购物助手系统中,在线ChatPV提升了6.81%。

英文摘要

Existing dialogue systems rely on query suggestion to enhance user engagement. Recent approaches mainly optimize generative models using click-through rate (CTR) models to align with user preferences. However, these methods are less effective in early-stage deployment scenarios, where click feedback is sparse and insufficient for training a reliable CTR model. To bridge this gap, we propose QualEQS, a quality-first iterative reinforcement learning framework for e-commerce query suggestion. We formalize actionable suggestion quality along three dimensions that directly affect downstream usability: answerability, factuality, and information gain. To continuously improve from online traffic without click supervision, we further propose group-level disagreement among candidate suggestions to identify ambiguous query contexts and mine hard training cases for iterative refinement. We also introduce EQS-Benchmark, a dataset of 16,949 real-world e-commerce queries for offline training and evaluation. Experiments show that our quality-based offline metrics correlate strongly with online performance, providing a practical evaluation recipe for sparse-feedback deployment. In both offline and online settings, QualEQS consistently outperforms strong baselines, yielding a 6.81% improvement in online ChatPV in a real-world enterprise-level conversational shopping assistant system.

2603.16606 2026-06-19 cs.CL 版本更新

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR:跨语言与跨模态句子嵌入,连接大规模多语言文本与语音

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Xiang "Tony" Cao, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

发表机构 * FAIR at Meta(Meta的FAIR)

AI总结 提出OmniSONAR模型,通过渐进式训练和教师-学生蒸馏,在数千种语言上实现文本、语音、代码和数学表达式的统一语义嵌入,在跨语言检索和翻译任务上显著降低错误率,并支持零样本语音翻译。

详情
AI中文摘要

跨语言句子编码器通常只覆盖几百种语言,并且常常为了更强的对齐而牺牲下游质量,限制了它们的采用。我们引入了OmniSONAR,一个新的全语言、跨语言和跨模态句子嵌入模型家族,它原生地将文本、语音、代码和数学表达式嵌入到单一语义空间中,同时在数千种语言(从高资源到极低资源变体)的规模上提供最先进的下游性能。为了在不发生表示崩溃的情况下达到这一规模,我们使用了渐进式训练。我们首先使用LLM初始化的编码器-解码器,结合token级解码、新颖的分裂softmax对比损失和合成硬负样本,为200种语言学习一个强大的基础空间。在此基础上,我们通过两阶段教师-学生编码器蒸馏框架扩展到数千种语言变体。最后,我们通过将177种口语无缝映射到该空间,展示了该空间的跨模态可扩展性。OmniSONAR将200种语言的FLORES数据集上的跨语言相似性搜索错误减半,并在1560种语言的BIBLE基准上将错误减少了15倍。它还实现了强大的翻译性能,在多语言基准上优于NLLB-3B,并在1560种语言到英语的BIBLE翻译上比先前模型(包括更大的LLM)高出15个chrF++点。OmniSONAR在MTEB和XLCoST上也表现强劲。对于语音,OmniSONAR实现了43%更低的相似性搜索错误,并达到了SeamlessM4T语音到文本质量的97%,尽管对于翻译是零样本(仅在ASR数据上训练)。最后,通过训练一个编码器-解码器LM Spectrum,仅使用英语文本处理OmniSONAR嵌入序列,我们为复杂的下游任务解锁了向数千种语言和语音的高性能迁移。

英文摘要

Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

2603.15106 2026-06-19 cs.AI 版本更新

PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

PrototypeNAS: 微控制器单元深度神经网络的快速设计

Mark Deutel, Simon Geis, Axel Plinge

发表机构 * Fraunhofer Institute for Integrated Circuits(弗劳恩霍夫集成电路研究所)

AI总结 提出零样本NAS方法PrototypeNAS,通过解耦设计与训练、多架构搜索空间、集成零样本代理和超体积子集选择,快速为不同MCU定制DNN,在图像分类等任务上分钟级找到小模型且精度接近大模型。

Comments Accepted at ECML-PKDD 2026. 18 pages, 7 figures, 4 tables. This work was funded by the European Commission as part of the MANOLO project under the Horizon Europe programme Grant Agreement No.101135782

详情
AI中文摘要

在具有不同硬件约束的边缘设备上实现高效的深度神经网络推理是一项具有挑战性的任务,通常需要为每个设备单独定制DNN架构。为避免大量人工努力,可以使用神经架构搜索。然而,许多现有的NAS方法资源密集且耗时,因为它们需要从头开始训练许多不同的DNN。此外,它们没有考虑目标系统的资源约束。为了解决这些缺点,我们提出了PrototypeNAS,一种零样本NAS方法,用于加速和自动化DNN的选择、压缩和针对不同目标微控制器单元的专门化。我们提出了一种新颖的三步搜索方法,将DNN设计和专门化与给定目标平台上的DNN训练解耦。首先,我们提出了一种新的搜索空间,不仅从单个大型架构中裁剪出较小的DNN,而且结合了多种架构类型的结构优化,以及它们的剪枝和量化配置的优化。其次,我们探索在优化过程中使用集成零样本代理而不是单个代理。第三,我们提出使用超体积子集选择从多目标优化的帕累托前沿中提取DNN架构,这些架构代表了准确性和FLOPs之间最有意义的权衡。我们在三个不同任务(图像分类、时间序列分类和目标检测)的12个数据集上评估了PrototypeNAS的有效性。我们的结果表明,PrototypeNAS能够在几分钟内识别出足够小、可部署在现成MCU上的DNN模型,并且仍然达到与大型DNN模型相当的精度。

英文摘要

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Qualcomm SARL France(法国.qualcomm SARL) Automated Driving, Qualcomm Technologies, Inc.(qualcomm Technologies, Inc. 自动驾驶部门)

AI总结 提出类别增量运动预测新任务,通过端到端框架结合伪标签与开放词汇分割,利用3D-2D投票机制和查询特征方差重放策略,缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情
AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而,现有方法通常假设一个封闭世界设定,具有固定的对象分类法并依赖高质量感知,限制了其在现实世界中的应用,因为现实世界中感知不完美,且新对象类别可能随时间出现。在这项工作中,我们引入了类别增量运动预测,这是一个新颖的设定,其中新对象类别随时间顺序引入,并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架,该框架适应新引入的类别,同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签,并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测,而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明,我们的方法成功地在已知类别上保持性能,同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移,并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

2602.05533 2026-06-19 cs.AI 版本更新

Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach

硬约束下的条件扩散引导:一种随机分析方法

Zhengyi Guo, Wenpin Tang, Renyuan Xu

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University(哥伦比亚大学工业工程与运营管理系) Department of Management Science and Engineering, Stanford University(斯坦福大学管理科学与工程系)

AI总结 提出基于Doob h-变换和鞅表示的条件扩散引导框架,通过鞅损失和鞅协方差损失学习条件函数梯度,确保硬约束满足并给出非渐近保证。

详情
AI中文摘要

我们研究了扩散模型中在硬约束下的条件生成,其中生成的样本必须以概率1满足预设事件。这类约束在安全关键应用和稀有事件模拟中自然出现,而软或基于奖励的引导方法无法保证约束满足。基于扩散模型的概率解释,我们利用Doob h-变换、鞅表示和二次变差过程,开发了一个原则性的条件扩散引导框架。具体地,得到的引导动力学通过涉及条件函数对数梯度的显式漂移校正来增强预训练扩散,而不修改预训练得分网络。利用鞅和二次变差恒等式,我们提出了两种新的离策略学习算法,基于鞅损失和鞅协方差损失,仅使用预训练模型的轨迹来估计h及其梯度。我们为得到的条件采样器在总变差和Wasserstein距离下提供了非渐近保证,明确刻画了得分近似和引导估计误差的影响。数值实验证明了所提方法在强制硬约束和生成稀有事件样本方面的有效性。数值实验的代码可在此https URL找到。

英文摘要

We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob's h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples. The code of the numerical experiments can be found at https://github.com/ZhengyiGuo2002/CDG_Finance.

2603.07236 2026-06-19 cs.CV 版本更新

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team(腾讯 HY 团队)

AI总结 提出HY-WU框架,通过功能性神经记忆模块即时生成实例特定权重更新,避免共享权重覆盖导致的干扰,解决持续学习与个性化中的灾难性遗忘问题。

详情
AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中,目标并非固定:领域漂移、用户偏好演变,以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而,大多数适应流程仍遵循静态权重范式:训练后(或任何适应步骤后),推理执行单一参数向量,而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中,不同目标可能在参数上诱导分离的可行区域,迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果,持续学习和个性化通常实现为对共享权重的重复覆盖,冒着先前学习行为退化的风险。我们提出HY-WU(权重释放),一种记忆优先的适应框架,将适应压力从覆盖单一共享参数点转移。HY-WU将功能性(算子级)记忆实现为神经模块:一个根据实例条件即时合成权重更新的生成器,产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

2602.20573 2026-06-19 cs.LG 版本更新

MolGraphBench: A Benchmark of GNN Architectures for Molecular Regression Tasks

MolGraphBench:用于分子回归任务的GNN架构基准测试

Rajan, Ishaan Gupta

发表机构 * Rajan 1 Ishaan Gupta 2

AI总结 提出MolGraphBench基准,比较四种GNN模型在分子回归任务上的性能,发现GCN和GIN为最优架构,并指出GNN层类型应作为可调超参数。

Comments 14 pages, 5 figures and 4 tables

详情
AI中文摘要

分子通常表示为SMILES字符串,可以轻松转换为手工设计的描述符或指纹(FP)用于分子性质预测。研究表明,SMILES可以转换为分子图 $G = (V, E)$,其中原子为节点 $(V)$,键为边 $(E)$。这些分子图随后可用于训练图神经网络(GNN)模型。尽管近年来GNN(现有和新架构)在分子性质预测中的应用激增,但仍缺乏严格的基准测试。我们提出了MolGraphBench,一个包含四种常用GNN模型的全面基准测试,用于分子性质预测。基准测试结果表明,基于绝对性能、训练效率、迁移学习和预测质量,图卷积网络(GCN)和图同构网络(GIN)是分子图回归任务的最优GNN架构。研究还表明,在融合(GNN-FP)框架中,分子指纹具有非互补性。此外,我们的GNN模型在三个数据集上取得了优于或与当前最先进GNN基线相当的性能(B3DB上GCN的RMSE为0.518,FreeSolv上GIN-FP的RMSE为1.022,RT数据集上GIN的MAE为63.783)。本研究的发现表明,GNN层类型应被视为可调超参数,而非固定设计选择,以实现更优性能。

英文摘要

Molecules are often represented as SMILES strings, which can be readily converted to hand-crafted descriptors or fingerprints (FP) for molecular property prediction. Research has demonstrated that SMILES can be converted to molecular graphs $G = (V, E)$, with atoms as nodes $(V)$ and bonds as edges $(E)$. These molecular graphs can subsequently be used to train graph neural networks (GNN) models. Despite the recent surge in application of GNN (existing and novel architectures) for molecular property prediction, a rigorous benchmark is still lacking. We propose MolGraphBench, a comprehensive benchmark of four commonly used GNN models for molecular property prediction. Benchmarking results demonstrate graph convolutional network (GCN) and graph isomorphism networks (GIN) as the optimal GNN architectures for molecular graph regression tasks, based on absolute performance, training efficiency, transfer learning and prediction quality. The study also indicates the non-complementary nature of molecular fingerprints in the fusion (GNN-FP) framework. Furthermore, our GNN models achieved performance superior or comparable performance to current state-of-the-art GNN baselines across three datasets (GCN with RMSE of $0.518$ on B3DB, GIN-FP with RMSE of $1.022$ on FreeSolv and GIN with MAE of $63.783$ on RT datasets). Findings from this study indicate that type of GNN-layer, should be treated as a tunable hyperparameter rather than a fixed design choice to achieve superior performance.

2603.04219 2026-06-19 cs.SD cs.AI eess.AS 版本更新

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.(Maum AI公司) Humelo Inc.(Humelo公司)

AI总结 提出ZeSTA框架,通过轻量领域嵌入区分真实与合成语音,结合真实数据过采样,在极低资源下提升零样本文本转语音增强的说话人相似度,保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

详情
AI中文摘要

我们研究了将零样本文本转语音(ZS-TTS)作为低资源个性化语音合成的数据增强源。虽然合成增强可以提供语言丰富且音素多样的语音,但将大量合成语音与有限的真实录音简单混合往往会导致微调过程中说话人相似度下降。为解决这一问题,我们提出了ZeSTA,一个简单的基于领域条件的训练框架,通过轻量领域嵌入区分真实和合成语音,并结合真实数据过采样以在极有限的目标数据下稳定适应,无需修改基础架构。在LibriTTS和一个内部数据集上使用两个ZS-TTS源的实验表明,我们的方法在保持可懂度和感知质量的同时,相比朴素合成增强提高了说话人相似度。音频样本可在我们的网页上获取。

英文摘要

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality. Audio samples are available on our web page.