arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
热门方向导航
2606.20174 2026-06-19 cs.LG 新提交

Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection

基于无细胞DNA分析的多癌早期检测的计算方法与挑战

Nicko Starkey, Marcin W. Wojewodzic, Krzysztof Rzecki

发表机构 * AGH University of Krakow(AGH克拉科夫大学) Norwegian Institute of Public Health(挪威公共卫生研究所)

AI总结 综述2022-2025年cfDNA多癌早期检测的计算方法,重点分析片段组学和表观遗传特征提取技术,指出多模态集成方法最具临床整合潜力,但需标准化评估协议。

详情
AI中文摘要

无细胞DNA(cfDNA)是非侵入性多癌早期检测(MCED)的一个有前景的途径,因为它可以通过单次抽血同时检测多种癌症,尤其对目前缺乏既定筛查程序的癌症具有敏感性。本文综述了2022年至2025年间基于cfDNA的MCED计算方法。我们重点关注如何提取和分析片段组学和表观遗传特征以在早期阶段检测癌症。我们首先简要概述cfDNA信号的生物学基础,然后回顾经典的统计和机器学习方法以及深度学习框架,包括基于自编码器的模型。对于每种方法,我们讨论其生物学可解释性、验证策略以及临床整合的准备情况。此外,我们将当前挑战分为技术、计算和方法论三类,并概述该领域的开放问题。本综述表明,多模态集成方法在临床整合方面具有最强的前景和最高的准备度。然而,为了更好地评估未来工作和进行并排比较,标准化评估协议和报告结果至关重要。

英文摘要

Cell-free DNA (cfDNA) is a promising avenue for non-invasive multicancer early detection (MCED), in that, it can enable multiple cancer detection simultaneously from a single blood draw, with particular sensitivity to cancers that currently lack established screening programs. Here we review the computational methods developed between 2022 and 2025 for cfDNA-based MCED. We focus on how fragmentomics and epigenetic features are extracted and analyzed to detect cancer at early stages. We first briefly outline the biological basis of cfDNA signals, then review classical statistical and machine learning approaches alongside deep learning frameworks including autoencoder-based models. For each method we discuss biological interpretability, validation strategy, and readiness for clinical integration. Furthermore, we categorize the current challenges into technical, computational, and methodological while outlining open problems in the field. This review shows that multimodal ensemble approaches have the strongest promise for clinical integration and the highest readiness. However, for better assessment of future work and side-by-side comparison, standardization of evaluation protocols and reporting results will be crucial.

2606.20172 2026-06-19 cs.LG 新提交

Predicting gestational age at birth in the context of preterm birth from multi-modal fetal MRI

基于多模态胎儿MRI预测早产背景下的出生胎龄

Diego Fajardo-Rojas, Megan Hall, Daniel Cromb, Mary A. Rutherford, Lisa Story, Emma C. Robinson, Jana Hutter

发表机构 * Leibniz University Hannover(莱布尼茨汉诺威大学)

AI总结 提出结合多模态胎儿MRI和机器学习流程预测出生胎龄,包括数据插补、特征选择和回归模型,在333例对照和93例早产数据上评估,R²=0.13,MAE=2.74周,准确率0.77。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:013

Journal ref Machine.Learning.for.Biomedical.Imaging. 2026 (2026)

详情
AI中文摘要

早产与高死亡率和终身发病风险相关。复杂的多因素病因阻碍了准确预测和最佳护理。我们开发并评估了一个包含定制机器学习方法的流程,用于数据插补、特征选择和回归模型,以从333例对照和93例早产病例的综合多模态形态和功能胎儿MRI数据预测出生胎龄。将出生胎龄预测分为足月和早产类别,并报告其准确性、敏感性和特异性。进行了消融研究以进一步验证流程设计。使用分层10折交叉验证评估性能。该流程实现了0.13的R²分数和2.74周的平均绝对误差。在交叉验证中,准确率为0.77,敏感性为0.59,特异性为0.82。流程选择的主要特征包括宫颈长度和基于胎盘T2*值的统计量。快速、运动鲁棒的多模态胎儿MRI技术与机器学习预测的结合使得能够预测出生胎龄。这些信息对任何妊娠都至关重要。据我们所知,早产在文献中仅作为分类问题处理。因此,这项工作提供了概念验证。未来工作将增加队列规模,以允许在早产队列内进行更精细的分层。我们的代码可在以下网址获取:此https URL。

英文摘要

Preterm birth is associated with significant mortality and a risk for lifelong morbidity. The complex multifactorial aetiology hampers accurate prediction and thus optimal care. A pipeline consisting of bespoke machine learning methods for data imputation, feature selection, and regression models to predict gestational age (GA) at birth was developed and evaluated from comprehensive multi-modal morphological and functional fetal MRI data from 333 control cases and 93 preterm birth cases. The GA at birth predictions were classified into term and preterm categories and their accuracy, sensitivity, and specificity were reported. An ablation study was performed to further validate the design of the pipeline. Performance was evaluated using stratified 10-fold cross-validation. The pipeline achieves an R2 score of 0.13 and a mean absolute error of 2.74 weeks. It also achieves a 0.77 accuracy, 0.59 sensitivity, and 0.82 specificity across folds. The predominant features selected by the pipeline include cervical length and statistics derived from placental T2* values. The confluence of fast, motion-robust and multi-modal fetal MRI techniques and machine learning prediction allowed the prediction of the gestation at birth. This information is essential for any pregnancy. To the best of our knowledge, preterm birth had only been addressed as a classification problem in the literature. Therefore, this work provides a proof of concept. Future work will increase the cohort size to allow for finer stratification within the preterm birth cohort. Our code is available at https://github.com/dfajardorojas/ml-for-preterm-birth-.

2606.20167 2026-06-19 cs.LG 新提交

Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying

多模态对比学习用于基于位置绑定的隐式地球嵌入

Jonathan Hecht, Lukas Arzoumanidis, Ziyue Li, Youness Dehbi

发表机构 * Computational Methods Lab, HafenCity University Hamburg(汉堡港城大学计算方法实验室) Dept. of Operations & Technology, Technical University of Munich(慕尼黑工业大学运营与技术系;海尔布隆数据科学中心;慕尼黑数据科学研究所) Heilbronn Data Science Center(波恩大学大地测量与地理信息研究所) Munich Data Science Institute Institute of Geodesy and Geoinformation, University of Bonn

AI总结 提出两种多模态对比学习架构MELT和SALT,通过位置绑定整合未配对地理数据,在四个下游任务中匹配最强双模态基线SATCLIP,但增加模态数未持续提升性能,表明位置编码器是主要瓶颈。

详情
AI中文摘要

空间预测任务通常受限于缺乏高质量标记的地面真值观测。为克服这一挑战,自监督预训练是一种可能的解决方案,其中对比学习在位置编码器中占主导地位。这些方法通常仅将地理坐标与一种额外模态对齐。我们提出了两种多模态对比学习架构:通过位置绑定的多模态嵌入(MELT)和顺序交替位置训练(SALT)。这些架构通过利用未配对的地理空间数据,将框架扩展到两种模态以上。两种方法在技术上均可行,并在四个下游任务中匹配了最强的双模态基线(SATCLIP)的性能。然而,增加模态数量并未持续提升性能,这表明所选的位置编码器是主要限制——对比目标在早期达到峰值,无论模态多样性或预训练量如何。MELT比SALT提供更稳定的训练,并为未来的扩展提供了更强的基础。

英文摘要

Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.

2606.20164 2026-06-19 cs.CL cs.AI cs.LG q-bio.QM 新提交

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM:用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结 提出MedRLM递归多模态健康智能框架,通过递归检查、分解、检索、验证和合成患者信息,协调多个专业代理并引入临床证据图记忆,实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情
AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理,而不是回答孤立的医学问题。然而,当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索,当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时,这可能变得脆弱。本文提出MedRLM,一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中,而是将患者病例视为一个外部临床环境,可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆,将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理,而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集(涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果)的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

2606.20161 2026-06-19 cs.CV 新提交

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

ARTEMIS: 基于智能体引导的可靠性感知时间掩码演化用于不完美监督的视频息肉分割

Tong Wang, Siwen Wang, Yaolei Qi, Jinxing Zhou, Yuting He, Guanyu Yang, Yutong Xie

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education(东南大学教育部新一代人工智能技术及其跨学科应用重点实验室) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) School of Medicine, Case Western Reserve University(凯斯西储大学医学院)

AI总结 提出ARTEMIS框架,利用视觉语言智能体选择可靠时间锚点,结合SAM2传播和可靠性感知鲁棒学习,从不完美监督(点、涂鸦、少量密集标签)中学习高质量视频息肉分割掩码,在多个基准上达到最优性能。

详情
AI中文摘要

不完美监督的视频息肉分割(VPS)旨在从廉价监督中学习密集、时间一致的掩码,包括弱标注(点、涂鸦)和少量密集标注帧的半监督。该设置具有临床价值,但由于弱对比、模糊边界、运动模糊和镜面高光,加上稀疏的像素级指导,具有挑战性。虽然SAM2可以从稀疏输入生成密集掩码,但直接伪标签通常会产生几何退化的掩码,存在边界泄漏,未充分利用时间一致性,并忽略可靠性。为解决这些问题,我们提出ARTEMIS,一个由智能体引导的可靠性感知时间掩码演化驱动的统一框架,用于不完美监督的VPS。ARTEMIS从可用监督初始化粗掩码:SAM2转换点/涂鸦,而密集标签作为可靠锚点。一个辩论-判断视觉语言智能体在弱监督下选择可靠的时间锚点,这些锚点通过SAM2双向传播以细化不可靠或未标注的帧。最后,ARTEMIS使用时间可靠性感知鲁棒学习训练分割器,结合可靠性引导的参考选择、参考原型传输模块和可靠性感知鲁棒损失。这些组件评估掩码可靠性,随时间演化锚点,跨帧传输目标身份,并降低噪声监督的权重而非丢弃困难样本。在SUN-SEG和CVC-ClinicDB-612上的涂鸦、点和有限标签设置下的实验表明,ARTEMIS达到了最先进的性能。代码将在此https URL发布。

英文摘要

Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at https://github.com/wangtong627/ARTEMIS.

2606.20156 2026-06-19 cs.AI 新提交

Modularity-Free Conflict-Averse Training for Generalized PINNs

面向广义PINN的无模块化冲突规避训练

Heejo Kong, Beomchul Park, Sung-Jin Kim, Seong-Whan Lee

发表机构 * Department of Brain and Cognitive Engineering, Korea University(韩国大学脑与认知工程系) Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 针对过参数化PINN因功能模块化导致冲突规避优化失效的问题,提出ModSync框架,通过惩罚任务专属连接并保留交互路径,实现结构优化与冲突规避训练的融合,在多种PDE基准上达到最先进精度。

Comments Accepted by ICASSP 2026

详情
AI中文摘要

物理信息神经网络(PINN)通过将物理定律嵌入可微目标,已成为求解偏微分方程的强大框架。尽管取得了进展,训练PINN仍然脆弱:最近的冲突规避优化方案缓解了残差损失和边界损失之间的梯度干扰,但我们表明,随着模型容量的增加,其有效性会下降。在本文中,我们识别了一种容量诱导的失效模式,其中过参数化网络经历功能模块化,自我划分为任务专属模块,抑制跨目标交互并阻碍向帕累托驻点收敛。为解决此问题,我们提出了一种新颖框架——模块稀疏同步(ModSync),通过惩罚任务专属连接同时保留促进交互的路径,将结构优化整合到冲突规避训练中。跨多种PDE基准的大量实验表明,ModSync持续防止容量驱动的失败,维持稳健的跨目标耦合,并实现了最先进的精度。代码可在\url{this https URL}获取。

英文摘要

Physics-informed neural networks (PINNs) have become a powerful framework for solving PDEs by embedding physical laws into differentiable objectives. Despite their advances, training PINNs remains fragile: recent conflict-averse optimization schemes alleviate gradient interference between residual and boundary losses, but we show that their effectiveness deteriorates as model capacity increases. In this paper, we identify a capacity-induced failure mode, where overparameterized networks undergo functional modularity, self-partitioning into task-exclusive modules that suppress cross-objective interaction and hinder convergence toward Pareto-stationary points. To address this issue, we propose a novel framework, Modular-Sparsity Synchronization (ModSync), which integrates structural optimization into conflict-averse training by penalizing task-exclusive connections while preserving interaction-promoting pathways. Extensive experiments across diverse PDE benchmarks demonstrate that ModSync consistently prevents capacity-driven failures, sustains robust cross-objective coupling, and achieves state-of-the-art accuracy. Codes are available at \url{https://github.com/heejokong/ModSync}.

2606.20155 2026-06-19 cs.CV cs.CL 新提交

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tel Aviv University(特拉维夫大学) Cornell University(康奈尔大学)

AI总结 提出一种黑盒行为探针,无需参考照片或训练数据,即可区分文本到图像模型生成的图像是记忆还是虚构,并在NAMESAKES数据集上验证其有效性。

详情
AI中文摘要

文本到图像(T2I)模型在提示其姓名时,会生成某些个体的逼真肖像,这引发了隐私问题。然而,区分生成的面孔是记忆还是虚构的,目前需要真实照片、训练数据访问权限或模型内部的白盒访问,限制了适用性。我们引入了一种完全黑盒的行为探针,可以在无需参考照片或事先了解训练数据的情况下区分这两种情况。为了基准测试这一任务,我们提出了NAMESAKES数据集,包含一千多个不同知名度水平的公众人物的姓名和面孔,以及经过扰动的、知名度较低的姓名。对最先进的T2I模型的实验表明,我们的探针能够显著预测身份记忆,并将记忆的姓名与未识别的姓名区分开来,并进一步揭示了不同模型系列之间的差异。

英文摘要

Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.

2606.20150 2026-06-19 cs.RO 新提交

Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

面向人机协作的基于动作识别的鲁棒装配状态推理

James Fant-Male, Roel Pieters

发表机构 * Cognitive Robotics group, Unit of Automation Technology and Mechanical Engineering, Tampere University(坦佩雷大学自动化技术与机械工程系认知机器人组)

AI总结 研究从动作识别输入跟踪装配状态的方法,比较逻辑、HMM和神经网络方法,发现最优方法因任务而异,逻辑方法在多变场景更鲁棒。

Comments Preprint accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 8 pages, 9 figures, 3 tables

详情
AI中文摘要

人类动作识别(HAR)在人机协作(HRC)研究中经常被用于理解已执行的动作以及协作任务的状态。然而,从HAR准确跟踪装配状态尚未得到充分研究,并且在现实场景中并非易事。本研究系统性地调查并比较了使用动作识别输入跟踪装配状态的方法。使用两个不同数据集和五种状态跟踪方法(包括基于逻辑的、隐马尔可夫模型(HMM)和神经网络(NN)方法)进行的调查表明,最优方法在不同任务中并不统一,并且不同方法在不同情况下会失败。测试使用具有不同噪声水平的模拟输入和来自HAR模型的真实输入进行。结果表明,NN和HMM方法在变异性有限的任务中表现良好,但在其他场景中,基于逻辑的方法可能更鲁棒。对于没有额外传感的重复动作任务,建模预期动作持续时间的方法也很重要。

英文摘要

Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.

2606.20146 2026-06-19 cs.AI 新提交

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

BIM-Edit:基于IFC的建筑信息模型的大语言模型基准测试

Bharathi Kannan Nithyanantham, Clemens Kujat, Tobias Sesterhenn, Stefan Telgmann, Jörn Plönnigs, Stefan Lüdtke, Christian Bartelt

发表机构 * University of Rostock(罗斯托克大学) Clausthal University of Technology(克劳斯塔尔工业大学)

AI总结 提出BIM-Edit基准,评估大语言模型在IFC格式建筑信息模型上的自然语言编辑能力,涵盖324个任务,最佳模型平均得分仅49.5%,揭示当前能力与工程需求间的差距。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被应用于计算机辅助设计(CAD),以从文本指令生成设计工件。在工程实践中,这需要的不仅仅是创建新的几何体,模型还必须理解现有场景,正确编辑它们,并保留语义和关系。然而,许多CAD基准侧重于创建新模型而非编辑现有模型,并且主要评估几何正确性。我们引入了BIM-Edit,这是一个用于评估LLMs在行业基础类(IFC)格式表示的建筑信息模型(BIM)上进行自然语言编辑的基准。BIM提供了一个具有挑战性的测试平台,因为建筑模型将几何体与语义和关系结构编码在一起。BIM-Edit包含324个编辑任务,涵盖11个真实建筑模型和36个合成场景。任务使用三种指令类别——直接、空间和拓扑——表达,涵盖显式编辑和场景接地编辑。我们沿三个维度评估输出:几何准确性、语义有效性和拓扑一致性。在评估的LLMs中,表现最佳的模型在三个指标上的平均得分仅为49.5%,且没有模型完全解决超过3.4%的任务。这些结果表明当前LLM能力与结构化工程设计工作流的要求之间存在巨大差距。

英文摘要

Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.

2606.20143 2026-06-19 cs.CV 新提交

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

头颈肿瘤 (HECKTOR) 2025 挑战赛:多模态 PET/CT 中的分割、诊断与预后基准

Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang, Moona Mazher, Abdul Qayyum, Yansong Bu, Mengye Lyu, Yue Lin, Mingyuan Meng, Chuanyi Huang, Lisheng Wang, Dalal Chamseddine, Shamimeh Ahrari, Beining Wu, Yifei Chen, Fuyou Mao, Hao Zhang, Baixiang Zhao, Surajit Ray, Muzi Guo, Lei Xiang, Jakob Dexl, Michael Ingrisch, Adrien Depeursinge, Arman Rahmim, Mathieu Hatt, Vincent Andrearczyk, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Amsterdam UMC(阿姆斯特丹大学医学中心) The Netherlands Cancer Institute(荷兰癌症研究所) Radboud University Medical Centre(拉德堡德大学医学中心) University College London(伦敦大学学院) Imperial College London(帝国理工学院) Shenzhen Technology University(深圳技术大学) Shenzhen University(深圳大学) Newland Digital Technology(新大陆数字技术) The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学) University Hospital, Nantes(南特大学医院) Nantes Université, Centrale Nantes, CNRS, LS2N(南特大学、南特中央理工学院、法国国家科学研究中心、LS2N实验室) Hangzhou Dianzi University(杭州电子科技大学) Tsinghua University(清华大学) Central South University(中南大学) University of Glasgow(格拉斯哥大学) China Mobile System Integration Co., Ltd.(中移系统集成有限公司) Subtle Medical Inc.(Subtle Medical公司) University Hospital, LMU Munich(慕尼黑大学医院) Munich Center for Machine Learning(慕尼黑机器学习中心) BC Cancer Research Institute(不列颠哥伦比亚癌症研究所) HES-SO Valais-Wallis University of Applied Sciences and Arts(HES-SO瓦莱州应用科学与艺术大学) Lausanne University Hospital (CHUV)(洛桑大学医院) LaTIM, INSERM, UMR 1101, Univ Brest(LaTIM实验室、法国国家健康与医学研究院、UMR 1101、布雷斯特大学)

AI总结 HECKTOR 2025 挑战赛利用多模态 PET/CT 和电子健康记录,建立了头颈癌自动分析的基准,涵盖肿瘤分割、复发预测和 HPV 分类三个任务,最佳算法分别达到 Dice 0.75、C-index 0.66 和平衡准确率 0.56。

Comments 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: https://hecktor.grand-challenge.org/

详情
AI中文摘要

头颈癌 (HNC) 构成显著的全球健康负担,准确的肿瘤勾画对于有效的放疗计划至关重要。口咽部解剖结构的复杂性,加上肿瘤在影像上的异质性表现,使得手动分割耗时且存在观察者间差异。除分割外,从非侵入性影像预测长期临床结局(如无复发生存期 RFS)和确定人乳头瘤病毒 (HPV) 状态,仍然是具有挑战性但临床价值高的目标。HECKTOR 2025 挑战赛通过使用多模态 PET/CT 影像和电子健康记录,建立了一个用于自动 HNC 分析的全面基准。基于前几届(2020-2022),本次挑战赛采用了扩展的多机构数据集,包含来自全球 10 个中心的 1100 多名患者。参与者需完成三个互补目标:(1) 分割原发肿瘤体积 (GTVp) 和转移淋巴结 (GTVn),(2) 预测无复发生存期,(3) 分类 HPV 状态。挑战赛吸引了 35 个注册团队,其中 15 个最终提交在保留测试集上进行了评估。表现最佳的算法在分割上达到平均 Dice 相似系数 0.75,在生存预测上达到一致性指数 0.66,在 HPV 分类上达到平衡准确率 0.56。本文对所提交的方法进行了全面分析,评估了它们在不同病变特征上的性能,并讨论了它们在自动化肿瘤学工作流程和决策支持系统中临床转化的意义。

英文摘要

Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

2606.20140 2026-06-19 cs.CV 新提交

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

SA-VIS: 用于训练视频实例分割的稀疏帧标注

Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool

发表机构 * CVL, ETH Zurich(计算机视觉实验室,苏黎世联邦理工学院) Align Technology VISICS, KU Leuven(VISICS,鲁汶大学) INSAIT, Sofia(INSAIT,索非亚)

AI总结 提出稀疏帧标注的SA-VIS方法,通过过去帧特征传播模块利用低维特征,在仅使用1/5标注帧时性能仅下降0.4%,显著降低标注成本。

详情
AI中文摘要

最近的在线视频实例分割(VIS)方法取得了令人印象深刻的结果,因此成为视频中实例分割的首选方法。尽管令人印象深刻的单图像模型(例如基于SAM的模型)重新兴起,但在线(或半在线)VIS方法通过在训练期间使用长序列的密集标注帧,优于单图像模型。然而,这种VIS的训练设置在计算和所需密集标注方面成本高昂。为了解决这些主要缺陷,我们认为实例及其在视频中的演变的有效建模并不需要密集标注的帧。为此,我们提出了一个简单有效的模块,称为过去帧特征传播(PFP),它聚合来自多个帧的图像编码器的低维特征。这个简单的低计算量模块为使用稀疏视频帧标签进行端到端训练提供了巨大的学习能力。结合轻量级的帧特定实例查询,我们的稀疏帧标注VIS(SA-VIS)显著提高了其基线的性能。最有趣的是,我们避免复杂性的简单设计有效地弥合了在稀疏和密集标注视频序列上训练之间的精度差距。这意味着当仅使用数据集中1/5图像的标注时,SA-VIS的性能仅下降0.4%。实验上,SA-VIS在YouTube-VIS 2019/2021/2022和Occluded VIS(OVIS)上显示出相对于基线的强劲改进,并且在有限标注场景下,AP比最先进方法提高了1%以上。

英文摘要

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

2606.20135 2026-06-19 cs.RO cs.AI 新提交

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流匹配用于连续且一致的机器人动作生成

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学) PKU-Psibot Lab(北大-智源机器人实验室) Zhongguancun Laboratory(中关村实验室) Hefei Comprehensive National Science Center(合肥综合性国家科学中心)

AI总结 提出频率感知流匹配(FAFM),通过离散余弦变换将离散动作序列转换到频域进行流匹配,并正则化一阶时间导数以生成平滑连续的动作,提升成功率、多模态表达性和运动平滑性。

详情
AI中文摘要

流匹配已成为机器人操作的标准范式,因为它与扩散策略等类似方法一样,对建模复杂的多模态动作分布具有很强的表达能力。然而,现有方法依赖于离散化的动作块,使得它们对以异构控制频率收集的演示数据脆弱,并且容易产生时间上不一致的动作,从而降低控制稳定性。在本文中,我们提出了频率感知流匹配(FAFM),它输出连续的、时间上一致的动作。为了处理异构频率输入,我们使用离散余弦变换(DCT)将离散动作序列转换到频域,对得到的系数进行流匹配,并通过余弦基展开重建连续动作。为了生成时间上一致的动作,我们对一阶时间导数进行正则化以促进平滑动作。这对应于一个Sobolev型约束,抑制高频误差并阻止突变的动作变化。我们的FAFM简单,不引入额外的网络参数,并且适用于独立的流匹配策略和视觉-语言动作模型。在合成玩具基准、避障、LapGym和LIBERO上,FAFM提高了成功率、多模态表达能力、运动平滑性、收敛速度、对机械偏差和混合频率输入的鲁棒性。这些优势在真实世界的Franka机器人上部署时保持一致。代码见此https URL。

英文摘要

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

2606.20131 2026-06-19 cs.CV cs.GR 新提交

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

TriFlow: 通过最近顶点向量场生成类艺术家3D网格拓扑

Haoxuan Li, Ziya Erkoç, Daniele Sirigatti, Vladislav Rosov, Lei Li, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) AUDI AG(奥迪股份公司) University of Virginia(弗吉尼亚大学)

AI总结 提出TriFlow,一种基于最近顶点向量场(NVF)的生成方法,通过流匹配模型合成NVF并引导拓扑感知的网格简化,直接从输入几何条件生成紧凑且具有类艺术家拓扑的3D网格。

详情
AI中文摘要

我们提出了TriFlow,一种新的生成方法,能够直接从输入几何条件(如符号距离场)生成具有类艺术家三角形拓扑的紧凑3D网格。我们的关键见解是将网格拓扑表示为在表面上定义的最近顶点向量场(NVF),其中每个点编码其在局部重心坐标系中与最近三角形顶点的关联。我们训练一个潜在流匹配模型来合成该场,从而实现基于输入几何条件的拓扑生成。为了提取连贯的网格,我们使用生成的NVF对表面区域进行聚类,并引导具有拓扑感知优化的约束二次误差度量(QEM)网格简化。这产生了与输入几何紧密匹配且具有结构化、类艺术家连接性的输出网格。实验表明,与最先进的基于学习方法相比,TriFlow实现了更强的泛化能力和显著提高的拓扑质量,同时Chamfer距离降低了90%,速度提升了8倍。

英文摘要

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

2606.20130 2026-06-19 cs.CV 新提交

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

SAM3自蒸馏用于细粒度GOOSE 2D语义分割

Xuesong Wang

发表机构 * Wayne State University(韦恩州立大学)

AI总结 提出基于SAM3图像编码器与轻量解码器的分割模型,通过自蒸馏、多尺度测试增强和光度畸变迁移,在GOOSE 2D挑战赛达69.73% mIoU。

Comments 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge

详情
AI中文摘要

我们描述了在ICRA 2026 GOOSE 2D细粒度语义分割挑战赛中获得第四名的方案,该方案在官方1815张图像测试集上达到了69.73%的复合平均交并比(mIoU)。我们的模型适配了近期视觉基础模型Segment Anything Model 3(SAM3)的图像编码器,并搭配轻量级解码器。除此之外,我们贡献了两项技术和一项经验发现:(i)一种自蒸馏方案,该方案重新利用SAM3本身,以真实边界框作为提示,在SAM3性能优于我们自身模型的类别上充当教师;(ii)一种图像级多尺度测试时增强方案,通过重新缩放图像而非模型输入,为固定输入尺寸的模型恢复多尺度推理;(iii)一项发现:来自2025年GOOSE 2D获胜方案的一种激进光度畸变,移植到我们的流程中,是单一最大的改进来源。

英文摘要

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

2606.20122 2026-06-19 cs.AI cs.MA 新提交

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

ScaffoldAgent: 面向开放式深度研究的效用引导动态大纲优化

Zhibang Yang, Xinke Jiang, Yuzhen Xiao, Ruizhe Zhang, Yue Fang, XinFei Wan, Zhengxing Song, Yuxuan Liu, Yuheng Huang, Xu Chu, Junfeng Zhao, Yasha Wang

发表机构 * National Engineering Research Center of Software Engineering, Peking University(北京大学软件工程国家工程研究中心) School of Computer Science, Peking University(北京大学计算机学院) Key Laboratory of High Confidence Software Technologies, Ministry of Education(教育部高可信软件技术重点实验室) GRG Banking Equipment Co., Ltd.(广电运通金融电子股份有限公司) Center on Frontiers of Computing Studies, Peking University(北京大学计算前沿研究中心) Peking University Information Technology Institute (Tianjin Binhai)(北京大学(天津滨海)信息技术研究院)

AI总结 提出ScaffoldAgent框架,通过效用引导的动态大纲优化(扩展、收缩、修订操作)解决开放式深度研究中大纲漂移问题,在DeepResearch Bench和Gym上提升长报告生成与事实准确性。

Comments 9 pages, 6 figures

详情
AI中文摘要

开放式深度研究(OEDR)要求系统通过多轮检索获取知识并生成连贯的长篇报告。大纲作为协调检索、证据组织和生成的结构性支架起着核心作用。然而,现有方法要么在写作前固定大纲,要么使用局部启发式方法进行优化,导致在持续信息积累下出现大纲漂移,且评估大纲修改的反馈延迟。我们提出ScaffoldAgent,一种面向OEDR的效用引导动态大纲优化框架。ScaffoldAgent将大纲演化建模为结构化决策过程,包含三种操作:扩展、收缩和修订,从而实现对报告支架的受控更新。它进一步引入效用引导的反馈机制,通过检索增益、结构连贯性和试生成质量来估计每个大纲操作的下游价值。得到的效用信号指导推理过程中的节点选择、操作调度和终止。在DeepResearch Bench和DeepResearch Gym上的实验表明,ScaffoldAgent在长报告生成和事实基础上持续优于现有的深度研究智能体。

英文摘要

Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

2606.20113 2026-06-19 cs.CL cs.IR 新提交

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助?表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs(SMG实验室)

AI总结 通过测量工具意图稳定化(即推测查询收敛到答案的时间点),在CRAG基准上分析流式RAG的延迟隐藏效果,发现73.9%的查询可实现显著延迟隐藏,并识别早期与晚期稳定化的预测因素。

详情
AI中文摘要

流式检索增强生成(Streaming RAG)通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的,但该机制的好处本质上是查询内在的:只有当正确的工具查询在用户停止说话或打字之前变得可确定时,推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化,即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准(1371个验证问题)上,我们(i)测量了稳定化的分布,(ii)推导出一个与模型无关的界限H,表示可以隐藏在用户剩余输入背后的工具延迟比例,该比例是工具延迟L和输入节奏δ的函数,(iii)通过一个工作流式管道验证了实际节省达到或超过此界限,(iv)识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练,在普通CPU硬件上运行。我们发现,在现实操作点(L=600ms,δ=3w/s,θ=0.8)下,整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题(其中黄金证据以原文形式存在且可被BM25检索)上的充分稳定化(在此有利切片上95.2%可流式处理)以及其余问题上的无基础top-1稳定化回退。在有利切片上,φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分(Kruskal-Wallis p=0.017, epsilon^2=0.04),直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

2606.20112 2026-06-19 cs.CV eess.IV 新提交

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer:可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出像素级残差扩散Transformer(PRDiT),通过两阶段训练(局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差)实现高保真3D CT体生成,在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情
AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难,生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中,我们提出了像素级残差扩散Transformer(PRDiT),这是一种可扩展的生成框架,可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构,包括:1)一个局部去噪器,形式为基于MLP的盲估计器,作用于重叠的3D块,以有效分离低频结构;2)一个全局残差扩散Transformer,采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化,增强了训练稳定性,并有效保留了细微结构,而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明,PRDiT始终优于最先进的模型,如HA-GAN、3D LDM和WDM-3D,在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

2606.20110 2026-06-19 cs.CV 新提交

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

FrozenDrive: 零样本文本引导驾驶场景生成与数据增强的无参数冻结扩散模型

Yuhwan Jeong, Hyeonseong Kim, Daehyun We, Seonkyu Song, Jinnyeong Yang, Hyun-Kurl Jang, Youngho Yoon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院视觉智能实验室)

AI总结 提出FrozenDrive框架,利用冻结的预训练扩散模型,通过知识保留的时空注意力实现多视图一致性和时间连贯性,无需微调即可生成恶劣天气下的驾驶场景,提升自动驾驶模型鲁棒性。

Comments Accepted to ECCV 2026

详情
AI中文摘要

自动驾驶的合成数据正在激增,这得益于扩散模型能够实现可扩展的场景生成。然而,关键障碍依然存在,因为强制执行多视图和时间一致性通常依赖于骨干网络微调或添加层,这会侵蚀预训练知识并削弱文本对齐。模型也保持接近训练分布,在恶劣天气和未见配置下表现不佳,并且保真度偏向频繁类别而非稀有类别。我们通过FrozenDrive解决这些差距,这是一个可控生成框架,在保持预训练扩散模型知识的同时实现强一致性。FrozenDrive以丰富的驾驶堆栈信号和文本提示为条件,并引入知识保留的时空注意力,在无参数的冻结扩散骨干中单次通过时施加跨视图对齐和时间连贯性。额外的对象聚焦约束提高了稀有类别的每个对象保真度。无需任何天气或场景特定的微调,我们的模型从文本合成全局连贯的多视图驾驶场景,特别是在恶劣和稀有条件下,并超越了先前的基线。在nuScenes上,FrozenDrive增强数据显著提升了AD模型的性能,尤其是在夜间和雨天,当使用我们的场景定向数据训练时,展示了更强的鲁棒性。

英文摘要

Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.

2606.20108 2026-06-19 cs.CV cs.LG 新提交

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

EFIQA: 基于解剖先验的可解释眼底图像质量评估

Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(维也纳医科大学医学数据科学中心人工智能研究所) Christian Doppler Lab for Artificial Intelligence in Retina, Medical University of Vienna, Austria(维也纳医科大学视网膜人工智能克里斯蒂安·多普勒实验室)

AI总结 提出无需质量标签的EFIQA框架,利用解剖先验通过掩膜解剖修复学习正常结构,生成空间质量图,在多个基准上超越监督方法,兼具可解释性。

Comments Accepted in MIDL 2026. Code: https://github.com/penway/EFIQA

Journal ref Proceedings of Machine Learning Research 315:2248-2264, 2026

详情
AI中文摘要

图像质量控制对于广泛的下游应用至关重要。基于深度学习的图像质量评估方法通常根据数据集特定的质量标签训练分类器,这继承了两种局限性:(1)泛化能力受限于训练集的标注标准;(2)这些方法无法提供质量下降的空间反馈,缺乏可解释性。在这项工作中,我们提出了EFIQA,一个无需质量相关监督的框架,并通过设计生成空间质量图。EFIQA不是从人工标注的标签中学习“什么是退化”,而是通过利用解剖先验来学习“应该有什么”。对于眼底摄影,我们将其实例化为两阶段方法:首先通过掩膜解剖修复训练无监督异常检测器,以识别缺失血管区域;然后将这一先验知识蒸馏到一个浅层适配器中,将冻结基础模型的特征映射到精确的质量图。外部数据集评估表明,这种无需标签且只需最小适配的方法,在不同质量标准的基准上,与监督方法相比,实现了更好的性能和可解释性,突显了其在现实应用中的潜力。

英文摘要

Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

2606.20107 2026-06-19 cs.LG 新提交

Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

均值分位数:一种用于最小最大最优强化学习的无奖励集成方法

Asaf Cassel, Aviv Rosenberg

发表机构 * Google Research(谷歌研究院)

AI总结 提出一种基于分位数的集成方法,无需计数即可在有限时域MDP中实现最优方差依赖的遗憾界,为强化学习中的集成探索提供理论依据。

详情
AI中文摘要

最优强化学习算法通常依赖于精心构造的基于计数的不确定性估计来驱动探索。尽管理论上合理,但这类估计在实际设置中难以计算,因此对设计探索启发式方法提供的见解有限。与此同时,集成方法已成为一种实用的方法,但仍缺乏理论证明。基于最近一种用于多臂赌博机的集成方法,我们提出了一种用于有限时域马尔可夫决策过程(MDP)的基于分位数的集成方法。我们这种简单的无计数方法实现了最优方差依赖的遗憾界,为强化学习中的集成探索提供了理论基础。

英文摘要

Optimal Reinforcement Learning (RL) algorithms typically rely on carefully constructed count-based uncertainty estimates to drive exploration. Although theoretically sound, such estimates are hard to compute in practical settings and therefore offer limited insight for designing exploration heuristics. Meanwhile, ensembling has emerged as a practical approach, but remains without theoretical justification. Building on a recent ensemble-based method for Multi-Armed Bandits, we propose a quantile-based ensemble method for finite-horizon Markov Decision Processes (MDPs). Our simple count-free approach achieves optimal variance-dependent regret bounds, providing theoretical grounding for ensemble-based exploration in RL.

2606.20104 2026-06-19 cs.LG cs.AI 新提交

Sensorimotor World Models: Perception for Action via Inverse Dynamics

传感器运动世界模型:通过逆动力学实现面向行动感知

Petr Ivashkov, Randall Balestriero, Bernhard Schölkopf

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Department of Computer Science, Brown University(布朗大学计算机科学系) ELLIS Institute(ELLIS研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出传感器运动世界模型(SMWM),通过逆动力学正则化端到端训练潜空间世界模型,防止表示崩溃并学习与行动对齐的紧凑表示,在2D和3D控制任务中实现竞争性规划性能。

详情
AI中文摘要

面向行动的感知表明,世界的表示不应仅由视觉保真度决定,而应由其与行动的相关性决定。同时,潜在的JEPA风格世界模型主张从高维观测中学习紧凑的预测状态以促进未来状态的预测,但这些模型的端到端训练并非易事,因为如果我们的唯一目标是构建易于预测的潜在状态,表示可能会崩溃。我们引入了一种传感器运动世界模型(SMWM):一种通过逆动力学正则化进行端到端训练的潜在世界模型。这一单一正则化解决了两个问题:它防止表示崩溃并诱导与行动对齐的表示。通过迫使潜在状态保留关于转换背后行动的信息,它使模型偏向于环境中可控的自由度,同时丢弃不可控的干扰因素。这产生了从离线、无奖励轨迹中训练的稳定潜在世界模型,无需冻结编码器、指数移动平均或复杂的潜在正则化。实验表明,SMWM学习了紧凑、可解释的潜在空间,并在简单的2D和3D控制任务中实现了竞争性的规划性能。

英文摘要

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

2606.20103 2026-06-19 cs.CV 新提交

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University(庆熙大学)

AI总结 针对LiDAR-相机标定中跨模态特征稀缺问题,提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何,提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情
AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置,但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景,通过密集光度监督实现外参优化。其中,3D高斯溅射(3DGS)被广泛用作几何代理,在单一可微框架内桥接LiDAR和相机。然而,由于3DGS最初是为新视图合成设计的,现有方法倾向于优先考虑渲染质量,导致代理几何偏离真实的LiDAR结构。我们提出了一种框架,通过聚合多视图LiDAR观测进行密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法,在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 新提交

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Fisheries College, Ocean University of China(中国海洋大学水产学院) College of Information and Electrical Engineering, China Agricultural University(中国农业大学信息与电气工程学院)

AI总结 提出混合两阶段扩散变压器架构,通过粗到细策略平衡全局语义对齐与局部细节编辑,在重叠音频事件和复杂指令任务上提升性能与效率。

详情
AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容,同时保留其余声学内容。尽管扩散模型取得了显著进展,但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互,这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下,扩散变压器提供了更强的全局建模和多模态融合,但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率,我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构,用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐,然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明,所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升,同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

2606.20100 2026-06-19 cs.CV 新提交

WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization

WeGenBench:面向文本到图像模型优化的多维诊断基准

Qian Liang, Xiaomin Li, Ying Zhang, Jia Xu, Lihao Ni, Hongrui Li, Jingjing Li, Jing Lyu, Chen Li

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Dalian University of Technology(大连理工大学) Weixin, Tencent(腾讯微信)

AI总结 提出WeGenBench基准,包含4000个中英双语提示,通过场景分类和多维标签实现跨维度评估,并设计基于视觉语言模型的新颖指标,精准定位模型在特定生成类别中的缺陷。

详情
AI中文摘要

最近的文本到图像生成模型在仅从文本输入合成高度逼真的图像方面展现了卓越的能力。尽管现有基准可以在一定程度上评估各种模型的生成能力,但它们难以全面准确地衡量多个维度的性能,往往无法揭示模型在特定类别中的固有缺陷。为了解决这些局限性,我们提出了WeGenBench,一个新颖的基准,旨在对文本到图像生成能力进行全面、多视角的评估。我们的基准总共包含4000个测试提示,涵盖两个主要类别,并在中英文之间精心平衡,以评估双语和跨文化生成能力。除了宏观场景分类外,我们根据每种语言的不同内容和挑战为每个提示标注了多维标签,从而将生成任务细化为更具体的子类别。通过利用场景分类和多维标签的跨维度评估机制,WeGenBench可以精确定位模型在特定生成类别中的不足。此外,为了更准确地衡量生成质量,我们通过整合视觉语言模型(VLM)设计并验证了几种新颖的评估指标,这些指标从三个核心方面评估模型在特定领域任务上的性能。至关重要的是,我们的方法既产生评估结果,也产生详细的推理轨迹,有助于对评估结果的准确性和合理性进行严格验证。最后,我们对当前最先进的方法进行了系统性的基准测试,并深入分析了现有模型中存在的局限性。

英文摘要

Recent text-to-image generation models have demonstrated remarkable capabilities in synthesizing highly realistic images from text inputs alone. Although existing benchmarks can evaluate the generation capabilities of various models to some extent, they struggle to comprehensively and accurately measure performance across multiple dimensions, often failing to reveal the inherent deficiencies of models in specific categories. To address these limitations, we propose WeGenBench, a novel benchmark designed for the comprehensive, multi-perspective evaluation of text-to-image generation capabilities. Our benchmark comprises a total of 4,000 test prompts across two primary categories, meticulously balanced between Chinese and English to evaluate bilingual and cross-cultural generation capabilities. Beyond macroscopic scene classification, we annotate each prompt with multi-dimensional tags tailored to the distinct content and challenges of each language, thereby refining the generation tasks into more specific sub-categories. Through a cross-dimensional evaluation mechanism leveraging both scene classifications and multi-dimensional tags, WeGenBench can precisely pinpoint model shortcomings in specific generation categories. Furthermore, to measure generation quality more accurately, we design and validate several novel evaluation metrics by integrating Vision-Language Models (VLMs), which assess model performance on domain-specific tasks from three core aspects. Crucially, our approach yields both the assessment outcomes and the detailed reasoning trajectories, facilitating a rigorous verification of the accuracy and soundness of the evaluation results. Finally, we conduct systematic benchmarking on current state-of-the-art methods and provide an in-depth analysis of the limitations present in existing models.

2606.20097 2026-06-19 cs.CL 新提交

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

HydraHead:从头部级功能异质性到专业化注意力混合

Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen, Yue Wu, Jieping Ye

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出HydraHead架构,沿头部维度混合全注意力和线性注意力,通过可解释性驱动的头部选择和尺度归一化融合模块,在长上下文任务中优于层级混合设计,仅用15B token训练即在512K上下文长度上提升69%。

详情
AI中文摘要

注意力的二次复杂度对长上下文处理构成了关键瓶颈,激发了混合注意力设计的兴趣。大多数开源混合模型采用层级策略。然而,先前工作注意到线性注意力与全注意力整合的内在困难,表明注意力混合的设计空间仍未充分探索。为了探索这一空间,我们进行可解释性分析,观察到层表现出块级功能相似性,而同一层内的单个头部尽管共享输入特征,却显示出不同的功能专门化。这种头部级异质性表明,头部维度为融合异质注意力信号提供了自然且原则性的粒度。基于这一洞察,我们引入了HydraHead,一种沿头部轴混合全注意力和线性注意力的新型架构。HydraHead具有两个关键创新:(1)一种可解释性驱动的选择策略,识别检索关键的头部并仅为其保留全注意力;(2)一种尺度归一化融合模块,调和全注意力和线性注意力头部输出之间的分布差距。通过利用参数重用和蒸馏的三阶段迁移流程,我们以最小的训练开销实现了高性能混合模型。在统一的训练设置下,HydraHead在长上下文任务中优于其他混合设计,同时保持强大的通用推理能力。通过可解释性驱动的头部选择,它以7:1的线性注意力与全注意力比例匹配了3:1层级混合的长上下文性能。关键的是,仅用15B token训练,HydraHead在512K上下文长度上比基线提升超过69%,接近Qwen3.5(一个具有256K原生上下文长度的类似规模领先模型)。这突显了头部级混合的显著扩展潜力。

英文摘要

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

2606.20095 2026-06-19 cs.CV 新提交

Stitching and dimensionality effects on large artificially generated volume datasets

拼接和维度对大规模人工生成体数据集的影响

Lucas von Chamier, Jan Philipp Albrecht, Dagmar Kainmüller

发表机构 * GFZ Helmholtz-Zentrum für Geoforschung(亥姆霍兹地球科学中心) Max Delbrück Center for Molecular Medicine in the Helmholtz Association(亥姆霍兹协会马克斯·德尔布吕克分子医学中心) Helmholtz Imaging(亥姆霍兹成像) Humboldt-Universität zu Berlin(柏林洪堡大学) University of Potsdam(波茨坦大学)

AI总结 研究深度学习生成大图像时的拼接伪影对风格迁移的影响,比较2D与3D模型,发现FID无法检测影响下游任务的细微伪影,3D模型略优但计算成本高。

详情
AI中文摘要

通过深度学习生成大图像需要对输入数据进行分块以适应硬件内存限制,然后组装输出块,这一过程在相邻块边界不对齐时可能引入拼接伪影。虽然已知这些伪影会影响分割任务,但它们对风格迁移生成模型的影响尚不清楚。我们使用在冷冻电镜数据集上训练的cycleGAN模型,研究了三种拼接方法和两种块维度(2D vs 3D)。我们评估了感知质量和下游线粒体分割的性能。主要发现如下:(1)FID分数无法检测到显著影响下游分割性能的细微拼接伪影;(2)具有无伪影拼接的3D模型在下游任务上略优于2D模型,尽管改进勉强证明计算成本合理;(3)2D模型由于更大的批量大小而训练更稳定。此外,我们证明从三个正交方向集成预测可以改善低质量体,但对高质量输出无益。这些结果表明,在大型科学数据集上最大化生成模型性能需要仔细考虑和减轻拼接伪影,并且仅凭感知指标不足以评估生物医学成像中的域适应质量。

英文摘要

Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror:在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon(亚马逊)

AI总结 提出MakeupMirror扩散模型,通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器,在保持面部特征和肤色的同时实现高质量化妆迁移,相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情
AI中文摘要

化妆迁移模型能够实现有趣的增强现实(AR)体验以及在线化妆购物的虚拟试妆(VTO)。尽管最近最先进的基于扩散的解决方案(如Stable-Makeup)显著提高了化妆迁移的准确性和逼真度,但在身份和肤色保持方面仍存在局限性,使得用于化妆购物的生产级VTO不切实际。在这项工作中,我们提出了MakeupMirror,一种基于扩散的化妆迁移方法,在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新:(1)将面部几何条件与ControlNets集成以保持面部保真度;(2)区域特定的化妆迁移控制,以便在面部区域(如皮肤、眼睛和嘴唇)实现精确的化妆应用;(3)基于肤色的化妆迁移调制,防止跨主体迁移场景中的肤色改变;(4)集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及(本文新收集的、更多样化的)MakeupSelfies数据集上的实验表明,与Stable-Makeup相比,MakeupMirror将相对面部识别相似度提高了+60%,将相对肤色差异降低了-50%,延迟为0.7秒,同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

2606.20093 2026-06-19 cs.CL 新提交

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

自我偏好在可验证的指令遵循修订中弱或不存在:基于真正作者身份的四模型测试

William Guey, Pierrick Bougault

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 通过IFEval验证器测试四类中端模型在指令遵循修订中的自我偏好,发现作者拒绝已验证正确编辑的比例与新鲜模型无显著差异,表明自我偏好弱或不存在。

Comments 7 pages, 3 tables. Code and data: https://github.com/williamguey/self-preference-revision

详情
AI中文摘要

大型语言模型(LLMs)越来越多地审查和修订文本,包括它们自己的文本。有记录的自我偏好偏差(模型在充当评判者时偏爱自己的生成)引发了一个问题:模型是否也会抵制对自己写作的有效修正。我们在一个“有效”不是由另一个模型决定,而是由确定性验证器决定的设置中测试这一点:基于IFEval的指令遵循修订。模型撰写草稿;官方IFEval检查器确认草稿违反约束,并且候选编辑修复了它;然后模型接受或拒绝该编辑,要么作为真正的上下文内作者,要么作为一个以中立方式看待草稿的新鲜模型。在四个中端模型系列和85次作者与新鲜模型比较中,我们未检测到可察觉的自我偏好:作者拒绝对自己草稿的已验证正确修复的比例与判断相同草稿的新鲜模型基本相同(差距-5.1个百分点,95%置信区间[-12.9, +2.7])。来自较小试点的自我怀疑提示未在大规模上复制。唯一稳健的观察是定性的:当作者确实拒绝已验证正确的修复时,他们陈述的理由中有97%是挑错而非偏好,即关于拒绝的性质,而非升高的比率。在此样本量下,不能排除小于约13个百分点的效应。

英文摘要

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

2606.20092 2026-06-19 cs.CV 新提交

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

EventVLA: 面向长程视觉-语言-动作策略的事件驱动视觉证据记忆

Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Dalian University of Technology(大连理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) The University of Hong Kong(香港大学) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 针对长程机器人操作中记忆瓶颈问题,提出EventVLA框架,通过动态关键帧证据记忆模块自主捕获任务关键视觉事件,在17个模拟和4个真实任务中平均成功率提升40%。

详情
AI中文摘要

记忆仍然是长程机器人操作的关键瓶颈,因为标准的视觉-语言-动作(VLA)策略在任务相关线索随时间变得遮挡或不可观测时常常失败。虽然现有的记忆增强方法利用历史上下文,但它们要么遭受严重的信息瓶颈,通过解耦的双系统引入高延迟,要么依赖积累大量视觉冗余的无选择性缓冲区。为了解决这些限制,我们引入了EventVLA,一个基于稀疏视觉证据记忆概念的端到端框架,包含两个核心组件:用于保留初始和短期上下文的基础视觉锚点,以及动态关键帧证据记忆(KEM)模块。具体来说,KEM直接从VLA的潜在嵌入中预测未来关键帧概率,以自主捕获和存储稀疏的、任务关键的视觉事件。这种前瞻驱动的机制使策略能够动态评估当前观测的未来因果效用,在瞬态视觉证据变得不可观测之前将其保留。此外,我们提出了RoboTwin-MeM,一个专门设计用于评估具有交互式视觉证据的非马尔可夫操作任务的诊断基准。大量评估表明,在17个需要记忆的模拟任务和4个真实世界双臂任务中,EventVLA相比最先进的记忆增强VLA实现了平均成功率提升+40%。

英文摘要

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.