arXivDaily arXiv每日学术速递 周一至周五更新
2606.20164 2026-06-19 cs.CL cs.AI cs.LG q-bio.QM 新提交

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM:用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结 提出MedRLM递归多模态健康智能框架,通过递归检查、分解、检索、验证和合成患者信息,协调多个专业代理并引入临床证据图记忆,实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情
AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理,而不是回答孤立的医学问题。然而,当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索,当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时,这可能变得脆弱。本文提出MedRLM,一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中,而是将患者病例视为一个外部临床环境,可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆,将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理,而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集(涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果)的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

2606.20163 2026-06-19 eess.SY cs.SY 新提交

Techno-Economic Analysis of Shared Mobile Storage for Demand Charge Reduction

用于需求费用削减的共享移动储能技术经济分析

B Hari Kiran Reddy, Ge Chen, Junjie Qin

AI总结 本文提出一个高保真车队管理框架,通过混合整数线性规划模型和启发式算法,评估共享电动汽车在考虑实际物流和运营约束下削减需求费用的技术经济可行性。

Comments 22 pages, 26 figures, journal

详情
AI中文摘要

本文研究了在实际物流和运营约束下,共享电动汽车车队用于削减需求费用的技术经济可行性。与忽略运输开销的理想化模型不同,我们提出了一个高保真车队管理框架,明确考虑了能源消耗的时空耦合、电动汽车驾驶员的人工成本和电池退化。我们将调度问题表述为混合整数线性规划,共同最小化需求费用和总拥有成本。为了解决路径依赖约束带来的计算复杂性,我们开发了一种基于边际价值的启发式算法,该算法以高计算效率实现了接近最优的性能。使用旧金山的真实数据,我们的分析表明,适度数量的电动汽车可以实现显著的需求费用节省,足以收回拥有和运营成本。我们的结果还显示了电价结构、车队规模和成本组成部分如何影响整体盈利能力。

英文摘要

This paper investigates the techno-economic viability of shared electric vehicle (EV) fleets for demand charge reduction under practical logistical and operational constraints. Unlike idealized models that overlook transit overheads, we propose a high-fidelity fleet management framework that explicitly accounts for the spatio-temporal coupling of energy consumption, labor costs for EV drivers, and battery degradation. We formulate the dispatch problem as a mixed-integer linear program (MILP) that jointly minimizes demand charges and total cost of ownership. To address the computational complexity arising from path-dependent constraints, we develop a marginal-value-based heuristic algorithm that achieves near-optimal performance with high computational efficiency. Using real-world data from San Francisco, our analysis reveals that a modest number of EVs can achieve significant demand charge savings, sufficient to recover the ownership and operational expenses. Our results also show how tariff structures, fleet size, and cost components influence overall profitability.

2606.20161 2026-06-19 cs.CV 新提交

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

ARTEMIS: 基于智能体引导的可靠性感知时间掩码演化用于不完美监督的视频息肉分割

Tong Wang, Siwen Wang, Yaolei Qi, Jinxing Zhou, Yuting He, Guanyu Yang, Yutong Xie

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education(东南大学教育部新一代人工智能技术及其跨学科应用重点实验室) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) School of Medicine, Case Western Reserve University(凯斯西储大学医学院)

AI总结 提出ARTEMIS框架,利用视觉语言智能体选择可靠时间锚点,结合SAM2传播和可靠性感知鲁棒学习,从不完美监督(点、涂鸦、少量密集标签)中学习高质量视频息肉分割掩码,在多个基准上达到最优性能。

详情
AI中文摘要

不完美监督的视频息肉分割(VPS)旨在从廉价监督中学习密集、时间一致的掩码,包括弱标注(点、涂鸦)和少量密集标注帧的半监督。该设置具有临床价值,但由于弱对比、模糊边界、运动模糊和镜面高光,加上稀疏的像素级指导,具有挑战性。虽然SAM2可以从稀疏输入生成密集掩码,但直接伪标签通常会产生几何退化的掩码,存在边界泄漏,未充分利用时间一致性,并忽略可靠性。为解决这些问题,我们提出ARTEMIS,一个由智能体引导的可靠性感知时间掩码演化驱动的统一框架,用于不完美监督的VPS。ARTEMIS从可用监督初始化粗掩码:SAM2转换点/涂鸦,而密集标签作为可靠锚点。一个辩论-判断视觉语言智能体在弱监督下选择可靠的时间锚点,这些锚点通过SAM2双向传播以细化不可靠或未标注的帧。最后,ARTEMIS使用时间可靠性感知鲁棒学习训练分割器,结合可靠性引导的参考选择、参考原型传输模块和可靠性感知鲁棒损失。这些组件评估掩码可靠性,随时间演化锚点,跨帧传输目标身份,并降低噪声监督的权重而非丢弃困难样本。在SUN-SEG和CVC-ClinicDB-612上的涂鸦、点和有限标签设置下的实验表明,ARTEMIS达到了最先进的性能。代码将在此https URL发布。

英文摘要

Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at https://github.com/wangtong627/ARTEMIS.

2606.20158 2026-06-19 cs.SE 新提交

N-Version Programming with Coding Agents

使用编码代理的N版本编程

Javier Ron, Benoit Baudry, Martin Monperrus

AI总结 本文在当代AI编码代理背景下重新审视N版本编程,通过Knight-Leveson实验评估代理系统、模型和实现语言的多样性对故障模式的影响,发现常见模式故障,但多数投票三版本单元显著降低故障数,证明该策略的工程实用性。

详情
AI中文摘要

本文在当代AI编码代理背景下重新审视N版本编程这一经典概念。通过重访开创性的Knight-Leveson实验,我们研究了代理系统、模型和实现语言之间的多样性是否会产生多样化的故障模式。使用Knight-Leveson的发射拦截器程序规范,我们在共享的预言机和100万个随机测试输入的测试集上评估了48个代理生成的实现。结果显示,与Knight-Leveson的发现一致,存在大量的共模故障。进一步分析表明,许多这些同时发生的故障可以追溯到规范中特别困难或模糊的地方。我们还证明了编码代理的多样性带来了实际效益:在多数投票的三版本单元中,平均故障数从单版本的387.44下降到三版本的130.99,并且有11,844个N版本单元表现出零观测故障。我们的原始结果是迄今为止最强的证据,表明使用编码代理的N版本编程是一种有用的工程策略。

英文摘要

This paper revisits the classical concept on N-version programming in the setting of contemporary AI coding agents. Revisiting the seminal Knight-Leveson experiment, we study whether diversity across agent systems, models, and implementation languages creates diverse failure modes. Using the Knight-Leveson's, Launch Interceptor Program Specification, we evaluate 48 agent-generated implementations on a shared oracle and a campaign of 1,000,000 randomized test inputs. The results show substantial common-mode failure, along the findings of Knight-Leveson. Further analysis that many of those co-occuring failures can be traced to where is specification is particularly hard or ambiguous. We also demonstrate that diversity from coding agents provides practical benefit: across majority voting three-version units, the mean failure count drops from 387.44 for single versions to 130.99 for triples, and 11,844 N-version units exhibit zero observed failures. Our original results is the strongest evidence to date that N-Version Programming with coding agents is a useful engineering strategy.

2606.20156 2026-06-19 cs.AI 新提交

Modularity-Free Conflict-Averse Training for Generalized PINNs

面向广义PINN的无模块化冲突规避训练

Heejo Kong, Beomchul Park, Sung-Jin Kim, Seong-Whan Lee

发表机构 * Department of Brain and Cognitive Engineering, Korea University(韩国大学脑与认知工程系) Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 针对过参数化PINN因功能模块化导致冲突规避优化失效的问题,提出ModSync框架,通过惩罚任务专属连接并保留交互路径,实现结构优化与冲突规避训练的融合,在多种PDE基准上达到最先进精度。

Comments Accepted by ICASSP 2026

详情
AI中文摘要

物理信息神经网络(PINN)通过将物理定律嵌入可微目标,已成为求解偏微分方程的强大框架。尽管取得了进展,训练PINN仍然脆弱:最近的冲突规避优化方案缓解了残差损失和边界损失之间的梯度干扰,但我们表明,随着模型容量的增加,其有效性会下降。在本文中,我们识别了一种容量诱导的失效模式,其中过参数化网络经历功能模块化,自我划分为任务专属模块,抑制跨目标交互并阻碍向帕累托驻点收敛。为解决此问题,我们提出了一种新颖框架——模块稀疏同步(ModSync),通过惩罚任务专属连接同时保留促进交互的路径,将结构优化整合到冲突规避训练中。跨多种PDE基准的大量实验表明,ModSync持续防止容量驱动的失败,维持稳健的跨目标耦合,并实现了最先进的精度。代码可在\url{this https URL}获取。

英文摘要

Physics-informed neural networks (PINNs) have become a powerful framework for solving PDEs by embedding physical laws into differentiable objectives. Despite their advances, training PINNs remains fragile: recent conflict-averse optimization schemes alleviate gradient interference between residual and boundary losses, but we show that their effectiveness deteriorates as model capacity increases. In this paper, we identify a capacity-induced failure mode, where overparameterized networks undergo functional modularity, self-partitioning into task-exclusive modules that suppress cross-objective interaction and hinder convergence toward Pareto-stationary points. To address this issue, we propose a novel framework, Modular-Sparsity Synchronization (ModSync), which integrates structural optimization into conflict-averse training by penalizing task-exclusive connections while preserving interaction-promoting pathways. Extensive experiments across diverse PDE benchmarks demonstrate that ModSync consistently prevents capacity-driven failures, sustains robust cross-objective coupling, and achieves state-of-the-art accuracy. Codes are available at \url{https://github.com/heejokong/ModSync}.

2606.20155 2026-06-19 cs.CV cs.CL 新提交

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tel Aviv University(特拉维夫大学) Cornell University(康奈尔大学)

AI总结 提出一种黑盒行为探针,无需参考照片或训练数据,即可区分文本到图像模型生成的图像是记忆还是虚构,并在NAMESAKES数据集上验证其有效性。

详情
AI中文摘要

文本到图像(T2I)模型在提示其姓名时,会生成某些个体的逼真肖像,这引发了隐私问题。然而,区分生成的面孔是记忆还是虚构的,目前需要真实照片、训练数据访问权限或模型内部的白盒访问,限制了适用性。我们引入了一种完全黑盒的行为探针,可以在无需参考照片或事先了解训练数据的情况下区分这两种情况。为了基准测试这一任务,我们提出了NAMESAKES数据集,包含一千多个不同知名度水平的公众人物的姓名和面孔,以及经过扰动的、知名度较低的姓名。对最先进的T2I模型的实验表明,我们的探针能够显著预测身份记忆,并将记忆的姓名与未识别的姓名区分开来,并进一步揭示了不同模型系列之间的差异。

英文摘要

Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.

2606.20152 2026-06-19 cs.CL cs.AI 新提交

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

从文本到分数:追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

AI总结 通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征,发现作文质量信息以线性可解码形式存在,并识别出与分数相关的神经元,揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展极大地改变了自动作文评分(AES),但基于LLM的评分内部机制仍知之甚少。在本工作中,我们系统分析了八个LLMs在两个英文作文数据集(ASAP++、CSEE)和一个葡萄牙语数据集(ENEM)上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析,我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现,在不同提示策略下保持稳健,并且尽管评分标准不同,仍能在作文提示间部分迁移。此外,非线性探测相对于线性探测仅提供边际且不一致的改进,表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”,其激活与作文分数强相关,且其行为对目标干预敏感。此外,这些神经元的逐层分布随作文长度系统性地变化,较长的作文更依赖深层。总体而言,我们的发现提供了LLM编码与作文质量相关的结构化表征的证据,并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

2606.20151 2026-06-19 cs.NE cs.AI 新提交

Hybrid ANN-SNN Pipeline with Local Plasticity

混合ANN-SNN流水线与局部可塑性

Denis Larionov, Khairutin Shtanchaev, Mikhail Kiselev, Mikhail Korovin, Ivan Tugoy

AI总结 提出一种混合ANN-SNN流水线,利用预训练ANN的丰富嵌入实现高性能SNN,通过速率编码和局部学习规则训练,在64类ImageNet上达到99.09%准确率。

Comments 9 pages, 4 figues, source-code available

详情
AI中文摘要

本文提出了一种混合ANN-SNN流水线,有效利用预训练人工神经网络(ANN)的丰富嵌入来实现高性能脉冲神经网络(SNN)。该架构将预训练的EfficientNet编码器与CoLaNET脉冲分类器耦合。我们通过速率编码将编码器的激活转换为脉冲序列,并使用局部、生物启发的学习规则训练后续的SNN分类器,绕过了端到端的梯度传播。该方法在64类ImageNet基准测试中达到了99.09%的准确率,展现了与传统深度网络相当的性能。该工作为将强大的预训练编码器适应于下游脉冲神经网络任务提供了一种生物上合理且高效的框架。

英文摘要

This work proposes a hybrid ANN-SNN pipeline that effectively leverages the rich embeddings of pretrained artificial neural networks (ANNs) to enable high-performance spiking neural networks (SNNs). The architecture couples a pretrained EfficientNet encoder with a CoLaNET spiking classifier. We convert the encoder's activations into spike trains via rate-coding and train the subsequent SNN classifier using local, biologically inspired learning rules, bypassing end-to-end gradient propagation. This approach achieves 99.09% accuracy on a 64-class ImageNet benchmark, demonstrating performance on par with conventional deep networks. The work presents a biologically plausible and efficient framework for adapting powerful pretrained encoders to downstream spiking neural network tasks.

2606.20150 2026-06-19 cs.RO 新提交

Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

面向人机协作的基于动作识别的鲁棒装配状态推理

James Fant-Male, Roel Pieters

发表机构 * Cognitive Robotics group, Unit of Automation Technology and Mechanical Engineering, Tampere University(坦佩雷大学自动化技术与机械工程系认知机器人组)

AI总结 研究从动作识别输入跟踪装配状态的方法,比较逻辑、HMM和神经网络方法,发现最优方法因任务而异,逻辑方法在多变场景更鲁棒。

Comments Preprint accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). 8 pages, 9 figures, 3 tables

详情
AI中文摘要

人类动作识别(HAR)在人机协作(HRC)研究中经常被用于理解已执行的动作以及协作任务的状态。然而,从HAR准确跟踪装配状态尚未得到充分研究,并且在现实场景中并非易事。本研究系统性地调查并比较了使用动作识别输入跟踪装配状态的方法。使用两个不同数据集和五种状态跟踪方法(包括基于逻辑的、隐马尔可夫模型(HMM)和神经网络(NN)方法)进行的调查表明,最优方法在不同任务中并不统一,并且不同方法在不同情况下会失败。测试使用具有不同噪声水平的模拟输入和来自HAR模型的真实输入进行。结果表明,NN和HMM方法在变异性有限的任务中表现良好,但在其他场景中,基于逻辑的方法可能更鲁棒。对于没有额外传感的重复动作任务,建模预期动作持续时间的方法也很重要。

英文摘要

Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.

2606.20146 2026-06-19 cs.AI 新提交

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

BIM-Edit:基于IFC的建筑信息模型的大语言模型基准测试

Bharathi Kannan Nithyanantham, Clemens Kujat, Tobias Sesterhenn, Stefan Telgmann, Jörn Plönnigs, Stefan Lüdtke, Christian Bartelt

发表机构 * University of Rostock(罗斯托克大学) Clausthal University of Technology(克劳斯塔尔工业大学)

AI总结 提出BIM-Edit基准,评估大语言模型在IFC格式建筑信息模型上的自然语言编辑能力,涵盖324个任务,最佳模型平均得分仅49.5%,揭示当前能力与工程需求间的差距。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被应用于计算机辅助设计(CAD),以从文本指令生成设计工件。在工程实践中,这需要的不仅仅是创建新的几何体,模型还必须理解现有场景,正确编辑它们,并保留语义和关系。然而,许多CAD基准侧重于创建新模型而非编辑现有模型,并且主要评估几何正确性。我们引入了BIM-Edit,这是一个用于评估LLMs在行业基础类(IFC)格式表示的建筑信息模型(BIM)上进行自然语言编辑的基准。BIM提供了一个具有挑战性的测试平台,因为建筑模型将几何体与语义和关系结构编码在一起。BIM-Edit包含324个编辑任务,涵盖11个真实建筑模型和36个合成场景。任务使用三种指令类别——直接、空间和拓扑——表达,涵盖显式编辑和场景接地编辑。我们沿三个维度评估输出:几何准确性、语义有效性和拓扑一致性。在评估的LLMs中,表现最佳的模型在三个指标上的平均得分仅为49.5%,且没有模型完全解决超过3.4%的任务。这些结果表明当前LLM能力与结构化工程设计工作流的要求之间存在巨大差距。

英文摘要

Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.

2606.20143 2026-06-19 cs.CV 新提交

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

头颈肿瘤 (HECKTOR) 2025 挑战赛:多模态 PET/CT 中的分割、诊断与预后基准

Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang, Moona Mazher, Abdul Qayyum, Yansong Bu, Mengye Lyu, Yue Lin, Mingyuan Meng, Chuanyi Huang, Lisheng Wang, Dalal Chamseddine, Shamimeh Ahrari, Beining Wu, Yifei Chen, Fuyou Mao, Hao Zhang, Baixiang Zhao, Surajit Ray, Muzi Guo, Lei Xiang, Jakob Dexl, Michael Ingrisch, Adrien Depeursinge, Arman Rahmim, Mathieu Hatt, Vincent Andrearczyk, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Amsterdam UMC(阿姆斯特丹大学医学中心) The Netherlands Cancer Institute(荷兰癌症研究所) Radboud University Medical Centre(拉德堡德大学医学中心) University College London(伦敦大学学院) Imperial College London(帝国理工学院) Shenzhen Technology University(深圳技术大学) Shenzhen University(深圳大学) Newland Digital Technology(新大陆数字技术) The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学) University Hospital, Nantes(南特大学医院) Nantes Université, Centrale Nantes, CNRS, LS2N(南特大学、南特中央理工学院、法国国家科学研究中心、LS2N实验室) Hangzhou Dianzi University(杭州电子科技大学) Tsinghua University(清华大学) Central South University(中南大学) University of Glasgow(格拉斯哥大学) China Mobile System Integration Co., Ltd.(中移系统集成有限公司) Subtle Medical Inc.(Subtle Medical公司) University Hospital, LMU Munich(慕尼黑大学医院) Munich Center for Machine Learning(慕尼黑机器学习中心) BC Cancer Research Institute(不列颠哥伦比亚癌症研究所) HES-SO Valais-Wallis University of Applied Sciences and Arts(HES-SO瓦莱州应用科学与艺术大学) Lausanne University Hospital (CHUV)(洛桑大学医院) LaTIM, INSERM, UMR 1101, Univ Brest(LaTIM实验室、法国国家健康与医学研究院、UMR 1101、布雷斯特大学)

AI总结 HECKTOR 2025 挑战赛利用多模态 PET/CT 和电子健康记录,建立了头颈癌自动分析的基准,涵盖肿瘤分割、复发预测和 HPV 分类三个任务,最佳算法分别达到 Dice 0.75、C-index 0.66 和平衡准确率 0.56。

Comments 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: https://hecktor.grand-challenge.org/

详情
AI中文摘要

头颈癌 (HNC) 构成显著的全球健康负担,准确的肿瘤勾画对于有效的放疗计划至关重要。口咽部解剖结构的复杂性,加上肿瘤在影像上的异质性表现,使得手动分割耗时且存在观察者间差异。除分割外,从非侵入性影像预测长期临床结局(如无复发生存期 RFS)和确定人乳头瘤病毒 (HPV) 状态,仍然是具有挑战性但临床价值高的目标。HECKTOR 2025 挑战赛通过使用多模态 PET/CT 影像和电子健康记录,建立了一个用于自动 HNC 分析的全面基准。基于前几届(2020-2022),本次挑战赛采用了扩展的多机构数据集,包含来自全球 10 个中心的 1100 多名患者。参与者需完成三个互补目标:(1) 分割原发肿瘤体积 (GTVp) 和转移淋巴结 (GTVn),(2) 预测无复发生存期,(3) 分类 HPV 状态。挑战赛吸引了 35 个注册团队,其中 15 个最终提交在保留测试集上进行了评估。表现最佳的算法在分割上达到平均 Dice 相似系数 0.75,在生存预测上达到一致性指数 0.66,在 HPV 分类上达到平衡准确率 0.56。本文对所提交的方法进行了全面分析,评估了它们在不同病变特征上的性能,并讨论了它们在自动化肿瘤学工作流程和决策支持系统中临床转化的意义。

英文摘要

Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

2606.20142 2026-06-19 cs.AI cs.MA 新提交

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

RACL:用于连续元启发式学习的推理代理控制层

Antón Asla Manzárraga

AI总结 提出RACL方法,在元启发式优化器之上添加推理代理,通过观察、推理和干预控制搜索行为,在车辆路径问题上平均成本降低0.641%-8.337%。

Comments 10 pages, 5 tables

详情
AI中文摘要

本文介绍了RACL,一种用于元启发式算法的推理代理控制层。RACL在现有优化器之上放置一个推理代理。该代理不替换优化器,也不修改业务约束。相反,它通过观察操作内存、推理过去行为、制定有界假设、测试干预、评估结果、应用护栏、巩固有用策略并解释其决策来控制优化器的内部搜索行为。实验使用车辆路径作为测试平台,但贡献不是新的路由求解器、特定的ALNS配置或特定的路由规则集。贡献是RACL方法:一种推理代理发现、验证、巩固和解释元启发式算法控制规则的方式。在当前实验设置中,RACL在21个可行案例中的21个中改进或持平操作内存策略,在21个可行案例中的18个中改进或持平非推理停滞触发策略,平均RACL与STP成本差异为-0.641%。在Sevilla-9/10运行时样本中,RACL相对于Fixed平均成本降低-8.337%,相对于STP降低-1.605%,且没有显示实质性计算开销。在概念验证期间,Codex被用作循环推理代理,观察执行、解释日志并提出实时有界干预。后来仅使用策略代理使定量评估可重复。

英文摘要

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

2606.20140 2026-06-19 cs.CV 新提交

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

SA-VIS: 用于训练视频实例分割的稀疏帧标注

Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool

发表机构 * CVL, ETH Zurich(计算机视觉实验室,苏黎世联邦理工学院) Align Technology VISICS, KU Leuven(VISICS,鲁汶大学) INSAIT, Sofia(INSAIT,索非亚)

AI总结 提出稀疏帧标注的SA-VIS方法,通过过去帧特征传播模块利用低维特征,在仅使用1/5标注帧时性能仅下降0.4%,显著降低标注成本。

详情
AI中文摘要

最近的在线视频实例分割(VIS)方法取得了令人印象深刻的结果,因此成为视频中实例分割的首选方法。尽管令人印象深刻的单图像模型(例如基于SAM的模型)重新兴起,但在线(或半在线)VIS方法通过在训练期间使用长序列的密集标注帧,优于单图像模型。然而,这种VIS的训练设置在计算和所需密集标注方面成本高昂。为了解决这些主要缺陷,我们认为实例及其在视频中的演变的有效建模并不需要密集标注的帧。为此,我们提出了一个简单有效的模块,称为过去帧特征传播(PFP),它聚合来自多个帧的图像编码器的低维特征。这个简单的低计算量模块为使用稀疏视频帧标签进行端到端训练提供了巨大的学习能力。结合轻量级的帧特定实例查询,我们的稀疏帧标注VIS(SA-VIS)显著提高了其基线的性能。最有趣的是,我们避免复杂性的简单设计有效地弥合了在稀疏和密集标注视频序列上训练之间的精度差距。这意味着当仅使用数据集中1/5图像的标注时,SA-VIS的性能仅下降0.4%。实验上,SA-VIS在YouTube-VIS 2019/2021/2022和Occluded VIS(OVIS)上显示出相对于基线的强劲改进,并且在有限标注场景下,AP比最先进方法提高了1%以上。

英文摘要

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

2606.20135 2026-06-19 cs.RO cs.AI 新提交

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流匹配用于连续且一致的机器人动作生成

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学) PKU-Psibot Lab(北大-智源机器人实验室) Zhongguancun Laboratory(中关村实验室) Hefei Comprehensive National Science Center(合肥综合性国家科学中心)

AI总结 提出频率感知流匹配(FAFM),通过离散余弦变换将离散动作序列转换到频域进行流匹配,并正则化一阶时间导数以生成平滑连续的动作,提升成功率、多模态表达性和运动平滑性。

详情
AI中文摘要

流匹配已成为机器人操作的标准范式,因为它与扩散策略等类似方法一样,对建模复杂的多模态动作分布具有很强的表达能力。然而,现有方法依赖于离散化的动作块,使得它们对以异构控制频率收集的演示数据脆弱,并且容易产生时间上不一致的动作,从而降低控制稳定性。在本文中,我们提出了频率感知流匹配(FAFM),它输出连续的、时间上一致的动作。为了处理异构频率输入,我们使用离散余弦变换(DCT)将离散动作序列转换到频域,对得到的系数进行流匹配,并通过余弦基展开重建连续动作。为了生成时间上一致的动作,我们对一阶时间导数进行正则化以促进平滑动作。这对应于一个Sobolev型约束,抑制高频误差并阻止突变的动作变化。我们的FAFM简单,不引入额外的网络参数,并且适用于独立的流匹配策略和视觉-语言动作模型。在合成玩具基准、避障、LapGym和LIBERO上,FAFM提高了成功率、多模态表达能力、运动平滑性、收敛速度、对机械偏差和混合频率输入的鲁棒性。这些优势在真实世界的Franka机器人上部署时保持一致。代码见此https URL。

英文摘要

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

2606.20134 2026-06-19 cs.LO cs.PL 新提交

An MSO Framework for Weak-Memory Verification and Robustness

弱内存验证与鲁棒性的MSO框架

Giovanna Kobus Conrado, Andreas Pavlogiannis

AI总结 本文研究单子二阶逻辑作为弱内存元理论,证明顺序一致性执行有界树宽而TSO无界,展示多种模型可MSO公理化,并引入读自鲁棒性概念,实现统一验证算法。

Comments Accepted at CONCUR 2026

详情
AI中文摘要

内存模型是并发程序执行的形式化规范,解释了编译器和架构优化引入的弱行为。其数量和复杂性的增加促使人们通过在适当的元理论中公理化模型来统一验证整个模型类别。本文正式研究单子二阶逻辑(MSO)作为弱内存的元理论,通过证明各种流行弱内存模型的树宽和MSO可表达性结果,使得我们能够统一处理多个验证问题。总结如下:首先,我们证明顺序一致性($\mathsf{SC}$)下的执行具有有界树宽,而总存储顺序($\mathsf{TSO}$)下的执行则无界。其次,我们证明包括Release/Acquire和完整RC20在内的广泛模型是MSO可公理化的,而其他模型如Strong Release/Acquire和$\mathsf{TSO}$则不可,除非正交向量问题(在SETH下需要二次时间)可以在线性时间内解决。最后,我们引入读自鲁棒性概念,作为对近期粗粒度鲁棒性准则工作的扩展。我们证明树宽界限(上界和下界)对任何MSO可公理化模型$\mathsf{MM}$具有深远的算法意义:存在一个算法,对于每个程序$\mathsf{P}$,要么验证$\mathsf{P}$在$\mathsf{MM}$下的正确性,要么报告$\mathsf{P}$对$\mathsf{MM}$不是读自鲁棒的。总体而言,我们的结果为弱内存验证和鲁棒性建立了一个丰富且多功能的理论框架。

英文摘要

Memory models are formal specifications of concurrent-program executions, accounting for weak behaviors introduced by compiler and architectural optimizations. The increase of their number and complexity has spawned efforts for uniform verification across whole classes of models, by axiomatizing the models in an adequate metatheory that admits a uniform treatment. In this work, we formally study Monadic Second-Order logic (MSO) as a metatheory for weak memory, by proving results on the treewidth and MSO-expressibility of various popular weak-memory models, as this combination allows us to uniformly tackle several verification problems. In summary, our results are as follows. First, we prove that executions under Sequential Consistency ($\mathsf{SC}$) have bounded treewidth, while already those under Total Store Order ($\mathsf{TSO}$) do not. Second, we prove that a broad range of models, including Release/Acquire and the full RC20, are MSO-axiomatizable, while others, such as Strong Release/Acquire and $\mathsf{TSO}$, are not, unless the Orthogonal Vectors problem $\unicode{x2013}$ which requires quadratic time under SETH $\unicode{x2013}$ can be solved in linear time. Finally, we introduce the notion of reads-from robustness, as an extension to recent work on coarse robustness criteria. We show that our treewidth bounds (both upper and lower) have far-reaching algorithmic implications for any of our MSO-axiomatizable models $\mathsf{MM}$: there is an algorithm that, for every program $\mathsf{P}$, either verifies $\mathsf{P}$ under $\mathsf{MM}$ or reports that $\mathsf{P}$ is not reads-from robust against $\mathsf{MM}$. Overall, our results establish a rich and versatile theoretical framework for weak-memory verification and robustness.

2606.20131 2026-06-19 cs.CV cs.GR 新提交

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

TriFlow: 通过最近顶点向量场生成类艺术家3D网格拓扑

Haoxuan Li, Ziya Erkoç, Daniele Sirigatti, Vladislav Rosov, Lei Li, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) AUDI AG(奥迪股份公司) University of Virginia(弗吉尼亚大学)

AI总结 提出TriFlow,一种基于最近顶点向量场(NVF)的生成方法,通过流匹配模型合成NVF并引导拓扑感知的网格简化,直接从输入几何条件生成紧凑且具有类艺术家拓扑的3D网格。

详情
AI中文摘要

我们提出了TriFlow,一种新的生成方法,能够直接从输入几何条件(如符号距离场)生成具有类艺术家三角形拓扑的紧凑3D网格。我们的关键见解是将网格拓扑表示为在表面上定义的最近顶点向量场(NVF),其中每个点编码其在局部重心坐标系中与最近三角形顶点的关联。我们训练一个潜在流匹配模型来合成该场,从而实现基于输入几何条件的拓扑生成。为了提取连贯的网格,我们使用生成的NVF对表面区域进行聚类,并引导具有拓扑感知优化的约束二次误差度量(QEM)网格简化。这产生了与输入几何紧密匹配且具有结构化、类艺术家连接性的输出网格。实验表明,与最先进的基于学习方法相比,TriFlow实现了更强的泛化能力和显著提高的拓扑质量,同时Chamfer距离降低了90%,速度提升了8倍。

英文摘要

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

2606.20130 2026-06-19 cs.CV 新提交

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

SAM3自蒸馏用于细粒度GOOSE 2D语义分割

Xuesong Wang

发表机构 * Wayne State University(韦恩州立大学)

AI总结 提出基于SAM3图像编码器与轻量解码器的分割模型,通过自蒸馏、多尺度测试增强和光度畸变迁移,在GOOSE 2D挑战赛达69.73% mIoU。

Comments 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge

详情
AI中文摘要

我们描述了在ICRA 2026 GOOSE 2D细粒度语义分割挑战赛中获得第四名的方案,该方案在官方1815张图像测试集上达到了69.73%的复合平均交并比(mIoU)。我们的模型适配了近期视觉基础模型Segment Anything Model 3(SAM3)的图像编码器,并搭配轻量级解码器。除此之外,我们贡献了两项技术和一项经验发现:(i)一种自蒸馏方案,该方案重新利用SAM3本身,以真实边界框作为提示,在SAM3性能优于我们自身模型的类别上充当教师;(ii)一种图像级多尺度测试时增强方案,通过重新缩放图像而非模型输入,为固定输入尺寸的模型恢复多尺度推理;(iii)一项发现:来自2025年GOOSE 2D获胜方案的一种激进光度畸变,移植到我们的流程中,是单一最大的改进来源。

英文摘要

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

2606.20129 2026-06-19 cs.SE 新提交

Learning Critical Testing Literacy Through Puzzles: an Experience Report

通过谜题学习关键测试素养:经验报告

Niels Doorn, Bart Th. Knaack, Tanja E. J. Vos, Beatriz Marín

AI总结 本文报告了使用谜题教授关键测试素养(CTL)的13次工作坊经验,发现参与者通过解谜、汇报和反思的完整序列学习效果显著,并开发了开源分析工具。

详情
AI中文摘要

在本文中,我们报告了使用谜题学习CTL的工作坊经验和收获。背景:软件测试重要但难以教授。我们引入了一个基于谜题的学习活动知识体系来教授CTL,该体系基于关键测试者认知模型,形成了P4TEST教学框架。我们与学生、测试人员、教师和小学生共举办了13次工作坊,评估基于谜题的关键测试素养教学。经验:在11次工作坊中,我们采用半结构化方法,变化谜题、材料和时长。在另外两次工作坊中,我们引入了工作手册和出声思考环节,以收集更多关于学习体验的数据。观察:参与者普遍认为自己在解谜时进行实验。学生倾向于收敛于解决方案,而专业人员继续探索。情绪在行为中可见,但难以通过书面反思单独浮现。出声思考环节揭示了即时推理;书面反思引发了更多元认知反思。主题“意义建构/行动中反思”捕捉了参与者如何构建问题、应对死胡同和转变策略。反思:谜题本身并非干预手段;解谜、汇报和反思的完整序列才是。更刻意地设计这一序列是未来的工作。我们还开发了一个带有内置分析功能的开源网络应用程序,用于定制工作坊。

英文摘要

In this paper, we report our experiences and takeaways from workshops using puzzles to learn CTL. Background: Software testing is important yet difficult to teach. We introduced a BoK of puzzle-based learning activities to teach CTL, based on a model of critical tester's cognition, leading to the pedagogical framework P4TEST. We conducted thirteen workshops with students, testers, teachers, and primary school pupils to assess puzzle-based teaching of critical testing literacy. Experience: Across eleven workshops, we used a semi-structured approach, varying puzzles, materials, and timing. In two additional workshops, we introduced workbooks and think-aloud sessions to gather more data on the learning experience. Observations: Participants consistently perceived themselves as experimenting while solving puzzles. Students tended to converge on solutions, while professionals continued exploring. Emotions were visible in behaviour but hard to surface through written reflection alone. Think-aloud sessions revealed immediate reasoning; written reflections elicited more meta-cognitive reflection. The theme Sensemaking / reflection-in-action captured how participants framed problems, navigated dead ends, and shifted strategies. Reflections: Puzzles are not the intervention: the entire sequence of solving, debriefing, and reflecting is. Designing that sequence more deliberately is the work ahead. We also developed an open-source web application with built-in analytics to customise workshops.

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 新提交

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

AI总结 通过高精度CPU参考和操作模式感知的模糊测试,发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误,提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情
AI中文摘要

针对LLM生成的GPU内核的基准测试(KernelBench、TritonBench、GEAK)通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核(15个正确对照和9个带有记录转录错误的LLM风格错误变体)的受控语料库,并在操作模式感知的种子模糊测试下,使用高精度(fp64)CPU参考和每个(操作,数据类型)的绝对容差重新评估。种子oracle标记了9个错误内核中的9个,并通过了15个正确对照中的15个,对照的精度成本为零。我们将语料库扩展到26个操作(添加一个flash-attention对),并在五类GPU(RTX 3060、A10、L40S、A100 SXM4、H100 NVL)上重新运行相同的协议。所有五个GPU的判定结果相同:10个错觉中的10个被捕获,16个对照中的16个干净。语料库结果涉及LLM风格的转录错误,这些错误被单形状allclose oracle认证为正确,而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

2606.20127 2026-06-19 eess.SY cs.SY 新提交

Contraction-based Neural Control for Cooperative Aerial Payload Transportation with Variable-length Cables

基于收缩的神经控制用于可变长度缆绳的协同空中载荷运输

Yi Lok Lo, Longhao Qian, Hugh H. T. Liu

AI总结 提出一种多无人机吊挂载荷系统的神经非线性控制框架,通过解耦动力学结构,联合训练神经收缩度量控制器和反馈控制器实现载荷轨迹跟踪,并利用可变长度缆绳进行避障。

Comments Submitted for publication in AIAA Scitech 2027

详情
AI中文摘要

本文提出了一种新颖的神经非线性控制框架,用于具有可变长度缆绳和刚体载荷的多无人机吊挂载荷系统。运动方程被表述为解耦结构,其中载荷和缆绳长度动力学由独立控制通道控制,便于在降阶子系统上进行模块化控制器设计。联合训练神经控制收缩度量(CCM)控制器和神经反馈控制器,以强制执行载荷子系统的收缩条件。另外,推导了一种缆绳长度控制律,利用可变长度自由度进行避障。数值模拟展示了在提出的控制框架下,刚体载荷的轨迹跟踪和整个系统的门穿越能力。

英文摘要

This paper presents a novel neural nonlinear control framework for a multi-drone slung payload system with variable-length cables and a rigid-body payload. The equations of motion are formulated into a decoupled structure, where the payload and cable length dynamics are governed by independent control channels, facilitating modularized controller design on reduced-order subsystems. A neural control contraction metric (CCM) controller and a neural feedback controller are jointly trained to enforce contraction conditions for the payload subsystem. Separately, a cable length control law is derived that exploits the variable-length degree of freedom for obstacle avoidance. Numerical simulations demonstrate trajectory tracking of a rigid-body payload and gate traversal capabilities of the overall system under the proposed control framework.

2606.20122 2026-06-19 cs.AI cs.MA 新提交

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

ScaffoldAgent: 面向开放式深度研究的效用引导动态大纲优化

Zhibang Yang, Xinke Jiang, Yuzhen Xiao, Ruizhe Zhang, Yue Fang, XinFei Wan, Zhengxing Song, Yuxuan Liu, Yuheng Huang, Xu Chu, Junfeng Zhao, Yasha Wang

发表机构 * National Engineering Research Center of Software Engineering, Peking University(北京大学软件工程国家工程研究中心) School of Computer Science, Peking University(北京大学计算机学院) Key Laboratory of High Confidence Software Technologies, Ministry of Education(教育部高可信软件技术重点实验室) GRG Banking Equipment Co., Ltd.(广电运通金融电子股份有限公司) Center on Frontiers of Computing Studies, Peking University(北京大学计算前沿研究中心) Peking University Information Technology Institute (Tianjin Binhai)(北京大学(天津滨海)信息技术研究院)

AI总结 提出ScaffoldAgent框架,通过效用引导的动态大纲优化(扩展、收缩、修订操作)解决开放式深度研究中大纲漂移问题,在DeepResearch Bench和Gym上提升长报告生成与事实准确性。

Comments 9 pages, 6 figures

详情
AI中文摘要

开放式深度研究(OEDR)要求系统通过多轮检索获取知识并生成连贯的长篇报告。大纲作为协调检索、证据组织和生成的结构性支架起着核心作用。然而,现有方法要么在写作前固定大纲,要么使用局部启发式方法进行优化,导致在持续信息积累下出现大纲漂移,且评估大纲修改的反馈延迟。我们提出ScaffoldAgent,一种面向OEDR的效用引导动态大纲优化框架。ScaffoldAgent将大纲演化建模为结构化决策过程,包含三种操作:扩展、收缩和修订,从而实现对报告支架的受控更新。它进一步引入效用引导的反馈机制,通过检索增益、结构连贯性和试生成质量来估计每个大纲操作的下游价值。得到的效用信号指导推理过程中的节点选择、操作调度和终止。在DeepResearch Bench和DeepResearch Gym上的实验表明,ScaffoldAgent在长报告生成和事实基础上持续优于现有的深度研究智能体。

英文摘要

Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

2606.20121 2026-06-19 cs.LO 新提交

BARReL: a modern backend for Atelier B in Lean

BARReL:Atelier B 在 Lean 中的现代后端

Ghilain Bergeron, Vincent Trélat

AI总结 BARReL 是一个 Lean 4 库,桥接工业 B 方法工具 Atelier B 与 Lean 证明助手,支持在 Lean 中交互式进行 B 开发,通过显式良定义条件编码部分算子,并利用依赖类型保证良定义性,同时提供基本自动化。

详情
AI中文摘要

BARReL 是一个 Lean 4 库,桥接了工业 B 方法工具 Atelier B 与 Lean 证明助手,使用户能够在 Lean 中交互式地进行形式化 B 开发(直至机器精化和实现),同时保留标准 B 语法。B 部分算子通过生成显式的良定义条件进行仔细编码,利用 Lean 的依赖类型从构造上强制实施良定义性纪律。也就是说,证明义务和证明步骤不能静默地依赖于类型错误或定义不当的实例化。BARReL 还具备基本自动化功能,尝试自动处理此类良定义条件。该实现完全使用 Lean 元编程编写,并设计为模块化:扩展支持的 B 片段通常只需添加新的语法和编码子句。我们通过一个小型但具有代表性的案例研究说明了该方法,并论证 BARReL 可以作为迈向基于 Lean 证明助手的高度可靠的 Atelier B 工具链的垫脚石。

英文摘要

BARReL is a Lean 4 library bridging Atelier B, an industrial tool for the B method, and the Lean proof assistant by enabling users to conduct their formal B developments -- up to machine refinement and implementation -- interactively inside Lean, while retaining standard B syntax. B partial operators are carefully encoded by generating explicit well-definedness conditions, leveraging Lean's dependent types to enforce a well-definedness discipline by construction. That is, proof obligations and proof steps cannot silently rely on ill-typed or ill-defined instantiations. BARReL also features basic automation to try to discharge such well-definedness conditions automatically. The implementation is written entirely using Lean meta-programming and is designed to be modular: extending the supported B fragment typically requires only adding new syntax and encoding clauses. We illustrate the approach on a small but representative case study, and argue that BARReL can act as a stepping stone towards a strongly reliable Atelier B toolchain grounded in the Lean proof assistant.

2606.20120 2026-06-19 cs.RO cs.AI 新提交

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

用于将自然语言协议翻译为机器人实验室平台的双智能体跨模型验证框架

Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon

AI总结 提出双智能体框架,通过解析器形式化协议、规则映射引擎生成控制命令、异构LLM验证器纠错,实现自然语言微孔板协议到机器人平台可执行命令的转换,并验证了端到端自主执行。

详情
AI中文摘要

生物实验协议以自然语言编写,而自动化系统依赖预定义控制命令,这造成了限制自主执行的语义鸿沟。微孔板自动实验由于需要同时控制孔映射、样本-试剂组合、重复放置和平行分配而尤其具有挑战性。本研究提出一种基于智能体的协议翻译框架,将自然语言微孔板协议转换为机器人实验室平台的可执行控制命令。解析器智能体将自然语言协议形式化为结构化表示,基于规则的映射引擎确定性地融入机器人实验室平台的操作约束以生成设备级控制命令。异构LLM验证器检查完整性、参数准确性和执行顺序,并在检测到错误时触发带有结构化反馈的自校正循环。在随机选择的ELISA协议上对7个解析器和3个验证器进行扫描,评估模型规模和验证器类型在跨模型验证下对翻译准确率和通过率的影响。通过将所提框架的基于规则映射与LLM端到端直接映射进行比较,进一步验证了准确率-延迟权衡。最后,在机器人实验室平台上演示了基于Bradford法的微孔板蛋白质定量,验证了从自然语言协议到真实实验的端到端自主执行。所提框架为缩小自然语言协议与基于微孔板的自主实验室之间的语义鸿沟提供了一种灵活方法。

英文摘要

Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.

2606.20118 2026-06-19 cs.RO cs.LG 新提交

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

Pose6DAug: 用于机器人数据增强的物理合理多视图物体替换

Jonghoon Lee, Seong Hyeon Park, Byungwoo Jeon, Minha Lee, Jinwoo Shin

AI总结 提出Pose6DAug,一种基于失败驱动的数据增强框架,通过3D网格和6D姿态轨迹替换成功轨迹中的物体,生成多视图一致的物理合理演示,无需额外数据收集,在新型物体上提升VLA策略成功率16.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在通用操作中展现出强大潜力,但在外观或几何形状偏离训练分布的新型分布外物体上常常失败。标准的补救措施是为每个失败案例收集多视图遥操作数据,但这在成本和时间上扩展性差。我们提出Pose6DAug,一种失败驱动的数据增强框架,将策略自身的成功回合转化为针对其失败模式的目标演示,无需任何新数据收集。我们的关键洞察是,每个成功回合已经编码了一个物理有效的动作轨迹以及校准的多视图观测。通过仅替换被操作物体同时保留该轨迹,我们获得新的且物理基础的演示。然而,简单的2D视频编辑会破坏多视图一致性和物理合理性,特别是在严重遮挡和以自我为中心的视角下。我们的方法直接在3D中操作,通过时间一致的6D姿态轨迹驱动的显式网格锚定目标物体,确保所有相机视图的几何一致渲染。在我们方法增强的数据上微调VLA,相对于最先进的基线,在新型物体上的成功率提高了16.5%,同时保持了分布内性能。这些结果表明,多视图和物理一致的增强是实现可扩展VLA泛化的实用途径。

英文摘要

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.

2606.20117 2026-06-19 cs.CE 新提交

Autoregressive Modelling and Synthetic Generation of High-Fidelity, Statistically Equivalent 3D Microstructures for As-Manufactured Misalignments in Fiber-Reinforced Composites

面向纤维增强复合材料中制造偏差的高保真、统计等效三维微观结构的自回归建模与合成生成

Mohamad A. Raja, Clemens Dransfeld, Boyang Chen

AI总结 提出一种集成框架,通过X射线μCT数据提取纤维错位特征,结合copula、自回归和极端值建模,经贝叶斯优化校准后,迭代生成约2400根非重叠合成纤维,统计偏差低于10%。

详情
AI中文摘要

本研究提出一个集成框架,用于从实验X射线-μCT观测中处理、建模和生成统计代表性的三维纤维微观结构。首先,引入一种解析的切片-段椭圆相交方法,沿纤维深度提取每切片和每纤维的面内和面外错位轮廓。然后利用这些描述符构建一个随机模型,通过基于copula的面内依赖性、潜在自回归连续性和罕见极端错位模式,捕获切片级错位分布及其沿深度的演变。模型超参数通过贝叶斯优化校准,与原始统计描述符达到高度一致,偏差通常低于10%。优化后的统计模型与物理生成策略相结合,该策略从可变半径纤维种子层开始,通过逐切片迭代的三维生长方案进行,其中统计层引导纤维演化,基于Delaunay的邻域构建与基于椭圆的接触分辨率确保非重叠、半径增强的合成微观结构。该框架成功生成约2400根合成纤维,同时保持对原始X射线-μCT数据的强统计保真度。所提出的管道为生成统计等效、几何可接受且可立即用于仿真的纤维复合材料微观结构提供了一条有前景且可扩展的途径,用于虚拟测试和分析。

英文摘要

This study presents an integrated framework for processing, modelling, and generating statistically representative three-dimensional fiber microstructures from experimental X-ray-$μ$CT observations. First, an analytical slice-segment ellipse-intersection method is introduced to extract per-slice and per-fiber in-plane and out-of-plane misalignment profiles along the fiber depth. These descriptors are then used to construct a stochastic model that captures slice-wise misalignment distributions and their depth-wise evolution through, copula-based in-plane dependence, latent autoregressive continuity, and rare extreme-misalignment motifs. The model hyperparameters are calibrated using Bayesian optimization, achieving close agreement with the original statistical descriptors, with deviations generally below 10\%. The optimized statistical model is coupled with a physical generation strategy that begins with variable-radius fiber seeding layer and proceeds through an iterative slice-by-slice 3D growth scheme, where the statistical layer guides fiber evolution and Delaunay-based neighbourhood construction with ellipse-based contact resolution ensures non-overlapping, radius-augmented synthetic microstructures. The framework successfully generates about 2400 synthetic fibers while preserving strong statistical fidelity to the original X-ray-$μ$CT data. The proposed pipeline provides a promising and scalable route for generating statistically equivalent, geometrically admissible, and simulation-ready fiber composite microstructures for virtual testing and analysis.

2606.20115 2026-06-19 cs.LG cs.CV 新提交

When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage

当校准失败于脆弱的医院:通过风险曲线收缩实现联邦共形风险控制

Nafis Fuad Shahid

AI总结 针对联邦部署中标准共形风险控制(CRC)对个体机构覆盖不足的问题,提出基于风险曲线收缩的联邦CRC协议,在真实脑肿瘤数据上实现2.7/20的违规率且预测集仅扩大2.0倍。

Comments 9 pages, 3 figures, 2 tables. Submitted to the DeCaF Workshop at MICCAI 2026

详情
AI中文摘要

共形风险控制(CRC)通过在保留数据上校准预测集阈值,提供分割质量的无分布保证。在联邦部署中,标准方法将各站点的校准分数合并为一个阈值。我们在真实多机构脑肿瘤数据(FeTS-2022,1251名受试者,20个机构)上首次量化表明,这种朴素的合并CRC保护了平均医院,但违反了40%个体机构的覆盖,最差站点的假阴性率超出目标7.8个百分点。朴素的替代方案——每个站点本地CRC——基本恢复了覆盖,但将预测集扩大了83倍,使其在临床上无用。我们提出一种基于收缩的联邦CRC协议:每个站点仅将其经验风险曲线(G个标量)传输到服务器,服务器为每个站点计算收缩正则化阈值。单个超参数n0平滑地权衡最坏情况覆盖与预测集效率;留一站点敏感性分析确定n0=19,在2.0倍拉伸下实现2.7/20的违规。我们进一步表明,覆盖预算的直接拉格朗日优化失败,将风险集中在脆弱的医院,并且有限样本修正项是必不可少的:移除它会使违规增加三倍。在所述站点混合假设下,边际CRC保证通过构造得以保留;在三个种子下针对四个目标验证了每个站点的覆盖。没有患者级别的图像、掩膜或每体积分数离开任何站点。

英文摘要

Conformal risk control (CRC) provides distribution-free guarantees on segmentation quality by calibrating a prediction-set threshold on held-out data. In federated deployments, the standard approach pools calibration scores across sites into a single threshold. We provide the first quantification, on real multi-institutional brain tumor data (FeTS-2022, 1,251 subjects, 20 institutions), showing that this naive pooled CRC protects the average hospital but violates coverage at 40% of individual institutions, with the worst site exceeding the target false-negative rate by 7.8 percentage points. The naive alternative, per-site local CRC, largely restores coverage but inflates prediction sets by 83x, rendering them clinically useless. We propose a shrinkage-based federated CRC protocol: each site transmits only its empirical risk curve (G scalars) to a server, which computes a shrinkage-regularized threshold per site. A single hyperparameter n0 smoothly trades worst-case coverage for prediction-set efficiency; leave-one-site-out sensitivity analysis identifies n0=19, achieving 2.7/20 violations at 2.0x stretch. We further show that direct Lagrangian optimization of coverage budgets fails, concentrating risk on vulnerable hospitals, and that the finite-sample correction term is essential: removing it triples violations. The marginal CRC guarantee is preserved by construction under the stated site-mixture assumption; per-site coverage is validated across four targets with three seeds. No patient-level images, masks, or per-volume scores leave any site.

2606.20113 2026-06-19 cs.CL cs.IR 新提交

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助?表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs(SMG实验室)

AI总结 通过测量工具意图稳定化(即推测查询收敛到答案的时间点),在CRAG基准上分析流式RAG的延迟隐藏效果,发现73.9%的查询可实现显著延迟隐藏,并识别早期与晚期稳定化的预测因素。

详情
AI中文摘要

流式检索增强生成(Streaming RAG)通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的,但该机制的好处本质上是查询内在的:只有当正确的工具查询在用户停止说话或打字之前变得可确定时,推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化,即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准(1371个验证问题)上,我们(i)测量了稳定化的分布,(ii)推导出一个与模型无关的界限H,表示可以隐藏在用户剩余输入背后的工具延迟比例,该比例是工具延迟L和输入节奏δ的函数,(iii)通过一个工作流式管道验证了实际节省达到或超过此界限,(iv)识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练,在普通CPU硬件上运行。我们发现,在现实操作点(L=600ms,δ=3w/s,θ=0.8)下,整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题(其中黄金证据以原文形式存在且可被BM25检索)上的充分稳定化(在此有利切片上95.2%可流式处理)以及其余问题上的无基础top-1稳定化回退。在有利切片上,φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分(Kruskal-Wallis p=0.017, epsilon^2=0.04),直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

2606.20112 2026-06-19 cs.CV eess.IV 新提交

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer:可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出像素级残差扩散Transformer(PRDiT),通过两阶段训练(局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差)实现高保真3D CT体生成,在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情
AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难,生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中,我们提出了像素级残差扩散Transformer(PRDiT),这是一种可扩展的生成框架,可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构,包括:1)一个局部去噪器,形式为基于MLP的盲估计器,作用于重叠的3D块,以有效分离低频结构;2)一个全局残差扩散Transformer,采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化,增强了训练稳定性,并有效保留了细微结构,而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明,PRDiT始终优于最先进的模型,如HA-GAN、3D LDM和WDM-3D,在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

2606.20110 2026-06-19 cs.CV 新提交

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

FrozenDrive: 零样本文本引导驾驶场景生成与数据增强的无参数冻结扩散模型

Yuhwan Jeong, Hyeonseong Kim, Daehyun We, Seonkyu Song, Jinnyeong Yang, Hyun-Kurl Jang, Youngho Yoon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院视觉智能实验室)

AI总结 提出FrozenDrive框架,利用冻结的预训练扩散模型,通过知识保留的时空注意力实现多视图一致性和时间连贯性,无需微调即可生成恶劣天气下的驾驶场景,提升自动驾驶模型鲁棒性。

Comments Accepted to ECCV 2026

详情
AI中文摘要

自动驾驶的合成数据正在激增,这得益于扩散模型能够实现可扩展的场景生成。然而,关键障碍依然存在,因为强制执行多视图和时间一致性通常依赖于骨干网络微调或添加层,这会侵蚀预训练知识并削弱文本对齐。模型也保持接近训练分布,在恶劣天气和未见配置下表现不佳,并且保真度偏向频繁类别而非稀有类别。我们通过FrozenDrive解决这些差距,这是一个可控生成框架,在保持预训练扩散模型知识的同时实现强一致性。FrozenDrive以丰富的驾驶堆栈信号和文本提示为条件,并引入知识保留的时空注意力,在无参数的冻结扩散骨干中单次通过时施加跨视图对齐和时间连贯性。额外的对象聚焦约束提高了稀有类别的每个对象保真度。无需任何天气或场景特定的微调,我们的模型从文本合成全局连贯的多视图驾驶场景,特别是在恶劣和稀有条件下,并超越了先前的基线。在nuScenes上,FrozenDrive增强数据显著提升了AD模型的性能,尤其是在夜间和雨天,当使用我们的场景定向数据训练时,展示了更强的鲁棒性。

英文摘要

Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.