arXivDaily arXiv每日学术速递 周一至周五更新
2603.04219 2026-06-19 cs.SD cs.AI eess.AS 版本更新

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.(Maum AI公司) Humelo Inc.(Humelo公司)

AI总结 提出ZeSTA框架,通过轻量领域嵌入区分真实与合成语音,结合真实数据过采样,在极低资源下提升零样本文本转语音增强的说话人相似度,保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

详情
AI中文摘要

我们研究了将零样本文本转语音(ZS-TTS)作为低资源个性化语音合成的数据增强源。虽然合成增强可以提供语言丰富且音素多样的语音,但将大量合成语音与有限的真实录音简单混合往往会导致微调过程中说话人相似度下降。为解决这一问题,我们提出了ZeSTA,一个简单的基于领域条件的训练框架,通过轻量领域嵌入区分真实和合成语音,并结合真实数据过采样以在极有限的目标数据下稳定适应,无需修改基础架构。在LibriTTS和一个内部数据集上使用两个ZS-TTS源的实验表明,我们的方法在保持可懂度和感知质量的同时,相比朴素合成增强提高了说话人相似度。音频样本可在我们的网页上获取。

英文摘要

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality. Audio samples are available on our web page.

2509.15927 2026-06-19 cs.LG cs.AI 版本更新

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

增强生成式自动出价:结合离线奖励评估与策略搜索

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Jinghao Chen, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba(阿里巴巴淘宝与天猫集团) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对现有生成式自动出价方法无法超越静态数据集进行探索的性能瓶颈,提出AIGB-Pearl方法,通过轨迹评估器和KL-Lipschitz约束的分数最大化方案实现安全高效探索,在模拟和真实广告系统中取得最优性能。

详情
AI中文摘要

自动出价是广告主提升广告效果的关键工具。最近进展表明,AI生成式出价(AIGB)从离线数据中学习条件生成规划器,相比典型的基于离线强化学习(RL)的自动出价方法取得了更优性能。然而,现有AIGB方法仍面临性能瓶颈,因其固有能力无法在静态数据集之外进行带反馈的探索。为解决此问题,我们提出\textbf{AIGB-Pearl}(\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}),一种融合生成式规划与策略优化的新方法。AIGB-Pearl的核心在于构建轨迹评估器以评估生成分数的质量,并设计一个理论上可靠的KL-Lipschitz约束分数最大化方案,确保在离线数据集之外进行安全高效的探索。进一步开发了结合同步耦合技术的实用算法,以保证所提方案所需的模型正则性。在模拟和真实广告系统上的大量实验证明了我们方法的最优性能。

英文摘要

Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose \textbf{AIGB-Pearl} (\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战:推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona(巴塞罗那人工智能在医学实验室(BCN-AIM),巴塞罗那大学数学与计算机学院)

AI总结 提出MAMA-MIA挑战,通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测,在跨洲多中心数据上分析模型泛化性与公平性,发现性能与亚组公平性之间存在权衡。

详情
AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤,也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用,尤其是接受新辅助化疗的患者。然而,现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估,使得直接比较困难,并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题,该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者,而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行,以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明,在共同的外部评估框架下,性能存在显著差异,并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源,以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

2603.00654 2026-06-19 cs.CV 版本更新

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP:雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) School of Automotive Studies, Tongji University(同济大学汽车学院) Thrust of Artificial Intelligence, Hong Kong University of Science and Technology(香港科技大学人工智能研究所)

AI总结 提出首个4D雷达与相机协同感知框架RC-GeoCP,通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位,实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情
AI中文摘要

协同感知(CP)通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何,但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量,相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP,这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位,RC-GeoCP建立了雷达锚定的几何一致性。具体而言,几何结构修正(GSR)将视觉语义与雷达导出的几何对齐,以生成空间有根基的、几何一致的表示。不确定性感知通信(UAC)将选择性传输表述为条件熵减少过程,基于智能体间分歧优先处理信息特征。最后,共识驱动聚合器(CDA)通过共享几何锚聚合多智能体信息,形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准,展示了最先进的性能,同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

2602.23248 2026-06-19 cs.AI 版本更新

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 提出解耦证明者-验证者游戏(DPVG),通过分离正确性与可检查性训练一个翻译器模型,将固定求解器的解转化为可检查形式,在保持答案正确性的同时提高可检查性,解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI

详情
AI中文摘要

随着大型语言模型能力日益增强,其输出能被能力较弱的系统轻松检查变得至关重要。证明者-验证者游戏可用于提高模型输出的可检查性,但与仅训练以最大化正确性的基线相比,其准确性有所下降——这种现象被称为可读性代价。我们提出一种解决方案,通过将正确性与可检查性条件解耦,转而训练一个“翻译器”模型,将固定求解器模型的解转化为可检查形式。这使我们能够首先训练求解器以最大化正确性,然后训练翻译器将求解器的解翻译成可检查形式,同时保留求解器的答案。为了适应这一新的翻译目标,我们制定了一个解耦的证明者-验证者游戏(DPVG),其均衡对应于忠实且可检查的翻译器。

英文摘要

As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game (DPVG) where the equilibria correspond to faithful and checkable translators.

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

2602.22959 2026-06-19 cs.CV 版本更新

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

智能体能否在零样本设置中区分视觉上难以分离的疾病?一项初步研究

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

发表机构 * Department of Diagnostic and Interventional Radiology, University Hospital Aachen, 52074 Aachen, Germany(诊断与介入放射科,亚琛大学医院,德国亚琛,52074)

AI总结 本研究探索多模态大语言模型智能体在零样本下区分视觉混淆疾病(如黑色素瘤与不典型痣、肺水肿与肺炎)的能力,提出基于对比裁决的多智能体框架,在皮肤镜数据上准确率提升11个百分点,但总体性能仍不足临床部署。

Comments Code available at https://github.com/TruhnLab/Contrastive-Agent-Reasoning. Accepted by MICCAI 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速进展引发了对基于智能体系统的日益关注。尽管大多数医学影像先前工作集中于自动化常规临床工作流程,我们研究了一个未被充分探索但临床意义重大的场景:在零样本设置中区分视觉上难以分离的疾病。我们在两个仅基于影像的代理诊断任务上对代表性智能体进行基准测试:(1)黑色素瘤与不典型痣,以及(2)肺水肿与肺炎,尽管临床管理存在显著差异,但视觉特征高度混淆。我们引入了一种基于对比裁决的多智能体框架。实验结果显示诊断性能提升(在皮肤镜数据上准确率提高11个百分点),并在定性样本上减少了无根据的声明,尽管整体性能仍不足以用于临床部署。我们承认人类注释中固有的不确定性以及临床背景的缺失,这进一步限制了向真实世界场景的转化。在此受控设置中,这项初步研究为视觉混淆场景下的零样本智能体性能提供了初步见解。

英文摘要

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

2602.13139 2026-06-19 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3:提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结 针对现有语言识别工具对近亲语言和噪声区分困难的问题,通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器,提出OpenLID-v3,在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情
AI中文摘要

语言识别(LID)是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具(如OpenLID或GlotLID)通常难以识别近亲语言,也难以区分有效自然语言与噪声,这污染了特定语言子集,尤其是低资源语言。在本工作中,我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3,并在多个基准上将其与GlotLID进行评估。在开发过程中,我们重点关注三组近亲语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部和法国南部的罗曼语变体;以及斯堪的纳维亚语言),并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度,但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

2508.15228 2026-06-19 cs.CV 版本更新

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore(南洋理工大学S实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出TriMM,首个前馈式3D原生生成模型,通过协作多模态编码融合RGB、RGBD和点云特征,结合辅助2D/3D监督和三平面潜在扩散模型,实现高质量3D资产生成。

详情
AI中文摘要

3D内容本质上具有多模态特性,可投影到不同模态(如RGB图像、RGBD和点云)。每种模态在3D资产建模中表现出独特优势:RGB图像包含生动的3D纹理,而点云定义精细的3D几何。然而,现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势,要么局限于3D结构,从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模,我们提出了TriMM,这是第一个从基本多模态(如RGB、RGBD和点云)学习的前馈式3D原生生成模型。具体来说,1) TriMM首先引入协作多模态编码,该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外,引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成更高质量的3D资产,增强了纹理和几何细节。在多个知名数据集上的大量实验表明,TriMM通过有效利用多模态,尽管使用少量训练数据,仍能达到与在大规模数据集上训练的模型相竞争的性能。此外,我们在最近的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

2602.18226 2026-06-19 math.NA cs.NA 版本更新

A Parametric Finite Element Approach for an Anisotropic Multi-Phase Mullins-Sekerka Problem with Kinetic Undercooling

各向异性多相Mullins-Sekerka问题含动力学过冷的参数有限元方法

Tokuhiro Eto, Harald Garcke, Robert Nürnberg

AI总结 针对含动力学过冷的各向异性多相Mullins-Sekerka问题,提出一种全离散非拟合有限元方法,实现无条件稳定,并成功模拟多冰晶连接演化。

Comments 26 pages, 16 figures

详情
AI中文摘要

我们考虑一个具有动力学过冷的各向异性多相Mullins-Sekerka问题的尖锐界面公式。该流动的特征是一簇表面演化,使得总表面能加上封闭相体积的加权和随时间减小。在推导出合适的变分公式后,我们引入了一种全离散非拟合有限元方法。在该方法中,移动界面的近似独立于用于体方程的三剖分。我们的方法可以证明是无条件稳定的。几个数值例子展示了所引入方法的能力。特别地,证明了使用所提出的方法可以模拟具有连接的多冰晶的演化。

英文摘要

We consider a sharp interface formulation for an anisotropic multi-phase Mullins-Sekerka problem with kinetic undercooling. The flow is characterized by a cluster of surfaces evolving such that the total surface energy plus a weighted sum of the volumes of the enclosed phases decreases in time. Upon deriving a suitable variational formulation, we introduce a fully discrete unfitted finite element method. In this approach, the approximations of the moving interfaces are independent of the triangulations used for the equations in the bulk. Our method can be shown to be unconditionally stable. Several numerical examples demonstrate the capabilities of the introduced method. In particular, it is demonstrated that the evolution of multiple ice crystals with junctions can be modeled using the proposed approach.

2602.15819 2026-06-19 cs.CV 版本更新

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

VideoSketcher:利用视频模型先验的序列草图生成

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

发表机构 * MIT(麻省理工学院)

AI总结 提出VideoSketcher方法,结合LLM的语义规划与视频扩散模型的时序渲染,通过两阶段微调从少量样本学习笔画顺序与风格,生成高质量序列草图。

详情
AI中文摘要

素描本质上是序列化的:笔画逐步绘制以探索和完善想法。然而,大多数生成方法将草图视为静态图像,忽略了创造性探索背后的时间过程。建模这种序列结构仍然具有挑战性:先前的方法要么依赖大规模但多样性有限的人类绘制数据集,要么使用大型语言模型(LLM)生成绘制指令,但往往以视觉保真度为代价。我们提出VideoSketcher,一种通过将预训练的文本到视频扩散模型适应于草图形成的稀疏连续性质来生成高质量绘制过程的方法。我们的关键洞察是LLM和视频扩散模型提供互补优势:LLM作为语义规划器,将概念分解为逐步指令,而视频扩散模型作为强大的“渲染器”,将它们转化为时间连贯的草图序列。我们引入一种两阶段微调策略,将时间结构与视觉外观解耦:笔画顺序从合成形状组合中学习,而风格则从少至七幅手绘示例中提炼。尽管监督极少,我们的方法能够生成多样、高质量的序列草图,并忠实遵循指定的绘制顺序。我们的框架自然扩展到笔刷风格控制和自回归生成,支持艺术应用。

英文摘要

Sketching is inherently sequential: strokes are drawn progressively to explore and refine ideas. Yet most generative approaches treat sketches as static images, ignoring the temporal process underlying creative exploration. Modeling this sequential structure remains challenging: prior methods either rely on large-scale human-drawn datasets with limited diversity, or use large language models (LLMs) to produce drawing instructions, often at the cost of visual fidelity. We present VideoSketcher, a method for generating high-quality sketching processes by adapting pretrained text-to-video diffusion models to the sparse, continuous nature of sketch formation. Our key insight is that LLMs and video diffusion models offer complementary strengths: LLMs act as semantic planners that decompose concepts into step-by-step instructions, while video diffusion models serve as powerful "renderers" that translate them into temporally coherent sketch sequences. We introduce a two-stage fine-tuning strategy that decouples temporal structure from visual appearance: stroke ordering is learned from synthetic shape compositions, while style is distilled from as few as seven hand-drawn examples. Despite minimal supervision, our method can generate diverse, high-quality sequential sketches that faithfully follow specified drawing orders. Our framework naturally extends to brush style control and autoregressive generation, supporting artistic applications.

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.(高通技术公司)

AI总结 提出首个仅使用音频和IMU模态的实时对话助手,通过微调语言模型减少不必要对话并提升问答准确性,在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情
AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入,这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手,仅使用来自用户可穿戴设备的轻量级隐私保护模态(如音频和IMU输入)来理解上下文,为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务,我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令,并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题,我们展示了微调模型如何将其减少不必要对话的能力提升50%(精确度),同时将正确回答问题的能力提升150%(召回率)。我们进一步描述了如何在边缘设备上实现该助手,无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

2602.14696 2026-06-19 cs.LG 版本更新

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

对目标指令选择的批判性审视:厘清什么重要(以及什么不重要)

Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis

发表机构 * Harvard University(哈佛大学) MIT(麻省理工学院) Kempner Institute(凯门研究所)

AI总结 本文系统解构指令微调中目标指令选择的两大核心要素——数据表示与选择算法,发现基于梯度的表示结合贪心轮询选择在低预算下表现最佳,但收益随预算增加而减弱,并统一了多种算法为近似距离最小化。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)的指令微调通常涉及从大型候选池中选择一个指令训练子集,使用来自目标任务的小型查询集。尽管兴趣日益增长,关于目标指令选择的文献仍然支离破碎且不透明:方法在选择预算上差异很大,经常省略零样本基线,并且常常混淆关键组件的贡献。因此,实践者缺乏针对其目标任务选择指令的可操作指导。在这项工作中,我们旨在通过解构和系统分析两个核心要素:数据表示和选择算法,为这一领域带来清晰度。我们的框架支持跨模型、任务和预算的受控比较。我们发现,只有基于梯度的数据表示选择的子集,其与查询的相似性能够一致地预测跨数据集、模型和候选池的性能。虽然没有单一方法占主导地位,但基于梯度的表示与贪心轮询选择相结合,在低预算下平均表现最佳,但这些收益在较大预算下会减弱。最后,我们将几种现有的选择算法统一为所选子集与查询集之间近似距离最小化的形式,并用新的泛化界限支持这一观点。更广泛地说,我们的发现为LLM微调中更原则性的数据选择提供了关键见解和基础。代码可在该 https URL 获取。

英文摘要

Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets, models, and candidate pools. While no single method dominates, gradient-based representations paired with greedy round-robin selection often perform best on average at low budgets, but these gains diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.

2512.11173 2026-06-19 cs.RO 版本更新

Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance

从单实例RGB演示中学习类别级最后米导航

Tzu-Hsien Lee, Fidan Mahmudova, Karthik Desingh

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学 Twin Cities 分校)

AI总结 提出面向对象的模仿学习框架,利用RGB观测实现四足移动机械臂在最后米阶段的精确导航,无需深度或地图先验,在类别级泛化中达到高成功率。

详情
AI中文摘要

移动机械臂基座的精确定位对于后续成功操作至关重要。大多数基于RGB的导航系统仅保证粗略的米级精度,不适合移动操作的精确定位阶段。这一差距导致操作策略无法在其训练演示的分布内运行,从而导致频繁的执行失败。我们通过引入一种面向对象的模仿学习框架来解决这一差距,用于最后米导航,使四足移动机械臂机器人仅使用其机载摄像头的RGB观测即可实现可操作的定位。我们的方法将导航策略条件化为三个输入:目标图像、来自机载摄像头的多视角RGB观测以及指定目标对象的文本提示。然后,语言驱动的分割模块和空间得分矩阵解码器提供显式的对象定位和相对姿态推理。使用类别内单个对象实例的真实世界数据,该系统能够泛化到不同环境中具有挑战性光照和背景条件的未见对象实例。为了全面评估这一点,我们引入了两个指标:边缘对齐度量(使用真实方向)和对象对齐度量(评估机器人视觉上面对目标的程度)。在这些指标下,我们的策略在相对于未见目标对象定位时,边缘对齐成功率达到74.58%,对象对齐成功率达到89.42%。这些结果表明,无需深度、LiDAR或地图先验,即可在类别级实现精确的最后米导航,为统一的移动操作提供可扩展的途径。项目页面:此https URL

英文摘要

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 74.58% success in edge-alignment and 89.42% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: https://rpm-lab-umn.github.io/category-level-last-meter-nav/

2508.21677 2026-06-19 cs.RO 版本更新

Robust Convex Model Predictive Control with collision avoidance guarantees for robot manipulators

具有碰撞避免保证的机器人操作器鲁棒凸模型预测控制

Bernhard Wullt, Johannes Köhler, Per Mattsson, Mikeal Norrlöf, Thomas B. Schön

发表机构 * ABB robotics(ABB机器人公司) Department of Mechanical Engineering, Imperial College London(帝国理工学院机械工程系) Department of Information Technology, Uppsala University(乌普萨拉大学信息科技系)

AI总结 提出一种结合鲁棒管MPC与走廊规划算法的凸MPC方案,在模型不确定下实现工业机器人快速无碰撞运动,优于基准方法。

详情
AI中文摘要

工业操作器通常在杂乱环境中运行,安全运动规划至关重要。然而,模型不确定性使任务更加复杂,导致保守的速度限制以减少干扰影响。因此,需要能够保证快速执行安全运动的控制方法。我们通过为操作器提出一种新颖的模型预测控制(MPC)方案来解决这一问题,其中两个主要组件是鲁棒管MPC和用于获得无碰撞运动的走廊规划算法。我们的方案形成凸MPC公式,可以快速求解,使方法具有实际应用价值。我们在模拟环境中展示了方法的有效性,该环境包含一个6自由度工业机器人在具有不确定模型参数的杂乱环境中运行。通过容忍更高水平的模型不确定性同时实现更快的运动,我们优于基准方法。

英文摘要

Industrial manipulators typically operate in cluttered environments, where safe motion planning is critical. However, model uncertainties further complicate this task, which leads to conservative speed limits to reduce the influence of disturbances. Hence, there is a need for control methods that can guarantee safe motions which are executed fast. We address this by suggesting a novel model predictive control (MPC) solution for manipulators, where our two main components are a robust tube MPC and a corridor planning algorithm to obtain collision-free motion. Our solution results in a convex MPC formulation, which we can solve fast, making our method practically useful. We demonstrate the efficacy of our method in a simulated environment with a 6 DOF industrial robot operating in cluttered environments with uncertain model parameters. We outperform benchmark methods by tolerating higher levels of model uncertainty while achieving faster motion.

2602.11972 2026-06-19 math.NA cs.NA 版本更新

Splitting Schemes for ODEs with Goal-Oriented Error Estimation

具有目标导向误差估计的常微分方程分裂格式

Erik Weyl, Andreas Bartel, Manuel Schaller

AI总结 提出一种混合先验/后验目标导向误差估计器,结合动态迭代和有限元离散,用于评估和平衡动态迭代误差与离散化误差,实现自适应网格细化和动态迭代停止准则。

Comments 24 pages, 5 figures, published in BIT Numerical Mathematics, added notice of this to the document

详情
AI中文摘要

我们提出了一种混合先验/后验目标导向误差估计器,用于结合基于动态迭代的常微分方程求解(通过有限元离散化)。我们的新型误差估计器结合了经典动态迭代方法(通常用于基于分裂的分布式仿真)和双加权残差法的估计,能够评估和平衡期望感兴趣量中的动态迭代误差和离散化误差。获得的误差估计器用于指导计算网格的细化,并作为动态迭代的停止准则。特别地,我们允许时间域的自适应和灵活离散化,其中变量可以不同地离散化以匹配目标和求解需求,例如考虑多时间尺度。我们为方案配备了数值线性代数中的高效求解器,以确保其适用于复杂问题。数值实验将自适应方法与均匀细化进行了比较。

英文摘要

We present a hybrid a-priori/a-posteriori goal oriented error estimator for a combination of dynamic iteration-based solution of ordinary differential equations discretized by finite elements. Our novel error estimator combines estimates from classical dynamic iteration methods, usually used to enable splitting-based distributed simulation, and from the dual weighted residual method to be able to evaluate and balance both, the dynamic iteration error and the discretization error in desired quantities of interest. The obtained error estimators are used to conduct refinements of the computational mesh and as a stopping criterion for the dynamic iteration. In particular, we allow for an adaptive and flexible discretization of the time domain, where variables can be discretized differently to match both goal and solution requirements, e.g. in view of multiple time scales. We endow the scheme with efficient solvers from numerical linear algebra to ensure its applicability to complex problems. Numerical experiments compare the adaptive approach to a uniform refinement.

2510.08275 2026-06-19 eess.SY cs.SY 版本更新

Control Allocation Algorithm for Hypersonic Glide Vehicles with Input Limitations

输入受限的高超声速滑翔飞行器控制分配算法

Johannes Autenrieb, Patrick Gruhn

AI总结 针对高超声速滑翔飞行器执行机构强非线性和物理约束,提出一种迭代控制分配方法,通过嵌入阻力敏感软约束提高能效并降低表面温度,在GHGV-2模型上验证了有效性。

Comments 43pages, 21 figures, accpeted for publication in the AIAA Journal of Guidance, Control, and Dynamics

详情
AI中文摘要

高超声速滑翔飞行器(HGV)在具有执行机构强非线性和严格物理约束的挑战性飞行状态下运行。这些约束包括状态相关的执行器限制、非对称控制边界以及随机动条件变化的热载荷。本文介绍了一种迭代控制分配方法,以实时应对这些挑战。所提出的算法搜索能够实现期望力矩指令的控制输入,同时满足输入幅度和速率的约束。对于细长HGV构型,热载荷和阻力生成密切相关——较低的阻力通常导致表面加热减少。通过嵌入阻力敏感软约束,该方法提高了能量效率并隐含地降低了表面温度,从而降低了飞行器的红外特征。这些特性对于需要低可观测性的远程军事行动尤为有利。该方法利用DLR的通用高超声速滑翔飞行器2(GHGV-2)仿真模型进行了演示。结果证实了该方法在现实约束飞行条件下保持控制权限的有效性。

英文摘要

Hypersonic glide vehicles (HGVs) operate in challenging flight regimes characterized by strong nonlinearities in actuation and stringent physical constraints. These include state-dependent actuator limitations, asymmetric control bounds, and thermal loads that vary with maneuvering conditions. This paper introduces an iterative control allocation method to address these challenges in real time. The proposed algorithm searches for control inputs that achieve the desired moment commands while respecting constraints on input magnitude and rate. For slender HGV configurations, thermal loads and drag generation are strongly correlated-lower drag typically results in reduced surface heating. By embedding drag-sensitive soft constraints, the method improves energy efficiency and implicitly reduces surface temperatures, lowering the vehicle's infrared signature. These features are particularly advantageous for long-range military operations that require low observability. The approach is demonstrated using the DLR's Generic Hypersonic Glide Vehicle 2 (GHGV-2) simulation model. The results confirm the method's effectiveness in maintaining control authority under realistic, constrained flight conditions.

2602.09689 2026-06-19 cs.LG 版本更新

Model soups need only one ingredient

模型汤只需一种成分

Alireza Abdollahpoorrostam, Nikolaos Dimitriadis, Adam Hazimeh, Pascal Frossard

发表机构 * EPFL(瑞士联邦理工学院) EPFL LTS4(瑞士联邦理工学院 LTS4)

AI总结 提出MonoSoup方法,利用SVD分解单检查点的层更新,通过熵有效秩自动重加权成分,实现强分布内-分布外平衡,无需多检查点。

详情
AI中文摘要

在目标分布上微调大型预训练模型通常会提高分布内(ID)准确性,但代价是分布外(OOD)鲁棒性下降,因为表示会专门适应微调数据。权重空间集成方法,如模型汤(Model Soups),通过平均多个检查点来缓解这一影响,但它们在计算上代价高昂,需要训练和存储数十个微调模型。在本文中,我们介绍了MonoSoup,一种简单、无数据、无超参数的事后方法,仅使用单个检查点即可实现强大的ID-OOD平衡。我们的方法对每一层的更新应用奇异值分解(SVD),将其分解为捕捉任务特定适应的高能量方向和引入噪声但可能仍编码对鲁棒性有用的残余信号的低能量方向。然后,MonoSoup使用基于熵的有效秩自动重新加权这些分量,并考虑模型的谱和几何结构的逐层系数。在ImageNet上微调并在自然分布偏移下评估的CLIP模型,以及在数学推理和多选题基准上测试的Qwen语言模型上的实验表明,这种即插即用方法是多检查点方法的实用且有效的替代方案,保留了其大部分好处而无需计算开销。

英文摘要

Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer's update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.

2510.24410 2026-06-19 cs.CV cs.RO 版本更新

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

GenTrack2: 一种改进的多目标跟踪混合方法

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人研究所,南丹麦大学)

AI总结 提出结合随机粒子滤波与确定性关联的多目标跟踪方法,通过粒子群优化和新型代价矩阵解决非线性动态下的标识一致性问题,性能优于现有方法。

Comments The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication

详情
AI中文摘要

本文提出一种视觉多目标跟踪方法,联合使用随机和确定性机制,以确保在非线性动态下未知且时变目标数量的标识一致性。随机粒子滤波处理非线性动态和非高斯噪声,并借助粒子群优化(PSO)将粒子引导至状态分布模式,通过提出的适应度度量(包含运动一致性、外观相似性和与邻近目标的社交互动线索)减轻发散。确定性关联通过提出的代价矩阵进一步强制标识一致性,该矩阵包含粒子与当前检测之间的空间一致性、检测置信度和轨迹惩罚。随后,提出一种新颖方案,在保持目标身份的同时平滑更新目标状态,特别是对于与其他目标交互和长时间遮挡期间的弱轨迹。此外,对过去状态的速度回归提供趋势种子速度,增强粒子采样和状态更新。所提出的跟踪器设计灵活,适用于预录视频和相机直播流(未来帧不可用)。实验结果表明,与最先进的跟踪器相比,性能优越。所提出方法和对比跟踪器的源代码参考实现已在GitHub上提供:此 https URL

英文摘要

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

2602.07628 2026-06-19 cs.AI cs.LG 版本更新

SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures

SleepMaMi:一种融合宏观与微观结构的通用睡眠基础模型

Keondo Park, Younghoon Na, Yourim Choi, Hyunwoo Ryu, Hyun-Woo Shin, Hyung-Sin Kim

发表机构 * Graduate School of Data Science, Seoul National University, Seoul, South Korea(首尔国立大学数据科学研究生院,韩国首尔) Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea(首尔国立大学医学院生物医学科学系,韩国首尔) Obstructive Upper Airway Research (OUaR) Laboratory, Department of Pharmacology, Seoul National University College of Medicine, Seoul, Republic of Korea(首尔国立大学医学院药理学系阻塞性上气道研究(OUaR)实验室,韩国首尔) Department of Otorhinolaryngology-Head and Neck Surgery, Seoul National University Hospital, Seoul, Republic of Korea(首尔国立大学医院耳鼻喉头颈外科系,韩国首尔)

AI总结 提出SleepMaMi睡眠基础模型,通过分层双编码器设计(宏观编码器建模整夜时间依赖,微观编码器捕捉生物信号短时特征),结合人口统计引导对比学习和混合掩码自编码器训练,在超过2万条PSG记录上预训练,在下游任务中优于或匹配现有基础模型。

Comments 8 pages, Appendix 9 pages

详情
AI中文摘要

虽然向统一基础模型的转变已经彻底改变了许多深度学习领域,但睡眠医学仍然主要局限于专注于局部微观结构特征的特定任务模型。这些方法常常忽略多导睡眠图(PSG)丰富的多模态背景,并且未能捕捉整夜睡眠的全局宏观结构。为了解决这个问题,我们引入了SleepMaMi,一种睡眠基础模型,旨在掌握长达一小时的睡眠架构和细粒度信号形态。我们的框架采用分层双编码器设计:宏观编码器用于建模整夜时间依赖,微观编码器用于从生物信号中捕捉短期特征。宏观编码器通过人口统计引导对比学习进行训练,该学习将夜间睡眠模式与客观受试者元数据(如年龄、性别和BMI)对齐,以优化全局表示。微观编码器通过混合掩码自编码器(MAE)和多模态对比目标进行优化。在超过20,000条PSG记录(158K小时)的大规模语料库上预训练,SleepMaMi在多样化的下游任务套件中优于或匹配现有的最先进基础模型,展示了在临床睡眠分析中卓越的泛化能力和标签高效适应能力。

英文摘要

While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms or matches state-of-the-art existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

2506.11719 2026-06-19 math.NA cs.NA physics.comp-ph 版本更新

Automatic differentiation for performing the Cauchy-Kovalevskaya procedure in Lax-Wendroff type discretizations

在Lax-Wendroff类型离散化中执行Cauchy-Kovalevskaya过程的自动微分

Arpit Babbar, Valentin Churavy, Michael Schlottke-Lakemper, Hendrik Ranocha

AI总结 本文引入自动微分(AD)执行Lax-Wendroff方法中的Cauchy-Kowalewski过程,实现任意阶数、无需雅可比矩阵且问题无关的预测步计算,数值实验验证了方法的精度和正性保持。

Journal ref Journal of Computational Physics, 15 October 2026, article 115101, Volume 563

详情
AI中文摘要

Lax-Wendroff方法结合间断Galerkin/通量重构空间离散化,为求解双曲守恒律提供了一种高阶、单步、无求积的方法。本文引入自动微分(AD)来执行Lax-Wendroff方法中用于单元局部时间平均通量计算步骤(预测步)的Cauchy-Kowalewski过程。AD的应用对于任意阶数的方法都是相似的,并且在预测步中不需要正性修正。这与近似Lax-Wendroff过程形成对比,后者需要针对不同阶数的方法使用不同的有限差分公式,并且在预测步中需要对仅能在可接受状态上计算的通量进行正性修正。该方法无需雅可比矩阵且与问题无关,允许直接应用于任何物理通量函数。数值实验证明了该方法的阶数和正性保持。此外,性能比较表明,自动微分的壁钟时间始终与近似Lax-Wendroff方法相当。

英文摘要

Lax-Wendroff methods combined with discontinuous Galerkin/flux reconstruction spatial discretization provide a high-order, single-stage, quadrature-free method for solving hyperbolic conservation laws. In this work, we introduce automatic differentiation (AD) for performing the Cauchy-Kowalewski procedure used in the element-local time average flux computation step (the predictor step) of Lax-Wendroff methods. The application of AD is similar for methods of any order and does not need positivity corrections during the predictor step. This contrasts with the approximate Lax-Wendroff procedure, which requires different finite difference formulas for different orders of the method and positivity corrections in the predictor step for fluxes that can only be computed on admissible states. The method is Jacobian-free and problem-independent, allowing direct application to any physical flux function. Numerical experiments demonstrate the order and positivity preservation of the method. Additionally, performance comparisons indicate that the wall-clock time of automatic differentiation is always on par with the approximate Lax-Wendroff method.

2602.04396 2026-06-19 cs.LG cs.AI 版本更新

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: 分布式低秩优化与低频通信

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

发表机构 * University of Cambridge(剑桥大学) Institute of Science and Technology Austria(奥地利科学与技术研究院) Lancaster University(兰卡斯特大学) Flower Labs(Flower实验室)

AI总结 提出LoRDO框架,统一低秩优化与低频同步,通过全秩准双曲更新恢复子空间探索,在125M-720M模型规模下实现与低秩DDP近似的性能,通信量减少约10倍。

Comments Accepted at ICML 2026

详情
AI中文摘要

通过$\ exttt{DDP}$进行基础模型的分布式训练受限于互连带宽。虽然低频通信策略减少了同步频率,但优化器状态的内存和通信需求仍然构成瓶颈。低秩优化器可以缓解这些限制;然而,在局部更新机制下,工作节点无法访问计算低秩投影所需的全批次梯度,这降低了性能。我们提出$\ exttt{LoRDO}$,一个统一低秩优化与低频同步的原则性框架。我们首先证明,虽然基于伪梯度的全局投影在理论上更优,但它们将优化轨迹永久限制在低秩子空间中。为了恢复子空间探索,我们引入了一个全秩准双曲更新。$\ exttt{LoRDO}$在125M-720M模型规模的语言建模和下游任务中实现了与低秩$\ exttt{DDP}$近乎相同的性能,同时将通信量减少了约10倍。最后,我们表明在具有小秩/小批次大小的极低内存设置中,$\ exttt{LoRDO}$的性能提升更为显著。

英文摘要

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

2602.04306 2026-06-19 cs.CL cs.AI 版本更新

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST(韩国科学技术院)

AI总结 针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题,提出框架感知的去偏方法,通过量化框架差异并增强跨框架一致性,有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

随着大语言模型(LLMs)在现实应用中的日益部署,确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力,但一个持续的挑战是隐藏的偏见:LLMs 在标准评估下表现公平,但在这些评估设置之外可能产生有偏见的响应。在本文中,我们识别出框架——语义等价的提示在表达方式上的差异(例如,“A 比 B 好” vs. “B 比 A 差”)——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准,我们发现(1)公平性得分随框架变化显著,以及(2)现有的去偏方法改善了整体(即框架平均)公平性,但往往未能减少框架引起的差异。为了解决这个问题,我们提出了一种框架感知的去偏方法,鼓励 LLMs 在不同框架之间更加一致。实验表明,我们的方法减少了整体偏见,并提高了对框架差异的鲁棒性,使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

2602.04115 2026-06-19 cs.GT 版本更新

Robustness of Stable Matchings When Attributes and Salience Determine Preferences

当属性和显著性决定偏好时稳定匹配的鲁棒性

Amit Ronen, S. S. Ravi, Sarit Kraus

AI总结 研究匹配市场中属性向量和显著性向量扰动下稳定匹配的鲁棒性,提出多项式时间算法验证和计算鲁棒半径,并设计近似最鲁棒匹配的搜索算法。

Comments Accepted to AAMAS 2026. This arXiv version contains the full appendix. Version 2 removes two appendix sections containing an incorrect auxiliary argument. All main results remain unchanged

详情
AI中文摘要

在许多匹配市场中——例如运动员招募或学术招生——一方的参与者通过另一方已知的属性向量进行评估,而另一方则应用个体的显著性向量来赋予这些属性相对重要性。由于显著性在实践中会发生变化,一个核心问题随之产生:稳定匹配对此类扰动的鲁棒性如何?我们在此背景下解决了几个基本问题。首先,我们将鲁棒性形式化为一个半径,在该半径内,稳定匹配在显著性向量的任何可容许扰动下(假设已归一化)仍能免疫于阻塞对。给定一个稳定匹配和一个半径,我们提出一个多项式时间算法来验证该匹配是否在指定半径内保持稳定。我们还给出了一个多项式时间算法来计算给定稳定匹配的最大鲁棒半径。此外,我们设计了一种随时搜索算法,利用认证的下界和上界来近似最鲁棒的稳定匹配,并通过可高效计算的界来刻画鲁棒性与成本之间的关系,这些界描述了鲁棒性与成本之间可实现的权衡。最后,我们证明,对于每个稳定匹配,保持其稳定性的显著性轮廓集是单纯形内低维多面体的乘积。这种几何结构精确刻画了每个鲁棒区域的多面体形状;其体积可以高效计算,随着维度增加可采用近似方法,从而将匹配市场中的鲁棒性分析与凸几何的经典工具联系起来。

英文摘要

In many matching markets--such as athlete recruitment or academic admissions--participants on one side are evaluated by attribute vectors known to the other side, which in turn applies individual \emph{salience vectors} to assign relative importance to these attributes. Since saliences are known to change in practice, a central question arises: how robust is a stable matching to such perturbations? We address several fundamental questions in this context. First, we formalize robustness as a radius within which a stable matching remains immune to blocking pairs under any admissible perturbation of salience vectors (which are assumed to be normalized). Given a stable matching and a radius, we present a polynomial-time algorithm to verify whether the matching is stable within the specified radius. We also give a polynomial-time algorithm for computing the maximum robustness radius of a given stable matching. Further, we design an anytime search algorithm that uses certified lower and upper bounds to approximate the most robust stable matching, and we characterize the robustness-cost relationship through efficiently computable bounds that delineate the achievable tradeoff between robustness and cost. Finally, we show that for each stable matching, the set of salience profiles that preserve its stability factors is a product of low-dimensional polytopes within the simplex. This geometric structure precisely characterizes the polyhedral shape of each robustness region; its volume can then be computed efficiently, with approximate methods available as the dimension grows, thereby linking robustness analysis in matching markets with classical tools from convex geometry.

2509.24894 2026-06-19 math.OC cs.LG 版本更新

Improved Stochastic Optimization of LogSumExp

改进的LogSumExp随机优化

Egor Gladin, Alexey Kroshnin, Jia-Jie Zhu, Pavel Dvurechensky

发表机构 * HSE University(莫斯科高等经济学院) Department of Mathematics, KTH Royal Institute of Technology(皇家理工学院数学系)

AI总结 针对LogSumExp函数在大规模指数项下的优化难题,提出一种保持凸性和光滑性的近似方法,基于Safe KL散度,在分布鲁棒优化和连续最优传输中优于现有基线。

Comments 21 pages, 6 figures, 5 tables; added convergence statement and additional experiments

详情
AI中文摘要

LogSumExp函数作为Kullback-Leibler (KL)散度的对偶函数,在许多重要优化问题中扮演核心角色,包括熵正则化最优传输(OT)和分布鲁棒优化(DRO)。实践中,当对数内指数项数量很大或无穷时,优化变得具有挑战性,因为计算梯度需要对每一项求导。我们提出了一种新颖的保持凸性和光滑性的LogSumExp近似,可以使用随机梯度方法高效优化。该近似基于对偶中KL散度的合理修改,产生了一种新的$f$-散度,称为Safe KL散度。我们在DRO和连续OT中基于LogSumExp的随机优化的实验和理论分析表明,我们的方法优于现有基线。

英文摘要

The LogSumExp function, dual to the Kullback-Leibler (KL) divergence, plays a central role in many important optimization problems, including entropy-regularized optimal transport (OT) and distributionally robust optimization (DRO). In practice, when the number of exponential terms inside the logarithm is large or infinite, optimization becomes challenging since computing the gradient requires differentiating every term. We propose a novel convexity- and smoothness-preserving approximation to LogSumExp that can be efficiently optimized using stochastic gradient methods. This approximation is rooted in a sound modification of the KL divergence in the dual, resulting in a new $f$-divergence called the Safe KL divergence. Our experiments and theoretical analysis of the LogSumExp-based stochastic optimization, arising in DRO and continuous OT, demonstrate the advantages of our approach over existing baselines.

2510.06048 2026-06-19 cs.LG 版本更新

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

BLISS: 一种用于语言模型预训练数据选择的轻量级双层影响评分方法

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu

发表机构 * Department of Computer Science, George Mason University, USA(乔治·马歇尔大学计算机科学系) IBM T.J. Watson Research Center, USA(IBM T.J. Watson研究部) Department of Statistics, Rice University(里士大学统计系) Department of System Engineering & Operations Research, George Mason University, USA(乔治·马歇尔大学系统工程与运营管理系)

AI总结 提出一种无需外部预训练模型的轻量级数据选择方法BLISS,通过双层优化和代理模型估计训练样本的长期影响,实现高效数据筛选,在C4数据集上预训练多种规模模型,显著加速收敛并提升下游任务性能。

详情
AI中文摘要

有效的数据选择对于预训练大型语言模型(LLM)至关重要,可以提高效率并增强对下游任务的泛化能力。然而,现有方法通常需要利用外部预训练模型,使得难以将数据选择的效果与外部预训练模型的效果分开。此外,如果模型训练至收敛,它们通常忽略所选数据的长期影响,这主要是由于全规模LLM预训练的过高成本。在本文中,我们介绍了BLISS(用于数据选择的轻量级双层影响评分方法):一种轻量级数据选择方法,完全从头开始操作,不依赖任何外部预训练预言模型,同时明确考虑所选数据的长期影响。BLISS利用一个小型代理模型作为LLM的替代,并采用一个评分模型来估计如果代理模型训练至收敛时训练样本的长期影响。我们将数据选择形式化为一个双层优化问题,其中上层目标优化评分模型以分配重要性权重给训练样本,确保最小化下层目标(即在加权训练损失上训练代理模型直至收敛)导致最佳验证性能。一旦优化完成,训练好的评分模型预测数据集的影响分数,从而能够高效选择高质量样本用于LLM预训练。我们通过在C4数据集的选择子集上预训练410M/1B/2.8B Pythia和LLaMA-0.5B模型来验证BLISS。值得注意的是,在1B模型设置下,BLISS在达到与最先进方法相同性能时实现了1.7倍的加速,展示了在多个下游任务上的优越性能。

英文摘要

Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.

2602.01425 2026-06-19 cs.AI cs.LG 版本更新

One Probe Won't Catch Them All: Towards Targeted Deception Detection

一个探针无法捕捉所有:迈向有针对性的欺骗检测

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

发表机构 * LASR Labs(LASR实验室) UK AI Security Institute(英国人工智能安全研究所)

AI总结 针对线性探针在欺骗检测中的异质性,提出根据具体欺骗类型匹配探针可显著提升性能(AUC提升0.108),建议组织定义威胁模型并部署相应探针。

详情
AI中文摘要

线性探针是一种有前景的监测AI系统欺骗行为的方法。先前工作表明,在对比指令对和简单数据集上训练的线性分类器可以达到良好性能。然而,这些探针即使在简单场景中也表现出显著失败,包括虚假相关性和对非欺骗响应的误报。在本文中,我们证明欺骗检测本质上是异质的:虽然单个通用探针实现了适度的改进(+0.032 AUC),但事后最优分析显示,当探针与特定欺骗类型匹配时,潜力显著更高(+0.108 AUC),并且合成验证实验表明,当欺骗类型事先已知时,这一上限是先验可实现的。我们的发现表明,指令对捕捉的是欺骗意图而非内容特定模式,这解释了为什么提示选择主导探针性能(占70.6%的方差)。鉴于这种异质性,我们得出结论,组织应定义其特定威胁模型并部署适当匹配的探针,而不是寻求通用的欺骗检测器。

英文摘要

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

2602.01391 2026-06-19 cs.CV 版本更新

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands(乌得勒支大学阿姆斯特丹分校博世Delta实验室) The University of Chicago, Chicago, USA(芝加哥大学) Johns Hopkins University, Baltimore, USA(约翰霍普金斯大学)

AI总结 提出增强潜在本征属性(ALI)方法,融合密集像素对齐视觉特征到潜在本征重光照模型,平衡语义与光度保真度,提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情
AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离,同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针:与奖励不变性的识别任务不同,重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架,我们发现强语义编码器会降低重光照质量,揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性(ALI),通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中,并在未标注的真实图像对上通过自监督进行细化,来平衡这一权衡。ALI提高了重光照质量,尤其是在光泽、金属和透明材质上,并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen: 奖励引导的LLM代码合成用于印刷电路板(PCB)原理图设计及结构化验证

Huanghaohe Zou, Peng Han, Emad Nazerian, Mafu Zhang, Zhicheng Guo, Alex Q. Huang

发表机构 * Semiconductor Power Electronics Center (SPEC)(半导体功率电子中心) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Arizona State University(亚利桑那州立大学)

AI总结 提出PCBSchemaGen框架,通过结构化验证器引导冻结的LLM生成可修复的PCB原理图,在无单元测试的领域实现高准确率。

详情
AI中文摘要

大多数LLM代码合成基准依赖于单元测试作为奖励预言,但PCB原理图设计没有这样的测试:正确性由真实IC封装和引脚级分配的结构化物理约束定义,每个任务的金标准参考不可用,且SPICE仿真无法验证原理图级正确性。我们提出PCBSchemaGen,一个无需训练的推理时框架,将冻结的LLM转变为可验证、可修复的PCB原理图生成器。该框架从IC数据手册中提取领域模式以约束LLM解码,将其与一个具有引脚级错误定位的确定性5层连续奖励验证器配对,并通过汤普森采样臂获取赌博机优化候选方案。我们在两个PCB基准上评估,涵盖22个统一电路领域的227个真实IC任务,包括一个从公开原理图导出的套件,作为完全保留的泛化测试(验证器、知识图谱库和提示在评估前冻结)。在我们的框架下,一个开放权重的31B模型(Gemma-4-31B)平均通过PCBBench任务的81.3%,且同一框架在两个基准间迁移时无需更改验证器代码;而基于相同Gemma-4-31B骨干网络的Circuitron式推理时提示基线在困难的系统级设计上崩溃。这表明在确定性结构验证器下的推理时优化是在没有单元测试预言的领域中实现无参考LLM代码合成的一般方法。我们的基准和确定性验证器在此https URL公开可用。

英文摘要

Most LLM code-synthesis benchmarks rely on unit tests as the reward oracle, but PCB schematic design has none: correctness is defined by structured physical constraints over real IC packages and pin-level assignments, per-task golden references are unavailable, and SPICE simulation does not validate schematic-level correctness. We introduce PCBSchemaGen, a training-free inference-time framework that turns a frozen LLM into a verifiable, repairable PCB schematic generator. The framework induces a domain schema from IC datasheets to ground LLM decoding, pairs it with a deterministic 5-layer continuous-reward verifier with pin-level error localization, and refines candidates through a Thompson Sampling arm-acquiring bandit. We evaluate on 2 PCB benchmarks covering 227 real-IC tasks across 22 unified circuit domains, including a public-schematic-derived suite that serves as a fully held-out generalization test (verifier, KG library, and prompts frozen before any evaluation). Under our framework, an open-weight 31B model (Gemma-4-31B) passes 81.3% of PCBBench tasks on average, and the same framework transfers across both benchmarks with zero verifier code changes; a Circuitron-style inference-time prompting baseline on the same Gemma-4-31B backbone collapses on hard system-level designs. This suggests inference-time refinement under a deterministic structural verifier is a general recipe for reference-free LLM code synthesis in domains without unit-test oracles. Our benchmarks and deterministic verifier are publicly available at https://github.com/HZou9/PCBSchemaGen_v2.

2602.00244 2026-06-19 math.NA cs.NA 版本更新

A Bayesian Approach to Feedback Control for Hyperbolic Balance Laws

双曲平衡律反馈控制的贝叶斯方法

Markus Bambach, Shaoshuai Chu, Michael Herty, Yunong Lin

AI总结 提出贝叶斯框架用于双曲平衡律的边界反馈控制,利用Lyapunov衰减估计作为似然传播反馈参数的概率分布,在线性和非线性问题中验证了方法的鲁棒性和可迁移性。

详情
AI中文摘要

我们提出了一个用于双曲平衡律反馈边界控制的贝叶斯框架。该方法利用Lyapunov衰减估计作为似然,在反馈参数上传播概率分布。对于线性模型,它恢复了现有的解析稳定性结果,并扩展到理论有限的非线性区域。使用一阶局部Lax-Friedrichs(LLF)离散化,我们在解耦波动系统和线性化Saint-Venant方程上验证了该方法,再现了已知的稳定性区间和混合边界耦合。然后我们处理非线性和随机问题,包括非线性Saint-Venant系统、一维和二维Burgers方程、具有随机初始数据的Burgers方程,以及带有源项的非守恒扰动,并表明推断的稳定性域相对于指标和先验是鲁棒的。最后,我们展示了向二阶半离散LLF方案和用于激光粉末床熔融功率调节的两参数反馈模型的迁移。

英文摘要

We propose a Bayesian framework for feedback boundary control of hyperbolic balance laws. The method propagates a probability distribution over feedback parameters using Lyapunov decay estimates as a likelihood. For linear models, it recovers available analytical stability results and extends to nonlinear regimes where theory is limited. Using first-order local Lax-Friedrichs (LLF) discretizations, we validate the approach on the decoupled wave system and the linearized Saint-Venant equations, reproducing known stability intervals and mixed boundary couplings. We then treat nonlinear and stochastic problems, including the nonlinear Saint-Venant system, one- and two-dimensional Burgers equations, Burgers equation with random initial data, and nonconservative perturbations with source terms, and show that the inferred stability domains are robust with respect to the indicator and the prior. Finally, we demonstrate transfer to a second-order semi-discrete LLF scheme and to a two-parameter feedback model for laser powder bed fusion with power regulation.