arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

cs.LG 机器学习应用 16 cs.AI AI应用与系统 13 cs.AI 机器学习与表示学习 12 cs.LG 强化学习与序列决策 12 cs.AI 可信、安全与AI治理 10 cs.LG 数据集、基准与评测 10 cs.AI 评测、基准与数据集 9 cs.LG 优化、泛化与理论分析 9 cs.CV 生成式视觉与世界模型 8 cs.LG 深度学习架构与训练方法 8 cs.RO 导航、定位与SLAM 8 cs.AI 机器人与具身智能 7 cs.LG 高效学习、压缩与部署 7 cs.AI 自然语言与多模态智能 6 cs.CV 数据集、基准、评测与训练方法 6 cs.CV 多模态与视觉语言模型 5 cs.CV 医学影像与生物视觉 5 cs.CL 大语言模型与基础模型 5 cs.LG 生成模型与概率建模 5 cs.LG 鲁棒性、不确定性与可信学习 5 cs.AI 智能体、规划与决策 4 cs.AI 其他/综合AI 4 cs.CV 3D视觉、点云与空间智能 4 cs.CV 鲁棒性、安全、隐私与可信视觉 4 cs.CL 对话系统与智能体 4 cs.CL 多模态语言处理 4 cs.CL 评测、数据集与基准 4 cs.CL 安全、隐私、公平与可解释NLP 4 cs.LG 其他/综合机器学习 4 cs.RO 操作、抓取与灵巧手 4 cs.RO 无人车、无人机与移动机器人 4 cs.AI 多智能体与博弈 3 cs.CV 具身智能、机器人与自动驾驶 3 cs.CV 图像识别、检索与分类 3 cs.CV 目标检测、分割与定位 3 cs.CV 其他/综合视觉 3 cs.CL 语音语言联合与音频文本 3 cs.CL 其他/综合NLP 3 cs.LG 图学习与结构化数据 3 cs.LG 迁移、元学习与持续学习 3 cs.RO 机器人学习与模仿强化学习 3 cs.RO 仿真、数据集与评测 3 cs.CL 机器翻译与跨语言处理 2 cs.LG 表示学习、自监督与对比学习 2 cs.RO 运动规划、控制与动力学 2 cs.RO 人机交互与协作机器人 2 cs.RO 具身智能与视觉语言动作模型 2 cs.RO 软体机器人与硬件设计 2 cs.SD 语音识别与关键词检测 2 cs.AI 搜索、优化与约束求解 1 cs.CV 低层视觉、计算成像与图像增强 1 cs.CL 文本生成、摘要与编辑 1 cs.CL 低资源、领域适配与高效训练 1 cs.RO 多机器人与群体系统 1 cs.RO 安全、鲁棒性与可信机器人 1 cs.RO 其他/综合机器人 1 cs.SD 语音合成与声音生成 1 cs.SD 音频事件检测与场景理解 1 cs.SD 安全、隐私与深度伪造音频 1 cs.SD 其他/综合语音音频 1

2509.15927 2026-06-19 cs.LG cs.AI 版本更新

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

增强生成式自动出价：结合离线奖励评估与策略搜索

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Jinghao Chen, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba（阿里巴巴淘宝与天猫集团）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结针对现有生成式自动出价方法无法超越静态数据集进行探索的性能瓶颈，提出AIGB-Pearl方法，通过轨迹评估器和KL-Lipschitz约束的分数最大化方案实现安全高效探索，在模拟和真实广告系统中取得最优性能。

详情

AI中文摘要

自动出价是广告主提升广告效果的关键工具。最近进展表明，AI生成式出价（AIGB）从离线数据中学习条件生成规划器，相比典型的基于离线强化学习（RL）的自动出价方法取得了更优性能。然而，现有AIGB方法仍面临性能瓶颈，因其固有能力无法在静态数据集之外进行带反馈的探索。为解决此问题，我们提出\textbf{AIGB-Pearl}（\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}），一种融合生成式规划与策略优化的新方法。AIGB-Pearl的核心在于构建轨迹评估器以评估生成分数的质量，并设计一个理论上可靠的KL-Lipschitz约束分数最大化方案，确保在离线数据集之外进行安全高效的探索。进一步开发了结合同步耦合技术的实用算法，以保证所提方案所需的模型正则性。在模拟和真实广告系统上的大量实验证明了我们方法的最优性能。

英文摘要

Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose \textbf{AIGB-Pearl} (\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

URL PDF HTML ☆

赞 0 踩 0

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战：推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona（巴塞罗那人工智能在医学实验室（BCN-AIM），巴塞罗那大学数学与计算机学院）

AI总结提出MAMA-MIA挑战，通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测，在跨洲多中心数据上分析模型泛化性与公平性，发现性能与亚组公平性之间存在权衡。

详情

AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤，也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用，尤其是接受新辅助化疗的患者。然而，现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估，使得直接比较困难，并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题，该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者，而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行，以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明，在共同的外部评估框架下，性能存在显著差异，并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源，以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

URL PDF HTML ☆

赞 0 踩 0

2603.00654 2026-06-19 cs.CV 版本更新

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP：雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University（浙江大学信息科学与电子工程学院）； School of Automotive Studies, Tongji University（同济大学汽车学院）； Thrust of Artificial Intelligence, Hong Kong University of Science and Technology（香港科技大学人工智能研究所）

AI总结提出首个4D雷达与相机协同感知框架RC-GeoCP，通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位，实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情

AI中文摘要

协同感知（CP）通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何，但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量，相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP，这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位，RC-GeoCP建立了雷达锚定的几何一致性。具体而言，几何结构修正（GSR）将视觉语义与雷达导出的几何对齐，以生成空间有根基的、几何一致的表示。不确定性感知通信（UAC）将选择性传输表述为条件熵减少过程，基于智能体间分歧优先处理信息特征。最后，共识驱动聚合器（CDA）通过共享几何锚聚合多智能体信息，形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准，展示了最先进的性能，同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2602.23248 2026-06-19 cs.AI 版本更新

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST（韩国科学技术院）

AI总结提出解耦证明者-验证者游戏（DPVG），通过分离正确性与可检查性训练一个翻译器模型，将固定求解器的解转化为可检查形式，在保持答案正确性的同时提高可检查性，解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Bosch Research（博世研究院）； University of Haifa（海法大学）

AI总结提出潜在高斯泼溅（LaGS）方法，通过特征高斯体作为动态关键点实现多视图特征聚合，用于4D全景占据跟踪，在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情

DOI: 10.1109/LRA.2026.3703990

AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而，现有方法通常只解决部分问题：它们要么通过边界框提供粗略的几何跟踪，要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中，我们提出了潜在高斯泼溅（LaGS）用于4D全景占据跟踪（4D-POT）。我们重新审视底层表示，将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点，在泼溅到体素网格进行解码之前，能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互，这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节，进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型：this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

URL PDF HTML ☆

赞 0 踩 0

2602.22959 2026-06-19 cs.CV 版本更新

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

智能体能否在零样本设置中区分视觉上难以分离的疾病？一项初步研究

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

发表机构 * Department of Diagnostic and Interventional Radiology, University Hospital Aachen, 52074 Aachen, Germany（诊断与介入放射科，亚琛大学医院，德国亚琛，52074）

AI总结本研究探索多模态大语言模型智能体在零样本下区分视觉混淆疾病（如黑色素瘤与不典型痣、肺水肿与肺炎）的能力，提出基于对比裁决的多智能体框架，在皮肤镜数据上准确率提升11个百分点，但总体性能仍不足临床部署。

Comments Code available at https://github.com/TruhnLab/Contrastive-Agent-Reasoning. Accepted by MICCAI 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）的快速进展引发了对基于智能体系统的日益关注。尽管大多数医学影像先前工作集中于自动化常规临床工作流程，我们研究了一个未被充分探索但临床意义重大的场景：在零样本设置中区分视觉上难以分离的疾病。我们在两个仅基于影像的代理诊断任务上对代表性智能体进行基准测试：（1）黑色素瘤与不典型痣，以及（2）肺水肿与肺炎，尽管临床管理存在显著差异，但视觉特征高度混淆。我们引入了一种基于对比裁决的多智能体框架。实验结果显示诊断性能提升（在皮肤镜数据上准确率提高11个百分点），并在定性样本上减少了无根据的声明，尽管整体性能仍不足以用于临床部署。我们承认人类注释中固有的不确定性以及临床背景的缺失，这进一步限制了向真实世界场景的转化。在此受控设置中，这项初步研究为视觉混淆场景下的零样本智能体性能提供了初步见解。

英文摘要

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

URL PDF HTML ☆

赞 0 踩 0

2508.15228 2026-06-19 cs.CV 版本更新

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore（南洋理工大学S实验室）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出TriMM，首个前馈式3D原生生成模型，通过协作多模态编码融合RGB、RGBD和点云特征，结合辅助2D/3D监督和三平面潜在扩散模型，实现高质量3D资产生成。

详情

AI中文摘要

3D内容本质上具有多模态特性，可投影到不同模态（如RGB图像、RGBD和点云）。每种模态在3D资产建模中表现出独特优势：RGB图像包含生动的3D纹理，而点云定义精细的3D几何。然而，现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势，要么局限于3D结构，从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模，我们提出了TriMM，这是第一个从基本多模态（如RGB、RGBD和点云）学习的前馈式3D原生生成模型。具体来说，1) TriMM首先引入协作多模态编码，该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外，引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码，TriMM采用三平面潜在扩散模型生成更高质量的3D资产，增强了纹理和几何细节。在多个知名数据集上的大量实验表明，TriMM通过有效利用多模态，尽管使用少量训练数据，仍能达到与在大规模数据集上训练的模型相竞争的性能。此外，我们在最近的RGB-D数据集上进行了额外实验，验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2602.15819 2026-06-19 cs.CV 版本更新

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

VideoSketcher：利用视频模型先验的序列草图生成

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

发表机构 * MIT（麻省理工学院）

AI总结提出VideoSketcher方法，结合LLM的语义规划与视频扩散模型的时序渲染，通过两阶段微调从少量样本学习笔画顺序与风格，生成高质量序列草图。

详情

AI中文摘要

素描本质上是序列化的：笔画逐步绘制以探索和完善想法。然而，大多数生成方法将草图视为静态图像，忽略了创造性探索背后的时间过程。建模这种序列结构仍然具有挑战性：先前的方法要么依赖大规模但多样性有限的人类绘制数据集，要么使用大型语言模型（LLM）生成绘制指令，但往往以视觉保真度为代价。我们提出VideoSketcher，一种通过将预训练的文本到视频扩散模型适应于草图形成的稀疏连续性质来生成高质量绘制过程的方法。我们的关键洞察是LLM和视频扩散模型提供互补优势：LLM作为语义规划器，将概念分解为逐步指令，而视频扩散模型作为强大的“渲染器”，将它们转化为时间连贯的草图序列。我们引入一种两阶段微调策略，将时间结构与视觉外观解耦：笔画顺序从合成形状组合中学习，而风格则从少至七幅手绘示例中提炼。尽管监督极少，我们的方法能够生成多样、高质量的序列草图，并忠实遵循指定的绘制顺序。我们的框架自然扩展到笔刷风格控制和自回归生成，支持艺术应用。

英文摘要

Sketching is inherently sequential: strokes are drawn progressively to explore and refine ideas. Yet most generative approaches treat sketches as static images, ignoring the temporal process underlying creative exploration. Modeling this sequential structure remains challenging: prior methods either rely on large-scale human-drawn datasets with limited diversity, or use large language models (LLMs) to produce drawing instructions, often at the cost of visual fidelity. We present VideoSketcher, a method for generating high-quality sketching processes by adapting pretrained text-to-video diffusion models to the sparse, continuous nature of sketch formation. Our key insight is that LLMs and video diffusion models offer complementary strengths: LLMs act as semantic planners that decompose concepts into step-by-step instructions, while video diffusion models serve as powerful "renderers" that translate them into temporally coherent sketch sequences. We introduce a two-stage fine-tuning strategy that decouples temporal structure from visual appearance: stroke ordering is learned from synthetic shape compositions, while style is distilled from as few as seven hand-drawn examples. Despite minimal supervision, our method can generate diverse, high-quality sequential sketches that faithfully follow specified drawing orders. Our framework naturally extends to brush style control and autoregressive generation, supporting artistic applications.

URL PDF HTML ☆

赞 0 踩 0

2602.14696 2026-06-19 cs.LG 版本更新

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

对目标指令选择的批判性审视：厘清什么重要（以及什么不重要）

Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis

发表机构 * Harvard University（哈佛大学）； MIT（麻省理工学院）； Kempner Institute（凯门研究所）

AI总结本文系统解构指令微调中目标指令选择的两大核心要素——数据表示与选择算法，发现基于梯度的表示结合贪心轮询选择在低预算下表现最佳，但收益随预算增加而减弱，并统一了多种算法为近似距离最小化。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）的指令微调通常涉及从大型候选池中选择一个指令训练子集，使用来自目标任务的小型查询集。尽管兴趣日益增长，关于目标指令选择的文献仍然支离破碎且不透明：方法在选择预算上差异很大，经常省略零样本基线，并且常常混淆关键组件的贡献。因此，实践者缺乏针对其目标任务选择指令的可操作指导。在这项工作中，我们旨在通过解构和系统分析两个核心要素：数据表示和选择算法，为这一领域带来清晰度。我们的框架支持跨模型、任务和预算的受控比较。我们发现，只有基于梯度的数据表示选择的子集，其与查询的相似性能够一致地预测跨数据集、模型和候选池的性能。虽然没有单一方法占主导地位，但基于梯度的表示与贪心轮询选择相结合，在低预算下平均表现最佳，但这些收益在较大预算下会减弱。最后，我们将几种现有的选择算法统一为所选子集与查询集之间近似距离最小化的形式，并用新的泛化界限支持这一观点。更广泛地说，我们的发现为LLM微调中更原则性的数据选择提供了关键见解和基础。代码可在该 https URL 获取。

英文摘要

Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets, models, and candidate pools. While no single method dominates, gradient-based representations paired with greedy round-robin selection often perform best on average at low budgets, but these gains diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.

URL PDF HTML ☆

赞 0 踩 0

2512.11173 2026-06-19 cs.RO 版本更新

Learning Category-level Last-meter Navigation from RGB Demonstrations of a Single-instance

从单实例RGB演示中学习类别级最后米导航

Tzu-Hsien Lee, Fidan Mahmudova, Karthik Desingh

发表机构 * University of Minnesota, Twin Cities（明尼苏达大学 Twin Cities 分校）

AI总结提出面向对象的模仿学习框架，利用RGB观测实现四足移动机械臂在最后米阶段的精确导航，无需深度或地图先验，在类别级泛化中达到高成功率。

详情

AI中文摘要

移动机械臂基座的精确定位对于后续成功操作至关重要。大多数基于RGB的导航系统仅保证粗略的米级精度，不适合移动操作的精确定位阶段。这一差距导致操作策略无法在其训练演示的分布内运行，从而导致频繁的执行失败。我们通过引入一种面向对象的模仿学习框架来解决这一差距，用于最后米导航，使四足移动机械臂机器人仅使用其机载摄像头的RGB观测即可实现可操作的定位。我们的方法将导航策略条件化为三个输入：目标图像、来自机载摄像头的多视角RGB观测以及指定目标对象的文本提示。然后，语言驱动的分割模块和空间得分矩阵解码器提供显式的对象定位和相对姿态推理。使用类别内单个对象实例的真实世界数据，该系统能够泛化到不同环境中具有挑战性光照和背景条件的未见对象实例。为了全面评估这一点，我们引入了两个指标：边缘对齐度量（使用真实方向）和对象对齐度量（评估机器人视觉上面对目标的程度）。在这些指标下，我们的策略在相对于未见目标对象定位时，边缘对齐成功率达到74.58%，对象对齐成功率达到89.42%。这些结果表明，无需深度、LiDAR或地图先验，即可在类别级实现精确的最后米导航，为统一的移动操作提供可扩展的途径。项目页面：此https URL

英文摘要

Achieving precise positioning of the mobile manipulator's base is essential for successful manipulation actions that follow. Most of the RGB-based navigation systems only guarantee coarse, meter-level accuracy, making them less suitable for the precise positioning phase of mobile manipulation. This gap prevents manipulation policies from operating within the distribution of their training demonstrations, resulting in frequent execution failures. We address this gap by introducing an object-centric imitation learning framework for last-meter navigation, enabling a quadruped mobile manipulator robot to achieve manipulation-ready positioning using only RGB observations from its onboard cameras. Our method conditions the navigation policy on three inputs: goal images, multi-view RGB observations from the onboard cameras, and a text prompt specifying the target object. A language-driven segmentation module and a spatial score-matrix decoder then supply explicit object grounding and relative pose reasoning. Using real-world data from a single object instance within a category, the system generalizes to unseen object instances across diverse environments with challenging lighting and background conditions. To comprehensively evaluate this, we introduce two metrics: an edge-alignment metric, which uses ground truth orientation, and an object-alignment metric, which evaluates how well the robot visually faces the target. Under these metrics, our policy achieves 74.58% success in edge-alignment and 89.42% success in object-alignment when positioning relative to unseen target objects. These results show that precise last-meter navigation can be achieved at a category-level without depth, LiDAR, or map priors, enabling a scalable pathway toward unified mobile manipulation. Project page: https://rpm-lab-umn.github.io/category-level-last-meter-nav/

URL PDF HTML ☆

赞 0 踩 0

2508.21677 2026-06-19 cs.RO 版本更新

Robust Convex Model Predictive Control with collision avoidance guarantees for robot manipulators

具有碰撞避免保证的机器人操作器鲁棒凸模型预测控制

Bernhard Wullt, Johannes Köhler, Per Mattsson, Mikeal Norrlöf, Thomas B. Schön

发表机构 * ABB robotics（ABB机器人公司）； Department of Mechanical Engineering, Imperial College London（帝国理工学院机械工程系）； Department of Information Technology, Uppsala University（乌普萨拉大学信息科技系）

AI总结提出一种结合鲁棒管MPC与走廊规划算法的凸MPC方案，在模型不确定下实现工业机器人快速无碰撞运动，优于基准方法。

详情

AI中文摘要

工业操作器通常在杂乱环境中运行，安全运动规划至关重要。然而，模型不确定性使任务更加复杂，导致保守的速度限制以减少干扰影响。因此，需要能够保证快速执行安全运动的控制方法。我们通过为操作器提出一种新颖的模型预测控制（MPC）方案来解决这一问题，其中两个主要组件是鲁棒管MPC和用于获得无碰撞运动的走廊规划算法。我们的方案形成凸MPC公式，可以快速求解，使方法具有实际应用价值。我们在模拟环境中展示了方法的有效性，该环境包含一个6自由度工业机器人在具有不确定模型参数的杂乱环境中运行。通过容忍更高水平的模型不确定性同时实现更快的运动，我们优于基准方法。

英文摘要

Industrial manipulators typically operate in cluttered environments, where safe motion planning is critical. However, model uncertainties further complicate this task, which leads to conservative speed limits to reduce the influence of disturbances. Hence, there is a need for control methods that can guarantee safe motions which are executed fast. We address this by suggesting a novel model predictive control (MPC) solution for manipulators, where our two main components are a robust tube MPC and a corridor planning algorithm to obtain collision-free motion. Our solution results in a convex MPC formulation, which we can solve fast, making our method practically useful. We demonstrate the efficacy of our method in a simulated environment with a 6 DOF industrial robot operating in cluttered environments with uncertain model parameters. We outperform benchmark methods by tolerating higher levels of model uncertainty while achieving faster motion.

URL PDF HTML ☆

赞 0 踩 0

2602.09689 2026-06-19 cs.LG 版本更新

Model soups need only one ingredient

模型汤只需一种成分

Alireza Abdollahpoorrostam, Nikolaos Dimitriadis, Adam Hazimeh, Pascal Frossard

发表机构 * EPFL（瑞士联邦理工学院）； EPFL LTS4（瑞士联邦理工学院 LTS4）

AI总结提出MonoSoup方法，利用SVD分解单检查点的层更新，通过熵有效秩自动重加权成分，实现强分布内-分布外平衡，无需多检查点。

详情

AI中文摘要

在目标分布上微调大型预训练模型通常会提高分布内（ID）准确性，但代价是分布外（OOD）鲁棒性下降，因为表示会专门适应微调数据。权重空间集成方法，如模型汤（Model Soups），通过平均多个检查点来缓解这一影响，但它们在计算上代价高昂，需要训练和存储数十个微调模型。在本文中，我们介绍了MonoSoup，一种简单、无数据、无超参数的事后方法，仅使用单个检查点即可实现强大的ID-OOD平衡。我们的方法对每一层的更新应用奇异值分解（SVD），将其分解为捕捉任务特定适应的高能量方向和引入噪声但可能仍编码对鲁棒性有用的残余信号的低能量方向。然后，MonoSoup使用基于熵的有效秩自动重新加权这些分量，并考虑模型的谱和几何结构的逐层系数。在ImageNet上微调并在自然分布偏移下评估的CLIP模型，以及在数学推理和多选题基准上测试的Qwen语言模型上的实验表明，这种即插即用方法是多检查点方法的实用且有效的替代方案，保留了其大部分好处而无需计算开销。

英文摘要

Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer's update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2510.24410 2026-06-19 cs.CV cs.RO 版本更新

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

GenTrack2: 一种改进的多目标跟踪混合方法

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark（SDU机器人研究所，南丹麦大学）

AI总结提出结合随机粒子滤波与确定性关联的多目标跟踪方法，通过粒子群优化和新型代价矩阵解决非线性动态下的标识一致性问题，性能优于现有方法。

Comments The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication

详情

AI中文摘要

本文提出一种视觉多目标跟踪方法，联合使用随机和确定性机制，以确保在非线性动态下未知且时变目标数量的标识一致性。随机粒子滤波处理非线性动态和非高斯噪声，并借助粒子群优化（PSO）将粒子引导至状态分布模式，通过提出的适应度度量（包含运动一致性、外观相似性和与邻近目标的社交互动线索）减轻发散。确定性关联通过提出的代价矩阵进一步强制标识一致性，该矩阵包含粒子与当前检测之间的空间一致性、检测置信度和轨迹惩罚。随后，提出一种新颖方案，在保持目标身份的同时平滑更新目标状态，特别是对于与其他目标交互和长时间遮挡期间的弱轨迹。此外，对过去状态的速度回归提供趋势种子速度，增强粒子采样和状态更新。所提出的跟踪器设计灵活，适用于预录视频和相机直播流（未来帧不可用）。实验结果表明，与最先进的跟踪器相比，性能优越。所提出方法和对比跟踪器的源代码参考实现已在GitHub上提供：此 https URL

英文摘要

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

URL PDF HTML ☆

赞 0 踩 0

2602.07628 2026-06-19 cs.AI cs.LG 版本更新

SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures

SleepMaMi：一种融合宏观与微观结构的通用睡眠基础模型

Keondo Park, Younghoon Na, Yourim Choi, Hyunwoo Ryu, Hyun-Woo Shin, Hyung-Sin Kim

发表机构 * Graduate School of Data Science, Seoul National University, Seoul, South Korea（首尔国立大学数据科学研究生院，韩国首尔）； Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea（首尔国立大学医学院生物医学科学系，韩国首尔）； Obstructive Upper Airway Research (OUaR) Laboratory, Department of Pharmacology, Seoul National University College of Medicine, Seoul, Republic of Korea（首尔国立大学医学院药理学系阻塞性上气道研究（OUaR）实验室，韩国首尔）； Department of Otorhinolaryngology-Head and Neck Surgery, Seoul National University Hospital, Seoul, Republic of Korea（首尔国立大学医院耳鼻喉头颈外科系，韩国首尔）

AI总结提出SleepMaMi睡眠基础模型，通过分层双编码器设计（宏观编码器建模整夜时间依赖，微观编码器捕捉生物信号短时特征），结合人口统计引导对比学习和混合掩码自编码器训练，在超过2万条PSG记录上预训练，在下游任务中优于或匹配现有基础模型。

Comments 8 pages, Appendix 9 pages

详情

AI中文摘要

虽然向统一基础模型的转变已经彻底改变了许多深度学习领域，但睡眠医学仍然主要局限于专注于局部微观结构特征的特定任务模型。这些方法常常忽略多导睡眠图（PSG）丰富的多模态背景，并且未能捕捉整夜睡眠的全局宏观结构。为了解决这个问题，我们引入了SleepMaMi，一种睡眠基础模型，旨在掌握长达一小时的睡眠架构和细粒度信号形态。我们的框架采用分层双编码器设计：宏观编码器用于建模整夜时间依赖，微观编码器用于从生物信号中捕捉短期特征。宏观编码器通过人口统计引导对比学习进行训练，该学习将夜间睡眠模式与客观受试者元数据（如年龄、性别和BMI）对齐，以优化全局表示。微观编码器通过混合掩码自编码器（MAE）和多模态对比目标进行优化。在超过20,000条PSG记录（158K小时）的大规模语料库上预训练，SleepMaMi在多样化的下游任务套件中优于或匹配现有的最先进基础模型，展示了在临床睡眠分析中卓越的泛化能力和标签高效适应能力。

英文摘要

While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms or matches state-of-the-art existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

URL PDF HTML ☆

赞 0 踩 0

2602.04396 2026-06-19 cs.LG cs.AI 版本更新

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: 分布式低秩优化与低频通信

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

发表机构 * University of Cambridge（剑桥大学）； Institute of Science and Technology Austria（奥地利科学与技术研究院）； Lancaster University（兰卡斯特大学）； Flower Labs（Flower实验室）

AI总结提出LoRDO框架，统一低秩优化与低频同步，通过全秩准双曲更新恢复子空间探索，在125M-720M模型规模下实现与低秩DDP近似的性能，通信量减少约10倍。

Comments Accepted at ICML 2026

详情

AI中文摘要

通过$\ exttt{DDP}$进行基础模型的分布式训练受限于互连带宽。虽然低频通信策略减少了同步频率，但优化器状态的内存和通信需求仍然构成瓶颈。低秩优化器可以缓解这些限制；然而，在局部更新机制下，工作节点无法访问计算低秩投影所需的全批次梯度，这降低了性能。我们提出$\ exttt{LoRDO}$，一个统一低秩优化与低频同步的原则性框架。我们首先证明，虽然基于伪梯度的全局投影在理论上更优，但它们将优化轨迹永久限制在低秩子空间中。为了恢复子空间探索，我们引入了一个全秩准双曲更新。$\ exttt{LoRDO}$在125M-720M模型规模的语言建模和下游任务中实现了与低秩$\ exttt{DDP}$近乎相同的性能，同时将通信量减少了约10倍。最后，我们表明在具有小秩/小批次大小的极低内存设置中，$\ exttt{LoRDO}$的性能提升更为显著。

英文摘要

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

URL PDF HTML ☆

赞 0 踩 0

2602.04306 2026-06-19 cs.CL cs.AI 版本更新

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST（韩国科学技术院）

AI总结针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题，提出框架感知的去偏方法，通过量化框架差异并增强跨框架一致性，有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

随着大语言模型（LLMs）在现实应用中的日益部署，确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力，但一个持续的挑战是隐藏的偏见：LLMs 在标准评估下表现公平，但在这些评估设置之外可能产生有偏见的响应。在本文中，我们识别出框架——语义等价的提示在表达方式上的差异（例如，“A 比 B 好” vs. “B 比 A 差”）——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准，我们发现（1）公平性得分随框架变化显著，以及（2）现有的去偏方法改善了整体（即框架平均）公平性，但往往未能减少框架引起的差异。为了解决这个问题，我们提出了一种框架感知的去偏方法，鼓励 LLMs 在不同框架之间更加一致。实验表明，我们的方法减少了整体偏见，并提高了对框架差异的鲁棒性，使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

URL PDF HTML ☆

赞 0 踩 0

2510.06048 2026-06-19 cs.LG 版本更新

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

BLISS: 一种用于语言模型预训练数据选择的轻量级双层影响评分方法

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu

发表机构 * Department of Computer Science, George Mason University, USA（乔治·马歇尔大学计算机科学系）； IBM T.J. Watson Research Center, USA（IBM T.J. Watson研究部）； Department of Statistics, Rice University（里士大学统计系）； Department of System Engineering & Operations Research, George Mason University, USA（乔治·马歇尔大学系统工程与运营管理系）

AI总结提出一种无需外部预训练模型的轻量级数据选择方法BLISS，通过双层优化和代理模型估计训练样本的长期影响，实现高效数据筛选，在C4数据集上预训练多种规模模型，显著加速收敛并提升下游任务性能。

详情

AI中文摘要

有效的数据选择对于预训练大型语言模型（LLM）至关重要，可以提高效率并增强对下游任务的泛化能力。然而，现有方法通常需要利用外部预训练模型，使得难以将数据选择的效果与外部预训练模型的效果分开。此外，如果模型训练至收敛，它们通常忽略所选数据的长期影响，这主要是由于全规模LLM预训练的过高成本。在本文中，我们介绍了BLISS（用于数据选择的轻量级双层影响评分方法）：一种轻量级数据选择方法，完全从头开始操作，不依赖任何外部预训练预言模型，同时明确考虑所选数据的长期影响。BLISS利用一个小型代理模型作为LLM的替代，并采用一个评分模型来估计如果代理模型训练至收敛时训练样本的长期影响。我们将数据选择形式化为一个双层优化问题，其中上层目标优化评分模型以分配重要性权重给训练样本，确保最小化下层目标（即在加权训练损失上训练代理模型直至收敛）导致最佳验证性能。一旦优化完成，训练好的评分模型预测数据集的影响分数，从而能够高效选择高质量样本用于LLM预训练。我们通过在C4数据集的选择子集上预训练410M/1B/2.8B Pythia和LLaMA-0.5B模型来验证BLISS。值得注意的是，在1B模型设置下，BLISS在达到与最先进方法相同性能时实现了1.7倍的加速，展示了在多个下游任务上的优越性能。

英文摘要

Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.01425 2026-06-19 cs.AI cs.LG 版本更新

One Probe Won't Catch Them All: Towards Targeted Deception Detection

一个探针无法捕捉所有：迈向有针对性的欺骗检测

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

发表机构 * LASR Labs（LASR实验室）； UK AI Security Institute（英国人工智能安全研究所）

AI总结针对线性探针在欺骗检测中的异质性，提出根据具体欺骗类型匹配探针可显著提升性能（AUC提升0.108），建议组织定义威胁模型并部署相应探针。

详情

AI中文摘要

线性探针是一种有前景的监测AI系统欺骗行为的方法。先前工作表明，在对比指令对和简单数据集上训练的线性分类器可以达到良好性能。然而，这些探针即使在简单场景中也表现出显著失败，包括虚假相关性和对非欺骗响应的误报。在本文中，我们证明欺骗检测本质上是异质的：虽然单个通用探针实现了适度的改进（+0.032 AUC），但事后最优分析显示，当探针与特定欺骗类型匹配时，潜力显著更高（+0.108 AUC），并且合成验证实验表明，当欺骗类型事先已知时，这一上限是先验可实现的。我们的发现表明，指令对捕捉的是欺骗意图而非内容特定模式，这解释了为什么提示选择主导探针性能（占70.6%的方差）。鉴于这种异质性，我们得出结论，组织应定义其特定威胁模型并部署适当匹配的探针，而不是寻求通用的欺骗检测器。

英文摘要

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

URL PDF HTML ☆

赞 0 踩 0

2602.01391 2026-06-19 cs.CV 版本更新

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands（乌得勒支大学阿姆斯特丹分校博世Delta实验室）； The University of Chicago, Chicago, USA（芝加哥大学）； Johns Hopkins University, Baltimore, USA（约翰霍普金斯大学）

AI总结提出增强潜在本征属性（ALI）方法，融合密集像素对齐视觉特征到潜在本征重光照模型，平衡语义与光度保真度，提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情

AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离，同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针：与奖励不变性的识别任务不同，重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架，我们发现强语义编码器会降低重光照质量，揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性（ALI），通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中，并在未标注的真实图像对上通过自监督进行细化，来平衡这一权衡。ALI提高了重光照质量，尤其是在光泽、金属和透明材质上，并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

URL PDF HTML ☆

赞 0 踩 0

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen: 奖励引导的LLM代码合成用于印刷电路板(PCB)原理图设计及结构化验证

Huanghaohe Zou, Peng Han, Emad Nazerian, Mafu Zhang, Zhicheng Guo, Alex Q. Huang

发表机构 * Semiconductor Power Electronics Center (SPEC)（半导体功率电子中心）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Arizona State University（亚利桑那州立大学）

AI总结提出PCBSchemaGen框架，通过结构化验证器引导冻结的LLM生成可修复的PCB原理图，在无单元测试的领域实现高准确率。

详情

AI中文摘要

大多数LLM代码合成基准依赖于单元测试作为奖励预言，但PCB原理图设计没有这样的测试：正确性由真实IC封装和引脚级分配的结构化物理约束定义，每个任务的金标准参考不可用，且SPICE仿真无法验证原理图级正确性。我们提出PCBSchemaGen，一个无需训练的推理时框架，将冻结的LLM转变为可验证、可修复的PCB原理图生成器。该框架从IC数据手册中提取领域模式以约束LLM解码，将其与一个具有引脚级错误定位的确定性5层连续奖励验证器配对，并通过汤普森采样臂获取赌博机优化候选方案。我们在两个PCB基准上评估，涵盖22个统一电路领域的227个真实IC任务，包括一个从公开原理图导出的套件，作为完全保留的泛化测试（验证器、知识图谱库和提示在评估前冻结）。在我们的框架下，一个开放权重的31B模型（Gemma-4-31B）平均通过PCBBench任务的81.3%，且同一框架在两个基准间迁移时无需更改验证器代码；而基于相同Gemma-4-31B骨干网络的Circuitron式推理时提示基线在困难的系统级设计上崩溃。这表明在确定性结构验证器下的推理时优化是在没有单元测试预言的领域中实现无参考LLM代码合成的一般方法。我们的基准和确定性验证器在此https URL公开可用。

英文摘要

Most LLM code-synthesis benchmarks rely on unit tests as the reward oracle, but PCB schematic design has none: correctness is defined by structured physical constraints over real IC packages and pin-level assignments, per-task golden references are unavailable, and SPICE simulation does not validate schematic-level correctness. We introduce PCBSchemaGen, a training-free inference-time framework that turns a frozen LLM into a verifiable, repairable PCB schematic generator. The framework induces a domain schema from IC datasheets to ground LLM decoding, pairs it with a deterministic 5-layer continuous-reward verifier with pin-level error localization, and refines candidates through a Thompson Sampling arm-acquiring bandit. We evaluate on 2 PCB benchmarks covering 227 real-IC tasks across 22 unified circuit domains, including a public-schematic-derived suite that serves as a fully held-out generalization test (verifier, KG library, and prompts frozen before any evaluation). Under our framework, an open-weight 31B model (Gemma-4-31B) passes 81.3% of PCBBench tasks on average, and the same framework transfers across both benchmarks with zero verifier code changes; a Circuitron-style inference-time prompting baseline on the same Gemma-4-31B backbone collapses on hard system-level designs. This suggests inference-time refinement under a deterministic structural verifier is a general recipe for reference-free LLM code synthesis in domains without unit-test oracles. Our benchmarks and deterministic verifier are publicly available at https://github.com/HZou9/PCBSchemaGen_v2.

URL PDF HTML ☆

赞 0 踩 0

2601.22970 2026-06-19 cs.LG cs.AI 版本更新

Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic Methods

稳定Q-梯度场以实现Actor-Critic方法中的策略平滑性

Jeong Woon Lee, Kyoleen Kwak, Daeho Kim, Hyoseok Hwang

发表机构 * College of Software, Kyung Hee University（韩国庆熙大学软件学院）

AI总结针对连续动作空间中actor-critic方法策略振荡问题，提出基于评论家微分几何的PAVE框架，通过稳定Q-梯度场实现策略平滑，无需修改actor。

详情

AI中文摘要

通过连续actor-critic方法学习的策略通常表现出不稳定的高频振荡，使其不适合物理部署。当前方法试图通过直接正则化策略输出来强制平滑性。我们认为这种方法治标不治本。在这项工作中，我们从理论上建立了策略非平滑性根本上由评论家的微分几何决定。通过对actor-critic目标应用隐式微分，我们证明了最优策略的敏感性受限于Q函数的混合偏导数（噪声敏感性）与其动作空间曲率（信号区分度）之比。为了实证验证这一理论见解，我们引入了PAVE（策略感知值场均衡），一种以评论家为中心的正则化框架，将评论家视为标量场并稳定其诱导的动作梯度场。PAVE通过最小化Q-梯度波动同时保持局部曲率来修正学习信号。实验结果表明，PAVE在不修改actor的情况下，实现了与策略侧平滑正则化方法相当的平滑性，同时保持了有竞争力的任务性能。

英文摘要

Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

URL PDF HTML ☆

赞 0 踩 0

2601.21542 2026-06-19 cs.CV cs.AI 版本更新

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结提出BA-solver，通过轻量SideNet（1-2%主干大小）学习双向时间感知和双锚点速度积分，在不重新训练主干的情况下，以极低训练成本实现10步内达到100+步Euler求解器质量，支持即插即用。

详情

AI中文摘要

流匹配（FM）模型已成为高保真合成的前沿范式。然而，它们对迭代常微分方程（ODE）求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难：无训练求解器在低神经函数评估（NFE）下性能严重下降，而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距，我们提出了双锚点插值求解器（BA-solver）。BA-solver保留了标准无训练求解器的通用性，同时通过引入轻量级SideNet（主干大小的1-2%）与冻结主干并行，实现了显著加速。具体而言，我们的方法基于两个协同组件：1）双向时间感知，其中SideNet学习近似未来和过去的速度，无需重新训练重型主干；2）双锚点速度积分，利用带有两个锚点速度的SideNet高效近似中间速度，用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹，BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明，BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量，并在仅5次NFE时保持高保真度，且训练成本可忽略不计。此外，BA-solver确保与现有生成流水线的无缝集成，便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

URL PDF HTML ☆

赞 0 踩 0

2601.22107 2026-06-19 cs.LG 版本更新

Prior-Informed Flow Matching for Graph Reconstruction

先验信息流匹配用于图重建

Harvey Chen, Nicolas Zilberstein, Santiago Segarra

发表机构 * Rice University（里士大学）

AI总结提出先验信息流匹配（PIFM），一种结合嵌入先验与连续时间流匹配的条件流模型，用于从部分观测中重建图，在多个数据集上优于经典嵌入和生成基线。

2601.21081 2026-06-19 cs.CV 版本更新

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

思维形状：通过视觉思维链进行渐进式物体组装

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）科学与工程学院）； School of Data Science, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）数据科学学院）； Sun Yat-sen University（中山大学）； The Hong Kong University of Science and Technology, Guangzhou（香港科学与技术大学（广州））； Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)（深圳未来网络智能研究所（FNii-Shenzhen））； Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHK(SZ)（广东省未来网络智能重点实验室，CUHK(SZ)）

AI总结提出Shape-of-Thought (SoT)框架，通过视觉思维链在渲染2D域中逐步组装形状，解决文本到图像生成中的组合结构约束问题，在组件计数和结构拓扑上显著优于直接生成。

Comments ICML2026

详情

AI中文摘要

用于文本到图像生成的多模态模型已实现强视觉保真度，但在组合结构约束（特别是生成计数、属性绑定和部分级关系）下仍然脆弱。为解决这些挑战，我们提出了Shape-of-Thought (SoT)，一种视觉思维链框架，用于在渲染2D域中进行过程监督的渐进式形状组装，推理时无需外部引擎。SoT训练一个统一的多模态自回归模型，生成交错文本计划和渲染中间状态，帮助模型在不产生显式几何表示的情况下捕捉形状组装逻辑。与纯文本思维链不同，每个决策都基于渲染状态，使得计数、连接、拓扑和中间部件添加错误在整个轨迹中可检查。为支持这一范式，我们引入了SoT-26K，一个基于部件CAD层次结构的大规模接地组装轨迹数据集，以及T2S-CompBench，一个用于评估结构完整性和轨迹忠实度的基准。在SoT-26K上微调在组件计数上达到88.4%，在结构拓扑上达到84.8%，在组件计数上比直接生成高出24.2个百分点，在结构拓扑上高出19.3个百分点。SoT为渲染域结构感知生成建立了一个透明测试平台。代码见此https URL。

英文摘要

Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints, notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework for process-supervised progressive shape assembly in the rendered 2D domain, without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. Unlike text-only CoT, each decision is grounded in a rendered state, making counts, attachments, topology, and intermediate part-addition errors inspectable across the trajectory. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming direct generation by +24.2 points on component numeracy and +19.3 points on structural topology. SoT establishes a transparent testbed for rendered-domain structure-aware generation. The code is available at https://github.com/yuhuo03/Shape-of-Thought.

URL PDF HTML ☆

赞 0 踩 0

2512.20014 2026-06-19 cs.RO cs.AI 版本更新

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Bring My Cup! 使用视觉注意力提示个性化视觉-语言-动作模型

Sangoh Lee, Sangwoo Mo, Wook-Shin Han

发表机构 * GSAI, POSTECH（POSTECH 人工智能研究所）； IME, POSTECH（POSTECH 信息媒体研究所）

AI总结针对VLA模型难以处理个性化指令的问题，提出无需训练的视觉注意力提示（VAP）方法，通过参考图像作为非参数记忆，利用开放词汇检测和嵌入匹配定位个人物品，并以视觉提示注入模型，在多个仿真和真实场景中显著提升成功率和正确物体操作。

Comments ICML 2026. Project page: https://vap-project.github.io/

详情

AI中文摘要

尽管视觉-语言-动作（VLA）模型能够很好地泛化到通用指令，但在处理个性化命令（如“bring my cup”）时却存在困难，因为机器人必须在视觉相似的物体中识别并操作特定实例。我们研究了这种操作个人物品的场景，其中VLA必须仅使用少量参考图像来识别并控制训练中未见过的用户特定物体。为了解决这一挑战，我们提出了视觉注意力提示（VAP），一种简单而有效的无需训练的感知适配器，为冻结的VLA模型赋予自上而下的选择性注意力。VAP将参考图像视为非参数视觉记忆，通过开放词汇检测和基于嵌入的匹配将个人物品定位到场景中，然后通过突出显示该物体并重写指令，将这种定位作为视觉提示注入模型。我们构建了两个仿真基准（Personalized-SIMPLER和Personalized-VLABench）以及一个真实桌面基准，用于评估多个机器人和任务上的个性化操作。实验表明，VAP在成功率和正确物体操作方面始终优于通用策略和令牌学习基线，有助于弥合语义理解与实例级控制之间的差距。

英文摘要

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

URL PDF HTML ☆

赞 0 踩 0

2507.00875 2026-06-19 cs.CL cs.HC cs.MA 版本更新

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TransLaw：模拟香港判例法专业翻译的大规模数据集与多智能体基准

Xi Xuan, Chunyu Kit

发表机构 * City University of Hong Kong, Hong Kong SAR, China（香港城市大学）

AI总结针对香港判例法英译中资源匮乏、法律术语和格式要求严格的问题，构建了首个大规模句对齐平行语料库HKCFA Judgment 97-22，并提出多智能体框架TransLaw，通过分解翻译任务、集成法律词汇库和检索增强生成，显著提升翻译质量，但仍未达到人类专家的风格自然度。

Comments Accepted at ICML 2026 - AI for Law

详情

AI中文摘要

根据《基本法》第8-9条，香港法院判决书需从英文翻译成繁体中文，但由于平行资源短缺以及对法律术语、引用格式和司法风格的严格要求，这一任务仍受到限制。我们引入了HKCFA Judgment 97-22，这是首个用于香港判例法的大规模句对齐平行语料库，包含344份专业翻译的判决书（11,099个句对；210万词元），涵盖1997年至2022年。基于这一资源，我们提出了TransLaw，一个多智能体框架，将翻译分解为词级表达、句级翻译和多维审查，集成了专门的香港法律词汇数据库、检索增强生成和迭代反馈，并包括涵盖语义对齐、术语、引用和风格的四维专家审查。通过对13个开源和商业大语言模型进行基准测试，我们证明TransLaw在所有评估模型上均显著优于单智能体基线，并在3次迭代内收敛。由10名持证法律翻译人员使用我们提出的Legal ACS指标进行的人工评估证实了法律语义准确性的提升，同时表明TransLaw在风格自然度上仍落后于人类专家。数据集和基准代码可在以下网址获取：https://xxx。

英文摘要

Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology, citation format, and judicial style. We introduce HKCFA Judgment 97-22, the first large-scale sentence-aligned parallel corpus for HK case law, comprising 344 professionally translated judgments (11,099 sentence pairs; 2.1M tokens) spanning 1997-2022. Building on this resource, we propose TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation, and iterative feedback, with four-dimensional expert review covering semantic alignment, terminology, citation, and style. Benchmarking 13 open-source and commercial LLMs, we demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models, with convergence within 3 iterations. Human evaluation by 10 certified legal translators using our proposed Legal ACS metric confirms gains in legal-semantic accuracy, while showing that TransLaw still trails human experts in stylistic naturalness. The dataset and benchmark code are available at https://github.com/xuanxixi/TransLaw.

URL PDF HTML ☆

赞 0 踩 0

2601.15797 2026-06-19 cs.AI 版本更新

Creativity Reconsidered: Generative AI and the Problem of Intentional Agency

重新思考创造力：生成式AI与意向能动性问题

James S. Pearson, Matthew J. Dennis, Marc Cheong

发表机构 * University of Amsterdam（阿姆斯特丹大学）； University of Lisbon（里斯本大学）； TU Eindhoven（埃因霍温理工大学）； University of Melbourne（墨尔本大学）

AI总结本文质疑意向能动性是创造力的必要条件，基于生成式AI的创造力表现，提出创造力归因依赖于“创造能力”，从而在不要求意向能动性的前提下解释AI的创造力。

Comments 27 pages, 2 figures

详情

AI中文摘要

许多理论家认为，有意识的意向能动性是创造力的必要条件。我们认为，这一要求（称为意向能动性条件，IAC）应当被放弃。我们通过强调该标准在面对生成式AI最新进展时遇到的问题来论证这一点，生成式AI尽管缺乏意向能动性，却显然具有创造力。我们呈现两项语料库分析，以说明人们将创造力归因于生成式AI的迅速增长趋势。针对这一困境，创造力理论家提出了一系列相互矛盾的解决方案，我们对其进行了批判性评估。我们发现，这些方案均未能令人满意地解决初始困境，因此我们提出了一种新方法。我们的主张是，创造力的归因依赖于我们所谓的创造能力。这一解决方案解释了为什么意向能动性对创造力判断很重要，但并非必要条件。因此，我们的方法在不忽视感知意图对创造力归因至关重要的直觉的情况下，容纳了AI的创造力。

英文摘要

Many theorists maintain that conscious intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be abandoned. We motivate this by highlighting the problems this criterion encounters in the face of recent advances in generative AI, which is ostensibly creative despite being incapable of intentional agency. We present two corpus analyses to illustrate the rapidly increasing tendency of people to predicate creativity to generative AI. In response to this predicament, theorists of creativity have proposed a range of conflicting solutions, which we critically evaluate. We find that none of these satisfyingly resolves the initial predicament, and we therefore propose a novel approach. Our claim is that ascriptions of creativity are dependent on what we call creative ability. This solution explains why intentional agency is important for judgements of creativity, without being a necessary condition. Our approach thereby accommodates AI creativity without dismissing the intuition that perceived intentions are of key importance for ascriptions of creativity.

URL PDF HTML ☆

赞 0 踩 0

2601.15614 2026-06-19 cs.RO 版本更新

AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning

AION: 基于双策略强化学习的空中室内目标导航

Zichen Yan, Yuchen Hou, Shenao Wang, Yichao Gao, Rui Huang, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电子与计算机工程系）

AI总结提出AION，一种端到端双策略强化学习框架，解耦探索与目标到达行为，用于视觉空中目标导航，无需外部定位或全局地图，在AI2-THOR和IsaacSim中验证了优越性能。

Comments Accepted to IROS 2026

详情

AI中文摘要

目标导航要求智能体自主探索未知环境并导航至由语义标签指定的目标对象。以往工作主要研究二维移动下的零样本目标导航，将其扩展到具有三维移动能力的空中平台仍未被充分探索。空中机器人具有优越的机动性和搜索效率，但也带来了空间感知、动态控制和安全性保障方面的新挑战。本文提出AION，用于基于视觉的空中目标导航，无需依赖外部定位或全局地图。AION是一个端到端的双策略强化学习框架，将探索和目标到达行为解耦为两个专门策略。我们在AI2-THOR基准上评估AION，并在IsaacSim中使用高保真无人机模型进一步评估其实时性能。实验结果表明，AION在探索、导航效率和安全性的综合评估指标上均取得了优越性能。视频可在\url{this https URL}找到，代码和模型检查点可在\url{this https URL}获取。

英文摘要

Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at \url{https://youtu.be/TgsUm6bb7zg}, code and model checkpoints are available at \url{https://github.com/Zichen-Yan/AION}.

URL PDF HTML ☆

赞 0 踩 0

2601.15459 2026-06-19 cs.RO 版本更新

Neural Minimum-Distance Estimation for Collision-Aware Operation of Multi-Arm Laparoscopy Surgical Robots Through Learning-from-Simulation

基于仿真学习的多臂腹腔镜手术机器人碰撞感知操作的神经最小距离估计

Sarvin Ghiasi, Majid Roshanfar, Jake Barralet, Liane S. Feldman, Amir Hooshiar

发表机构 * Surgical Performance Enhancement and Robotics (SuPER) Centre, Department of Surgery（外科性能增强与机器人中心（SuPER）中心，外科部）； The Wilfred and Joyce Posluns Centre for Image Guided Innovation & Therapeutic Intervention (PCIGITI)（威廉与乔伊斯·波斯伦中心（PCIGITI）影像引导创新与治疗干预中心）； The Hospital for Sick Children (SickKids)（儿童医院（SickKids））

AI总结提出结合分析建模、实时仿真与深度残差神经网络的框架，用于多臂手术机器人最小距离估计与碰撞预警，模型在验证集上R²=0.940，RMSE=42.0 mm。

Journal ref Sensors 2026, 26(12), 3744

详情

DOI: 10.3390/s26123744

AI中文摘要

本研究提出了一个集成框架，通过解决多臂操纵器之间的最小距离估计和相关的碰撞感知警告，提高腹腔镜手术中机械臂的安全性和操作效率。通过结合分析建模、实时仿真和机器学习，该框架为确保机器人安全操作提供了稳健的解决方案。开发了一个分析模型，基于关节配置估计机械臂之间的最小距离，提供理论计算作为验证工具和基准。为补充这一点，创建了一个3D仿真环境，模拟两个7自由度Kinova机械臂（Kinova inc., Boisbriand, QC, Canada），生成了用于距离估计和碰撞警告的多样化配置数据集。利用这些见解，训练了一个以关节配置为输入的深度残差神经网络模型。在保留的验证集上，模型达到了R²=0.940，RMSE=42.0 mm，MAE=28.7 mm，且平均偏差接近零，展示了强大的预测准确性和在整个工作空间中的一致泛化能力。该框架旨在作为早期碰撞警告层，当预测的臂间距离低于0.2 m阈值时触发警告，考虑到Kinova Gen3（Kinova inc., Boisbriand, QC, Canada）的横截面半径，这对应于大约50 mm的表面到表面间隙。这项工作展示了将分析建模与机器学习相结合以提高多臂机器人系统精度和可靠性的有效性。

英文摘要

This study presents an integrated framework for enhancing the safety and operational efficiency of robotic arms in laparoscopic surgery by addressing minimum distance estimation between multi-arm manipulators and the associated collision-aware warning. By combining analytical modeling, real time simulation, and machine learning, the framework offers a robust solution for ensuring safe robotic operations. An analytical model was developed to estimate the minimum distances between robotic arms based on their joint configurations, offering theoretical calculations that serve as both a validation tool and a benchmark. To complement this, a 3D simulation environment was created to model two 7 DOF Kinova robotic arms (Kinova inc., Boisbriand, QC, Canada), generating a diverse dataset of configurations for distance estimation and collision warning. Using these insights, a deep residual neural network model was trained with joint configurations as inputs. On the held out validation set, the model achieves R2 = 0.940, RMSE = 42.0 mm, MAE = 28.7 mm, and a near zero mean bias, demonstrating strong predictive accuracy and consistent generalization across the workspace. The framework is intended as an early collision warning layer, where a warning is triggered when the predicted inter-arm distance falls below a 0.2 m threshold, which corresponds to a surface to surface clearance of approximately 50 mm given the Kinova Gen3 (Kinova inc., Boisbriand, QC, Canada) cross sectional radius. This work demonstrates the effectiveness of combining analytical modeling with machine learning to enhance the precision and reliability of multi-arm robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2509.03122 2026-06-19 cs.CL cs.AI cs.LG 版本更新

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

从构建到注入：面向大型语言模型的基于编辑的指纹

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

发表机构 * East China Normal University（华东师范大学）； Hasso Plattner Institute/University of Potsdam（哈索罗普拉特纳研究所/波茨坦大学）

AI总结提出端到端注入指纹框架，通过代码混合指纹和多候选编辑方法，解决黑盒部署中指纹的不可感知性和鲁棒性挑战。

Comments preprint

详情

AI中文摘要

可靠的模型指纹对于保护大型语言模型（LLMs）免受未经授权的重新分发和商业滥用至关重要。在黑盒部署中，验证受到对可疑指纹查询的防御性过滤以及可能削弱嵌入所有权证据的下游模型修改的阻碍。这些风险要求指纹在构建和注入方面都具有鲁棒性。在构建方面，先前的范式面临不可感知性的权衡：自然语言指纹可能被意外激活，而乱码指纹在统计上暴露且更容易被过滤。在注入方面，现有方法难以在模型修改下保持持久的触发-目标行为。我们提出了一个端到端的注入指纹框架来解决这些挑战。代码混合指纹（CF）在高复杂度约束下使用最低困惑度的代码混合来缓解这种双向不可感知性权衡。多候选编辑（MCEdit）构建结构冗余、间隔分离的触发-目标映射，以在模型修改下实现优雅降级。在不可感知性、可检测性和无害性方面的广泛评估表明，该框架在几乎不影响实用性的情况下实现了鲁棒的所有权验证。

英文摘要

Reliable model fingerprints are essential for protecting large language models (LLMs) against unauthorized redistribution and commercial misuse. In black-box deployment, verification is hindered by defensive filtering of suspected fingerprint queries, as well as by downstream model modifications that may weaken embedded ownership evidence. These risks require fingerprints to be robust in both construction and injection. For construction, prior paradigms face an imperceptibility trade-off: natural-language fingerprints may be accidentally activated, whereas garbled fingerprints are statistically exposed and easier to filter. For injection, existing methods struggle to preserve persistent trigger--target behaviors under model modification. We propose an end-to-end injected fingerprinting framework to address these challenges. Code-mixing Fingerprints (CF) use lowest-perplexity code-mixing under a high-complexity constraint to mitigate this two-sided imperceptibility trade-off. Multi-Candidate Editing (MCEdit) constructs structurally redundant, margin-separated trigger--target mappings to enable graceful degradation under model modification. Extensive evaluations on imperceptibility, detectability, and harmlessness demonstrate robust ownership verification with negligible impact on utility.

URL PDF HTML ☆

赞 0 踩 0