arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2602.04998 2026-05-20 cs.LG cs.AI cs.CL

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning

学习率至关重要:Vanilla LoRA可能足以用于LLM微调

Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh

发表机构 * National Taiwan University(国立台湾大学) IBM Research(IBM研究院) Academia Sinica(台湾“学术院”)

AI总结 本文通过广泛的超参数搜索重新评估了九种代表性的LoRA变体和Vanilla LoRA,在数学推理、常识推理、代码生成和指令遵循等任务上,发现不同的LoRA方法偏好不同的学习率范围。当学习率正确调整时,所有方法都能达到相似的峰值性能,这表明Vanilla LoRA仍然是一个有竞争力的基线,而单一训练配置下的改进可能并不反映一致的方法优势。

Comments Project page: https://github.com/yuang-lee/lr-matters-lora

详情
AI中文摘要

低秩适应(LoRA)是高效大型语言模型(LLM)微调的主流方法。在此范式基础上,近期研究提出了替代的初始化策略、架构修改和优化调整,报告了显著优于Vanilla LoRA的改进。然而,这些改进通常是在固定或狭窄调整的超参数设置下展示的,尽管神经网络对训练配置敏感已知。在本工作中,我们通过广泛的超参数搜索,系统地重新评估了九种代表性的LoRA变体以及Vanilla LoRA,搜索范围包括学习率、批量大小、秩和训练持续时间。在覆盖数学推理、常识推理、代码生成和指令遵循等任务的不同模型规模上,我们发现不同的LoRA方法偏好不同的学习率范围。关键的是,一旦学习率正确调整,所有方法都能达到相似的峰值性能(在1-2%以内),仅存在细微的秩依赖行为。这些结果表明,Vanilla LoRA仍然是一个有竞争力的基线,而单一训练配置下的改进可能并不反映一致的方法优势。最后,二次分析将不同的最优学习率范围归因于最大的Hessian特征值的变化,这与经典的机器学习理论一致。

英文摘要

Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies, architectural modifications, and optimization adjustments, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate nine representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches over learning rate, batch size, rank, and training duration. Across tasks spanning mathematical reasoning, commonsense reasoning, code generation, and instruction following at diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under a single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.

2602.04663 2026-05-20 cs.LG cs.AI

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

重新思考扩散模型强化学习的设计空间:超越损失设计的似然估计的重要性

Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, Yongxin Chen

发表机构 * Georgia Institute of Technology(佐治亚理工学院) National University of Singapore(新加坡国立大学) Nanjing university(南京大学)

AI总结 本文研究了扩散模型强化学习设计空间中的关键问题,通过分解策略梯度目标、似然估计器和回放采样方案三个因素,发现基于证据下界(ELBO)的模型似然估计器是实现有效、高效和稳定强化学习优化的主要因素,优于特定策略梯度损失函数的影响。

Comments 23 pages, 11 figures

详情
AI中文摘要

强化学习已被广泛应用于扩散和流模型,用于文本到图像生成等视觉任务。然而,这些任务仍然具有挑战性,因为扩散模型具有不可 tractable 的似然,这阻碍了直接应用流行策略梯度类型方法。现有方法主要集中在构建新的目标,这些目标基于已经高度工程化的LLM目标,并使用随意的似然估计器,而没有深入研究此类估计对整体算法性能的影响。在本文中,我们通过分解三个因素:i)策略梯度目标,ii)似然估计器,和iii)回放采样方案,对RL设计空间进行了系统分析。我们证明,采用基于证据下界(ELBO)的模型似然估计器,仅从最终生成的样本计算,是实现有效、高效和稳定RL优化的主要因素,其影响超过特定策略梯度损失函数的影响。我们通过SD 3.5 Medium在多个奖励基准上验证了我们的发现,并在所有任务中观察到一致的趋势。我们的方法在90个GPU小时内将GenEval得分从0.24提高到0.95,比FlowGRPO高效4.6倍,比无奖励黑客的SOTA方法DiffusionNFT高效2倍。

英文摘要

Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.

2602.04381 2026-05-20 cs.CV cs.AI

Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

通过超轻量架构在商用CPU上实现实时结肠镜息肉分割

Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma

发表机构 * School of Computer Science and Artificial Intelligence, Guangdong University of Education(广东教育学院计算机科学与人工智能学院) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 本文提出UltraSeg家族,一种在CPU上运行的轻量级分割模型,能够在不依赖GPU的情况下实现实时结肠镜息肉分割,其核心方法是采用组多率扩张卷积和注意力门控跨层融合,主要贡献是建立了首个在商用CPU上实现高精度实时息肉分割的基准线。

Comments 18pages, 4 figures

详情
AI中文摘要

实时息肉分割对于早期结直肠癌检测至关重要,但临床部署仍受GPU依赖的阻碍。我们引入UltraSeg家族,一组在CPU上运行的分割模型,参数量低于0.3M。UltraSeg-108K(0.108M)建立了极端压缩的前沿,而UltraSeg-130K(0.130M)通过跨层轻量融合提升了多中心泛化能力。该架构用组多率扩张卷积和注意力门控跨层融合取代参数密集的组件,实现了在单个CPU核心上实时吞吐(在256*256分辨率上超过50 FPS,在352*352分辨率上超过30 FPS)而不牺牲临床级精度。在七个公开数据集上评估,UltraSeg-130K在两个分辨率上均达到Dice分数超过0.8,显著优于所有现有的子0.3M竞争者。值得注意的是,在零样本外部验证中,它接近或超过了UNet-Medium(7.76M参数)的性能,但仅使用其1.7%的参数,建立了首个在CPU上实现实时息肉分割的强基准线。当扩展到4.38M参数时,UltraSeg的准确性可与重型最先进的模型相媲美,同时保持数量级的参数优势,证明了所提出的设计原则在效率光谱的整个范围内实现了内在的表示增益。通过提供首个在商用CPU上可部署的实时解决方案,本工作为资源有限的环境提供了一个立即可用的工具,并为超越内窥镜的实时医疗AI提供了可复现的蓝图。源代码已公开。

英文摘要

Real-time polyp segmentation is essential for early colorectal cancer detection, yet clinical deployment remains blocked by GPU dependency. We introduce the UltraSeg family, a set of CPU-native segmentation models operating below 0.3M parameters. UltraSeg-108K (0.108M) establishes the extreme-compression frontier, while UltraSeg-130K (0.130M) integrates cross-layer lightweight fusion for enhanced multi-center generalization. The architecture replaces parameter-heavy components with grouped multi-rate dilated convolutions and attention-gated cross-layer fusion, achieving real-time throughput on a single CPU core (exceeding 50 FPS at 256*256 and 30 FPS at 352*352) without sacrificing clinical-grade accuracy. Evaluated on seven public datasets, UltraSeg-130K attains Dice scores exceeding 0.8 at both resolutions, substantially outperforming all existing sub-0.3M competitors. Notably, it approaches or exceeds UNet-Medium (7.76M parameters) on zero-shot external validations while using only 1.7% of its parameters, establishing the first strong baseline for CPU-native real-time polyp segmentation. When scaled to 4.38M parameters, UltraSeg achieves accuracy competitive with heavyweight state-of-the-art models while maintaining an order-of-magnitude parameter advantage, demonstrating that the proposed design principles yield intrinsic representational gains across the entire efficiency spectrum. By delivering the first clinically deployable, CPU-native real-time solution, this work provides an immediately usable tool for resource-limited settings and a reproducible blueprint for real-time medical AI beyond endoscopy. Source code is publicly available.

2602.03454 2026-05-20 cs.CV

Contextualized Visual Personalization in Vision-Language Models

基于上下文的视觉个性化在视觉-语言模型中

Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea(电气电子工程系,首尔国立大学,首尔,韩国) Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, Korea(人工智能交叉学科项目,首尔国立大学,首尔,韩国)

AI总结 本文提出了一种基于上下文的视觉个性化方法,通过强化学习和生成增强技术改进视觉-语言模型的个性化图像描述能力,并通过诊断评估验证了模型对视觉上下文的真实利用,展示了CoViP在下游个性化任务中的全面提升。

Comments Accepted at ICML 2026

详情
AI中文摘要

尽管视觉-语言模型(VLMs)在最近取得了进展,但现有方法往往无法根据用户的特定经历生成个性化响应,因为它们缺乏将视觉输入与用户积累的视觉-文本上下文相关联的能力。我们首次将这一挑战正式化为“基于上下文的视觉个性化”,要求VLMs在解释新图像时通过视觉识别和文本检索个性化视觉经验。为了解决这一问题,我们提出了CoViP,一个统一的框架,将个性化图像描述作为基于上下文的视觉个性化的核心任务,并通过基于强化学习的后训练和描述增强生成来提高这一能力。我们进一步引入了诊断评估,明确排除了文本捷径解决方案,并验证VLMs是否真正利用了视觉上下文。广泛的实验表明,现有开源和专有VLMs存在显著限制,而CoViP不仅提高了个性化图像描述能力,还在下游个性化任务中实现了全面提升。这些结果突显了CoViP作为实现稳健且可推广的基于上下文的视觉个性化关键阶段的重要性。

英文摘要

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

2602.02513 2026-05-20 cs.LG cond-mat.mtrl-sci

Learning ORDER-Aware Multimodal Representations for Composite Materials Design

学习有序的多模态表示以进行复合材料设计

Xinyao Li, Hangwei Qian, Jingjing Li, Lei Zhu, Ivor Tsang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Tongji University(同济大学) A*STAR CFAR(新加坡A*STAR CFAR)

AI总结 本研究提出了一种基于有序性的多模态预训练框架ORDER,用于复合材料设计,通过整合异构数据源来捕捉纤维分布,从而在连续设计空间中实现有效的属性预测和微结构生成。

详情
AI中文摘要

人工智能在材料发现和性质预测中展现出显著的成功,尤其是在晶体和聚合物系统中,其中材料性质和结构主要由离散图表示主导。这种图中心范式在复合材料中失效,因为复合材料具有连续和非线性的设计空间。通用复合描述符,例如纤维体积和偏移角度,无法完全捕捉决定微结构特性的纤维分布,需要通过多模态学习整合异构数据源。现有的对齐导向框架在离散、唯一的图-性质映射假设下对大量晶体或聚合物数据有效,但在极端数据稀缺的情况下无法解决高度连续的复合设计空间。在本工作中,我们引入了ORDinal-aware imagE-tabulaR alignment(ORDER),一种多模态预训练框架,将有序性作为材料表示的核心原则。ORDER确保具有相似目标属性的材料在潜在空间中占据附近区域,这有效地保持了复合材料属性的连续性,并在稀疏观察设计之间实现了有意义的插值。我们评估了ORDER在纳米纤维增强复合材料数据集和碳纤维T700数据集上的表现。ORDER及其变体在属性预测、跨模态检索和微结构生成任务中均优于对齐导向和定制属性意识对比基线。我们进一步引入基于物理的有序替代信号,避免了预训练过程中需要完整的属性注释。我们的工作证明了学习连续多模态特征对于复合材料是基础性的,并提供了一条通往数据高效通用多模态智能系统可靠路径。

英文摘要

Artificial intelligence has shown remarkable success in materials discovery and property prediction, particularly for crystalline and polymer systems where material properties and structures are dominated by discrete graph representations. Such graph-central paradigm breaks down on composite materials, which possess continuous and nonlinear design spaces. General composite descriptors, e.g., fiber volume and misalignment angle, cannot fully capture the fiber distributions that determine microstructural characteristics, necessitating the integration of heterogeneous data sources through multimodal learning. Existing alignment-oriented frameworks have proven effective on abundant crystal or polymer data under discrete, unique graph-property mapping assumptions, but fail to address the highly continuous composite design space under extreme data scarcity. In this work we introduce ORDinal-aware imagE-tabulaR alignment (ORDER), a multimodal pretraining framework that establishes ordinality as a core principle for material representations. ORDER ensures that materials with similar target properties occupy nearby regions in the latent space, which effectively preserves the continuous nature of composite properties and enables meaningful interpolation between sparsely observed designs. We evaluate ORDER on a Nanofiber-reinforced composite dataset and a carbon fiber T700 dataset. ORDER and its variants outperform both alignment-oriented and customized property-aware contrastive baselines across property prediction, cross-modal retrieval, and microstructure generation tasks. We further introduce physics-based ordinal surrogate signals avoiding the need for full property annotation during pretrain. Our work demonstrates learning continuous multimodal features are fundamental for composite materials, and provides a reliable pathway toward data-efficient universal multimodal intelligent systems.

2601.22478 2026-05-20 cs.LG

Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models

增强推理探索的变换增强GRPO

Khiem Le, Phuc Nguyen, Youssef Mroueh, Chi-Heng Lin, Shangqian Gao, Ting Hua, Nitesh V. Chawla

发表机构 * University of Notre Dame(诺特大学) IBM Research(IBM研究院) Meta AI Florida State University(佛罗里达州立大学)

AI总结 本文提出变换增强GRPO(TA-GRPO)方法,通过问题重述解决大语言模型强化学习中梯度消失和多样性崩溃问题,提升模型在推理任务中的探索能力。

详情
AI中文摘要

组相对策略优化(GRPO)已成为在大语言模型中使用可验证奖励的强化学习主导方法,但其面临两个关键限制:梯度消失和多样性崩溃。当训练问题过于简单或过于困难时,所有采样响应获得相同奖励,导致梯度消失。同时,模型倾向于将响应集中于单一推理模式,而非探索多样化策略。我们提出变换增强GRPO(TA-GRPO),一种简单但有效的方法,通过问题重述解决这两个问题。对于每个训练问题,我们自动生成多个等价问题重述,改变用词、格式和信息顺序,同时保持底层含义。由于这些重述改变了模型感知的难度,池化原始问题及其重述的响应可获得混合奖励和更多多样化的推理路径。TA-GRPO联合计算此扩展响应集的优势,并将所有重要性比率对齐到原始问题,使模型能够从更丰富的解决方案尝试中学习。在四个LLM(Qwen3-1.7B,Qwen3-4B,Llama-3.2-1B,Llama-3.2-3B)上的实验表明,TA-GRPO在竞争级基准(AMC,OlympiadBench,AIME24,AIME25)和分布外基准(Minerva,GPQA-Diamond)上一致提升了pass@$k$。值得注意的是,TA-GRPO使Qwen3-1.7B和Qwen3-4B的平均pass@32分别提高了4.97和4.34个点,并与训练数据多达2.5倍的基线模型在探索质量上相当。

英文摘要

Group Relative Policy Optimization (GRPO) has become the dominant method for reinforcement learning with verifiable rewards in large language models, but it suffers from two critical limitations: gradient vanishing and diversity collapse. When training questions are too easy or too hard, all sampled responses receive identical rewards, yielding zero gradients. Meanwhile, the model tends to collapse its responses toward a single reasoning pattern rather than exploring diverse strategies. We propose Transformation-Augmented GRPO (TA-GRPO), a simple but effective method that addresses both issues via question rephrasing. For each training question, we automatically generate multiple problem-equivalent rephrasings that alter wording, format, and information order while preserving the underlying meaning. Because these rephrasings shift the model's perceived difficulty, pooling responses across the original and its rephrasings yields mixed rewards and more diverse reasoning paths. TA-GRPO jointly computes advantages over this expanded response set and aligns all importance ratios to the original question, enabling the model to learn from a richer set of solution attempts. Experiments on four LLMs (Qwen3-1.7B, Qwen3-4B, Llama-3.2-1B, Llama-3.2-3B) show that TA-GRPO consistently improves pass@$k$ on competition-level benchmarks (AMC, OlympiadBench, AIME24, AIME25) and out-of-distribution benchmarks (Minerva, GPQA-Diamond). Notably, it improves the average pass@32 of Qwen3-1.7B and Qwen3-4B by \textbf{4.97} and \textbf{4.34} points, respectively, and matches the exploration quality of baselines trained on up to 2.5$\times$ more data.

2601.21484 2026-05-20 cs.LG

ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

ETS: 为无训练强化学习对齐的能耗引导测试时缩放

Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan

发表机构 * Renmin University of China, Beijing, China(中国人民大学,北京,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国)

AI总结 本文提出了一种无需训练的推理方法,直接从最优强化学习策略中采样,通过结合参考策略模型和能耗项来改进掩码语言模型的过渡概率,并通过在线蒙特卡洛方法估计关键能耗项,从而提高生成质量。

Comments Accepted by ICML 2026

详情
AI中文摘要

强化学习(RL)在语言模型中的训练后对齐是有效的,但实际中也成本高且不稳定,这归因于其复杂的训练过程。为了解决这个问题,我们提出了一种无需训练的推理方法,直接从最优RL策略中采样。应用于掩码语言模型(MLM)的过渡概率由参考策略模型和一个能耗项组成。基于此,我们的算法,能耗引导测试时缩放(ETS),通过在线蒙特卡洛方法估计关键能耗项,具有可证明的收敛率。此外,为了确保实际效率,ETS利用现代加速框架以及定制的重要性采样估计器,显著减少推理延迟,同时可证明地保持采样质量。在MLM(包括自回归模型和扩散语言模型)上,通过推理、编码和科学基准测试,我们的ETS一致地提高了生成质量,验证了其有效性和设计。代码可在https://github.com/sheriyuo/ETS上获得。

英文摘要

Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design. The code is available at https://github.com/sheriyuo/ETS.

2601.20308 2026-05-20 cs.CV cs.GR

Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

通过一步扩散模型平滑现实世界的时空视频超分辨率

Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao, Huihui Bai

发表机构 * Institute of Information Science, Beijing Jiaotong University(北京交通大学信息科学学院) Visual Intelligence + X International Cooperation Joint Laboratory of MOE, Beijing(教育部视觉智能+X国际合作联合实验室) Innovation School of Artificial Intelligence, Hefei University of Technology(合肥工业大学人工智能创新学院) School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院)

AI总结 本文提出OSDEnhancer框架,通过一步扩散模型实现鲁棒的时空视频超分辨率,解决了现实世界中复杂未知退化的问题,通过线性初始化和分治策略提升时空动态和纹理恢复性能。

Comments 12 pages, 9 figures

详情
AI中文摘要

扩散模型在视频超分辨率(VSR)中表现出色,能够生成精细细节。然而,其在时空视频超分辨率(STVSR)中的潜力仍被忽视,STVSR需要恢复真实的高分辨率视觉内容并提高帧率,同时保持时间动态的一致性。此外,现有STVSR方法主要在简单退化假设下处理时空上采样,无法应对现实世界中复杂的未知退化。为了解决这些挑战,我们提出了OSDEnhancer,这是首个在一步扩散中实现稳健STVSR的框架。OSDEnhancer首先通过线性初始化建立必要的时空结构并适应模型进行一步重建。然后应用分治策略,引入时间一致性(TC)和纹理丰富(TE)LoRAs,分别专注于帧间动态建模和精细纹理恢复,同时在推理过程中协作以提升整体性能。双向VAE解码器使用可变形递归块来利用常规VAE的多尺度结构,通过联合多尺度可变形聚合和帧间特征传播增强潜在到像素的重建。实验结果表明,所提出的方法在现实世界场景中实现了最先进的性能,并具有更强的泛化能力。代码可在https://github.com/W-Shuoyan/OSDEnhancer获取。

英文摘要

Diffusion models have demonstrated exceptional success in video super-resolution (VSR), exhibiting powerful capabilities for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic high-resolution visual content but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simple degradation assumptions, thus failing in real-world scenarios with complex unknown degradations. To address these challenges, we propose OSDEnhancer, the first framework that achieves robust STVSR in one-step diffusion. OSDEnhancer begins with a linear initialization to establish essential spatiotemporal structures and adapt the model for one-step reconstruction. It then applies a divide-and-conquer strategy, introducing the temporal coherence (TC) and texture enrichment (TE) LoRAs that progressively specialize in inter-frame dynamics modeling and fine-grained texture recovery, respectively, while collaborating during inference for enhanced overall performance. A bidirectional VAE decoder employs deformable recurrent blocks to leverage the multi-scale structure of the vanilla VAE, enhancing latent-to-pixel reconstruction through joint multi-scale deformable aggregation and inter-frame feature propagation. Experimental results demonstrate that the proposed method attains state-of-the-art performance with superior generalization in real-world scenarios. The code is available at https://github.com/W-Shuoyan/OSDEnhancer.

2601.18993 2026-05-20 cs.CV cs.AI cs.GR

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

FreeOrbit4D: 通过前景完整4D重建实现免训练的任意相机重定向

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, Yaoyao Liu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Pennsylvania(宾夕法尼亚大学) Eyeline Labs(Eyeline实验室)

AI总结 本文提出FreeOrbit4D,一种无需训练的框架,通过恢复完整的前景4D代理来解决大角度重定向中的几何模糊问题,从而生成更真实且时间一致的视频。

Comments 12 pages, 10 figures. Accepted to SIGGRAPH Conference Papers 2026

详情
AI中文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective 免训练 framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

英文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

2601.16823 2026-05-20 cs.CL cs.AI

Disentangling generalization and memorization in large language models using chess

通过国际象棋解构大型语言模型中的泛化与记忆

Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsaecker

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 本文通过国际象棋测试环境,研究大型语言模型中泛化与记忆能力的区别,发现模型在相关先验知识稀疏时性能显著下降,表明系统泛化能力有限,需超越规模的机制来实现鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)展现出显著的能力,但其能力在多大程度上反映的是复杂的记忆还是真正的推理能力仍不明确。我们引入国际象棋作为受控测试环境,旨在区分这些能力。利用游戏的结构和可扩展的引擎评估,我们构建了一个位置分类学,这些位置在相关先验知识的密度上变化较大,从可以通过记忆解决的常见状态到完全新颖需要泛化的状态。关键的是,我们的方法在不需要显式了解模型训练数据的情况下实现了这一区分。应用此分类学,我们结合了GPT系列的纵向分析和对现代模型的严格评估,包括Claude Opus和Gemini。我们的分析揭示了一个陡峭的梯度:随着相关先验知识密度的降低,性能持续下降。值得注意的是,在相关先验知识较少的任务中,基础模型性能回归到随机下棋的基线。虽然新模型有所改进,但在先验知识稀疏的任务中,进步显著放缓。此外,虽然推理增强的推理提高性能,但在没有相关先验知识的情况下,每token的相对边际收益减少。这些结果表明系统泛化能力有限,强调了在缺乏相关先验知识时,需要超越规模的机制来实现鲁棒性能。

英文摘要

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably, for tasks with few relevant priors, base model performance regresses to the random-play baseline. While newer models improve, progress slows significantly for tasks with sparse priors. Furthermore, while reasoning-augmented inference improves performance, its relative marginal benefit per token decreases in the absence of relevant priors. These results suggest limitations in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust performance when deprived of relevant priors.

2601.14234 2026-05-20 cs.LG cs.AI cs.RO stat.ML

Q-learning with Adjoint Matching

具有伴随匹配的Q学习

Qiyang Li, Sergey Levine

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于时序差分的强化学习算法QAM,解决了连续动作强化学习中的长期挑战:高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。通过利用批评者的首阶信息进行有效优化,但直接通过反向传播其多步去噪过程进行梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配,将批评者的动作梯度转换为逐步目标函数,避免了不稳定反向传播,同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习,QAM在离线和离线到在线强化学习的硬稀疏奖励任务中一致优于先前方法。

Comments 32 pages, 8 figures, 7 tables

详情
AI中文摘要

我们提出QAM,一种新颖的基于时序差分的强化学习(RL)算法,解决了连续动作RL中长期存在的挑战:高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。有效的优化需要利用批评者的首阶信息,但通过反向传播其多步去噪过程进行直接梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配,将批评者的动作梯度转换为逐步目标函数,避免了不稳定反向传播,同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习,QAM在离线和离线到在线RL的硬稀疏奖励任务中一致优于先前方法。

英文摘要

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

2601.12707 2026-05-20 cs.LG stat.ML

Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

在竞争性游戏中解码奖励:带有熵正则化的逆向博弈论

Junyi Liao, Zihan Zhu, Ethan Fang, Zhuoran Yang, Vahid Tarokh

发表机构 * Department of Electrical and Computer Engineering, Duke University(杜克大学电气与计算机工程系) Department of Statistics and Data Science, University of Pennsylvania(宾夕法尼亚大学统计与数据科学系) Department of Biostatistics and Bioinformatics, Duke University(杜克大学生物统计与生物信息学系) Department of Statistics and Data Science, Yale University(耶鲁大学统计与数据科学系)

AI总结 本文研究了在竞争性游戏中通过逆向博弈论和熵正则化来恢复未知奖励函数的问题,提出了一种统一的框架,能够在静态和动态设置中学习奖励函数,并通过理论保证和数值实验验证了其有效性。

Comments Extended journal version of ICML 2025 paper. Submitted to Operations Research

详情
AI中文摘要

估计驱动智能体行为的未知奖励函数在逆向强化学习和博弈论中具有核心重要性。为解决这个问题,我们开发了一个统一的框架,用于在两名玩家零和矩阵博弈和马尔可夫博弈中恢复奖励函数,并通过熵正则化来重建给定观察到的玩家策略和动作的潜在奖励函数。这项任务具有挑战性,因为逆向问题固有的模糊性、可行奖励的非唯一性和观察数据覆盖的限制。为了解决这些挑战,我们利用线性假设在量级响应均衡(QRE)下建立了奖励函数的可识别性。在此理论基础上,我们提出了一种新的算法,从观察到的动作中学习奖励函数。我们的算法适用于静态和动态设置,并且可以适应不同方法,如最大似然估计(MLE)。我们为算法的可靠性和样本效率提供了强有力的理论保证。进一步,我们进行了广泛的数值研究,以证明所提出框架的实际有效性,为竞争环境中的决策提供了新的见解。

英文摘要

Estimating the unknown reward functions driving agents' behaviors is of central interest in inverse reinforcement learning and game theory. To tackle this problem, we develop a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization, where we aim to reconstruct the underlying reward functions given observed players' strategies and actions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish the reward function's identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building upon this theoretical foundation, we propose a novel algorithm to learn reward functions from observed actions. Our algorithm works in both static and dynamic settings and is adaptable to incorporate different methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm. Further, we conduct extensive numerical studies to demonstrate the practical effectiveness of the proposed framework, offering new insights into decision-making in competitive environments.

2601.12369 2026-05-20 cs.CL

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

深度研究代理能否检索和组织?通过专家分类法评估合成差距

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhui Wang, Zhenghao Xiang, Qiyuan Peng, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Maxm Pan, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * Fudan University(复旦大学) Hunyuan Team, Tencent(腾讯 Hunyuan 团队)

AI总结 本文提出TaxoBench基准,评估深度研究代理在检索和组织论文方面的能力,发现两者在能力与对齐方面均存在瓶颈。

详情
AI中文摘要

深度研究代理越来越多地自动化文献综述生成,但它们是否能像人类专家一样检索关键论文并将其组织成专家级分类法仍不清楚。现有基准强调写作质量和引用正确性,而标准聚类指标忽略层次结构。我们引入TaxoBench,一个包含72篇高引LLM综述、专家编写的分类树和3,815篇映射到论文类别的论文的基准。TaxoBench评估(1)检索通过召回率/精确率/F1,以及(2)在叶级别(论文到类别分配)和层次级别通过两个新指标:无序语义树编辑距离(US-TED/US-NTED)和语义路径相似性(Sem-Path)。支持两种模式:深度研究(主题-only,端到端)和自下而上(提供专家论文集,仅组织)。为了区分与单一专家参考的分歧与真正的模型失败,我们明确将发现分为能力基于(参考自由)和对齐基于(参考依赖)组。评估7个深度研究代理和12个前沿LLM揭示了双重瓶颈。在能力方面,最好的代理只能检索专家引用论文的20.92%,1,000个模型分类法显示75.9%的兄弟节点重叠,51.2%的MECE违规,和83.4%的结构不平衡,所有这些在没有参考的情况下都可以检测到。在对齐方面,所有12个LLM收敛到Sem-Path 28-29%,远低于三个独立人工标注组在相同论文集上达到的47-58%。我们的基准在https://github.com/KongLongGeFDU/TaxoBench上公开可用。

英文摘要

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.

2601.05437 2026-05-20 cs.CL cs.AI

Tracing Moral Foundations in Large Language Models

在大型语言模型中追溯道德基础

Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Psychology, University of Southern California(南加州大学心理学系) Center for Computational Language Sciences, University of Southern California(南加州大学计算语言科学中心)

AI总结 本文研究了大型语言模型中道德基础的编码、组织和表达,通过多层方法分析道德基础与人类道德感知的一致性,并发现道德结构在预训练和微调过程中自然形成,且部分解耦。

详情
AI中文摘要

大型语言模型常常产生类似人类的道德判断,但不清楚这种表现是内部概念结构还是表面的'道德模仿'。使用道德基础理论(MFT)作为分析框架,我们研究了14个基础和指令微调的LLM在四个模型家族(Llama、Qwen2.5、Qwen3-MoE、Mistral)和从7B到70B的不同规模上如何编码、组织和表达道德基础。我们采用多级方法结合(i)逐层分析MFT概念表示及其与人类道德感知的一致性,(ii)在残差流上预训练稀疏自编码器(SAEs)以识别支持道德概念的稀疏特征,以及(iii)使用密集MFT向量和稀疏SAE特征进行因果引导干预。我们发现模型在表示和区分道德基础方面与人类判断一致,且这种道德几何结构自然从预训练中产生,并在微调中被选择性重 wiring。在更细的尺度上,SAE特征显示出与特定基础的明确语义联系,表明在共享表示中存在部分解耦的机制。最后,沿着密集向量或稀疏特征引导会产生可预测的在基础相关行为上的变化,证明了内部表示与道德输出之间的因果联系。共同,我们的结果提供了机械证据,表明LLM中的道德概念是分布的、分层的且部分解耦的,暗示了多元道德结构可以从语言的统计规律中作为潜在模式出现。

英文摘要

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

2512.24470 2026-05-20 cs.RO cs.AI

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

桥梁上的基础模型:基于视觉-语言模型的语义危险检测与安全操作用于海上自主性

Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert

发表机构 * Dept. of Mechanical and Industrial Engineering, NTNU(机械与工业工程系,挪威科技大学) Dept. of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学) Dept. of Computer Science, Stanford University(计算机科学系,斯坦福大学) NVIDIA Research(NVIDIA研究)

AI总结 本文提出了一种基于视觉-语言模型的语义危险检测与安全操作方法,用于满足IMO草案MASS代码对海上自主船舶的要求,通过快速-慢速异常管道和短时间范围的人类可覆盖回退操作来实现,在40个港口场景中验证了该方法的性能。

Comments 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/

Journal ref Ocean Engineering 359, Part 3 (2026), Article 124646

详情
AI中文摘要

草案IMO MASS代码要求自主和远程监督的海事船舶检测其操作设计领域偏离,进入预定义的回退模式以通知操作员,允许立即的人类接管,并避免在未经批准的情况下更改航行计划。在警报到接管的间隙中满足这些义务需要一个短时间范围、可人类接管的回退操作。传统的海事自主堆栈在正确行动依赖于意义(例如,潜水员旗表示水中的人员,附近有火表示危险)时会遇到困难。我们主张(i)视觉-语言模型(VLMs)为这些分布外情况提供语义意识,(ii)一个快速-慢速异常管道,带有短时间范围、可人类接管的回退操作,使在交接窗口内实现这一目标成为可能。我们引入了Semantic Lookout,一种仅使用摄像头、候选约束的VLM回退操作选择器,它在连续人类授权下,从水有效、世界锚定的轨迹中选择一个谨慎的操作(或站守)。在40个港口场景中,我们测量了每调用场景的理解和延迟,与人类共识(模型多数三票投票)的一致性,短时间范围在火险场景中的风险缓解,以及在水上的警报->回退操作->操作员交接。子10秒的模型保留了较慢的最新模型大部分的意识。回退操作选择器在火险场景中比仅基于几何的基线表现更好,并增加了 standoff 距离。一次现场运行验证了端到端的操作。这些结果支持VLMs作为符合草案IMO MASS代码的语义回退操作选择器,适用于实际延迟预算,并激励未来工作,研究适应领域、混合自主性,将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。网站:kimachristensen.github.io/bridge_policy

英文摘要

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy

2512.23461 2026-05-20 cs.LG cs.AI

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

通过信息论指导消除奖励模型中的归纳偏置

Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba(阿里巴巴大模型应用团队) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Research Institute of Big Data(深圳大数据研究院)

AI总结 本文提出了一种基于信息论的奖励模型去偏方法DIR,通过最大化奖励模型评分与人类偏好对之间的互信息,同时最小化奖励模型输出与偏好输入偏置属性之间的互信息,从而有效缓解归纳偏置问题并提升RLHF性能。

Comments Published as a conference paper at The International Conference on Learning Representations (ICLR) 2026

详情
AI中文摘要

奖励模型(RMs)在人类反馈的强化学习(RLHF)中至关重要,用于将大型语言模型(LLMs)对齐于人类价值观。然而,RM训练数据通常被认为是低质量的,包含可能导致过拟合和奖励黑客的归纳偏置。例如,更详细和全面的响应通常更受人类青睐,但包含更多单词,导致响应长度成为不可避免的归纳偏置之一。有限的先前RM去偏方法要么针对单一特定类型的偏置,要么仅用简单的线性相关性建模,例如皮尔逊系数。为缓解奖励建模中更复杂和多样的归纳偏置,我们引入了一种新的信息论去偏方法,称为通过信息优化的奖励模型去偏(DIR)。受信息瓶颈(IB)的启发,我们最大化奖励模型评分与人类偏好对之间的互信息(MI),同时最小化奖励模型输出与偏好输入偏置属性之间的互信息。从信息论的理论依据出发,DIR能够处理更复杂的偏置类型,具有非线性相关性,从而广泛扩展了RM去偏方法在现实世界中的应用场景。在实验中,我们验证了DIR在三种归纳偏置类型(响应长度、奉承和格式)上的有效性。我们发现,DIR不仅有效缓解了目标归纳偏置,还通过多样化的基准测试提升了RLHF性能,展现出更好的泛化能力。代码和训练配方可在https://github.com/Qwen-Applications/DIR获取。

英文摘要

Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textit{response length}, \textit{sycophancy}, and \textit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.

2512.16856 2026-05-20 cs.AI

Distributional AGI Safety

分布式AGI安全

Nenad Tomašev, Matija Franklin, Julian Jacobs, Sébastien Krier, Simon Osindero

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文提出了一种分布式的AGI安全框架,旨在通过设计和实现虚拟代理沙盒经济来应对群体代理协调带来的安全风险,强调市场机制、可审计性和监管的重要性。

详情
AI中文摘要

人工智能安全和对齐研究主要集中在保护单个AI系统的方法上,基于最终出现单一人工通用智能(AGI)的假设。另一种AGI出现假说认为,一般能力首先通过具有互补技能和能力的子AGI个体代理群体中的协调表现出来,这一假说受到较少关注。本文认为,这种碎片化AGI假说需要得到认真考虑,并应指导相应安全措施和缓解措施的发展。先进AI代理的快速部署,使其具备工具使用能力和通信协调能力,使其成为紧迫的安全问题。因此,我们提出了一种分布式的AGI安全框架,超越了评估和对齐单个代理。该框架以设计和实现虚拟代理沙盒经济(不可渗透或半渗透)为中心,其中代理间的交易由稳健的市场机制调控,并辅以适当的可审计性、声誉管理和监管,以缓解集体风险。

英文摘要

AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centres on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.

2512.11234 2026-05-20 cs.CV

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

RoomPilot: 通过多模态语义解析实现可控的室内场景合成

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

发表机构 * School of Information Science and Engineering, Hunan University(信息科学与工程学院,湖南大学)

AI总结 该研究提出RoomPilot框架,通过多模态语义解析实现可控的室内场景合成,解决了现有方法输入模态有限和生成过程隐式的问题,提高了场景结构和语义的可控性。

Comments 30 pages, 8 figures

详情
AI中文摘要

生成可控的室内场景对于游戏开发、建筑可视化和具身AI应用至关重要。然而,现有方法要么只支持有限的输入模态,要么依赖隐式生成过程,限制了对场景结构和语义的精确控制。为了解决这些限制,我们引入RoomPilot,一个统一的框架,从多模态输入(包括文本描述和CAD平面图)中生成可控的室内场景。RoomPilot将异构输入映射到一个室内领域特定语言(IDSL),作为描述室内场景的结构化和可解释的语义表示。基于IDSL,RoomPilot提出一个分层合成流程,逐步在建筑、房间和物体层面组织场景,促进多房间布局中的结构一致性和功能一致性。此外,RoomPilot构建了一个经过精心挑选的资产数据集,具有丰富的语义注释,以支持高质量的场景合成,提高视觉真实感和外观一致性。广泛的实验表明,该方法在多模态理解、场景生成的细粒度可控性以及物理一致性和视觉保真度方面均有所提升,标志着可控3D室内场景合成的重要一步。代码和模型将公开。

英文摘要

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.

2512.10891 2026-05-20 cs.RO cs.LG

Iterative Compositional Data Generation for Robot Control

迭代组合数据生成用于机器人控制

Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Stony Brook University(石溪大学)

AI总结 本文提出了一种语义组合扩散变换器,通过注意力机制学习机器人、物体、障碍物和目标特定组件的交互,从而在有限任务集上训练后,能够零样本生成高质量过渡,进而学习未见任务组合的控制策略,并通过迭代自我改进过程提升零样本性能。

详情
AI中文摘要

收集机器人操作数据成本高昂,使得在多对象、多机器人和多环境设置中获取大量任务演示不切实际。尽管最近的生成模型可以为单个任务合成有用的数据,但它们未能利用机器人领域的组合结构,并且在泛化到未见任务组合时表现不佳。我们提出了一种语义组合扩散变换器,将过渡分解为机器人、物体、障碍物和目标特定的组件,并通过注意力机制学习它们的交互。一旦在有限的任务子集上训练,我们展示了模型能够零样本生成高质量的过渡,从而学习未见任务组合的控制策略。然后,我们引入了一个迭代自我改进过程,其中合成数据通过离线强化学习验证,并纳入后续的训练轮次中。我们的方法在单体和硬编码组合基线之上显著提高了零样本性能,最终解决了几乎所有未见任务,并展示了学习表示中出现有意义的组合结构。

英文摘要

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

2512.08237 2026-05-20 cs.CV

Fast-BEV++: Fast by Algorithm, Deployable by Design

Fast-BEV++: 通过算法加速,通过设计部署

Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao

发表机构 * iMotion Automotive Technology (Suzhou) Co., Ltd(iMotion汽车技术(苏州)有限公司) School of Data Science, Fudan University(复旦大学数据科学学院)

AI总结 本文提出Fast-BEV++,通过算法加速和设计部署两个原则,解决自动驾驶中低成本鸟眼视图感知在精度与部署效率之间的矛盾,实现了3倍速度提升并在nuScenes基准上取得0.488 NDS的新状态-of-the-art结果,同时在134 FPS以上实现实时推理。

Comments most up-to-date version

详情
AI中文摘要

视觉-only鸟眼视图(BEV)感知的进步受制于感知精度与设备部署效率之间的长期根本权衡。在本文中,我们引入了Fast-BEV++,一种通过两个基本设计原则解决这一矛盾的BEV感知框架:通过算法加速和通过设计部署。通过将核心视图转换模块分解为硬件导向的标准索引-收集-重塑流水线,Fast-BEV++消除了对定制内核的依赖,从而在主流边缘平台上实现了至少3倍于Fast-BEV基线的速度提升。实证表明,Fast-BEV++在nuScenes 3D物体检测基准上建立了新的状态-of-the-art结果0.488 NDS,同时通过我们的加速设计实现了超过134 FPS的实时推理。特别是,我们的集成、可学习深度模块带来了持续的性能提升,在可比方法中保持最高准确性。总体而言,这种本质上分解的架构使在各种生产级汽车平台上的无缝实时部署成为可能,缓解了硬件限制,而不会牺牲感知精度或推理效率。

英文摘要

The advancement of vision-only Bird's-Eye-View (BEV) perception, a core paradigm for cost-effective autonomous driving, is hindered by the long-standing fundamental trade-off between perception accuracy and on-device deployment efficiency. In this work, we introduce Fast-BEV++, a BEV perception framework that resolves this tension through two fundamental design principles: Fast by Algorithm and Deployable by Design. By decomposing the core view transformation module into a hardware-oriented standard Index-Gather-Reshape pipeline, Fast-BEV++ eliminates dependencies on custom kernels while achieving no less than 3 times speedup over the Fast-BEV baseline across mainstream edge platforms. Empirically, Fast-BEV++ establishes a new state-of-the-art result of 0.488 NDS on the nuScenes 3D object detection benchmark, simultaneously delivering real-time inference at more than 134 FPS via our acceleration design. In particular, our integrated, learnable depth module yields consistent performance gains, maintaining the highest accuracy among comparable methods. Overall, this inherently decomposed architecture enables seamless real-time deployment across diverse production-grade automotive platforms, alleviating hardware limitations without compromising perception accuracy or inference efficiency.

2512.07068 2026-05-20 cs.CL

SETUP: Sentence-level English-To-Uniform Meaning Representation Parser

SETUP:句子级别的英语到统一意义表示解析器

Emma Markle, Javier Gutierrez Bach, Shira Wein

发表机构 * Amherst College(阿默斯特学院)

AI总结 本文提出两种英语到统一意义表示(UMR)的解析方法,其中一种微调了现有的抽象意义表示解析器,另一种利用了通用依赖关系转换器。所提出的最佳模型SETUP在AnCast和SMATCH++评分上分别达到84和91,显示出在自动UMR解析中的显著提升。

Comments LREC 2026 Camera-ready

详情
AI中文摘要

统一意义表示(UMR)是一种新颖的基于图的语义表示,能够捕捉文本的核心意义,其注释方案具有灵活性,使得世界上各种语言(包括低资源语言)的注释成为可能。尽管UMR在促进语言记录、改进低资源语言技术以及增加可解释性方面显示出潜力,但只有在文本到UMR解析器能够实现大规模自动生产准确的UMR图时,UMR的下游应用才能得到充分探索。先前的文本到UMR解析工作仅限于当前阶段。在本文中,我们介绍了两种英语文本到UMR解析方法,其中一种微调了现有的抽象意义表示解析器,另一种利用了通用依赖关系转换器,以先前工作为基准。我们的最佳模型,我们称之为SETUP,在AnCast评分上达到84,在SMATCH++评分上达到91,表明在自动UMR解析方面取得了显著进展。

英文摘要

Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.

2512.05958 2026-05-20 cs.LG cs.AI

MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

MaxShapley:迈向具有公平上下文归因的激励兼容生成搜索

Sara Patel, Mingxun Zhou, Giulia Fanti

发表机构 * Carnegie Mellon University(卡内基梅隆大学) HKUST(香港科技大学)

AI总结 本文提出MaxShapley算法,用于在生成搜索流程中公平地归因和补偿内容提供者,该算法基于Shapley值的特例,通过可分解的max-sum效用函数在多项式时间内计算归因,相比Shapley值的指数成本具有更高的效率。

详情
AI中文摘要

基于大型语言模型(LLMs)的生成搜索引擎正在取代传统搜索引擎,从根本上改变了信息提供者如何获得补偿。为了维持这一生态系统,我们需要公平的机制来根据内容提供者对生成答案的贡献来归因和补偿。我们介绍了MaxShapley,一种高效的算法,用于在生成搜索流程中进行公平的信用归因,该流程在生成之前检索外部来源。MaxShapley是著名Shapley值的特例;它利用可分解的max-sum效用函数,在文档数量上以多项式时间计算归因,而不是Shapley值的指数成本。我们在三个多跳问答数据集(HotPotQA、MuSiQUE、MS MARCO)上评估MaxShapley;MaxShapley在归因质量上与精确的Shapley计算相当,同时消耗的资源更少——例如,在相同归因准确性下,它在资源消耗上比先前最先进的方法减少了高达9倍。我们发布了开源代码和重新校准的数据集。一个教育演示可在https://fair-search.com上获得。

英文摘要

Generative search engines based on large language models (LLMs) are replacing traditional search, fundamentally changing how information providers are compensated. To sustain this ecosystem, we need fair mechanisms to attribute and compensate content providers based on their contributions to generated answers. We introduce MaxShapley, an efficient algorithm for fair credit attribution in generative search pipelines that retrieve external sources before generation. MaxShapley is a special case of the celebrated Shapley value; it leverages a de-composable max-sum utility function to compute attributions with polynomial-time computation in the number of documents, as opposed to the exponential cost of Shapley values. We evaluate MaxShapley on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO); MaxShapley achieves comparable attribution quality to exact Shapley computation, while consuming a fraction of its tokens--for instance, it gives up to a 9x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy. We release open-source code and re-calibrated datasets. An educational demo is available at https://fair-search.com.

2512.05721 2026-05-20 cs.LG

BERTO: Intent-Driven Network Time Series Forecasting via Natural Language Operator Preferences

BERTO:通过自然语言运算偏好进行意图驱动的网络时间序列预测

Nitin Priyadarshini Shankar, Vaibhav Singh, Sheetal Kalyani, Christian Maciocco

发表机构 * Intel Labs(英特尔实验室) Indian Institute of Technology Madras(印度理工学院马德拉斯分校)

AI总结 BERTO通过自然语言运算偏好进行意图驱动的网络时间序列预测,利用BERT框架实现交通预测和能耗优化,结合平衡损失函数和提示条件,使模型能够根据运营商需求动态调整预测偏差,实现灵活的决策感知预测。

Comments 7 pages, 3 figures, 2 tables

详情
AI中文摘要

传统的蜂窝交通预测模型优化于最小化对称误差,使其对操作优先级的变化不敏感。为弥合这一差距,我们引入BERTO,一种基于BERT的框架,用于蜂窝网络的交通预测和能耗优化。基于Transformer架构,BERTO在实现高预测精度的同时,通过自然语言运营商提示使单个微调模型能够在多个预测制度中运行。通过结合平衡损失函数(BLF)和基于提示的条件,BERTO能够根据运营商在节能和服务质量之间的权衡需求,自适应地调整预测偏差,向欠预测或过预测倾斜。这使得同一模型能够在不重新训练或修改模型参数的情况下,动态生成不同的决策感知预测。在真实世界数据集上的实验表明,BERTO可以在约1.4kW的功率消耗范围内运行,同时平衡9倍的服务级别协议(SLA)违规变化,使其非常适合智能RAN部署。

英文摘要

Traditional cellular traffic forecasting models are optimized for minimizing symmetric errors, leaving them indifferent to shifting operational priorities. To bridge this gap, we introduce BERTO, a BERT-based framework for traffic prediction and energy optimization in cellular networks. Built on transformer architectures, BERTO achieves high prediction accuracy while enabling a single fine-tuned model to operate across multiple forecasting regimes via natural-language operator prompts. By combining a Balancing Loss Function (BLF) with prompt-based conditioning, BERTO adaptively shifts its forecasting bias toward underprediction or overprediction depending on the operator's desired trade-off between power savings and service quality. This allows the same model to dynamically generate different decision-aware forecasts without retraining or modifying model parameters. Experiments on real-world datasets demonstrate that BERTO can operate across a flexible range of approximately 1.4 kW in power consumption while balancing 9x variation in service level agreement (SLA) violations, making it well suited for intelligent RAN deployments.

2512.01152 2026-05-20 cs.LG cs.AI cs.CV

Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

开放集域适应在背景分布偏移下的挑战:挑战与一种可证明高效的解决方案

Shravan Chaudhari, Yoav Wald, Suchi Saria

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Faculty of Data and Decision Sciences, Technion(技术学院数据与决策科学学院) Center for Data Science, New York University(纽约大学数据科学中心) Bayesian Health(贝叶斯健康)

AI总结 本文研究了在背景分布偏移情况下开放集域适应的挑战,并提出了一种可证明高效的解决方案CoLOR,通过理论分析和实验证明其在简化过参数化设置中优于基线方法,同时展示了其在图像和文本数据上的广泛适用性。

Comments Project page at https://github.com/Shra1-25/CoLOR

Journal ref Transactions on Machine Learning Research (TMLR) 2026/May ISSN: 2835-8856

详情
AI中文摘要

随着我们将机器学习系统部署到现实世界中,一个核心挑战是保持模型在数据偏移时的性能。这种偏移可以以多种形式存在:新类可能在训练时不存在,这被称为开放集识别,以及已知类别的分布可能发生变化。对于开放集识别的保证大多基于假设已知类别的分布(我们称之为背景分布)是固定的。在本文中,我们开发了CoLOR,一种在挑战性情况下(即背景分布偏移)也能解决开放集识别的方法。我们证明该方法在温和假设下有效,即新类可与非新类分离,并提供理论保证,表明其在简化过参数化设置中优于代表基线方法。我们开发了使CoLOR可扩展和稳健的技术,并在图像和文本数据上进行了全面的实证评估。结果表明,CoLOR在背景偏移下显著优于现有开放集识别方法。此外,我们还提供了新的见解,探讨了诸如新类大小等因素对性能的影响,这在先前工作中尚未得到广泛探索。

英文摘要

As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call the background distribution, is fixed. In this paper we develop CoLOR, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make CoLOR scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that CoLOR significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

2512.00281 2026-05-20 cs.CV q-bio.NC

Beyond Size and Growth: Rethinking Lung Cancer Screening with AI Based Nodule Detection and Diagnosis

超越尺寸和增长:利用AI进行肺结节检测与诊断的肺癌筛查再思考

Sylvain Bodard, Pierre Baudot, Benjamin Renoust, Charles Voyton, Gwendoline De Bie, Ezequiel Geremia, Van-Khoa Le, Danny Francis, Pierre-Henri Siot, Yousra Haddou, Vincent Bobin, Jean-Christophe Brisset, Carey C. Thomson, Valerie Bourdes, Benoit Huet

发表机构 * Université de Paris Cité, AP-HP, Hôpital Universitaire Necker Enfants Malades, Service d’Imagerie Adulte(巴黎大学Cité,AP-HP,Necker儿童医院成人影像科) Memorial Sloan Kettering Cancer Center, Department of Radiology(纪念斯隆凯特琳癌症中心,放射科) Sorbonne Université, CNRS UMR 7371, INSERM U 1146, Laboratoire d’Imagerie Biomédicale (LIB)(索邦大学,CNRS UMR 7371,INSERM U 1146,生物医学成像实验室) Median Technologies, eyonis(Median Technologies,eyonis) Mount Auburn Hospital/Beth Israel Lahey Health, Cambridge MA, USA(Mount Auburn医院/Beth Israel Lahey健康,马萨诸塞州剑桥市,美国) Harvard Medical School, Boston MA, USA(哈佛医学院,马萨诸塞州波士顿,美国)

AI总结 本文提出了一种基于AI的集成系统,通过低剂量CT扫描在结节层面直接进行结节检测和恶性评估,超越传统基于尺寸和增长的筛查标准,提高了肺癌筛查的准确性和效率。

Comments 25 pages, 8 figures, with supplementary information containing 11 figures

详情
AI中文摘要

早期检测恶性肺结节仍然受到基于尺寸和生长的筛查标准的限制,常常延迟诊断。我们提出了一种集成的AI系统,该系统在统一的CADe/CADx框架内,从低剂量CT扫描中联合执行结节检测和恶性评估。与传统将检测和诊断分开的流程不同,我们的方法直接针对恶性结节,重新定义了临床决策点的评估。为了解决数据集规模和可解释性限制,系统由一个大型集成模型(LEM)组成,结合了浅层深度学习和基于特征的模型。该系统在25,709例扫描中训练和评估,其中69,449个结节被标注,并在独立队列上进行了外部验证。其内部AUC为0.98,外部AUC为0.945,优于所有基于生长的指标、Lung RADS尺寸基于的分流、欧洲体积和VDT基于的筛查标准、放射科医生和领先的AI模型。该模型在低假阳性率下保持高灵敏度,对小和早期阶段的癌症表现出色,并能对不确定和缓慢生长的结节在一年内更早地评估恶性性。这种方法有潜力优化肺癌筛查流程,支持更早、更可行的临床决策。

英文摘要

Early detection of malignant lung nodules remains constrained by size and growth based screening criteria, often delaying diagnosis. We present an integrated AI system that jointly performs nodule detection and malignancy assessment directly at the nodule level from low dose CT scans, within a unified CADe/CADx framework. Unlike conventional pipelines separating detection and diagnosis, our approach targets malignant nodules directly, redefining evaluation at the point where clinical decisions are made. To address limitations in dataset scale and explainability, the system consists of a Large Ensemble Model (LEM) combining ensembles of shallow deep learning and feature based models. It was trained and evaluated on 25,709 scans with 69,449 annotated nodules, with external validation on an independent cohort. It achieved an AUC of 0.98 internally and 0.945 externally, outperforming all growth based metrics, Lung RADS size based triage, European volume and VDT based screening criteria, radiologists, and leading AI models. The model maintains high sensitivity at low false positive rates, excels for small and early stage cancers, and enables malignancy assessment up to one year earlier than radiologists for indeterminate and slow growing nodules. This approach has the potential to streamline lung cancer screening workflows and support earlier, more actionable clinical decision making.

2511.17166 2026-05-20 cs.RO

Reflection-Based Relative Localization for Cooperative UAV Teams Using Active Markers

基于反射的协作无人机团队相对定位方法

Tim Lakemann, Daniel Bonilla Licea, Viktor Walter, Martin Saska

发表机构 * Multi-Robot Systems Group, Faculty of Electrical Engineering, Czech Technical University in Prague(布拉格捷克技术大学电气工程系多机器人系统组) Mohammed VI Polytechnic University(摩洛哥穆莱伊沙·穆莱·阿卜杜勒阿齐兹·本·阿卜杜勒阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐兹·本·阿卜杜勒·阿齐-middle Polytechnic University)

AI总结 本文提出了一种利用环境中的主动标记反射进行无人机团队相对定位的新方法,无需预先知道机器人大小或标记配置,且能有效应对表面不规则性带来的不确定性,实验表明其在不同光照条件下具有更高的有效范围和精度。

详情
AI中文摘要

主动标记在环境中的反射通常是机载视觉相对定位中的常见模糊源。本文提出了一种新的方法,利用这些通常不受欢迎的反射来实现异构多无人机团队的机载相对定位。该方法无需事先了解机器人大小或预定义的标记配置,不依赖于表面属性,并明确考虑了由表面不规则性引起的不确定性,包括对海洋部署相关的动态水表面。我们在室内和户外实验中验证了该方法,证明了其在不同光照条件下的可靠运行,并实现了比现有最先进方法更高的有效范围(超过30米)和精度。视频可通过以下链接获取:https://youtu.be/y0zp8cIwkig。

英文摘要

Reflections of active markers in the environment are a common source of ambiguity in onboard visual relative localization. This work presents a novel approach that exploits these typically unwanted reflections for onboard relative localization in heterogeneous multi-UAV teams. The method operates without prior knowledge of robot size or predefined marker configurations, remains independent of surface properties, and explicitly accounts for uncertainties caused by surface irregularities, including dynamic water surfaces relevant for marine deployments. We validated the approach in both indoor and outdoor experiments, demonstrating reliable operation across varying lighting conditions and achieving greater effective range (above 30 m) and accuracy than state-of-the-art methods. The video is available under the following link: https://youtu.be/y0zp8cIwkig.

2511.16766 2026-05-20 cs.CV

SVG360: Editable Multiview Vector Graphics from a Single SVG

SVG360: 从单个SVG生成可编辑的多视角矢量图形

Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学) University of Stuttgart(斯图加特大学)

AI总结 本文提出SVG360框架,通过视图一致的矢量化流程将单个SVG转换为几何和视觉一致的多视角SVG资产,解决了多视角下路径碎片化和颜色不稳定的问题,提升了多视角一致性与编辑性。

详情
AI中文摘要

可缩放矢量图形(SVG)是可编辑视觉设计的标准表示形式,但通常作为单视角二维插图进行作者创作。这限制了其在需要对象级资产在不同视角下保持一致时的应用。我们提出了SVG360,一个框架,将单个输入SVG转换为几何和视觉一致的多视角SVG资产。关键挑战在于直接按视角生成或矢量化会产生视角依赖的区域、碎片化的路径和不稳定的颜色,使生成的SVG难以作为整体对象进行编辑。SVG360通过视图一致的矢量化流程解决这一问题。它首先将栅格化输入提升为视图条件的对象表示,并在规定相机下渲染目标视角。然后通过一种源自视频分割的时空记忆机制,将部分身份传播到相邻视角,建立一致的区域分解、路径对应和颜色分配,而无需特定任务的重新训练。最后,每个视角通过结构感知的矢量化重建为可编辑的SVG,其中冗余路径被合并,局部几何被优化,同时保持边界和语义部分。在对象级SVG资产上的实验表明,与直接按视角矢量化相比,SVG360提高了多视角一致性,减少了路径冗余,并更好地保留了细结构。通过将单视角SVG转换为一致的360度矢量资产,SVG360将矢量图形从静态插图扩展到可编辑的多视角内容,适用于设计、动画和结构化视觉编辑。

英文摘要

Scalable Vector Graphics are a standard representation for editable visual design, yet they are usually authored as single view two dimensional illustrations. This limits their use in applications that require object level assets to remain coherent when observed, edited, or animated from different viewpoints. We present SVG360, a framework that converts a single input SVG into geometrically and visually consistent multiview SVG assets. The key challenge is that direct per view generation or vectorization produces view dependent regions, fragmented paths, and unstable colors, making the resulting SVGs difficult to edit as a coherent object. SVG360 addresses this problem through a view consistent vectorization pipeline. It first lifts the rasterized input into a view conditioned object representation and renders target views under prescribed cameras. It then propagates part identity across neighboring views using a spatial memory mechanism adapted from video segmentation, establishing consistent region decomposition, path correspondence, and color assignment without task specific retraining. Finally, each view is reconstructed as an editable SVG through structure aware vectorization, where redundant paths are consolidated and local geometry is optimized while preserving boundaries and semantic parts. Experiments on object level SVG assets show that SVG360 improves multiview consistency, reduces path redundancy, and better preserves fine structures compared with direct per view vectorization. By turning a single view SVG into a coherent 360 degree vector asset, SVG360 expands vector graphics from static illustration toward editable multiview content for design, animation, and structured visual editing.

2511.13864 2026-05-20 cs.CV

GRLoc: Geometric Representation Regression for Visual Localization

GRLoc: 用于视觉定位的几何表示回归

Changyang Li, Xuejian Ma, Lixiang Liu, Zhan Li, Qingan Yan, Yi Xu

发表机构 * Goertek Alpha Labs(歌尔声学实验室)

AI总结 本文提出了一种基于几何表示回归(GRR)的方法,通过分离旋转和翻译预测来提升视觉定位的性能,并在7-Scenes和Cambridge Landmarks数据集上实现了最先进的结果。

详情
AI中文摘要

绝对姿态回归(APR)已成为视觉定位中的有力范式。然而,APR模型通常作为黑箱操作,直接从查询图像回归6自由度姿态,这可能导致记忆训练视图而非理解3D场景几何。在本文中,我们提出了一种基于几何的替代方法。受新颖视角合成的启发,该方法通过从中间几何表示生成图像,将APR重新公式化为其逆过程,即从图像直接回归底层3D表示,并将此范式称为几何表示回归(GRR)。我们的模型显式预测两种解耦的几何表示:(1)方向图以估计相机旋转,(2)对应点图以估计相机翻译。最终的相机姿态通过可微确定性求解器从这些几何组件中恢复。这种解耦方法将学习的视觉到几何映射与最终姿态计算分离,为网络引入了强几何先验。我们发现,显式分离旋转和翻译预测可显著提升性能。我们证明在7-Scenes和Cambridge Landmarks数据集上实现了最先进的性能,验证了建模逆渲染过程是更稳健的通用绝对姿态估计路径。

英文摘要

Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a raymap's directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.

2511.12158 2026-05-20 cs.LG

Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

用于细粒度鸟类叫声分析的数据高效自监督算法

Houtan Ghaffari, Lukas Rauch, Paul Devos

发表机构 * Department of Information Technology, Ghent University(根特大学信息科技系) Intelligent Embedded Systems, University of Kassel(卡塞尔大学智能嵌入式系统)

AI总结 本文提出了一种数据高效的鸟类叫声标注器,通过三阶段训练流程在最小标注情况下开发可靠的鸟类叫声音节检测器,并在极端标注稀缺场景下验证了其有效性,同时评估了自监督嵌入在线性探测和无监督鸟类叫声分析中的潜力。

详情
AI中文摘要

生物声学、神经科学和语言学研究经常使用鸟类叫声作为代理来获取跨不同领域的知识。这需要音频模型能够标注和解析鸟类叫声。开发此类模型需要精确的、音节级注释的训练数据。因此,减少标注成本的自动化方法需求迫切。本文提出了一种数据高效的鸟类叫声标注器,称为残差多层感知机递归神经网络。然后,本文提出了一个三阶段训练流程,以在最小标注情况下开发可靠的鸟类叫声音节检测器。第一阶段是从未标注数据中进行自监督学习。探索了两种最成功的预训练范式,即掩码预测和在线聚类。第二阶段是使用有效的数据增强进行监督训练,以为每个个体生成稳健的帧级音节检测器。第三阶段是一个半监督的后训练步骤,利用未标注数据来优化每个个体的模型。该方法在极端标注稀缺场景下对金翅雀叫声进行了验证。从信号处理的角度来看,金翅雀叫声表现出最具有挑战性的频谱-时间模式之一,对于算法时间序列标注而言:快速的发声、短暂的音节间间隔、快速且宽带的频率扫频,以及需要细粒度特征区分的光谱相似音节。因此,成功的金翅雀音节检测算法为其他鸟类建立了稳健的基准。这种方法论的泛化在白喉歌鸲叫声标注的案例研究中得到了验证。最后,评估了自监督嵌入在线性探测和无监督鸟类叫声分析中的潜力。

英文摘要

Research in bioacoustics, neuroscience, and linguistics often uses birdsong as a proxy to acquire knowledge across diverse areas. This requires audio models to annotate and parse the birdsong. Developing such models requires precise, syllable-level annotated training data. Therefore, automated methods that reduce annotation costs are in demand. This work presents a data-efficient birdsong annotator called Residual Multi-Layer Perceptron Recurrent Neural Network. It then presents a three-stage training pipeline for developing reliable birdsong syllable detectors with minimal annotation. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentation to produce a robust frame-level syllable detector for each individual. The third stage is a semi-supervised post-training step that refines each individual's model using unlabeled data. The effectiveness of this approach is demonstrated for the Canary song in extreme label-scarcity scenarios. From a signal-processing perspective, the Canary song exhibits one of the most challenging spectro-temporal patterns for algorithmic time-series annotation: rapid vocalizations, brief inter-syllabic intervals, fast and broadband frequency sweeps, and spectrally similar syllables that require fine-grained features to distinguish. Hence, a successful syllable detection algorithm for Canary also establishes a robust baseline for other birds. This methodological generalization is validated in a case study of Bengalese Finch song annotation. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.

2511.11688 2026-05-20 cs.LG cs.CV

Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

分层调度优化用于快速且稳健的扩散模型采样

Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology(澳门科学技术大学计算机科学与工程学院) Beijing Institute of Technology(北京理工大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种分层调度优化方法,通过改进的双层优化框架,在极低的函数评估次数下实现高效的扩散模型采样,显著提升了样本质量和计算效率。

Comments Preprint, accepted to AAAI 2026

详情
AI中文摘要

扩散概率模型在生成保真度方面设立了新标准,但受到采样过程缓慢的迭代限制。一种强大的无训练策略是调度优化,旨在在固定的、较小的函数评估次数(NFE)下找到最优的时间步分布以最大化样本质量。为此,成功的调度优化方法必须遵循四个核心原则:有效性、适应性、实用性鲁棒性和计算效率。然而,现有方法难以同时满足这些原则,推动了更先进解决方案的需求。为克服这些限制,我们提出了分层调度优化器(HSO),一种新颖且高效的双层优化框架。HSO通过交替迭代两个协同层级将全局最优调度的搜索转化为更可处理的问题:上层的全局搜索用于寻找最优初始化策略,下层的局部优化用于调度细化。这一过程由两个关键创新引导:中点误差代理(MEP),一种求解器无关且数值稳定的局部优化目标,以及间距惩罚适应度(SPF)函数,通过惩罚病态接近的时间步确保实用性鲁棒性。大量实验表明,HSO在极低NFE范围内为无训练采样设定了新的状态-of-the-art。例如,仅使用5次NFE,HSO在LAION-Aesthetics上实现显著的FID为11.94,使用Stable Diffusion v2.1。关键的是,这种性能不是通过昂贵的重新训练实现的,而是一次性的优化成本不到8秒,提供了一种高效且实用的扩散模型加速范式。

英文摘要

Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.