arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17019 2026-05-19 cs.CV

StreamingEffect: Real-Time Human-Centric Video Effect Generation

StreamingEffect: 实时以人为中心的视频效果生成

Yiren Song, Cheng Liu, Yuxin Jiang, Mike Zheng Shou

AI总结 本文提出StreamingEffect框架,通过实时视频到视频编辑技术,在保持人类身份、背景内容和时间一致性的同时添加表现性效果,并构建了最大的以人为中心的视频效果数据集VideoEffect-130K,实现了单块H200 GPU上的实时高质量720p视频编辑。

详情
AI中文摘要

实时以人为中心的视频效果生成对于直播人为主的应用如电子商务直播、娱乐和vlogging具有高度需求,但仍然困难,由于缺乏合适的数据和可部署的编辑模型。与通用视频生成不同,此任务需要实时视频到视频编辑,添加表现性效果的同时保持人类身份、背景内容和时间一致性。现有加速努力主要集中在文本到视频生成,而高效的视频编辑蒸馏仍 largely underexplored。在本文中,我们提出StreamingEffect,一个实时以人为中心的流视频效果框架。我们采用上下文视频编辑架构并训练高质量的双向教师,然后将其蒸馏为因果自回归学生,并进一步将采样步骤从50步减少到4步。我们还引入关键帧控制,允许参考效果帧在线注入并通过流进行传播以实现交互式编辑。为了解决数据瓶颈问题,我们构建了VideoEffect-130K,据我们所知,这是最大的以人为中心的视频效果数据集,包含70000个效果视频和60000个编辑视频,涵盖600个效果类别,这些类别是从短视频和编辑平台中挑选的。实验表明,我们的方法能够在单块H200 GPU上实现实时、高质量的720p视频编辑。

英文摘要

Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.

2605.17017 2026-05-19 cs.LG cs.AI

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

当动态变化时,鲁棒任务推断胜出:重新审视具有行为基础模型的离线模仿学习

Rishabh Agrawal, Rahul Jain, Ashutosh Nayyar

AI总结 本文提出了一种基于行为基础模型(BFM)的框架,通过将任务推断建模为鲁棒最小最大优化问题,以应对动态变化,从而在不修改预训练的情况下实现对最坏动态扰动的适应。该方法在动态变化下显著优于标准BFM和鲁棒离线模仿学习基线。

详情
AI中文摘要

行为基础模型(BFM)通过预训练任务无关的表示,实现了可扩展的模仿学习(IL)。然而,现有BFM假设环境动态固定,限制了其在现实世界变化(如摩擦力、驱动或传感器噪声变化)下的鲁棒性。我们通过将BFM的任务推断建模为鲁棒最小最大优化问题来解决这一问题,从而能够在不修改预训练的情况下适应最坏情况的动态扰动。到目前为止,这是首个仅依赖单个名义环境的离线数据的BFM框架,能够在动态变化下实现鲁棒性。我们的方法在动态变化下显著优于标准BFM和鲁棒离线IL基线。这些结果表明,鲁棒策略可以完全在任务推断时间实现,提高了BFM在动态环境中的实用性。

英文摘要

Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.

2605.17014 2026-05-19 cs.CV

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

RHINO:从单目视频中重建人类与新物体的交互

Lixin Xue, Chengwei Zheng, Georgios Paschalidis, Chen Guo, Manuel Kaufmann, Juan Zarate, Dimitrios Tzionas

AI总结 本文提出RHINO框架,通过三步方法从单目视频中重建3D人类、新物体和静态场景,解决了视频中物体和人类交互的重建问题,提升了4D重建和新视角合成的性能。

Comments CVPR 2026. Project page: https://lxxue.github.io/RHINO

详情
AI中文摘要

从智能系统中重建人类、物体及其交互的3D结构是一个长期目标。通常输入是移动相机的RGB视频,使任务变得不明确;深度具有歧义,人类和物体相互遮挡,相机和物体运动交织,产生似动现象。大多数先前工作只针对人类或物体单独处理,忽略它们的相互作用,或假设已知的3D形状或相机,这在实际应用中不切实际。我们开发了RHINO(Reconstructing Human Interactions with Novel Objects),一个三步框架,从单目RGB视频中恢复3D的人、新(未见过的)操控物体和静态场景,以共同的世界框架。首先,我们利用3D感知的基础模型,获取稳定结构从运动(SfM)的提示,即使在低纹理区域也能稳定;这将导致从前景像素获得操控物体的粗略形状和似动现象,以及从背景像素获得粗略场景形状和相机运动。第二,我们通过现成的方法估计摄像机框架中的人类,并从似动现象中减去相机运动以提取物体运动;这将人类、物体和粗略场景形状注册到共同的世界框架中。第三,我们使用具有组件的神经场和每个组件的有符号距离场来细化形状。后者进一步使不同可微的接触先验,吸引表面同时惩罚相互穿透,提高最终重建的物理合理性。对于评估,我们捕捉了一个新的手持单目视频数据集,与体积4D捕捉阶段同步,提供地面真实形状和相机运动。RHINO在新视角合成和4D重建上优于最先进的基线。消融实验表明,每个阶段都做出了显著贡献。代码和数据可在https://lxxue.github.io/RHINO上获得。

英文摘要

Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.

2605.17007 2026-05-19 cs.CL

HalluScore: Large Language Model Hallucination Question Answering Benchmark

HalluScore: 大语言模型幻觉问答基准

Aisha Alansari, Hamzah Luqman

AI总结 本研究提出HalluScore基准,用于评估大语言模型在不同推理难度、知识领域、历史时间线和文化场景中的幻觉行为,通过827个精心编写的题目评估、检测和缓解幻觉,同时提供高质量的人类标注以识别不同模型的幻觉表现。

详情
AI中文摘要

大语言模型(LLMs)在自然语言生成方面取得了显著进展,但仍然容易产生幻觉。针对日益增长的幻觉担忧,已开发出几种基准,主要集中在英语和中文上。然而,阿拉伯语仍被低估,由于标注资源稀缺和语言形态复杂性,LLM幻觉基准有限。因此,现有基准不能充分反映阿拉伯语的语言、文化和推理特征。为解决这一差距,我们引入HalluScore,一个结构化的阿拉伯语问答基准,旨在评估LLM在不同推理难度、各种知识领域、历史时间线和文化相关的阿拉伯语场景中的幻觉行为。该数据集包含827个精心挑选的问题,用于评估、检测和缓解LLM中的幻觉。数据集通过包含质量保证、清晰度和事实准确性过滤以及模型驱动选择的结构化流程构建,以保留一致引发幻觉的问题。每个问题都链接到经过验证的地面真实证据、答案解释和多标签注释。使用HalluScore基准,我们对17个阿拉伯语、多语言和推理LLM的幻觉模式进行了全面的实证分析。此外,我们提供了高质量的人类标注,识别了所有评估LLM的幻觉、非幻觉和部分幻觉响应。这些结果表明,阿拉伯语LLM中的幻觉不仅限于事实不准确,还涉及文化理解、语言推理和逻辑一致性方面的挑战。我们发布HalluScore以支持未来改进阿拉伯语LLM的可靠性和文化能力的研究。

英文摘要

Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.

2605.17000 2026-05-19 cs.LG cs.AI

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

BoLT:一个民主化黑盒优化研究的基准,用于昂贵的LLM任务

Ruth Wan Theng Chew, Zhiliang Chen, Apivich Hemachandra, Bryan Kian Hsiang Low

AI总结 本文提出BoLT基准,旨在通过提供真实LLM优化问题,促进黑盒优化方法在昂贵的大型语言模型任务中的研究和评估。

详情
AI中文摘要

优化大型语言模型(LLM)的训练和推理配置,如超参数、数据混合和提示,对性能至关重要,但在实践中往往采用启发式方法,导致可能的次优结果。通过将它们视为噪声、昂贵且无导数的优化问题,贝叶斯优化(BO)和其他黑盒优化(BBO)方法提供了一个有前途但尚未充分探索的方向,用于原则性、样本效率高的方法。然而,LLM训练和推理成本对大多数BBO研究社区来说过高,新方法往往仅在合成测试函数和小规模数据集上进行评估,这些数据集无法捕捉现代LLM优化问题的挑战。这阻碍了BBO方法的发展,并使评估这些方法在现代LLM任务上的有效性变得困难。我们介绍了BoLT,这是首个以LLM为中心的基准,旨在民主化LLM研究,服务于BBO社区。BoLT在https://github.com/chewwt/bolt上发布。BoLT涵盖了广泛且有动机的LLM优化问题,包括多保真度、多目标、异方差噪声和高维搜索空间。BoLT中的每个问题都基于真实的实验数据,并通过轻量级的替代模型,基于成千上万的真实LLM实验结果,使其完全可重复和可访问。我们对BoLT进行了广泛的BO和BBO方法的评估,显示选定的BO方法在各种任务上持续优于其他方法,突显了现有BBO方法在LLM任务上的不足,强调了为BBO社区现代化基准的必要性。

英文摘要

Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative-free optimization problems, Bayesian optimization (BO) and other black-box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample-efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small-scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM-centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.

2605.16999 2026-05-19 cs.LG

Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning

基于排名的校准:可靠多模态强化学习

Peng Cui, Boyao Yang, Jun Zhu

AI总结 本文提出Ranking-Aware Calibration方法,通过利用组内强化学习自然产生的比较信号,提升多模态强化学习的校准能力,从而提高任务准确性和校准效果。

详情
AI中文摘要

强化学习后训练显著提高了视觉-语言模型的推理准确性,但由此产生的策略仍然校准不足。终端正确性奖励无法提供梯度来惩罚置信度更高的错误比不确定的错误更严厉,也无法提供将置信度与视觉证据质量联系起来的信号,这一差距在损坏或模糊输入下尤为严重,此时模型仍会报告高置信度的错误答案。我们引入Ranking-Aware Calibration (RAC),一种训练时框架,利用组内强化学习已经产生的两种比较信号来监督置信度。排名感知组损失强制在同一提示下,更优的回放获得更高的置信度。清洁-损坏成对损失强制置信度随着视觉证据的退化而减弱。由于排名信号迫使策略区分正确和错误的推理路径,它还超越了仅靠正确性奖励所能达到的任务准确性。这两种损失都不需要外部置信度注释,并自然地与组内强化学习后训练整合。我们将在Qwen2.5-VL和InternVL-3.5基础上实例化RAC,并在六个多模态推理基准测试中评估清洁和损坏输入下的表现。实验证明,排名感知损失通过教政策区分更好和更差的推理路径显著提高了任务准确性,而成对损坏损失在退化输入下减少了校准误差。它们的结合在所有测试的backbone上实现了最佳的校准,同时在大多数设置中提高了准确性。

英文摘要

Reinforcement learning post-training has substantially improved the reasoning accuracy of vision-language models, yet the resulting policies remain poorly calibrated. Terminal correctness rewards provide no gradient that penalizes confident errors more than uncertain ones and no signal that ties confidence to the quality of visual evidence, a gap that becomes especially severe under corrupted or ambiguous inputs where models continue to report high confidence on incorrect answers. We introduce Ranking-Aware Calibration (RAC), a training-time framework that supervises confidence using two comparison signals that group-based RL already produces at no additional labeling cost. The ranking-aware group loss enforces that a better rollout receives higher confidence than a worse one within the same prompt. The clean--corrupted pairwise loss enforces that confidence attenuates as visual evidence degrades. Because the ranking signal forces the policy to distinguish between correct and incorrect reasoning paths, it also reinforces task accuracy beyond what correctness rewards alone produce. Both losses require no external confidence annotations and integrate naturally with group-based RL post-training. We instantiate RAC on Qwen2.5-VL and InternVL-3.5 backbones and evaluate on six multimodal reasoning benchmarks under clean and corrupted inputs. Empirical results show that the ranking-aware loss substantially improves task accuracy by teaching the policy to discriminate between better and worse reasoning, while the pairwise corruption loss reduces calibration error under degraded inputs. Their combination achieves the best calibration across all tested backbones while improving accuracy in the majority of settings.

2605.16996 2026-05-19 cs.CL

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

LLM个性诱导中的评估漂移:我们是否在移动目标?

Prateek Rajput, Yewei Song, Iyiola E. Olatunji, Jacques Klein, Tegawendé F. Bissyandé

AI总结 本文研究了大型语言模型在诱导人类样性格时的稳定性问题,通过微调长格式论文来诱导性格,并发现尽管微调减少了问卷评分的方差,但完整五维性格的准确性仍接近随机,表明无指导的论文缺乏表达忠实性格所需的线索。

Comments 14 pages, 8 main pages, 5 figures, 4 main page figures

详情
AI中文摘要

大型语言模型能否可靠地表达人类般的性格,还是仅仅在模仿表面线索而缺乏稳定的底层特征?为探讨此问题,我们通过在长格式论文上进行微调来诱导LLM的性格,每篇论文都关联一个目标大五人格特征轮廓。随后,我们使用IPIP-NEO问卷评估诱导性格的稳定性和准确性。具体而言,我们提出两个问题:(i)训练后(SFT、DPO、ORPO)是否在提示重新表述下稳定问卷评分?(ii)能否从无指导的论文中诱导目标大五人格特征?我们的结果表明,微调在五个模型上一致减少了问卷响应的方差,直接缓解了预训练模型中报告的评估脆弱性。然而,这种新发现的稳定性揭示了更根本的限制:即使单个特征分数有所提高,完整五维特征的准确性仍接近随机。这表明无指导的论文缺乏表达忠实性格所需的线索。因此,我们主张使用场景相关的数据集或交互式引导来积累与测试一致的证据。

英文摘要

Can large language models reliably express a human-like personality, or are they merely mimicking surface cues without a stable underlying profile? To investigate this, we induce personality in LLMs by fine-tuning them on the long-form essays, where each essay is associated with a target Big Five personality profile. We then evaluate the stability and fidelity of the induced personality using the IPIP-NEO questionnaire. Specifically, we ask: (i) does post-training (SFT, DPO, ORPO) stabilize questionnaire scores under prompt rephrasings, and (ii) can it induce target Big Five profiles from unguided essays? Our results demonstrate that fine-tuning consistently reduces variance in questionnaire responses across five models, directly mitigating the evaluation fragility reported in pre-trained models. However, this newfound stability reveals a more fundamental limitation: accuracy on the full five-dimensional profile remains near chance, even when single-trait scores improve. This indicates that unguided essays lack the cues needed for faithful personality expression. We therefore argue for scenario-grounded datasets or interactive elicitation that accumulates test-aligned evidence over time.

2605.16991 2026-05-19 cs.CL cs.AI

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

无响应项目难度建模用于多项选择题:细调Transformer:组件表示和多任务学习

Jan Netík, Patrícia Martinková

AI总结 本文提出了一种无响应项目难度建模方法,通过细调Transformer来处理阅读理解多项选择题的难度问题,采用组件级表示和多任务学习方法来提升模型性能。

详情
AI中文摘要

无响应项目难度建模旨在减少对响应校准的依赖,但对阅读理解多项选择题而言,其难度取决于词汇组件的推断需求。尽管现有方法通常从项目文本中提取特征并传递给单独的统计或机器学习模型,本文通过端到端地在项目词汇上微调Transformer编码器,消除了手动特征工程和预处理所丢失的信息。此外,本文还提出了两种扩展:一种是组件级变体,通过共享编码器分别编码词汇组件;另一种是多任务变体,保留联合编码并添加辅助的多项选择问题回答目标。每种方法都在三种训练集大小下通过蒙特卡洛子采样设计在保留的测试集上进行评估。研究发现,联合编码是一种可行的端到端替代方案;虽然组件级变体没有明显优势,这与自注意力机制本身已经捕获跨组件信号一致,但多任务变体在小样本情况下提供了显著的改进。Transformer微调,尤其是通过合适的辅助任务进行正则化,能够在应用测量中典型的训练集大小下恢复大量词汇可推导的信号。该框架为心理测量学扩展提供了可定制的接口。

英文摘要

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

2605.16990 2026-05-19 cs.CV

DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

DreamEdit3D: 多视角扩散模型的3D编辑个性化

Jinxin Ai, Matthias Nießner, Ziya Erkoç

AI总结 本文提出DreamEdit3D,通过自然语言实现多视角扩散模型的3D编辑个性化,通过提取语义组件并学习不同组件的token嵌入,实现多视角一致的高质量3D编辑。

Comments 24 pages, 5 figures

详情
AI中文摘要

尽管2D扩散模型在保持身份的个性化方面取得了显著成功,但将其能力扩展到3D资产仍是一个重大挑战,因为多视角一致性和空间控制的复杂性。受这些2D进展的启发,我们提出了一种新的个性化方法,用于文本引导的3D编辑,通过自然语言实现组合性和对象级控制。给定一个3D输入,我们渲染正交视图并提取对象级分割掩码以隔离语义组件。然后通过定制的两阶段优化策略学习每个组件的distinct token embeddings:多视角文本倒置与注意力对齐,随后对多视角扩散模型进行完整微调。在推理过程中,这些解耦的tokens与编辑提示无缝组合,生成多视角一致的图像,随后提升为高保真纹理3D网格。在多样化的编辑场景中的广泛评估表明,我们的方法成功地将2D个性化的优势转移到3D中,相比现有基线,在编辑忠实度和身份保持方面取得了最先进的成果。

英文摘要

While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.

2605.16989 2026-05-19 cs.LG

Decision-Aware Proximal Bridge Learning for Optimal Treatment Selection

面向决策的近端桥学习用于最优治疗选择

Tomàs Garriga, Alejandro Almodóvar, Axel Brando, Gerard Sanz, Eduard Serrahima de Cambra, Juan Parras

AI总结 本文提出了一种面向决策的近端桥学习方法,通过强调决策相关治疗区域并保留全局稳定性,解决了在近端因果推断中治疗选择和最优决策的不足。

详情
AI中文摘要

在需要连续动作的个性化治疗选择中,必须在决策相关区域中准确估计因果响应,而不是在整个动作空间中均匀估计。因此,估计全局因果响应面并选择最大化它的治疗可能不最优,因为标准估计目标根据观察到的治疗分布分配建模努力,而不是决定最优决策的区域。虽然在无偏设定中已经研究了面向决策的方法,但在近端因果推断中,这一问题仍处于探索阶段,其中代理变量和桥函数在存在隐藏混杂的情况下能够通过合适假设进行识别。尽管有最近的进展,近端方法主要集中在治疗效应和潜在结果估计,而不是治疗选择和最优决策。为弥合这一差距,我们引入了一种面向政策的加权桥损失,强调决策相关治疗区域的同时保留全局稳定性。我们证明了一个后悔界,表明所提出的加权桥损失通过加权不恰当常数控制治疗选择的后悔。我们将在几种近端桥求解器的决策意识变体中实例化该框架,得到交替进行加权桥估计、响应面投影、策略更新和权重细化的实用算法。经验上,我们发现面向决策的加权方法在多个桥求解器中减少了后悔,表明在近端设置中改进了治疗选择。

英文摘要

Individualized treatment selection with continuous actions requires accurate causal response estimation in decision-relevant regions, rather than uniformly over the entire action space. Estimating a global causal response surface and then choosing the treatment that maximizes it can therefore be suboptimal, since standard estimation objectives allocate modeling effort according to the observed treatment distribution rather than the regions that determine the optimal decision. While decision-aware approaches have been studied in unconfounded settings, this problem remains underexplored in proximal causal inference, where proxy variables and bridge functions enable identification under suitable assumptions even in the presence of hidden confounding. Despite recent progress, proximal methods have primarily focused on treatment-effect and potential-outcome estimation rather than treatment selection and optimal decision-making. To bridge this gap, we introduce a policy-targeted weighted bridge loss that emphasizes decision-relevant treatment regions while retaining global stabilization. We prove a regret bound showing that the proposed weighted bridge loss controls treatment-selection regret through a weighted ill-posedness constant. We instantiate the framework in decision-aware variants of several proximal bridge solvers, yielding practical algorithms that alternate between weighted bridge estimation, response-surface projection, policy update, and weight refinement. Empirically, we find that decision-aware weighting reduces regret across several bridge solvers, suggesting improved treatment selection in proximal settings.

2605.16986 2026-05-19 cs.CL cs.AI

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang, Chenyu Zhou, Zhihui Fu, Jun Wang, Weiwen Liu, Weinan Zhang, Jianghao Lin

AI总结 本文提出了一种在测试时自适应的技能合成方法SkillTTA,通过检索与当前任务相关的少量训练轨迹并将其合成成为任务特定的文本技能,以提高LLM代理在SpreadsheetBench、ALFWorld和BigCodeBench等任务上的性能。

Comments 10 pages, 4 figures

详情
AI中文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose SkillTTA, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

英文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

2605.16981 2026-05-19 cs.CV

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

重新思考长序列递归3D重建中的状态更新门

Kejun Ren, Lei Jin, Tianxin Huang, Lianming Xu, Li Wang

AI总结 本文针对长序列递归3D重建中状态更新门的结构瓶颈问题,提出一个基于帧级的门控机制,通过闭式解推导出无需参数、训练和额外前向传递的标量帧级门,从而在多个基准测试中显著提升了精度并保持了常量内存使用。

Comments 17 pages, 7 figures

详情
AI中文摘要

在严格常量内存预算下进行流式3D重建的关键在于递归状态如何随着流的演进进行更新。我们对TTT3R风格的每token门在五个基准测试中进行了分析,发现了一个结构性瓶颈:门本质上在幅度上受到限制(中位数为0.31;从未超过0.6),并且几乎帧不变,导致每个状态token的有效内存范围仅为约3帧,这成为长序列漂移的结构性根源。我们追溯到一个缺失的轴:现有推理时间方法仅在每token、帧内级别上调节更新,而正交的帧级别问题——'每个帧应如何强地贡献于状态'——被视为内容无关。我们通过闭式解从帧间内部特征的变化推导出一个标量帧级门α_t ∈ (0, 1],这是一种连续放松的经典同时定位与建图(SLAM)关键帧选择方法,无需参数、训练或额外前向传递。在六个涵盖相机姿态、视频深度和3D重建的基准测试中,我们的门在长TUM-RGBD姿态序列中将ATE减少51%,在Bonn视频深度中将AbsRel减少12.8%,在KITTI长序列姿态估计中超越了LongStream和Keyframe-VO,同时保持严格常量内存使用,且无训练成本。

英文摘要

Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median $0.31$; never exceeding $0.6$) and nearly frame-invariant, yielding an effective memory horizon of only $\sim$3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emph{how strongly each frame should contribute to the state} has been treated as content-independent. We close this gap with a scalar frame-level gate $α_t \in (0, 1]$ derived in closed form from frame-to-frame changes of internal features -- a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to $4,541$ frames, our gate cuts ATE by $51\%$ on long TUM-RGBD pose sequences, reduces AbsRel by $12.8\%$ on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.

2605.16980 2026-05-19 cs.CV

Statistical Hand Shape Modeling from Clinical CT Scans Using Deep Learning and Implicit Skinning

基于深度学习和隐式皮肤的临床CT扫描中手部形状统计建模

Gokce Guven, Hasan Fehmi Ates, Deniz Karasahin, Kaan Erdogan

AI总结 本文提出了一种AI辅助的重建流程,利用深度学习和隐式皮肤技术对临床CT扫描中的手部解剖结构进行分割和分析,通过统计形状建模提高生物力学、人机工程学和医疗诊断的应用价值。

详情
AI中文摘要

准确的分割和手部解剖的统计形状建模对医学诊断、人机工程学和生物力学有重要影响。本研究提出了一种AI辅助的重建流程,用于从1,271例肘至手(e2h-CT)计算机断层扫描中分割和分析手部解剖结构。首先使用基于Pix2Pix的条件生成对抗网络去除CT体积中的石膏和背景伪影。清洁后的扫描随后在3D Slicer中处理,提取皮肤和骨掩膜,并将其转换为封闭曲面网格模型。分割的骨网格用于构建骨骼表示,使隐式皮肤能够将所有手模型对齐到标准化的解剖配置。随后,使用Geodesic Based Coherent Point Drift++(GBCPD++)算法对手部皮肤表面进行非刚性配准,以在不同受试者之间建立点对应关系。然后对配准后的模型应用主成分分析(PCA)以量化解剖形状的变异性。Pix2Pix预处理阶段在保留测试集上实现了Dice系数为0.9856和IoU为0.9720。统计建模在90例扫描中进行,其中手指完全可见且解剖上分离。所得的统计形状分布与美国陆军人体测量调查(ANSUR II)有很强的一致性,支持重建模型的解剖有效性。所提出的方法在生物力学建模、人机工程学优化、假肢设计和精准医疗诊断方面具有显著潜力。

英文摘要

Accurate segmentation and statistical shape modeling of hand anatomy have significant implications for medical diagnostics, ergonomics, and biomechanics. This study proposes an AI-assisted reconstruction pipeline for segmenting and analyzing hand anatomy from 1,271 elbow-to-hand (e2h-CT) computed tomography scans. A Pix2Pix-based conditional generative adversarial network is first employed to remove plaster cast and background artifacts from CT volumes. The cleaned scans are then processed in 3D Slicer to extract skin and bone masks, which are converted into closed-surface mesh models. Segmented bone meshes are used to construct skeletal representations, enabling implicit skinning to align all hand models into a standardized anatomical configuration. Subsequently, non-rigid registration is performed on the hand skin surfaces using the Geodesic Based Coherent Point Drift++ (GBCPD++) algorithm to establish point-wise correspondence across subjects. Principal Component Analysis (PCA) is then applied to the registered models to quantify anatomical shape variability. The Pix2Pix preprocessing stage achieved a Dice coefficient of 0.9856 and an IoU of 0.9720 on the held-out test set. Statistical modeling was performed on a subset of 90 scans in which the fingers were fully visible and anatomically separated. The resulting statistical shape distributions demonstrate strong agreement with the U.S. Army Anthropometric Survey (ANSUR II), supporting the anatomical validity of the reconstructed models. The proposed methodology demonstrates significant potential for advancing biomechanical modeling, ergonomic optimization, prosthetic design, and precision medical diagnostics.

2605.16979 2026-05-19 cs.RO

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

NORM-Nav: 通过自然语言行为约束实现零样本移动机器人导航

Dongjie Huo, Junhui Wang, Chao Gao, Yan Qiao, Dong Zhang, Guyue Zhou

AI总结 本文提出NORM-Nav框架,通过将自然语言行为约束整合到基于成本图的规划中,提升移动机器人在人类环境中导航的社交适应性,实验表明其在任务成功率和轨迹贴近人类参考方面优于基线方法。

详情
AI中文摘要

移动机器人在人类环境中运行时,不仅要生成无碰撞路径,还必须生成遵循本地行为规范的轨迹。传统基于成本图的导航强调几何可行性,往往忽视这些要求,可能导致不恰当的社会行为。本文提出了NORM-Nav,一种零样本框架,将自然语言行为约束整合到基于成本图的规划中。一个大语言模型将每个指令解析为结构化约束,并通过实时视觉-激光雷达感知进行 grounding。这些约束被编码为多层成本图,代表几何、语义、方向和速度提示,并直接与标准栅格规划器兼容。仿真和现实世界实验表明,NORM-Nav提高了任务成功率,并产生比代表基线更接近人类参考的轨迹。项目网站可用 https://ei-nav.github.io/NORM-Nav。

英文摘要

Mobile robots operating in human-centered environments must generate not only collision-free paths but also trajectories that follow local behavioral conventions. Conventional costmap-based navigation emphasizes geometric feasibility and often overlooks such requirements, which can result in socially inappropriate behaviors. This paper presents NORM-Nav, a zero-shot framework that integrates natural language behavioral constraints into costmap-based planning. An LLM parses each instruction into structured constraints and grounds them using real-time vision--LiDAR perception. These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners. Simulation and real-world experiments indicate that NORM-Nav improves task success rates and produces trajectories closer to human references than representative baselines. The project website is available at https://ei-nav.github.io/NORM-Nav.

2605.16975 2026-05-19 cs.LG cs.AI

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

扩展预训练的10秒ECG基础模型以适应更长的时域

Wei Tang, Jinpei Han, Kangning Cui, Mattia Carletti, Fredrik K. Gustafsson, Shreyank N Gowda, Patitapaban Palo, Anshul Thakur, Lei Clifton, Jean-michel Morel, Raymond H. Chan, David A. Clifton, Xiao Gu

AI总结 本文提出了一种参数高效的框架,通过在不重新训练基础模型的情况下扩展预训练的10秒ECG基础模型,使其能够处理更长和可变长度的ECG信号,解决了结构不兼容和语义挑战问题,实验表明其在多个长时域ECG任务中优于滑动窗口和池化基线方法。

详情
AI中文摘要

预训练在典型诊断10秒ECG片段上的ECG基础模型已在多种临床应用中展示了强大的迁移能力。然而,许多实际应用产生的记录通常更长,且在推理过程中持续时间各异。这些10秒模型缺乏整合时间信息的内置方法。将其扩展到更长的时域引入了两个挑战:由于输入长度差异导致的结构不兼容性,以及限制有意义时间聚合的语义挑战。我们提出了一种参数高效的框架,通过冻结预训练的10秒模型,引入一个轻量级插件模块,以两种互补的方式扩展模型:(i) 结构兼容的长序列处理,(ii) 语义指导的时间建模。在多个长时域ECG任务、数据集和基础模型背骨上的实验表明,我们的方法能够从预训练的快照模型中实现稳健的长时域扩展,一致优于滑动窗口和池化基线方法,具有强大的参数效率。

英文摘要

Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.

2605.16973 2026-05-19 cs.CV cs.LG

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

SHED: 风格均质化嵌入对齐用于领域泛化

Kai Gan, Tong Wei

AI总结 本文提出SHED方法,通过均质化嵌入对齐来解决领域泛化中的信息不对称问题,实验表明其在多个基准测试中取得了最先进的性能。

详情
AI中文摘要

领域泛化旨在通过嵌入分布偏移增强模型对未见领域的鲁棒性。尽管像CLIP这样的大规模视觉-语言模型表现出色,但其直接的图像-文本嵌入对齐却受到固有信息不对称的限制:图像编码了类别语义和领域特定的风格,而文本提示主要传达基本的类别线索。这种不对称性阻碍了在现实场景中对新领域的泛化。为此,我们提出了SHED,一种基于CLIP的新方法,通过对齐风格均质化的嵌入而不是CLIP编码器的原始表示。在训练过程中,SHED从图像嵌入(按源领域计算)和文本嵌入(在多样化的提示模板下平均并去除全局质心)中移除领域特定的风格质心。在推理过程中,考虑到目标领域信息的缺乏,SHED将多样化的文本领域质心投影到视觉空间,并通过成员加权聚合预测。在五个基准测试上的广泛实验表明,SHED在多个基准测试中取得了最先进的性能,显著优于先前方法(例如,在DomainNet上比标准微调高出+4.0%)

英文摘要

Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).

2605.16969 2026-05-19 cs.AI

Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms

基于脑血流速度和机器学习算法的脑血管年龄预测

Anni Zhao, Alex Bateh, Tyler Baldridge, Sandra Billinger, Xiao Hu

AI总结 本研究利用脑血流速度数据和机器学习算法,通过分析不同脑疾病患者的血管年龄预测,评估加速衰老现象,并探讨TCD生成的特征在评估加速脑血管老化中的相关性。

详情
AI中文摘要

定义血管年龄为生理功能的范畴已成为广泛研究中分类和跟踪年龄的关键问题。超声多普勒(TCD)是一种测量人类大脑主要动脉血流速度的方法。本研究旨在利用从TCD提取的特征来估计年龄并评估患有各种脑疾病个体的加速老化。我们预测患有各种脑疾病的个体在使用不同回归模型训练的健康个体上会表现出加速的脑血管老化。使用形态学分析和颅内压聚类(MOCAIP)算法分析了168名健康受试者和277名双侧大脑中动脉TCD记录的疾病受试者。MOCAIP生成的特征和心率变异性特征被用作回归模型的输入特征以预测脑血管年龄。对66名急性中风患者、27名中风后患者、26名阿尔茨海默病患者、23名轻度认知障碍患者和135名正常受试者进行了测试,以评估加速的脑血管年龄。训练好的模型在平均上预测健康受试者的脑血管年龄比实际年龄高3.69年。不同疾病状况的受试者表现出不同程度的年龄加速。健康和疾病受试者之间的表现差异表明,使用TCD生成的特征可能在评估加速的脑血管老化时是相关的。此外,不平衡的数据集已被观察到会影响基于机器学习的脑年龄预测模型的性能。

英文摘要

Defining vascular age in terms of physiological function has become one focal point of the extensive studies to categorize and track chronological age. Transcranial Doppler (TCD) is a method by which cerebral blood flow velocity is measured along the major arteries feeding the human brain. This study aims to use features extracted from TCD to estimate chronological age and assess accelerated aging in subjects with various brain diseases. We predict subjects with various brain diseases to present with accelerated cerebrovascular aging when tested on various regression models trained by healthy subjects. 168 healthy subjects and 277 diseased subjects with bilateral TCD recordings of the middle cerebral artery were analyzed using the Morphological Analysis and Clustering of Intracranial Pressure (MOCAIP) algorithm. MOCAIP-generated features and heart rate variability features were used as input features for regression models to predict the brain vascular age. 66 subjects with acute stroke, 27 subjects with post stroke, 26 subjects with Alzheimer's disease, 23 subjects with mild cognitive impairment, and 135 established subjects were tested against the machine learning model to assess for accelerated cerebrovascular age. The trained model, on average, predicted healthy subjects' cerebrovascular age to be 3.69 years above their chronological age. Subjects with different disease conditions exhibited varying levels of age acceleration. The differences in healthy and diseased subjects' performances suggest that features generated using TCD may be relevant when evaluating accelerated cerebrovascular aging. Moreover, imbalanced datasets have been observed to affect the performance of machine-learning-based brain age prediction models.

2605.16967 2026-05-19 cs.CV

Expandable, Compressible, Mineable: Open-World Thermal Image Restoration

可扩展、可压缩、可挖掘:面向开放世界热成像修复的ECMRNet

Pu Li, Huafeng Li, Yafei Zhang, Wen Wang, Neng Dong, Jie Wen

AI总结 本文提出ECMRNet,从持续学习视角解决开放世界热成像修复问题,通过可扩展、可压缩、可挖掘的闭环过程实现持续适应新型退化,同时通过结构熵剪枝和子退化知识挖掘模块提升修复性能。

Comments Accepted by ICML2026

详情
AI中文摘要

在开放世界场景中,热红外(TIR)图像退化持续出现并演变,而现有大多数单一切换修复方法基于封闭集假设,难以持续适应新退化。为此,我们提出ECMRNet,即面向开放世界热成像修复的可扩展、可压缩、可挖掘修复网络。从概念上,ECMRNet将持续退化学习统一为一个“扩展-压缩-挖掘”闭环过程,通过可控进化实现对新退化的持续适应。从结构上,ECMRNet将中间表示分解为组隔离的子空间,并通过冻结历史组和等形扩展新组,实现严格参数隔离和快速适应新退化。为抑制任务积累后的模型增长,我们提出结构熵剪枝,通过二维结构熵最小化识别并移除冗余通道组,实现信息贡献驱动的自适应压缩。此外,我们设计了子退化知识挖掘模块,动态检索并重新组合历史表示中的可转移组件,以提高复合退化下的修复性能。实验结果表明,ECMRNet在多种单退化和复合退化场景中均实现了优越的整体性能,同时使用更少的参数和更低的计算成本。源代码可在https://github.com/Kust-lp/ECMRNet获取。

英文摘要

In open-world settings, thermal infrared (TIR) image degradations continuously emerge and evolve, while most existing all-in-one restoration methods are built on a closed-set assumption and struggle to continually adapt to novel degradations. To address this, we propose ECMRNet, an Expandable, Compressible, and Mineable Restoration Network for open-world TIR restoration from a continual learning perspective. Conceptually, ECMRNet unifies continual degradation learning as an "expand-compress-mine" closed-loop process, enabling sustained adaptation to new degradations with controllable evolution. Structurally, ECMRNet decomposes intermediate representations into group-isolated subspaces, and achieves strict parameter isolation and fast adaptation to new degradations by freezing historical groups and isomorphically expanding new ones. To curb model growth as tasks accumulate, we present Structural Entropy Pruning, which identifies and removes redundant channel groups via two-dimensional structural entropy minimization, achieving information contribution-driven adaptive compression. Moreover, we design a Sub-degradation Knowledge Mining Module that dynamically retrieves and recombines transferable components from historical representations to improve restoration under compound degradations. Experimental results demonstrate that ECMRNet achieves superior overall performance across diverse single and compound degradations while using fewer parameters and lower computational cost. The source code is available at https://github.com/Kust-lp/ECMRNet.

2605.16966 2026-05-19 cs.AI

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

利用人工智能解决逆偏微分方程问题:过去、现在与展望

Zhentao Tan, Yuze Hao, Boyi Zou, Mingsheng Long, Yi Yang, Gang Bao

AI总结 本文综述了利用人工智能解决逆偏微分方程问题的最新进展,涵盖了逆问题、逆设计和控制问题三大类,总结了科学和工业领域中的典型应用,并讨论了开放挑战和未来前景。

Comments 35 pages, 4 figures

详情
AI中文摘要

求解逆偏微分方程(PDE)问题在科学研究中是一个基础性课题,因其在广泛现实应用中的重要性。逆PDE问题出现在医学成像、地球物理、材料科学和空气动力学等领域,目标是推断隐藏原因、设计结构或控制物理状态。本文全面回顾了利用人工智能(AI)解决逆PDE问题的最新进展。我们首先介绍了逆PDE问题的基本 formulation、关键挑战和传统数值基础,然后将其分为三大类别:逆问题、逆设计和控制问题。对于每个类别,我们进一步提出了方法论范式,并回顾了近年来的代表性最先进方法。我们随后总结了科学和工业领域的典型应用,包括机械系统、空气动力学问题、热系统、全波形反演、系统识别和医学成像。最后,我们讨论了开放挑战和未来前景,如物理感知架构、有限现实数据、不确定性量化和逆基础模型。本文旨在为人工智能解决逆PDE问题提供首个统一和系统的视角,展示现代基于学习的方法如何重塑PDE系统中的逆问题、逆设计和控制问题。

英文摘要

Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.

2605.16961 2026-05-19 cs.CV cs.AI

Latent Action Control for Reasoning-Guided Unified Image Generation

潜在动作控制用于推理引导的统一图像生成

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai, Tengjun Huang, Lei Zhu

AI总结 本文提出Latent Action Control (LAC),通过将推理表示为隐藏的连续动作,使推理过程可操作,从而在统一生成器中实现推理引导的图像生成。LAC通过角色结构化的潜在轨迹进行规划、内部视觉草图、诊断和细化,并将这些动作注入到条件流生成的隐藏流中,从而提升生成质量。

详情
AI中文摘要

统一的多模态模型可以在共享的骨干网络中编码视觉理解和图像生成,但理解并不自动转化为控制:模型可能推断出对象、关系或知识提示,但无法在生成的图像中实例化。我们提出潜在动作控制(LAC),通过将推理表示为隐藏的连续动作,使推理过程可操作。给定提示,LAC会规划角色结构化的潜在轨迹,进行内部视觉草图、诊断和细化,并将这些动作注入到条件流生成的隐藏流中,而无需生成推理标记或中间图像。由于这些动作轨迹是未观察到的,LAC通过先验引导的变分潜在动作对齐从仅训练的语义先验、草图图像特征和监督停止信号中学习这些动作,随后通过Latent-Flow GRPO对齐潜在到图像的生成轨迹与终端视觉反馈。这为从推断的关系、绑定和知识提示到生成过程的控制路径提供了支持。在BAGEL-7B-MoT上实现后,LAC在GenEval、WISE和T2I-CompBench中一致提升了组合性和知识引导的生成,尤其是在空间关系、属性绑定和世界知识敏感提示上表现最佳。消融实验和潜在干预显示,学习的动作轨迹被生成器消耗,表明统一生成在理解不仅被编码,而是在生成过程中被操作时受益。

英文摘要

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

2605.16951 2026-05-19 cs.CV

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

Edit-GRPO: 一种用于图像编辑的保持局部性的策略优化框架

Shaodong Xu, Zexian Li, Zhendong Wang, Litong Gong, Tiezheng Ge, Wengang Zhou, Bo Zheng, Houqiang Li

AI总结 本文提出Edit-GRPO框架,通过分离编辑与保留目标,解决图像编辑中保持局部性与全局一致性的问题,提升编辑效果并减少上下文扭曲等常见伪影。

详情
AI中文摘要

图像编辑中的一个根本性挑战在于保持空间局部性:编辑应改进目标内容而不应无意地改变周围区域。然而,大多数基于优化的编辑方法将图像视为整体实体,导致全局策略更新,从而破坏局部性并引入不期望的上下文变化。我们观察到,这一问题源于局部编辑意图与全局应用的优化信号之间的不匹配。受此启发,我们提出Edit-GRPO,一种在优化图像编辑时保持局部性的策略优化框架,该框架明确地将编辑和保留目标分离。通过为编辑和非编辑区域分配区域特定的优化信号,Edit-GRPO使策略更新与编辑任务的空间结构对齐,从而实现局部改进同时保持全局视觉一致性。这种设计有效抑制了诸如上下文扭曲和边界不一致等常见伪影。在各种图像编辑场景中的广泛实验表明,与现有基于优化的方法相比,Edit-GRPO在显著提高局部性保持的同时,保持了强大的编辑性能,验证了所提框架的通用性和有效性。

英文摘要

A fundamental challenge in image editing lies in preserving spatial locality: edits should improve targeted content without inadvertently altering surrounding regions. However, most optimization-based editing approaches treat images as holistic entities, causing global policy updates that undermine locality and introduce undesired context changes. We observe that this issue stems from a mismatch between localized editing intent and globally applied optimization signals. Motivated by this insight, we propose Edit-GRPO, preserving Locality while optimizing image editing, a locality-preserving policy optimization framework that explicitly decouples editing and preservation objectives. By assigning region-specific optimization signals to edit and non-edit areas, Edit-GRPO aligns policy updates with the spatial structure of editing tasks, enabling localized improvements while maintaining global visual coherence. This design effectively suppresses common artifacts such as context distortion and boundary inconsistency. Extensive experiments across diverse image editing scenarios demonstrate that Edit-GRPO significantly improves locality preservation while maintaining strong editing performance compared to existing optimization-based methods, validating the generality and effectiveness of the proposed framework.

2605.16949 2026-05-19 cs.CV

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

超越点对点匹配:为加速扩散变换器的结构表示对齐

Shaodong Xu, Zhendong Wang, Litong Gong, Zexian Li, Wengang Zhou, Tiezheng Ge, Houqiang Li

AI总结 本文提出sREPA框架,通过显式结构约束来对齐特征图的相对几何关系,以提高生成质量并加速收敛。

详情
AI中文摘要

最近的扩散变换器(DiTs)进展表明,将噪声潜在状态与经过训练的语义特征对齐(如代表性对齐(REPA)所开创的)可以显著加速训练并提高生成保真度。随后的分析(例如iREPA)表明,这些收益主要来自于转移预训练视觉表示中包含的空间结构。然而,大多数现有对齐方法使用点对点匹配目标或依赖于隐式架构调整,这些方法未能显式建模视觉基础模型中固有的空间关系几何。我们主张,元素级监督不足以捕捉视觉表示中的丰富空间拓扑,有效的对齐应以显式结构约束的形式进行。为此,我们提出了sREPA,一种结构代表性对齐框架,以强制特征图的相对几何一致性,而不是仅仅匹配单个特征点。通过鼓励模型内部化预训练特征中的整体空间布局和结构相关性,sREPA在比最先进的对齐策略更快、更稳定的收敛以及改进的样本质量方面取得了成果。我们的代码和模型将被发布。

英文摘要

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.

2605.16941 2026-05-19 cs.CL

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

展开与回退:扩散大语言模型是它们自己的效率教师

Fanqin Zeng, Feng Hong, Geng Yu, Huangjie Zheng, Xiaofeng Cao, Ya Zhang, Bo Han, Yanfeng Wang, Jiangchao Yao

AI总结 本文提出了一种基于可撤销解码的扩散大语言模型(DLLM)方法,通过发现可靠的去噪顺序来提升生成质量和效率,WINO和WINO+在多个数据集上展示了显著的性能提升。

详情
AI中文摘要

扩散大语言模型(DLLMs)承诺快速并行生成,但开源DLLMs仍面临严重的质量-速度权衡问题:通过揭示多个token来加速解码往往会导致显著的质量下降。我们将其困境归因于训练-推理不匹配被不可逆解码放大。虽然训练过程从随机破坏的状态重建token,但高效的推理需要自适应的去噪顺序,其中更容易的token被更早揭示,而依赖上下文的token则被推迟。这种观点促使提出两种互补的方法:一种是在推理时使并行解码可撤销的方法,另一种是在训练时扩展的方法,通过这种可撤销过程暴露的可靠顺序进行知识蒸馏。据此,我们首先提出Wide-In, Narrow-Out(WINO),一种无需训练的解码算法,使并行解码可撤销。WINO积极地草案多个token,通过增强的全局上下文验证生成token,并重新掩码不可靠的token以供后续细化。基于发现的顺序,我们进一步引入WINO+,将WINO生成的验证去噪轨迹注入模型参数,使训练与高效推理对齐。在LLaDA和MMaDA上的实验表明,WINO在质量和效率上均有提升,而WINO+进一步加强了这一进程。在GSM8K上,WINO将准确率从73.24%提升到75.82%,并以6.10倍的步骤减少;WINO+进一步达到76.58%并以6.83倍的步骤减少。在Flickr30K上,WINO+实现了16.22倍的步骤减少并提升了CIDEr分数。这些结果表明,DLLMs可以通过可撤销解码发现可靠的去噪顺序,然后学习遵循这些顺序以实现更快的生成。代码可在https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus获取。

英文摘要

Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus.

2605.16938 2026-05-19 cs.CL cs.AI q-bio.NC

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

努力作为上限,而非调节器:推理预算不影响人类与大推理模型之间的认知成本对齐

Yueqing Hu, Tianhong Wang

AI总结 该研究探讨了推理预算是否影响人类与大推理模型之间的认知成本对齐,发现无论推理努力如何变化,对齐情况保持不变,表明这种对齐是在训练时形成的,而非在推理时动态调整。

Comments 8 pages, 6 figures

详情
AI中文摘要

大推理模型(LRMs)生成的思维链轨迹长度与人类反应时间在认知任务中保持一致,但最近的争论质疑这种一致性是否反映真实的计算结构还是表面的冗长性。我们测试了这种一致性是否随推理时间的推理努力而变化。在GPT-OSS-20B和GPT-OSS-120B上,三个努力水平和六个推理任务中,任务内和跨任务的一致性保持不变:贝叶斯因子倾向于null,且各条件下的平均一致性几乎相同。操纵检查显示,努力参数设定了生成的上限,而非驱动实时分配,表明分配策略在训练时已固化。算术复杂度对比进一步显示,令牌分配跟踪细粒度、格式依赖的人类难度模式,模型规模提高了匹配程度。人类与LRMs之间的认知成本对齐似乎是在训练时形成的,对推理时的扰动具有鲁棒性,支持大推理模型问题解决的编译而非在线账户。

英文摘要

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

2605.16937 2026-05-19 cs.CV

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

DEVIS-GRPO:释放GRPO用于动态极视合成

Yi Zuo, Huimin Wu, Lingling Li, Fang Liu, Licheng Jiao, Qing Li

AI总结 本文提出DEVIS-GRPO,一种基于GRPO的框架,用于轨迹控制的视频生成,是首个在线策略梯度方法用于极视视频生成。核心方法是新颖的采样策略ADEVIS,通过逐步积累小视增量实现大视运动,提高了训练效率和采样多样性。

详情
AI中文摘要

轨迹控制的视频生成已成为可控视频生成的关键。尽管当前方法在小视相机运动下表现良好,但在大视运动下显著退化。现有的极视合成解决方案通常需要专门的视频对,需要大量标注工作。为了解决这些限制,我们提出了动态极视合成-GRPO(DEVIS-GRPO),一种基于GRPO的框架,用于轨迹控制的视频生成,是首个在线策略梯度方法用于极视视频生成。我们的方法的核心是一种新颖的采样策略:累积动态极视合成(ADEVIS),通过逐步积累小视增量实现大视运动。该方法带来了两个关键优势:1)增强的训练效率,因为它消除了需要预热策略模型的需要,通过收集昂贵的配对大视视频;2)增加的采样多样性,通过灵活变化轨迹配置实现。最后,我们设计了多级一致性-质量奖励函数来选择高质量的样本用于模型优化。在Kubric-4D、iPhone和DL3DV数据集上的实验表明了我们的方法的优越性。在Kubric-4D上,我们在非遮挡区域相比第二好的方法在PSNR上提高了21.57%,在SSIM上提高了7.31%。在iPhone上,LPIPS减少了18.56%。

英文摘要

Trajectory-controlled video generation has become essential for controllable video generation. While current methods perform well under small-view camera motions, they degrade significantly with large-view motions. Existing solutions for extreme-view synthesis typically require dedicated video pairs, demanding substantial annotation effort. To address these limitations, we propose Dynamic Extreme VIew Synthesis-GRPO (DEVIS-GRPO), a GRPO-based framework for trajectory-controlled video generation, the first online policy gradient method for extreme view video generation. Central to our approach is a novel sampling strategy: Accumulative Dynamic Extreme VIew Synthesis (ADEVIS), which achieves large-view camera motions by progressively accumulating small-view increments. This method delivers two key advantages: 1) enhanced training efficiency, as it eliminates the need to warm-start the policy model by collecting expensive paired large-view videos, and 2) increased sampling diversity, achieved by flexibly varying trajectory configurations. Finally, we designed a multi-level consistency-quality reward function to select high-quality samples for model optimization. Experiments on the Kubric-4D, iPhone, and DL3DV datasets demonstrate our method's superiority. On Kubric-4D, we achieve relative improvements of 21.57% in PSNR and 7.31% in SSIM over the second-best method in non-occlusion areas. On iPhone, LPIPS is reduced by 18.56%.

2605.16932 2026-05-19 cs.RO

MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

MORN: 为资源理性长周期导航的元认知目标-目标调节

Xi Lin, Jiayi Li, Kangyi Wu, Jiaqiao Tang, Qingrong He, Lin Zhao

AI总结 本文提出MORN,一种基于双过程理论的元认知导航架构,通过引入资源理性机制,解决传统导航系统在长周期任务中因缺乏全局资源意识导致的资源浪费问题,提升了目标完成率和任务效率。

详情
AI中文摘要

在无结构人类环境中部署的机器人必须频繁执行长周期任务,如找到杯子、然后椅子、然后打印机,这些任务受严格操作约束。尽管现代零样本物体导航(ObjectNav)代理利用视觉-语言模型(VLMs)有效定位语义目标,但它们本质上是纯粹的反应系统,缺乏全局资源意识。因此,这些代理由于部分可观测性而无意中耗尽关键预算,包括时间和电池,对不可行的子目标进行本地探索,未能在本地探索与全局任务可行性之间取得平衡。为了填补这一差距,通过在导航循环中注入资源理性,我们提出了MORN(元认知目标-目标调节导航),一种受认知科学双过程理论启发的执行架构。MORN在冻结的导航骨干上增加了一个System 2元控制器,持续监控System 1的移动。通过正式化三个神经认知状态,潜在指数、坚持门控和证据积累,MORN根据在线进度速度和感知不确定性的估计动态调节任务计划。这种机制有效消除了沉没成本谬误,使代理能够提前中止僵尸目标并果断承诺可行的目标。在HM3D数据集上的大量实验表明,MORN将目标完成率(CR)从0.23提高到0.30,并将浪费步分数(WSF)从0.90降低到0.70,证明在资源受限自主性中,元认知对全局资源的意识与反应能力导航同样关键。

英文摘要

Robots deployed in unstructured human environments must frequently execute long-horizon missions, such as find the mug, then the chair, then the printer, under strict operational constraints. While contemporary zero-shot Object Navigation (ObjectNav) agents leverage Vision-Language Models (VLMs) to effectively localize semantic targets, they operate as purely reactive systems that inherently lack global resource awareness. Consequently, these agents inadvertently exhaust critical budgets, including time and battery, on infeasible subgoals due to partial observability, failing to balance local exploration with global mission viability. To bridge this gap by injecting resource-rationality into the navigation loop, we present MORN (Metacognitive Object-goal Regulation Navigation), an executive architecture inspired by Dual-Process Theory in cognitive science. MORN augments frozen navigation backbones with a System 2 meta-controller that continuously monitors the System 1 locomotor. By formalizing three neuro-cognitive states, Potentiality Index, Persistence Gating, and Evidence Accumulation, MORN dynamically regulates the mission schedule based on online estimates of progress velocity and perceptual uncertainty. This mechanism effectively neutralizes the Sunk Cost Fallacy, enabling agents to abort zombie goals early and decisively commit to achievable ones. Extensive experiments on the HM3D dataset demonstrate that MORN improves Goal Completion Rate (CR) from 0.23 to 0.30 and reduces Wasted Step Fraction (WSF) from 0.90 to 0.70, establishing that in resource-constrained autonomy, the metacognitive awareness of global resources is as critical as the reactive ability to navigate.

2605.16929 2026-05-19 cs.LG

Emulating the Forced Response of Climate Models with Flow Matching

用流匹配模拟气候模型的强迫响应

Graham Clyne, Julia Kaltenborn, Peer Nowack, Claire Monteleoni, Anasatase Charantonis

AI总结 本文提出利用深度学习模型模拟气候模型对多种气候强迫的响应,通过训练多个SSP情景生成未见过的场景,并验证了该模型在土地表面温度方面的有效性。

详情
AI中文摘要

全球气候模型是模拟过去和潜在未来气候变化路径以及相关气候影响的关键工具。共享社会经济路径(SSPs)描述了全球经济和人口发展的各种未来情景。这些SSPs本质上与气候强迫的变化相关,这些强迫是外部驱动因素,如温室气体和气溶胶排放,从而导致地球能量平衡随时间的变化。这些强迫是气候模型中的基本边界条件,以了解这些变化对气候影响的潜在影响。然而,运行气候模型计算成本极高,与需要大量模拟集以获得更稳健估计的需求相冲突(考虑内部变异性和情景不确定性)。最近的研究表明,可以利用机器学习捕捉气候模型的动力学,当条件于不同气候情景的强迫。我们在此训练了一个深度学习(DL)模型在多个SSP上,并成功生成训练期间未见过的场景。我们的模拟器验证了MESMER-M,一个土地表面温度的统计模拟器。我们的研究展示了生成对多种同时气候强迫(如二氧化碳、甲烷、一氧化二氮、硫酸气溶胶和臭氧)响应的气候变化状态的能力。特别是,我们的消融研究强调需要包括多种不同强迫以用DL模拟器表示长期大气趋势。

英文摘要

Global climate models are essential tools to simulate past and potential future pathways of climate change, as well as associated climate impacts. Shared Socioeconomic Pathways (SSPs) describe a range of future scenarios of global economic and demographic development. These SSPs are intrinsically linked to changes in climate forcings, the external drivers, such as greenhouse gas and aerosol emissions, which in turn lead to the human impact on the energy balance of the Earth over time. These forcings are fundamental boundary conditions in climate models in order to gain insight into the potential climatic impacts of these changes described by each SSP. Running a climate model, however, is extremely computationally expensive, conflicting with the need for large ensembles of simulations for each model to give, e.g., more robust estimates in the presence of internal variability (the inherent, chaotic fluctuations within the climate system) and scenario uncertainty. Recent research has demonstrated the ability to capture climate model dynamics using machine learning when conditioned on forcings from different climatic scenarios. We here train a Deep Learning (DL) model on multiple SSPs and successfully generate scenarios unseen during training. Our emulator is validated against MESMER-M, a statistical emulator of land surface temperature. Our research demonstrates the capacity to generate such changing climate states in response to a variety of simultaneous climate forcings (e.g., carbon dioxide, methane, nitrous oxide, sulphate aerosols, and ozone). In particular, our ablation studies underline a need to include a range of different forcings to represent long-term atmospheric trends with a DL emulator.

2605.16927 2026-05-19 cs.AI

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

从静态风险到动态轨迹:迈向世界模型启发的临床预测

Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou, Tong Yang, Xiaoyu Zhang, Tao Tan, Yue Sun, Bin Cui

AI总结 本文探讨了临床AI中干预感知的疾病轨迹建模方法,提出了统一框架,结合了预测、反事实轨迹和政策评估,以解决治疗分配、时间变化混杂和观察偏差问题,推动临床预测向决策级证据发展。

详情
AI中文摘要

临床决策是一个反馈系统,其中风险估计影响治疗,而治疗又改变疾病轨迹,两者共同塑造医生的测量实践。静态预测在临床中往往失败:训练于观察性护理日志的模型会将疾病生物学与医生行为混为一谈,特别是在存在治疗混杂反馈和不规则或信息性观察的情况下。本文聚焦于临床AI中的干预感知疾病轨迹建模方法——估计患者特定的纵向疾病演变并评估在替代治疗下的轨迹变化。本文围绕六个相关组成部分组织该领域:三个决策任务(事实预测、反事实估计、政策评估)和三个数据生成机制(疾病演变、治疗分配、观察过程),这些决定了可识别性。本文提出了第一个统一框架,连接了离散/连续时间下的预测、反事实轨迹和政策评估,明确处理治疗分配、时间变化混杂和观察偏差。本文综合了关键方法家族(多状态/联合模型、时间点过程、深度序列架构、纵向因果推断),将它们映射到相关组成部分,并通过重叠诊断、不确定性量化、非策略鲁棒性和目标试验验证对齐评估。这种综合将基准预测推进到决策级临床证据,使治疗敏感的个性化未来成为可能,实现部署前的政策压力测试,并推动更安全的闭环学习健康系统,在证据不足时适应或回避。

英文摘要

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

2605.16925 2026-05-19 cs.CV

P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction

P2GS: 基于物理先验的高斯点云法用于光度一致的城市重建

Kota Shimomura, Hidehisa Arai, Tsubasa Takahashi, Takayoshi Yamashita, Hironobu Fujiyoshi

AI总结 本文提出P2GS,一种基于物理先验的高斯点云法,用于解决自动驾驶中由于异质相机管道和动态户外照明导致的光度不一致问题,通过联合分解视图不变的线性HDR光场、每视图曝光尺度和色调映射函数,提升光度一致性与光照一致性。

Comments Accepted CVPR2026 main

详情
AI中文摘要

3D高斯点云法(3DGS)最近作为一种强大的显式表示方法出现,使其能够实现快速、高保真的渲染,成为自动驾驶闭环模拟器和感知模型的有前途的基础。然而,传统3DGS隐式假设不同视图之间具有一致的曝光和色调映射。真实驾驶数据由于异质相机管道和动态户外照明而违反这一假设,将曝光差异和传感器噪声烘焙到光场中,导致在静态背景中产生伪影和不一致的照明,这对现实模拟至关重要。这些问题是自动驾驶中尤为突出的,因为稀疏的视点、变化的曝光和户外照明相互作用,而以往的工作主要针对动态物体重建,忽略了跨视图的光度一致性。为了解决这一限制,我们引入了P2GS,一种物理一致的高斯点云框架,仅从LDR图像中联合分解视图不变的线性HDR光场、每视图曝光尺度和色调映射函数。P2GS采用基于物理图像形成过程的统一优化策略,强制相对曝光一致性和HDR域光流正则化。这产生了一个对跨相机照明差异具有鲁棒性的光场,同时保持标准3DGS的实时效率。在真实和模拟驾驶环境中进行的实验表明,P2GS在LDR重建中匹配或超越了先前的方法,同时在多样化的场景中提供了显著改进的光度一致性、可靠的曝光归一化和物理一致的照明。

英文摘要

3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.

2605.16922 2026-05-19 cs.CV

Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation

基于图像点跟踪的运动线索用于LiDAR场景流估计

Youngdong Jang, Gyeongrok Oh, Jong Wook Kim, Hyunju Ryu, Hyung-gun Chi, SeungHyeon Kim, Seungryong Kim, Jonghyun Choi, Sangpil Kim

AI总结 本文提出TrackCue框架,通过图像点跟踪获取密集轨迹以改进LiDAR场景流估计中的动态物体表示,通过视觉一致的运动补偿策略和视觉运动线索提升来实现更准确的静态-动态分类和更可靠的场景流学习。

详情
AI中文摘要

LiDAR场景流估计对于自动驾驶至关重要,因为它为每个点提供3D运动。自监督方法利用静态-动态分类来缓解静态和动态点之间的不平衡,从而获得针对性的监督。然而,现有方法依赖于稀疏几何观测进行此分类,使其容易受到数据稀疏性和遮挡的影响。由此产生的噪声标签会提供错误的运动指导并降低场景流学习的效果。为了解决这个问题,我们引入了TrackCue,一种基于跟踪的框架,用于改进LiDAR场景流估计中的动态物体表示。具体而言,TrackCue重新利用点跟踪来获取锚定在LiDAR点上的密集图像空间轨迹,提供超越稀疏几何观测的运动线索。此外,我们提出了一种视觉一致的运动补偿策略,该策略在图像平面中将跟踪轨迹与自我诱导的刚性轨迹进行比较,有效地将真正的物体运动与自我诱导的表观运动分离。为了将这些分离的运动线索转移到LiDAR领域,我们执行了视觉运动线索提升,将自我补偿的图像轨迹与LiDAR点相关联以进行静态-动态标签细化。结果,TrackCue产生更准确的静态-动态分类,并为场景流学习提供更可靠的监督。实验结果表明,TrackCue显著提高了动态标签的精度和F1分数,从而在自监督场景流估计中带来了性能提升。

英文摘要

LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.