arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.01079 2026-06-02 cs.CV

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

Chameleon: 面向跨域对象合成的风格-内容解耦框架

Sukhun Ko, Soo Ye Kim, Jihyong Oh

发表机构 * CMLab, Chung-Ang University(Chung-Ang大学CMLab) Adobe Research(Adobe研究)

AI总结 提出基于大规模数据集ChameleonDataset的两阶段训练框架Chameleon,通过联合硬对比学习和时空注意力门控实现跨域对象合成的风格-内容解耦与自适应风格化。

详情
Comments
The last two authors are co-corresponding authors. Please visit our project page at https://cmlab-korea.github.io/Chameleon/
AI中文摘要

图像合成旨在将前景对象无缝插入背景图像中,扩散模型的最新进展显著提升了合成质量,尤其是当前景和背景图像来自同一域(例如自然图像)时。然而,当前景和背景来自不同域时,跨域合成相对未被充分探索且仍具挑战性,因为模型必须保留前景对象的身份,同时对其进行风格化以匹配背景域。现有的跨域合成方法主要依赖无训练的混合和细化策略,部分原因是缺乏大规模配对数据集用于跨域合成,限制了基于训练的方法的发展。因此,它们局限于色调级对齐,常常产生风格不一致或过度风格化的结果。为克服这些限制,我们构建了ChameleonDataset,这是首个用于跨域合成的大规模训练数据集,并配有全面的评估基准,通过可扩展的数据构建流水线实现。在此基础上,我们提出了Chameleon,一种新颖的两阶段基于训练的跨域合成框架。在第一阶段,我们提出联合硬对比学习(JHCL)来训练ChameleonEncoder,有效解耦风格和内容表示。在第二阶段,我们将时空注意力门控(STAG)引入扩散变换器以实现有效的风格化,自适应地调节来自第一阶段编码器的风格标记如何在空间和时间维度上注入。我们的方法优于最先进的域内和跨域合成模型、顺序流水线和商业模型,在合成合理性和风格保真度方面均取得了改进。

英文摘要

Image compositing aims to seamlessly insert a foreground object into a background image, and recent advances in diffusion models have significantly enhanced the quality, especially when the foreground and background images come from the same domain (e.g., natural images). However, cross-domain compositing, where the foreground and background come from different domains, is relatively underexplored and remains challenging because the model must preserve the foreground object's identity while stylizing it to match the background domain. Existing cross-domain compositing approaches largely rely on training-free blending and refinement strategies. This is partly due to the lack of large-scale paired datasets for cross-domain compositing, limiting the development of training-based solutions. As a result, they are limited to tone-level alignment and often produce style-inconsistent or overstylized results. To overcome such limitations, we construct ChameleonDataset, the first large-scale training dataset for cross-domain compositing, with a comprehensive evaluation benchmark, built through a scalable data construction pipeline. Building on this, we propose Chameleon, a novel two-stage training-based cross-domain compositing framework. In the first stage, we propose Joint Hard Contrastive Learning (JHCL) to train ChameleonEncoder, which effectively disentangles style and content representations. In the second stage, we introduce Spatio-Temporal Attention Gating (STAG) into a diffusion transformer for effective stylization, adaptively regulating how style tokens from the first-stage encoder are injected across spatial and temporal dimensions. Our method outperforms state-of-the-art in-domain and cross-domain compositing models, sequential pipelines and commercial models, achieving improvements in both compositional plausibility and stylistic fidelity.

2606.01078 2026-06-02 cs.LG stat.CO stat.ME

Non-Vacuous Certification of Transport MCMC via Oscillation-Controlled Normalizing Flows

通过振荡控制归一化流实现传输MCMC的非平凡认证

Jun Hu

发表机构 * China Sanya Science and Education Innovation Park, Wuhan University of Technology, Sanya 572025, China(中国三亚科技教育创新园,武汉理工大学,三亚572025,中国) School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430070, China(武汉理工大学土木工程与建筑学院,武汉430070,中国)

AI总结 提出振荡控制归一化流框架,首次为传输MCMC采样器提供严格的非平凡谱隙界,通过谱归一化、基于覆盖的经验振荡界和振荡正则化训练,在多个后验分布上实现可认证的采样效率。

详情
Comments
36 pages, includes appendix
AI中文摘要

传输MCMC训练归一化流以预处理Metropolis-Hastings提议,在具有挑战性的后验分布上实现了高经验效率;然而,先前的工作没有为此类采样器产生数值上非平凡的、严格的谱隙界。我们建立了第一个这样的界。对于香蕉族上的独立MH,我们在D=2时认证了γ^*=0.828(在原始空间中覆盖),在D=5时认证了γ^*≥7.6×10^{-4}(在解析解卷的高斯空间中覆盖,并具有网格认证的梯度界,在所述数值Lipschitz认证下),两者均在95%置信度下严格。该框架基于三个支柱:(i) 具有缩减尺度裁剪的谱归一化将流Lipschitz常数从10^{47}约束到10^4;(ii) 基于覆盖的经验振荡界用数据依赖的证书替代了空洞的分析界;(iii) 振荡正则化训练在不损失密度拟合的情况下将经验振荡减少60-90%,将实用证书扩展到D=20(γ^*≥1.7×10^{-4})。在另外四个目标(高斯混合、剪切建筑、Neal的漏斗、贝叶斯逻辑回归)上的测试确定了三个精确障碍:边界曲率、目标刚度和尾部覆盖不匹配。仿射与样条比较表明,更简单的架构在相同NLL下产生更紧的证书,颠倒了通常的表达性层次。

英文摘要

Transport MCMC trains a normalizing flow to precondition Metropolis--Hastings proposals, achieving high empirical efficiency on challenging posteriors; yet no prior work produces a numerically non-vacuous, rigorous spectral-gap bound for such samplers. We establish the first such bounds. For independence MH on the banana family we certify (γ^\ast = 0.828) at (D = 2) (covering in the original space) and (γ^\ast \ge 7.6\times 10^{-4}) at (D = 5) (covering in an analytically unwarped Gaussian space with a grid-certified gradient bound under the stated numerical Lipschitz certification), both rigorous at 95% confidence. The framework rests on three pillars: (i) spectral normalization with reduced scale clips constrains the flow Lipschitz constant from (10^{47}) to (10^4); (ii) a coverage-based empirical oscillation bound replaces the vacuous analytical bound with a data-dependent certificate; and (iii) oscillation-regularised training cuts the empirical oscillation by 60--90% at no cost to density fit, extending practical certificates through (D = 20) ((γ^\ast \ge 1.7\times 10^{-4})). Tests on four further targets (Gaussian mixture, shear-building, Neal's funnel, Bayesian logistic regression) identify three precise barriers: boundary curvature, target stiffness, and tail-coverage mismatch. An affine-vs-spline comparison shows that simpler architectures yield tighter certificates at identical NLL, inverting the usual expressiveness hierarchy.

2606.01074 2026-06-02 cs.CL

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

何时0.1%就足够了?分析降维和量化对文本嵌入压缩的联合效应

Riku Kisako, Hayato Tsukagoshi, Ryohei Sasano

发表机构 * Graduate School of Informatics, Nagoya University(名古屋大学信息学研究科)

AI总结 本文系统研究了结合降维和量化方法压缩文本嵌入的效果,发现联合压缩可大幅减小嵌入尺寸(低至原始大小的0.1%)而几乎不损失性能,且最优策略因任务而异。

详情
AI中文摘要

最近的高性能文本嵌入模型通常输出高维实值向量,导致巨大的存储和计算成本。为了解决这个问题,提出了基于降维或量化的压缩方法;然而,降维和量化结合的效果尚未得到充分研究。在本文中,我们使用四个MTEB任务族和四个预训练嵌入模型,系统地研究了结合降维和量化压缩文本嵌入的有效性。实验结果表明,结合降维和量化比单独使用任何一种方法都能实现更强的压缩,在某些设置下嵌入可以缩减到原始大小的0.1%而几乎没有性能下降,并且最优压缩策略取决于任务。

英文摘要

Recent high-performing text embedding models often output high-dimensional real-valued vectors, resulting in substantial storage and computational costs. To address this issue, compression methods based on dimensionality reduction or quantization have been proposed; however, the effects of combining dimensionality reduction and quantization have not been sufficiently investigated. In this paper, we systematically examine the effectiveness of compressing text embeddings by combining dimensionality reduction and quantization, using four MTEB task families and four pretrained embedding models. The experimental results demonstrate that combining dimensionality reduction and quantization enables substantially stronger compression than using either method alone, that in some settings embeddings can be reduced to as little as 0.1% of their original size with almost no performance degradation, and that the optimal compression strategy depends on the task.

2606.01069 2026-06-02 cs.CV

A Multiscale Network with Supervised Contrastive Learning for Real-Time Facial Emotion Recognition

基于监督对比学习的多尺度网络用于实时面部情感识别

Rejoy Chakraborty, Archisman Adhikary, Chayan Halder, Payel Rakshit, Sanchita Ghosh, Kaushik Roy

发表机构 * Indian Statistical Institute(印度统计研究所) Department of Biological Sciences, Bose Institute(生物科学系, Bose 院) Ramakrishna Mission Vivekananda Centenary College(拉马克里希纳使命 Vivekananda 百年学院) Maheshtala College(Maheshtala 学院) West Bengal State University(西孟加拉州大学)

AI总结 提出一种结合监督对比学习的多尺度深度学习网络,用于实时视频中面部表情变化的情感识别,在标准数据集上取得满意效果。

详情
Comments
13 pages
AI中文摘要

从面部表情进行实时情感识别是一项具有挑战性的任务,特别是在视频场景中,多个情感状态可能随时间出现。由于每个情感状态对应的面部表情在不同个体间差异显著,难度进一步增加。描绘情感状态的面部表情变化不是离散的,而是连续的,这通过计算手段来表征非常困难。能够检测面部表情变化的系统对于确定个体的情感状态具有重要影响。这样的系统在心理咨询中可以为治疗师提供关于受试者情感状态的额外见解,从而非常有益。本文提出了一种基于深度学习的系统,通过建模面部表情的变化来检测个体实时视频中的情感变化。本研究在标准数据集上进行深度学习系统的训练,并在此方面取得了非常满意的结果。

英文摘要

Real-time emotion recognition from facial expressions is a challenging task, particularly in video-based scenarios where multiple emotional states may occur over time. The difficulty increases further due to the fact that each emotional state is associated with facial expressions that vary significantly across individuals. The change of facial expressions portraying emotional state is not discrete, but rather continuous, which is very challenging to represent through computational aids. A system with the ability to detect variations in facial expressions can have a significant impact on determining the emotional state of an individual. Such a system can be very beneficial for psychologists during counseling by providing additional insights into the emotional state of a subject. In this paper, a deep learning-based system is presented to detect emotional changes in real-time video of a person by modeling the change in facial expressions. The current study is conducted on a standard dataset for training of the deep learning system and the system has provided very satisfactory outcomes in this respect.

2606.01066 2026-06-02 cs.AI

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

在模型学会Bug之前:模糊测试RLVR验证器

Jaideep Ray

发表机构 * ACM

AI总结 本文提出一个轻量级验证器模糊测试框架,通过生成对抗性补全、比较有缺陷与严格的参考验证器,并报告多种指标,以研究RLVR中验证器错误导致优化学习Bug的失败模式。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)用可执行的奖励函数(如数学答案检查器、JSON工具调用验证器和代码单元测试框架)替代人类偏好标签。这使得奖励部分成为软件制品:如果验证器出错,优化可能会学习到Bug。我们通过一个轻量级验证器模糊测试框架研究这种失败模式,该框架生成对抗性补全,比较有缺陷和更严格的参考验证器,记录配对决策,并报告假阳性、假阴性、不一致、利用和不确定性指标。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.

2606.01063 2026-06-02 cs.AI

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

MindClaw: 用于精确干预的闭环具身心理状态推理

Ruoxuan Zhang, Qiaoqiao Wan, Zhengguang Wang, Chenghao Yu, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

发表机构 * Jilin University(吉林大学) Microsoft Asia(微软亚洲) National Taiwan University(国立台湾大学)

AI总结 提出MindClaw框架,通过闭环具身心理状态推理实现精确干预,结合多源输入、信念记忆、认知触发技能和动作生成,在动态环境中优化干预时机。

详情
Comments
Extended version of the CVPR 2026 paper *MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents*
AI中文摘要

心理理论使智能体能够推理他人的信念、目标和意图,这对于以人为中心的具身辅助至关重要。现有的心理理论基准推动了文本和多模态心理状态识别的发展,但主要评估离线问答或最终动作预测。它们并未充分测试具身智能体是否能够与变化的环境保持连接、更新特定于主体的信念、决定何时需要推理,以及仅在帮助有用时进行干预。基于MindPower,我们将以机器人为中心的心理理论推理扩展到实时闭环设置,并引入MindClaw,一个用于具身心理状态推理和精确干预的框架。MindClaw连接多源输入、信念记忆、具身认知触发技能、心理推理和动作生成,使智能体能够在正确的时间输出有用的动作,同时在不需要干预时保持沉默。实验表明,直接的VLM基线在任务意识和干预校准方面存在困难,而MindClaw实现了最佳的整体性能,证明了触发技能优化对于闭环具身心理理论辅助的重要性。

英文摘要

Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human-centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental-state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor-specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot-centric ToM reasoning to a real-time closed-loop setting and introduce MindClaw, a framework for embodied mental-state reasoning with precision intervention. MindClaw connects multi-source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger-skill optimization for closed-loop embodied ToM assistance.

2606.01062 2026-06-02 cs.AI

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

DAG-MoE:从简单混合到专家混合中的结构聚合

Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu, Yinglong Xia, Qiang Zhang, Qifan Wang, Ren Chen, Dongqi Fu, Jiayi Liu, Zhoukai Zhao, Xiangjun Fan, Benyu Zhang, Yixin Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 本文提出DAG-MoE框架,通过轻量级模块自动学习选定专家之间的最优聚合结构,以替代标准加权求和聚合,从而在不改变专家或路由器的情况下扩展专家组合空间并实现单层多步推理,在预训练和微调中均优于传统MoE基线。

详情
Comments
Accepted by ICML 2026
AI中文摘要

混合专家(MoE)模型已成为在大型语言模型中解耦参数数量与计算成本的主流方法,但有效扩展MoE性能仍是一个挑战。先前的工作表明,细粒度专家扩大了专家组合空间并提高了灵活性,但它们也带来了大量的路由开销,造成了新的可扩展性瓶颈。在本文中,我们探索了扩展的互补轴——专家输出的聚合方式。我们从理论上证明,用结构聚合替代标准加权求和聚合可以在不改变专家或路由器的情况下扩展专家组合空间,并使得在单个MoE层内实现多步推理成为可能。为此,我们提出了DAG-MoE,一个稀疏MoE框架,它采用轻量级模块自动学习所选专家之间的最优聚合结构。在标准语言建模设置下的大量实验表明,DAG-MoE在预训练和微调中均持续提升了性能,超越了传统的MoE基线。

英文摘要

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

2606.01057 2026-06-02 cs.CV cs.AI cs.GR cs.LG

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

3DCodeBench:通过代码进行智能体程序化3D建模的基准测试

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong, Ameesh Makadia, Meiqi Guo, Laurent Itti, Jindong Chen

发表机构 * Google DeepMind(谷歌DeepMind) University of Southern California(南加州大学) Google Research(谷歌研究)

AI总结 提出3DCodeBench基准,评估12种视觉语言模型将文本和图像参考转换为程序化3D建模代码的能力,并构建基于人类偏好的3DCodeArena排名平台。

详情
Comments
Project Page: https://www.3dcodebench.com/; 11 pages (main), with appendix
AI中文摘要

通过代码进行程序化3D建模正成为一种通用的范式,提供确定性、引擎就绪且可精确编辑的资产,而神经3D生成器天生缺乏这些特性。然而,编写此类程序化内容需要深厚的3D软件API、参数化设计和代码级几何推理专业知识。在本文中,我们提出了3DCodeBench,一个系统性的基准,用于评估3D建模软件中用于程序化3D生成的视觉语言模型(VLM)智能体。具体来说,3DCodeBench评估了12种先进VLM如何有效地充当程序化3D建模器,将文本和图像参考转换为3D建模软件的程序化代码。认识到自动度量可能无法完全捕捉3D形状的感知质量,我们构建了3DCodeArena,一个基于成对人类偏好对生成的3D输出进行排名的平台。通过广泛的评估和结果,我们观察到:(1)失败主要源于API不匹配,而成功渲染的模型仍然存在断开或浮动的3D几何组件。(2)测试时扩展,如更高的思考预算和多轮细化,总体上提高了性能。我们的发现突显了对高质量程序化编码数据以推进商业VLM的迫切需求。此外,有效的程序化3D建模需要一个强大的执行环境,为迭代细化提供高保真反馈。我们发布了3DCodeBench,包括精心策划的大规模多模态(文本/图像)提示数据集、程序化代码、3D对象三元组、评估协议以及公共3DCodeArena平台,作为探索基于VLM的程序化3D建模器的基础工具包。

英文摘要

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

2606.01053 2026-06-02 cs.AI

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

AnyEdit++: 基于贝叶斯惊讶的自适应长文本知识编辑

Bowen Tian, Caixue He, Jiemin Wu, Jingying Wang, Wenshuo Chen, Zexi Li, Yutao Yue

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出AnyEdit++框架,通过基于贝叶斯惊讶的自适应分割机制Bayes-Chunk,实现结构感知的长文本知识编辑,在数学推理、代码生成和叙事任务上优于现有方法。

详情
Comments
Accepted by ICML 2026
AI中文摘要

在大语言模型中编辑复杂的长文本知识仍然是一个重大挑战,因为难以保持生成的连贯性。现有的自回归方法(如AnyEdit)缓解了长度限制,但依赖于固定窗口分块,忽略了逻辑结构并损害了一致性。为了解决这个问题,我们提出了AnyEdit++,一个结构感知的框架,其中包含Bayes-Chunk,这是一种基于贝叶斯惊讶动态识别语义边界的自适应分割机制。我们通过一个理论框架支撑这种方法,确立了三个关键原则:(1)结构独立性:我们证明了当锚键在几何上正交时(我们的基于惊讶度的边界自然满足这一条件,而固定窗口则违反),跨段干扰最小化;(2)因果局部性:我们证明了在这些语义峰值处注入的更新相比任意分割点具有严格更优的控制。在数学推理、代码生成和叙事任务上的大量实验表明,AnyEdit++相比最先进的基线取得了更优的性能和鲁棒性,验证了结构感知对于有效的长文本知识编辑至关重要。

英文摘要

Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.

2606.01051 2026-06-02 cs.LG

Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment

交互受限的动态医疗安全连续时间强化学习

Xun Shen, Yuepeng Wang, Akifumi Wachi, Yongqi Zhou, Richard Weiss, Yoshihiko Fujisawa, Ken Kawano, Mehrshad Sadria, Ying Chen, Xin Liu, Sebastien Gros, Xiao Hu, Kyoung-Sook Kim, Mengmou Li, Katsuki Fujisawa, Kenji Wakabayashi

发表机构 * Tokyo University of Agriculture and Technology(东京大学农业技术大学) LY Corporation(LY公司) National University of Singapore(新加坡国立大学) Institute of Science Tokyo(东京科学研究所) Altos Labs, Inc.(Altos实验室) National Institute of Advanced Industrial Science and Technology (AIST)(国家先进工业科学与技术研究院) Norwegian University of Science and Technology(挪威科学技术大学) Emory University(埃默里大学) Hiroshima University(广岛大学)

AI总结 提出交互受限的安全连续时间强化学习框架,通过选项式半马尔可夫决策过程联合优化治疗策略与临床交互时机,并引入安全收紧机制保证轨迹级安全。

详情
AI中文摘要

动态医疗需要决定治疗强度和干预时机,而患者状态连续演化,不良事件可能在临床交互之间发生。现有大多数治疗学习方法假设固定时间表或仅在离散决策点强制执行安全性。我们提出了交互受限的安全连续时间强化学习,这是一个在轨迹级安全约束下联合优化治疗管理和临床交互时机的框架。我们的关键思想是将连续时间治疗问题重新表述为基于选项的半马尔可夫决策过程,其中每个选项指定一个连续时间治疗策略及其持续时间。我们开发了一种安全收紧机制,表明在交互时间适当构造的约束能够以高概率保证整个连续时间轨迹的安全性。我们进一步建立了从记录的治疗轨迹中进行策略学习的有限样本保证,并引入了一个实用的数据驱动保守替代。实验表明,所提出的自适应交互时机机制在不同安全策略优化方法上均能提高安全性和治疗效果,优于等距交互方案。

英文摘要

Dynamic medical treatment requires deciding treatment intensity and intervention timing, while patient states evolve continuously and adverse events may occur between clinical interactions. Most existing treatment learning methods assume fixed schedules or enforce safety only at discrete decision points. We propose Interaction-Limited Safe Continuous-Time Reinforcement Learning, a framework that jointly optimizes treatment administration and clinical interaction timing under trajectory-level safety constraints. Our key idea is to reformulate the continuous time treatment problem as an option-based semi-Markov decision process, where each option specifies a continuous-time treatment policy and its duration. We develop a safety-tightening mechanism showing that suitably constructed constraints at interaction times guarantee safety over the full continuous-time trajectory with high probability. We further establish finite-sample guarantees for policy learning from logged treatment trajectories and introduce a practical data-driven conservative surrogate. Experiments show that the proposed adaptive interaction-timing mechanism improves both safety and treatment effectiveness over equidistant interaction schemes across different safe policy optimization methods.

2606.01050 2026-06-02 cs.CV

TextFake: Benchmarking AI-Generated Image Detection on Text-Rich Images

TextFake: 对富含文本图像中AI生成图像检测的基准测试

Yuning Zhang, Changtao Miao, Mingyu Liao, Tingyu Liu, Xinghao Wang, Tao Gong, Qi Chu, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络科学与技术学院) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) Individual Researcher(独立研究者)

AI总结 针对AI生成图像检测在富含文本图像上的空白,构建包含28种语言、2万图像的TextFake基准,评估14种检测器和3种VLM API,发现系统性能差距并诊断三种失败模式。

详情
AI中文摘要

最近的AI生成图像(AIGI)检测器在自然图像基准上表现良好,但它们在富含文本的伪造图像(如虚假截图、文档和新闻页面)上的行为尚未得到测试,这些伪造图像在虚假信息中普遍存在。我们引入了TextFake,一个包含20,000张图像的富含文本AIGI检测基准,涵盖28种语言、4个主题类别和2种场景模态。伪造图像通过一个四阶段流水线合成,该流水线沿三个受控维度注释真实图像,并通过分布对齐的结构化提示生成对应图像,排除了协变量捷径。对14个专用检测器和3个前沿VLM API的零样本评估揭示了巨大的系统性差距:没有方法超过80%的准确率,有些方法相比自然图像基准下降了60%以上。诊断评估识别出三种失败模式:文本密度诅咒,即密集字形压倒低级检测器;通过渲染保真度进行伪装,即更强的文本渲染抑制生成伪影;以及阈值崩溃,即常规扰动将检测器推向随机水平。

英文摘要

Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.

2606.01049 2026-06-02 cs.CL

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

PMC-InterCPT:重新思考用于多模态持续预训练的医学交错数据

Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang, Zhijie Sang, Minheng Ni, Wenjun Wang, Yanggan Gu, Shuo Cai, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Sun Yat-sen University(中山大学) InfiX.ai PolyU-Daya Bay Technology and Innovation Research Institute(PolyU-大亚湾技术与创新研究院)

AI总结 针对医学多模态持续预训练中图像-文本对数据存在的上下文缺失、结构噪声等问题,提出PMC-InterCPT交错语料库,通过恢复缺失标题、清洗文本、重建交错样本及LLM监督过滤,结合模态感知重采样,有效提升医学与通用多模态性能。

详情
AI中文摘要

从科学文献中提取的大规模生物医学图像-文本数据集为医学多模态模型训练提供了宝贵资源。这些数据集通常组织为图像-标题对;然而,图像标题往往简短、依赖上下文,且在没有周围文章文本的情况下仅部分信息。同时,大规模自动提取引入了结构噪声,如缺失标题、残留标记、重复上下文和不连贯的多段落图像描述。我们重新审视医学多模态持续预训练(CPT)的数据构建,并提出PMC-InterCPT,一个基于上下文的生物医学交错语料库,除了标题外还包含引用图像的正文文本。我们的流程恢复缺失标题,清洗标题和上下文文本,重建连贯的交错图像-文本样本,并应用LLM监督的医学相关性和质量分类器来过滤噪声记录。我们进一步揭示了结果语料库中强烈的模态不平衡,并引入了一个四桶证据分类法用于模态感知重采样。通过在Qwen3.5-4B-Base上进行CPT后接监督微调(SFT),PMC-InterCPT有效提升了医学和通用多模态性能,同时使用的CPT令牌少于原始源池。实验结果还说明了数据质量和模态对医学多模态CPT的互补性。

英文摘要

Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.

2606.01048 2026-06-02 cs.CV

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

解耦残差去噪扩散模型用于统一且数据高效的图像到图像翻译

Ziyue Lin, Jiahe Hou, Hongyu Xia, Xinrui Xie, Feifei Wang, Yuyin Zhou, Wei Wang, Jiawei Liu, Liangqiong Qu

发表机构 * The University of Hong Kong(香港大学) Shenyang Institute of Automation, Chinese Academy of Sciences(中国科学院沈阳自动化研究所) The Chinese University of Hong Kong(香港中文大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 提出解耦残差去噪扩散模型(DRDD),通过将扩散过程解耦为随机噪声扩散和确定性残差扩散两个独立阶段,实现统一且数据高效的图像到图像翻译。

详情
Comments
CVPR 2026
AI中文摘要

我们提出解耦残差去噪扩散模型(DRDD),用于统一且数据高效的图像到图像(I2I)翻译。尽管扩散模型在质量和多样性方面推动了I2I翻译的发展,但我们揭示了扩散模型中一个先前未被充分探索的性质。关键在于,除了其传统的流形提升作用(即将数据移出低维流形),注入高斯噪声通过隐式对齐跨域的特征分布促进了域协调,这一性质对于统一的I2I翻译尤其有利。然而,现有的扩散模型过早地削弱了这种协调效果,因为噪声和残差在单个耦合的扩散过程中被同时移除。为解决这一问题,DRDD将扩散过程解耦为两个顺序且独立的扩散阶段:(1)用于域协调和流形提升的随机噪声扩散,以及(2)完全在固定噪声域内学习核心语义映射的确定性残差扩散。这种解耦在整个变换过程中保留了协调和流形提升效果,极大地简化了跨不同任务和域的统一映射学习。值得注意的是,噪声扩散阶段仅在丰富的、无配对的目标域图像上训练,大大提高了数据效率。全面的理论和实证分析表明,DRDD与主流扩散模型广泛兼容,即使在有限配对数据下也能持续提供稳健、统一的I2I翻译。我们的代码可在 https://github.com/HKU-HealthAI/DRDD 获取。

英文摘要

We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Our code is available at https://github.com/HKU-HealthAI/DRDD.

2606.01047 2026-06-02 cs.RO

Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

学习多模态轨迹策略以实现数据高效的机器人操作

Zijia Chen, Yuenan Hou, Xinhua Jiang, Yu Li, Weijie Li, Li Liu

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学学院,国防科技大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 针对机器人操作中数据稀缺导致的多模态干扰问题,提出基于混合专家模型的多模态轨迹预测框架MATE,通过细粒度子令牌特征解耦和跨模态余弦路由器实现稳定专家分配,在LIBERO基准和真实乒乓球实验中取得显著性能提升。

详情
AI中文摘要

机器人操作需要有效整合异构输入,包括视觉观察、语言指令和轨迹表示,以生成精确的动作。现有的基于Transformer的策略通常在一个共享参数空间内处理这些异构模态,这往往导致模态干扰和低效的表示学习,尤其是在数据稀缺的场景下。虽然混合专家模型(MoE)通过专家专业化提供了可扩展的解决方案,但传统的路由机制通常对这类跨模态表示差异敏感,导致专家分配不稳定和专家崩溃。在这项工作中,我们提出了MATE(多模态轨迹策略),一种基于MoE的新型轨迹预测框架。具体来说,我们引入了一种多模态MoE架构以实现细粒度的子令牌特征解耦,并设计了一个跨模态余弦路由器,用于跨异构模态的稳定且尺度不变的专家分配。我们进一步采用温度控制路由和随机噪声注入,以改善专家平衡并防止在稀缺演示下过早的路由崩溃。在LIBERO基准上的实验表明,我们的MATE在数据稀缺情况下始终优于先前的工作。与轨迹引导的对应方法相比,平均成功率提高了4.75%。在真实世界的乒乓球机器人实验也表明,预测的轨迹可以为下游机器人执行提供有用的指导,进一步证明了我们算法的实际可行性。

英文摘要

Robotic manipulation requires the effective integration of heterogeneous inputs, including visual observations, language instructions, and trajectory representations, to generate accurate actions. Existing transformer-based policies typically process these heterogeneous modalities within a shared parameter space, which often leads to modality interference and inefficient representation learning, especially in data-scarce scenarios. While Mixture-of-Experts (MoE) offers a scalable solution through expert specialization, conventional routing mechanisms are often sensitive to such cross-modal representation discrepancies, resulting in unstable expert assignment and expert collapse. In this work, we propose MATE (Multi-ModAl TrajEctory Policies), a novel trajectory prediction framework built upon MoE. Specifically, we introduce a Multi-Modal MoE architecture to achieve fine-grained sub-token feature decoupling, and design a cross-modal cosine router for stable and scale-invariant expert assignment across heterogeneous modalities. We further employ temperature-controlled routing and stochastic noise injection to improve expert balance and prevent premature routing collapse under scarce demonstrations. Experiments on the LIBERO benchmark show that our MATE consistently outperforms prior work under data scarcity. It achieves a 4.75% improvement in average success rate over the trajectory-guided counterpart. Real-world experiments on robotic ping-pong also suggest that the predicted trajectories can provide useful guidance for downstream robotic execution, further indicating the practical feasibility of our algorithm.

2606.01046 2026-06-02 cs.AI

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

TravelEval:评估基于LLM的旅行规划代理的综合基准框架

Weiyi Chen, Shuaixiong Wang, Ziyun Gao, Kaichun Hu, Wangze Ni, Shimin Di, Chen Jason Zhang, Lei Chen

发表机构 * Zhejiang University(浙江大学) Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) HKUST (GZ) & HKUST Guangzhou(香港科技大学(广州)& 香港科技大学(广州))

AI总结 针对现有基准过度关注约束合规、缺乏真实性和多维评估的问题,提出TravelEval,通过六维评估框架、真实数据沙盒和模拟全局评估方法,全面评估LLM在旅行规划中的表现。

详情
Comments
31pages, 8 figures, accepted by KDD 2026
AI中文摘要

大型语言模型(LLM)的发展显著提升了旅行规划应用,但现有基准的局限性限制了对其评估:1)过度强调约束合规,忽视时空成本等多维质量;2)数据集缺乏真实世界真实性和关键领域(如住宿、交通)的覆盖;3)孤立的每日计划评估遗漏了评估整个计划所需的关键细节(例如每日住宿和参观节奏的影响)。为解决这一差距,我们引入了TravelEval,一个真实且全面的基准。TravelEval具有1)一个新颖的六维评估框架,从准确性、合规性、时间性、空间性、经济性和实用性维度全面评估计划;2)一个高度真实的数据沙盒,包含精确的住宿定价和真实的城际交通数据;3)一种基于模拟的全局评估方法,通过集成API的地理信息和细粒度排队时间模拟完整的旅行计划。使用TravelEval评估12种主流方法揭示了若干有价值的见解,例如LLM在全局优化的多维规划(特别是时空推理和预算合规)方面存在困难,而代理推理策略并未提供一致的改进。简而言之,TravelEval通过基于现实的时空模拟和全面指标促进旅行计划评估,为推进基于LLM的旅行规划研究和应用提供了坚实基础。

英文摘要

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

2606.01045 2026-06-02 cs.CL

Child-directed speech facilitates production, not comprehension, in BabyLMs

儿童导向语言促进BabyLMs的生成而非理解

Bastian Bunzeck, Sina Zarrieß

发表机构 * Computational Linguistics, Department of Linguistics(计算语言学系,语言学系)

AI总结 本研究通过框架补全任务评估儿童导向语言(CDS)对BabyLMs生成能力的影响,发现CDS训练的模型在生成语法补全方面优于网络数据训练的模型,而理解基准测试低估了CDS的贡献。

详情
Comments
Accepted at CoNLL 2026
AI中文摘要

近期研究表明,儿童导向语言(CDS)不利于BabyLMs的语言学习。然而,当前的评估主要关注理解而非生成,而生成是基于用法的语言习得理论的核心,该理论认为CDS通过构式“框架”(具有开放槽位的频繁词汇模式)促进早期语言使用。我们受这些理论启发,引入了一种新颖的基于生成的评估方法——框架补全任务,并比较了使用CDS、BabyLM语料库和网络爬取数据(FineWeb-edu)训练的Llama模型在理解基准测试和我们新框架上的表现。我们的结果揭示了模型的理解与生成能力之间存在明显的分离:虽然FineWeb训练的模型在最小对测试中表现优异,但CDS训练的模型在训练早期就能生成语法正确的补全,并将概率质量集中在合适的槽位填充词上。这些发现表明,理解基准测试低估了CDS对BabyLMs的贡献。

英文摘要

Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ''frames'' (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a frame-completion task, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models' comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.

2606.01044 2026-06-02 cs.CV

Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA

Ask4VG: 用于减少医学VQA中先验驱动答案的风险感知问题选择

Xiaorong Zhu, Qiang Li, Zibo Xu, Weijie Wang, Weizhi Nie

发表机构 * School of Microelectronics, Tianjin University, Tianjin 300072, China(天津大学电子工程学院,天津 300072,中国) DISI, University of Trento, Trento, Italy(特伦托大学DISI研究所,意大利特伦托) School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China(天津大学电气与信息工程学院,天津 300072,中国)

AI总结 提出Ask4VG框架,通过反事实视觉探测估计问题引发的幻觉风险,并重排问题改写以选择更依赖图像证据的问题,从而减少医学VQA中的先验驱动答案。

详情
AI中文摘要

医学视觉问答要求模型将回答建立在图像证据上,因为缺乏视觉支持的答案可能误导下游解读。然而,许多医学VQA问题是通用的、模板化的或形式高度相似,这可能鼓励模型学习问答捷径而非依赖图像的推理,从而增加幻觉回答的风险。我们提出Ask4VG,一个无标签的试点框架,用于风险感知的问题选择。Ask4VG通过反事实视觉探测估计问题引发的幻觉风险:在原始图像、扰动图像、空白图像和错配图像下提出相同问题,并将得到的答案关系转换为反事实风险估计器的弱监督信号。然后,学习到的估计器对候选问题改写进行重排,以优先选择那些对缺失或错配视觉证据更不具不变性的、保留意图的问题,再进行最终答案生成。在VQA-RAD上使用Qwen2-VL-2B-Instruct,仅提示改写增加了反事实风险,而基于预测风险的重排将留出风险从0.658降至0.623,并将精确准确率从0.337提升至0.356。一个300样本的PMC-VQA外部检查显示了相同的风险降低方向,并伴有小幅准确率提升。这些结果表明,问题选择是响应级幻觉缓解的一个有前景的补充,有助于实现可靠的医学VQA。

英文摘要

Medical visual question answering requires models to ground their responses in image evidence, because visually unsupported answers can mislead downstream interpretation. However, many medical VQA questions are generic, template-like, or highly similar in form, which can encourage models to learn question-answer shortcuts instead of image-dependent reasoning and thereby increase the risk of hallucinated responses. We propose Ask4VG, a label-free pilot framework for risk-aware question selection. Ask4VG estimates question-induced hallucination risk through counterfactual visual probing: the same question is asked under the original image, a perturbed image, a blank image, and a mismatched image, and the resulting answer relations are converted into weak supervision for a counterfactual risk estimator. The learned estimator then reranks candidate question rewrites to favor intent-preserving questions that are less invariant to missing or mismatched visual evidence before final answer generation. On VQA-RAD with Qwen2-VL-2B-Instruct, prompt-only rewriting increases counterfactual risk, whereas predicted-risk reranking reduces held-out risk from 0.658 to 0.623 and improves exact accuracy from 0.337 to 0.356. A 300-sample PMC-VQA external check shows the same direction of risk reduction with a small accuracy gain. These results suggest that question selection is a promising complement to response-level hallucination mitigation for reliable medical VQA.

2606.01042 2026-06-02 cs.LG cs.AI

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

似真性不是预测:基于LLM的细胞扰动推理的对比证据

Xinyu Yuan, Xixian Liu, Jianan Zhao, Yashi Zhang, Hongyu Guo, Jian Tang

发表机构 * Mila - Québec AI Institute(魁北克人工智能研究所) University of Montréal(蒙特利尔大学) HEC Montréal(蒙特利尔HEC商学院) University of Ottawa(渥太华大学) National Research Council of Canada(加拿大国家研究理事会) CIFAR AI Chair(CIFAR人工智能 chair)

AI总结 本文发现基于大语言模型的细胞扰动推理虽能生成生物上合理的解释,但实际预测性能差,并提出CORE方法通过对比证据组织来提升扰动特异性预测。

详情
AI中文摘要

扰动实验对于理解细胞机制至关重要,但成本高昂且稀疏,因此需要预测未观察条件下的基因表达响应。最近一个有前景的方向是利用大语言模型(LLM)作为“虚拟细胞”模拟器——通过逐步的、基于知识的机械推理来推断差异表达——指向一种可解释的、知识驱动的范式,超越了纯粹的数据驱动方法。然而,我们发现似真性不是预测:尽管产生了生物上合理的解释,这些方法未能捕捉扰动特异性效应:系统性地高估差异表达,在聚合评估中通常表现不如简单的基因频率基线,并且在每个基因水平上降至随机水平。这揭示了对内在基因响应倾向的依赖,而非真正的扰动推理。我们将这一失败追溯到证据呈现方式:现有方法孤立地评估扰动-基因对,而不揭示相关扰动对同一基因的影响差异。为解决这一局限性,我们引入了CORE(对比关系证据组织),通过将证据组织成来自相关扰动的正面和负面结果,将预测重新定义为比较任务。使用生物医学知识图谱进行证据检索,CORE在基于LLM和非LLM的设置中均改善了校准并大幅提升了扰动特异性预测:例如,在药物扰动数据上,CORE-Reasoning将Qwen3.5-9B的聚合指标提升了高达28.6%;而在通用扰动数据上,CORE-Voting将四个细胞系的每个基因平均AUROC从随机水平提高到0.703。这突显了对比证据组织对于可靠的基于LLM的扰动推理至关重要。

英文摘要

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as "virtual cell" simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning

2606.01041 2026-06-02 cs.CL

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

ExpWeaver: LLM智能体通过潜在RAG从经验中学习

Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ExpWeaver框架,利用潜在检索增强生成(无独立RAG模块)使LLM智能体从经验中学习,通过隐藏状态编码经验、潜在空间检索和交叉注意力聚合,在13个任务中12个达到最优,token效率高且跨域泛化强。

详情
AI中文摘要

经验学习通过将过去的交互整合为可重用知识,在增强LLM智能体规划和推理方面取得了有希望的结果。然而,现有方法仍局限于显式文本空间,通过语义相似性检索经验并将其拼接到上下文窗口中,导致大量token开销以及将检索与生成分离的解耦架构。为了解决这些限制,我们提出了ExpWeaver,一个使LLM智能体能够通过潜在检索增强生成从经验中学习的框架,无需单独的RAG模块。ExpWeaver使用LLM自身的隐藏状态编码经验,在每个解码步骤直接在潜在空间中检索相关经验,并通过交叉注意力聚合和门控残差机制进行整合。整个流程通过强化学习进行端到端优化,支持生成和排序任务。我们在涵盖问答、推理、编程、科学预测和推荐的13个不同任务上评估了ExpWeaver。结果表明,ExpWeaver在13个任务中的12个上达到了最先进的性能,比最强基线高出6.8%以上;保持了与非检索基线相当的token效率,而基于文本的检索方法需要1.5到2倍的token;并展现出卓越的跨域泛化能力,在零样本迁移下比最强基线高出16.32%,在少样本迁移下高出15.21%。我们的ExpWeaver代码已在https://github.com/ulab-uiuc/ExpWeaver发布。

英文摘要

Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state-of-the-art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer. Our code for ExpWeaver is released at https://github.com/ulab-uiuc/ExpWeaver.

2606.01039 2026-06-02 cs.LG cs.AI

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+: 重新思考在线策略蒸馏中的优势设计

Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang

发表机构 * Columbia University(哥伦比亚大学) Amazon(亚马逊) Meta Capital One

AI总结 本文提出OPD+,通过修正在线策略蒸馏中因停止梯度操作导致的奖励目标偏差,并支持多种f-散度,在数学推理和工具使用基准上提升了性能。

详情
AI中文摘要

在线策略蒸馏(OPD)是一种广泛使用的技术,用于将能力强的教师语言模型的能力迁移到基础学生模型,并且可以通过使用学生生成的轨迹来制定强化学习风格的目标。然而,尽管散度奖励依赖于学生模型的可能性,现有工作通常采用停止梯度设计主要是为了稳定性,这使得得到的优势估计存在问题。在这项工作中,我们提供了一个基于学生和教师之间f-散度的通用优化框架,并从数学上重新审视这种设计空间是否有效。我们证明,对于一般的散度函数,一般的停止梯度操作会导致奖励目标和相应梯度的有偏估计。我们提出了OPD+,这是OPD的修正版本,在基线KL方法上展示了改进的性能,并且也支持各种f-散度的选择。我们在数学推理和工具使用基准上验证了我们的发现。

英文摘要

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.

2606.01038 2026-06-02 cs.RO

Robust Integrated Planning and Control for Quadrotors in Dynamic Environments via NMPC with CBF Penalties

动态环境中四旋翼飞行器的鲁棒集成规划与控制:基于带CBF惩罚的NMPC

Zeinab Shayan, Mohammadreza Izadi, Reza Faieghi

发表机构 * Autonomous Vehicles Laboratory, Department of Aerospace Engineering, Toronto Metropolitan University(自主车辆实验室,航空航天工程系,多伦多 Metropolitan 大学)

AI总结 提出一种将控制障碍函数作为指数惩罚嵌入非线性模型预测控制的鲁棒集成规划与控制策略,通过高增益扰动观测器和卡尔曼滤波器增强系统鲁棒性,实现动态环境中的安全避障。

详情
Comments
Accepted to Conference on Robots and Vision (CRV 2026), Vancouver, Canada
AI中文摘要

本文提出了一种新的多旋翼无人飞行器鲁棒集成规划与控制策略。我们提出了一种非线性模型预测控制公式,将控制障碍函数作为指数惩罚嵌入,在严格输入约束下提高可行性并确保平滑避障。惩罚权重提供了一个实用的调节旋钮,用于在跟踪精度和避障激进程度之间进行权衡。我们通过采用高增益扰动观测器来估计和补偿外部扰动,从而增强系统鲁棒性。我们还结合了卡尔曼滤波器,用于计算高效的实时障碍物运动预测,从而实现对移动障碍物的规避。与传统的NMPC以及带有硬CBF约束的NMPC的对比研究,在Gazebo和硬件实验中得到了验证,展示了优越的可行性、安全性和鲁棒性。据我们所知,这是首个经过硬件验证的NMPC-CBF IPC框架,为四旋翼飞行器在动态环境中的安全部署迈出了实际的一步。

英文摘要

This paper presents a new robust integrated planning and control (IPC) strategy for multirotor uncrewed aerial vehicles. We propose a nonlinear model predictive control (NMPC) formulation that embeds control barrier functions (CBFs) as exponential penalties, improving feasibility while ensuring smooth obstacle avoidance under tight input bounds. The penalty weights provide a practical tuning knob to trade off tracking accuracy against avoidance aggressiveness. We enhance the system robustness by employing a high-gain disturbance observer (HGDO) to estimate and compensate for external disturbances. We also incorporate a Kalman filter (KF) for computationally efficient, real-time prediction of obstacle motion, enabling avoidance of moving obstacles. Comparative studies against both conventional NMPC and NMPC with hard CBF constraints, validated in Gazebo and hardware experiments, demonstrate superior feasibility, safety, and robustness. To the best of our knowledge, this is the first hardware-validated NMPC-CBF IPC framework, offering a practical step toward safe quadrotor deployment in dynamic environments.

2606.01036 2026-06-02 cs.RO

Position: Good Embodied Reward Models Need Bad Behavior Data

立场:好的具身奖励模型需要不良行为数据

Ran Tian, Yilin Wu, Andrea Bajcsy

发表机构 * Ran Tian, Yilin Wu, Andrea Bajcsy

AI总结 本文主张为获得可靠的具身奖励模型,社区必须投资于“不良”机器人数据(失败、次优、易错甚至危险行为),并通过实验证明即使少量真实不良数据也能改善与人类偏好的一致性。

详情
Comments
This position paper has been accepted by the ICML 2026 position track as a spotlight paper
AI中文摘要

这篇立场论文认为,为了获得可靠的具身奖励模型,社区必须投资于“不良”机器人数据:失败、次优、易错甚至危险的行为。虽然奖励模型是任何基础模型生命周期的核心,但今天的具身奖励模型主要基于成功行为进行训练。我们分析了三个最先进的具身奖励模型,发现它们系统性地过度奖励那些真实人类评估者会惩罚的行为,包括不安全交互、糟糕执行以及仅表面满足任务的捷径策略。我们将这些失败归因于一个关键的数据缺口:负面具身数据的稀缺性,这些数据收集成本高昂,并且在现有的机器人数据集中经常被过滤掉或保留。此外,我们表明,即使是少量真实不良行为数据也能改善与人类偏好的一致性,并减少代价高昂的误报。因此,我们呼吁具身AI社区整理并发布他们的不良机器人数据,构建合成不良数据生成引擎,开发更去中心化的物理评估系统,并设计用于细粒度具身奖励模型评估的基准。

英文摘要

This position paper argues that to obtain reliable embodied reward models, the community must invest in ``bad'' robot data: failed, suboptimal, error-prone, and even hazardous behaviors. While reward models are central to any foundation model's lifecycle, today's embodied reward models are trained primarily on successful behaviors. We analyze three state-of-the-art embodied reward models and find that they systematically over-reward behaviors that real human evaluators would penalize, including unsafe interactions, poor execution, and shortcut strategies that only superficially satisfy tasks. We attribute these failures to a key data gap: the scarcity of negative embodied data which is costly to collect and often filtered out or withheld in existing robotics datasets. Furthermore, we show that even modest exposure to real bad behavior data can improve alignment with human preferences and reduce costly false positives. We therefore call on the embodied AI community to curate and release their bad robot data, build synthetic bad data generation engines, develop more decentralized physical evaluation systems, and design benchmarks for fine-grained embodied reward model evaluations.

2606.01034 2026-06-02 cs.CL stat.ME

A Finite-Calibration Regime Map for LLM Judge Panels

有限校准机制图:LLM评审团面板

Bin Zhu, Yanghui Rao

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 研究在有限人工标注预算下,低维堆叠器与联合输出表对LLM评审团面板的校准权衡,提出有限校准面板选择方法,实验表明多数评审输出可加或冗余。

详情
Comments
Work in Progress
AI中文摘要

我们研究了在有限人工标注预算下,LLM评审团面板应何时使用低维堆叠器与联合输出表进行校准。低维堆叠器估计成本小但忽略交互,而联合表校准器可表示交互但需为单元格计数和未见模式付出代价。我们将此权衡构建为有限校准机制图,并实例化为有限校准面板选择——一种可部署的验证选择器,涵盖评审路径、前缀大小和聚合器家族,并辅以表格和参数估计诊断。在RewardBench、LLMBar、SummEval和Arena100K上,使用包含DeepSeek V4 Flash的七评审池,标量/可靠性聚合在20个真实数据集-预算单元中赢得16个,表明当前评审输出通常是可加或冗余的。受控的校准增长数据显示互补机制:可加标签仍偏好标量,而六路交互选择更大的联合表,其测试MSE从未见质量消失前的0.224降至0.061。因此,实际问题不是“需要多少评审?”,而是下一个评审的信息在可用人工标注下是否可估计。

英文摘要

We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.

2606.01033 2026-06-02 cs.AI

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

TriLens: 基于逐层Logit-Lens熵的白盒幻觉检测

Bohan Yang, Yijun Gong, Zhi Zhang, Ge Zhang, Wenpeng Xing, Meng Han

发表机构 * Binjiang Institute of Zhejiang University(浙江大学滨海学院) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Zhejiang University(浙江大学) GenTel.io Great Bay University(Great Bay大学)

AI总结 提出TriLens方法,通过在每个Transformer层读取多头自注意力、前馈网络和残差流的logit-lens输出熵,构建紧凑的3L维轨迹,有效检测大语言模型幻觉。

详情
AI中文摘要

当语言模型产生幻觉时,最终答案是错误的,但错误在模型内部并非不可见。不同的内部路径可能保持不确定,在锐化速度上不一致,或在输出产生前承诺相互竞争的延续。我们提出TriLens,一种白盒检测器,将这一直觉转化为紧凑表示:在每一层,它通过模型自身的logit透镜读取多头自注意力输出、前馈输出和残差流,然后仅记录每个读出的熵。得到的3L维轨迹描述了确定性如何跨深度和跨模块形成,无需存储高维隐藏状态或采样多个生成。这一简单信号在指令微调LLM和QA基准测试中产生了强大的检测器,我们的分析表明,三个模块的熵轨迹提供了互补证据。TriLens表明,幻觉检测可以从跟踪内部计算如何稳定中受益,而不仅仅是最终层的预测。

英文摘要

When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different internal pathways may remain uncertain, disagree in how quickly they sharpen, or commit to competing continuations before the output is produced. We introduce TriLens, a white-box detector that turns this intuition into a compact representation: at every layer, it reads the multi-head self-attention output, the feed-forward output, and the residual stream through the model's own logit lens, then records only the entropy of each readout. The resulting 3L-dimensional trajectory describes how certainty forms across depth and across modules, without storing high-dimensional hidden states or sampling multiple generations. This simple signal yields a strong detector across instruction-tuned LLMs and QA benchmarks, and our analyses show that the three module-wise entropy trajectories provide complementary evidence. TriLens suggests that hallucination detection can benefit from tracking how internal computation settles, not only what the final layer predicts.

2606.01028 2026-06-02 cs.LG

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

MedGym:面向动态医疗治疗强化学习的统一连续时间基准

Yuepeng Wang, Ken Kawano, Yongqi Zhou, Yoshihiko Fujisawa, Richard Weiss, Akifumi Wachi, Katsuki Fujisawa, Ying Chen, Mehrshad Sadria, Xin Liu, Kyoung-Sook Kim, Xiao Hu, Sebastien Gros, Xun Shen

发表机构 * Tokyo University of Agriculture and Technology(东京农业大学) Institute of Science Tokyo(东京科学研究院) National University of Singapore(国立新加坡大学) LY Corporation(LY公司) Altos Labs, Inc.(Altos实验室) National Institute of Advanced Industrial Science and Technology (AIST)(国家先进工业科学与技术研究院) Emory University(埃默里大学) Norwegian University of Science and Technology(挪威科学技术大学)

AI总结 提出MedGym基准,通过连续时间框架和物理信息神经网络构建可配置的医疗RL环境,支持离散与连续时间方法在非规则治疗间隔下的比较,并评估个性化、轨迹安全等临床指标。

详情
AI中文摘要

医疗治疗推荐给强化学习(RL)带来了若干挑战:患者生理状态在连续时间内演变,测量和干预以不规则间隔进行,且治疗效果在不同个体间差异显著。然而,现有的RL公式和模拟环境基于离散时间的MDP或POMDP抽象,具有固定或预先指定的决策间隔。因此,评估RL方法能否处理时间间隔依赖的疾病进展、个性化治疗反应以及连续测量点之间的安全性仍然困难。为弥补这一空白,我们引入了MedGym,一个用于动态治疗推荐的基准环境。MedGym在连续时间框架中对纵向患者演变进行建模,并通过使用物理信息神经网络从临床数据构建可配置的医疗RL基准。所得基准支持离线RL和在线RL,并能够在非规则治疗时机和患者特定动态下直接比较离散时间与连续时间方法。此外,MedGym支持从临床重要角度进行评估,包括个性化、轨迹级安全性以及基于模型的离线学习与在线部署之间的性能差距。通过为连续时间动态治疗提供标准化且可配置的基准,MedGym旨在促进对医疗RL方法进行更真实、更具信息量的评估。

英文摘要

Medical treatment recommendation poses several challenges to reinforcement learning (RL): patient physiology evolves in continuous time, measurements and interventions are performed at irregular intervals, and treatment effects vary substantially across individuals. Existing RL formulations and simulated environments, however, are based on discrete-time MDP or POMDP abstractions with fixed or pre-specified decision intervals. Thus, it remains difficult to evaluate whether RL methods can handle time-interval-dependent disease progression, personalized treatment response, and safety between consecutive measurement points. To address this gap, we introduce MedGym, a benchmark environment for dynamic treatment recommendation. MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment. By providing a standardized and configurable benchmark for continuous-time dynamic treatment, MedGym aims to facilitate more realistic and informative evaluation of medical RL methods.

2606.01027 2026-06-02 cs.RO

$τ_0$-WM: A Unified Video-Action World Model for Robotic Manipulation

$\tau_0$-WM:一种用于机器人操作的统一视频-动作世界模型

Pengfei Zhou, Shengcong Chen, Di Chen, Jiaxu Wang, Rongjun Jin, Bingwen Zhu, Yike Pan, Songen Gu, Kuanning Wang, Shufeng Nan, Xingyu Qiu, Chenhao Qiu, Pu Yang, Yunuo Cai, Jianxiong Gao, Yifan Li, Yanwei Fu, Xiangyu Yue, Zhi Chen, Jianlan Luo

发表机构 * Shanghai Innovation Institute(上海创新研究院) AGIBOT Finch

AI总结 提出$\tau_0$-WM,一个统一视频-动作世界模型,通过共享视频扩散骨干集成策略学习、视频预测和动作评估,在长时域和精细操作任务上优于基线。

详情
Comments
Our project homepge: https://finch.agibot.com/research/tau0-wm
AI中文摘要

机器人操作需要能够生成可执行动作并在物理执行前预测和评估其未来后果的模型。我们提出$\tau_0$-世界模型($\tau_0$-WM),一个统一的视频-动作世界模型,在单个未来预测框架内整合了策略学习、视频预测和动作评估。基于共享的视频扩散骨干,$\tau_0$-WM提供两个互补接口。首先,一个视频动作模型从多视角观察、语言指令和机器人状态中联合预测未来视觉潜变量和连续动作块。其次,一个动作条件视频模拟器将候选动作块展开为多视角未来并预测密集的任务进度分数。该模型在大约27,300小时的实机遥操作、UMI风格交互、自我中心人类视频以及使用模态特定监督掩码的展开或失败轨迹上进行训练。在推理时,$\tau_0$-WM利用测试时计算来采样动作候选,通过重新去噪一致性对其进行排序,并对低质量候选调用基于模拟器的修正。在具有挑战性的长时域和精细机器人操作任务上,$\tau_0$-WM表现出优于其他相关基线的性能。

英文摘要

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.

2606.01026 2026-06-02 cs.CL

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

修正而非冻结:面向自修正掩码扩散语言模型的采样器匹配训练

Longxuan Yu, Shaorong Zhang, Yu Fu, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside(加州大学河滨分校) Microsoft(微软)

AI总结 针对掩码扩散语言模型在去噪过程中未利用可见标记修正能力的问题,提出无需额外模块的采样器D3IM和轻量级后训练方法SCOPE,显著提升数学和代码生成性能。

详情
Comments
8 pages, 2 figures, 10 tables
AI中文摘要

掩码扩散语言模型(MDLMs)在每个去噪步骤重新预测每个位置,但标准采样器一旦揭示标记就将其固定,导致这种修正能力未被使用。现有方法要么添加启发式或学习机制来修正已提交的标记,要么在重新预测前将其重新掩码为[MASK];一种无需辅助模块、直接修正可见标记的原则性采样器仍未被充分探索。我们提出了D3IM,一种无参数采样器,作为校正器风格的反向更新推导而来,允许无需额外模块或辅助传递的直接可见到可见修正。D3IM还揭示了一个我们称为保留偏差的模型侧障碍:模型倾向于重现自身错误的已提交标记而非修正它们。我们通过SCOPE(基于预测误差的自条件化)解决这一问题,这是一种轻量级的后训练过程,模拟D3IM的采样过程。在LLaDA-8B上使用64个去噪步骤时,SCOPE+D3IM相比原始LLaDA-8B标准去掩码在GSM8K上提升+13.0(68.3%),在MATH-500上提升+4.8(23.6%),在HumanEval上提升+15.3(29.3%),在MBPP上提升+10.4(30.8%),且在数学和HumanEval上随着去噪步骤增加,提升幅度更大。

英文摘要

Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or learned mechanisms to revise committed tokens, or remask them back to [MASK] before re-predicting; a principled sampler that directly revises visible tokens without auxiliary modules remains underexplored. We introduce D3IM, a parameter-free sampler derived as a corrector-style reverse update that permits direct visible-to-visible revision without additional modules or auxiliary passes. D3IM also reveals a model-side obstacle we term preservation bias: the model tends to reproduce its own wrong committed tokens rather than correct them. We address this with SCOPE (Self-Conditioned On Prediction Errors), a lightweight post-training procedure that simulates D3IM's sampling process. On LLaDA-8B at 64 denoising steps, SCOPE+D3IM improves over the original LLaDA-8B with standard unmasking by +13.0 on GSM8K (68.3%), +4.8 on MATH-500 (23.6%), +15.3 on HumanEval (29.3%), and +10.4 on MBPP (30.8%), with gains that increase as more denoising steps are used on math and HumanEval.

2606.01024 2026-06-02 cs.CL cs.AI

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

DSL-LLaDA: 将连续去噪扩展到8B掩码扩散语言模型

Longxuan Yu, Yunshu Wu, Yu Fu, Siheng Xiong, Rob Brekelmans, Hui Liu, Yue Dong, Greg Ver Steeg

发表机构 * University of California, Riverside(加州大学河滨分校) Georgia Institute of Technology(佐治亚理工学院) Microsoft(微软)

AI总结 通过离散随机定位(DSL)将预训练掩码扩散语言模型(LLaDA-8B-Instruct)轻量适配为支持连续嵌入空间去噪,在低步数下实现高质量摘要生成并避免长度-质量权衡。

详情
Comments
8 pages, 4 figures, 28 tables
AI中文摘要

离散掩码扩散语言模型通过迭代并行解码生成文本,但少步解码面临长度与质量之间的权衡:在固定步数预算下,标准方法可以生成短而高质量的输出,或者产生长但重复的文本。连续去噪可以通过在嵌入空间中联合演化所有位置来规避这种权衡,但从头开始构建这样的模型仍是一个开放问题。我们证明,预训练的掩码DLM可以轻量适配以支持连续嵌入空间去噪。从LLaDA-8B-Instruct开始,我们仅用1,000步进行离散随机定位(DSL)的继续预训练,将二元掩码替换为连续的逐token高斯噪声作为软掩码。适配后的模型支持连续推理,在嵌入空间中联合演化所有位置,并将硬token承诺推迟到最后一步。在低步数预算(<=16次前向传播)下的零样本摘要任务中,DSL-LLaDA-SDE在所有四个基准上取得了最佳ROUGE-1,并很大程度上避免了迭代去掩码的过早终止/重复权衡。同样的适配还产生了选择性噪声状态鲁棒性:模型在保留干净token的同时纠正损坏的token。使用相同计算量的标准掩码扩散训练对照实验未表现出这两种行为。

英文摘要

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

2606.01022 2026-06-02 cs.CV cs.AI

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

ProductWebGen: 多模态产品网页生成基准测试

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

发表机构 * School of Computer Science & Zhiyuan College(计算机科学学院及智远学院) Shanghai Jiao Tong University(上海交通大学) Kuaishou Technology(快手科技)

AI总结 提出ProductWebGen基准,用于评估多模态生成模型从产品图像和指令生成一致产品展示网页的能力,并比较了基于编辑和基于统一模型两种工作流。

详情
Comments
Accepted by KDD 2026
AI中文摘要

从源产品图像以及布局和视觉内容指令中制作产品展示网页,对于营销、广告和电子商务等领域具有重要的实用价值。直观上,该任务要求产品展示之间严格的视觉一致性以及高保真度的指令遵循,以联合生成可渲染的HTML代码。这些对可控性和指令遵循的要求与先进多模态生成模型(如图像编辑模型和统一模型)的核心特征紧密一致。为此,本文引入ProductWebGen来系统性地基准测试这些模型的产品网页生成能力。我们组织了包含500个测试样本的ProductWebGen,涵盖13个产品类别;每个样本由源图像、视觉内容指令和网页指令组成。任务是根据源图像和指令生成包含多个一致图像的产品展示网页。鉴于任务的混合模态输入输出性质,我们设计并系统比较了两种评估工作流——一种使用大语言模型和图像编辑模型分别生成HTML代码和图像(基于编辑),另一种依赖单个统一模型生成两者,其中图像生成依赖于先前的多模态上下文(基于统一模型)。实验结果表明,基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果,而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。我们还构建了一个监督微调数据集ProductWebGen-1k,包含1000组真实产品图像和LLM生成的HTML代码。我们在开源统一模型BAGEL上验证了其有效性。数据和代码可在https://github.com/SJTU-DENG-Lab/ProductWebGen获取。

英文摘要

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.

2606.01021 2026-06-02 cs.CV

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

学习神经变形表示用于4D动态形状生成

Gyojin Han, Jiwan Hur, Jaehyun Choi, Junmo Kim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出一种新的神经变形表示,结合条件神经符号距离场,设计解耦运动与形状潜在空间的4D表示架构,通过扩散模型生成高质量、高时间一致性的4D动态形状。

详情
Comments
ECCV 2024
AI中文摘要

近期3D形状表示的发展为生成精细3D形状开辟了新可能性。尽管取得了这些进展,但关于生成随时间变形的3D对象形式的4D动态形状的研究仍然很少。为弥补这一差距,本文聚焦于生成4D动态形状,同时强调生成质量和效率。先前关于4D生成的工作HyperDiffusion提出了一种直接生成4D占用场权重参数的方法,但由于运动表示未与4D占用场的形状表示分离,导致时间一致性差且渲染速度慢。因此,我们提出一种新的神经变形表示,并将其与条件神经符号距离场结合,设计了一种4D表示架构,其中运动潜在空间与形状潜在空间解耦。所提出的变形表示通过预测多个部分的蒙皮权重和刚体变换来工作,在理解形状结构方面也优于现有4D表示的变形模块。此外,我们设计了一种扩散模型的训练过程,利用由我们的4D表示提取的形状和运动特征作为数据点。无条件生成、条件生成和运动重定向实验结果表明,我们的方法不仅在4D动态形状生成方面表现出优于先前工作的性能,而且具有多种潜在应用。

英文摘要

Recent developments in 3D shape representation opened new possibilities for generating detailed 3D shapes. Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. HyperDiffusion, a previous work on 4D generation, proposed a method of directly generating the weight parameters of 4D occupancy fields but suffered from low temporal consistency and slow rendering speed due to motion representation that is not separated from the shape representation of 4D occupancy fields. Therefore, we propose a new neural deformation representation and combine it with conditional neural signed distance fields to design a 4D representation architecture in which the motion latent space is disentangled from the shape latent space. The proposed deformation representation, which works by predicting skinning weights and rigid transformations for multiple parts, also has advantages over the deformation modules of existing 4D representations in understanding the structure of shapes. In addition, we design a training process of a diffusion model that utilizes the shape and motion features that are extracted by our 4D representation as data points. The results of unconditional generation, conditional generation, and motion retargeting experiments demonstrate that our method not only shows better performance than previous works in 4D dynamic shape generation but also has various potential applications.