arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.01098 2026-06-02 cs.RO cs.AI

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

隐式漂移策略：通过条件专家几何实现单步动作生成

Zemin Yang, Yaoyu He, Yiming Zhong, Yuhao Zhang, Xinge Zhu, Yao Mu, Qingqiu Huang, Yuexin Ma

发表机构 * ShanghaiTech University（上海科技大学）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）； Morphi Robot（Morphi机器人）

AI总结提出隐式漂移策略（IDP），一种单步模仿学习框架，通过条件专家几何隐式引入训练时的漂移校正，无需显式向量场估计，在2D、3D及真实世界操作任务中有效保持有效动作流形，性能优于显式漂移方法并达到强单步基线水平。

详情

AI中文摘要

基于扩散或流匹配的生成动作策略在行为克隆中表现出色，但其迭代采样对于高频机器人控制来说过于耗时。尽管最近的单步公式缓解了这种延迟，但它们不可避免地丢弃了提供关键动作校正的中间轨迹演化。由于条件演示极端稀疏，通过显式估计训练时漂移场直接恢复这一机制在数学上是不适定的。我们提出了隐式漂移策略（IDP），一种单步模仿学习框架，无需显式向量场估计即可将训练时的漂移校正引入策略学习。IDP从观测相似专家动作的局部变化中提取条件专家几何，并将其与全局参考几何进行比较，以分离条件特定的约束。这种局部几何结构自适应地加权一个标量势目标。结合专家近端终端评估，IDP在训练期间直接对单步生成器施加流形约束。在2D、3D和真实世界操作任务上的广泛评估表明，IDP有效保持了对有效动作流形的遵循，优于显式漂移方法，并达到了与强单步基线相当的性能。

英文摘要

Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01097 2026-06-02 cs.CV

Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R

双路Top-K检索与1v1 VLM重排序用于CoVR-R

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（新加坡国立大学）； Independent Researcher（独立研究者）； Opus AI Research（Opus AI研究）； University of Science and Technology of China（中国科学技术大学）

AI总结提出双路Top-K检索与1v1 VLM重排序方法，通过解耦召回与选择，在CoVR-R挑战中达到95.28% R@1。

2606.01095 2026-06-02 cs.RO cs.AI

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

超越任务成功：WAM 和 VLA 的行为与表征诊断

Hung Mai, Bin Zhu, Tuan Do

发表机构 * National Economics University, Vietnam（越南国家经济大学）； Singapore Management University（新加坡管理大学）； Phenikaa University, Vietnam（越南Phenikaa大学）

AI总结本文提出一个模型无关的诊断框架，通过行为分析和基于稀疏自编码器的特征分析，比较世界动作模型（WAM）与视觉-语言-动作（VLA）策略在机器人操作中的行为与表征差异，发现WAM在目标选择和行为改进上优于VLA但计算成本更高，且不同WAM架构对未来信息的编码方式不同。

详情

AI中文摘要

视觉-语言-动作（VLA）策略和世界动作模型（WAM）代表了机器人操作中两种日益重要的范式。然而，尚不清楚WAM中的未来预测是否在最终任务成功之外带来行为上有意义的改进。在本文中，我们探究WAM是否仅仅增加了未来预测，还是以对控制可操作的方式改变了机器人行为和内部表征。我们引入一个模型无关的诊断框架，通过两个互补的视角比较WAM和VLA：行为 rollout 分析和基于稀疏自编码器的特征分析。行为协议测量动作动态一致性、目标物体进展、干扰物干扰和运行时成本。特征空间协议将内部表征表征为记忆型、反应型或预测型，揭示模型是否编码了面向未来的结构。在LIBERO和RoboTwin2.0上，我们评估了7种策略，涵盖直接VLA以及联合、顺序和辅助WAM。我们的结果表明，仅凭成功隐藏了关键差异：WAM通常改善物体级行为和目标选择性，但其收益依赖于架构并导致更高的推理成本。顺序WAM显示出最清晰的预测结构，而辅助和联合WAM分别压缩或纠缠未来信息。这些发现为WAM设计提供了未来方向，以保留行为可操作的未来表征，实现高效操作。

英文摘要

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.01094 2026-06-02 cs.AI

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent: 具有结构化推理和工具集成的临床智能体用于医嘱生成

Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu, Siran Zhao, Yao Yu, Jie Zhai, Tong Ruan

发表机构 * East China University of Science and Technology, Shanghai, China（东华大学上海科学技术学院）； Zhongshan Hospital, Fudan University, Shanghai, China（复旦大学中山医院）

AI总结提出CAREAgent，通过两阶段推理数据构建和监督微调与强化学习，生成细粒度临床医嘱，在ClinicalBench上F1提升5.05%。

详情

AI中文摘要

临床医嘱生成是临床决策与实际实践之间的关键桥梁，将医疗决策转化为具体可执行的医嘱。现有智能体主要关注粗粒度决策，忽略了临床医嘱所需的细粒度可执行信息。为弥补这一差距，我们提出CAREAgent，一个用于临床医嘱生成的智能体。为支持其训练，我们引入了一种两阶段智能体推理数据构建方法。首先，我们设计了一个智能体框架，构建与真实临床工具使用一致的可验证推理轨迹。其次，我们根据格式合规性、医嘱有效性和临床合理性筛选推理轨迹。基于构建的数据，模型首先通过监督微调训练以获得基本的推理格式和医学知识，随后通过具有多维奖励函数的强化学习进行优化，以增强复杂的临床推理能力。在多个基准上的实验证明了CAREAgent的有效性。在ClinicalBench（训练中未见）上，CAREAgent的F1分数分别比单智能体、多智能体和智能体推理方法提高了5.05%、2.09%和0.86%。

英文摘要

Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.01092 2026-06-02 cs.LG cs.AI

A Fiber Criterion for Representation Identifiability in Supervised Learning

监督学习中表示可辨识性的纤维准则

Vasileios Sevetlidis

发表机构 * Athena Research Center, Kimmeria Campus, Xanthi, Greece（亚特兰大研究中心，基米里亚校区，哈尼亚，希腊）； Democritus University of Thrace, Vas. Sofias Campus, Xanthi, Greece（德摩根大学，瓦斯·索菲亚校区，哈尼亚，希腊）； International Hellenic University, Serres, Greece（国际希腊大学，塞雷斯，希腊）

AI总结本文提出纤维准则，通过投影映射的纤维常数性来形式化监督学习中表示-头部分解的可辨识性，并指出仅凭监督预测行为无法唯一确定表示。

详情

AI中文摘要

监督学习通过输入-输出行为评估预测器。当预测器实现为复合函数 $f=c\circ h$ 时，监督证据约束了复合映射 $f$，但未必确定表示-头部因子分解 $(h,c)$。本文形式化了由此产生的表示级可辨识性问题：对于一类可接受的表示-头部对，当且仅当表示属性在投影 $(h,c)\mapsto c\circ h$ 的纤维上为常数时，它可从诱导的预测器中辨识，等价于它下降为预测器的良定义属性。保持预测器的增广给出了一个规范障碍：辅助信息可以附加到表示上而头部忽略它，保持预测器不变但改变诸如极小性、压缩、不变性、等变性、干扰信息或语义可访问性等属性。这种构造将表示可辨识性与优化和有限样本估计分离开来。有限样本诊断说明了而非证明了该准则：精确代数见证在改变表示诊断时保持预测器固定，而匹配性能的Waterbirds模型表明不同约束可以在相似的监督性能下选择不同的表示。结果阐明，表示级声明需要超越监督预测行为本身的假设、目标、测量或归纳偏置。

英文摘要

Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$, supervised evidence constrains the composite map $f$ but need not determine the representation-head factorization $(h,c)$. This paper formalizes the resulting representation-level identifiability problem: for a class of admissible representation-head pairs, a representation property is identifiable from the induced predictor exactly when it is constant on the fibers of the projection $(h,c)\mapsto c\circ h$, equivalently when it descends to a well-defined property of the predictor. Predictor-preserving augmentation gives a canonical obstruction: auxiliary information can be appended to a representation while the head ignores it, leaving the predictor unchanged but altering properties such as minimality, compression, invariance, equivariance, nuisance information, or semantic accessibility. This construction separates representation identifiability from optimization and finite-sample estimation. Finite-sample diagnostics illustrate, rather than prove, the criterion: exact algebraic witnesses hold the predictor fixed while changing representation diagnostics, and matched-performance Waterbirds models show that different constraints can select different representations at similar supervised performance. The results clarify that representation-level claims require assumptions, objectives, measurements, or inductive biases beyond supervised predictive behavior alone.

URL PDF HTML ☆

赞 0 踩 0

2606.01091 2026-06-02 cs.CL

Deep Research as Rubric for Reinforcement Learning

深度研究作为强化学习的评估准则

Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang

发表机构 * Fudan University（复旦大学）； Xiaohongshu Inc.（小红书公司）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出DR-rubric框架，通过两阶段过程（迭代多轮智能体搜索构建准则，然后蒸馏为可验证约束用于GRPO策略优化）为开放式推理和长文本生成任务生成细粒度奖励信号，实验表明该方法在多个基准上表现优异。

详情

AI中文摘要

开放式推理和长文本生成任务缺乏可靠的自动验证信号用于基于奖励的策略优化。评估准则提供了一种有前景的替代方案，但现有方法将其视为给定的产物——要么手工制作，要么由提示生成——并且常常忽略最关键的、任务特定的、知识密集的维度，从而扭曲了奖励信号。我们的关键观察是，准则构建本身就是一个研究问题：识别什么使回答正确或富有洞察力需要发现和综合外部知识。我们提出了深度研究作为评估准则（DR-rubric），一个用于构建此类准则的两阶段框架。第一阶段通过迭代多轮智能体搜索引出领域事实、结构约束和失败模式；第二阶段将这些证据蒸馏为原子化的、可独立验证的约束，用于基于GRPO的策略优化。由于正在训练的模型可以作为其自身的准则生成器，DR-rubric-8B支持无需前沿模型辅助的自举准则生成。我们在涵盖智能体研究和专家推理的6个基准上进行了评估。实验表明，DR-Rubric仅使用1K-3K训练实例就取得了强劲的竞争性能，其中GPT-5生成的准则特别有利于智能体任务的广度覆盖，Gemini生成的准则在智能体和专家推理任务上取得了最平衡的性能，而自举准则表现出从专业化到再平衡的演变，在第三次迭代时达到最佳整体性能。结果表明，将准则构建从静态评估模板重新定义为证据驱动的研究过程，可以为开放式任务产生更可扩展、更细粒度的奖励信号。

英文摘要

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01086 2026-06-02 cs.LG cs.AI

Strong Stochastic Flow Maps

强随机流映射

Sam McCallum, Zander W. Blasingame, Timothy Herschell, Niklas Rindtorff, Alexander Tong, James Foster

发表机构 * University of Bath（巴斯大学）； AITHYRA

AI总结提出强随机流映射（SSFMs）框架，通过学习加性噪声SDE的强解映射，实现扩散模型的免模拟训练和少步采样，在图像生成和分子系统采样中优于现有方法。

Comments Preprint

详情

AI中文摘要

流模型和扩散模型在许多模态中生成高质量样本；然而，由于需要对底层微分方程进行数值积分，推理过程中需要多次网络评估。流映射通过学习微分方程的解映射直接缓解了这一问题，实现了少步采样。然而，当前方法仅限于逼近ODE的解映射。这些方法可用于学习SDE的转移核，从而获得恢复过程边际分布（弱收敛）而非解路径（强收敛）的解映射。我们提出强随机流映射（SSFMs）作为一种新框架，用于学习加性噪声SDE的强解映射，直接将确定性流映射推广到随机设置。此外，引入了布朗运动的多项式逼近，并证明其路径收敛。这些结果为扩散模型的解映射提供了免模拟训练目标。我们证明，SSFMs在图像生成上优于先前的随机流映射方法，并实现了分子系统的少步采样。

英文摘要

Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference due to numerical integration of an underlying differential equation. Flow maps alleviate this problem by learning the solution map of the differential equation directly, enabling few-step sampling. Yet, current methods are restricted to approximating the solution map of ODEs. These methods can be used to learn the transition kernel of an SDE, thereby obtaining a solution map that recovers the marginal distributions of the process (weak convergence) rather than the solution path (strong convergence). We propose Strong Stochastic Flow Maps (SSFMs) as a novel framework for learning the strong solution map of additive-noise SDEs, directly generalizing deterministic flow maps to the stochastic setting. Further, a polynomial approximation to Brownian motion is introduced and shown to converge pathwise. These results enable a simulation-free training objective for the solution map of diffusion models. We demonstrate that SSFMs outperform previous stochastic flow map methods on image generation and enable few-step sampling of molecular systems.

URL PDF HTML ☆

赞 0 踩 0

2606.01084 2026-06-02 cs.LG cs.AI

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter：通过多视图交替注意力内化组合路由的几何等变性

Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出MViewRouter框架，利用多视图交替注意力机制内化几何等变性作为结构归纳偏置，通过集体策略梯度聚合优化，解决组合路由问题中的对称性挑战，在TSP和CVRP上取得竞争性解质量和强零样本泛化。

详情

AI中文摘要

组合路由问题，如旅行商问题（TSP）和带容量约束的车辆路径问题（CVRP），是基础的NP难问题，具有广泛的现实应用。虽然最近的深度强化学习方法显示出有希望的性能，但它们通常仅通过数据增强处理几何对称性，导致决策不一致和泛化能力有限。为了解决这个问题，我们提出了MViewRouter，一个多视图框架，将几何等变性内化为结构归纳偏置，以实现跨路由问题变体的不变决策。我们的方法引入了一种多视图交替注意力（MAA）机制，能够在$D_4$对称群上进行并行处理，在视图内关系建模和视图间特征对齐之间交替进行。此外，我们通过集体策略梯度聚合（CPGA）优化策略，利用来自多个对称视图的共识梯度来稳定训练并加速收敛。在TSP和CVRP基准测试以及真实世界的TSPLIB实例上的实验表明，MViewRouter实现了竞争性的解质量和强大的零样本泛化能力。

英文摘要

Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the $D_4$ symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.01081 2026-06-02 cs.LG

Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback

面向决策的在线策略学习用于部分反馈下的上下文线性优化

Wyame Benslimane, Tinghan Ye, Pascal Van Hentenryck, Paul Grigas

发表机构 * Department of Industrial Engineering and Operations Research, University of California, Berkeley（工业工程与运筹学系，加州大学伯克利分校）； H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology（H.米尔顿·斯图尔特工业与系统工程学院，佐治亚理工学院）

AI总结提出一种混合梯度估计方法，用于部分反馈下顺序上下文线性优化的在线策略学习，实现决策质量驱动的预测模型训练，并在多个基准上优于上下文多臂赌博机基线。

详情

AI中文摘要

决策聚焦学习（DFL）通过优化下游决策质量而非单独预测准确性来训练预测模型。对于上下文线性优化，大多数现有DFL方法假设离线数据和目标成本向量的完全观测。我们开发了一种在线策略学习方法，用于部分反馈下的顺序上下文线性优化，推广了标准赌博机反馈设置。我们的方法学习一个随机预测-然后-优化策略，该策略从条件分布中采样成本向量预测，并求解由此产生的下游线性优化问题。为了更新这个分布模型，我们引入了一个双组分混合梯度估计器。第一个组分是得分函数估计器，它提供无偏但可能高方差的策略梯度估计。第二个是决策聚焦插件组分，它使用潜在成本向量的辅助干扰估计来利用下游优化结构，随着估计的改进而变得更具信息性。我们证明了平均平方策略梯度范数的$\mathcal{O}(T^{-1/2})$界，与标准非凸SGD速率相匹配。在top-$k$选择、最短路径、组合定价和真实数据能源调度基准上的实验表明，混合梯度方法在使用高斯和更丰富的条件生成模型时，在所有基准上实现了比上下文赌博机风格基线更低的累积遗憾。代码可在https://github.com/Joeyetinghan/on-policy-bandit-dfl获取。

英文摘要

Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on-policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting. Our method learns a stochastic predict-then-optimize policy that samples a cost-vector prediction from a conditional distribution and solves the resulting downstream linear optimization problem. To update this distributional model, we introduce a two-component hybrid gradient estimator. The first component is a score function estimator, which provides an unbiased but potentially high-variance policy gradient estimate. The second is a decision-focused plug-in component that uses an auxiliary nuisance estimate of the latent cost vector to exploit the downstream optimization structure, becoming more informative as the estimate improves. We prove an $\mathcal{O}(T^{-1/2})$ bound on the average squared policy-gradient norm, matching the standard non-convex SGD rate. Experiments on top-$k$ selection, shortest path, combinatorial pricing, and a real-data energy-scheduling benchmark show that the hybrid gradient approach achieves lower cumulative regret than contextual-bandit-style baselines across all benchmarks, using both Gaussian and richer conditional generative models. Code is available at https://github.com/Joeyetinghan/on-policy-bandit-dfl.

URL PDF HTML ☆

赞 0 踩 0

2606.01080 2026-06-02 cs.LG cs.AI

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

ThinkSwitch：基于LoRA和权重插值的上下文蒸馏用于特定目的推理任务

Dhruv Saini, Rohan Pandey

发表机构 * bellevue High School（贝尔维尤高中）； DigitalOcean

AI总结提出ThinkSwitch方法，通过QLoRA蒸馏和球面权重插值协同训练指令模型和思考模型，在AIME 2026和PubMedQA上分别提升指令模型10/30→20/30和13/30→18/30，思考模型14/30→22/30和18/30→25/30，仅需15个训练提示和$2.86成本。

详情

AI中文摘要

大型语言模型通常通过在产生最终答案之前花费推理时间计算来改进困难任务。额外的计算可能有用，但也增加了延迟、令牌成本和部署复杂性。我们引入了 extbf{ThinkSwitch}，一种低计算量的程序，用于协同训练配对的指令和思考检查点。从兼容的Qwen3-4B指令和思考模型开始，每次迭代要求思考检查点生成答案，移除推理轨迹，通过QLoRA将仅答案对蒸馏到指令检查点，并通过球面权重插值重建思考检查点。唯一的人工输入是任务提示；标签由模型自身生成。在30个问题的AIME 2026评估中，ThinkSwitch将指令检查点从10/30提升到20/30，思考检查点从14/30提升到22/30。在30个问题的PubMedQA子集上，它将指令检查点从13/30提升到18/30，思考检查点从18/30提升到25/30。完整实验每个领域使用15个训练提示，在单个云RTX 3070上花费2.86美元。结果规模较小，但表明有针对性的蒸馏循环可以将显式推理的部分好处转移到权重中，同时保留独立的思考模式。

英文摘要

Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and deployment complexity. We introduce \textbf{ThinkSwitch}, a low-compute procedure for co-training paired instruct and thinking checkpoints. Starting from compatible Qwen3-4B instruct and thinking models, each iteration asks the thinking checkpoint to generate answers, removes the reasoning trace, distills the answer-only pairs into the instruct checkpoint with QLoRA, and reconstructs a thinking checkpoint with spherical weight interpolation. The only human-supplied inputs are task prompts; the labels are generated by the model itself. On a 30-question AIME 2026 evaluation, ThinkSwitch improves the instruct checkpoint from 10/30 to 20/30 and the thinking checkpoint from 14/30 to 22/30. On a 30-question PubMedQA subset, it improves the instruct checkpoint from 13/30 to 18/30 and the thinking checkpoint from 18/30 to 25/30. The complete experiment uses 15 training prompts per domain and costs \$2.86 on a single cloud RTX 3070. The results are small-scale, but they indicate that targeted distillation loops can move part of the benefit of explicit reasoning into weights while preserving a separate thinking mode.

URL PDF HTML ☆

赞 0 踩 0

2606.01079 2026-06-02 cs.CV

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

Chameleon: 面向跨域对象合成的风格-内容解耦框架

Sukhun Ko, Soo Ye Kim, Jihyong Oh

发表机构 * CMLab, Chung-Ang University（Chung-Ang大学CMLab）； Adobe Research（Adobe研究）

AI总结提出基于大规模数据集ChameleonDataset的两阶段训练框架Chameleon，通过联合硬对比学习和时空注意力门控实现跨域对象合成的风格-内容解耦与自适应风格化。

Comments The last two authors are co-corresponding authors. Please visit our project page at https://cmlab-korea.github.io/Chameleon/

详情

AI中文摘要

图像合成旨在将前景对象无缝插入背景图像中，扩散模型的最新进展显著提升了合成质量，尤其是当前景和背景图像来自同一域（例如自然图像）时。然而，当前景和背景来自不同域时，跨域合成相对未被充分探索且仍具挑战性，因为模型必须保留前景对象的身份，同时对其进行风格化以匹配背景域。现有的跨域合成方法主要依赖无训练的混合和细化策略，部分原因是缺乏大规模配对数据集用于跨域合成，限制了基于训练的方法的发展。因此，它们局限于色调级对齐，常常产生风格不一致或过度风格化的结果。为克服这些限制，我们构建了ChameleonDataset，这是首个用于跨域合成的大规模训练数据集，并配有全面的评估基准，通过可扩展的数据构建流水线实现。在此基础上，我们提出了Chameleon，一种新颖的两阶段基于训练的跨域合成框架。在第一阶段，我们提出联合硬对比学习（JHCL）来训练ChameleonEncoder，有效解耦风格和内容表示。在第二阶段，我们将时空注意力门控（STAG）引入扩散变换器以实现有效的风格化，自适应地调节来自第一阶段编码器的风格标记如何在空间和时间维度上注入。我们的方法优于最先进的域内和跨域合成模型、顺序流水线和商业模型，在合成合理性和风格保真度方面均取得了改进。

英文摘要

Image compositing aims to seamlessly insert a foreground object into a background image, and recent advances in diffusion models have significantly enhanced the quality, especially when the foreground and background images come from the same domain (e.g., natural images). However, cross-domain compositing, where the foreground and background come from different domains, is relatively underexplored and remains challenging because the model must preserve the foreground object's identity while stylizing it to match the background domain. Existing cross-domain compositing approaches largely rely on training-free blending and refinement strategies. This is partly due to the lack of large-scale paired datasets for cross-domain compositing, limiting the development of training-based solutions. As a result, they are limited to tone-level alignment and often produce style-inconsistent or overstylized results. To overcome such limitations, we construct ChameleonDataset, the first large-scale training dataset for cross-domain compositing, with a comprehensive evaluation benchmark, built through a scalable data construction pipeline. Building on this, we propose Chameleon, a novel two-stage training-based cross-domain compositing framework. In the first stage, we propose Joint Hard Contrastive Learning (JHCL) to train ChameleonEncoder, which effectively disentangles style and content representations. In the second stage, we introduce Spatio-Temporal Attention Gating (STAG) into a diffusion transformer for effective stylization, adaptively regulating how style tokens from the first-stage encoder are injected across spatial and temporal dimensions. Our method outperforms state-of-the-art in-domain and cross-domain compositing models, sequential pipelines and commercial models, achieving improvements in both compositional plausibility and stylistic fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.01078 2026-06-02 cs.LG stat.CO stat.ME

Non-Vacuous Certification of Transport MCMC via Oscillation-Controlled Normalizing Flows

通过振荡控制归一化流实现传输MCMC的非平凡认证

Jun Hu

发表机构 * China Sanya Science and Education Innovation Park, Wuhan University of Technology, Sanya 572025, China（中国三亚科技教育创新园，武汉理工大学，三亚572025，中国）； School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430070, China（武汉理工大学土木工程与建筑学院，武汉430070，中国）

AI总结提出振荡控制归一化流框架，首次为传输MCMC采样器提供严格的非平凡谱隙界，通过谱归一化、基于覆盖的经验振荡界和振荡正则化训练，在多个后验分布上实现可认证的采样效率。

Comments 36 pages, includes appendix

详情

AI中文摘要

传输MCMC训练归一化流以预处理Metropolis-Hastings提议，在具有挑战性的后验分布上实现了高经验效率；然而，先前的工作没有为此类采样器产生数值上非平凡的、严格的谱隙界。我们建立了第一个这样的界。对于香蕉族上的独立MH，我们在D=2时认证了γ^*=0.828（在原始空间中覆盖），在D=5时认证了γ^*≥7.6×10^{-4}（在解析解卷的高斯空间中覆盖，并具有网格认证的梯度界，在所述数值Lipschitz认证下），两者均在95%置信度下严格。该框架基于三个支柱：(i) 具有缩减尺度裁剪的谱归一化将流Lipschitz常数从10^{47}约束到10^4；(ii) 基于覆盖的经验振荡界用数据依赖的证书替代了空洞的分析界；(iii) 振荡正则化训练在不损失密度拟合的情况下将经验振荡减少60-90%，将实用证书扩展到D=20（γ^*≥1.7×10^{-4}）。在另外四个目标（高斯混合、剪切建筑、Neal的漏斗、贝叶斯逻辑回归）上的测试确定了三个精确障碍：边界曲率、目标刚度和尾部覆盖不匹配。仿射与样条比较表明，更简单的架构在相同NLL下产生更紧的证书，颠倒了通常的表达性层次。

英文摘要

Transport MCMC trains a normalizing flow to precondition Metropolis--Hastings proposals, achieving high empirical efficiency on challenging posteriors; yet no prior work produces a numerically non-vacuous, rigorous spectral-gap bound for such samplers. We establish the first such bounds. For independence MH on the banana family we certify (γ^\ast = 0.828) at (D = 2) (covering in the original space) and (γ^\ast \ge 7.6\times 10^{-4}) at (D = 5) (covering in an analytically unwarped Gaussian space with a grid-certified gradient bound under the stated numerical Lipschitz certification), both rigorous at 95% confidence. The framework rests on three pillars: (i) spectral normalization with reduced scale clips constrains the flow Lipschitz constant from (10^{47}) to (10^4); (ii) a coverage-based empirical oscillation bound replaces the vacuous analytical bound with a data-dependent certificate; and (iii) oscillation-regularised training cuts the empirical oscillation by 60--90% at no cost to density fit, extending practical certificates through (D = 20) ((γ^\ast \ge 1.7\times 10^{-4})). Tests on four further targets (Gaussian mixture, shear-building, Neal's funnel, Bayesian logistic regression) identify three precise barriers: boundary curvature, target stiffness, and tail-coverage mismatch. An affine-vs-spline comparison shows that simpler architectures yield tighter certificates at identical NLL, inverting the usual expressiveness hierarchy.

URL PDF HTML ☆

赞 0 踩 0

2606.01074 2026-06-02 cs.CL

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

何时0.1%就足够了？分析降维和量化对文本嵌入压缩的联合效应

Riku Kisako, Hayato Tsukagoshi, Ryohei Sasano

发表机构 * Graduate School of Informatics, Nagoya University（名古屋大学信息学研究科）

AI总结本文系统研究了结合降维和量化方法压缩文本嵌入的效果，发现联合压缩可大幅减小嵌入尺寸（低至原始大小的0.1%）而几乎不损失性能，且最优策略因任务而异。

2606.01069 2026-06-02 cs.CV

A Multiscale Network with Supervised Contrastive Learning for Real-Time Facial Emotion Recognition

基于监督对比学习的多尺度网络用于实时面部情感识别

Rejoy Chakraborty, Archisman Adhikary, Chayan Halder, Payel Rakshit, Sanchita Ghosh, Kaushik Roy

发表机构 * Indian Statistical Institute（印度统计研究所）； Department of Biological Sciences, Bose Institute（生物科学系， Bose 院）； Ramakrishna Mission Vivekananda Centenary College（拉马克里希纳使命 Vivekananda 百年学院）； Maheshtala College（Maheshtala 学院）； West Bengal State University（西孟加拉州大学）

AI总结提出一种结合监督对比学习的多尺度深度学习网络，用于实时视频中面部表情变化的情感识别，在标准数据集上取得满意效果。

Comments 13 pages

详情

AI中文摘要

从面部表情进行实时情感识别是一项具有挑战性的任务，特别是在视频场景中，多个情感状态可能随时间出现。由于每个情感状态对应的面部表情在不同个体间差异显著，难度进一步增加。描绘情感状态的面部表情变化不是离散的，而是连续的，这通过计算手段来表征非常困难。能够检测面部表情变化的系统对于确定个体的情感状态具有重要影响。这样的系统在心理咨询中可以为治疗师提供关于受试者情感状态的额外见解，从而非常有益。本文提出了一种基于深度学习的系统，通过建模面部表情的变化来检测个体实时视频中的情感变化。本研究在标准数据集上进行深度学习系统的训练，并在此方面取得了非常满意的结果。

英文摘要

Real-time emotion recognition from facial expressions is a challenging task, particularly in video-based scenarios where multiple emotional states may occur over time. The difficulty increases further due to the fact that each emotional state is associated with facial expressions that vary significantly across individuals. The change of facial expressions portraying emotional state is not discrete, but rather continuous, which is very challenging to represent through computational aids. A system with the ability to detect variations in facial expressions can have a significant impact on determining the emotional state of an individual. Such a system can be very beneficial for psychologists during counseling by providing additional insights into the emotional state of a subject. In this paper, a deep learning-based system is presented to detect emotional changes in real-time video of a person by modeling the change in facial expressions. The current study is conducted on a standard dataset for training of the deep learning system and the system has provided very satisfactory outcomes in this respect.

URL PDF HTML ☆

赞 0 踩 0

2606.01066 2026-06-02 cs.AI

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

在模型学会Bug之前：模糊测试RLVR验证器

Jaideep Ray

发表机构 * ACM

AI总结本文提出一个轻量级验证器模糊测试框架，通过生成对抗性补全、比较有缺陷与严格的参考验证器，并报告多种指标，以研究RLVR中验证器错误导致优化学习Bug的失败模式。

2606.01063 2026-06-02 cs.AI

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

MindClaw: 用于精确干预的闭环具身心理状态推理

Ruoxuan Zhang, Qiaoqiao Wan, Zhengguang Wang, Chenghao Yu, Hongxia Xie, Jianlong Fu, Wen-Huang Cheng

发表机构 * Jilin University（吉林大学）； Microsoft Asia（微软亚洲）； National Taiwan University（国立台湾大学）

AI总结提出MindClaw框架，通过闭环具身心理状态推理实现精确干预，结合多源输入、信念记忆、认知触发技能和动作生成，在动态环境中优化干预时机。

Comments Extended version of the CVPR 2026 paper *MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents*

详情

AI中文摘要

心理理论使智能体能够推理他人的信念、目标和意图，这对于以人为中心的具身辅助至关重要。现有的心理理论基准推动了文本和多模态心理状态识别的发展，但主要评估离线问答或最终动作预测。它们并未充分测试具身智能体是否能够与变化的环境保持连接、更新特定于主体的信念、决定何时需要推理，以及仅在帮助有用时进行干预。基于MindPower，我们将以机器人为中心的心理理论推理扩展到实时闭环设置，并引入MindClaw，一个用于具身心理状态推理和精确干预的框架。MindClaw连接多源输入、信念记忆、具身认知触发技能、心理推理和动作生成，使智能体能够在正确的时间输出有用的动作，同时在不需要干预时保持沉默。实验表明，直接的VLM基线在任务意识和干预校准方面存在困难，而MindClaw实现了最佳的整体性能，证明了触发技能优化对于闭环具身心理理论辅助的重要性。

英文摘要

Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human-centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental-state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor-specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot-centric ToM reasoning to a real-time closed-loop setting and introduce MindClaw, a framework for embodied mental-state reasoning with precision intervention. MindClaw connects multi-source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger-skill optimization for closed-loop embodied ToM assistance.

URL PDF HTML ☆

赞 0 踩 0

2606.01062 2026-06-02 cs.AI

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

DAG-MoE：从简单混合到专家混合中的结构聚合

Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu, Yinglong Xia, Qiang Zhang, Qifan Wang, Ren Chen, Dongqi Fu, Jiayi Liu, Zhoukai Zhao, Xiangjun Fan, Benyu Zhang, Yixin Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）； University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）

AI总结本文提出DAG-MoE框架，通过轻量级模块自动学习选定专家之间的最优聚合结构，以替代标准加权求和聚合，从而在不改变专家或路由器的情况下扩展专家组合空间并实现单层多步推理，在预训练和微调中均优于传统MoE基线。

Comments Accepted by ICML 2026

详情

AI中文摘要

混合专家（MoE）模型已成为在大型语言模型中解耦参数数量与计算成本的主流方法，但有效扩展MoE性能仍是一个挑战。先前的工作表明，细粒度专家扩大了专家组合空间并提高了灵活性，但它们也带来了大量的路由开销，造成了新的可扩展性瓶颈。在本文中，我们探索了扩展的互补轴——专家输出的聚合方式。我们从理论上证明，用结构聚合替代标准加权求和聚合可以在不改变专家或路由器的情况下扩展专家组合空间，并使得在单个MoE层内实现多步推理成为可能。为此，我们提出了DAG-MoE，一个稀疏MoE框架，它采用轻量级模块自动学习所选专家之间的最优聚合结构。在标准语言建模设置下的大量实验表明，DAG-MoE在预训练和微调中均持续提升了性能，超越了传统的MoE基线。

英文摘要

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01057 2026-06-02 cs.CV cs.AI cs.GR cs.LG

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

3DCodeBench：通过代码进行智能体程序化3D建模的基准测试

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong, Ameesh Makadia, Meiqi Guo, Laurent Itti, Jindong Chen

发表机构 * Google DeepMind（谷歌DeepMind）； University of Southern California（南加州大学）； Google Research（谷歌研究）

AI总结提出3DCodeBench基准，评估12种视觉语言模型将文本和图像参考转换为程序化3D建模代码的能力，并构建基于人类偏好的3DCodeArena排名平台。

Comments Project Page: https://www.3dcodebench.com/; 11 pages (main), with appendix

详情

AI中文摘要

通过代码进行程序化3D建模正成为一种通用的范式，提供确定性、引擎就绪且可精确编辑的资产，而神经3D生成器天生缺乏这些特性。然而，编写此类程序化内容需要深厚的3D软件API、参数化设计和代码级几何推理专业知识。在本文中，我们提出了3DCodeBench，一个系统性的基准，用于评估3D建模软件中用于程序化3D生成的视觉语言模型（VLM）智能体。具体来说，3DCodeBench评估了12种先进VLM如何有效地充当程序化3D建模器，将文本和图像参考转换为3D建模软件的程序化代码。认识到自动度量可能无法完全捕捉3D形状的感知质量，我们构建了3DCodeArena，一个基于成对人类偏好对生成的3D输出进行排名的平台。通过广泛的评估和结果，我们观察到：（1）失败主要源于API不匹配，而成功渲染的模型仍然存在断开或浮动的3D几何组件。（2）测试时扩展，如更高的思考预算和多轮细化，总体上提高了性能。我们的发现突显了对高质量程序化编码数据以推进商业VLM的迫切需求。此外，有效的程序化3D建模需要一个强大的执行环境，为迭代细化提供高保真反馈。我们发布了3DCodeBench，包括精心策划的大规模多模态（文本/图像）提示数据集、程序化代码、3D对象三元组、评估协议以及公共3DCodeArena平台，作为探索基于VLM的程序化3D建模器的基础工具包。

英文摘要

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

URL PDF HTML ☆

赞 0 踩 0

2606.01053 2026-06-02 cs.AI

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

AnyEdit++: 基于贝叶斯惊讶的自适应长文本知识编辑

Bowen Tian, Caixue He, Jiemin Wu, Jingying Wang, Wenshuo Chen, Zexi Li, Yutao Yue

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出AnyEdit++框架，通过基于贝叶斯惊讶的自适应分割机制Bayes-Chunk，实现结构感知的长文本知识编辑，在数学推理、代码生成和叙事任务上优于现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

在大语言模型中编辑复杂的长文本知识仍然是一个重大挑战，因为难以保持生成的连贯性。现有的自回归方法（如AnyEdit）缓解了长度限制，但依赖于固定窗口分块，忽略了逻辑结构并损害了一致性。为了解决这个问题，我们提出了AnyEdit++，一个结构感知的框架，其中包含Bayes-Chunk，这是一种基于贝叶斯惊讶动态识别语义边界的自适应分割机制。我们通过一个理论框架支撑这种方法，确立了三个关键原则：（1）结构独立性：我们证明了当锚键在几何上正交时（我们的基于惊讶度的边界自然满足这一条件，而固定窗口则违反），跨段干扰最小化；（2）因果局部性：我们证明了在这些语义峰值处注入的更新相比任意分割点具有严格更优的控制。在数学推理、代码生成和叙事任务上的大量实验表明，AnyEdit++相比最先进的基线取得了更优的性能和鲁棒性，验证了结构感知对于有效的长文本知识编辑至关重要。

英文摘要

Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.

URL PDF HTML ☆

赞 0 踩 0

2606.01051 2026-06-02 cs.LG

Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment

交互受限的动态医疗安全连续时间强化学习

Xun Shen, Yuepeng Wang, Akifumi Wachi, Yongqi Zhou, Richard Weiss, Yoshihiko Fujisawa, Ken Kawano, Mehrshad Sadria, Ying Chen, Xin Liu, Sebastien Gros, Xiao Hu, Kyoung-Sook Kim, Mengmou Li, Katsuki Fujisawa, Kenji Wakabayashi

发表机构 * Tokyo University of Agriculture and Technology（东京大学农业技术大学）； LY Corporation（LY公司）； National University of Singapore（新加坡国立大学）； Institute of Science Tokyo（东京科学研究所）； Altos Labs, Inc.（Altos实验室）； National Institute of Advanced Industrial Science and Technology (AIST)（国家先进工业科学与技术研究院）； Norwegian University of Science and Technology（挪威科学技术大学）； Emory University（埃默里大学）； Hiroshima University（广岛大学）

AI总结提出交互受限的安全连续时间强化学习框架，通过选项式半马尔可夫决策过程联合优化治疗策略与临床交互时机，并引入安全收紧机制保证轨迹级安全。

详情

AI中文摘要

动态医疗需要决定治疗强度和干预时机，而患者状态连续演化，不良事件可能在临床交互之间发生。现有大多数治疗学习方法假设固定时间表或仅在离散决策点强制执行安全性。我们提出了交互受限的安全连续时间强化学习，这是一个在轨迹级安全约束下联合优化治疗管理和临床交互时机的框架。我们的关键思想是将连续时间治疗问题重新表述为基于选项的半马尔可夫决策过程，其中每个选项指定一个连续时间治疗策略及其持续时间。我们开发了一种安全收紧机制，表明在交互时间适当构造的约束能够以高概率保证整个连续时间轨迹的安全性。我们进一步建立了从记录的治疗轨迹中进行策略学习的有限样本保证，并引入了一个实用的数据驱动保守替代。实验表明，所提出的自适应交互时机机制在不同安全策略优化方法上均能提高安全性和治疗效果，优于等距交互方案。

英文摘要

Dynamic medical treatment requires deciding treatment intensity and intervention timing, while patient states evolve continuously and adverse events may occur between clinical interactions. Most existing treatment learning methods assume fixed schedules or enforce safety only at discrete decision points. We propose Interaction-Limited Safe Continuous-Time Reinforcement Learning, a framework that jointly optimizes treatment administration and clinical interaction timing under trajectory-level safety constraints. Our key idea is to reformulate the continuous time treatment problem as an option-based semi-Markov decision process, where each option specifies a continuous-time treatment policy and its duration. We develop a safety-tightening mechanism showing that suitably constructed constraints at interaction times guarantee safety over the full continuous-time trajectory with high probability. We further establish finite-sample guarantees for policy learning from logged treatment trajectories and introduce a practical data-driven conservative surrogate. Experiments show that the proposed adaptive interaction-timing mechanism improves both safety and treatment effectiveness over equidistant interaction schemes across different safe policy optimization methods.

URL PDF HTML ☆

赞 0 踩 0

2606.01050 2026-06-02 cs.CV

TextFake: Benchmarking AI-Generated Image Detection on Text-Rich Images

TextFake: 对富含文本图像中AI生成图像检测的基准测试

Yuning Zhang, Changtao Miao, Mingyu Liao, Tingyu Liu, Xinghao Wang, Tao Gong, Qi Chu, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络科学与技术学院）； Anhui Province Key Laboratory of Digital Security（安徽省数字安全重点实验室）； Individual Researcher（独立研究者）

AI总结针对AI生成图像检测在富含文本图像上的空白，构建包含28种语言、2万图像的TextFake基准，评估14种检测器和3种VLM API，发现系统性能差距并诊断三种失败模式。

详情

AI中文摘要

最近的AI生成图像（AIGI）检测器在自然图像基准上表现良好，但它们在富含文本的伪造图像（如虚假截图、文档和新闻页面）上的行为尚未得到测试，这些伪造图像在虚假信息中普遍存在。我们引入了TextFake，一个包含20,000张图像的富含文本AIGI检测基准，涵盖28种语言、4个主题类别和2种场景模态。伪造图像通过一个四阶段流水线合成，该流水线沿三个受控维度注释真实图像，并通过分布对齐的结构化提示生成对应图像，排除了协变量捷径。对14个专用检测器和3个前沿VLM API的零样本评估揭示了巨大的系统性差距：没有方法超过80%的准确率，有些方法相比自然图像基准下降了60%以上。诊断评估识别出三种失败模式：文本密度诅咒，即密集字形压倒低级检测器；通过渲染保真度进行伪装，即更强的文本渲染抑制生成伪影；以及阈值崩溃，即常规扰动将检测器推向随机水平。

英文摘要

Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.

URL PDF HTML ☆

赞 0 踩 0

2606.01049 2026-06-02 cs.CL

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

PMC-InterCPT：重新思考用于多模态持续预训练的医学交错数据

Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang, Zhijie Sang, Minheng Ni, Wenjun Wang, Yanggan Gu, Shuo Cai, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Sun Yat-sen University（中山大学）； InfiX.ai ； PolyU-Daya Bay Technology and Innovation Research Institute（PolyU-大亚湾技术与创新研究院）

AI总结针对医学多模态持续预训练中图像-文本对数据存在的上下文缺失、结构噪声等问题，提出PMC-InterCPT交错语料库，通过恢复缺失标题、清洗文本、重建交错样本及LLM监督过滤，结合模态感知重采样，有效提升医学与通用多模态性能。

详情

AI中文摘要

从科学文献中提取的大规模生物医学图像-文本数据集为医学多模态模型训练提供了宝贵资源。这些数据集通常组织为图像-标题对；然而，图像标题往往简短、依赖上下文，且在没有周围文章文本的情况下仅部分信息。同时，大规模自动提取引入了结构噪声，如缺失标题、残留标记、重复上下文和不连贯的多段落图像描述。我们重新审视医学多模态持续预训练（CPT）的数据构建，并提出PMC-InterCPT，一个基于上下文的生物医学交错语料库，除了标题外还包含引用图像的正文文本。我们的流程恢复缺失标题，清洗标题和上下文文本，重建连贯的交错图像-文本样本，并应用LLM监督的医学相关性和质量分类器来过滤噪声记录。我们进一步揭示了结果语料库中强烈的模态不平衡，并引入了一个四桶证据分类法用于模态感知重采样。通过在Qwen3.5-4B-Base上进行CPT后接监督微调（SFT），PMC-InterCPT有效提升了医学和通用多模态性能，同时使用的CPT令牌少于原始源池。实验结果还说明了数据质量和模态对医学多模态CPT的互补性。

英文摘要

Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.

URL PDF HTML ☆

赞 0 踩 0

2606.01048 2026-06-02 cs.CV

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

解耦残差去噪扩散模型用于统一且数据高效的图像到图像翻译

Ziyue Lin, Jiahe Hou, Hongyu Xia, Xinrui Xie, Feifei Wang, Yuyin Zhou, Wei Wang, Jiawei Liu, Liangqiong Qu

发表机构 * The University of Hong Kong（香港大学）； Shenyang Institute of Automation, Chinese Academy of Sciences（中国科学院沈阳自动化研究所）； The Chinese University of Hong Kong（香港中文大学）； University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结提出解耦残差去噪扩散模型（DRDD），通过将扩散过程解耦为随机噪声扩散和确定性残差扩散两个独立阶段，实现统一且数据高效的图像到图像翻译。

Comments CVPR 2026

详情

AI中文摘要

我们提出解耦残差去噪扩散模型（DRDD），用于统一且数据高效的图像到图像（I2I）翻译。尽管扩散模型在质量和多样性方面推动了I2I翻译的发展，但我们揭示了扩散模型中一个先前未被充分探索的性质。关键在于，除了其传统的流形提升作用（即将数据移出低维流形），注入高斯噪声通过隐式对齐跨域的特征分布促进了域协调，这一性质对于统一的I2I翻译尤其有利。然而，现有的扩散模型过早地削弱了这种协调效果，因为噪声和残差在单个耦合的扩散过程中被同时移除。为解决这一问题，DRDD将扩散过程解耦为两个顺序且独立的扩散阶段：（1）用于域协调和流形提升的随机噪声扩散，以及（2）完全在固定噪声域内学习核心语义映射的确定性残差扩散。这种解耦在整个变换过程中保留了协调和流形提升效果，极大地简化了跨不同任务和域的统一映射学习。值得注意的是，噪声扩散阶段仅在丰富的、无配对的目标域图像上训练，大大提高了数据效率。全面的理论和实证分析表明，DRDD与主流扩散模型广泛兼容，即使在有限配对数据下也能持续提供稳健、统一的I2I翻译。我们的代码可在 https://github.com/HKU-HealthAI/DRDD 获取。

英文摘要

We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Our code is available at https://github.com/HKU-HealthAI/DRDD.

URL PDF HTML ☆

赞 0 踩 0

2606.01047 2026-06-02 cs.RO

Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

学习多模态轨迹策略以实现数据高效的机器人操作

Zijia Chen, Yuenan Hou, Xinhua Jiang, Yu Li, Weijie Li, Li Liu

发表机构 * College of Electronic Science and Technology, National University of Defense Technology（电子科学学院，国防科技大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结针对机器人操作中数据稀缺导致的多模态干扰问题，提出基于混合专家模型的多模态轨迹预测框架MATE，通过细粒度子令牌特征解耦和跨模态余弦路由器实现稳定专家分配，在LIBERO基准和真实乒乓球实验中取得显著性能提升。

详情

AI中文摘要

机器人操作需要有效整合异构输入，包括视觉观察、语言指令和轨迹表示，以生成精确的动作。现有的基于Transformer的策略通常在一个共享参数空间内处理这些异构模态，这往往导致模态干扰和低效的表示学习，尤其是在数据稀缺的场景下。虽然混合专家模型（MoE）通过专家专业化提供了可扩展的解决方案，但传统的路由机制通常对这类跨模态表示差异敏感，导致专家分配不稳定和专家崩溃。在这项工作中，我们提出了MATE（多模态轨迹策略），一种基于MoE的新型轨迹预测框架。具体来说，我们引入了一种多模态MoE架构以实现细粒度的子令牌特征解耦，并设计了一个跨模态余弦路由器，用于跨异构模态的稳定且尺度不变的专家分配。我们进一步采用温度控制路由和随机噪声注入，以改善专家平衡并防止在稀缺演示下过早的路由崩溃。在LIBERO基准上的实验表明，我们的MATE在数据稀缺情况下始终优于先前的工作。与轨迹引导的对应方法相比，平均成功率提高了4.75%。在真实世界的乒乓球机器人实验也表明，预测的轨迹可以为下游机器人执行提供有用的指导，进一步证明了我们算法的实际可行性。

英文摘要

Robotic manipulation requires the effective integration of heterogeneous inputs, including visual observations, language instructions, and trajectory representations, to generate accurate actions. Existing transformer-based policies typically process these heterogeneous modalities within a shared parameter space, which often leads to modality interference and inefficient representation learning, especially in data-scarce scenarios. While Mixture-of-Experts (MoE) offers a scalable solution through expert specialization, conventional routing mechanisms are often sensitive to such cross-modal representation discrepancies, resulting in unstable expert assignment and expert collapse. In this work, we propose MATE (Multi-ModAl TrajEctory Policies), a novel trajectory prediction framework built upon MoE. Specifically, we introduce a Multi-Modal MoE architecture to achieve fine-grained sub-token feature decoupling, and design a cross-modal cosine router for stable and scale-invariant expert assignment across heterogeneous modalities. We further employ temperature-controlled routing and stochastic noise injection to improve expert balance and prevent premature routing collapse under scarce demonstrations. Experiments on the LIBERO benchmark show that our MATE consistently outperforms prior work under data scarcity. It achieves a 4.75% improvement in average success rate over the trajectory-guided counterpart. Real-world experiments on robotic ping-pong also suggest that the predicted trajectories can provide useful guidance for downstream robotic execution, further indicating the practical feasibility of our algorithm.

URL PDF HTML ☆

赞 0 踩 0

2606.01046 2026-06-02 cs.AI

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

TravelEval：评估基于LLM的旅行规划代理的综合基准框架

Weiyi Chen, Shuaixiong Wang, Ziyun Gao, Kaichun Hu, Wangze Ni, Shimin Di, Chen Jason Zhang, Lei Chen

发表机构 * Zhejiang University（浙江大学）； Hong Kong Polytechnic University（香港理工大学）； Southeast University（东南大学）； HKUST (GZ) & HKUST Guangzhou（香港科技大学（广州）& 香港科技大学（广州））

AI总结针对现有基准过度关注约束合规、缺乏真实性和多维评估的问题，提出TravelEval，通过六维评估框架、真实数据沙盒和模拟全局评估方法，全面评估LLM在旅行规划中的表现。

Comments 31pages, 8 figures, accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817533

AI中文摘要

大型语言模型（LLM）的发展显著提升了旅行规划应用，但现有基准的局限性限制了对其评估：1）过度强调约束合规，忽视时空成本等多维质量；2）数据集缺乏真实世界真实性和关键领域（如住宿、交通）的覆盖；3）孤立的每日计划评估遗漏了评估整个计划所需的关键细节（例如每日住宿和参观节奏的影响）。为解决这一差距，我们引入了TravelEval，一个真实且全面的基准。TravelEval具有1）一个新颖的六维评估框架，从准确性、合规性、时间性、空间性、经济性和实用性维度全面评估计划；2）一个高度真实的数据沙盒，包含精确的住宿定价和真实的城际交通数据；3）一种基于模拟的全局评估方法，通过集成API的地理信息和细粒度排队时间模拟完整的旅行计划。使用TravelEval评估12种主流方法揭示了若干有价值的见解，例如LLM在全局优化的多维规划（特别是时空推理和预算合规）方面存在困难，而代理推理策略并未提供一致的改进。简而言之，TravelEval通过基于现实的时空模拟和全面指标促进旅行计划评估，为推进基于LLM的旅行规划研究和应用提供了坚实基础。

英文摘要

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

URL PDF HTML ☆

赞 0 踩 0

2606.01045 2026-06-02 cs.CL

Child-directed speech facilitates production, not comprehension, in BabyLMs

儿童导向语言促进BabyLMs的生成而非理解

Bastian Bunzeck, Sina Zarrieß

发表机构 * Computational Linguistics, Department of Linguistics（计算语言学系，语言学系）

AI总结本研究通过框架补全任务评估儿童导向语言（CDS）对BabyLMs生成能力的影响，发现CDS训练的模型在生成语法补全方面优于网络数据训练的模型，而理解基准测试低估了CDS的贡献。

Comments Accepted at CoNLL 2026

详情

AI中文摘要

近期研究表明，儿童导向语言（CDS）不利于BabyLMs的语言学习。然而，当前的评估主要关注理解而非生成，而生成是基于用法的语言习得理论的核心，该理论认为CDS通过构式“框架”（具有开放槽位的频繁词汇模式）促进早期语言使用。我们受这些理论启发，引入了一种新颖的基于生成的评估方法——框架补全任务，并比较了使用CDS、BabyLM语料库和网络爬取数据（FineWeb-edu）训练的Llama模型在理解基准测试和我们新框架上的表现。我们的结果揭示了模型的理解与生成能力之间存在明显的分离：虽然FineWeb训练的模型在最小对测试中表现优异，但CDS训练的模型在训练早期就能生成语法正确的补全，并将概率质量集中在合适的槽位填充词上。这些发现表明，理解基准测试低估了CDS对BabyLMs的贡献。

英文摘要

Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ''frames'' (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a frame-completion task, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models' comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.01044 2026-06-02 cs.CV

Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA

Ask4VG: 用于减少医学VQA中先验驱动答案的风险感知问题选择

Xiaorong Zhu, Qiang Li, Zibo Xu, Weijie Wang, Weizhi Nie

发表机构 * School of Microelectronics, Tianjin University, Tianjin 300072, China（天津大学电子工程学院，天津 300072，中国）； DISI, University of Trento, Trento, Italy（特伦托大学DISI研究所，意大利特伦托）； School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China（天津大学电气与信息工程学院，天津 300072，中国）

AI总结提出Ask4VG框架，通过反事实视觉探测估计问题引发的幻觉风险，并重排问题改写以选择更依赖图像证据的问题，从而减少医学VQA中的先验驱动答案。

详情

AI中文摘要

医学视觉问答要求模型将回答建立在图像证据上，因为缺乏视觉支持的答案可能误导下游解读。然而，许多医学VQA问题是通用的、模板化的或形式高度相似，这可能鼓励模型学习问答捷径而非依赖图像的推理，从而增加幻觉回答的风险。我们提出Ask4VG，一个无标签的试点框架，用于风险感知的问题选择。Ask4VG通过反事实视觉探测估计问题引发的幻觉风险：在原始图像、扰动图像、空白图像和错配图像下提出相同问题，并将得到的答案关系转换为反事实风险估计器的弱监督信号。然后，学习到的估计器对候选问题改写进行重排，以优先选择那些对缺失或错配视觉证据更不具不变性的、保留意图的问题，再进行最终答案生成。在VQA-RAD上使用Qwen2-VL-2B-Instruct，仅提示改写增加了反事实风险，而基于预测风险的重排将留出风险从0.658降至0.623，并将精确准确率从0.337提升至0.356。一个300样本的PMC-VQA外部检查显示了相同的风险降低方向，并伴有小幅准确率提升。这些结果表明，问题选择是响应级幻觉缓解的一个有前景的补充，有助于实现可靠的医学VQA。

英文摘要

Medical visual question answering requires models to ground their responses in image evidence, because visually unsupported answers can mislead downstream interpretation. However, many medical VQA questions are generic, template-like, or highly similar in form, which can encourage models to learn question-answer shortcuts instead of image-dependent reasoning and thereby increase the risk of hallucinated responses. We propose Ask4VG, a label-free pilot framework for risk-aware question selection. Ask4VG estimates question-induced hallucination risk through counterfactual visual probing: the same question is asked under the original image, a perturbed image, a blank image, and a mismatched image, and the resulting answer relations are converted into weak supervision for a counterfactual risk estimator. The learned estimator then reranks candidate question rewrites to favor intent-preserving questions that are less invariant to missing or mismatched visual evidence before final answer generation. On VQA-RAD with Qwen2-VL-2B-Instruct, prompt-only rewriting increases counterfactual risk, whereas predicted-risk reranking reduces held-out risk from 0.658 to 0.623 and improves exact accuracy from 0.337 to 0.356. A 300-sample PMC-VQA external check shows the same direction of risk reduction with a small accuracy gain. These results suggest that question selection is a promising complement to response-level hallucination mitigation for reliable medical VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.01042 2026-06-02 cs.LG cs.AI

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

似真性不是预测：基于LLM的细胞扰动推理的对比证据

Xinyu Yuan, Xixian Liu, Jianan Zhao, Yashi Zhang, Hongyu Guo, Jian Tang

发表机构 * Mila - Québec AI Institute（魁北克人工智能研究所）； University of Montréal（蒙特利尔大学）； HEC Montréal（蒙特利尔HEC商学院）； University of Ottawa（渥太华大学）； National Research Council of Canada（加拿大国家研究理事会）； CIFAR AI Chair（CIFAR人工智能 chair）

AI总结本文发现基于大语言模型的细胞扰动推理虽能生成生物上合理的解释，但实际预测性能差，并提出CORE方法通过对比证据组织来提升扰动特异性预测。

详情

AI中文摘要

扰动实验对于理解细胞机制至关重要，但成本高昂且稀疏，因此需要预测未观察条件下的基因表达响应。最近一个有前景的方向是利用大语言模型（LLM）作为“虚拟细胞”模拟器——通过逐步的、基于知识的机械推理来推断差异表达——指向一种可解释的、知识驱动的范式，超越了纯粹的数据驱动方法。然而，我们发现似真性不是预测：尽管产生了生物上合理的解释，这些方法未能捕捉扰动特异性效应：系统性地高估差异表达，在聚合评估中通常表现不如简单的基因频率基线，并且在每个基因水平上降至随机水平。这揭示了对内在基因响应倾向的依赖，而非真正的扰动推理。我们将这一失败追溯到证据呈现方式：现有方法孤立地评估扰动-基因对，而不揭示相关扰动对同一基因的影响差异。为解决这一局限性，我们引入了CORE（对比关系证据组织），通过将证据组织成来自相关扰动的正面和负面结果，将预测重新定义为比较任务。使用生物医学知识图谱进行证据检索，CORE在基于LLM和非LLM的设置中均改善了校准并大幅提升了扰动特异性预测：例如，在药物扰动数据上，CORE-Reasoning将Qwen3.5-9B的聚合指标提升了高达28.6%；而在通用扰动数据上，CORE-Voting将四个细胞系的每个基因平均AUROC从随机水平提高到0.703。这突显了对比证据组织对于可靠的基于LLM的扰动推理至关重要。

英文摘要

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as "virtual cell" simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning

URL PDF HTML ☆

赞 0 踩 0

2606.01041 2026-06-02 cs.CL

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

ExpWeaver: LLM智能体通过潜在RAG从经验中学习

Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ExpWeaver框架，利用潜在检索增强生成（无独立RAG模块）使LLM智能体从经验中学习，通过隐藏状态编码经验、潜在空间检索和交叉注意力聚合，在13个任务中12个达到最优，token效率高且跨域泛化强。

详情

AI中文摘要

经验学习通过将过去的交互整合为可重用知识，在增强LLM智能体规划和推理方面取得了有希望的结果。然而，现有方法仍局限于显式文本空间，通过语义相似性检索经验并将其拼接到上下文窗口中，导致大量token开销以及将检索与生成分离的解耦架构。为了解决这些限制，我们提出了ExpWeaver，一个使LLM智能体能够通过潜在检索增强生成从经验中学习的框架，无需单独的RAG模块。ExpWeaver使用LLM自身的隐藏状态编码经验，在每个解码步骤直接在潜在空间中检索相关经验，并通过交叉注意力聚合和门控残差机制进行整合。整个流程通过强化学习进行端到端优化，支持生成和排序任务。我们在涵盖问答、推理、编程、科学预测和推荐的13个不同任务上评估了ExpWeaver。结果表明，ExpWeaver在13个任务中的12个上达到了最先进的性能，比最强基线高出6.8%以上；保持了与非检索基线相当的token效率，而基于文本的检索方法需要1.5到2倍的token；并展现出卓越的跨域泛化能力，在零样本迁移下比最强基线高出16.32%，在少样本迁移下高出15.21%。我们的ExpWeaver代码已在https://github.com/ulab-uiuc/ExpWeaver发布。

英文摘要

Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM's own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state-of-the-art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer. Our code for ExpWeaver is released at https://github.com/ulab-uiuc/ExpWeaver.

URL PDF HTML ☆

赞 0 踩 0

2606.01039 2026-06-02 cs.LG cs.AI

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+: 重新思考在线策略蒸馏中的优势设计

Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang

发表机构 * Columbia University（哥伦比亚大学）； Amazon（亚马逊）； Meta ； Capital One

AI总结本文提出OPD+，通过修正在线策略蒸馏中因停止梯度操作导致的奖励目标偏差，并支持多种f-散度，在数学推理和工具使用基准上提升了性能。

详情

AI中文摘要

在线策略蒸馏（OPD）是一种广泛使用的技术，用于将能力强的教师语言模型的能力迁移到基础学生模型，并且可以通过使用学生生成的轨迹来制定强化学习风格的目标。然而，尽管散度奖励依赖于学生模型的可能性，现有工作通常采用停止梯度设计主要是为了稳定性，这使得得到的优势估计存在问题。在这项工作中，我们提供了一个基于学生和教师之间f-散度的通用优化框架，并从数学上重新审视这种设计空间是否有效。我们证明，对于一般的散度函数，一般的停止梯度操作会导致奖励目标和相应梯度的有偏估计。我们提出了OPD+，这是OPD的修正版本，在基线KL方法上展示了改进的性能，并且也支持各种f-散度的选择。我们在数学推理和工具使用基准上验证了我们的发现。

英文摘要

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry

Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

A Fiber Criterion for Representation Identifiability in Supervised Learning

Deep Research as Rubric for Reinforcement Learning

Strong Stochastic Flow Maps

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing

Non-Vacuous Certification of Transport MCMC via Oscillation-Controlled Normalizing Flows

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression

A Multiscale Network with Supervised Contrastive Learning for Real-Time Facial Emotion Recognition

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment

TextFake: Benchmarking AI-Generated Image Detection on Text-Rich Images

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

Child-directed speech facilitates production, not comprehension, in BabyLMs

Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

OPD+: Rethinking the Advantage Design for On-Policy Distillation