arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.16258 2026-05-22 cs.CV cs.AI cs.RO

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT:隐式视觉几何变换器用于神经场景表示

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

AI总结 本文提出IVGT,一种隐式视觉几何变换器,通过无姿态多视角图像隐式建模连续且一致的几何结构,从而实现神经场景表示,支持在任意3D位置进行连续空间查询,以预测签名距离和颜色,并在多个任务中表现出色。

Comments Code: https://github.com/wzzheng/IVGT/

详情
AI中文摘要

从未经姿态的多视角图像中重建一致的3D几何和外观是计算机视觉中的基础但具有挑战性的问题。现有的视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何,常常面临冗余和几何连续性有限的问题。我们提出了IVGT,一种隐式视觉几何变换器,能够从无姿态的多视角图像中隐式建模连续且一致的几何。这种形式在规范坐标系中学习了连续的神经场景表示,并支持在任意3D位置进行连续空间查询,通过轻量级解码器检索局部特征,以预测签名距离(SDF)值和颜色。它允许直接提取连续且一致的表面几何,从而能够从任意视角渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化进行训练,结合2D监督和3D几何正则化。IVGT在不同场景中表现出良好的泛化能力,并在多种任务中实现了优异的性能,包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。

英文摘要

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

2605.15588 2026-05-22 cs.CL cs.LG

Calibrating LLMs with Semantic-level Reward

通过语义层面奖励校准大型语言模型

Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

AI总结 本文提出了一种新的校准框架CSR,通过在语义空间中直接校准语言模型,避免了传统方法中因词汇化置信度导致的不一致问题,实验显示CSR在多个数据集上均能有效降低ECE并提高AUROC。

详情
AI中文摘要

随着大型语言模型(LLMs)被应用于医疗问答和法律推理等关键领域,估计其输出正确性的能力对于安全可靠使用至关重要,要求模型具有良好的校准能力。标准的可验证奖励强化学习(RLVR)通过二元正确性奖励训练模型,但该奖励对置信度不敏感,无法对自信但错误的预测施加惩罚,从而降低校准效果。最近的研究通过训练模型生成带有词汇化置信度的置信分数并奖励与正确性的同意来解决这一问题。然而,词汇化置信度在语义相同但文本变化时表现出不一致性。我们提出Calibration with Semantic Reward(CSR),一种在语义空间中直接校准语言模型的框架,无需词汇化置信度接口。CSR结合了正确性奖励和一种新的语义校准奖励,通过促进正确路径中的语义一致性和不正确路径中的探索来鼓励利用和探索。在HotpotQA(在分布)和TriviaQA、MSMARCO、NQ-Open(不在分布)三个模型家族上的实验表明,CSR在几乎所有设置中都比词汇化置信度基线实现了更低的ECE和更高的AUROC,ECE减少高达40%,AUROC提高高达31%,校准行为在所有四个评估设置中均表现出良好的鲁棒性。

英文摘要

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

2605.15505 2026-05-22 cs.AI cs.IR cs.LG

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

X-SYNTH:超越检索——从观察到的数字人类注意力中提取企业上下文

Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

AI总结 本文提出X-SYNTH框架,通过分析数字人类注意力行为模式,解决企业上下文合成问题,其核心方法是基于行为模式的上下文合成,而非传统检索,从而显著提升有效线索率并降低误报率。

Comments 11 pages, 7 figures, 5 tables

详情
AI中文摘要

在企业运营中,AI代理任务所需上下文分散在记录系统、静态信息存储和通信渠道中。所存储的是系统状态,这是工作实际发生情况的损失性表示。现有的方法通过匹配请求内容来检索存储的信息;对于狭窄请求,这种方法效果良好。但合成质量依赖于了解应展示什么以及如何解释它:这涉及每个组织、团队和个人特有的知识,存在于行为模式中,而不在任何检索索引中。对于提出对企业有价值的线索给销售员的代理任务,这种方法失效:真正的线索率低,假线索率高,且模型没有改进机制。我们提出了X-SYNTH,一个基于数字人类注意力的框架,这种注意力是每个工人的可数字化交互特征,编码了他们做了什么、按什么顺序做,以及隐含的奖励信号。在没有外部标签的情况下,可以区分出导致积极结果的先前行为轨迹与未导致积极结果的轨迹。X-SYNTH将每个个体的行为基线建模为数字双胞胎签名(DTS),并根据个体和查询选择七种注意力过滤器:比例、反比、微分、递归、比较、顺序和集体,以识别因果相关的活动签名。一个四阶段的管道将基于行为模式的排名上下文组装起来,而不是查询嵌入。一个前沿模型在无辅助的情况下实现了9.5%的真实线索率(TLR)和90.5%的假线索率(FLR)。在加入X-SYNTH后,TLR上升到61.9%(6.5倍),而FLR下降到18.8%。企业上下文合成不是检索问题,而是相关性问题,而数字人类注意力是其最可靠的地面真实值。

英文摘要

In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened. The prevailing approach retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual, present in behavioral patterns, absent from any retrieval index. For the agentic task of proposing enterprise-valuable leads to sellers, this approach breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in digital human attention, the digitally observable interaction signatures of each worker, encoding what they did, the sequence in which they did it, and implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven attention filters, Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. A frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and digital human attention is its most reliable ground truth.

2605.14598 2026-05-22 cs.RO

DSSP: Diffusion State Space Policy with Full-History Encoding

DSSP:具有完整历史编码的扩散状态空间策略

Zhiyuan Guan, Jianshu Hu, Han Fang, Yunpeng Jiang, Yize Huang, Shujia Li, Xiao Li, Yutong Ban

AI总结 本文提出DSSP,一种基于扩散模型的状态空间策略,通过完整历史编码提升机器人操作任务中长周期任务的历史依赖性处理能力,实现了更高效的模型压缩和更小的模型规模。

详情
AI中文摘要

基于扩散的模仿学习在机器人操作中显示出强大的前景。然而,大多数现有策略仅依赖于当前观察或最近的短窗口观察,限制了它们在长周期任务中解决历史依赖性模糊性的能力。为此,我们引入DSSP,一种具有完整历史编码的扩散状态空间策略,能够为机器人操作提供高效的完整历史条件。利用状态空间模型(SSMs)的连续序列建模特性,我们的历史编码器有效地将整个观察流压缩成一个紧凑的上下文表示。为了确保此上下文保留有关未来状态演化的关键信息,编码器通过动态感知的辅助训练目标进行优化。此高层上下文表示随后与近期状态观察无缝融合,形成一个分层的条件机制用于动作生成。此外,为了保持架构一致性并减少GPU内存开销,我们还用SSM实例化扩散骨干网络。在模拟基准和真实世界操作任务中的广泛实验表明,DSSP在显著更小的模型规模下实现了最先进的性能,展示了分层条件在历史长度增加时捕获关键信息的优越效率。

英文摘要

Diffusion-based imitation learning has shown strong promise for robot manipulation. However, most existing policies condition only on the current observation or a short window of recent observations, limiting their ability to resolve history-dependent ambiguities in long-horizon tasks. To address this, we introduce DSSP, a history-conditioned Diffusion State Space Policy that enables efficient, full-history conditioning for robot manipulation. Leveraging the continuous sequence modeling properties of State Space Models (SSMs), our history encoder effectively compresses the entire observation stream into a compact context representation. To ensure this context preserves critical information regarding future state evolution, the encoder is optimized with a dynamics-aware auxiliary training objective. This high-level context representation is then seamlessly fused with recent state observations to form a hierarchical conditioning mechanism for action generation. Furthermore, to maintain architectural consistency and minimize GPU memory overhead, we also instantiate the diffusion backbone itself using an SSM. Extensive experiments across simulation benchmarks and real-world manipulation tasks show that DSSP achieves state-of-the-art performance with a significantly smaller model size, demonstrating superior efficiency of the hierarchical conditioning in capturing crucial information as the history length increases.

2605.14322 2026-05-22 cs.AI

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

代理是否准备好教学?一个多阶段基准用于现实世界教学工作流程

Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu

AI总结 本文提出EduAgentBench基准,用于评估教学代理的全面能力,发现当前模型在教学任务中的表现有限,但仍为开发未来教学代理提供了测量基础。

Comments Under review

详情
AI中文摘要

语言代理越来越多地部署在复杂的专业工作流程中,辅导能力作为高风险功能,目前在现有基准中仍未得到充分衡量。有效的辅导代理需要超越产生正确答案或执行准确工具调用:一个稳健的辅导代理必须诊断学习者状态、随时间适应支持、做出基于教育证据的决策,并在现实的学习管理系统中执行干预。我们引入EduAgentBench,一个源驱动的基准,用于全面评估辅导代理在教学工作范围内的能力。它包含150个经过质量控制的任务,涵盖三个能力表面:专业教学判断、情境多轮辅导和Canvas式教学工作流程完成。任务通过教学洞察驱动的流程构建,并通过互补的验证信号和人工审查进行评估。在对前沿模型的全面评估中,我们的发现表明,当前模型在有限的教学判断方面表现良好,但在情境辅导和自主教学工作流程执行方面仍无法达到专业教学标准。据我们所知,EduAgentBench是第一个理论基础和现实的基准,用于评估辅导代理的全面教学能力,为开发未来能够支持现实教学工作的辅导代理提供了测量基础。

英文摘要

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

2605.12836 2026-05-22 cs.LG

Discrete Stochastic Localization for Non-autoregressive Generation

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

AI总结 本文提出了一种连续状态框架,通过单位球体令牌嵌入实现离散随机定位,以提高离散序列生成的分布忠实度,并展示了在OpenWebText上改进MAUVE指标的效果。

Comments This work was intended as a replacement of arXiv:2602.16169 and any subsequent updates will appear there

详情
AI中文摘要

连续扩散是一种非自回归生成的自然框架,但在离散序列生成上通常落后于掩码离散扩散模型(MDMs)。我们争论瓶颈不是连续性本身,而是在于表示中去噪依赖于时间步索引的噪声制度。我们引入了离散随机定位(DSL),一种具有单位球体令牌嵌入的连续状态框架,其贝叶斯最优去噪器在定位信道下对名义信号噪声比(SNR)具有不变性。一个训练好的网络可以支持整个SNR路径家族,端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提高OpenWebText上的分布忠实度(MAUVE)在所有步骤预算从T=128到T=1024,且同一检查点支持随机顺序自回归采样,以及使用最少T=48总步骤的混合连续-然后-离散采样器,无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

2605.12623 2026-05-22 cs.CL cs.CV cs.LG

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas: 跨80多种语言的多语言文档理解

Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

AI总结 本文提出DocAtlas框架,通过构建高保真的OCR数据集和基准测试,覆盖82种语言和9个评估任务,利用双重管道生成精确的结构注解,展示了直接偏好优化在多语言适应中的有效性,提升了领域内和领域外的准确率。

Comments Under submission

详情
AI中文摘要

多语言文档理解在低资源语言中受限于稀缺的训练数据和基于模型的标注流程,这些流程会加剧现有偏见。我们引入DocAtlas,一个构建覆盖82种语言和9个评估任务的高保真OCR数据集和基准测试的框架。我们的双重管道,包括本地DOCX文档的差异渲染和针对从右到左脚本的合成LaTeX生成,生成统一的DocTag格式注解,编码布局、文本和组件类型,无需学习模型进行核心注解。评估16种最先进的模型揭示了低资源脚本中的持续差距。我们展示直接偏好优化(DPO)使用渲染派生的真实情况作为正信号,实现了稳定的多语言适应,提高了领域内(+1.9%)和领域外(+1.8%)的准确性,而监督微调会导致领域外性能下降高达21%。我们的最佳变体,DocAtlas-DeepSeek,在最强基线基础上提高了+1.7%。代码可在https://github.com/ahmedheakl/DocAtlas获取。

英文摘要

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

2605.10067 2026-05-22 cs.LG cs.AI

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Metis: 通过自进化元认知策略优化学习 jailbreak LLMs

Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang, Xuelong Li

AI总结 本文提出Metis框架,通过将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程中的推理时间策略优化,以提高对抗性测试的效率和效果,同时通过结构化反馈和透明推理轨迹提升可解释性,实验表明Metis在多种模型上均表现出更高的攻击成功率和更低的token成本。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

红队测试对于揭示大型语言模型(LLMs)中的漏洞至关重要。尽管自动化方法已提高可扩展性,但现有方法往往依赖静态启发式或随机搜索,使其在面对高级安全对齐时显得脆弱。为了解决这一问题,我们引入了Metis框架,该框架将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程(POMDP)中的推理时间策略优化。Metis采用自进化元认知循环来执行目标防御逻辑的因果诊断,并利用结构化反馈作为语义梯度来优化其策略,通过透明推理轨迹提高可解释性。在10种不同模型上的广泛评估表明,Metis在比较方法中实现了最强的平均攻击成功率(ASR)为89.2%,在坚韧的前沿模型(如O1和GPT-5-chat)上保持高效果,而传统基线方法表现出显著的性能下降。通过用定向优化替代冗余探索,Metis将token成本平均降低了8.2倍,最高可达11.4倍。我们的分析表明,当前防御在测试设置下仍易受内部引导的闭环推理轨迹影响,突显了下一代防御机制在推理过程中动态处理安全性的关键需求。

英文摘要

Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

2605.09273 2026-05-22 cs.LG

Instance-Adaptive Online Multicalibration

实例自适应在线多校准

Zhiming Huang, Jamie Morgenstern, Aaron Roth, Claire Jie Zhang

AI总结 本文提出了一种高效的实例自适应在线多校准算法,通过动态调整预测值的二进制网格来平衡最坏情况和易处理情况,实现了在不同实例下的最优误差控制。

Comments We tightened the analysis and added a comparison to the concurrent work of Liu et al. (arXiv:2605.11490)

详情
AI中文摘要

我们研究了超越最坏情况的在线多校准。我们给出一个单一、高效的算法,通过自适应细化预测值的二进制网格,动态插值于良性和最坏情况序列之间。其误差由细化树中的叶子数量控制。我们的分析恢复了已知的在线多校准最坏情况最优率$\widetilde O(T^{2/3})$,同时自动适应于更简单的实例:在边际随机情况下,获得$\widetilde O(\sqrt T)$的速率,对于具有$J$段的分段平稳均值,其速率是$\widetilde O(\sqrt{JT})$。更一般地说,速率取决于可预测均值过程相对于组族的阈值复杂度度量。我们证明这种依赖性在对数因子范围内是紧致的。

英文摘要

We study online multicalibration beyond the worst-case. We give a single, efficient algorithm which dynamically interpolates between benign and worst-case sequences by adaptively refining a dyadic grid of prediction values. Its error is controlled by the number of leaves in the refinement tree. Our analysis recovers the known $\widetilde O(T^{2/3})$ worst-case-optimal rate for online multicalibration, while simultaneously automatically adapting to easier instances: in the marginal stochastic setting it obtains a rate of $\widetilde O(\sqrt T)$, and for piecewise-stationary means with $J$ segments its rate is $\widetilde O(\sqrt{JT})$. More generally, the rate depends on a threshold-complexity measure of the predictable mean process relative to the group family. We show that this dependence is tight up to logarithmic factors.

2605.09252 2026-05-22 cs.CL

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

AI总结 本文提出When2Tool基准,通过18个环境研究工具调用的必要性,发现模型已能识别何时需要调用工具,但生成时未能有效利用此知识,提出Probe&Prefill方法显著减少工具调用。

详情
AI中文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of 免训练 baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$ imes$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

英文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

2605.07287 2026-05-22 cs.CV

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

SplatWeaver: 学习分配高斯原语以实现可泛化的新型视角合成

Yecong Wan, Fan Li, Mingwen Shao, Wangmeng Zuo

AI总结 本文提出SplatWeaver框架,通过动态分配高斯原语实现可泛化的新型视角合成,解决传统方法中固定分配导致的资源浪费和表达不足问题。

Comments Project Page: https://yecongwan.github.io/SplatWeaver/

详情
AI中文摘要

可泛化的新型视角合成旨在从未经校准的输入图像中渲染未见过的视角,而无需每个场景的优化。最近基于3D高斯点划的前馈方法在效率和渲染质量上取得了显著进展。然而,大多数方法将固定数量的高斯分布分配给每个像素或体素,忽略了现实场景中空间变化的复杂性。这种均匀分配通常在平滑区域浪费高斯原语,而在细结构、复杂几何和高频细节方面提供不足的容量。这促使我们预测区域依赖的原语数量,而不是在所有地方施加固定原语预算,从而实现更具表达力的3D场景表示。因此,我们提出SplatWeaver,一个能够动态分配高斯原语的可泛化新型视角合成框架。具体而言,SplatWeaver引入了基数高斯专家和像素级路由方案,其中每个专家专门生成从0到M的特定数量的原语,路由方案协调这些专家以适应性地确定每个空间位置应分配多少高斯原语。此外,SplatWeaver结合了高频先验和相关的指导模块和路由正则化,以稳定专家选择并促进复杂度感知的分配。通过利用高频线索,路由过程被鼓励将更多的高斯原语分配给细结构和纹理区域,同时抑制平滑区域的冗余。在多样化的场景中进行的广泛实验表明,SplatWeaver在大多数情况下都优于最先进的方法,能够以更少的高斯原语生成更逼真的新型视角渲染。项目页面:https://yecongwan.github.io/SplatWeaver/

英文摘要

Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency cues, the routing process is encouraged to assign more Gaussian primitives to fine structures and textured regions, while suppressing redundancy in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives. Project Page: https://yecongwan.github.io/SplatWeaver/

2605.05765 2026-05-22 cs.CV

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

X-OmniClaw 技术报告:一种统一的移动代理用于多模态理解和交互

Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou, Yanhao Zhang, Peng Liu, Qi Qi, Quanlong Zheng, Qi Wu, Zhenyi Liao, Binqiang Pan, Haobo Ji, Haonan Lu

AI总结 本文提出X-OmniClaw,一种统一的移动代理,用于Android生态系统中的多模态理解和交互,通过统一的感知、记忆和行动架构,提升复杂移动任务的上下文感知能力,展示了其在多模态交互中的高效性和可靠性。

Comments 12 pages, 7 figures

详情
AI中文摘要

受OpenClaw发展启发,随着对能够处理复杂和直观交互的移动个人代理需求增加,本文介绍了X-OmniClaw,一种专为Android生态系统设计的统一移动代理,用于多模态理解和交互。该统一架构的感知、记忆和行动模块使代理能够通过高上下文感知处理复杂移动任务。具体而言,Omni Perception提供了一个统一的多模态输入管道,整合UI状态、现实世界视觉上下文和语音输入,利用时间对齐模块将原始数据分解为结构化的多模态意图表示。Omni Memory利用多模态记忆优化来增强个性化智能,通过整合运行时工作记忆与从本地数据中提取的长期个人记忆,实现高度上下文感知和个性化的交互。最后,Omni Action采用混合接地策略,结合结构性XML元数据与视觉感知以实现稳健的交互。通过行为克隆和轨迹回放,系统捕获用户导航作为可重用的技能,实现精确的直接访问执行。在多样化的场景中展示表明,X-OmniClaw有效提高了交互效率和任务可靠性,为下一代移动原生个人助手提供了实用的架构蓝图。

英文摘要

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

2605.01466 2026-05-22 cs.CV cs.LG

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

SplAttN: 通过高斯软溅射和注意力在2D和3D之间架桥以实现点云补全

Zhaoyang Li, Zhichao You, Tianrui Li

AI总结 本文提出SplAttN方法,通过高斯软溅射和注意力机制解决点云补全中2D和3D模态连接问题,改进了传统硬投影导致的跨模态熵塌陷问题,实现了更有效的跨模态连接学习。

Comments Accepted as a Spotlight paper at ICML 2026; camera-ready version

详情
AI中文摘要

尽管多模态学习在点云补全方面取得了进展,但理论机制仍不明确。最近的研究将成功归因于模态间的联系,但我们发现标准硬投影破坏了这种联系:将稀疏点云投影到图像平面会产生极稀疏的支持,阻碍视觉先验传播,这种失败模式我们称为跨模态熵塌陷。为解决这一实际限制,我们提出了SplAttN,用可微高斯溅射替代硬投影,生成密集的连续图像平面表示。通过将投影重新公式化为连续密度估计,SplAttN避免了塌陷的稀疏支持,促进了梯度流动,并提高了跨模态连接的学习能力。广泛的实验表明,SplAttN在PCN和ShapeNet-55/34上实现了最先进的性能。关键的是,我们利用现实世界的KITTI基准作为多模态依赖的应力测试。反事实评估显示,尽管基线退化为对视觉移除不敏感的单模态模板检索器,SplAttN仍能保持对视觉线索的稳健依赖,验证了我们的方法建立了有效的跨模态连接。代码可在https://github.com/zay002/SplAttN获取。

英文摘要

Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://github.com/zay002/SplAttN.

2605.00414 2026-05-22 cs.LG cond-mat.stat-mech cs.AI

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

树到流及回归:统一决策树和扩散模型

Sai Niranjan Ramachandran, Suvrit Sra

AI总结 本文通过建立层次决策树与扩散过程之间的数学对应关系,统一了决策树和扩散模型,揭示了共同的优化原则'全局轨迹得分匹配',并提出了两种实用应用:treeflow在表格数据生成中表现优异,且计算速度更快;dsmtree将层次决策逻辑转移到神经网络中,在多个基准上与教师模型表现相近。

Comments 12 pages (main), 68 pages (inclusive of appendix), Accepted in the Forty-Third International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

决策树和扩散模型本质上是不同的模型类别,前者是离散和层次的,后者是连续和动态的。本文通过在适当的极限情况下建立层次决策树与扩散过程之间的清晰数学对应关系,将两者统一起来。我们的统一揭示了一个共同的优化原则:全局轨迹得分匹配(GTSM),其中梯度提升(在理想化版本中)在渐近意义上是最优的。通过两个关键的实用实例,我们强调了本工作的概念价值:treeflow在表格数据上实现了具有更高保真度和2倍计算速度的竞争性生成质量,而dsmtree是一种新的蒸馏方法,将层次决策逻辑转移到神经网络中,在许多基准上与教师模型表现相近。

英文摘要

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emph{Global Trajectory Score Matching (GTSM)}, for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\times computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2\% on many benchmarks.

2605.00185 2026-05-22 cs.LG cs.AI

Fair Dataset Distillation via Cross-Group Barycenter Alignment

通过跨组重心对齐实现公平的数据集蒸馏

Mohammad Hossein Moslemi, Nima Hosseini Dashtbayaz, Zhimin Mei, Bissan Ghaddar, Boyu Wang

AI总结 本文研究了数据集蒸馏中因不同群体预测模式差异导致的公平性问题,提出通过跨组重心对齐方法来减少群体间的预测偏差,从而提升模型的公平性。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩成小规模合成数据集,同时保持预测性能。我们发现,由于不同人口群体表现出不同的预测模式,蒸馏过程在保持所有子群体信息信号方面面临困难,无论群体大小是轻微还是严重不平衡。因此,训练在蒸馏数据上的模型可能会在某些子群体上出现显著性能下降,导致公平性差距。关键的是,这些差距不会仅仅通过纠正群体不平衡来消失,因为它们源于子群体预测模式的根本不匹配,而不是样本数量差异本身。因此,我们正式分析了这两种偏差源之间的相互作用,并将解决方案定义为识别一个不考虑群体不平衡的预测信息重心,该重心在所有子群体中诱导出相似的表示。通过向这个共享的聚合表示进行蒸馏,我们证明可以减少群体公平性方面的担忧。我们的方法与现有蒸馏方法兼容,并且实验证明,它显著减少了数据集蒸馏引入的偏差。代码可在https://github.com/mhmoslemi/COBRA上获得。

英文摘要

Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation. Code is available at https://github.com/mhmoslemi/COBRA.

2604.24762 2026-05-22 cs.CV

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

OmniShotCut: 以-shot查询Transformer实现整体关系性shot边界检测

Boyang Wang, Guangyi Xu, Jiahui Zhang, Zhipeng Tang, Zezhou Cheng

AI总结 本文提出OmniShotCut,通过shot查询基于的密集视频Transformer,将shot边界检测建模为结构化关系预测,同时估计shot内关系和shot间关系,以解决现有方法在边界不可解释、错过细微有害断点以及依赖噪声低多样性标注和过时基准的问题。

详情
AI中文摘要

Shot Boundary Detection (SBD)旨在自动识别shot变化并将视频划分为连贯的shot。尽管SBD在文献中被广泛研究,现有方法往往在转换处产生不可解释的边界,错过细微但有害的断点,并依赖于噪声大、低多样性的标注和过时的基准。为缓解这些限制,我们提出OmniShotCut,将SBD建模为结构化关系预测,通过shot查询基于的密集视频Transformer,联合估计shot范围、shot内关系和shot间关系。为避免不精确的手动标注,我们采用完全合成的过渡合成管道,自动重现主要过渡家族并精确生成参数化变体。我们还引入OmniShotCutBench,一个现代宽领域基准,能够实现整体和诊断评估。在基准上的实验展示了我们方法的有效性和通用性。

英文摘要

Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation. Experiments on the benchmarks demonstrate the effectiveness and generality of our method.

2604.24681 2026-05-22 cs.RO

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

从大规模人类示范中学习人类意图先验以用于机器人操作

Yifan Xie, YuAn Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

AI总结 本文提出MoT-HRA框架,通过大规模人类示范学习人类意图先验,用于机器人操作,通过构建HA-2.2M数据集和三个耦合专家提升动作合理性和鲁棒性。

Comments 13 pages, 5 figures

详情
AI中文摘要

人类视频包含丰富的操作先验,但用于机器人学习仍然困难,因为原始观测将场景理解、人类运动和特定于身体的动作纠缠在一起。我们引入MoT-HRA,一种层次化视觉-语言-动作框架,从大规模人类示范中学习人类意图先验。我们首先整理HA-2.2M,一个通过手中心过滤、空间重建、时间分割和语言对齐从异构人类视频中重建出的220万集动作-语言数据集。在此数据集之上,MoT-HRA将操作分解为三个耦合专家:一个视觉-语言专家预测无关身体的3D轨迹,一个意图专家将MANO风格的手部运动建模为潜在的人类运动先验,一个精细专家将意图感知的表示映射到机器人动作块。共享注意力主干和只读键值传输允许下游控制使用人类先验同时限制对上游表示的干扰。在手部运动生成、模拟操作和真实世界机器人任务上的实验表明,MoT-HRA在分布偏移下提高了动作合理性和鲁棒控制。

英文摘要

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

2604.24514 2026-05-22 cs.LG

SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

SceneSelect: 用于轨迹场景分类和专家调度的选择性学习

Xinrun Wang, Deshun Xia, Yuxi Sun, Weijie Zhu

AI总结 本文提出SceneSelect,一种基于场景的选择性学习方法,通过动态路由输入到最合适的专家模型,提升轨迹预测的准确性和效率。

Comments This paper has been accepted by ICIC 2026

详情
AI中文摘要

准确的轨迹预测因高场景异质性而具有根本挑战性 - 不同现实环境中的运动速度、空间密度和交互模式存在剧烈变化。然而,大多数现有方法通常训练一个单一统一模型,期望固定容量架构能普遍泛化所有可能场景。这种以模型为中心的范式在面对此类极端异质性时本质上是错误的,不可避免导致严重的泛化差距、降级的准确性以及大量的计算浪费。为克服这一瓶颈,我们提出选择性学习,一种新的以场景为中心的范式。它明确分析底层场景的特性,动态路由输入到最合适的专家模型。作为这一范式的具体实现,我们引入SceneSelect。具体而言,SceneSelect利用无监督聚类在可解释的几何和运动学特征上发现潜在的场景分类。然后训练一个高度解耦的分类模块,将实时输入分配到这些场景类别,并一个高度可扩展、插件式的调度策略自动将轨迹序列调度到最优的专家预测器。关键的是,这种解耦设计确保了出色的泛化能力,允许无缝集成不同的现成模型,并在新数据集上稳健适应,而无需计算昂贵的联合再训练。在三个公开基准(ETH-UCY、SDD和NBA)上的大量实验表明,我们的方法在强单模型和集成基线中一致表现更好,平均提高10.5%,展示了场景感知选择性学习的有效性。

英文摘要

Accurate trajectory prediction is fundamentally challenging due to high scene heterogeneity - the severe variance in motion velocity, spatial density, and interaction patterns across different real-world environments. However, most existing approaches typically train a single unified model, expecting a fixed-capacity architecture to generalize universally across all possible scenarios. This conventional model-centric paradigm is fundamentally flawed when confronting such extreme heterogeneity, inevitably leading to a severe generalization gap, degraded accuracy, and massive computational waste. To overcome this bottleneck, rather than refining restricted model-centric architectures, we propose selective learning, a novel scene-centric paradigm. It explicitly analyzes the characteristics of the underlying scene to dynamically route inputs to the most appropriate expert models. As a concrete implementation of this paradigm, we introduce SceneSelect. Specifically, SceneSelect utilizes unsupervised clustering on interpretable geometric and kinematic features to discover a latent scene taxonomy. A highly decoupled classification module is then trained to assign real-time inputs to these scene categories, and a highly extensible, plug-and-play scheduling policy automatically dispatches the trajectory sequence to the optimal expert predictor. Crucially, this decoupled design ensures excellent generalization capabilities, allowing seamless integration with different off-the-shelf models and robust adaptation across new datasets without requiring computationally expensive joint retraining. Extensive experiments on three public benchmarks (ETH-UCY, SDD, and NBA) demonstrate that our method consistently outperforms strong single-model and ensemble baselines, achieving an average improvement of 10.5%, showcasing the effectiveness of scene-aware selective learning.

2604.17623 2026-05-22 cs.CV cs.GR

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

ViPS: 为自动绑定网格的视频感知姿态空间

Honglin Chen, Karran Pandey, Rundi Wu, Matheus Gadelha, Yannick Hold-Geoffroy, Ayush Tewari, Niloy J. Mitra, Changxi Zheng, Paul Guerrero

AI总结 本文提出ViPS,一种通过视频扩散模型提取运动先验来发现自动绑定网格有效姿态分布的前馈框架,实现了对多样形状变化、逆向运动学和动画的关键帧生成的支持。

Comments Project page: https://honglin-c.github.io/vips/

详情
AI中文摘要

运动绑定提供了一个结构化的接口来表达3D网格,但缺乏任何关联的姿态空间,即给定网格的可能关节配置的显式表示。没有这样的姿态空间,随机采样或手动操作原始绑定参数很容易导致语义和/或几何违规,例如解剖学超伸展和非物理自相交。我们提出了Video-informed Pose Spaces (ViPS),一种前馈框架,通过从预训练的视频扩散模型中提取运动先验,发现自动绑定网格有效姿态的潜在分布。与现有方法依赖稀缺的艺术家编写的4D数据集或专注于重建单个运动实例不同,ViPS将生成视频模型的先验转移到给定绑定参数化的通用分布中。应用于皮肤网格的可微几何验证器在不需手动调节器的情况下强制执行形状特定的完整性。我们的前馈模型揭示了平滑、紧凑且可控的姿态空间。这反过来支持了对多样形状变化的采样、逆向运动学的流形投影以及动画和关键帧的时序一致轨迹。此外,提取的3D姿态样本作为语义代理指导视频扩散,有效地闭合了生成2D先验和结构化3D运动控制之间的循环。我们的评估显示,仅使用视频先验训练的ViPS在合理性和多样性方面与基于合成艺术家创建的4D数据训练的最新模型表现相当。此外,作为通用模型,ViPS在分布外物种和未见骨骼拓扑上表现出鲁棒的零样本泛化能力。

英文摘要

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

2604.15003 2026-05-22 cs.CV

Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

流之真相:面向图像到视频生成的主动时间鉴伪

Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang, Guanjie Wang, Weiming Zhang

AI总结 本文提出了一种面向图像到视频生成的主动时间鉴伪方法,通过追踪像素在视频中的流动和变换,解决了传统空间鉴伪在时间维度上的不足。

详情
AI中文摘要

图像到视频(I2V)生成的迅速发展使单张图像可以生成逼真的视频,但也带来了新的鉴伪需求。与静态图像不同,I2V内容随时间演变,要求鉴伪方法超越二维像素级篡改定位,追踪像素在视频中的流动和变换。随着帧数增加,嵌入的痕迹会漂移和变形,使传统空间鉴伪失效。为应对这一未探索的维度,我们提出了**Flow of Truth**,首个专注于I2V生成中时间鉴伪的主动框架。关键挑战在于发现一个能够与生成过程一致演化的鉴伪特征,这本质上是一种创造性的转换而非确定性重建。尽管存在这种内在困难,我们创新性地将视频生成重新定义为*像素随时间的运动而非帧的合成*。基于这一观点,我们提出了一种可学习的鉴伪模板,追踪像素运动,并提出一个模板引导的流模块,将运动与图像内容解耦,实现稳健的时间追踪。实验表明,Flow of Truth在商业和开源I2V模型上均表现出色,显著提升了时间鉴伪性能。

英文摘要

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

2604.14084 2026-05-22 cs.LG cs.AI

TIP: Token Importance in On-Policy Distillation

TIP: on-policy distillation 中的 token 重要性

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结 本研究探讨了在 on-policy 知识蒸馏中哪些 token 对学习信号最有用,提出了一种基于学生熵和教师-学生分歧的双轴分类方法,并通过实验验证了在有限内存条件下使用少量 token 进行蒸馏的有效性。

详情
AI中文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

英文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

2604.12325 2026-05-22 cs.LG cs.AI

Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

通过合成任务进行元学习的黑盒优化

Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa

AI总结 本文提出了一种通过生成合成任务进行元学习的框架OptBias,用于解决小规模离线数据下的黑盒优化问题,通过学习可重用的优化偏差来提升小数据场景下的性能。

Comments Accepted for Publication at International Conference on Artificial Intelligence and Statistics (AISTATS)

详情
AI中文摘要

我们考虑了离线黑盒优化的问题,目标是从过去的实验数据中发现最优设计(例如分子或材料)。在这一设置中,一个关键挑战是数据稀缺性:在许多科学应用中,只有小规模或低质量的数据集可用,这严重限制了现有算法的有效性。先前的工作在理论和实证上都表明,离线优化算法的性能取决于代理模型对优化偏差(即正确排序输入设计的能力)的捕捉程度,这在有限的实验数据下很难实现。本文提出了一种通过生成合成任务进行元学习的框架OptBias,该框架通过在高斯过程生成的合成任务上训练来直接解决数据稀缺性问题。OptBias通过在小数据上微调代理模型来解决目标任务。在多样化的连续和离散离线优化基准上,OptBias在小数据场景中始终优于最先进的基线。这些结果突显了OptBias作为现实中小数据设置中离线优化的稳健且实用的解决方案。

英文摘要

We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

2604.08872 2026-05-22 cs.LG cond-mat.dis-nn cond-mat.stat-mech

How does Chain of Thought decompose complex tasks?

链式思维如何分解复杂任务?

Amrut Nadgir, Vijay Balasubramanian, Pratik Chaudhari

AI总结 本文研究了链式思维在复杂任务分解中的作用,发现通过将任务分解为多个小分类问题可以显著降低预测误差,并确定了分解深度的最优阈值。

详情
AI中文摘要

许多语言任务可以建模为分类问题,其中大型语言模型(LLM)被给出提示并选择多个可能答案中的一个。我们证明此类问题中的分类误差随着类别的数量呈幂律变化。这具有重大影响:通过将整体任务分解为一系列较小的分类问题,每个问题具有相同数量的类别(

英文摘要

Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes ("degree"). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they "think", i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

2604.08571 2026-05-22 cs.LG cs.AI cs.CL

Robust Reasoning Benchmark

鲁棒推理基准

Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

AI总结 本研究提出鲁棒推理基准(RRB),通过13种确定性文本扰动评估8种前沿模型,发现Claude在面对变换提示时表现出异常拒绝行为,而开放权重模型在结构噪声下出现多种失败模式,如认知冲刷、分词崩溃和推理崩溃,导致平均准确率下降高达54%。研究进一步发现由模型自身推理链引起的注意力稀释问题,并提出Intra-Query Attention Dilution概念,表明中间推理步骤会污染标准密集注意力机制,未来架构需整合显式上下文重置以实现可靠推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在标准数学基准上表现优异,但其问题解决能力依赖于上下文和文本格式。我们引入鲁棒推理基准(RRB),该基准由13种确定性文本扰动组成,应用于2024年和2025年的AIME。评估8种最先进的模型后,发现前沿模型总体上具有较强的鲁棒性,但Claude在面对变换提示时表现出异常拒绝行为。开放权重推理模型在结构噪声下表现出多种失败模式(认知冲刷、分词崩溃和推理崩溃),在扰动下平均准确率下降高达54%,某些扰动甚至导致100%的准确率下降。我们进一步研究其中一种失败模式:由模型自身推理链引起的注意力稀释。通过要求模型在单一上下文窗口内依次解决多个独立数学问题,我们识别出Intra-Query Attention Dilution。从7B到120B参数的开放权重模型在后续问题上的准确率逐渐下降,表明中间推理步骤会污染标准密集注意力机制。我们主张,为了实现可靠的推理,未来架构需要在模型自身推理链中整合显式上下文重置,从而引发关于推理任务最佳粒度的开放研究问题。

英文摘要

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple independent mathematical problems sequentially within a single context window, we identify Intra-Query Attention Dilution. Open-weights models ranging from 7B to 120B parameters exhibit accuracy decay on subsequent problems, suggesting that intermediate reasoning steps progressively pollute standard dense attention mechanisms. We argue that in order to achieve reliable reasoning, future architectures need to integrate explicit contextual resets within models' own chain-of-thought, leading to open research questions regarding the optimal granularity of reasoning tasks.

2603.29735 2026-05-22 cs.AI

Unveiling the Reasoning Process of Large Language Models

揭示大型语言模型的推理过程

Junjie Zhang, Zhen Shen, Xisong Dong, Gang Xiong

AI总结 本文通过分析Transformer层中注意力头和层的信息转换,揭示了大型语言模型在数学和符号推理任务中,中间层将token级信息转化为可重用的关联结构的核心机制。

详情
AI中文摘要

大型语言模型往往能够超越表层token进行推理,但token级信息转变为抽象关系结构的内部阶段仍不明确。我们通过分析自回归推理过程中注意力头和层如何转换信息来探讨这一问题。在数学和符号推理任务中,我们观察到一种一致的分层分工:外层主要保留和路由输入相关特征,而中层将它们重新组织为更具转移性的规则级表示。这种解释得到了表示几何的支持:中层状态占据较低维的流形,并在不同词汇库中表现出更强的对齐性,这些词汇库实现了相同的符号规则。此外,因果干预进一步支持了这一结论:移除通过我们基于交互的标准识别出的中层组件,会比移除其他区域或随机移除的组件产生更大的下游变化和准确率下降。共同,这些结果表明,抽象推理并非均匀分布在Transformer层中,而是优先在中层计算阶段形成,该阶段将token级信息转化为可重用的关联结构。

英文摘要

Large language models often reason beyond surface tokens, but the internal stage at which token-level information becomes abstract relational structure remains unclear. We investigate this question by analyzing how attention heads and layers transform information during autoregressive reasoning. Across mathematical and symbolic reasoning tasks, we observe a consistent layer-wise division of labor: outer layers mainly preserve and route input-related features, whereas middle layers reorganize them into more transferable rule-level representations. This interpretation is supported by representation geometry: middle-layer states occupy lower-dimensional manifolds and show stronger alignment across disjoint vocabularies that instantiate the same symbolic rules. It is further supported by causal interventions: removing middle-layer components identified by our interaction-based criterion produces substantially larger downstream changes and accuracy drops than removing components from other regions or at random. Together, these results suggest that abstract reasoning is not uniformly distributed across transformer layers, but is preferentially formed in a middle-layer computation stage that converts token-level information into reusable relational structure.

2603.22508 2026-05-22 cs.RO cs.SY eess.SY

Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation

并行八叉树映射:一种用于自主导航中路径规划增强的可扩展框架

Yihui Mao, Tian Tan, Xuehui Shen, Warren E. Dixon, Rushikesh Kamalapurkar

AI总结 本文提出并行八叉树映射(POMP),一种高效的基于八叉树的映射技术,通过在固定占用网格分辨率下优化自由空间表示,提升路径规划效率和成功率,特别是在复杂环境中。

详情
AI中文摘要

映射在机器人和自主系统中至关重要,因为它为路径规划提供了空间基础。高效的映射使规划算法能够生成可靠的路径,同时确保安全并实时适应复杂环境。固定分辨率的映射方法通常会产生过于保守的障碍物表示,导致在拥挤场景中生成次优路径或规划失败。为了解决这个问题,我们引入了并行八叉树映射(POMP),一种高效的基于八叉树的映射技术,旨在最大化可用自由空间并支持多线程计算。据我们所知,POMP是首个在固定占用网格分辨率下优化自由空间表示同时保持地图保真度和与现有基于搜索的规划器兼容的方法。因此,它可以集成到现有的规划流程中,从而提高路径发现的成功率和路径长度,特别是在拥挤环境中,同时显著提高计算效率。

英文摘要

Mapping is essential in robotics and autonomous systems because it provides the spatial foundation for path planning. Efficient mapping enables planning algorithms to generate reliable paths while ensuring safety and adapting in real time to complex environments. Fixed-resolution mapping methods often produce overly conservative obstacle representations that lead to suboptimal paths or planning failures in cluttered scenes. To address this issue, we introduce Parallel OctoMapping (POMP), an efficient OctoMap-based mapping technique that maximizes available free space and supports multi-threaded computation. To the best of our knowledge, POMP is the first method that, at a fixed occupancy-grid resolution, refines the representation of free space while preserving map fidelity and compatibility with existing search-based planners. It can therefore be integrated into existing planning pipelines, yielding higher pathfinding success rates and shorter path lengths, especially in cluttered environments, while substantially improving computational efficiency.

2603.21743 2026-05-22 cs.LG q-bio.QM

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

CellFluxRL: 通过强化学习实现生物约束的虚拟细胞建模

Dongxia Wu, Shiye Su, Yuhui Zhang, Elaine Sui, Emma Lundberg, Emily B. Fox, Serena Yeung-Levy

AI总结 本文提出CellFluxRL,通过强化学习约束虚拟细胞模型,使其在生物功能、结构有效性及形态正确性方面更符合生物学规律,从而提升虚拟细胞建模的生物意义。

详情
AI中文摘要

构建虚拟细胞以生成模型模拟细胞行为在硅中的仿真,正成为加速药物发现的有前途的范式。然而,先前基于图像的生成方法可能会产生不合理的细胞图像,违反基本的物理和生物学约束。为了解决这个问题,我们提出通过强化学习(RL)后训练虚拟细胞模型,利用具有生物意义的评估器作为奖励函数。我们设计了七个奖励,涵盖三个类别——生物功能、结构有效性及形态正确性,并优化最先进的CellFlux模型以获得CellFluxRL。CellFluxRL在所有奖励上均优于CellFlux,且在测试时扩展进一步提升性能。总体而言,我们的结果展示了一个通过强化学习施加物理约束的虚拟细胞建模框架,从而超越了“视觉逼真”的生成,朝着“生物意义”的生成迈进。

英文摘要

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.

2603.21717 2026-05-22 cs.LG

Uncertainty-Aware Distribution-to-Distribution Flow Matching for Scientific Imaging

面向科学成像的不确定性感知分布到分布流匹配

Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox

AI总结 本文提出了一种面向科学成像的不确定性感知分布到分布流匹配方法,通过引入贝叶斯随机流匹配和抗变异不确定性量化技术,提升模型在分布偏移下的泛化能力,并有效估计epistemic和aleatoric不确定性,从而检测不可靠的生成结果。

详情
AI中文摘要

分布到分布生成模型支持从建模细胞扰动响应到跨条件翻译医学图像的科学成像任务。可信生成需要可靠性,即在不同实验室、设备和实验条件下的泛化能力,以及问责,即检测出分布外情况,其中预测可能不可靠。我们利用随机流匹配(SFM),一种保持边缘的随机扩展流匹配,以改进在分布偏移下的泛化能力。SFM在确定性流中加入扩散项和学习的分数基漂移校正,保留所学的传输边缘的同时建模条件变化性。基于此SFM框架,我们引入贝叶斯随机流匹配(BSFM)作为不确定性量化机制,并开发AVUQ(反向方差减少不确定性量化)以通过样本高效反向采样和近似后验推断来近似估计epistemic和aleatoric不确定性。我们进一步使用AVUQ生成异常分数以检测不可靠的生成结果。在细胞成像(BBBC021,JUMP)和脑部fMRI(Theory of Mind)等不同未见过的场景中的实验表明,SFM在提升泛化能力的同时,AVUQ在实际采样预算下提供了有效的基于不确定性的异常分数。

英文摘要

Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires reliability, or generalization across labs, devices, and experimental conditions, and accountability, or detecting out-of-distribution cases where predictions may be unreliable. We leverage Stochastic Flow Matching (SFM), a marginal-preserving stochastic extension of flow matching for improved generalization under distribution shift. SFM augments deterministic flows with a diffusion term together with a learned score-based drift correction, retaining the learned transport marginals while modeling conditional variability. Building on this SFM framework, we introduce Bayesian Stochastic Flow Matching (BSFM) as a companion uncertainty quantification mechanism and develop AVUQ (Antithetic Variance-reduction Uncertainty Quantification) to approximately estimate epistemic and aleatoric uncertainty via sample-efficient antithetic sampling with approximate posterior inference. We further use AVUQ to yield anomaly scores for unreliable generation detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse unseen scenarios show that SFM improves generalization while AVUQ provides effective uncertainty-based anomaly scores under practical sampling budgets.

2603.21610 2026-05-22 cs.LG cs.AI stat.ML

Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

规则状态推断(RSI):一种用于规则治理领域合规监控的贝叶斯框架

Abdou-Raouf Atarmla

AI总结 本文提出了一种名为规则状态推断(RSI)的贝叶斯框架,用于解决规则治理领域中合规监控的三大结构性挑战:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。RSI通过将权威、形式化的规则集作为结构化的贝叶斯先验,利用变分推断和精确坐标上升更新来推断人口的潜在合规状态。

Comments 18 pages. Experimental validation forthcoming

详情
AI中文摘要

在规则治理领域(如税收管理、临床协议遵守、环境监管)的合规监控面临三个结构性障碍,标准机器学习无法同时解决:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。我们引入规则状态推断(RSI),一种贝叶斯框架,颠覆了传统的学习规则从数据的范式。RSI将权威、形式化的规则集作为结构化的贝叶斯先验,并通过均场变分推断和精确坐标上升更新推断人口的潜在合规状态。核心建模对象是一个联合潜变量,每个监管时期一个:全局合规文化因子η以及每个规则的激活、人口合规水平和参数漂移成分。RSI提供了三个正式保证:每个规则更新的监管适应性为O(n_k + K);对于可识别的连续成分的伯恩斯坦-冯·米塞斯一致性;以及每次迭代的单调ELBO收敛。我们将在托戈财政系统上实例化RSI,基于官方监管法律的基准2000家合成企业;完整的数值验证将随后进行。该框架设计用于直接扩展到顺序RSI,一种状态空间公式化中,一个监管时期的后验成为下一个的先验,从而产生精确的卡尔曼滤波器用于合规轨迹跟踪和实体级贝叶斯评分。

英文摘要

Compliance monitoring in rule-governed domains (tax administration, clinical protocol adherence, environmental regulation) faces three structural obstacles that standard machine learning does not simultaneously address: the absence of labeled outcomes at deployment, strategically missing observations where non-compliant entities selectively withhold evidence, and a regulatory environment that changes faster than any supervised model can be retrained. We introduce Rule-State Inference (RSI), a Bayesian framework that reverses the usual paradigm. Rather than learning rules from data, RSI treats an authoritative, formalized rule set as structured Bayesian priors and infers the latent compliance state of a population through mean-field variational inference with exact coordinate-ascent updates. The central modeling object is a joint latent state per regulatory period: a global compliance-culture factor eta and per-rule components for activation, population compliance level, and parametric drift. RSI delivers three formal guarantees: O(n_k + K) regulatory adaptability per rule update; Bernstein-von Mises consistency for the identifiable continuous components; and monotone ELBO convergence at every iteration. We instantiate RSI on the Togolese fiscal system on a benchmark of 2,000 synthetic enterprises grounded in official regulatory law; full numerical validation is forthcoming. The framework is designed for direct extension to Sequential RSI, a state-space formulation where the posterior from one regulatory period becomes the prior for the next, yielding an exact Kalman filter for compliance-trajectory tracking and entity-level Bayesian scoring.

2603.16077 2026-05-22 cs.LG

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models

MDM-Prime-v2:二进制编码和索引洗牌使扩散语言模型能够扩展

Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan

AI总结 本文提出MDM-Prime-v2,通过二进制编码和索引洗牌技术改进扩散语言模型,解决了子分词器功能形式与BPE分词器结合导致的交叉熵损失增加以及子分词器粒度超参数选择缺乏工具的问题,从而提升了模型在常识推理基准上的零样本准确率。

详情
AI中文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

英文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.