arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02515 2026-06-02 cs.LG

A Biconvex Formulation for Stable Transport of Mixture Models with a Unique Solution

混合模型稳定传输的双凸形式与唯一解

Yeganeh Marghi, Kelly Jin, Uygar Sümbül

AI总结提出最优混合传输（OMT）框架，通过严格双凸优化实现子群体混合的稳定传输，理论保证稳定性，计算复杂度仅与混合成分数相关。

详情

AI中文摘要

最优传输（OT）为概率分布之间的映射提供了原则性框架。尽管取得了广泛进展，将OT应用于大规模数据仍然计算密集，且得到的逐点传输计划往往难以解释。我们引入了最优混合传输（OMT），这是一个可扩展的框架，将传输范式从单个样本转移到子群体的混合，将传输问题重新表述为具有唯一全局最小值的严格双凸优化。我们进一步建立了OMT映射稳定性的理论保证，表明底层分布的有界扰动会导致传输计划的有界变化。通过将子群体表述为指数族分布，OMT将计算复杂度与样本量解耦，仅随混合成分数量扩展。我们在广泛的合成基准和真实世界数据集（包括图像数据和大规模单细胞RNA测序测量）上展示了OMT的有效性和实用性。

英文摘要

Optimal transport (OT) provides a principled framework for mapping between probability distributions. Despite extensive progress, applying OT to large-scale data remains computationally demanding, and the resulting pointwise transport plans are often difficult to interpret. We introduce Optimal Mixture Transport (OMT), a scalable framework that shifts the transport paradigm from individual samples to mixtures of subpopulations, reformulating the transport problem as a strictly biconvex optimization with a unique global minimizer. We further establish theoretical guarantees on the stability of the OMT map, showing that bounded perturbations of the underlying distributions lead to bounded changes in the transport plan. By formulating subpopulations as exponential-family distributions, OMT decouples computational complexity from the sample size, scaling solely with the number of mixture components. We demonstrate the effectiveness and practicality of OMT on a wide range of synthetic benchmarks and real-world datasets, including image data and large-scale single-cell RNA sequencing measurements.

URL PDF HTML ☆

赞 0 踩 0

2606.02510 2026-06-02 cs.CV cs.RO

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

并非所有点都同等重要：不确定性感知的4D LiDAR场景合成

Xiang Xu, Alan Liang, Youquan Liu, Xian Sun, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

AI总结提出U4D框架，利用空间不确定性引导LiDAR场景生成，通过熵图识别高不确定性区域并优先合成，再补全其余区域，实现高保真4D场景。

详情

Comments: CVPR 2026 E2E3D Workshop; GitHub at https://github.com/worldbench/U4D

AI中文摘要

从LiDAR获取的序列构建忠实的4D世界对于具身AI至关重要，但当前的生成框架对所有空间区域采用统一的建模能力。这忽略了单个扫描中感知难度的巨大差异：远距离表面、遮挡边界和小尺度物体比良好观测的结构具有更高的不确定性。我们提出了U4D，一种新的框架，明确利用空间不确定性以“从难到易”的顺序引导LiDAR场景生成。U4D通过预训练分割器的香农熵推导逐点不确定性图，然后应用无条件扩散阶段合成具有精确几何的高熵区域，接着是条件补全阶段，利用这些结构作为先验填充剩余区域。MoST（时空混合）块通过动态平衡空间细节和时间连续性进一步维护跨帧一致性。在nuScenes和SemanticKITTI上的大量实验证明了最先进的场景保真度、时间一致性和下游性能。

英文摘要

Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a "hard-to-easy" schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02509 2026-06-02 cs.CL

When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives

当评分量表不足时：LLM辅助发现土耳其教师叙述中的ADHD信号

Baris Karacan, Irem Aktar Songur, Ahmet Ozaslan, Elvan Iseri

AI总结本研究通过分析土耳其教师评估表中的结构化评分和开放式叙述，利用大语言模型辅助的主题发现方法，揭示了叙述文本中未被结构化量表捕捉的ADHD互补信号。

详情

Comments: 15 pages. Accepted to CLPsych 2026. Camera-ready author version. The final version will appear in the ACL Anthology

AI中文摘要

注意缺陷多动障碍（ADHD）是儿童期最常见的神经发育障碍之一，其诊断依赖于结合临床医生判断、标准化评分量表以及家长和教师报告的评估。虽然诸如康纳斯教师评分量表修订版简表（CTRS-R:S）等结构化工具能够量化ADHD相关行为，但教师也会提供开放式叙述，其中可能包含结构化评估未捕捉的互补信号。然而，教师叙述在多大程度上编码了评分量表忽视的信号仍不清楚。在本研究中，我们分析了临床ADHD评估期间收集的去标识化土耳其教师评估表，包括CTRS-R:S评分和开放式教师叙述。我们比较了结构化评分和叙述文本的预测信号，并识别了结构化评估无法清晰区分ADHD与非ADHD学生，而基于叙述的模型却能捕捉到不同行为模式的案例。值得注意的是，这些案例与叙述模型遗漏的案例重叠极少，表明结构化和叙述信息编码了互补信号。为了解释这些差异，我们应用了大语言模型（LLM）辅助的主题发现流程，揭示了不同的注意力、行为和家庭相关模式，突显了自然语言处理（NLP）在从教师叙述中发现临床相关信号以及补充传统ADHD筛查工具方面的潜力。

英文摘要

Attention Deficit Hyperactivity Disorder (ADHD) is one of the most common neurodevelopmental disorders in childhood, and its diagnosis relies on assessments combining clinician judgment with standardized rating scales and reports from parents and teachers. While structured instruments such as the Conners' Teacher Rating Scale-Revised Short Form (CTRS-R:S) quantify ADHD-related behaviors, teachers also provide open-ended narratives that may contain complementary signals not captured by structured assessments. However, it remains unclear to what extent teacher narratives encode signals overlooked by rating scales. In this study, we analyze de-identified Turkish teacher evaluation forms collected during clinical ADHD assessments, including both CTRS-R:S scores and open-ended teacher narratives. We compare predictive signals from structured scores and narrative text and identify cases where structured assessments fail to clearly distinguish ADHD from non-ADHD students while narrative-based models capture distinct behavioral patterns. Notably, these cases show minimal overlap with those missed by the narrative model, suggesting that structured and narrative information encode complementary signals. To interpret these differences, we apply a large language model (LLM)-assisted theme discovery pipeline that reveals distinct attention, behavioral, and family-related patterns, highlighting the potential of natural language processing (NLP) to uncover clinically relevant signals from teacher narratives and to complement traditional ADHD screening tools.

URL PDF HTML ☆

赞 0 踩 0

2606.02507 2026-06-02 cond-mat.mtrl-sci cs.ET cs.LG physics.app-ph physics.comp-ph

Towards Automated Discovery: A Review of Generative Models, Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design

迈向自动发现：逆向材料设计中生成模型、多模态学习与闭环工作流综述

Anand Babu, Rogério Almeida Gouvêa, Gian-Marco Rignanese

AI总结本文综述了逆向材料设计中生成晶体结构建模、多模态学习和闭环设计管道的最新进展，重点讨论了可行性约束与物理先验的施加方式、多模态融合策略以及多种逆向设计策略（如条件生成与潜在优化、贝叶斯优化、强化学习和主动学习），并指出了常见失败模式及基于分阶段报告的评估实践。

详情

AI中文摘要

逆向材料设计将材料发现从正向预测转变为在物理约束下满足目标的有针对性的候选材料提出。在此，我们回顾了晶体固体中生成晶体结构建模、多模态学习和闭环设计管道的最新进展。我们调查了现代生成器如何从大型数据库中学习化学-结构先验，以实现周期性结构的可控采样，并比较了包括变分自编码器、归一化流、自回归公式和扩散模型在内的主要模型类别。特别关注如何通过表示选择、训练目标、采样时指导以及生成后筛选和弛豫，在整个工作流中施加可行性约束和物理先验。我们还讨论了多模态学习如何融合多种材料模态，包括晶体结构、热力学、电子信息、显微镜、光谱学、加工背景和科学文本，以构建更通用、可迁移的化学空间表示。此外，考察了多种逆向设计策略，特别是那些将条件生成与潜在优化、贝叶斯优化、强化学习和主动学习相结合的策略。最后，我们强调了反复出现的失败模式，如代理利用、多样性崩溃、分布偏移和稳定性-可合成性差距，并基于有效性、新颖性、独特性、稳定性和成本的分阶段报告，概述了发现级评估实践。

英文摘要

Inverse materials design is shifting materials discovery from forward prediction to targeted proposal of candidates that satisfy objectives under physical constraints. Here, we review recent advances in generative crystal structure modeling, multimodal learning, and closed-loop design pipelines for crystalline solids. We survey how modern generators learn chemical-structural priors from large databases to enable controllable sampling of periodic structures, and compare leading model classes including variational autoencoders, normalizing flows, autoregressive formulations, and diffusion models. Particular attention is given to how feasibility constraints and physical priors are enforced across the workflow, through representation choices, training objectives, sampling-time guidance, and post-generation screening and relaxation. We also discuss how multimodal learning fuses diverse materials modalities, including crystal structures, thermodynamic, electronic information, microscopy, spectroscopy, processing context, and scientific text, to construct a more universal, transferable representation of chemical space. In addition, diverse inverse-design strategies are examined, particularly those that integrate conditional generation with latent optimization, Bayesian optimization, reinforcement learning, and active learning. Finally, we highlight recurring failure modes, such as surrogate exploitation, diversity collapse, distribution shift, and the stability-synthesizability gap, and outline discovery-grade evaluation practices based on staged reporting of validity, novelty, uniqueness, stability, and cost.

URL PDF HTML ☆

赞 0 踩 0

2606.02506 2026-06-02 cs.CV

Question-Aware Evidence Ledgers for Video Relational Reasoning

问题感知的证据账本用于视频关系推理

Yilin Ou, Mengshi Qi, Huadong Ma

AI总结提出基于GPT-5.5视频QA求解器和问题感知证据账本的测试时推理流水线，通过显式化计数、空间、端点、视角和对话推理所需的目标、计数单位、参考帧及时间或空间范围，并利用外部工具作为证据源，最终在VRR-QA挑战上达到92.95%的整体准确率。

详情

Comments: Technical report for the VRR Challenge at the VideoLLMs Workshop, CVPR 2026

AI中文摘要

VRR-QA挑战评估视频中的视觉关系推理，答案通常依赖于隐含的空间关系、事件边界、目标身份和对话上下文，而非单个显著帧。我们提出一个基于强GPT-5.5视频QA求解器和一组问题感知证据账本的测试时推理流水线。初始求解器从统一的视频表示回答每个问题，而路由账本被提示使所需目标、计数单位、参考帧以及时间或空间范围显式化，用于计数、空间、端点、视角和对话推理。外部工具如开放词汇检测、深度线索、成对裁剪、ASR和场景图账本仅用作证据源。保守门控保持当前答案，除非独立证据唯一支持不同选项。最终证据门控流水线在挑战测试集上达到92.95%的整体准确率和93.79%的宏平均准确率。

英文摘要

The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.

URL PDF HTML ☆

赞 0 踩 0

2606.02502 2026-06-02 cs.CL

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

CRAM：面向多模态持续指令调优的质心路由与自适应MoE

Jun-Tao Tang, Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

AI总结提出CRAM方法，通过将任务特定模式隔离到独立模块、自适应秩实例化动态分配参数、质心路由激活现有专家以及正交惩罚约束更新方向，解决了多模态持续指令调优中任务竞争导致遗忘和参数效率低下的问题。

详情

AI中文摘要

多模态大语言模型（MLLMs）通过指令调优在共享生成框架下统一异构视觉-语言任务，但实际部署需要持续能力扩展，这使得多模态持续指令调优（MCIT）至关重要。现有方法要么使用共享参数集更新所有任务，要么为每个新任务分配专用模块。共享更新迫使异构任务竞争，导致已学能力遗忘。相反，隔离扩展防止了干扰，但在长任务流中严重限制了参数效率。为解决这一困境，我们提出了CRAM。具体来说，通过将任务特定模式隔离到独立模块，CRAM减轻了跨任务的灾难性遗忘。为了进一步提高参数效率，我们利用自适应秩实例化来识别现有专家能力与新任务需求之间的能力差距，并仅动态分配必要的参数。为了确保任务间的稳定复用，质心引导路由识别并激活现有专家的能力，而正交惩罚将新更新限制在任务特定方向，防止重新学习通用能力。跨多个基准的大量实验一致证明了其相对于现有方法的优越性。

英文摘要

Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts' capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.02498 2026-06-02 cs.CV

GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury prediction

GloResNet：一种用于早产儿脑损伤预测的轻量级3D CNN与全局拓扑特征

Boyu Yuan, Jiamiao Lu, Weichuan Zhang, Benqing Wu, Tuo Wang, Changshan Wang, Changming Sun, Liang Guo

AI总结提出基于ResNet-10的轻量级3D CNN GloResNet，结合全局流形映射和预处理策略，在dHCP数据集上实现早产儿脑损伤预测，平均准确率75.18%。

详情

AI中文摘要

本研究引入了一个自动化深度学习框架，用于从T2加权MRI（dHCP数据集）预测早产儿脑损伤（BI）。我们提出了GloResNet，一种基于ResNet-10的轻量级3D CNN，并在MedicalNet上预训练以应对数据稀缺。一种全局流形映射策略首先将每个3D体积重采样为128x128x128，然后应用逐样本z分数强度归一化，从而在标准化外观的同时保留全局拓扑。训练集成了mixup、类别加权和测试时增强以提高鲁棒性。在5折交叉验证中，GloResNet达到了75.18%的平均准确率（峰值81.82%），特异性0.81，敏感性0.76。结果表明，拓扑感知的轻量级CNN能够有效预测新生儿脑损伤，提供了一种非侵入性筛查工具。本文源代码可从GitHub仓库获取：https://github.com/ICL-SUST/GloResNet-Preterm-Brain

英文摘要

This study introduces an automated deep learning framework for predicting brain injury (BI) in preterm infants from T2-weighted MRI (dHCP dataset). We propose GloResNet, a lightweight 3D CNN based on ResNet-10, pretrained on MedicalNet to address data scarcity. A global manifold mapping strategy first resamples each 3D volume to 128x128x128 and then applies subject-wise z-score intensity normalization, thereby preserving global topology while standardizing appearance. Training integrates mixup, class weighting, and test-time augmentation for robustness. In 5-fold cross-validation, GloResNet achieved 75.18% average accuracy (peak 81.82%), with specificity 0.81 and sensitivity 0.76. Results demonstrate that a topology-aware lightweight CNN has the capability to effectively predict neonatal BI, offering a non-invasive screening tool. The source code of this paper can be obtained from the GitHub repository: https://github.com/ICL-SUST/GloResNet-Preterm-Brain

URL PDF HTML ☆

赞 0 踩 0

2606.02497 2026-06-02 cs.AI

Bridging the Last Mile of Time Series Forecasting with LLM Agents

用LLM智能体弥合时间序列预测的最后一公里

Yuhua Liao, Zetian Wang, Qiangqiang Nie, Zhenhua Zhang

AI总结提出一个LLM智能体框架，通过检索上下文证据和结构化约束，将统计预测转化为业务就绪的预测。

详情

AI中文摘要

时间序列预测发展迅速，特别是随着基础模型的出现，这些模型在数值外推上展现出强大的零样本性能。然而，在实际预测场景中，统计上合理的基线很少是实践中使用的最终预测。在预测成为决策就绪之前，通常需要使用弱结构化的业务背景进行修订，例如假日效应、活动计划、外部事件、历史类比和专家反馈。这一实际阶段在预测文献中仍未得到充分探索。在本文中，我们将这一阶段定义为 extbf{最后一公里预测}问题，并提出一个位于预测骨干之上的LLM智能体框架。我们的系统维护一个统一的预测工作空间，调用工具检索上下文证据，并在结构安全约束下将推理轨迹转化为明确的预测修订行动。它还通过map-reduce风格的分解支持长周期预测，并通过记忆库支持事后反思。最终的系统设计为可控和可审计的。通过实际案例研究，我们展示了LLM智能体如何弥合统计预测与业务就绪预测之间的差距。

英文摘要

Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on numerical extrapolation. However, in real-world forecasting settings, a statistically plausible baseline is rarely the final forecast used in practice. Before a forecast becomes decision-ready, it often needs to be revised using weakly structured business context such as holiday effects, campaign plans, external events, historical analogs, and expert feedback. This practical stage remains underexplored in the forecasting literature. In this paper, we formulate this stage as the \textbf{last-mile forecasting} problem and present an LLM-agent framework that sits on top of a forecasting backbone. Our system maintains a unified forecast workspace, invokes tools to retrieve contextual evidence, and converts reasoning trajectories into explicit forecast revision actions under structural safety constraints. It also supports long-horizon forecasting through map-reduce-style decomposition and post-hoc reflection through a memory bank. The resulting system is designed to be controllable and auditable. Through real-world case studies, we show how LLM agents can bridge the gap between statistical prediction and business-ready forecasting.

URL PDF HTML ☆

赞 0 踩 0

2606.02494 2026-06-02 cs.SE cs.AI

Monitoring Agentic Systems Before They're Reliable

在代理系统可靠之前对其进行监控

Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase

AI总结针对生产环境中代理系统因结构缺陷主导故障的问题，提出一种基于方差信号的三维度三范围监控与分类方法，并通过合成测试验证其有效性。

详情

Comments: 9 pages, 2 figures, 3 tables. Accepted to the Workshop on Agentic Software Engineering (AgenticSE), co-located with ACM CAIS 2026 (non-archival)

AI中文摘要

进入生产环境的代理系统通常以部分集成的组件形式运行，其中结构缺陷（而非任务级错误）主导故障场景。在此成熟度下，任务级错误检测可能不可行：结构故障模式掩盖了任务级监控器旨在检测的信号。我们提出一种监控与分类方法，将代理系统评估分解为三个维度（质量、适用性、效率）和三个监控范围（运行内、跨运行、结构），使用方差作为表征信号。发现结果通过基于FMEA的严重性分类进行路由，将人类注意力集中在需要调查的子集上。我们在一个包含220次运行、120个文档包且受控错误注入的合成测试平台上进行评估。三个结果显现：监控范围决定故障类型——运行内监控器发现确定性阶段缺陷（CV=0.02），跨运行监控器发现随机集成后果（CV=1.25，24%为L2级），结构监控器以完全一致性识别集成缺口（CV=0.00）。注入的任务级错误与干净基线无法区分，证实结构缺陷掩盖了任务级信号。确定性分类将97%的发现路由至自动跟踪，仅留下2%反映可变行为的发现供人工调查。基于第一阶段证据，我们提出一个成熟度阶段模型，其中监控随着集成缺陷的解决从结构表征过渡到错误检测再到可靠性跟踪。该分类法、基于CV的范围表征和严重性模型在架构上可迁移至受监管行业中基于文档的多阶段代理工作流；具体校准是领域特定的。尽早部署监控：它发现的第一个问题就是最需要修复的问题。

英文摘要

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal that task-level monitors are designed to detect.We present a monitoring and triage methodology that decomposes agentic system evaluation into three dimensions (quality, suitability, efficiency) at three monitoring scopes (within-run, cross-run, structural), using variance as a characterization signal. Findings are routed through severity classification adapted from FMEA, concentrating human attention on the subset that warrants investigation. We evaluate on a synthetic testbed of 220 runs across 120 document bundles with controlled error injection.Three results emerge. Monitor scope determines failure type: within-run monitors surface deterministic stage defects (CV = 0.02), cross-run monitors surface stochastic integration consequences (CV = 1.25, 24% at L2), and a structural monitor identifies an integration gap with perfect consistency (CV = 0.00). Injected task-level errors are indistinguishable from clean baselines, confirming structural defects mask task-level signal. Deterministic triage routes 97% of findings to automated tracking, leaving the 2% reflecting variable behavior for human investigation.We propose, on Stage 1 evidence, a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects resolve. The taxonomy, CV-based scope characterization, and severity model transfer architecturally to document-driven, multi-stage agentic workflows in regulated industries; specific calibrations are domain-specific. Deploy monitoring early: the first thing it finds is the most important thing to fix.

URL PDF HTML ☆

赞 0 踩 0

2606.02491 2026-06-02 cs.CV

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

MORPHOS: 基于时间结构化潜变量的自回归4D生成

Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim

AI总结提出MORPHOS框架，利用时间结构化潜变量（T-SLAT）统一表示4D动态资产，通过自回归因果注意力生成，解决多表示兼容、拓扑变化和长时间一致性问题。

详情

Comments: Project page: https://cvlab-kaist.github.io/MORPHOS/

AI中文摘要

我们提出MORPHOS，一种新颖的自回归框架，能够从视频生成动态3D资产，支持多种表示，包括网格、3D高斯和辐射场。现有方法通常局限于单一表示，难以建模拓扑变化，或在长视频中无法保持时间一致性。为解决这些限制，我们引入时间结构化潜变量（T-SLAT），一种统一的4D表示，沿时间维度联合编码几何和外观。利用T-SLAT，MORPHOS通过因果注意力自回归生成动态3D资产，将每一帧条件于其先前历史，以确保时间一致性并处理演化的拓扑。我们还提出一种时间结构增强，以减轻自回归生成中的误差累积。MORPHOS在多个基准测试中实现了外观方面的最先进性能和几何方面的竞争性结果，展示了跨多种表示的卓越泛化能力和长时程生成的鲁棒性。

英文摘要

We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

URL PDF HTML ☆

赞 0 踩 0

2606.02490 2026-06-02 cs.LG

Expressivity of congruence-based architectures for DNNs on positive-definite matrices

基于同余结构的深度神经网络在正定矩阵上的表达能力

Antonin Oswald, Estelle Massart

AI总结研究同余层（输入矩阵左乘和右乘权重矩阵及其转置）在正定矩阵分类中的表达能力，发现半正交约束会限制网络表达能力，导致退化为单隐藏层等价结构，并分析了不同黎曼分类器与同余层特征图的兼容性。

详情

Comments: Accepted for Eusipco 2026

AI中文摘要

本文研究用于分类对称正定矩阵的神经架构，重点关注同余类层，其中输入矩阵左乘和右乘一个（可能是矩形的）权重矩阵 $W$ 及其转置。这类层是著名的 SPDNet 的核心，也已被独立用于正定数据的降维。我们表明，通常对 $W$ 施加的（半）正交约束限制了这些层的表达能力：对于某些激活函数，生成的架构退化为单隐藏层等价结构。这种表达能力的缺失源于半正交 $W$ 的同余类层中谱多样性的损失，并且是庞加莱分离定理的直接结果。然后，我们考察了最终分类器的选择，比较了几种黎曼分类器，并讨论了它们与同余类层产生的特征图的兼容性。

英文摘要

This work studies neural architectures for classifying symmetric positive-definite matrices, focusing on congruence-like layers, in which the input matrix is multiplied on the left and right by a (possibly rectangular) weight matrix $W$ and its transpose. Such layers lie at the core of the celebrated SPDNet and have also been employed independently for dimensionality reduction on positive-definite data. We show that the (semi)-orthogonality constraint commonly imposed on $W$ limits the expressivity of these layers: for certain activation functions, the resulting architecture collapses to a one-hidden-layer equivalent. This lack of expressivity follows from a loss of spectral diversity in congruence-like layers for semi-orthogonal $W$ and is a direct consequence of Poincaré's separation theorem. We then examine the choice of the final classifier, comparing several Riemannian classifiers and discussing their compatibility with the feature maps produced by congruence-like layers.

URL PDF HTML ☆

赞 0 踩 0

2606.02488 2026-06-02 cs.AI

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER: 可恢复性感知的选择性升级路由器用于多跳问答

Yuyang Li, Zihe Yan, Tobias Käfer

AI总结提出RASER路由器，基于单次RAG的六个特征决定是否升级到更昂贵的检索策略，在不增加额外LLM调用的情况下，在F1分数与SOTA相当的同时节省大量token。

详情

Comments: Under Review

AI中文摘要

多跳问答系统通常对每个问题使用昂贵的检索。它们可能会分解问题、运行多轮检索或通过桥接实体搜索后再回答。所有这些策略都依赖于重复的LLM调用来重写或分解问题，这增加了额外的token成本，并且在LLM预算紧张时不适用。然而，我们的分析表明，许多多跳问题已经被单次RAG正确回答，因此对每个问题都进行额外检索浪费了预算。我们引入了RASER（可恢复性感知的选择性升级路由器），这是一系列基于单次RAG及其六个特征的廉价路由器。RASER-2决定是停止还是升级到额外检索动作PRUNE。RASER-3在单次RAG、PRUNE和迭代检索IRCoT之间进行选择，使用相同的特征但增加了显式的成本-准确率权衡。两个路由器都不需要额外的LLM调用来做决定。在六个LLM和三个多跳QA基准测试中，两个路由器在F1分数上与最先进的基线保持竞争力，同时仅消耗始终PRUNE的41-49%的token，并且也少于迭代和分解检索基线。

英文摘要

Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieval rounds, or search through bridge entities before answering. All of these strategies rely on repeated LLM calls to rewrite or decompose the question, which increases extra token cost, and it is not fitting when the LLM budget is tight. However, our analysis shows that lots of multi-hop questions are already answered correctly by a single one-shot RAG, so running an extra retrieval on every question wastes the budget. We introduce RASER (Recoverability-Aware Selective Escalation Router), a family of cheap routers built on one-shot RAG and six features from it. RASER-2 decides whether to stop or escalate to the extra-retrieval action PRUNE. RASER-3 chooses among one-shot RAG, PRUNE, and iterative retrieval IRCoT, using the same features but adding an explicit cost-accuracy trade-off. Neither router makes an extra LLM call to decide. Across six LLMs and three multi-hop QA benchmarks, both routers stay competitive with the other state-of-the-art (SOTA) baselines in F1 while spending only 41-49% of always-prune's tokens and also less than the iterative and decomposition retrieval baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02487 2026-06-02 cs.CL

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

迈向多学科住院总结：高效的句子级临床来源分类

Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde, Mary Khetani, Yu-Shan Tseng, Vanessa Barbosa, Julie Vignato, Lindsey Knake, Rajashree Dahal, Emily Spellman, Danielle Hitzel, Janine Petitgout, Kristi Haughey, Amanda Karstens, Brianna Clarahan, Rachel Dawson, Lauren Boyd, Mackenzie Weis, Angie Tipton, Jaewon Bae, Catherine K. Craven, Karen Dunn Lopez, Andrew D. Boyd

AI总结本研究提出一个基于大语言模型监督微调的临床来源分类流水线，用于多学科住院总结，通过量化70B模型在跨领域迁移中提升F1分数7%，并证明模型容量对语义灵活性的关键作用。

详情

Comments: 5 pages. Submitted preprint version of a paper accepted to AIME 2026. This version may differ from the camera-ready manuscript and the final Version of Record. The Version of Record will be available from Springer Nature once published

AI中文摘要

在新生儿重症监护室（NICU）等高复杂性环境中，有效的“全团队”总结需要聚合来自不同学科（医生、护士、治疗师）的见解，这些见解分散在数百份临床自由文本记录中。简单地将异质文本汇集在一起往往会导致输出不连贯。因此，结构化总结首先需要对跨多源记录的句子级来源进行准确分类。本初步研究引入了一个临床来源分类流水线，使用大语言模型（LLM）的监督微调（SFT）。我们将两个Llama-3模型（8B和70B）适配到MedSecId，这是一个包含2,002份MIMIC-III（成人ICU）记录并带有临床来源标题注释的语料库，两个模型在域内均实现了超过92%的宏F1分数。为了评估跨域泛化能力，我们评估了模型容量（8B vs. 70B）和量化在一个由来自三个多学科NICU总结的227个句子级跨度组成的金标准数据集上的表现。实验结果表明存在规模依赖的迁移效应：虽然SFT对8B模型仅产生边际变化，但显著改进了70B模型，使宏F1提高了7%。值得注意的是，量化微调的70B模型在显著降低计算需求的同时，超越了其全精度基线。这些发现表明，足够的模型容量对于在跨域临床迁移中保持语义灵活性至关重要，并且高效的量化适配可以为下游总结实现结构化来源建模。

英文摘要

Effective "all-team" summarization in high-complexity settings like the Neonatal Intensive Care Unit (NICU) requires aggregating insights from diverse disciplines (physicians, nurses, therapists) spread across hundreds of clinical free-text notes. Simply pooling heterogeneous text often leads to incoherent outputs. Structured summarization therefore first requires accurate categorization of sentence-level provenance across multi-source notes. This pilot study introduces a clinical provenance categorization pipeline using supervised fine-tuning (SFT) of large language models (LLMs). We adapted two Llama-3 models (8B and 70B) to MedSecId, a corpus of 2,002 MIMIC-III (Adult ICU) notes annotated with clinical provenance headers, achieving in-domain Macro F1 scores above 92% for both models. To evaluate cross-domain generalization, we assessed model capacity (8B vs. 70B) and quantization on a gold-standard dataset of 227 sentence-level spans derived from three multi-disciplinary NICU summaries. Experimental results demonstrate a scale-dependent transfer effect: while SFT produced only marginal changes for the 8B model, it substantially improved the 70B model, increasing Macro F1 by 7%. Notably, the quantized fine-tuned 70B model outperformed its full-precision baseline while substantially reducing computational requirements. These findings suggest that sufficient model capacity is critical for preserving semantic flexibility during cross-domain clinical transfer and that efficient quantized adaptation can enable structured provenance modeling for downstream summarization.

URL PDF HTML ☆

赞 0 踩 0

2606.02486 2026-06-02 cs.RO

Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

拦截未来：用于动态VLA操作的潜在空间预测世界模型

Shahram Najam Syed, Arthur Jakobsson, Haoran Hao, Jeffrey Ichnowski

AI总结提出AHEAD框架，通过潜在空间世界模型预测未来视觉特征，使冻结的VLA模型在动态场景中实现高成功率操作。

详情

Comments: 28 pages, 7 figures, 16 tables, Su

AI中文摘要

视觉-语言-动作（VLA）模型在静态操作中具有泛化能力，但当物体在任务执行过程中移动时则失效。它们将当前观测映射为动作，并假设观测与执行之间场景静止，因此在任何非平凡的物体速度下，产生的延迟都会超过可用的抓取时间。我们通过AHEAD（自适应动态预期视界外推）弥补了这一差距，这是一种先预测后执行的包装器，用运动感知的潜在世界模型增强冻结的VLA。一个在操作视频上训练的小型世界模型，基于光流计算的每个令牌的速度和加速度，预测VLA特征空间中的未来块令牌。语言和运动显著性掩码将预测集中在任务相关的块上，模型向前滚动自适应视界，当预测不确定性超过阈值时停止。然后冻结的动作解码器接收预测的未来令牌代替当前令牌。AHEAD为冻结的7B OpenVLA增加了4.9M参数，在20个动态模拟场景中达到79%至97%的成功率，而最强基线仅为31%至58%。在物理UFactory xArm 7上，AHEAD在三个传送带和滚球任务中成功率为29/30至30/30，在桨叶拦截任务中为23/30，在抛射物捕捉任务中为19/30，而所有基线均为0/30。

英文摘要

Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.

URL PDF HTML ☆

赞 0 踩 0

2606.02484 2026-06-02 cs.AI cs.LG

Iteris: Agentic Research Loops for Computational Mathematics

Iteris: 计算数学的智能体研究循环

Leheng Chen, Zihao Liu, Wanyi He, Bin Dong

AI总结提出Iteris智能体研究系统，通过数值实验、构造和证明草稿解决计算数学中的两个开放问题，经专家验证后获得可验证结果。

详情

Comments: 43 pages

AI中文摘要

大型语言模型和智能体AI系统的最新进展使得数学发现取得了显著进展，从解决竞赛问题到处理研究级猜想。然而，计算数学中的开放问题受到的关注相对较少：该领域的研究通常不仅需要证明，还需要数值实验、对抗性构造和算法设计。在本文中，我们介绍了一个面向计算数学开放问题的智能体研究系统Iteris。我们将Iteris应用于近期Simons Workshop论文集（arXiv:2602.05394）中的两个开放问题。在这些案例研究中，Iteris生成了数值证据、构造和证明草稿，经过专家评审和修正后，得到了可验证的结果。第一个结果是关于幂律谱上共轭梯度与随机坐标下降渐近比较的相图；第二个结果是一个反例，表明即使低相干性下，带列主元的QR分解也可能无法选择良态子矩阵。这些案例研究表明，智能体AI系统可以有意义地参与计算数学开放问题的研究工作流程，而人工验证仍然至关重要。

英文摘要

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.

URL PDF HTML ☆

赞 0 踩 0

2606.02483 2026-06-02 cs.CR cs.AI cs.CL

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

幽灵工具调用：投机性智能体工具的发布时隐私保护

Bardia Mohammadi, Lars Klein, Akhil Arora, Laurent Bindschaedler

AI总结针对工具增强型语言智能体投机性预发调用泄露用户意图的问题，提出投机性工具隐私契约，在发布时而非提交后保护隐私。

详情

AI中文摘要

工具增强型语言智能体投机性地发出可能的未来工具调用以隐藏延迟，但这些调用在智能体提交分支之前将推断出的用户意图泄露给外部服务。每个收到调用的外部观察者在智能体放弃分支后仍保留该披露。问题在于时机，而非授权：提交后的清理、只读限制或访问控制白名单都无法撤回观察者已持有的信息。我们将这些调用称为幽灵工具调用，并提出投机性工具隐私契约，这是一种运行时抽象，将提交前的观察视为与状态突变不同的第一类效应。我们在原型运行时中实现了该契约，并在三个语料库上评估了十二种策略。投机性调度增加了观察者能够推断用户意图的程度；事后过滤器、只读限制和访问控制白名单无法消除这种推断；只有那些在调度前改变或抑制投机性调用的参数或目标投影的发布时策略才能减少这种推断。

英文摘要

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the call retains the disclosure after the agent abandons the branch. Timing is the issue, not authorization: no commit-time cleanup, read-only restriction, or access-control allow-list unsends what an observer already holds. We call these invocations ghost tool calls and propose Speculative Tool Privacy Contracts, a runtime abstraction that treats observation before commitment as a first-class effect, distinct from state mutation. We implement the contracts in a prototype runtime and evaluate twelve policies across three corpora. Speculative dispatch increases what an observer can infer about user intent; post-hoc filters, read-only restrictions, and access-control allow-lists leave that inference intact; only issue-time policies that change or suppress the speculative call's argument or destination projection before dispatch reduce it.

URL PDF HTML ☆

赞 0 踩 0

2606.02481 2026-06-02 cs.CV

Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research

野外场景：一个用于生态有效视觉研究的大规模高分辨率RAW照片数据集

Michelle R. Greene

AI总结本文提出了一个包含67,574张高分辨率RAW照片的数据集，通过360度视角采样覆盖260个场景类别，支持视角依赖识别、真实场景理解及自然场景统计研究。

详情

Comments: 19 pages, 3 tables, 4 figures

AI中文摘要

大规模图像数据集加速了认知神经科学和计算机视觉的进展。然而，大多数数据集是低分辨率、来自互联网的JPEG图像，其拍摄条件未知且空间上下文有限。野外场景数据集包含67,574张高分辨率照片，这些照片在810个物理位置现场采集，涵盖260个基本级场景类别，包括室内、城市和自然环境。在每个位置，安装在全景三脚架上的4500万像素佳能EOS R5相机以5度水平间隔拍摄72张图像，并在不同仰角拍摄12张图像，实现了密集的360度视点采样。所有图像同时记录为14位RAW（CR3）文件和压缩JPEG文件，保留了传感器级别的细节，用于分析亮度、对比度、颜色和其他图像统计信息。该数据集附有完整的EXIF元数据和一套图像质量指标。野外场景数据集支持人类和模型中视角依赖识别的研究、在真实条件下训练和评估场景理解系统、自然场景统计特征的刻画，以及需要近全视野视觉显示的实验。

英文摘要

Large image datasets have accelerated progress in cognitive neuroscience and computer vision. However, most datasets are low-resolution, internet-sourced JPEGs with unknown capture conditions and limited spatial context. Places in the Wild is a dataset of 67,574 high-resolution photographs collected in situ across 810 physical locations spanning 260 basic-level scene categories, including indoor, urban, and natural environments. At each location, a 45-megapixel Canon EOS R5 mounted on a panoramic tripod captured 72 images at 5-degree horizontal intervals plus 12 images at varying elevations, yielding dense 360-degree viewpoint sampling. All images were recorded simultaneously as 14-bit RAW (CR3) files and compressed JPEGs, preserving sensor-level detail for analyses of luminance, contrast, color, and other image statistics. The dataset is accompanied by complete EXIF metadata and a suite of image-quality metrics. Places in the Wild supports research on viewpoint-dependent recognition in humans and models, training and evaluation of scene-understanding systems under realistic conditions, characterization of natural scene statistics, and experiments requiring near-full-field visual displays.

URL PDF HTML ☆

赞 0 踩 0

2606.02479 2026-06-02 cs.CV

Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation

检索缺失内容：面向一致长视频生成的覆盖最大化检索

Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee, Hyunwoo J. Kim

AI总结提出基于深度的覆盖最大化检索增强生成框架COVRAG，利用预训练3D先验构建轻量级覆盖图作为记忆证据，通过迭代检索最大化残差覆盖来提升长视频生成的几何一致性。

详情

Comments: 19 pages, 10 figures, 5 tables

AI中文摘要

对于长时域自回归视频生成，保持长期几何一致性仍然具有挑战性。记忆增强生成模型通过检索历史帧来解决这一问题，但其有效性取决于两个关键设计选择：哪些3D几何证据应代表过去的观测，以及如何从这些证据中选择记忆帧。现有方法通常依赖相机位姿或视场重叠，这些方法轻量但过于粗糙，无法推理像素级可见性；或者使用显式3D重建，提供细粒度证据但在长序列中维护成本高昂。我们提出覆盖最大化检索增强生成（COVRAG），一种基于深度的记忆检索框架，利用预训练3D先验构建目标视图覆盖图作为轻量级3D记忆证据。在帧选择方面，COVRAG最大化残差覆盖增益，迭代检索能够解释当前上下文或先前选择的记忆未覆盖的目标视图区域的帧。为了提高长视频生成的可扩展性，我们引入滑动窗口深度缓存以实现高效的几何估计。在RealEstate10K和DL3DV10K上的实验表明，COVRAG在保持低延迟的同时，相比基线方法改善了长时域几何一致性。

英文摘要

Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02470 2026-06-02 cs.AI

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

MCP-Persona：通过环境模拟在真实个人应用上基准测试LLM智能体

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen

AI总结针对现有基准忽略个人社交应用中工具与个人账户或本地数据库交互的挑战，提出MCP-Persona基准，通过模拟真实个性化MCP工具评估LLM智能体性能，实验表明现有智能体在个性化工具使用上存在显著困难。

详情

Comments: ICML 2026 Camera Ready

AI中文摘要

模型上下文协议（MCP）已成为连接大型语言模型（LLM）与外部数据源和工具的变革性标准，并已迅速在个人应用和开发平台中得到采用。然而，现有基准主要关注通用信息搜索工具，未能捕捉个人社交应用带来的实际挑战，在这些应用中工具与个人账户或本地数据库交互。为弥合这一关键差距，我们引入了MCP-Persona，这是首个专门用于评估智能体在真实世界个性化MCP工具上性能的基准。MCP-Persona涵盖了一系列多样化的广泛使用的应用，从社交媒体平台如Reddit和小红书（Rednote）到企业协作套件如飞书（Lark）和Slack。我们在各种最先进（SOTA）智能体上的广泛实验表明，它们在个性化工具使用上存在显著困难，从而凸显了该基准在识别和解决这些局限性方面的关键作用。MCP-Persona公开可用：https://github.com/wwh0411/MCP-Persona。

英文摘要

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.

URL PDF HTML ☆

赞 0 踩 0

2606.02465 2026-06-02 cs.CL cs.AI

Learning When to Translate for Multilingual Reasoning

学习何时翻译以实现多语言推理

Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee

AI总结提出Luar框架，通过强化学习训练推理语言模型在直接理解不可靠时选择性调用翻译，从而缩小多语言推理差距。

详情

Comments: preprint

AI中文摘要

推理语言模型（RLMs）在复杂推理任务上表现出色，但仍存在显著的多语言推理差距，这主要源于非英语输入中的语言理解失败。英语翻译可以通过将非英语输入转换为RLMs更可靠解释的形式来缓解这些失败，但当模型能够从原始查询中可靠推理时，翻译每个输入是不必要的。为应对这一挑战，我们提出Luar，一种语言理解边界感知的强化学习框架，训练RLMs在直接理解不可靠时选择性调用翻译。Luar训练模型在直接解决原始输入和对其英语翻译进行推理之间做出选择，仅在翻译增强推理预期显著优于直接推理时鼓励翻译。在多语言推理基准测试中，Luar优于标准GRPO和其他基于训练的基线，在低资源语言上尤其获得巨大提升。进一步分析表明，Luar在直接推理足够的情况下避免不必要的翻译，同时将其翻译调用行为扩展到未见过的低资源语言。总之，我们的工作提出了一种选择性多语言推理方法：RLMs可以学习仅在直接理解不可靠时调用翻译。该项目将在https://github.com/deokhk/LUAR公开。

英文摘要

Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at https://github.com/deokhk/LUAR

URL PDF HTML ☆

赞 0 踩 0

2606.02463 2026-06-02 cs.CV cs.AI

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

MASER: 面向具身3D空间智能的模态自适应专家路由

Hilton Raj, Vishnuram AV

AI总结提出MASER框架，通过训练共享VLM骨干的五个模态适配器并学习基于问题选择最佳适配器的神经路由策略，解决具身代理在3D环境中多模态推理时忽略问题语义的问题。

详情

Comments: Accepted to CVPR 2026 Foundation Models Meet Embodied Agents Workshop

AI中文摘要

在3D环境中，具身代理通过推理自然语言、RGB图像、点云、深度图和相机位姿等多模态信息来回答空间相关问题。现有的视觉语言模型（VLM）在单一模态上微调，完全忽略了可能偏好不同于微调模态的问题语义。为解决这一问题，我们提出MASER（模态自适应专家路由），一个轻量级框架，训练共享VLM骨干的五个不同模态适配器，并学习一个神经路由策略，在推理时根据问题选择最佳适配器。我们使用冻结的句子变换器对每个问题进行编码，并将嵌入通过一个小型多层感知器（MLP），该感知器在oracle适配器-准确率标签上训练。我们在Open3D-VQA基准上评估我们的方法，评估结果表明没有单一模态是普遍最优的——点云答案在51.5%的情况下最佳。MASER以51.3%的oracle一致性进行路由，优于随机森林消融（43.5%），且每个问题仅调用一次适配器。

英文摘要

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.

URL PDF HTML ☆

赞 0 踩 0

2606.02459 2026-06-02 cs.CV

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

像鸽子一样主动探索：通过智能视觉语言模型强化空间推理

Wei Deng, Xianlin Zhang, Mengshi Qi

AI总结提出一种受鸽子认知地图启发的智能视觉语言模型管道，通过动态认知地图和空间断言代码提供密集奖励信号，在MindCube基准上实现80.5%的总体准确率，在Rotation子集上相对提升53.2%。

详情

Comments: Accepted by ICML 2026

AI中文摘要

使视觉语言模型（VLM）能够进行空间推理仍然具有挑战性。现有方法将VLM视为被动观察者，这在实际应用中难以奏效。此外，强化学习方法依赖稀疏奖励，限制了其在复杂推理任务中的有效性。受鸽子构建和利用认知地图进行导航的启发，我们提出了一种新颖的智能管道用于空间推理。首先，我们引入了一种新的\emph{动态认知地图}，将场景布局参数化为物体位置和朝向，作为新观测的持久记忆。其次，我们提出了一种新颖的\emph{空间断言代码（SAC）}，即用Python表达式编程描述空间关系。通过与动态认知地图协作，SAC能够验证中间推理步骤，提供密集的奖励信号。我们通过监督学习和强化微调来优化模型。在MindCube基准上的实验表明，我们的方法达到了\emph{80.5\%}的总体准确率，在具有挑战性的 extsc{Rotation}子集上，比当前最佳方法高出\emph{29.5}个准确率点（相对提升\emph{53.2\%}）。我们的代码和数据已在https://github.com/dw-dengwei/active-spatial-reasoning.git开源。

英文摘要

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

URL PDF HTML ☆

赞 0 踩 0

2606.02458 2026-06-02 cs.AI

Beyond One-shot: AI Agents for Learning in Field Experiments

超越一次性：用于现场实验学习的AI智能体

Junjie Luo, Ritu Agarwal, Gordon Gao

AI总结研究通过工具增强的智能体AI自动从实验数据中学习并生成新干预措施，在医疗处方消息现场实验中证明其优于人类+聊天机器人方法。

详情

AI中文摘要

组织通常进行A/B测试实验，但一次实验产生的数据未被充分利用以指导后续干预设计。从先前实验数据中提取可操作知识以指导新干预存在重大障碍。我们研究工具增强的智能体AI能否自动从实验数据中学习，以在后续实验中生成新干预。通过医疗处方消息传递（693,139次患者就诊）的两阶段现场实验，我们比较了人类+聊天机器人方法（第一阶段：行为专家与对话式AI共同设计13种消息变体，444,691次患者就诊）与工具增强的智能体AI方法（第二阶段：AI自主从第一阶段数据中提取原则以生成17种新变体，248,448次患者就诊）。配备分析工具、结构化数据-信息-知识-智慧（DIKW）推理智能体和透明证据链的智能体AI方法产生了更优的干预：AI生成的最佳消息实现了69.8%的点击率（比基线高6.5个百分点）。关键的是，我们的结果表明价值来自特定领域的实验数据，而非通用推理能力：没有实验数据的前沿大语言模型无法预测哪些干预会成功。现场实验还揭示，用于干预设计的通用行为理论并不能统一适用于特定医疗环境，这激发了在实验规模上进行理论审计的智能体AI方法。我们的研究表明，工具增强的AI可以从实验数据中学习并生成改进的领域相关干预，将行为实验从一次性评估转变为可扩展的累积设计学习系统。

英文摘要

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.

URL PDF HTML ☆

赞 0 踩 0

2606.02455 2026-06-02 cs.LG cond-mat.mtrl-sci physics.chem-ph physics.comp-ph stat.CO

Speculative Sampling For Faster Molecular Dynamics

用于加速分子动力学的推测采样

Arthur Kosmala, Stephan Günnemann, Meng Gao, Brandon Wood

AI总结提出Langevin推测动力学（LSD），一种分布式且模型无关的推测采样方法，通过草稿模型快速提议步长并用目标模型并行验证，实现分子动力学模拟的3-9倍加速而不增加相对误差。

详情

Comments: Forty-Third International Conference on Machine Learning (ICML 2026). 32 pages, 14 figures, 8 tables

AI中文摘要

分子动力学（MD）是模拟原子系统动力学行为的关键工具。然而，MD本质上是串行的，这使得通过并发计算提高单系统吞吐量变得困难。为了解决这个问题，我们引入了Langevin推测动力学（LSD），一种分布式且模型无关的推测采样器，用于在不增加相对误差的情况下加速MD。受语言和扩散建模中推测方法的启发，LSD使用草稿模型提议快速模拟步长，并用较慢的目标模型并行验证，应用从草稿分布到目标分布的传输映射。我们将推测采样扩展到二阶Langevin动力学，推导出作为物理参数函数的可实现加速比，表明LSD在不同系统和草稿-目标组合中实现3-9倍加速，并从理论和实验上证实LSD从其目标模型分布中采样轨迹。

英文摘要

Molecular dynamics (MD) is a key tool for simulating the dynamical behavior of atomic systems. However, MD is inherently serial, which makes it difficult to increase single-system throughput with concurrent compute. To address this, we introduce Langevin Speculative Dynamics (LSD), a distributed and model-agnostic speculative sampler for accelerating MD without adding relative error. Inspired by speculative methods in language and diffusion modeling, LSD uses a draft model to propose fast simulation steps and verifies them in parallel with a slower target model, applying a transport map from the draft to the target distribution. We extend speculative sampling to second-order Langevin dynamics, derive the achievable speedup as a function of physical parameters, show that LSD generalizes across different systems and draft-target combinations with a 3-9x speedup, and confirm theoretically and empirically that LSD samples trajectories from its target model distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.02453 2026-06-02 cs.CV cs.AI

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

初始化即半程：从引导势后验生成多样图像

Xiang Li, Dianbo Liu, Kenji Kawaguchi

AI总结针对生成模型模式崩溃问题，提出从引导势后验中采样初始噪声的DivIn方法，利用朗之万动力学引导初始化远离崩溃区域，提升多样性且兼容扩散与流匹配模型。

详情

Comments: Accepted by ICML 2026 Spotlight

AI中文摘要

尽管生成模型具有显著的保真度，但它们经常遭受模式崩溃。现有的增强多样性的策略主要集中于在生成轨迹期间进行干预。我们发现一个关键的疏忽：标准高斯初始化通常导致轨迹崩溃到主导模式，因为它对引导势景观是无关的。在这项工作中，我们从引导势后验中公式化选择初始噪声，这有效地将先验重新加权到多样性丰富的区域。为了高效地从该分布中采样，我们引入了多样性诱导初始化（DivIn），它利用朗之万动力学主动导航初始化景观，将初始噪声引导远离崩溃区域，同时将其锚定到有效的数据流形。我们的方法作为一种推理时多样性增强，与扩散和流匹配模型都兼容。大量实验表明，DivIn在类到图像和文本到图像场景中都表现出优越的性能。此外，我们强调，由于DivIn与基于轨迹的方法是正交的，将它们结合起来显著扩展了多样性-质量帕累托前沿，超越了任何单独方法所能达到的。

英文摘要

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.

URL PDF HTML ☆

赞 0 踩 0

2606.02449 2026-06-02 cs.AI cs.CL cs.CV cs.LG cs.MM

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL：智能体能否跨越人类最后一道验证防线？

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu

AI总结提出HLL基准，通过交互式CAPTCHA验证评估多模态智能体在受保护工作流中替代人类的能力，发现当前智能体在定位、动作校准、状态跟踪和过程一致性方面存在脆弱性。

详情

Comments: 27 pages, 14 figures

AI中文摘要

多模态智能体越来越被期望代表用户操作界面，这引发了一个核心部署问题：在服务特意防止自动化的流程中，它们能否真正替代人类？CAPTCHA验证使这个问题具体化。它不仅仅是一个视觉谜题，更是在账户创建、内容访问、表单提交和其他受保护操作之前设置的人类验证边界。我们引入了 extbf{人类最后一道验证防线（HLL）}，这是一个受控基准，使用交互式CAPTCHA验证来评估智能体是否能够通过基于环境的类人交互（而非仅识别）跨越这一边界。HLL涵盖了多种CAPTCHA交互，并让智能体暴露于受控的现实压力因素下，包括杂乱的网页、更困难的任务变体以及解决过程的轨迹条件验证。我们在闭环GUI环境中评估了八个前沿多模态智能体。结果表明，当前智能体在这个人类替代边界上仍然脆弱：性能在不同验证类型间差异显著，在现实界面条件下下降，当正确答案必须由有效动作轨迹支持时进一步下降。通过揭示定位、动作校准、状态跟踪和过程一致性方面的差距，HLL为衡量多模态智能体在受保护的真实世界工作流中作为人类替代品有多接近提供了一个具体的测试平台。我们的代码可在https://github.com/XinhaoS0101/HLL获取。

英文摘要

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

URL PDF HTML ☆

赞 0 踩 0

2606.02448 2026-06-02 eess.SP cs.SD

Diffusion-Based Heart Sound Generation: Evaluation with Physiological Signal Metrics, Classifiers, and Expert Listening

基于扩散的心音生成：使用生理信号指标、分类器和专家听诊评估

Xinqi Bao, Jia Bi, Xin Chen, Ernest Nlandu Kamavuako, Saikat Chatterjee

AI总结提出一种在log-mel域上的类别条件扩散模型用于生成心音图，通过生理指标、下游分类准确率和专家听诊评估合成保真度，并分析了异常声学线索保留和重建伪影等挑战。

详情

AI中文摘要

公开可用的心音图（PCG）数据集在规模和病理多样性方面仍然有限，限制了听诊训练和自动心音分类器的泛化能力。本文在log-mel域上开发了一种用于PCG生成的类别条件扩散模型，并使用互补的（i）生理启发的合理性指标、（ii）下游标签一致性评估和（iii）专家听诊来评估合成保真度。实验使用Phy-sioNet/Computing in Cardiology Challenge 2016数据集（3240条记录）进行记录级划分。经过预处理和质量控制后，将16,749个不重叠的4秒片段映射到归一化的1×128×128 log-mel表示，以训练带有无分类器引导的条件2D U-Net去噪器。使用三个轻量级指标在重建波形上量化信号级合理性：包络自相关节律评分、基于幅度的爆炸评分和主周期滞后。合成片段保留了相似的主周期持续时间，但与真实片段相比，包络周期性降低，瞬态突发性增加。在下游评估中，ResNet-50分类器在保留的真实测试集上达到92.24%的准确率，在类别平衡的合成批次上达到82.8%的准确率，表明生成信号保留了与正常/异常分类相关的判别结构。在一项初步的专家听诊研究（60个片段，两名临床医生）中，大多数合成片段被判断为类似心音，而真实和合成的4秒片段对异常敏感性均较低。总体而言，结果为基于扩散的PCG生成提供了实用基线，同时突出了在保留异常声学线索和减少重建伪影方面的剩余挑战。

英文摘要

Publicly available phonocardiogram (PCG) datasets remain limited in size and pathological diversity, constraining both auscultation training and the generalisation of automated heart-sound classifiers. A class-conditional diffusion model for PCG generation is developed in the log-mel domain and synthetic fidelity is assessed using complementary (i) physiology-inspired plausibility metrics, (ii) downstream label-consistency evaluation, and (iii) expert listening. Experiments use the Phy-sioNet/Computing in Cardiology Challenge 2016 dataset (3240 recordings) with recording-level splits. After preprocessing and quality control, 16,749 non-overlapping 4 s clips are mapped to a normalised 1 x 128 x 128 log-mel representation to train a conditional 2D U-Net denoiser with classifier-free guidance. Signal-level plausibility is quantified on reconstructed waveforms using three lightweight metrics: an envelope-autocorrelation rhythm score, an amplitude-based explosion score, and the dominant cycle lag. Synthetic clips preserve similar dominant cycle durations but exhibit reduced envelope periodicity and increased transient burstiness relative to real clips. For downstream evaluation, a ResNet-50 classifier achieves 92.24% accuracy on the held-out real test set and 82.8% accuracy on class-balanced synthetic batches, indicating that generated signals retain discriminative structure relevant to normal/abnormal classification. In a pilot expert listening study (60 clips, two clinicians), most synthetic clips are judged as heart-sound-like, while abnormality sensitivity is low for both real and synthetic 4 s excerpts. Overall, the results provide a practical baseline for diffusion-based PCG generation while highlighting remaining challenges in retaining abnormal acoustic cues and reducing reconstruction-induced artefacts.

URL PDF HTML ☆

赞 0 踩 0

2606.02444 2026-06-02 cs.AI cs.CL

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

食物噪音与虚假安全：系统评估LLMs如何在临床医生反馈下未能适应饮食障碍查询

Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie, Tanvi Dinkar, Arabella Sinclair

AI总结本研究通过与临床饮食障碍专家合作，系统评估了大型语言模型在处理饮食障碍用户查询时，因不加批判地适应不安全或自伤请求而可能产生的危害。

2606.02443 2026-06-02 cs.CL cs.AI cs.CV

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

AI总结提出PaSBench-Video基准，包含740个视频，评估多模态大模型在危险发生前及时发出预警的能力，发现现有模型在时序精度和低误报率上表现不佳。

详情

AI中文摘要

从危险的第一个可见迹象到事故发生之间，通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型（MLLM）可以作为始终在线的安全监控器，在此窗口内发出警告。然而，当前的基准测试并未检验这一能力：它们依赖静态输入，忽略时间精度，并且省略了对安全场景的误报测量。我们提出了PaSBench-Video，一个包含740个视频的基准测试，涵盖驾驶、医疗、日常生活和工业生产四个领域，其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频，并发出在时间上校准且内容正确的警告。测试了13个MLLM后，我们发现没有模型在我们的最严格指标上超过20.0%，并且召回率与误报率紧密相关，皮尔逊相关系数为0.64：更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化：在日常生活领域，模型在低误报率下实现了中等召回率，因为该领域的风险本质上是异常的；而在驾驶领域，模型不加区分地触发警告，因为常规场景和危险场景看起来相似。这些结果表明，当前模型依赖于场景级别的活动线索，而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

URL PDF HTML ☆

赞 0 踩 0

2606.02441 2026-06-02 cs.CV

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

空间-时间解耦参考条件用于身份保持的文本到视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Lizhuang Ma, Jiangning Zhang

AI总结提出ST-DRC框架，通过空间-时间解耦参考条件、TASS-RoPE机制和身份目标，实现高保真身份保持视频生成。

详情

AI中文摘要

身份保持视频生成（IPVG）旨在合成高保真视频，遵循文本提示同时忠实保持参考身份。尽管最近取得进展，现有IPVG方法仍难以平衡高级语义控制和低级身份保真度。为弥合这一差距，我们提出ST-DRC，一种有效的空间-时间解耦参考条件框架，用于身份保持的文本到视频生成。在框架层面，ST-DRC通过使用视频VAE编码参考图像并将其与噪声视频潜在变量拼接，执行潜在上下文特征注入，无需额外适配器即可访问丰富的低级身份细节。为将身份感知参考检索与外观复制分离，我们引入TASS-RoPE，一种时间相邻-空间偏移的RoPE方案，将参考令牌在时间上靠近视频序列但在空间上偏移，允许参考信息通过时空注意力流动，同时抑制像素级复制粘贴捷径。为进一步防止捷径学习并增强扩散目标中被稀释的身份监督，我们结合外观不变参考增强与面部引导身份目标，鼓励模型在颜色、姿态和布局变化下保持身份。在推理时，我们引入三流参考无分类器引导策略，独立控制文本遵循度和参考保真度。实验表明，ST-DRC在基于LTX-2.3的轻量级设计下，实现了强身份保持、提示对齐、时间一致性和视频质量。我们的方法在面部身份保持视频生成赛道中排名靠前，验证了空间-时间解耦参考条件的有效性。

英文摘要

Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.

URL PDF HTML ☆

赞 0 踩 0