arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.15199 2026-05-15 cs.CV cs.AI

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

AI总结 EntityBench 是一个用于评估多镜头视频生成中实体一致性能力的基准数据集,包含140个情节(共2,491个镜头),从真实叙事媒体中提取,涵盖不同难度级别的场景,并明确追踪角色、物体和地点在多镜头间的连续性。该基准引入了三部分评估体系,分别评估单镜头质量、提示对齐度和跨镜头一致性,并通过“保真度门”机制确保只有准确的实体表现在跨镜头评分中被计入。研究还提出了一种基于记忆增强的生成方法EntityMem,通过在生成前存储每个实体的视觉参考,显著提升了跨镜头实体一致性表现。

详情
Comments
Project page: https://catherine-r-he.github.io/EntityBench/
英文摘要

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

2605.15198 2026-05-15 cs.CV cs.AI cs.CL

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

AI总结 该研究提出了一种名为ATLAS的新型视觉推理框架,旨在解决传统方法在计算开销和任务泛化上的不足。ATLAS通过一个单一的离散“功能词”同时实现代理式推理和潜在视觉推理,无需视觉监督且兼容标准训练流程。研究还引入了LA-GRPO方法以提升训练稳定性,实验表明ATLAS在多个基准上表现出色,兼具高效性与可解释性。

详情
Comments
Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS
英文摘要

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

2605.15196 2026-05-15 cs.CV cs.LG

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

AI总结 该论文提出了一种名为 RefDecoder 的参考条件视频解码器,旨在提升视觉生成任务中的细节保真度和结构一致性。通过在解码过程中引入高保真参考图像信号,RefDecoder 利用参考注意力机制将参考图像编码为高维特征,并与去噪后的视频潜在特征进行联合处理,从而增强生成结果的质量。实验表明,RefDecoder 在多个基准数据集上显著提升了生成视频的 PSNR 指标,并且无需额外微调即可直接集成到现有视频生成系统中,有效提升了生成内容的主体一致性、背景一致性和整体质量。

详情
英文摘要

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

2605.15195 2026-05-15 cs.CV

VGGT-$Ω$

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht

AI总结 本文提出了一种改进的前馈重建模型 VGGT-$Ω$,旨在提升静态和动态场景的重建精度与效率。通过简化网络结构、引入注册机制和自监督学习策略,VGGT-$Ω$ 在大幅降低 GPU 内存占用的同时,显著提升了模型性能,并在多个基准测试中取得了优异结果,例如在 Sintel 数据集上将相机估计精度提升了 77%。研究还表明,该模型中的注册机制可有效支持视觉-语言-动作模型的空间理解任务。

详情
Comments
CVPR 2026 (Oral)
英文摘要

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

2605.15193 2026-05-15 cs.CV

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Tuna Han Salih Meral, Kaan Oktay, Hidir Yesiltepe, Adil Kaan Akan, Pinar Yanardag

AI总结 该研究针对图像生成中的潜在流匹配方法,提出了通过对齐潜在空间的几何结构来提升生成质量的新方法。作者发现,传统方法在将高斯噪声传输到变分自编码器潜在空间时,往往沿着欧几里得路径进行,但这种路径无法保持在薄球壳状的潜在分布上。为此,他们将潜在表示分解为径向和角度成分,发现感知和语义信息主要由方向决定,从而提出将数据潜在表示投影到固定半径球面,并采用球面线性插值替代传统方法,使生成路径始终位于球面上,显著提升了生成图像的质量。

详情
英文摘要

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

2605.15190 2026-05-15 cs.CV

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Yanzuo Lu, Ronglai Zuo, Jiankang Deng

AI总结 本文提出了一种名为RAVEN的实时自回归视频外推网络,用于从先前生成的内容中实时生成未来视频片段。为了解决训练与推理过程中历史分布不一致导致的长期生成质量下降问题,RAVEN在训练时将自滚动过程重构为包含干净历史端点和噪声去噪状态的交错序列,从而对齐训练注意力与推理外推过程。此外,论文还引入了基于条件高斯转移的CM-GRPO方法,通过在线强化学习优化一致性采样步骤,进一步提升了生成效果。实验表明,RAVEN在多项评估指标上优于现有因果视频蒸馏方法。

详情
Comments
Project Page: https://yanzuo.lu/raven
英文摘要

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

2605.15188 2026-05-15 cs.LG cs.AI cs.CL

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

AI总结 本文提出 FutureSim,一个用于评估适应性人工智能代理在真实世界事件预测能力的基准平台。该平台通过按时间顺序回放真实新闻事件,测试代理在知识截止点之后预测未来事件的能力。实验表明,现有前沿代理在三月份的预测准确率普遍较低,最高仅为25%,揭示了当前模型在长期适应和不确定性推理方面仍存在显著挑战。FutureSim 为研究长期适应、搜索、记忆和不确定性推理等方向提供了现实可靠的实验环境。

详情
Comments
31 pages, 10 main
英文摘要

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

2605.15187 2026-05-15 cs.CV cs.GR cs.RO

Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

Matt Zhou, Ruining Li, Xiaoyang Lyu, Zhaomou Song, Zhening Huang, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi, Shangzhe Wu

AI总结 本文提出了一种名为Articraft的智能系统,用于大规模生成可动的3D模型资产。该系统通过将生成任务转化为编写程序的过程,并利用大型语言模型自动编写代码,从而克服了当前缺乏大规模多样化数据集的瓶颈。Articraft引入了专门的编程接口和验证机制,使语言模型能够高效生成包含部件定义、几何组合、关节设置及测试验证的代码,最终生成高质量的可动3D资产。实验表明,该方法在生成质量上优于现有最先进的生成工具,并基于此构建了一个包含10,000个样本、涵盖245类物体的高质量数据集,用于训练和应用如机器人仿真与虚拟现实等领域。

详情
Comments
Project page: https://articraft3d.github.io/
英文摘要

A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

2605.15185 2026-05-15 cs.CV cs.AI

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

AI总结 本文提出了一种名为PDI-Bench的定量评估框架,用于检测生成视频中的几何一致性问题。该方法通过分割和点追踪获取物体中心视角的观测信息,结合单目重建技术将其映射到三维空间,并计算反映尺度-深度对齐、三维运动一致性和结构刚性等三个失败维度的投影几何残差。研究还构建了PDI-Dataset,用于系统评估生成视频的几何特性,揭示了现有生成模型在物理合理性方面的不足。

详情
Comments
12 pages, 5 figures. Project page : https://pdi-bench.github.io/
英文摘要

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

2605.15184 2026-05-15 cs.CL

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

AI总结 本文研究了在智能体搜索系统中,不同检索策略(如grep和向量检索)与智能体架构及工具调用方式之间的交互影响。通过两个实验,作者对比了在不同工具结果呈现方式和干扰信息环境下,grep与向量检索的性能差异,发现grep在多数情况下表现更优,但整体效果还高度依赖于所使用的智能体框架和工具调用方式。研究为优化智能体搜索系统的检索策略提供了实证依据。

详情
英文摘要

Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

2605.15183 2026-05-15 cs.LG

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms

AI总结 本文研究了如何判断两个神经网络是否实现相同的计算机制,提出了一个基于张量的相似性度量方法,该方法对权重空间的对称性具有不变性,能够捕捉全局功能等价性并考虑跨层机制。相比现有方法,该度量在追踪功能训练动态方面具有更高的精度,将相似性衡量和可信度验证转化为代数问题,提升了机制可解释性的准确性与可靠性。

详情
Comments
22 pages, 8 figures. Code: https://github.com/tdooms/tensor-similarity
英文摘要

Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.

2605.15182 2026-05-15 cs.CV

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Yifan Wang, Tong He

AI总结 本文提出了一种名为Warp-as-History的方法,旨在实现无需额外训练即可从单个训练视频中生成可控相机轨迹的视频。该方法通过将相机引起的图像变形转化为伪历史信息,并结合目标帧的位置对齐和可见令牌选择,直接输入到视频生成模型中,从而引导模型生成符合指定相机路径的视频。实验表明,该方法不仅在零样本情况下表现出良好的相机轨迹跟随能力,而且通过轻量的微调还可进一步提升生成视频的质量和运动一致性。

详情
Comments
Project page: https://yyfz.github.io/warp-as-history/
英文摘要

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

2605.15181 2026-05-15 cs.CV

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

AI总结 该研究旨在解决开放性图像编辑中抽象、多步骤指令的处理问题,提出了一种将规划与执行紧密结合的框架。其核心方法包括一个生成原子分解步骤的规划器、一个选择编辑工具和区域的协调器,以及一个基于视觉语言判断的奖励机制,用于指导编辑过程。该方法通过奖励驱动的执行优化协调器,并利用成功轨迹反哺规划器,从而实现更连贯、可靠的图像编辑效果。

详情
英文摘要

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

2605.15179 2026-05-15 cs.LG cs.AI physics.comp-ph

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

Ellwil Sharma, Arastu Sharma

AI总结 该论文研究了如何消除多物理场基础模型中的负迁移问题,即在同时训练不同偏微分方程(PDE)系统时出现的梯度冲突和优化不稳定现象。为此,作者提出了一种基于稀疏激活的混合专家(MoE)架构Shodh-MoE,通过物理感知的自编码器生成压缩的物理潜在表示,并结合软语义路由策略,将不同物理机制的局部潜在块分配给专门的专家子网络,从而实现对多物理场的高效且稳定的建模。实验表明,该方法在保持质量守恒的同时,显著提升了模型在不同物理场景下的预测精度。

详情
Comments
5 pages, 4 figures
英文摘要

Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

2605.15178 2026-05-15 cs.CV

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie

AI总结 本文提出了一种名为 SANA-WM 的高效世界模型,能够在单分钟内生成高保真、720p 分辨率的视频,并具备精确的相机控制能力。该模型通过混合线性注意力机制、双分支相机控制、两阶段生成流程以及鲁棒的标注管道等核心设计,在保证视觉质量的同时显著提升了训练与推理效率。实验表明,SANA-WM 在数据使用量、训练时长和硬件资源消耗方面均优于现有开源模型,且在单分钟世界建模基准测试中表现出更高的动作跟随精度和生成吞吐量。

详情
Comments
https://nvlabs.github.io/Sana/WM/
英文摘要

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

2605.15172 2026-05-15 cs.CR cs.CL

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma, Ahmed Salem

AI总结 本文提出了一种新型的后门攻击方法MetaBackdoor,利用大语言模型中的位置编码作为触发机制,无需修改输入文本内容即可激活后门。研究发现,基于位置信息的触发器能够有效激活隐蔽的后门行为,使模型在满足特定长度条件时泄露敏感信息或执行恶意操作。该方法扩展了大语言模型后门攻击的威胁模型,揭示了位置编码这一此前被忽视的攻击面,为防御策略的设计提出了新的挑战。

详情
英文摘要

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

2605.15171 2026-05-15 cs.CV cs.AI cs.LG

Evidential Reasoning Advances Interpretable Real-World Disease Screening

Chenyu Lian, Hong-Yu Zhou, Jing Qin

AI总结 本文提出了一种基于证据推理的可解释疾病筛查框架EviScreen,旨在解决当前医学图像筛查模型在可解释性和性能上的不足。该方法通过从历史病例中检索区域级证据,并结合双知识库进行回顾性解释,提升了模型的透明度和诊断准确性。同时,利用对比检索生成的异常图增强定位解释性,实验表明该方法在真实世界疾病筛查基准上表现出色,尤其在临床召回率下的特异性显著提高。

详情
Comments
ICML 2026
英文摘要

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

2605.15168 2026-05-15 cs.CL cs.AI cs.LG stat.ML

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

AI总结 本研究旨在解决临床文本与结构化电子健康记录(EHR)在时间信息上的互补性问题,提出了一种基于检索增强的多模态对齐框架,用于重建更精确的临床时间线。该方法通过从文本中提取关键事件构建时间框架,并结合结构化数据中的时间信息进行校准,从而提升时间戳的准确性。实验表明,该方法在多个模型上均显著提升了时间一致性,同时保留了事件匹配率,展示了多模态对齐在临床轨迹重建中的优势。

详情
Comments
Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)
英文摘要

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

2605.15167 2026-05-15 cs.CV

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

AI总结 本文研究了纯合成分层数据是否有助于提升图形设计分解的效果。作者基于先进的CLD框架构建了合成数据集SynLayers,并利用视觉语言模型生成文本监督和自动推理输入,发现纯合成数据在性能上可超越现有非可扩展数据集,且在数据量增加时表现持续提升,同时能有效平衡分层分布。该研究为分层设计编辑系统提供了可扩展的合成数据基础,具有重要的实用价值。

详情
Comments
22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers
英文摘要

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

2605.15164 2026-05-15 cs.LG cs.AI

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth, Vinay Kumar Sankarapu

AI总结 本文指出,当前的行为保障方法无法满足AI治理框架对安全性的验证需求。治理框架要求验证AI系统是否存在隐藏目标、抗失控能力及灾难性能力边界等属性,但现有方法仅能观察模型输出,无法验证其潜在表征和长期行为。文章提出“审计鸿沟”概念,强调验证需求与技术能力之间的不匹配,并建议通过法律文本中限制行为证据的权重、引入机制性验证手段等方式进行技术转向。

详情
英文摘要

This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

2605.15155 2026-05-15 cs.LG cs.AI cs.CL

Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI总结 该论文研究了如何提升基于强化学习(RL)的大型语言模型代理在多轮任务中的性能。为了解决传统RL在长序列任务中监督信号过于稀疏的问题,作者提出了自蒸馏代理强化学习(SDAR),通过将基于教师分支的密集令牌级指导作为辅助目标,与主RL优化框架结合。SDAR通过引入一个门控机制,增强对教师认可的正向令牌的蒸馏效果,同时柔和地抑制教师的负向拒绝,从而在多个基准任务上显著提升了性能,并避免了传统方法的不稳定性。

详情
英文摘要

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

2605.15154 2026-05-15 stat.ML cs.LG

RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution

Lanxin Xiang, Liang Shi, Youhui Ye, Boyu Jiang, Dawei Zhou, Feng Guo

AI总结 本文提出了一种名为RoSHAP的分布框架和鲁棒度量方法,用于实现更稳定的特征归因分析。该方法基于SHAP值,通过引导重采样和核密度估计建模特征归因分数的分布,并在温和正则条件下证明其聚合值渐近服从高斯分布,从而降低了分布估计的计算成本。RoSHAP不仅提升了特征排名的稳定性,还在模拟和实际数据实验中表现出优于传统单次归因方法的性能,同时使用更少的特征即可达到与全特征模型相当的预测效果。

详情
英文摘要

Feature attribution analysis is critical for interpreting machine learning models and supporting reliable data-driven decisions. However, feature attribution measures often exhibit stochastic variation: different train--test splits, random seeds, or model-fitting procedures can produce substantially different attribution values and feature rankings. This paper proposes a framework for incorporating stochastic nature of feature attribution and a robust attribution metric, RoSHAP, for stable feature ranking based on the SHAP metric. The proposed framework models the distribution of feature attribution scores and estimates it through bootstrap resampling and kernel density estimation. We show that, under mild regularity conditions, the aggregated feature attribution score is asymptotically Gaussian, which greatly reduces the computational cost of distribution estimation. The RoSHAP summarizes the distribution of SHAP into a robust feature-ranking criterion that simultaneously rewards features that are active, strong, and stable. Through simulations and real-data experiments, the proposed framework and RoSHAP outperform standard single-run attribution measures in identifying signal features. In addition, models built using RoSHAP-selected features achieve predictive performance comparable to full-feature models while using substantially fewer predictors. The proposed RoSHAP approach improves the stability and interpretability of machine learning models, enabling reliable and consistent insights for analysis.

2605.15138 2026-05-15 cs.LG cs.CL cs.ET

Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu

AI总结 本文研究了量化语言模型中机器遗忘的永久性问题,指出传统方法在全精度下评估遗忘效果,未能反映实际部署中模型先经历量化的情况。研究发现,4位量化会削弱甚至逆转遗忘效果,其根本原因在于参数更新幅度远小于量化区间宽度,导致无法改变量化后的模型结构。为此,作者提出MANSU方法,结合因果电路归因与约束投影,实现有意义的遗忘与结构性删除,并引入CAD指标用于验证,实验证明该方法在多个模型和任务中表现优异。

详情
英文摘要

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.

2605.15133 2026-05-15 cs.LG

Causal Foundation Models with Continuous Treatments

Christopher Stith, Medha Barath, Vahid Balazadeh, Jesse C. Cresswell, Rahul G. Krishnan

AI总结 本文研究了从观测数据中估计连续处理变量因果效应的问题,这是因果推断中的一个重要但较少被探索的领域。作者提出了首个针对连续处理场景的因果基础模型,通过设计新的数据生成先验并利用Transformer进行上下文学习,实现了无需额外训练即可预测多种任务中的因果效应。该模型在个体处理反应曲线重建任务中表现出色,优于专门为此任务设计的因果模型。

详情
Comments
22 pages, 9 figures
英文摘要

Causal inference, estimating causal effects from observational data, is a fundamental tool in many disciplines. Of particular importance across a variety of domains is the continuous treatment setting, where the variable of intervention has a continuous range. This setting is far less explored and represents a substantial shift from the binary treatment setting, with models needing to represent effects across a continuum of treatment values. In this paper, we present the first causal foundation model for the continuous treatment setting. Our model meta-learns the ability to predict causal effects across a wide variety of unseen tasks without additional training or fine-tuning. First, we design a novel prior over data-generating processes with continuous treatment variables in order to generate a rich causal training corpus. We then train a transformer to reconstruct individual treatment-response curves given only observational data, leveraging in-context learning to amortize expensive Bayesian posterior inference. Our model achieves state-of-the-art performance on individual treatment-response curve reconstruction tasks compared to causal models which are trained specifically for those tasks.

2605.15132 2026-05-15 cs.AI cs.DC cs.MA

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

Evan Rose, Tushin Mallick, Matthew D. Laws, Cristina Nita-Rotaru, Alina Oprea

AI总结 本文提出了一种名为APWA的分布式架构,旨在高效处理高度可并行化的智能体工作负载。该架构通过将任务分解为互不干扰的子问题,实现无需跨通信的独立资源处理,从而克服了传统多智能体系统在推理、协调和计算扩展方面的瓶颈。实验表明,APWA能够动态地将复杂查询分解为可并行执行的工作流,并在任务规模增大时实现有效扩展,优于现有系统。

详情
Comments
25 pages, 2 figures, 14 tables
英文摘要

Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

2605.15131 2026-05-15 cs.LG

Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models

Frederik Schmitt, Matthias Cosler, Niklas Metzger, Julian Siber, Vladimir Krsmanovic, Mohamed Ghanem, Bernd Finkbeiner

AI总结 本文研究了反应式综合问题,即从逻辑规范自动生成硬件电路的挑战。作者提出了一种神经符号方法,结合大推理模型与模型检测器,通过符号反馈迭代修正生成的Verilog实现,从而在年度综合竞赛中超越了现有最佳工具,并能处理参数化系统这一已知不可判定的问题。此外,作者引入了自动形式化步骤,将规范任务从时序逻辑转换为自然语言,通过构建自然语言规范数据集,验证了基于自然语言的综合流程在性能上可与基于形式化规范的方法媲美。

详情
英文摘要

Reactive synthesis, the problem of automatically constructing a hardware circuit from a logical specification, is a long-standing challenge in formal verification. It is elusive for two reasons: It is algorithmically hard, and writing formal specifications by hand is notoriously difficult. In this paper, we tackle both sides of the problem. For the algorithmic side, we present a neuro-symbolic approach to reactive synthesis that couples large reasoning models with model checkers to iteratively repair a synthesized Verilog implementation via sound symbolic feedback. Our approach solves more benchmarks than the best dedicated tools in the annual synthesis competition and extends to constructing parameterized systems, a problem known to be undecidable. On the specification side, we introduce an autoformalization step that shifts the specification task from temporal logic to natural language by introducing a hand-authored dataset of natural-language specifications for evaluation. We demonstrate performance comparable to that of starting from formal specifications, establishing natural synthesis as a viable end-to-end workflow.

2605.15128 2026-05-15 cs.CV cs.CL cs.IR

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei Che, Wujiang Xu, Shilong Liu, Zirui Liu, Mubbasir Kapadia, Vladimir Pavlovic, Jiang Liu, Mengdi Wang, Yiyu Shi, Dimitris N. Metaxas, Ruixiang Tang

AI总结 MemEye 是一个以视觉为核心的多模态智能体记忆评估框架,旨在解决现有方法在长期记忆中对视觉证据保存和推理能力评估不足的问题。该框架从视觉证据的粒度和使用方式两个维度进行评估,构建了包含8个生活场景任务的新基准,并通过消融验证门机制评估模型的推理结构与视觉必要性。实验表明,当前主流模型在细粒度视觉细节保存和时序状态推理方面仍存在明显不足,突显了证据路由、时间追踪和细节提取在长期多模态记忆中的关键作用。

详情
Comments
46 pages, 15 figures
英文摘要

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

2605.15127 2026-05-15 cs.HC cs.AI

Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation

Laleh Nourian, Anisa Callis, Stephanie Patterson, Jadeline Miao, Jamison Heard, Garreth W. Tigwell

AI总结 本文研究了在美国留学的国际学生如何使用对话式人工智能来支持跨文化适应。通过调查和访谈,研究揭示了国际学生在面临文化适应挑战时对AI工具的使用模式、动机及局限性。研究发现,AI被视为应对即时问题的“急救工具”,但学生也期望其能发展为长期支持伙伴。研究为设计更贴合国际学生需求的AI支持系统提供了重要建议。

详情
Comments
33 pages, single column. 4 figures, 9 tables
英文摘要

Moving to a new culture and adapting to a new life, as an international student, can be a stressful experience. In the US, international students face unique overlapping challenges, yet the current support ecosystem, including university support systems and informal social networks, remains largely fragmented. While conversational AI has emerged as a tool used by many (e.g., generative AI chatbots like ChatGPT and Google Gemini), we do not have a clear understanding of how international students adopt and perceive these technologies as support tools. We conducted a survey study (n=60) to map the relationship between international students' challenges and AI adoption patterns, followed by an interview study with 14 participants to identify the underlying motivations and boundaries of use. Our findings show that AI is perceived as a first-aid tool for immediate challenges, however, there is an interest in transforming AI from a tool for short-term help into a long-term support companion. By identifying where and how AI can provide long-term support, and where it is insufficient, we contribute recommendations for creating AI-powered support tailored to the unique needs of international students.

2605.15122 2026-05-15 cs.RO cs.LG cs.SY eess.SY

CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios

Michael Baumgartner, David Müller, Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, Moritz Bächer

AI总结 本文提出了一种名为CoCo-InEKF的新型滤波方法,用于在动态且富含接触的场景中实现腿式机器人的鲁棒状态估计。该方法通过学习接触协方差来替代传统的二值接触状态,从而更精确地捕捉部分接触和方向性滑动等复杂情况。实验表明,该方法在双足机器人上实现了更优的速度估计精度与效率平衡,并提升了滤波一致性,能够有效支持如舞蹈和复杂地面交互等高难度运动的稳定执行。

详情
Comments
RSS 2026
英文摘要

Robust state estimation for highly dynamic motion of legged robots remains challenging, especially in dynamic, contact-rich scenarios. Traditional approaches often rely on binary contact states that fail to capture the nuances of partial contact or directional slippage. This paper presents CoCo-InEKF, a differentiable invariant extended Kalman filter that utilizes continuous contact velocity covariances instead of binary contact states. These learned covariances allow the method to dynamically modulate contact confidence, accounting for more nuanced conditions ranging from firm contact to directional slippage or no contact. To predict these covariances for a set of predefined contact candidate points, we employ a lightweight neural network trained end-to-end using a state-error loss. This approach eliminates the need for heuristic ground-truth contact labels. In addition, we propose an automated contact candidate selection procedure and demonstrate that our method is insensitive to their exact placement. Experiments on a bipedal robot demonstrate a superior accuracy-efficiency tradeoff for linear velocity estimation, as well as improved filter consistency compared to baseline methods. This enables the robust execution of challenging motions, including dancing and complex ground interactions -- both in simulation and in the real world.

2605.15116 2026-05-15 cs.CV

DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

Haonan Zhao, Yiting Wang, Jingkun Chen, Valentina Donzella, Thomas Bashford-Rogers, Kurt Debattista

AI总结 自动驾驶系统训练需要大量标注的驾驶视频数据,但仿真数据与真实场景之间存在显著的领域差异,限制了其实际应用。本文提出DriveCtrl,一种基于深度条件的可控仿真到真实驾驶视频生成框架,通过结构感知适配器在保持场景布局和运动模式的同时生成视觉真实且时间连贯的驾驶视频。该方法还引入了支持多条件信号的数据生成流水线,并提出专门的评估指标DVRS,实验表明其在真实感、时间一致性和感知任务性能上均优于现有方法,有效缩小了仿真与真实驾驶视频之间的差距。

详情
英文摘要

Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.