arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.15199 2026-05-15 cs.CV cs.AI

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

AI总结 EntityBench 是一个用于评估多镜头视频生成中实体一致性能力的基准数据集，包含140个情节（共2,491个镜头），从真实叙事媒体中提取，涵盖不同难度级别的场景，并明确追踪角色、物体和地点在多镜头间的连续性。该基准引入了三部分评估体系，分别评估单镜头质量、提示对齐度和跨镜头一致性，并通过“保真度门”机制确保只有准确的实体表现在跨镜头评分中被计入。研究还提出了一种基于记忆增强的生成方法EntityMem，通过在生成前存储每个实体的视觉参考，显著提升了跨镜头实体一致性表现。

2605.15198 2026-05-15 cs.CV cs.AI cs.CL

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

AI总结该研究提出了一种名为ATLAS的新型视觉推理框架，旨在解决传统方法在计算开销和任务泛化上的不足。ATLAS通过一个单一的离散“功能词”同时实现代理式推理和潜在视觉推理，无需视觉监督且兼容标准训练流程。研究还引入了LA-GRPO方法以提升训练稳定性，实验表明ATLAS在多个基准上表现出色，兼具高效性与可解释性。

详情

Comments: Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS

英文摘要

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

URL PDF HTML ☆

赞 0 踩 0

2605.15196 2026-05-15 cs.CV cs.LG

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

AI总结该论文提出了一种名为 RefDecoder 的参考条件视频解码器，旨在提升视觉生成任务中的细节保真度和结构一致性。通过在解码过程中引入高保真参考图像信号，RefDecoder 利用参考注意力机制将参考图像编码为高维特征，并与去噪后的视频潜在特征进行联合处理，从而增强生成结果的质量。实验表明，RefDecoder 在多个基准数据集上显著提升了生成视频的 PSNR 指标，并且无需额外微调即可直接集成到现有视频生成系统中，有效提升了生成内容的主体一致性、背景一致性和整体质量。

2605.15195 2026-05-15 cs.CV

VGGT-$Ω$

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht

AI总结本文提出了一种改进的前馈重建模型 VGGT-$Ω$，旨在提升静态和动态场景的重建精度与效率。通过简化网络结构、引入注册机制和自监督学习策略，VGGT-$Ω$ 在大幅降低 GPU 内存占用的同时，显著提升了模型性能，并在多个基准测试中取得了优异结果，例如在 Sintel 数据集上将相机估计精度提升了 77%。研究还表明，该模型中的注册机制可有效支持视觉-语言-动作模型的空间理解任务。

详情

Comments: CVPR 2026 (Oral)

英文摘要

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.15193 2026-05-15 cs.CV

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Tuna Han Salih Meral, Kaan Oktay, Hidir Yesiltepe, Adil Kaan Akan, Pinar Yanardag

AI总结该研究针对图像生成中的潜在流匹配方法，提出了通过对齐潜在空间的几何结构来提升生成质量的新方法。作者发现，传统方法在将高斯噪声传输到变分自编码器潜在空间时，往往沿着欧几里得路径进行，但这种路径无法保持在薄球壳状的潜在分布上。为此，他们将潜在表示分解为径向和角度成分，发现感知和语义信息主要由方向决定，从而提出将数据潜在表示投影到固定半径球面，并采用球面线性插值替代传统方法，使生成路径始终位于球面上，显著提升了生成图像的质量。

2605.15190 2026-05-15 cs.CV

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Yanzuo Lu, Ronglai Zuo, Jiankang Deng

AI总结本文提出了一种名为RAVEN的实时自回归视频外推网络，用于从先前生成的内容中实时生成未来视频片段。为了解决训练与推理过程中历史分布不一致导致的长期生成质量下降问题，RAVEN在训练时将自滚动过程重构为包含干净历史端点和噪声去噪状态的交错序列，从而对齐训练注意力与推理外推过程。此外，论文还引入了基于条件高斯转移的CM-GRPO方法，通过在线强化学习优化一致性采样步骤，进一步提升了生成效果。实验表明，RAVEN在多项评估指标上优于现有因果视频蒸馏方法。

2605.15188 2026-05-15 cs.LG cs.AI cs.CL

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

AI总结本文提出 FutureSim，一个用于评估适应性人工智能代理在真实世界事件预测能力的基准平台。该平台通过按时间顺序回放真实新闻事件，测试代理在知识截止点之后预测未来事件的能力。实验表明，现有前沿代理在三月份的预测准确率普遍较低，最高仅为25%，揭示了当前模型在长期适应和不确定性推理方面仍存在显著挑战。FutureSim 为研究长期适应、搜索、记忆和不确定性推理等方向提供了现实可靠的实验环境。

2605.15187 2026-05-15 cs.CV cs.GR cs.RO

Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

Matt Zhou, Ruining Li, Xiaoyang Lyu, Zhaomou Song, Zhening Huang, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi, Shangzhe Wu

AI总结本文提出了一种名为Articraft的智能系统，用于大规模生成可动的3D模型资产。该系统通过将生成任务转化为编写程序的过程，并利用大型语言模型自动编写代码，从而克服了当前缺乏大规模多样化数据集的瓶颈。Articraft引入了专门的编程接口和验证机制，使语言模型能够高效生成包含部件定义、几何组合、关节设置及测试验证的代码，最终生成高质量的可动3D资产。实验表明，该方法在生成质量上优于现有最先进的生成工具，并基于此构建了一个包含10,000个样本、涵盖245类物体的高质量数据集，用于训练和应用如机器人仿真与虚拟现实等领域。

2605.15185 2026-05-15 cs.CV cs.AI

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

AI总结本文提出了一种名为PDI-Bench的定量评估框架，用于检测生成视频中的几何一致性问题。该方法通过分割和点追踪获取物体中心视角的观测信息，结合单目重建技术将其映射到三维空间，并计算反映尺度-深度对齐、三维运动一致性和结构刚性等三个失败维度的投影几何残差。研究还构建了PDI-Dataset，用于系统评估生成视频的几何特性，揭示了现有生成模型在物理合理性方面的不足。

2605.15184 2026-05-15 cs.CL

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

AI总结本文研究了在智能体搜索系统中，不同检索策略（如grep和向量检索）与智能体架构及工具调用方式之间的交互影响。通过两个实验，作者对比了在不同工具结果呈现方式和干扰信息环境下，grep与向量检索的性能差异，发现grep在多数情况下表现更优，但整体效果还高度依赖于所使用的智能体框架和工具调用方式。研究为优化智能体搜索系统的检索策略提供了实证依据。

2605.15183 2026-05-15 cs.LG

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms

AI总结本文研究了如何判断两个神经网络是否实现相同的计算机制，提出了一个基于张量的相似性度量方法，该方法对权重空间的对称性具有不变性，能够捕捉全局功能等价性并考虑跨层机制。相比现有方法，该度量在追踪功能训练动态方面具有更高的精度，将相似性衡量和可信度验证转化为代数问题，提升了机制可解释性的准确性与可靠性。

2605.15182 2026-05-15 cs.CV

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Yifan Wang, Tong He

AI总结本文提出了一种名为Warp-as-History的方法，旨在实现无需额外训练即可从单个训练视频中生成可控相机轨迹的视频。该方法通过将相机引起的图像变形转化为伪历史信息，并结合目标帧的位置对齐和可见令牌选择，直接输入到视频生成模型中，从而引导模型生成符合指定相机路径的视频。实验表明，该方法不仅在零样本情况下表现出良好的相机轨迹跟随能力，而且通过轻量的微调还可进一步提升生成视频的质量和运动一致性。

2605.15181 2026-05-15 cs.CV

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

AI总结该研究旨在解决开放性图像编辑中抽象、多步骤指令的处理问题，提出了一种将规划与执行紧密结合的框架。其核心方法包括一个生成原子分解步骤的规划器、一个选择编辑工具和区域的协调器，以及一个基于视觉语言判断的奖励机制，用于指导编辑过程。该方法通过奖励驱动的执行优化协调器，并利用成功轨迹反哺规划器，从而实现更连贯、可靠的图像编辑效果。

2605.15179 2026-05-15 cs.LG cs.AI physics.comp-ph

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

Ellwil Sharma, Arastu Sharma

AI总结该论文研究了如何消除多物理场基础模型中的负迁移问题，即在同时训练不同偏微分方程（PDE）系统时出现的梯度冲突和优化不稳定现象。为此，作者提出了一种基于稀疏激活的混合专家（MoE）架构Shodh-MoE，通过物理感知的自编码器生成压缩的物理潜在表示，并结合软语义路由策略，将不同物理机制的局部潜在块分配给专门的专家子网络，从而实现对多物理场的高效且稳定的建模。实验表明，该方法在保持质量守恒的同时，显著提升了模型在不同物理场景下的预测精度。

详情

Comments: 5 pages, 4 figures

英文摘要

Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

URL PDF HTML ☆

赞 0 踩 0

2605.15178 2026-05-15 cs.CV

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie

AI总结本文提出了一种名为 SANA-WM 的高效世界模型，能够在单分钟内生成高保真、720p 分辨率的视频，并具备精确的相机控制能力。该模型通过混合线性注意力机制、双分支相机控制、两阶段生成流程以及鲁棒的标注管道等核心设计，在保证视觉质量的同时显著提升了训练与推理效率。实验表明，SANA-WM 在数据使用量、训练时长和硬件资源消耗方面均优于现有开源模型，且在单分钟世界建模基准测试中表现出更高的动作跟随精度和生成吞吐量。

2605.15172 2026-05-15 cs.CR cs.CL

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma, Ahmed Salem

AI总结本文提出了一种新型的后门攻击方法MetaBackdoor，利用大语言模型中的位置编码作为触发机制，无需修改输入文本内容即可激活后门。研究发现，基于位置信息的触发器能够有效激活隐蔽的后门行为，使模型在满足特定长度条件时泄露敏感信息或执行恶意操作。该方法扩展了大语言模型后门攻击的威胁模型，揭示了位置编码这一此前被忽视的攻击面，为防御策略的设计提出了新的挑战。

详情

英文摘要

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.15171 2026-05-15 cs.CV cs.AI cs.LG

Evidential Reasoning Advances Interpretable Real-World Disease Screening

Chenyu Lian, Hong-Yu Zhou, Jing Qin

AI总结本文提出了一种基于证据推理的可解释疾病筛查框架EviScreen，旨在解决当前医学图像筛查模型在可解释性和性能上的不足。该方法通过从历史病例中检索区域级证据，并结合双知识库进行回顾性解释，提升了模型的透明度和诊断准确性。同时，利用对比检索生成的异常图增强定位解释性，实验表明该方法在真实世界疾病筛查基准上表现出色，尤其在临床召回率下的特异性显著提高。

2605.15168 2026-05-15 cs.CL cs.AI cs.LG stat.ML

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

AI总结本研究旨在解决临床文本与结构化电子健康记录（EHR）在时间信息上的互补性问题，提出了一种基于检索增强的多模态对齐框架，用于重建更精确的临床时间线。该方法通过从文本中提取关键事件构建时间框架，并结合结构化数据中的时间信息进行校准，从而提升时间戳的准确性。实验表明，该方法在多个模型上均显著提升了时间一致性，同时保留了事件匹配率，展示了多模态对齐在临床轨迹重建中的优势。

详情

Comments: Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)

英文摘要

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

URL PDF HTML ☆

赞 0 踩 0

2605.15167 2026-05-15 cs.CV

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

AI总结本文研究了纯合成分层数据是否有助于提升图形设计分解的效果。作者基于先进的CLD框架构建了合成数据集SynLayers，并利用视觉语言模型生成文本监督和自动推理输入，发现纯合成数据在性能上可超越现有非可扩展数据集，且在数据量增加时表现持续提升，同时能有效平衡分层分布。该研究为分层设计编辑系统提供了可扩展的合成数据基础，具有重要的实用价值。

详情

Comments: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers

英文摘要

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15164 2026-05-15 cs.LG cs.AI

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth, Vinay Kumar Sankarapu

AI总结本文指出，当前的行为保障方法无法满足AI治理框架对安全性的验证需求。治理框架要求验证AI系统是否存在隐藏目标、抗失控能力及灾难性能力边界等属性，但现有方法仅能观察模型输出，无法验证其潜在表征和长期行为。文章提出“审计鸿沟”概念，强调验证需求与技术能力之间的不匹配，并建议通过法律文本中限制行为证据的权重、引入机制性验证手段等方式进行技术转向。

2605.15155 2026-05-15 cs.LG cs.AI cs.CL

Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI总结该论文研究了如何提升基于强化学习（RL）的大型语言模型代理在多轮任务中的性能。为了解决传统RL在长序列任务中监督信号过于稀疏的问题，作者提出了自蒸馏代理强化学习（SDAR），通过将基于教师分支的密集令牌级指导作为辅助目标，与主RL优化框架结合。SDAR通过引入一个门控机制，增强对教师认可的正向令牌的蒸馏效果，同时柔和地抑制教师的负向拒绝，从而在多个基准任务上显著提升了性能，并避免了传统方法的不稳定性。

2605.15154 2026-05-15 stat.ML cs.LG

RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution

Lanxin Xiang, Liang Shi, Youhui Ye, Boyu Jiang, Dawei Zhou, Feng Guo

AI总结本文提出了一种名为RoSHAP的分布框架和鲁棒度量方法，用于实现更稳定的特征归因分析。该方法基于SHAP值，通过引导重采样和核密度估计建模特征归因分数的分布，并在温和正则条件下证明其聚合值渐近服从高斯分布，从而降低了分布估计的计算成本。RoSHAP不仅提升了特征排名的稳定性，还在模拟和实际数据实验中表现出优于传统单次归因方法的性能，同时使用更少的特征即可达到与全特征模型相当的预测效果。

2605.15138 2026-05-15 cs.LG cs.CL cs.ET

Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu

AI总结本文研究了量化语言模型中机器遗忘的永久性问题，指出传统方法在全精度下评估遗忘效果，未能反映实际部署中模型先经历量化的情况。研究发现，4位量化会削弱甚至逆转遗忘效果，其根本原因在于参数更新幅度远小于量化区间宽度，导致无法改变量化后的模型结构。为此，作者提出MANSU方法，结合因果电路归因与约束投影，实现有意义的遗忘与结构性删除，并引入CAD指标用于验证，实验证明该方法在多个模型和任务中表现优异。

2605.15133 2026-05-15 cs.LG

Causal Foundation Models with Continuous Treatments

Christopher Stith, Medha Barath, Vahid Balazadeh, Jesse C. Cresswell, Rahul G. Krishnan

AI总结本文研究了从观测数据中估计连续处理变量因果效应的问题，这是因果推断中的一个重要但较少被探索的领域。作者提出了首个针对连续处理场景的因果基础模型，通过设计新的数据生成先验并利用Transformer进行上下文学习，实现了无需额外训练即可预测多种任务中的因果效应。该模型在个体处理反应曲线重建任务中表现出色，优于专门为此任务设计的因果模型。

2605.15132 2026-05-15 cs.AI cs.DC cs.MA

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

Evan Rose, Tushin Mallick, Matthew D. Laws, Cristina Nita-Rotaru, Alina Oprea

AI总结本文提出了一种名为APWA的分布式架构，旨在高效处理高度可并行化的智能体工作负载。该架构通过将任务分解为互不干扰的子问题，实现无需跨通信的独立资源处理，从而克服了传统多智能体系统在推理、协调和计算扩展方面的瓶颈。实验表明，APWA能够动态地将复杂查询分解为可并行执行的工作流，并在任务规模增大时实现有效扩展，优于现有系统。

2605.15131 2026-05-15 cs.LG

Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models

Frederik Schmitt, Matthias Cosler, Niklas Metzger, Julian Siber, Vladimir Krsmanovic, Mohamed Ghanem, Bernd Finkbeiner

AI总结本文研究了反应式综合问题，即从逻辑规范自动生成硬件电路的挑战。作者提出了一种神经符号方法，结合大推理模型与模型检测器，通过符号反馈迭代修正生成的Verilog实现，从而在年度综合竞赛中超越了现有最佳工具，并能处理参数化系统这一已知不可判定的问题。此外，作者引入了自动形式化步骤，将规范任务从时序逻辑转换为自然语言，通过构建自然语言规范数据集，验证了基于自然语言的综合流程在性能上可与基于形式化规范的方法媲美。

2605.15128 2026-05-15 cs.CV cs.CL cs.IR

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei Che, Wujiang Xu, Shilong Liu, Zirui Liu, Mubbasir Kapadia, Vladimir Pavlovic, Jiang Liu, Mengdi Wang, Yiyu Shi, Dimitris N. Metaxas, Ruixiang Tang

AI总结 MemEye 是一个以视觉为核心的多模态智能体记忆评估框架，旨在解决现有方法在长期记忆中对视觉证据保存和推理能力评估不足的问题。该框架从视觉证据的粒度和使用方式两个维度进行评估，构建了包含8个生活场景任务的新基准，并通过消融验证门机制评估模型的推理结构与视觉必要性。实验表明，当前主流模型在细粒度视觉细节保存和时序状态推理方面仍存在明显不足，突显了证据路由、时间追踪和细节提取在长期多模态记忆中的关键作用。

2605.15127 2026-05-15 cs.HC cs.AI

Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation

Laleh Nourian, Anisa Callis, Stephanie Patterson, Jadeline Miao, Jamison Heard, Garreth W. Tigwell

AI总结本文研究了在美国留学的国际学生如何使用对话式人工智能来支持跨文化适应。通过调查和访谈，研究揭示了国际学生在面临文化适应挑战时对AI工具的使用模式、动机及局限性。研究发现，AI被视为应对即时问题的“急救工具”，但学生也期望其能发展为长期支持伙伴。研究为设计更贴合国际学生需求的AI支持系统提供了重要建议。

2605.15122 2026-05-15 cs.RO cs.LG cs.SY eess.SY

CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios

Michael Baumgartner, David Müller, Agon Serifi, Ruben Grandia, Espen Knoop, Markus Gross, Moritz Bächer

AI总结本文提出了一种名为CoCo-InEKF的新型滤波方法，用于在动态且富含接触的场景中实现腿式机器人的鲁棒状态估计。该方法通过学习接触协方差来替代传统的二值接触状态，从而更精确地捕捉部分接触和方向性滑动等复杂情况。实验表明，该方法在双足机器人上实现了更优的速度估计精度与效率平衡，并提升了滤波一致性，能够有效支持如舞蹈和复杂地面交互等高难度运动的稳定执行。

2605.15116 2026-05-15 cs.CV

DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

Haonan Zhao, Yiting Wang, Jingkun Chen, Valentina Donzella, Thomas Bashford-Rogers, Kurt Debattista

AI总结自动驾驶系统训练需要大量标注的驾驶视频数据，但仿真数据与真实场景之间存在显著的领域差异，限制了其实际应用。本文提出DriveCtrl，一种基于深度条件的可控仿真到真实驾驶视频生成框架，通过结构感知适配器在保持场景布局和运动模式的同时生成视觉真实且时间连贯的驾驶视频。该方法还引入了支持多条件信号的数据生成流水线，并提出专门的评估指标DVRS，实验表明其在真实感、时间一致性和感知任务性能上均优于现有方法，有效缩小了仿真与真实驾驶视频之间的差距。