arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18754 2026-05-19 cs.CV

Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate

这些视角能是一个场景吗？在3D基础模型产生幻觉时评估多视角3D一致性

Soumava Paul, Prakhar Kaushik, Alan Yuille

AI总结本文研究了在3D基础模型产生幻觉时多视角3D一致性的可靠性问题，提出了一种可控的鲁棒性基准和参数化家族，将神经度量分解为backbone、残差和聚合组件，并引入基于COLMAP的度量方法，以提高与人类判断的一致性。

详情

Comments: Project Page at https://mvp18.github.io/3d-consistency-metrics/

AI中文摘要

多视角3D评估假设被评分的图像是对一个静态3D场景的观测。这一假设在NVS和稀疏视角重建中可能失效：输入或生成的输出可能包含伪影、异常帧、重复视角或噪声，但仍可能获得高3D一致性分数。现有基于参考的度量需要地面真实，而无需地面真实的度量如MEt3R依赖于学习的重建backbone，其失败模式尚不明确。我们通过比较神经重建先验与经典几何验证研究了这一可靠性问题。我们引入benchmark，一种用于多视角3D一致性的受控鲁棒性基准，以及一个参数化家族，将神经度量分解为backbone、残差和聚合组件。该家族恢复MEt3R并产生多达3倍更稳健的变体。我们的分析显示，VGGT、MASt3R、DUSt3R和Fast3R可以产生无关场景的密集几何和跨视角支持，重复图像和随机噪声。我们引入基于COLMAP的度量方法，利用匹配、注册、密集支持和重建失败作为失败感知的一致性信号。在真实的NVS输出和结构化的人类研究中，这些度量方法与人类判断的一致性比MEt3R高多达4倍。

英文摘要

Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.

URL PDF HTML ☆

赞 0 踩 0

2605.18753 2026-05-19 cs.CL cs.AI cs.LG

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention: 可微且自适应的稀疏分层注意力

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso

AI总结本研究提出DashAttention，一种可微且自适应的稀疏分层注意力机制，通过自适应稀疏α-entmax变换选择可变数量的块，从而在保持整个层次结构可微的同时，提升长上下文建模能力，实验表明其在高稀疏度下优于现有方法。

详情

Comments: Preprint

AI中文摘要

当前的分层注意力方法，如NSA和InfLLMv2，基于粗粒度注意力得分选择前k个相关键值（KV）块，然后对所选标记应用细粒度softmax注意力。然而，top-k操作假设任何查询的相关标记数量固定，并且阻止了稀疏和密集阶段之间的梯度流动。在本工作中，我们提出了DashAttention（可微且自适应的稀疏分层注意力），它利用自适应稀疏α-entmax变换，在第一阶段根据当前查询选择可变数量的块。这反过来为第二阶段的softmax注意力提供先验信息，保持整个层次结构完全可微。与其他分层注意力方法不同，我们表明DashAttention是非发散的，这导致更好的长上下文建模能力。在大型语言模型（LLMs）上的实验表明，DashAttention在75%的稀疏度下达到与全注意力相当的准确性，并在高稀疏度情况下优于NSA和InfLLMv2，特别是在高稀疏度情况下。我们还提供了一个高效的、GPU-aware的DashAttention实现，在Triton中实现了比FlashAttention-3快超过一倍的推理速度。总体而言，DashAttention提供了一种成本效益高的长上下文建模策略。

英文摘要

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.18752 2026-05-19 cs.IR astro-ph.IM cs.DL

Traditional statistical representations outperform generative AI in identifying expert peer reviewers

传统统计表示在识别专家同行评审员中优于生成式AI

Vicente Amado Olivo, Tereza Jerabkova, Jakub Klencki, John Carpenter, Mario Malički, Ferdinando Patat, Louis-Gregory Strolger, Wolfgang Kerzendorf

AI总结本研究通过对比统计方法和AI驱动的方法，发现传统统计表示在识别领域专家方面优于生成式AI，强调了细粒度词汇在区分子领域专家中的重要性。

详情

AI中文摘要

科学投稿的指数级增长已对同行评审系统造成了压力。尽管全球研究人员数量迅速增加，但这种前所未有的规模使传统的手工专家识别方法变得不可行。因此，机构自然转向大型语言模型（LLMs）来自动化复杂的过程，如专家评审员识别。然而，这些新模型在准确识别领域专家方面的可靠性缺乏严格评估。我们对统计和AI驱动的专家识别方法进行了全面的实证评估，以基准其可靠性和局限性。将专家识别框架为信息检索问题，我们利用一个主要国际天文台的分布式同行评审系统，其中提案作者身份作为我们的领域专家代理真实值。评估六个在天文台和计算机科学会议中使用的检索方法，我们证明传统统计表示优于生成式AI。具体而言，词频-逆文档频率成功在前25个推荐中识别出标注专家79.5%的时间，而GPT-4o mini为51.5%。我们的结果表明，区分子领域专家需要细粒度的词汇，这被生成方法中的语义平滑所掩盖。通过建立一个严格评估框架来自动化同行评审，我们证明了透明且可重复的统计表示在专门的科学任务中仍优于计算成本高的LLMs。

英文摘要

The exponential growth of scientific submissions has strained the peer review system. Despite the rapidly expanding global pool of researchers, this unprecedented scale has rendered the previous approach of manual expert identification unfeasible. Therefore, institutions have naturally turned to Large Language Models (LLMs) to automate intricate processes like expert reviewer identification. However, the reliability of these new models in accurately identifying domain experts lacks rigorous evaluation. We conduct a comprehensive empirical evaluation of statistical and AI-driven expertise identification methodologies to benchmark their reliability and limitations. Framing expert identification as an information retrieval problem, we utilize the distributed peer review system of a major international astronomical observatory, where proposal authorship serves as our proxy ground truth for domain expertise. Evaluating six retrieval methodologies utilized across observatories and computer science conferences, we demonstrate that traditional statistical representations outperform generative AI. Specifically, Term Frequency-Inverse Document Frequency successfully identified a labeled expert within the top 25 recommendations 79.5% of the time, compared to 51.5% for GPT-4o mini. Our results highlight that distinguishing subfield expertise requires fine-grained vocabulary, which is obscured by the semantic smoothing in generative methods. By establishing a rigorous evaluation framework for automated peer review, we demonstrate that transparent and reproducible statistical representations still outperform computationally expensive LLMs in specialized scientific tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.18750 2026-05-19 cs.DC cs.LG

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

一种面向运行时变异的流水线并行训练的准备驱动运行时

Ruitao Liu, Xinyang Tian, Shuo Chen, Tingrui Zhang, Guang Yang, Alan Zhao, Wei Xu

AI总结本文提出了一种准备驱动的流水线运行时，以解决流水线并行训练中由于运行时变异导致的阶段对齐问题和空闲泡现象，通过非绑定提示顺序来优化当前就绪工作的排序，从而提高资源利用率。

详情

Comments: 29 pages, including appendices

AI中文摘要

流水线并行是扩展大模型训练的关键技术，但现代工作负载在计算和通信方面表现出运行时变异。现有的流水线系统通常消耗静态的、经过分析的或自适应生成的调度作为预承诺的执行顺序。当实现的任务准备度偏离预承诺顺序时，阶段可能等待尚未就绪的工作，尽管其他可执行的工作可用，从而产生阶段错位、空闲泡和利用率降低。我们提出了运行时准备优先流水线（RRFP），一种面向流水线并行训练的准备驱动运行时。RRFP改变了运行时调度的消费方式：而不是将调度视为阶段必须等待以遵循的序列，它将调度视为非绑定提示顺序，用于对当前就绪工作进行排序。为了支持这种模型，RRFP结合了消息驱动的异步通信、轻量级张量并行协调以实现集体一致性，以及用于低开销调度的就绪-设置仲裁。我们将在基于Megatron的训练框架中实现RRFP，并在语言模型和多模态工作负载上评估，最多支持128个GPU。RRFP在所有设置中均优于固定顺序流水线基线。使用BFW提示，RRFP在语言模型工作负载上实现了高达1.77倍的速度提升，在多模态工作负载上高达2.77倍。在跨框架比较中，RRFP使用默认BF提示在保持训练正确性的同时，比更快的可用外部系统高出高达1.84倍。

英文摘要

Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.

URL PDF HTML ☆

赞 0 踩 0

2605.18749 2026-05-19 cs.SD cs.CV

WavFlow: Audio Generation in Waveform Space

WavFlow：在波形空间中进行音频生成

Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng

AI总结本文提出WavFlow框架，直接在原始波形空间生成高保真的音频，无需中间表示，通过波形分块和振幅提升实现稳定优化，通过自动化数据管道生成高质量视频-文本-音频三元组，实验结果显示在视频到音频和文本到音频基准测试中表现优异，证明了无需中间压缩即可实现高质量合成。

详情

Comments: Code: https://github.com/facebookresearch/WavFlow

AI中文摘要

现代音频生成主要依赖于潜在空间压缩，引入了额外的复杂性和潜在的信息损失。在本工作中，我们挑战这一范式，提出WavFlow框架，该框架直接在原始波形空间中生成高保真的音频，而无需中间表示。为了克服建模高维和低能量信号的固有困难，我们将音频转换为2D token网格通过波形分块，并引入振幅提升以对齐信号尺度，通过直接x预测在流匹配中实现稳定优化。为了捕捉复杂的语义对齐和时间同步，我们利用自动化数据管道来收集500万高质量的视频-文本-音频三元组，使模型能够从头学习精细的声学模式。实验结果表明，WavFlow在视频到音频基准测试VGGSound（FD_PaSST：59.98，IS_PANNs：17.40，DeSync：0.44）和文本到音频基准测试AudioCaps（FD_PANNs：10.63，IS_PANNs：12.62）中表现竞争，与已有的基于潜在空间的方法相匹配或超过。我们的工作证明了中间压缩不是高质量合成的必要条件，为多模态音频生成提供了一个更简单且可扩展的替代方案。

英文摘要

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.

URL PDF HTML ☆

赞 0 踩 0

2605.18748 2026-05-19 cs.CV

Aurora: Unified Video Editing with a Tool-Using Agent

Aurora: 一种基于工具使用的统一视频编辑框架

Yongsheng Yu, Ziyun Zeng, Zhiyuan Xiao, Zhenghong Zhou, Hang Hua, Wei Xiong, Jiebo Luo

AI总结本文提出Aurora框架，通过结合增强的视觉语言模型（VLM）代理和统一视频扩散变换器，解决视频编辑中的文本和视觉不充分问题，提升了视频编辑的灵活性和准确性。

详情

Comments: Code: https://github.com/yeates/Aurora

AI中文摘要

近期的视频编辑模型趋于统一的条件设计：一个扩散变换器同时消耗文本、源视频和参考图像，并且一组权重可以用于替换、删除、风格迁移和参考驱动的插入。该设计具有灵活性，但假设用户已经提供了模型准备的文本、参考图像和局部编辑的空间定位，而实际需求往往省略这些。我们提出了Aurora，一种基于代理的视频编辑框架，将增强的视觉语言模型（VLM）代理与统一视频扩散变换器相结合。VLM代理将原始用户请求映射到与变换器条件通道对齐的结构化编辑计划，从而在生成前解决文本和视觉不充分问题。我们使用监督数据训练VLM代理进行完整的编辑计划和参考图像选择，同时结合偏好对进行鲁棒的工具使用和指令细化。我们引入AgentEdit-Bench来评估在文本和视觉不充分情况下的代理增强视频编辑。在AgentEdit-Bench和两个现有视频编辑基准测试中，实验表明Aurora优于仅基于指令的基线，并且VLM代理可以转移到兼容的冻结视频编辑模型中。项目页面：https://yeates.github.io/Aurora-Page

英文摘要

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page

URL PDF HTML ☆

赞 0 踩 0

2605.18747 2026-05-19 cs.CL cs.AI

Code as Agent Harness

代码作为代理工具

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pan Chen, Dorothy Sun, Ren Chen, Mahesh Srinivasan, Nipun Mathur, Yinglong Xia, Hong Li, Hong Yan, Pan Lu, Lingming Zhang, Tong Zhang, Hanghang Tong, Jingrui He

AI总结本文探讨了代码在代理系统中的作用，提出了一种统一的视角，将代码视为代理基础设施的基础，并讨论了代理工具接口、机制以及扩展到多代理系统的挑战。

详情

Comments: GitHub: https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers

AI中文摘要

近年来，大型语言模型（LLMs）在理解和生成代码方面展现了强大的能力，从竞争性编程到仓库级别的软件工程。在新兴的代理系统中，代码不再仅仅是目标输出，而是越来越多地作为代理推理、行动、环境建模和基于执行的验证的操作基础。我们通过代理工具的视角来阐述这一转变，并引入“代码作为代理工具”的概念：一种以代码为基础的统一视角，用于代理基础设施。为了系统地研究这一视角，我们围绕三个相连的层次组织了综述。首先，我们研究工具接口，其中代码连接代理到推理、行动和环境建模。其次，我们检查工具机制：计划、记忆和工具使用用于长周期执行，以及反馈驱动的控制和优化，使工具可靠且适应性强。第三，我们讨论将工具从单代理系统扩展到多代理系统，其中共享的代码艺术支持多代理协调、审查和验证。在这些层次中，我们总结了代码作为代理工具的代表性方法和实际应用，涵盖编码助手、GUI/OS自动化、具身代理、科学发现、个性化和推荐、DevOps以及企业工作流程。我们进一步概述了工具工程中的开放挑战，包括评估超越最终任务成功、在不完整反馈下的验证、无回归的工具改进、多个代理之间的一致共享状态、人类监督以确保安全关键行动，以及向多模态环境的扩展。通过将代码视为代理AI的工具，本文为可执行、可验证和具有状态的AI代理系统提供了一条统一的道路。

英文摘要

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18738 2026-05-19 cs.AI

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

人工智能医生重视什么？审查语言模型临床伦理中的多元主义

Payal Chandak, Victoria Alkin, David Wu, Maya Dagan, Taposh Dutta Roy, Maria Clara Saad Menezes, Ayush Noori, Nirali Somia, John S. Brownstein, Ran Balicer, Rebecca W. Brendel, Noa Dagan, Isaac S. Kohane, Gabriel A. Brat

AI总结本文研究了大型语言模型在医疗建议中带来的伦理价值观，提出了一种审计框架来评估医疗AI中的价值多元主义，揭示了模型在决策中对患者自主权的潜在忽视，并强调了多模型协同的重要性。

详情

Comments: Code and data available upon request via https://hvp.global/

AI中文摘要

医学本质上是多元主义的。诸如自主性、有益性、不伤害和正义等原则经常发生冲突，这种伦理困境常常使合理医生产生激烈分歧。良好的临床实践是在与每位患者的价值观协调一致的情况下解决这些矛盾，而不是强加单一的伦理立场。然而，大型语言模型带来的伦理价值观尚未被系统地考察。本文提出了一种审计价值多元主义的框架，包括经临床医生验证的困境基准和一种从决策中直接恢复价值优先级的方法。前沿模型生态系统涵盖了医生层面的价值异质性，模型在其推理中讨论竞争性价值观（Overton多元主义），然后做出决定。然而，个体模型决策在重复采样和语义变化下几乎近似决定性，无法再现医生小组的分布多元主义。在基准案例中，这些一致的决策反映了坚定、系统的价值偏好。虽然大多数模型优先级落在医生间自然变化的范围内，但某些显著低估了患者自主权。一个没有考虑其价值优先级的单个LLM可能在每服务的患者中放大这些优先级。如果没有明确努力通过一个或多个模型平衡伦理观点，这些工具可能会用部署单文化取代临床多元主义。

英文摘要

Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient's values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.

URL PDF HTML ☆

赞 0 踩 0

2605.18735 2026-05-19 cs.CV cs.GR cs.LG

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

PIXLRelight: 通过内在条件实现可控的图像重照明

Miguel Farinha, Ronald Clark

AI总结 PIXLRelight通过内在条件将物理基础渲染与学习图像合成相结合，实现对单图像重照明的可控性，其核心方法是利用真实照片或PBR渲染得到的内在条件进行训练和推理，从而在保证图像细节的同时实现高质量的重照明效果。

详情

Comments: Project page: https://mlfarinha.github.io/pixl-relight/. Under review

AI中文摘要

我们提出了PIXLRelight，一种用于物理可控单图像重照明的前馈方法。现有方法要么提供有限的光照控制（例如通过文本或环境地图），要么在逆向和正向渲染链中累积误差，或者需要昂贵的每图像优化。我们的关键思想是通过共享的内在条件将物理基础渲染（PBR）与学习图像合成联系起来，该条件可以从真实照片或PBR渲染中获得。在训练时，成对的多光照照片被分解为反照率、漫反射阴影和非漫反射残差，这些条件用于模型训练。在推理时，相同的条件从粗略3D重建的输入下用户指定的PBR灯光路径追踪渲染中计算。基于变压器的神经渲染器然后将目标光照应用于源照片，通过每像素的仿射调制保留精细图像细节。PIXLRelight实现了任意PBR风格的光照控制，达到了最先进的重照明质量，并且每张图像的运行时间不到十分之一秒。代码和模型可在https://mlfarinha.github.io/pixl-relight/上获得。

英文摘要

We present PIXLRelight, a feed-forward approach for physically controllable single-image relighting. Existing methods either provide limited lighting control (e.g. through text or environment maps), accumulate errors when chaining inverse and forward rendering, or require costly per-image optimization. Our key idea is to bridge physically based rendering (PBR) and learned image synthesis through a shared intrinsic conditioning that can be obtained from either real photographs or PBR renders. At training time, paired multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals, which condition the model. At inference time, the same conditioning is computed from a path-traced render of a coarse 3D reconstruction of the input under user-specified PBR lights. A transformer-based neural renderer then applies the target illumination to the source photograph, preserving fine image detail through a per-pixel affine modulation. PIXLRelight enables arbitrary PBR-style lighting control, achieves state-of-the-art relighting quality, and runs in under a tenth of a second per image. Code and models are available at https://mlfarinha.github.io/pixl-relight/.

URL PDF HTML ☆

赞 0 踩 0

2605.18734 2026-05-19 cs.CV

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

EgoExoMem: 跨视角记忆推理 over 同步的自身视角和外部视角视频

Ruiping Liu, Junwei Zheng, Yufan Chen, Di Wen, Shaofang Quan, Chengzhi Wu, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen

AI总结本文提出EgoExoMem，首个跨视角记忆推理基准，通过同步自身视角和外部视角视频进行跨视角记忆推理，利用E$^2$-Select方法实现高效的帧选择，实验表明自身和外部视角提供互补的记忆线索，但现有模型在基准测试中表现有限。

详情

Comments: The source code and dataset can be found at https://github.com/RuipingL/EgoExoMem

AI中文摘要

自身视角记忆在具身智能中被广泛应用，但可能不足以进行全面的空间-时间推理。受人类从现场和观察者视角回忆的启发，我们引入EgoExoMem，首个跨视角记忆推理基准，包含2600个高质量MCQs，覆盖八个时间、空间和跨视角QA类型。为支持双视角检索，我们提出E$^2$-Select，一种无需训练的帧选择方法，结合基于相关性的预算分配与每视角k-DPP采样，以处理视角不对称性和跨视角时间一致性。实验表明，自身和外部视角提供互补的记忆线索，而现有MLLMs仍远未解决该基准：最佳模型仅达到55.3%。E$^2$-Select在帧选择和RAG基于的记忆基线中达到最先进的58.2%。进一步分析揭示了问题框架和答案定位之间的系统性视角偏好冲突，突显了跨视角记忆推理的新颖性和挑战性。

英文摘要

Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.18733 2026-05-19 cs.CV

Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory

通过无训练的身份感知记忆推进叙事长视频生成

Jinzhuo Liu, Jiangning Zhang, Wencan Jiang, Yabiao Wang, Dingkang Liang, Zhucun Xue, Ran Yi, Yong Liu

AI总结本文提出了一种无训练的身份感知记忆框架IAMFlow，通过显式建模和跟踪持久实体身份，实现一致的生成，同时引入NarraStream-Bench基准测试，在叙事流视频生成中取得最佳性能。

详情

Comments: Project page: https://eddie0521.github.io/projects/iamflow/ Code: https://github.com/Eddie0521/IAMFlow

AI中文摘要

自回归视频生成在视觉保真度和交互性方面有了显著提升，但仍然存在长期不一致性和记忆退化问题。现有解决方案要么使用预定义策略压缩历史帧，要么基于粗略隐式注意力信号检索关键帧，这两种方法都无法处理具有变化实体参考的演变提示，导致身份漂移、角色重复和属性丢失。为此，我们提出了IAMFlow，一种无训练的身份感知记忆框架，能够显式建模和跟踪持久实体身份，从而在提示转换过程中实现一致的生成。具体而言，一个大语言模型从每个提示中提取具有视觉属性的实体并分配唯一的全局ID用于身份感知记忆，而一个视觉语言模型异步验证和细化从渲染帧中提取的属性，从而在原位实现显式实体跟踪，而不是基于隐式相似性匹配。为了保持所提出框架的计算实用性，我们设计了一套系统推理加速管道，包括异步视觉验证、自适应提示转换和模型量化，从而实现了比现有基线更快的生成速度。此外，我们引入了NarraStream-Bench，一个用于叙事流视频生成的基准测试，其包含324个多提示脚本，跨越六个维度，并采用三维评估协议，整合了传统指标和多模态大语言模型评估。大量实验表明，尽管IAMFlow是无训练的，但其在NarraStream-Bench上取得了最佳整体性能，优于最强基线2.56分，同时在60秒多提示设置中比最高效的基线快1.39倍。

英文摘要

Autoregressive video generation has improved rapidly in visual fidelity and interactivity, but it still suffers from long-term inconsistency and memory degradation. Most existing solutions either compress historical frames using predefined strategies or retrieve keyframes based on coarse implicit attention signals, both of which fail to handle evolving prompts with shifting entity references, leading to identity drift, character duplication, and attribute loss. To address this, we propose IAMFlow, a training-free identity-aware memory framework that explicitly models and tracks persistent entity identities, enabling consistent generation across prompt transitions. Specifically, an LLM extracts entities with visual attributes from each prompt and assigns unique global IDs for identity-aware memory, while a VLM asynchronously verifies and refines attributes from rendered frames, enabling explicit entity tracking in place of implicit similarity-based matching. To keep the proposed framework computationally practical, we design a systematic inference acceleration pipeline, including asynchronous visual verification, adaptive prompt transition, and model quantization, which achieves faster generation than existing baselines. Furthermore, we introduce NarraStream-Bench, a benchmark for narrative streaming video generation that features 324 multi-prompt scripts spanning six dimensions and a three-dimensional evaluation protocol that integrates both traditional metrics and multimodal large language model-based assessments. Extensive experiments show that IAMFlow, despite being training-free, achieves the best overall performance on NarraStream-Bench, outperforming the strongest baseline by 2.56 points, while achieving a 1.39$\times$ speedup over the most efficient baseline in the 60-second multi-prompt setting.

URL PDF HTML ☆

赞 0 踩 0

2605.18732 2026-05-19 cs.CL cs.AI cs.LG

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

可预测的编造：大型语言模型的事实回忆能力随模型大小和主题频率而增加

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji, Tegawendé F. Bissyandé

AI总结本研究探讨了大型语言模型在事实回忆方面的可预测性，发现模型大小和训练数据中主题频率是影响回忆质量的关键因素，且模型大小和主题频率的组合能解释60%-94%的方差。

2605.18729 2026-05-19 cs.RO cs.CV

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Robo-Cortex: 通过双粒认知记忆和自主知识诱导实现自我进化具身智能体

Nga Teng Chan, Yi Zhang, Yechi Liu, Renwen Cui, Fanhu Zeng, Zeyuan Ding, Xiancong Ren, Zhang Zhang, Qifeng Chen, Jian Liu, Yong Dai, Xiaozhu Ju

AI总结本文提出Robo-Cortex框架，通过双粒认知记忆和自主知识诱导机制，使机器人能够自主诱导导航启发式方法并优化认知策略，从而在复杂环境中实现自主导航和探索。

详情

AI中文摘要

导航和与复杂环境交互的能力是真实世界具身智能体的核心，但导航未知环境仍然具有挑战性，因为“经验性失忆”导致现有基于轨迹的或反应性策略无法从过去交互中合成可推广的策略。我们提出了Robo-Cortex，一个自我进化的框架，使机器人能够通过持续的反思-适应循环自主诱导导航启发式方法并优化认知策略。通过将成功模式和失败陷阱抽象为自然语言启发式方法，Robo-Cortex实现了从被动执行到主动策略进化的转变。我们的核心创新是一个自主知识诱导（AKI）机制，将多模态轨迹转化为结构化的导航启发式库以实现知识泛化。该架构进一步集成了双粒认知记忆系统，包括用于实时局部进展分析的短时反思记忆（SRM）和将过去轨迹抽象为可重用指导和警示原则的长时原则记忆（LPM）。为确保稳健决策，我们引入了多模态的想象-然后验证循环，其中世界模型模拟潜在结果，基于视觉语言模型（VLM）的评估器验证行动计划。在IGNav、AR和AEQA上的广泛评估显示，Robo-Cortex在任务成功率和探索效率方面均优于强大的基线方法，其在最强前方法上的SPL提升高达+4.16%，在启发式转移至未知环境下的SPL提升高达+15.30%。初步的现实世界机器人实验进一步支持了Robo-Cortex在物理环境中的有效性。

英文摘要

The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.

URL PDF HTML ☆

赞 0 踩 0

2605.18727 2026-05-19 cs.RO cs.AI

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

DexHoldem: 使用 Dexterous Embodied 系统进行德州扑克游戏

Feng Chen, Tianzhe Chu, Li Sun, Pei Zhou, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

AI总结本文提出DexHoldem，一个基于ShadowHand的现实世界系统级基准，用于评估德克萨斯扑克的灵巧操作。研究通过14个德克萨斯扑克操作原始技能的1470个远程操作示例，测试了代理在感知、执行和决策路由中的能力。

详情

Comments: 30 Pages

AI中文摘要

评估基于真实灵巧硬件的具身系统需要超越孤立的原始技能：一个代理必须感知一个变化的桌面场景，选择合适的上下文动作，用灵巧的手执行该动作，并确保场景在后续决策中仍然可用。我们介绍了DexHoldem，一个围绕使用ShadowHand进行德克萨斯扑克灵巧操作构建的现实世界系统级基准。DexHoldem提供了1470个远程操作示例，涵盖14个德克萨斯扑克操作原始技能，一个标准化的物理政策基准，以及一个测试代理是否能够恢复所需结构化游戏状态的代理感知基准。在原始执行方面，π_{0.5}获得了最高的任务完成率（61.2%），而π_{0.5}和π_0在保持场景成功率为47.5%时并列。在代理感知方面，Opus 4.7在严格的问题级准确性（34.3%）方面表现最佳，而GPT 5.5在平均领域准确性（66.8%）方面表现最佳，揭示了孤立视觉子能力与完整路由相关状态恢复之间的差距。最后，我们通过三个案例研究实现了完整的具身代理循环，其中等待、恢复调度、人类帮助请求和重复原始执行揭示了在闭环部署过程中感知和策略错误如何累积。DexHoldem因此在共享物理环境中评估了灵巧桌面执行、代理感知和具身决策路由。项目页面：https://dexholdem.github.io/Dexholdem/.

英文摘要

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $π_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $π_{0.5}$ and $π_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

URL PDF HTML ☆

赞 0 踩 0

2605.18722 2026-05-19 cs.RO

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Dexora: 开源的高自由度双臂灵巧性VLA系统

Zongzheng Zhang, Jingrui Pang, Zhuo Yang, Kun Li, Minwen Liao, Saining Zhang, Guoxuan Chi, Jinbang Guo, Huan-ang Gao, Modi Shi, Dongyun Ge, Yao Mu, Jiayuan Gu, Rui Chen, Hao Dong, Huazhe Xu, Li Yi, Yixin Zhu, Hang Zhao, Pengwei Wang, Shanghang Zhang, Guocai Yao, Jianyu Chen, Hongyang Li, Hao Zhao

AI总结本文提出Dexora，首个开源的VLA系统，旨在双臂双手高自由度操作，通过混合遥控管道分离粗臂运动和精细手指运动，结合物理平台和数字孪生，构建大规模训练数据集，并提出数据质量感知训练方法，实验证明其在基础和灵巧任务上的优越性能。

详情

Comments: Accpeted by ICRA 2026

AI中文摘要

Vision-Language-Action (VLA)模型最近已成为具身AI的核心方向，但当前系统受限于双抓手控制或单臂灵巧手操作。尽管低维抓手控制可以使用更简单的方法处理，高维灵巧手控制从全端到端VLA学习获益匪浅。在本文中，我们介绍了Dexora，首个开源的VLA系统，原生针对双臂双手高自由度操作。我们设计了一个混合遥控管道，将粗臂运动（通过定制外骨骼背包捕捉）与精细手指运动（通过Apple Vision Pro无标记手追踪）分离，并驱动物理双臂双手平台和相同的MuJoCo数字孪生。使用该接口，我们构建了一个大规模训练数据集：一个匹配的合成数据集（100K模拟轨迹，6.5M帧）和一个现实世界的数据集（10K遥控演示，2.92M帧）。为缓解嘈杂的遥控演示，我们提出了一种数据质量感知的训练配方：一个离线判别器为扩散-Transformer策略训练提供片段级权重，降低低质量演示的权重。实证上，Dexora在基础和灵巧基准测试中优于竞争VLA基线（例如，平均灵巧成功率为66.7% vs. 51.7%），在基础任务上达到90%的成功率，并展示了鲁棒的分布外和跨具身泛化能力。消融实验确认了真实数据和判别器对灵巧性的重要性。

英文摘要

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

URL PDF HTML ☆

赞 0 踩 0

2605.18720 2026-05-19 cs.RO

Data-Driven Dynamic Modeling of a Tendon-Actuated Continuum Robot

基于数据的 tendon-驱动连续机器人动态建模

Harald Minde Hansen, Bjørn Kåre Sæbø, Kristin Y. Pettersen, Jan Tommy Gravdahl, Mario Di Castro

AI总结本文研究了基于数据的系统辨识方法，用于建模具有滚动关节的tendon-驱动连续机器人，发现仅需两个自由度的动力学模型即可准确捕捉系统动力学，展示了其在实时控制中的可行性。

2605.18719 2026-05-19 cs.CV

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

SafeDiffusion-R1: 在线奖励引导用于安全扩散后训练

Komal Kumar, Ankan Deria, Abhishek Basu, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar

AI总结本文提出了一种在线强化学习框架，通过在负样本和正样本文本提示上进行后训练，利用组相对策略优化（GRPO）解决数据稀缺和模型退化问题，引入了引导奖励机制以提高扩散模型的安全性，实验表明其在减少不适当内容和提升生成质量方面表现优异。

详情

Comments: Page 28, Image 20, Table 6

AI中文摘要

扩散模型已被广泛研究用于去除预训练过程中学习到的不安全内容。现有方法需要昂贵的监督数据，要么是不安全文本与安全图像的配对数据，要么是负/正图像对，使其难以扩展。此外，离线强化学习和监督微调方法生成离线合成数据会受到灾难性遗忘的影响，降低生成质量。我们提出了一种新的在线强化学习框架，通过在负样本和正样本文本提示上进行后训练，利用组相对策略优化（GRPO）解决数据稀缺和模型退化问题。为了消除对专门安全/不安全奖励模型的微调需求，我们引入了一种引导奖励机制，利用CLIP嵌入的一个固有特性：在嵌入空间中将文本表示引导向积极安全方向，远离消极方向。我们的在线策略方法使模型能够从多样化的提示中学习，包括显式不安全内容，而不会出现灾难性遗忘。大量实验表明，我们的方法将不适当内容减少到18.07%（与SD v1.4的48.9%相比），将色情检测减少到15（与基线646相比），同时在GenEval上将组合生成质量从42.08%提高到47.83%。值得注意的是，这些安全收益可以推广到七个危害类别中的跨领域不安全提示，实现了最先进的性能，而无需监督配对数据或奖励微调。Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

英文摘要

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

URL PDF HTML ☆

赞 0 踩 0

2605.18715 2026-05-19 cs.DL

Global training and the collaborative structure of elite U.S. science

全球训练与美国顶尖科学的协作结构

Erjia Yan, Chaoqun Ni, Xiang Zheng

AI总结研究探讨了全球训练对美国顶尖科学产出的影响，发现非美国学位持有者在大学教职中占比虽小，但其科研产出和高被引论文占比更大，这主要归因于机构集中和协作整合的结构特征。

详情

AI中文摘要

全球训练的科学劳动力是美国大学的重要组成部分，然而将外国学位训练与精英科学产出联系起来的组织机制仍不明确。我们通过将全面的美国教职人员名单与2011至2020年间超过1200万条OpenAlex索引的教职人员-出版物观察数据联系起来。拥有非美国学位的教职人员占美国教职人员的十分之一，但他们在总出版物和顶级1%被引论文中的占比更大。这种过度代表性集中在高产出学科领域和研究密集型机构中。然而，在机构-领域-排名-年份分层内，顶级1%产出、FWCI和通讯作者占比的差异显著减弱，表明整体模式主要反映组织安排而非大范围内引用优势。协作结构进一步区分了外国和本地训练的教职人员：混合本地-外国教职人员团队表现出显著提高的精英产出率，且在考虑团队规模后关联显著减弱，表明协作规模是模式的核心。主题独特性分析显示，外国学位教职人员没有证据表明占据异常罕见的研究领域。总体而言，外国学位训练最好理解为美国精英科学的结构特征，通过机构集中和协作整合运作。

英文摘要

Globally trained scientific labor is a substantial component of U.S. universities, yet the organizational mechanisms linking foreign degree training to elite scientific output remain poorly understood. We link comprehensive U.S. faculty rosters to more than 12 million OpenAlex-indexed faculty-publication observations from 2011 to 2020. Faculty with non-U.S. degrees constitute one-tenth of the U.S. professoriate but account for larger shares of total publications and top-1% cited papers. This overrepresentation is concentrated in high-output disciplinary domains and research-intensive institutions. Within institution - domain - rank - year strata, however, differences in top-1% output, FWCI, and corresponding-author share attenuate sharply, indicating that much of the aggregate pattern reflects organizational placement rather than large within-context citation advantages. Collaboration structure further differentiates foreign- and domestically trained faculty: mixed domestic-foreign faculty teams exhibit substantially elevated elite-output rates, and the association attenuates strongly after accounting for team size, suggesting that collaboration scale is central to the pattern. Topic-distinctiveness analyses show little evidence that foreign-degree faculty occupy unusually rare research niches. Overall, foreign-degree training is best understood less as an individual productivity attribute than as a structural feature of elite U.S. science, operating through institutional concentration and collaborative integration.

URL PDF HTML ☆

赞 0 踩 0

2605.18714 2026-05-19 cs.CV cs.AI

Semantic Generative Tuning for Unified Multimodal Models

语义生成微调用于统一多模态模型

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

AI总结本文提出语义生成微调（SGT）方法，通过将高阶语义任务作为生成代理，统一多模态模型的感知与生成能力，提升多模态理解和生成质量。

详情

Comments: 14 pages, 13 figures

AI中文摘要

统一多模态模型（UMMs）致力于在单一架构中整合视觉理解和视觉生成。然而，现有训练范式分别通过稀疏文本信号优化理解，通过密集像素目标优化生成，导致表示空间不一致，隔离了视觉理解和生成，阻碍了它们的相互促进。本文首次系统地研究了生成式后训练，我们将层次化的视觉任务作为生成代理，以弥合UMMs中的隔离。通过实证研究发现，高阶语义任务，特别是图像分割，作为最优代理。不同于低阶任务，分割提供结构语义，显著增强视觉感知和生成布局的保真度。基于这些见解，我们引入语义生成微调（SGT），一种利用分割作为生成代理来对齐和协同多模态能力的新范式。机理分析进一步表明，SGT从根本上提高了特征线性可分离性，并优化了视觉-文本注意力分配模式。广泛的评估显示，SGT在主流基准上一致提升了多模态理解和生成保真度。我们的代码可在https://song2yu.github.io/SGT/上获得。

英文摘要

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

URL PDF HTML ☆

赞 0 踩 0

2605.18710 2026-05-19 cs.DC

Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing

Mosaic：迈向多模态模型高效训练的时空资源复用

Yanbo Wang, Yuxuan Wang, Chen Chen, Chunyu Xue, Yu Feng, Anbang Wu, Quan Chen, Yin Chen, Qizhen Weng

AI总结本文提出Apollo系统，通过时空复用方法提升多模态模型训练效率，采用灵活的执行引擎和性能模型生成高质量部署计划，实测训练速度提升达1.31倍。

详情

AI中文摘要

随着多模态模型（MMs）在现实场景中的广泛应用，高效训练日益复杂的多模态模型变得尤为重要。现有方法采用时序复用方式为每个MM模块分配单块GPU，导致训练效率低下，因为单一模块难以实现高GPU利用率。为提高GPU利用率并实现高效MM训练，本文提出采用时空复用方式部署MMs，允许多个MM模块在GPU上共存并受控资源配额。本文提出Apollo，一个高效的MM训练系统，应用时空复用方法。我们首先开发了一个灵活且轻量的执行引擎，支持任意资源配额的MM训练，然后构建了全面且准确的性能模型，以估算不同分配计划下的模块执行时间。利用性能模型，我们进一步采用有效的启发式方法，高效地推导出高质量的MM部署计划。测试床实验证实，Apollo有效提高了流行MMs的训练效率，最大训练加速达1.31倍。

英文摘要

With the wide adoption of Multimodal Models (MMs) in real-world scenarios, it is significant to efficiently train emerging MMs that exhibit increasingly complex module architectures. For MM deployment, existing works allocate a GPU to only one MM module in a temporal-multiplexing manner; this compromises training efficiency because a single module often fails to achieve high GPU utilization. To improve GPU utilization and enable efficient MM training, we propose deploying MMs in a temporal-spatial multiplexing manner, allowing multiple MM modules to colocate on a GPU with well-controlled resource quotas. In this paper, we propose Apollo, an efficient MM training system that applies temporal-spatial multiplexing. We first develop a flexible and lightweight execution engine that supports MM training with arbitrary resource quotas, and then build a comprehensive and accurate performance model to estimate module execution time under different allocation plans. With the performance model, we further adopt effective heuristics to derive high-quality MM deployment plans efficiently. Testbed experiments confirm that Apollo effectively improves the training efficiency of popular MMs, with a training speedup of up to 1.31x.

URL PDF HTML ☆

赞 0 踩 0

2605.18707 2026-05-19 cs.DC

Ranking Opinions with Few States in Population Protocols

用少量状态进行人口协议中的意见排名

Tom-Lukas Breitkopf, Julien Dallot, Antoine El-Hayek, Stefan Schmid

AI总结该研究提出了一种使用$ k^3 $状态的人口协议，解决了相对多数问题，并进一步扩展以处理各种破 tie 机制和不同颜色的初始顺序情况。

详情

Comments: To appear at PODC 2026

AI中文摘要

人口协议是一种分布式计算模型，其中$n$个代理，每个都是简单的有限状态机，成对交互以解决共同任务，对抗一个（对抗性）交互调度器。这一模型近年来被广泛研究；特别是相对多数问题受到了广泛关注：每个代理初始有一个意见（或颜色）来自$k$种可能性，目标是让每个代理最终输出在人口中支持度最大的颜色。在我们的工作之前，状态复杂度（每个代理所需的最小状态数）仅知在$Ω(k^2)$到$O(k^{7})$之间。我们的主要贡献是提出了一种使用$ k^3 $状态的人口协议来解决相对多数问题。我们通过一种称为CIRCLES的新协议实现这一结果。虽然文献中的先前方法依赖于代理之间的对决来寻找多数颜色——这种方法在两种颜色的情况下证明有效——CIRCLES将代理分成大小递减的圆形链表，具有一个性质：没有两个初始颜色相同的代理位于同一个圆圈中。我们证明CIRCLES总能正确计算所需的结构，即使面对最敌对的调度器（弱公平）。我们随后证明CIRCLES的简单扩展解决了相对多数问题。我们扩展了该协议以处理各种破 tie 机制或支持代理不共享颜色的先前顺序情况。最后，我们证明CIRCLES的修改版本可以以$2 \cdot k^4$状态解决排名问题，其中每个代理必须输出其初始颜色在人口中的排名。

英文摘要

Population protocols are a model of distributed computing where $n$ agents, each a simple finite-state machine, interact in pairs to solve a common task against a (adversarial) interaction scheduler. This model was intensively studied in recent years; in particular, the problem of relative majority received much attention: Each agent starts with an input opinion (or color) out of $k$ possibilities, and the goal is for each agent to eventually output the color with the largest support in the population. Before our work, the state complexity (the minimum number of states required per agent) was only known to be between $Ω(k^2)$ and $O(k^{7})$. Our main contribution is a population protocol that solves the relative majority problem with $k^3$ states. We achieve this result with a new protocol called CIRCLES. While prior approaches in the literature relied on duels of agents to find the majority color -- an approach that proved effective for the case with two colors -- CIRCLES partitions the agents into circular linked lists of decreasing sizes, with the property that no two agents with the same initial color lie in the same circle. We show that CIRCLES always correctly computes the desired structure against the most adversarial of schedulers (weakly fair). We then show that a trivial extension of CIRCLES solves the relative majority problem. We extend our protocol to handle various tie-breaking mechanisms or to support the case where the agents do not share a prior ordering of the colors. Finally, we show that a modification of CIRCLES solves the ranking problem with $2 \cdot k^4$ states, where each agent must output the rank of its initial color in the population.

URL PDF HTML ☆

赞 0 踩 0

2605.18704 2026-05-19 eess.SP cs.LG

Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation

在Sage-Husa卡尔曼滤波器中学习记忆衰减用于鲁棒无人机状态估计

Kenan Majewski, Marcin Żugaj

AI总结本文提出N-Deep Recurrent Sage-Husa滤波器，通过学习的记忆衰减策略改进传统卡尔曼滤波器，以提高无人机在动态环境中的状态估计鲁棒性。

详情

Comments: 49 pages, 9 figures. Preprint submitted to Aerospace Science and Technology

AI中文摘要

无人机在动态环境中面临 telemetry 中断、结构振动和依赖于制度的噪声，这些都会破坏经典卡尔曼滤波器的静态协方差假设。Sage-Husa卡尔曼滤波器（SHKF）能够在线估计噪声统计信息，但其依赖于一个静态的标量遗忘因子，迫使在稳态稳定性与瞬态响应性之间做出严格权衡。本文引入了N-Deep Recurrent Sage-Husa滤波器（NDR-SHKF），将此标量参数替换为一个向量值的记忆衰减策略，该策略通过在白化创新序列上操作的分层递归网络进行学习。双分支架构将浅层递归状态用于捕捉瞬时传感器异常，将深层状态用于编码持续动态趋势，同时辅助重建目标防止特征崩溃。完整的滤波器，包括递归协方差更新，通过反向传播通过时间进行端到端训练，直接最小化状态估计误差。在拓扑上不同的混沌吸引子上的评估显示了跨领域泛化能力，优于纯数据驱动的基线，这些基线在分布外动态下会发散。此外，在记录的真实世界无人机飞行数据集上的评估验证了该框架的实用性，证明了其在进入本体感觉死 reckoning 时的过渡能力，并在传感器中断期间优于经典自适应估计器。

英文摘要

Unmanned Aerial Vehicles in dynamic environments face telemetry outages, structural vibrations, and regime-dependent noise that invalidate the stationary covariance assumptions of classical Kalman filters. The Sage-Husa Kalman Filter (SHKF) estimates noise statistics online, but its reliance on a static, scalar forgetting factor forces a strict compromise between steady-state stability and transient responsiveness. We introduce the N-Deep Recurrent Sage-Husa Filter (NDR-SHKF), which replaces this scalar parameter with a vector-valued memory attenuation policy learned by a hierarchical recurrent network operating on whitened innovation sequences. A bifurcated architecture routes shallow recurrent states to capture instantaneous sensor anomalies and deep states to encode sustained dynamic trends, while an auxiliary reconstruction objective prevents feature collapse. The complete filter, including recursive covariance updates, is trained end-to-end via backpropagation through time to directly minimize state estimation error. Evaluations on topologically distinct chaotic attractors demonstrate cross-domain generalization, outperforming purely data-driven baselines that diverge under out-of-distribution dynamics. Furthermore, evaluations on recorded real-world UAV flight datasets validate the framework's practical viability, demonstrating its capacity to bridge transitions into proprioceptive dead reckoning and outperform classical adaptive estimators during sensor outages.

URL PDF HTML ☆

赞 0 踩 0

2605.18703 2026-05-19 cs.CL cs.LG

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory: 通过可执行环境合成和鲁棒强化学习扩展工具使用智能体

Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, Zhijiang Guo

AI总结本文提出EnvFactory框架，通过自动合成可执行环境和鲁棒强化学习，解决工具使用智能体扩展中的环境可扩展性和训练数据不足问题，显著提升训练效率和下游性能。

详情

Comments: 11 pages

AI中文摘要

通过代理强化学习（Agentic RL）为LLM配备工具使用能力受到两个挑战的限制：缺乏可扩展且稳健的执行环境以及现实训练数据的稀缺性，这些数据无法捕捉隐含的人类推理。现有方法依赖于昂贵的真实世界API、易产生幻觉的LLM模拟器或依赖预收集文档的合成环境，这些环境通常是单轮次或依赖预收集文档。此外，合成轨迹经常过于指定，更像指令序列而非自然人类意图，从而降低了其对强化学习训练的有效性。我们引入EnvFactory，一个完全自动化的框架，解决这两个挑战。EnvFactory自动探索和验证具有状态的可执行工具环境，并通过拓扑感知采样和校准细化合成自然多轮次轨迹，生成具有隐含意图的扎根查询。仅使用7个领域中的85个验证环境，EnvFactory生成2,575个SFT和RL轨迹。尽管使用的环境数量远少于先前工作（通常是5倍），EnvFactory在训练效率和下游性能上均优于现有方法，使Qwen3系列模型在BFCLv3上提升高达+15%，在MCP-Atlas上提升+8.6%，并在对话基准测试中包括τ²-Bench和VitaBench上提升+6%。通过完全自动化环境构建和轨迹合成，EnvFactory为代理强化学习提供了可扩展、可扩展且稳健的基础。

英文摘要

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

URL PDF HTML ☆

赞 0 踩 0

2605.18702 2026-05-19 cs.LG cs.AI

Distilling Tabular Foundation Models for Structured Health Data

为结构化健康数据 distilling 表格基础模型

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay Kumar Sankarapu, Pratinav Seth

AI总结本文研究了如何通过知识蒸馏将表格基础模型的预测行为转移到轻量级表格模型中，通过分层出折教师标签解决上下文泄露问题，在19个医疗数据集上验证了蒸馏学生模型在保持高AUC的同时显著提升了推理速度，并展示了多教师平均法并不总能超越最佳单教师。

详情

AI中文摘要

表格基础模型（TFMs）在健康数据集上表现出色，但其推理成本和基础设施需求限制了实际应用。我们研究了是否可以通过知识蒸馏将TFMs的预测行为转移到轻量级表格模型中。由于上下文TFMs在推理时依赖于训练集，直接蒸馏会引入上下文泄露；我们通过分层出折教师标签来解决这一问题。在19个医疗数据集、6个TFM教师、4个学生家族和多个多教师集成模型上，我们发现蒸馏后的学生模型至少保留了教师AUC的90%，在某些情况下优于教师，同时在CPU上运行速度至少快26倍，并保持了对健康应用至关重要的校准和公平性。此外，多教师平均法并不总能超越最佳单教师。因此，具有泄漏意识的蒸馏是一种将TFM质量预测带入受推理限制的健康环境中的可行途径。

英文摘要

Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.

URL PDF HTML ☆

赞 0 踩 0

2605.18701 2026-05-19 cs.LG q-bio.QM

Learning Normal Representations for Blood Biomarkers

学习正常表示以血清生物标志物

Aashna P. Shah, Michelle M. Li, Yash Lal, Seffi Cohen, Liat F. Antwarg, Morgan Sanchez, James A. Diao, Chirag J. Patel, Ben Y. Reis, Ran D. Balicer, Noa Dagan, Arjun K. Manrai

AI总结该研究提出NORMA框架，通过结合患者历史和人口水平数据生成更精确的参考区间，以改善血清生物标志物的个性化解读，避免过度个性化导致的误诊风险。

详情

AI中文摘要

基于生物液体的生物标志物是临床诊断和管理的基础，但其解释主要依赖于固定的参考区间，这些区间忽略了稳定的个体间变异性。因此，基于群体的解释可能会掩盖个体基线的有意义偏差，从而延误疾病检测。为了解决这个问题，人们越来越多地尝试使用个体测试历史来个性化血清生物标志物的解释。然而，这些方法可能会过度拟合稀疏数据，导致假阳性率升高和不必要的随访，并可能无意中包含未被识别或亚临床疾病。在这里，我们利用近20亿个纵向实验室测量值，来自超过160万名北美洲、中东和东亚的个体，表明尽管实验室值高度个体化，但纯个性化区间经常过度拟合，将多达68%的测量值分类为异常，而没有与不良临床结果相应的关联。我们随后引入NORMA，一个基于条件变压器的框架，通过结合患者的历史和人口水平数据中的“正常”变异生成参考区间。NORMA生成的区间在预测结果方面更具精度，包括死亡率、急性肾损伤和慢性疾病。这些发现警示过度个性化在实验室医学中的风险，并证明将个体轨迹锚定到人口水平先验优于单独的方法。为了促进透明度，我们公开发布模型、代码和一个交互式用户界面，以实现可访问的个性化实验室解释。

英文摘要

Blood-based biomarkers underpin clinical diagnosis and management, yet their interpretation relies largely on fixed population reference intervals that ignore stable, intra-patient variability. As such, population-based interpretation can mask meaningful deviation from an individual's baseline, risking delayed disease detection. To remedy this, there have been increasing efforts to personalize blood biomarker interpretation using individual testing histories. However, these methods may overfit to sparse data, inflating false-positive rates and unnecessary follow-up, and can also unwittingly include unrecognized or subclinical disease. Here, we leverage nearly 2 billion longitudinal laboratory measurements from over 1.6 million individuals across North America, the Middle East, and East Asia, to show that while laboratory values are highly individual, purely personalized intervals routinely overfit, classifying up to 68% of measurements as abnormal, without corresponding associations with adverse clinical outcomes. We then introduce NORMA, a conditional transformer-based framework that generates reference intervals by conditioning on both a patient's history and population-level data about "normal" variation. NORMA-derived intervals achieve higher precision for predicting outcomes, including mortality, acute kidney injury, and chronic disease. These findings caution against over-personalization in laboratory medicine and demonstrate that anchoring individual trajectories to population-level priors outperforms either approach alone. To promote transparency, we publicly release the model, code, and an interactive user interface for accessible, individualized laboratory interpretation.

URL PDF HTML ☆

赞 0 踩 0

2605.18700 2026-05-19 cs.CV

A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition

细粒度图像识别中训练和评估设置的准确性与成本权衡的大规模研究

Edwin Arkel Rios, Augusto Christian Surya, Oswin Gosal, Fernando Mikael, Mary Madeline Nicole, Kisoon Jang, Bo-Cheng Lai, Min-Chun Hu

AI总结本文通过2000多项实验，探讨了不同训练和评估设置下，模型精度与成本之间的权衡，提出改进的Counterfactual Attention Learning方法，并提供高效的评估变体以降低推理成本。

详情

Comments: Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 4 figures

AI中文摘要

先前关于细粒度图像识别（FGIR）的研究已确立了backbone选择的重要性，但忽视了不同训练和评估设置下的精度与成本权衡。在本工作中，我们进行了大规模研究，涵盖超过2000项实验，6种训练和评估设置，9种预训练backbone和17个数据集。初步观察数据增强在细粒度训练中的有效性促使我们扩展Counterfactual Attention Learning（CAL），一种基于数据感知裁剪和遮罩增强的状态-of-the-art方法，引入跨图像判别区域混合增强。我们还提出了一种高效的评估-only变体，在保持竞争力精度的同时，通过放弃通常由CAL和类似FGIR方法使用的判别作物的前向传递来降低推理成本。我们的结果表明，训练期间的数据感知增强仅能使模型在不使用作物的情况下达到卓越的精度，显著减少推理成本。为了支持未来研究，我们共享了代码和检查点：https://github.com/arkel23/FGIR-Backbones

英文摘要

Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}

URL PDF HTML ☆

赞 0 踩 0

2605.18698 2026-05-19 cs.HC

Contextualized Dynamic Explanations: A Vision

上下文化动态解释：一种愿景

Zhicheng Liu, Jason H Li, Greg Briskin

AI总结本文提出了一种基于动态受众模型和预定义沟通意图的上下文化动态解释（CODEX）方法，旨在通过自主代理生成多模态信息界面，提升数据驱动解释的效果。

2605.18697 2026-05-19 cs.DC cs.AI cs.PL

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

PopPy: 在Python复合AI应用中机会性地利用并行性

Stephen Mell, David Mell, Konstantinos Kallas, Steve Zdancewic, Osbert Bastani

AI总结本文提出PopPy系统，通过识别Python应用中调用外部组件的并行化机会，从而在复合AI应用的端到端延迟上实现6.4倍的加速，同时保持顺序程序语义。

详情

AI中文摘要

复合AI应用通过使用通用编程语言如Python调用ML模型的调用，广泛应用于软件工程和企业自动化等用户-facing任务，使其端到端延迟成为关键瓶颈。与传统应用不同，执行时间主要由外部组件主导，这些组件无法通过传统语言优化系统如优化编译器来处理。为了解决这个问题，我们开发了PopPy，一个能够发现Python应用中调用这些重型外部组件的并行化机会的系统，包括那些用于复合AI应用的组件。PopPy支持Python的一个非常表达性的片段，并且需要最小的开发者输入来发现并行性。它结合了提前编译器和运行时，解决了从Python应用中提取并行性的三个关键挑战：语言复杂性、动态调度和变量变异。在一组真实的复合AI应用上，PopPy在端到端执行时间上相比标准Python执行实现了高达6.4倍的加速，同时保持顺序程序语义。

英文摘要

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their end-to-end latency a critical bottleneck. In contrast to traditional applications, execution time is dominated by the external components, which cannot be handled by traditional language optimization systems, like optimizing compilers. To address this problem, we develop PopPy, a system that can uncover parallelization opportunities in Python applications that invoke these heavy external components, including those used in compound AI applications. PopPy supports a very expressive fragment of Python and requires minimal developer input to uncover parallelism. It combines an ahead-of-time compiler with a runtime, addressing three key challenges in extracting parallelism from Python applications: language complexity, dynamic dispatch, and variable mutation. On a set of real-world compound AI applications, PopPy achieves up to $6.4\times$ speedups in end-to-end execution time compared to standard Python execution while preserving the sequential program semantics.

URL PDF HTML ☆

赞 0 踩 0

2605.18696 2026-05-19 cs.LG cs.AI

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

表格基础模型的集成——多样性上限与校准陷阱

Aditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi, Nassim Bouarour, Vinay Kumar Sankarapu

AI总结本文研究了表格基础模型（TFMs）的集成方法，发现尽管集成通常能提升性能，但现代TFMs的集成池近似冗余，且某些集成策略在准确率和校准上表现不佳，建议采用贪心选择作为实用默认方案。

详情

AI中文摘要

表格基础模型（TFMs）如今在越来越多的表格任务上能够匹配或超越调优的梯度提升树，但没有单一的TFM能在所有数据集上获胜。集成是解决此问题的首选方法，但其效果不如预期。六个现代TFMs形成一个近似冗余的池：它们的平均成对Q统计量为0.961，接近1，因此任何凸组合都受限制。我们对六个TFMs在153个OpenML分类任务上进行了六个集成策略的基准测试。最佳集成策略，两层级联堆叠，在计算成本增加253倍的情况下，比最强单个TFM的准确率提高0.18%。Friedman和Nemenyi分析将三个集成策略和最佳基础TFM置于一个等价组中；其他三个集成策略显著劣于最佳基础TFM。使用逻辑回归元学习器进行堆叠是最引人注目的案例：在准确率和ROC-AUC上具有竞争力，但在log-loss排名中是最差的。元学习器通过锐化类别边界来提高准确率，这破坏了校准。我们建议贪心选择作为实用默认方案。

英文摘要

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

URL PDF HTML ☆

赞 0 踩 0

2605.18693 2026-05-19 cs.AI

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench: 评估LLM代理技能生成流水线的基准测试

Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

AI总结本文提出SkillGenBench，一个用于评估LLM代理技能生成流水线的基准测试，通过统一可控的协议评估技能生成过程，涵盖任务条件生成和任务无关生成两种模式，以及基于仓库和文档的两种程序来源，揭示技能生成在不同数据源中的表现差异。

详情

AI中文摘要

随着LLM代理越来越多地围绕可重用的技能构建，一个核心挑战不再是代理是否能使用提供的技能，而是它们能否从仓库和文档中生成正确、可重用且可执行的技能。现有基准主要评估给定技能的有效性或代理从原始上下文中解决下游任务的能力，但并未将技能生成本身作为研究对象。我们引入SkillGenBench，一个用于评估技能生成流水线的基准测试，采用统一且受控的协议。在SkillGenBench中，生成器接收原始语料并生成标准化的技能制品，然后在固定框架下执行并经过统一的评估程序。该基准涵盖两种生成模式：任务条件生成，即在任务揭示后合成特定任务的技能；以及任务无关生成，即在下游任务确定前必须整理出可重用的技能库。它还涵盖两种互补的程序来源：基于仓库的实例，其中程序分布在代码、配置和脚本中；以及基于文档的实例，其中程序和约束必须从长文本中提炼。我们提供了标准化的任务规范、固定环境和以确定性执行为基础的评估协议，并辅以辅助信号用于诊断。在多种技能生成方法和基础模型上的实验显示了显著的性能差异，突显了可重用技能提炼的难度，并揭示了从软件仓库与长文本中生成技能的不同失败模式。SkillGenBench为研究技能生成作为代理系统中的独立研究问题建立了可重复的测试环境。

英文摘要

As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.

URL PDF HTML ☆

赞 0 踩 0