arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16676 2026-05-19 cs.AI

GraViti：具有放松排列不变性的图级变分自编码器

Roman Bresson, Konstantinos Divriotis, Johannes F. Lutzeyer, Iakovos Evdaimon, Michalis Vazirgiannis

AI总结 GraViti通过图级变分自编码器生成紧凑的潜在向量，支持平滑插值和下游任务，优于节点级嵌入。

详情

AI中文摘要

我们介绍了GraViti，一种基于transformer的图级变分自编码器，将整个图映射到紧凑的潜在向量。这种设计产生了一个真正的图级潜在空间，支持平滑插值、属性引导搜索等下游任务，超越节点级嵌入的限制。在分子基准上，GraViti学会解码符合训练数据化学约束的有效样本，表明模型能直接从图级表示中恢复领域规则。我们还显示，在存在可靠规范节点顺序的领域（如分子或贝叶斯网络）中，强制排列不变性可能对一致重建有害。GraViti在大规模数据集上实现了最先进的重建准确性，并提供了坚实的生成性能。其单步解码提供了一种轻量级替代方案，同时保持实用的样本质量。

英文摘要

We introduce GraViti, a transformer-based graph-level variational autoencoder that maps entire graphs to compact latent vectors. This design produces a true graph-level latent space that supports smooth interpolation, property-guided search, and other downstream tasks beyond the constraints of node-level embeddings. On molecular benchmarks, GraViti learns to decode valid samples that follow the chemical constraints present in the training data, showing that the model recovers domain rules directly from graph-level representations. We also show that, in domains where a reliable canonical node ordering exists such as molecules or bayesian networks, enforcing permutation invariance can prove detrimental for consistent reconstruction. GraViti achieves state-of-the-art reconstruction accuracy on large datasets, and provides solid generative performance. Its single-step decoding offers a lightweight alternative to more complex generation pipelines while maintaining practical sample quality.

URL PDF HTML ☆

赞 0 踩 0

2605.16665 2026-05-19 cs.LG physics.geo-ph

AtlasVid: 通过解耦的全局-局部建模实现高效超高清长视频生成

Ziyang Mai, Yuyao Zhang, Yu-Wing Tai

AI总结本文提出AtlasVid框架，通过解耦建模提升超高清长视频生成效率，实现60.9倍加速和更低训练成本，优于原生4K生成器。

详情

AI中文摘要

近期基于扩散的视频生成器在视觉保真度和提示可控性方面取得了显著进展，但将其扩展到超高清（UHR）长视频仍极具挑战性。难点尤其体现在长单次生成中，需保持连续场景的全局时间一致性，同时不依赖剪辑过渡或自回归镜头拼接的精细空间细节。本文从解耦建模角度重新审视这一挑战。我们主张现有视频扩散模型已编码了强局部视觉先验，而主要瓶颈在于如何高效扩展全局时空建模以适应更高的分辨率和持续时间。基于此见解，我们提出AtlasVid，一种解耦的全局-局部框架，用于高效UHR长视频生成。AtlasVid首先通过时间缩放RoPE生成低分辨率和低FPS的全局语义代理，从而扩展时间范围而不增加训练token数量。在该代理的引导下，高分辨率细节分支进行联合去噪，采用分层局部性保持注意力。重新排列的时空窗口保持几何局部性，不对称的全局-局部注意力注入对齐的语义指导并保留模型的预训练能力。此设计使模型具备分辨率无关的训练能力：模型仅在720P上训练，使用轻量LoRA适配，即可直接泛化到4K及更长（>10秒）的视频生成。实验表明，AtlasVid显著提升了超高清长视频生成的效率，实现了高质量UHR长视频生成，速度提升60.9倍，训练成本显著降低，甚至优于原生4K视频生成器。

英文摘要

Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.

URL PDF HTML ☆

赞 0 踩 0

2605.14963 2026-05-19 cs.CV

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

H-OmniStereo：基于方向对齐法线先验的零样本全方位立体匹配

Chenxing Jiang, Zhe Tong, Pusen Gao, Peize Liu, Yang Xu, Chuan Fang, Ping Tan, Shaojie Shen

AI总结本文提出H-OmniStereo框架，通过构建高质量合成数据集和引入方向对齐法线估计器，解决全方位立体匹配中数据稀缺和视角先验退化问题，实现更高精度和跨视角一致性。

Comments 8 pages, 9 figures

详情

AI中文摘要

在顶底等距矩形图像上的立体匹配为全方位感知提供了有效框架，因为垂直对齐的视差线能够利用大量数据集和单目先验驱动的先进透视立体架构。然而，此类适应的性能严重受限于全方位立体数据集的稀缺性和球面畸变下单目先验的退化。为解决这些挑战，我们提出H-OmniStereo，零样本全方位立体匹配框架。首先，我们构建包含280万对顶底等距矩形立体对的高质量合成数据集以扩大训练规模。其次，我们引入等距矩形单目法线估计器，专门在方向对齐坐标系中运行。除了提供抗畸变和跨视角一致的几何先验以建立可靠的立体匹配对应关系外，该设计还提升了训练效率并适应了训练测试视角范围不匹配。大量实验表明，我们的方法在域外数据集上比现有方法更准确，并成功泛化到实际消费者相机设置中使用单个模型。模型和数据集将在https://github.com/JIANG-CX/H-OmniStereo发布。

英文摘要

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

URL PDF HTML ☆

赞 0 踩 0

2605.14854 2026-05-19 cs.CV cs.AI

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

因子化HMR：视频人体网格恢复的混合框架

Patrick Kwon, Chen Chen

AI总结本文提出FactorizedHMR框架，通过确定性回归模块和概率流匹配模块分别处理人体不同部位的恢复问题，结合复合目标表示和几何感知监督提升模糊部位的恢复效果，实现在遮挡和漂移敏感度指标上的优势。

详情

AI中文摘要

人体网格恢复（HMR）本质上具有歧义性：在遮挡或弱深度线索下，同一图像证据可能由多个3D身体解释。这种歧义性并非均匀分布于全身，躯干姿态和根结构通常相对受约束，而远端关节如手臂和腿部则更不确定。基于此观察，我们提出FactorizedHMR，一种两阶段框架，分别处理这两种情形。一个确定性回归模块首先恢复稳定的躯干-根锚点，一个概率流匹配模块则完成剩余的非躯干关节。为使完成可靠，我们结合复合目标表示与几何感知监督和特征感知分类器自由引导，保留躯干-根锚点的同时提升易产生歧义的关节的单参考恢复。我们还引入了一个合成数据管道，提供在多种视角下的配对图像-相机-运动监督。在相机空间和世界空间基准测试中，FactorizedHMR与强基线竞争，尤其在遮挡密集恢复和漂移敏感世界空间指标上表现最突出。

英文摘要

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.14504 2026-05-19 cs.AI

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

当机器人做家务：一个基准和代理用于长期家庭任务执行

Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, Jing Liu

AI总结本文提出LongAct基准和HoloMind代理，用于评估长期家庭任务执行中的高层自主能力，实验显示HoloMind在减少模型规模依赖的同时提升了长期性能，但目标完成率仍较低，凸显了长期规划的挑战。

详情

AI中文摘要

长期家庭任务需要稳健的高层规划和持续推理能力，而现有具身AI基准多关注短时间导航或操作，依赖固定任务类别。我们引入LongAct基准，用于评估通过自由指令指定的长期家庭任务中的规划自主性。通过抽象掉与具体身体相关的低层控制，LongAct隔离了如指令理解、依赖管理、记忆维护和适应性规划等高层认知能力。我们进一步提出HoloMind，一个基于视觉语言模型的代理，配备基于有向无环图的长期分层规划器、多模态空间记忆用于持久世界建模、经验重用的片段记忆以及全局批评者用于反思监督。实验表明，GPT-5和Qwen3-VL模型在HoloMind上显著提升了长期性能，同时减少了对模型规模的依赖。即使顶级模型也仅达到59%的目标完成率和16%的完整任务成功率，凸显了LongAct的难度以及具身代理中更强长期规划的需求。

英文摘要

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

URL PDF HTML ☆

赞 0 踩 0

2605.14498 2026-05-19 cs.CL

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

GroupMemBench: 多方对话中LLM代理记忆的基准测试

Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

AI总结本文提出GroupMemBench，用于评估多方对话中LLM代理的记忆能力，揭示现有记忆系统在群体动态、信念跟踪和语言适应方面的不足。

详情

极性探针线性解码LLM中的语义结构

Pablo J. Diego-Simón, Pierre Orhan, Emmanuel Chemla, Yair Lakretz, Jean-Rémi King

AI总结研究通过极性探针线性恢复LLM中的语义结构，发现其基于嵌入距离和方向表示实体存在与关系类型，且在中层表现更优，能泛化至新实体但随语义结构规模下降。

详情

AI中文摘要

人工神经网络如何将概念绑定形成复杂语义结构？本文提出一种简单神经编码，通过嵌入距离和方向表示实体的存在及关系类型。在多种LLM中测试，结果表明极性探针能线性恢复真实语义结构，该编码主要出现在中层，随LLM性能提升而改善。极性探针能泛化至新实体和关系类型，但随语义结构规模增大而退化。极性表示质量与LLM回答语义结构问题的能力相关。这些发现表明，LLM通过简单几何原理绑定表示来构建复杂语义结构。

NOFE - 神经操作函数嵌入

Lars Uebbing, Harald L. Joakimsen, Siyan Chen, Georgios Leontidis, Kristoffer K. Wickstrøm, Michael C. Kampffmeyer, Sébastien Lefèvre, Arnt-Børre Salberg, Robert Jenssen

AI总结 NOFE是一种面向连续域的降维框架，通过图核操作学习函数到函数的映射，实现无网格评估，优于传统方法在局部结构保持和鲁棒采样方面表现。

Comments 21 pages, 11 figures, 12 tables

详情

AI中文摘要

大多数降维方法将数据视为离散点云，忽视了许多现实过程固有的连续域结构。为弥合这一差距，我们引入神经操作函数嵌入（NOFE），一种面向连续域的降维框架。NOFE通过图核操作学习函数到函数的映射，能够在任意查询位置进行无网格评估，而不受输入离散化的限制。我们建立了NOFE作为sheaf到sheaf映射的近似，将sheaf神经网络推广到连续域。我们在不同数据集上评估了NOFE，将其与PCA、t-SNE和UMAP进行比较。结果表明，NOFE在局部结构保持方面显著优于基线方法，在ERA5气候再分析数据集上，局部应力为0.111，相比之下PCA为0.398，t-SNE为0.773，UMAP为0.791。NOFE还表现出鲁棒的采样独立性，相对于UMAP，将拼接误差降低了高达20.0倍（59.0 vs. 267.6在区域归一化下），并确保在不连续域碎片之间的一致性。虽然保持了竞争性的全局结构保持（应力-1：0.379 vs. PCA的0.268），NOFE解决了细粒度结构并产生了平滑一致的嵌入，这些嵌入在不同样本密度下具有良好的泛化能力，解决了离散降维方法的关键限制。

英文摘要

Most dimensionality reduction methods treat data as discrete point clouds, ignoring the continuous domain structure inherent to many real-world processes. To bridge this gap, we introduce Neural Operator Function Embedding (NOFE), a domain-aware framework for continuous dimensionality reduction. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation at arbitrary query locations independent of input discretization. We establish NOFE as approximation of sheaf-to-sheaf mappings, generalizing Sheaf Neural Networks to continuous domains. We evaluate NOFE across different datasets, comparing it against PCA, t-SNE, and UMAP. Our results demonstrate that NOFE significantly outperforms baselines in local structure preservation, achieving a local Stress of 0.111 compared to 0.398 for PCA, 0.773 for t-SNE, and 0.791 for UMAP for the ERA5 climate reanalysis dataset. NOFE also exhibits robust sampling independence, reducing the Patch Stitching Error by up to $20.0\times$ relative to UMAP (59.0 vs. 267.6 under regional normalization) and ensuring consistency across disjoint domain patches. While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA's 0.268), NOFE resolves fine-grained structures and produces smooth, consistent embeddings that generalize across varying sample densities, addressing key limitations of discrete reduction methods.

URL PDF HTML ☆

赞 0 踩 0

2605.11871 2026-05-19 cs.CV

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

$h$-control: 无需训练的相机控制 via 块条件吉布斯细化

Yuzhu Wang, Xi Ye, Duo Su, Yangyang Xu, Jun Zhu

AI总结本文提出$h$-control，通过改进采样器结构，解决免训练视频生成中相机控制的逆向问题，提升轨迹一致性与视觉质量的平衡，实现在多个数据集上的最佳表现。

详情

AI中文摘要

无需训练的相机控制对于预训练的流匹配视频生成器是一个部分观察逆向问题：深度扭曲的引导视频为潜变量子集提供噪声证据，采样器必须与预训练先验相协调。现有方法难以平衡轨迹一致性和视觉质量，且启发式引导强度调整缺乏鲁棒性。我们提出$h$-control，通过在采样器中引入结构变化：每个外层硬替换引导步骤均增强内循环块条件伪吉布斯细化，对同一噪声水平下的未观测补集进行处理，保证收敛到部分观察条件数据定律。为加速高维视频潜变量的收敛，我们利用其条件局部性，将未观测补集划分为3D块，每个块由自定义混合指示器跟踪，能自适应冻结收敛块。在RealEstate10K和DAVIS数据集上，$h$-control在所有七种免训练和训练-based竞争者中取得最佳FVD，优于所有免训练基线。

英文摘要

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

URL PDF HTML ☆

赞 0 踩 0

2605.11599 2026-05-19 cs.LG

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

面向LLM推理的定向测试：一种受审计约束的协议

Hongmin Li

AI总结本文提出一种受审计约束的协议，用于评估LLM推理能力，通过组件自适应提示采样与均匀采样对比，验证了在受控环境下研究定向提示变化的有效性。

Comments 17 pages, 1 figure

详情

AI中文摘要

固定推理基准评估标准提示，但语义上有效的呈现变化仍可能改变模型行为。提示变化研究可揭示此类失败，但缺乏审计时可能混杂真实模型错误、无效扰动、提取伪影和不匹配的搜索过程。本文提出一种受审计约束的定向推理评估协议。提示变体由有限组件语法生成，确定性渲染，固定查询预算下评估，并在经过语义和提取审计后才视为模型错误。在此协议中，我们实例化了组件自适应提示采样（CAPS），一种基于得分的提示组件采样器，并在相同任务库、渲染器、模型接口、解码设置和审计程序下，与等预算均匀组件采样进行比较。在三个受审计的切片中，该协议确认了模型错误提示键，同时排除了格式和提取伪影，但匹配比较未显示CAPS在受控产量或唯一提示键发现上优于均匀采样。贡献是方法论的：定向提示变化可以在可重建、可审查、预算匹配的协议下研究，代理引导策略应通过受控产量而非原始不匹配计数或选定示例单独判断。

英文摘要

Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.

URL PDF HTML ☆

赞 0 踩 0

2605.11518 2026-05-19 cs.AI cs.CL cs.LG

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch: 训练研究代理以自动化LLM实验配置 - 从低成本学习，优化高成本

Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

AI总结本文提出AutoLLMResearch框架，通过多保真度实验环境学习LLM配置原则，解决高成本实验自动化问题，展示其在大规模LLM实验中的有效性与通用性。

详情

AI中文摘要

有效配置可扩展的大规模语言模型（LLM）实验，涵盖架构设计、超参数调优等，对推进LLM研究至关重要，因为糟糕的配置选择会浪费大量计算资源并阻碍模型潜力的实现。以往的自动化方法适用于低成本环境，但可扩展的LLM实验成本过高，无法进行大量迭代。为了解决这一问题，我们提出AutoLLMResearch，一个模仿人类研究人员从低保真度实验中学习一般性原则并高效识别高成本LLM配置的代理框架。核心挑战是如何使代理通过与多保真度实验环境的交互学习LLM配置景观的结构。为此，我们提出一个系统框架，包含两个关键组件：1) LLMConfig-Gym，涵盖四个关键LLM实验任务的多保真度环境，支持超过一百万GPU小时的可验证实验结果；2) 一个结构化训练管道，将配置研究建模为长周期马尔可夫决策过程，并相应地激励跨保真度外推推理。在各种强基线上的广泛评估表明了我们框架的有效性、通用性和可解释性，支持其作为大规模现实LLM实验自动化的实用且通用解决方案的潜力。

英文摘要

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

URL PDF HTML ☆

赞 0 踩 0

2605.11480 2026-05-19 cs.LG

Efficient Adjoint Matching for Fine-tuning Diffusion Models

高效对抗匹配用于扩散模型微调

Jeongwoo Shin, Dongsoo Shin, Yuchen Zhu, Wei Guo, Yongxin Chen, Joonseok Lee, Jaewoong Choi, Jaemoo Choi

AI总结本文提出高效对抗匹配(EAM)，通过改用线性基础漂移和修改终端成本，解决对抗匹配在扩散模型微调中的计算瓶颈，使训练效率提升4倍并在多个指标上表现优异。

详情

AI中文摘要

奖励微调已成为对齐预训练扩散和流模型与人类偏好的常见方法。在基于奖励梯度的方法中，对抗匹配（AM）通过将奖励微调视为随机最优控制（SOC）问题提供了系统化的公式。然而，AM不可避免地需要显著的计算成本：它要求（i）在无记忆动态下对完整生成轨迹进行随机模拟，导致大量的函数评估，以及（ii）沿每个采样轨迹进行反向ODE模拟。在本工作中，我们观察到这两个瓶颈都与从预训练模型继承的非平凡基础漂移密切相关。受此启发，我们提出高效对抗匹配（EAM），通过将SOC问题改用线性基础漂移和相应修改的终端成本，大幅提高训练效率。此改写消除了两种无效来源；它使训练时采样能够使用几步确定性ODE求解器，并产生闭合形式的伴随解，从而消除反向伴随模拟。在标准的文本到图像奖励微调基准上，EAM比AM快4倍收敛，并在PickScore、ImageReward、HPSv2.1、CLIPScore和Aesthetics等各项指标上匹配或超越了AM。

英文摘要

Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

URL PDF HTML ☆

赞 0 踩 0

2605.11208 2026-05-19 cs.CV

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Hi-GaTA：用于外科视频报告生成的分层门控时间聚合适配器

Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

AI总结本文提出Hi-GaTA框架，通过时间聚合压缩长视频序列生成LLM兼容的视觉前缀令牌，结合预训练的外科专用视频编码器和LoRA微调，实现高质量外科报告生成。

Comments 11 pages, 2 figures

详情

AI中文摘要

自动化、临床级的外科手术评估报告可减少文档负担并提供客观反馈，但面临视频时空表示与语言推理对齐困难及高质量隐私数据稀缺的挑战。为此，我们建立包含214个高质量模拟外科视频及外科医生撰写的评估报告的基准。基于此资源，我们提出包含Hi-GaTA的感知-对齐-推理框架，其中Hi-GaTA是一种新型轻量级时间适配器，通过短到长范围时间聚合高效压缩长视频序列为紧凑的LLM兼容视觉前缀令牌。为实现稳健的视觉感知，我们预训练了Sur40k，一种针对外科专用的ViViT风格视频编码器，在40,000分钟的公开外科视频上进行预训练以捕捉细粒度的时空手术先验。Hi-GaTA采用带有文本条件双交叉注意力的时间金字塔，并通过跨层门控融合和递增深度策略提高多尺度一致性。最后，我们使用LoRA微调LLM主干以在有限监督下实现连贯且风格一致的外科报告生成。实验表明，我们的方法在整体性能上最佳，且在强大的多模态大语言模型（MLLM）基线中表现出一致的优势。消融研究进一步验证了每个提出组件的有效性。

英文摘要

Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.

URL PDF HTML ☆

赞 0 踩 0