arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2605.22191 2026-05-22 cs.LG cs.IT math.IT

Bandit Convex Optimization with Gradient Prediction Adaptivity

带梯度预测自适应的带状凸优化

Shuche Wang, Adarsh Barik, Vincent Y. F. Tan

AI总结 本文研究了在预测自适应方式下,乐观梯度预测能否改进最坏情况下的后悔保证。提出了一种双点反馈设置下的两种点方差减少乐观梯度下降算法,该算法的梯度估计器方差与预测误差相关,从而得到O(√(dE[S_T]))的后悔界,并建立了信息论下界,证明了该算法在预测自适应后悔上的最优性。

详情
AI中文摘要

带状凸优化(BCO)是一种具有部分反馈的在线学习框架,其中学习者在每一轮中只观察所选决策点的损失。在本工作中,我们研究乐观梯度预测是否能在预测自适应的方式下改进最坏情况下的后悔保证。具体而言,给定梯度预测m_t,我们寻求与累积预测误差S_T=∑_{t=1}^T ||∇f_t(x_t)-m_t||^2相关的后悔界。我们首先得出一个负结果:在单点反馈协议下,即使S_T=o(T),仍存在不可避免的Ω(√T)的后悔下界,表明梯度估计的方差从根本上阻碍了准确预测的好处。为克服这一障碍,我们提出了适用于双点反馈设置的Two-Point Variance-Reduced Optimistic Gradient Descent(TP-VR-OPT)算法。其关键思想是新颖的方差减少梯度估计器,其方差与预测误差而非梯度范数相关。这导致了O(√(dE[S_T]))的后悔界,其中d是决策维度。补充这一结果,我们建立了信息论下界,其规模为Ω(√E[S_T]),提供了预测自适应后悔的最佳可实现性的基本特征,并证明TP-VR-OPT在至多√d因子内是最佳的。我们进一步开发了自适应变体,消除了对E[S_T]或时间范围T的先验知识的需求,并将我们的框架扩展到非平稳环境,建立了同时适应累积预测误差和比较路径长度的动态后悔保证。

英文摘要

Bandit convex optimization (BCO) is a fundamental online learning framework with partial feedback, where the learner observes only the loss incurred at the chosen decision point in each round. In this work, we investigate whether optimistic gradient predictions can improve worst-case regret guarantees in a prediction-adaptive manner. Specifically, given gradient predictions $m_t$, we seek regret bounds that scale with the cumulative prediction error $S_T=\sum_{t=1}^T \|\nabla f_t(x_t)-m_t\|^2.$ We first establish a negative result: under the single-point feedback protocol, an unavoidable $Ω(\sqrt{T})$ regret lower bound persists even when $S_T=o(T)$, showing that the variance of gradient estimation fundamentally obscures the benefit of accurate predictions. To overcome this barrier, we propose \emph{Two-Point Variance-Reduced Optimistic Gradient Descent} (TP-VR-OPT) for the two-point feedback setting. The key idea is a novel variance-reduced gradient estimator whose variance scales with the prediction error rather than the gradient norm. This yields a regret bound of $O\big(\sqrt{d\,\mathbb{E}[S_T]}\big),$ where $d$ is the decision dimension. Complementing this result, we establish an information-theoretic lower bound that scales as $Ω(\sqrt{\mathbb{E}[S_T]})$, providing a fundamental characterization of the best achievable prediction-adaptive regret and showing that TP-VR-OPT is optimal up to a factor of $\sqrt d$. We further develop adaptive variants that eliminate the need for prior knowledge of $\mathbb{E}[S_T]$ or the horizon $T$, and extend our framework to non-stationary environments, establishing dynamic regret guarantees that adapt simultaneously to the cumulative prediction error and the comparator path length.

2605.22190 2026-05-22 cs.CV

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

无需姿态,无问题:从未姿态多视角视频中馈送动态高斯

Matteo Balice, Yanik Kunzi, Chenyangguang Zhang, Matteo Matteucci, Marc Pollefeys, Sungwhan Hong

AI总结 本文提出NoPo4D,一种首个无需姿态的馈送式系统,能够处理动态内容、多视角输入和未知相机姿态,通过速度分解和双向运动编码提升性能,优于现有方法。

Comments https://bralani.github.io/nopo4d_html/

详情
AI中文摘要

近期的馈送式3D高斯散射方法在3D场景重建的单个方面取得了显著进展,但现有方法无法在单次馈送过程中同时处理动态内容、多视角输入和未知相机姿态。处理动态的 方法要么需要准确的相机姿态,要么只能接受单目输入;无姿态多视角方法仅能处理静态场景;而每场景优化方法在填补这些差距时,每场景的成本为分钟到小时。我们引入NoPo4D,首个馈送式系统,通过预训练的几何骨干网络和最近的4D高斯框架,引入速度分解,将高斯运动分解为每个像素图像平面位移和深度变化,从而可以直接从伪地面真实光流获得2D组件的监督。这规避了可微渲染将先验姿态方法与姿态准确性耦合以及先验无姿态方法所需的3D运动地面真实。系统还通过双向运动编码实现跨视角和跨帧特征聚合,以及视图依赖的不透明度,以缓解跨视角和跨时间步的高斯错位。在四个多视角动态基准上,NoPo4D一致优于现有馈送式基线,并通过可选后优化阶段超越每场景优化方法,同时运行速度快十倍。

英文摘要

Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.

2605.22189 2026-05-22 cs.RO

Learning A Unified Risk Map for Autonomous Driving in Partially Observable Environments

在部分可观察环境中学习统一的风险图

Jie Jia, Yaofeng Su, Zeyu Bao, Yun Hong, Bingzhao Gao, Zhongxue Gan, Wenchao Ding

AI总结 本文提出了一种统一的风险图建模与学习框架,用于部分可观察环境中的自动驾驶,通过时空建模整合交通流风险和碰撞风险,以更精细地评估遮挡引起的危险,并引入扩散基场景生成框架来解决遮挡交互场景稀缺的问题,实验表明该方法在Waymo Open Motion Dataset上显著优于现有方法。

Comments Published in IEEE Robotics and Automation Letters

详情
AI中文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

英文摘要

Occlusion-aware prediction remains a critical challenge in autonomous driving due to the inherent uncertainty of unobserved regions. Existing approaches either overestimate risk based on reachable states or struggle to predict accurate trajectories under high occlusion uncertainty. To address these limitations, we propose a unified risk map modeling and learning framework for partially observable environments. Our method integrates traffic flow risk and collision risk through spatiotemporal modeling, enabling fine-grained assessment of occlusion-induced hazards. To address the scarcity of scenarios involving occluded interactions, we introduce a diffusion-based scenario generation framework that produces realistic yet adversarial scenarios. We integrate the modeling and learning of a unified risk map into a framework that supports risk-aware planning under partial observability. Experiments on the Waymo Open Motion Dataset show that our method significantly outperforms the state-of-the-art occlusion-aware baseline, improving minimum time-to-collision by 0.78 times and average time-to-collision by 1.67 times. The proposed framework offers a comprehensive and practical solution for risk-aware planning in partially observable environments.

2605.22188 2026-05-22 cs.LG math.OC stat.ML

From Sequential Nodes to GPU Batches: Parallel Branch and Bound for Optimal $k$-Sparse GLMs

从顺序节点到GPU批处理:并行分支限界法用于最优k-稀疏广义线性模型

Jiachang Liu, Andrea Lodi

AI总结 本文提出了一种CPU-GPU框架,通过批量处理GPU上的分支限界节点,显著加速了大规模优化问题的求解,特别是在具有离散变量、组合结构和非线性目标的优化问题中,如验证卡数约束下的最优广义线性模型解。

详情
AI中文摘要

GPU在大规模优化的一阶方法中显著加速了计算,尤其是在连续优化中。然而,这种成功并未顺利转移到具有离散变量、组合结构和非线性目标的问题中,例如验证卡数约束下的广义线性模型的最优解。主要挑战包括分支限界(BnB)中异构节点的顺序处理以及CPU和GPU之间频繁的数据移动。我们提出了一种简单、通用且模块化的CPU-GPU框架,该框架可以在GPU上批量处理多个BnB节点。该框架围绕一组GPU高效的子程序构建,并利用填充和轻量级自定义内核来处理不规则的节点数据结构。实验表明,该框架在挑战性实例上实现了1到2个数量级的加速,并且在最优性间隙方面达到了零。该框架还可以扩展以收集整个Rashomon集,从而启用下游的统计分析,如变量重要性分析和在二次用户特定度量(例如分类中的AUC)下的模型选择。

英文摘要

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and nonlinear objectives, such as certifying optimal solutions for cardinality-constrained generalized linear models. Major challenges include the sequential processing of heterogeneous nodes in branch and bound (BnB) and frequent data movement between the CPU and GPU. We propose a simple, generic, and modular CPU--GPU framework that processes multiple BnB nodes in batches on GPUs. The framework is built around a small set of GPU-efficient routines and uses padding together with lightweight custom kernels to handle irregular node data structures. Experiments show one to two orders of magnitude speedups and zero optimality gap on challenging instances. The framework can also be extended to collect the entire Rashomon set, enabling downstream statistical analysis such as variable-importance analysis and model selection under secondary user-specific measures (e.g., AUC in classification).

2605.22186 2026-05-22 cs.CV

Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

事件-照明协同低光照图像增强与高分辨率现实数据集

Senyan Xu, Zhijing Sun, Kean Liu, Xin Lu, Ruixuan Jiang, Mingyang Huang, Xueyang Fu, Zheng-Jun Zha

AI总结 本文提出EIC-LIE框架,通过事件-照明协同模块和照明感知事件滤波器,解决低光照图像增强中HDR信息整合不足和现实噪声敏感问题,并构建首个高分辨率现实事件数据集,实验证明其在多个数据集上优于现有方法。

详情
AI中文摘要

事件基于低光照图像增强(LIE)方法主要关注整合高动态范围(HDR)信息,而忽视图像中的全局照明和现实场景中事件信号的固有噪声敏感性。为解决这些问题,我们提出EIC-LIE,一种事件-照明协同LIE框架。具体而言,我们首先设计了一个事件-照明协同交互(EICI)模块,包含两个关键过程:前向收集,用于在不同光照条件下收集HDR特征,以及后向注入,为照明和事件表示提供互补内容。接下来,我们引入了一个照明感知事件滤波器(IAEF),根据图像导出的亮度统计动态减少事件噪声。此外,我们构建了一个基于光束分割器的混合成像系统,以从动态场景中收集高质量的事件-图像对,实现时间同步,提供了首个高分辨率、现实的事件基LIE数据集。广泛的实验表明,我们的EIC-LIE在五个现实和合成数据集上优于现有方法,显著超越了以前的方法,在PSNR上提高了1.24dB,在SSIM上提高了0.069。代码和数据集已发布在https://github.com/QUEAHREN/EIC-LIE。

英文摘要

Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations. Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset. Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at https://github.com/QUEAHREN/EIC-LIE.

2605.22185 2026-05-22 cs.CV cs.LG

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

增强多模态大语言模型以用于安全关键驾驶视频分析

Tomaso Trinci, Henrique Piñeiro Monteagudo, Leonardo Taccari

AI总结 本研究通过融合降采样视频帧与同步高频 telemetry 数据及专用计算机视觉模型的语义信息,提升多模态大语言模型在安全关键驾驶场景中的感知与推理能力,从而更准确地识别和描述现实驾驶中的安全关键事件。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在一般视觉理解方面展现了出色的性能。然而,其在安全关键驾驶场景中的应用受限于无法准确感知和推理罕见高风险动态事件(如碰撞或接近碰撞)的能力。为此,我们提出了一种增强MLLM感知能力的流程,通过融合降采样视频帧与同步高频telematics数据(IMU和GPS)以及专用计算机视觉模型的语义信息生成高质量的伪标签,包括描述性标题和问答对,专门用于训练MLLM识别和描述现实驾驶中的安全关键事件(SCEs)。我们通过微调开源QwenVL-2.5模型并使用DoRA适配器展示了该方法的有效性:实验表明在少于50M可训练参数和有限计算预算下,显著提高了识别和解释安全关键事件的能力。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

2605.22182 2026-05-22 cs.LG

IKNO: Infinite-order Kernel Neural Operators

IKNO:无限阶核神经算子

Pengyuan Zhu, Ivor W. Tsang, Yueming Lyu

AI总结 本文提出IKNO,一种通过无限阶核积分构建的神经算子,解决了传统模型因依赖一阶核积分而限制表达能力的问题,通过两种互补的构造方法实现了高效的全局信息聚合,并在多个基准数据集上取得了SOTA精度。

详情
AI中文摘要

神经算子在现代科学计算中因灵活性和强大的泛化能力而取得了显著成功。然而,现有模型主要依赖于一阶核积分近似,这严重限制了它们的表达能力。为此,我们提出了无限阶核神经算子(IKNO),通过无限阶核积分构建神经算子,并具有优雅的闭式有限近似。我们开发了两种互补的无限阶神经算子构造:IKNO-Vanilla,通过克罗内克特征分解在产品网格上应用完整的核解算子;以及IKNO-TP,一种替代的张量积算子,通过各轴解算子进行组合。此外,我们为这两种IKNO变体开发了快速计算方案,实现了出色的全局信息聚合同时保持高计算效率。实验证明,我们在具有任意输入形状的时间依赖和时间无关基准数据集上评估了我们的IKNO,包括大规模工业数据集。广泛的实验表明,IKNO方法在几乎所有基准数据集上都实现了显著的精度提升,同时保持了对非常大的点云的可扩展性。

英文摘要

Neural operators have achieved significant success in modern scientific computing due to their flexibility and strong generalization capabilities. Existing models, however, primarily rely on first-order kernel integral approximations, which severely limit their expressivity. To address this, we propose the Infinite-order Kernel Neural Operator (IKNO), which constructs neural operators via infinite-order kernel integrals and admits an elegant closed-form finite approximation. We develop two complementary infinite-order neural operator constructions: IKNO-Vanilla, which applies the full-kernel resolvent on the product grid via Kronecker eigendecomposition, and IKNO-TP, an alternative tensor-product operator that composes per-axis resolvents. Furthermore, we develop fast computation schemes for both variants of IKNO, which achieve outstanding global information aggregation while maintaining high computational efficiency. Empirically, we evaluate our IKNO on both time-dependent and time-independent benchmarks with arbitrary input shapes, including large-scale industrial datasets. Extensive experiments demonstrate that the IKNO method consistently achieves the SOTA accuracy with significant improvements on nearly all benchmark datasets while maintaining scalability to very large point clouds.

2605.22177 2026-05-22 cs.LG cs.CL

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Maestro:通过强化学习协调分层模型-技能集合

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao

AI总结 本文提出Maestro框架,通过强化学习协调多模态任务,利用分层模型-技能集合提升多模态任务性能,实现高效且通用的协调策略。

详情
AI中文摘要

大型语言模型(LLMs)和模块化技能的普及使自主代理具备了越来越强大的能力。现有框架通常依赖于单一的LLM和固定的逻辑来与这些技能交互。这导致了一个关键瓶颈:不同的LLMs在不同领域具有不同的优势,但当前框架未能利用模型和技能的互补优势,从而限制了其在下游任务上的性能。在本文中,我们提出了Maestro(多模态代理专家技能强化学习协调框架),这是一个由强化学习(RL)驱动的协调框架,将异构多模态任务重新框架化为一个在分层模型-技能注册表上的顺序决策过程。与将所有知识整合到单一模型中不同,Maestro训练了一个轻量级的策略,动态组合冻结的专家模型和一个双层技能库,决定在每一步是否调用外部专家,选择哪个模型-技能对,以及何时终止。该策略通过基于结果的强化学习进行优化,不需要步骤级监督。我们评估了Maestro在十个代表性的多模态基准上,涵盖数学推理、图表理解、高分辨率感知和领域特定分析。仅使用一个4B的协调器,Maestro实现了70.1%的平均准确率,超过了GPT-5(69.3%)和Gemini-2.5-Pro(68.7%)。关键的是,学习的协调策略能够泛化到未见过的模型和技能,无需重新训练:在注册表中添加非领域专家,使在四个具有挑战性的基准上平均达到59.5%,优于所有闭源基线。Maestro进一步保持了高计算效率和低延迟。源代码可在https://github.com/jinyangwu/Maestro上获得。

英文摘要

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

2605.22176 2026-05-22 cs.AI

LLM-Metrics: Measuring Research Impact Through Large Language Model Memory

LLM-Metrics: 通过大语言模型内存测量研究影响力

Si Shen, Wenhua Zhao, Danhao Zhu

AI总结 本文提出LLM-Metrics,一种基于大语言模型参数内存的研究影响力评估指标,通过设计多种选择题探针评估549篇2023-2024年计算机科学论文,发现高影响力论文在学术社区中获得更大曝光,从而在LLM训练数据中形成更强的参数记忆,与引用次数呈现显著相关性。

Comments 25pages, 5figures

详情
AI中文摘要

引用次数仍然是评估研究影响力的主要指标,但存在众所周知的局限性:时间滞后、学科偏见和马太效应。本文提出LLM-Metrics,一种基于大语言模型(LLMs)参数内存的研究影响力评估指标。核心假设是高影响力论文在学术界获得更大曝光,这种曝光以文本形式进入LLM训练数据,从而使模型形成更强的参数记忆。我们设计了四种类型的多项选择探针,涵盖标题识别、作者识别、方法识别和会议识别,并评估了549篇2023-2024年发表的计算机科学论文,覆盖17个LLM,参数范围从0.5B到72B,来自六个供应商。在17个模型中,15个产生了正预测,其中9个在p小于0.05时显著,与引用次数的斯皮尔曼相关性为rho=0.1495,p=0.0004。三个额外的发现支持所提出的机制。首先,预测信号在2024年的论文中更强,rho=0.1880,其引用次数在模型训练时间接近零,减少了简单反向因果解释的可能性。其次,作者识别探针显示出最强的判别能力,与曝光驱动的记忆机制一致。第三,模型规模和预测能力是非单调的:一个3B参数的模型Llama-3.2-3B-Instruct,rho=0.1829,优于大多数更大的模型,支持了一个选择性记忆假设,即较小模型的有限容量可以作为有效的信息过滤器。LLM-Metrics提供了一种实时、跨学科、不依赖引用的研究所评估范式。

英文摘要

Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.

2605.22170 2026-05-22 cs.CL

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

事实回忆机制在文本到语音的多模态语言模型中是否延续?

Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, Richard Johansson

AI总结 研究探讨了多模态语言模型中事实回忆机制在文本和语音模态间的延续性,通过因果中介分析揭示了语音到文本与文本到文本在事实存储和回忆中的差异。

Comments In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics

详情
AI中文摘要

近年来,几种将语音和书面文本联合表示的语音语言模型(SLMs)已被提出。然后出现的问题是,当模型在两种模态中运行时,内部机制的相似性和差异性如何。我们关注这些系统如何编码、存储和检索事实知识,这之前已经在文本模型中被研究过。为了调查SLMs中事实关联存储和回忆的机制,我们利用了因果中介分析,这是一种之前应用于基于文本的模型的技术。使用SpiritLM这一多模态模型,整合离散语音标记,初步结果揭示了文本到文本和语音到文本结果之间的差异,表明事实回忆的新兴机制仅部分从文本延续到语音模态。这些结果加深了我们对SLMs中内部机制如何编码事实关联的理解,并为改进语音启用的AI系统提供了见解。

英文摘要

In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

2605.22169 2026-05-22 cs.CV

Balancing Uncertainty and Diversity of Samples: Leveraging Diversity of Least, High Confidence Samples for Effective Active Learning

平衡不确定性和样本多样性:利用低、高置信度样本的多样性进行有效的主动学习

Vipul Arya, S. H. Shabbeer Basha, Srikrishna U N, Sunainha Vijay, Snehasis Mukherjee

AI总结 本文提出了一种新的混合采样方法,通过同时选择容易和困难的样本,结合多样性,以提高主动学习的效果。实验表明,所提出的Least Confident and Diverse (LCD)方法在性能上优于现有方法,通过选择不确定且多样的实例,帮助模型学习更明显的特征。

详情
AI中文摘要

深度学习模型,包括卷积神经网络(CNNs)和视觉Transformer(ViTs),在各种计算机视觉任务如物体分类、检测、分割、生成等任务中取得了最先进的性能。然而,这些模型对数据需求很高,因为它们需要更多的训练数据来学习数百万或数十亿的参数。特别是对于监督学习任务,为模型训练收集大量标记样本是一个昂贵且耗时的任务。主动学习(AL)已被用于解决这个问题多年。现有的主动学习方法旨在从未标记样本池中选择用于注释的样本,这些样本要么是多样化的要么是不确定的。选择这样的样本可能会阻碍模型的性能,因为我们基于单一维度进行池化,即要么多样化要么不确定。在本文中,我们提出四种新颖的混合采样方法,用于同时池化容易和困难的样本,这些样本也是多样的。为了验证所提出方法的有效性,进行了大量的实验,分别使用高和低置信度样本。我们从实验中发现,所提出的混合采样方法,即Least Confident and Diverse(LCD),在性能上始终优于最先进的方法。观察到选择不确定且多样的实例有助于模型学习更明显的特征。与本研究相关的代码将在https://github.com/XXX/LCD上提供。

英文摘要

Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model's performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at https://github.com/XXX/LCD.

2605.22168 2026-05-22 cs.AI cs.LG

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

衡量跨模态协同:VLM可解释性的一个基准

Joël Roman Ky, Salah Ghamizi, Maxime Cordy

AI总结 本文提出Synergistic Faithfulness作为衡量VLM跨模态协同的指标,解决了传统单模态评估方法在评估VLM可解释性时的不足,通过引入Shapley交互指数,实现了对多模态协同的准确评估,同时提升了计算效率。

详情
AI中文摘要

视觉-语言模型(VLMs)将复杂的视觉输入映射到语义空间,但目前解释VLM的跨模态推理仍依赖于通过单模态扰动度量评估的后验解释器。我们揭示了这一范式的局限性:由于多模态数据集包含语言先验和模态偏差,VLMs经常表现出跨模态冗余,允许它们仅使用文本回答视觉查询。因此,单模态度量惩罚忠实的解释器,导致评估崩溃,其中视觉和文本排名根本矛盾(Kendall's τ= -0.06)。为了解决这一问题,我们引入了Synergistic Faithfulness(F_syn),一个基于Shapley交互指数的可扩展度量,严格隔离模态间的Harsanyi收益,作为高度准确的替代指标(ρ= 0.92),同时实现了24倍的计算加速。在评估8种不同的XAI方法、3种VLM架构和3个基准数据集时,发现为VLM设计的解释器严重过度索引视觉显著性,并在捕捉真正的跨模态协同方面显著劣于适应的注意力方法。通过将视觉合理性与跨模态忠实性解耦,本文提供了一个严格评估框架,以安全审计VLM在高风险部署中的推理。

英文摘要

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ= -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($ρ= 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

2605.22164 2026-05-22 cs.LG cs.RO

Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

超越欧几里得距离:通过地平线匹配轨迹可达性度量修复潜在世界模型

Liangyu Li, Shengzhi Wang, Qingwen Liu

AI总结 本文提出轨迹可达性度量(TRM)作为固定潜在世界模型的后处理终端排名方法,通过训练小的成对头部来改进终端排名,从而提高连续操控任务的性能。

Comments 26 pages, 7 figures

详情
AI中文摘要

潜在世界模型可以包含用于控制的状态,但其终端成本接口可能会向规划器暴露错误的决策相关信息。在常见的潜在MPC中,候选序列通过预测终端和目标潜在状态之间的欧几里得距离进行排名;这假设了原始潜在距离权重能够正确地反映可达性相关变量。我们提出轨迹可达性度量(TRM),一种用于固定潜在世界模型的后处理终端排名方法。TRM从记录的轨迹结构中训练一个小的成对头部,并将其用作替代或混合成本;编码器、动力学、采样器、优化器和评估表现保持不变。关键设计选择是地平线意识监督:该度量在广泛的、平衡的时间分离上进行训练,以匹配长地平线终端候选排名问题。在硬TwoRoom基准上,使用LeWorldModel(LeWM)的原始潜在规划成功率为7.0%,而全地平线TRM成功率为97.0%;洗牌时间标签控制仍为0.0%。同样的配方在三个种子上将PLDM基线从32.7%提高到84.0%,而短地平线TRM变体在100,000对预算下仅达到35.0%。在TwoRoom中,我们提供了TRM为何有效的机理证据:XY位置是线性可解码的(R²=0.998),但原始潜在MSE错误地排名候选;XY探针行空间在终端-目标潜在MSE中占比不到1%,但承载了大部分候选质量信号;SCSA审计显示TRM提高了规划器看到的排序和选定终点。在PushT go50/go75中,TRM风格的任务-状态度量比闭环成功更清晰地改进了SCSA排名和选定最终距离,推动了连续操控中的辅助混合成本。TRM是规划器面对的修复,审计解释了何时终端可达性度量应替代或补充原始潜在接近度。

英文摘要

Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

2605.22158 2026-05-22 cs.AI cs.CV

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ST-SimDiff:平衡时空相似性与差异以实现高效的视频理解与大语言模型

Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding

AI总结 本文提出ST-SimDiff框架,通过平衡时空相似性与差异来提高视频理解效率,利用时空图和双选择策略减少计算成本并提升性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在处理长视频时面临显著的计算开销,因为需要处理大量的视觉标记。为了提高效率,现有方法主要通过修剪或合并标记来减少冗余,但这些方法忽略了视频内容的一个关键维度,即变化和转折点,并且缺乏对时空关系的协作模型。为此,我们提出了一种新的视角:相似性用于识别冗余,而差异用于捕捉关键事件。基于此,我们设计了一个无需训练的框架,名为ST-SimDiff。我们首先从视觉标记中构建时空图,以统一建模其复杂的关联。随后,我们采用并行双选择策略:1)基于相似性的选择使用社区检测保留代表性标记,压缩静态信息;2)基于时间差异的选择精确定位内容变化点,以保留捕捉关键动态变化的标记。这使它能够用最少的标记保留静态和动态内容。广泛实验表明,我们的方法在显著优于现有最先进方法的同时,大幅减少了计算成本。我们的代码可在https://github.com/bingjunluo/ST-SimDiff上获得。

英文摘要

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST-SimDiff.

2605.22156 2026-05-22 cs.LG cs.AI

One-Way Policy Optimization for Self-Evolving LLMs

单向策略优化用于自演化大语言模型

Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, Li Yuan

AI总结 本文提出单向策略优化方法,通过解耦优化方向与更新幅度,解决传统方法中验证器奖励稀疏导致的训练不稳定问题,实现大语言模型的持续自演化。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为扩展大语言模型(LLMs)推理能力的一种有前景的范式。然而,二进制验证器奖励的稀疏性往往导致低效和优化不稳定。为了稳定训练,现有方法通常施加与参考策略相关的令牌级约束。我们发现这些约束会无差别地惩罚偏差;当策略试图超越参考时,这会翻转由验证器确定的方向,从而抑制收益。为了解决这个问题,我们提出了一种基于解耦优化方向与更新幅度原理的单向策略优化(OWPO)方法。在OWPO中,验证器规定更新方向,而参考策略仅用于调整更新幅度。具体而言,OWPO采用不对称重加权:它对劣质偏差(策略落后于参考)执行加速对齐,对优质偏差(策略超越参考)执行收益锁定。此外,通过整合迭代参考更新,OWPO创建了“棘轮效应”,持续巩固收益。实验结果表明,OWPO在DAPO、OPD和MOPD等强基线方法上表现更优,突破了固定先验的瓶颈,使大语言模型能够持续自演化,而无需依赖外部参考模型。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

2605.22155 2026-05-22 cs.LG

Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines

代数机器学习在小至中等数据集上的表现与强标准基线竞争

David Mendez, Fernando Martin-Maroto, Gonzalo G. de Polavieja

AI总结 本文研究了代数机器学习在小至中等规模数据集上的表现,发现其在图像和表格分类任务中能与CNN等强基线方法竞争,且无需交叉验证。

Comments 9 pages, 4 figures

详情
AI中文摘要

符号方法通常不被认为在现实监督任务上能与强大的现代学习者竞争。我们评估了代数机器学习(AML)框架在不同训练集大小下的图像和表格分类任务中的表现,该框架通过代数结构的子直接分解来学习,而非数值优化。我们发现,AML仅在训练数据上训练,不使用验证或交叉验证,就能在小至中等规模的图像数据集(50-2000个训练示例)上优于包括CNN在内的多种交叉验证基线方法。在相同规模范围内的表格数据集中,XGBoost总体表现最佳,但AML仍能与包含任务特定偏置的方法(如LightGBM和随机森林)竞争。AML通过通用的代数归纳偏置在两种非常不同的数据集类型上实现了竞争性表现,而不是标准基线(如CNN用于图像或XGBoost用于表格数据)中固有的模态特定偏置,并且不需要交叉验证,因为它没有需要调优的任务依赖超参数。

英文摘要

Symbolic methods are generally not considered competitive with strong modern learners on realistic supervised tasks. We evaluate Algebraic Machine Learning (AML), a framework that learns through subdirect decomposition of algebraic structure rather than numerical optimization, against standard baselines on image and tabular classification across varying training-set sizes. We find that AML trained only on training data without using validation or cross-validation outperforms a family of cross-validated baseline methods including CNNs on small to medium image datasets (50--2000 training examples). On tabular datasets in the same size range, XGBoost is overall the best performing method, but AML is nonetheless comparable to methods incorporating task-specific biases such as LightGBM and random forests. AML achieves this competitive performance across two very different types of datasets using a generic algebraic inductive bias, rather than the modality-specific biases built into standard baselines like CNNs for images or XGBoost for tabular data, and requires no cross validation because it has no task-dependent hyperparameters to tune.

2605.22154 2026-05-22 cs.AI

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

IdleSpec: 通过投机规划利用空闲时间用于LLM代理

Daewon Choi, Kyunghyun Park, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Jinwoo Shin, Aram Galstyan

AI总结 本文提出IdleSpec,一种利用空闲时间提升LLM代理性能的方法,通过在空闲期间生成计划候选并减少延迟开销,从而在多种代理场景中显著提高性能。

详情
AI中文摘要

基于大型语言模型(LLM)的代理通过多步骤推理和迭代工具调用及环境交互来解决复杂任务,这在等待观察时会产生空闲时间。尽管大多数代理场景中普遍存在空闲时间,但现有工作将其视为不可避免的开销或提出受限解决方案,忽略了不同工具调用之间的不同计算预算和未来观察不确定性,从而导致空闲时间利用不充分。本文介绍IdleSpec,一种可扩展且通用的推理方法,利用空闲时间计算来提高代理性能同时最小化延迟开销。具体而言,IdleSpec在空闲期间迭代生成计划候选,并在观察可用时汇总它们以引导下一步推理。为了在观察不确定性下有效生成计划,IdleSpec从学习的分布中采样互补的起草策略(即渐进和恢复),该分布通过后验反馈更新。我们的实验表明,IdleSpec在各种代理场景中通过有效利用空闲时间显著提高了代理性能。特别是,在GAIA和FRAMES上,IdleSpec使用Gemini-2.5-Flash实现了55.6%的平均准确率,超过了不使用空闲时间的基线方法5.1%。此外,在涉及大量代码执行延迟的MLE-Bench上,IdleSpec在Any Medal速率上实现了高达9.1%的性能提升,突显了其在长周期任务中的通用性。

英文摘要

Large language model (LLM)-based agents solve complex tasks by leveraging multi-step reasoning with iterative tool calls and environment interactions, which incur idle time while waiting for observations. Despite the prevalence of idle time in most agentic scenarios, existing works treat it as an unavoidable overhead or propose restricted solutions that overlook varying computational budgets across different tool calls and future observation uncertainty, thereby leading to suboptimal utilization of idle time. In this paper, we introduce IdleSpec, a scalable and generic inference approach that leverages idle-time computation to improve agent performance while minimizing latency overhead. Specifically, IdleSpec iteratively generates plan candidates during idle periods and, once observations become available, aggregates them to guide the next reasoning step. For effective plan generation under observation uncertainty, IdleSpec samples between complementary drafting strategies (i.e., progressive and recovery) from a learned distribution that is updated via posterior feedback. Our experiments demonstrate that IdleSpec significantly improves agent performance in various agentic scenarios by effectively utilizing idle time. In particular, on the GAIA and FRAMES, IdleSpec achieves 55.6% average accuracy with Gemini-2.5-Flash, surpassing the vanilla baseline without idle-time usage by 5.1%. Furthermore, for MLE-Bench, which involves substantial delay from code executions, IdleSpec achieves performance gains of up to 9.1% on the Any Medal rate, highlighting its generalizability to long-horizon tasks.

2605.22148 2026-05-22 cs.AI cs.CL

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Ratchet:一种最小化卫生的自演化LLM代理技能库

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

AI总结 本文提出Ratchet,一种单代理循环,使冻结的LLM能够自行编写、检索、整理和淘汰其自然语言技能,通过整合四个卫生机制提升技能库的生命周期管理,从而在MBPP+ hard-100数据集上显著提升性能。

Comments 16 pages, 2 figures, 6 tables. Extends arXiv:2605.19576 with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)

详情
AI中文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce extbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

英文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

2605.22147 2026-05-22 cs.CV

Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution

基于流的高斯点散射用于连续尺度遥感图像超分辨率

Jiangwei Mo, Xi Lu, Hanlin Wu

AI总结 本文提出FlowGS框架,通过流匹配和高斯点散射实现任意尺度的遥感图像超分辨率,提升生成效率和质量。

详情
AI中文摘要

高分辨率遥感图像(RSI)对于地球观测应用至关重要,但获取它们通常受到传感器限制和成本的限制。近年来,生成式超分辨率(SR)方法,特别是扩散模型,取得了显著进展。然而,它们通常需要缓慢的迭代推断,需要40-1000步,并且在连续尺度SR设置中表现出有限的灵活性。为了解决这些问题,我们提出FlowGS,一种用于任意尺度RSI超分辨率的生成性重建框架。FlowGS建模高分辨率和低分辨率图像之间的高频细节表示,并通过流匹配(FM)约束于快捷一致性,学习从噪声到细节先验的连续概率流,从而减少生成复杂性并提高推断效率。此外,我们采用2D高斯点散射来构建连续特征场,从而在任意查询位置上实现灵活的重建。实验结果表明,FlowGS在连续尺度和固定尺度SR设置中均能提供与现有方法相媲美的感知质量,同时具有显著提高的推断效率。

英文摘要

High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40--1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.

2605.22144 2026-05-22 cs.CV

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

一句话,一出戏剧:通过多智能体系统实现个性化短剧生成

Yufei Shi, Weilong Yan, Naixuan Huang, Yucheng Chen, Chenyu Zhang, Tao He, Si Yong Yeo, Ming Li

AI总结 本文提出了一种多智能体框架,通过结构化中间模块和迭代优化,将用户的单句想法转化为完整短剧,解决了短剧生成中的叙事节奏、空间一致性及生产质量控制问题。

详情
AI中文摘要

现有的数字短剧制作方法通常依赖单次生成的LLM脚本和松散耦合的流程,无法满足短剧生成的三个关键要求:(1) 叙事节奏,导致钩子弱、情节不足和不吸引人的结局;(2) 空间一致性,导致场景布局漂移和跨片段角色位置不一致;(3) 生产级质量控制,需要在脚本和视觉阶段进行大量手动审查和修正。我们提出了One Sentence, One Drama,一种分层多智能体框架,通过结构化中间模块和迭代优化,将用户的单句想法转化为完整短剧。我们的方法基于三个关键组件:(1) 基于多智能体辩论的故事生成模块,强制短剧节奏和叙事连贯性;(2) 3D基础的第一帧生成机制,建立共享的空间参考,确保跨片段的一致性角色定位和场景布局;(3) 多阶段评审循环,在脚本、视觉和视频生成阶段进行全面的错误检测和有针对性的修订。我们还引入了场景级BGM匹配和场景转换规划,以提高观众的沉浸体验。为了系统评估该任务,我们引入了Short-Drama-Bench,一个扩展标准视频质量指标的基准,包含短剧特定的评估标准。实验结果表明,我们的方法在叙事质量、跨片段一致性以及整体观看体验上显著优于现有流程。

英文摘要

Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

2605.22142 2026-05-22 cs.LG cs.AI

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

知识图谱下的短期到长期记忆转移:在部分可观测性下的短期到长期记忆转移

Taewoon Kim, Vincent François-Lavet, Michael Cochez

AI总结 本文研究了在部分可观测性下知识图谱中的短期到长期记忆转移问题,提出了一种基于神经符号价值决策的方法,通过在长期插入前决定保留或丢弃观察到的三元组,从而提升记忆效率,并在RoomKG基准测试中优于符号和神经基线方法。

详情
AI中文摘要

在部分可观测性下的强化学习需要决定保留哪些信息,但大多数基于记忆的方法并未显式建模符号观察的短期到长期转移。我们研究了这一转移过程,将其建模为一个神经符号价值决策问题:对于每个观察到的三元组,智能体需决定在长期插入前是否保留或丢弃。为处理可变大小的短期缓冲区,我们采用了一种每项Q学习设计,使用共享参数和实际的时间差分更新,跨连续步骤匹配项目。在长期记忆容量为128的RoomKG基准测试中,学习到的转移决策优于符号和神经基线,包括带有时间注释的符号基线和基于历史的LSTM/Transformer基线。在转移策略消融分析中,一个轻量级的本地短期-only变体表现最佳,且在步骤层面行为显示,策略保留导航和查询相关的事实,同时丢弃低价值的候选事实,支持在内存限制下显式且可解释的记忆决策。

英文摘要

Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

2605.22140 2026-05-22 cs.CL

Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues

Psy-Chronicle: 一个用于合成长周期校园心理辅导对话的结构化流水线

Chaogui Gou, Jiarui Liang

AI总结 本文提出Psy-Chronicle,一种结构化数据生成框架,用于合成长周期校园心理辅导对话,通过生成学期跨度的时间压力事件图和学生与辅导员代理的交互模拟,构建了包含100个学生档案和9万条对话的CPCD数据集,并通过CPCD-Bench评估模型的长周期校园辅导能力,实验结果表明CPCD有效提升了模型的会话级响应生成和长周期记忆召回能力。

详情
AI中文摘要

近年来,大语言模型在心理支持任务中展现出显著潜力。然而,现有心理辅导数据大多依赖于单轮问答或短多轮对话,难以刻画大学生心理困扰在校园生活事件中如何积累、交互并逐渐演变的长期过程。为解决这一问题,本文提出Psy-Chronicle,一种结构化数据生成框架,用于合成长周期校园心理辅导对话。我们生成一个跨越学期的时间压力事件图,以建模校园压力事件之间的时序顺序和演变依赖关系。通过学生代理与辅导员代理之间的交互模拟,以及结构化记忆整合机制,Psy-Chronicle生成具有连续性且跨越多个辅导会话的长周期对话。基于Psy-Chronicle,我们构建并开源了CPCD,一个中文长周期对话数据集,包含100个学生档案和90,000条辅导对话。我们进一步构建CPCD-Bench,从三个维度评估模型的长周期校园辅导能力:会话级响应、长周期记忆召回和时间因果推理。实验结果表明,CPCD有效提升了具有相同基础架构的模型的会话级响应生成和长周期记忆召回能力。同时,时间因果推理的改进仍有限,表明事件链组织和因果解释是长周期心理辅导建模中的关键挑战。相关代码和数据可在:https://github.com/EdwinUSTB/Psy-Chronicle 获取。

英文摘要

In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single-turn question answering or short multi-turn dialogues, making it difficult to characterize how college students' psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy-Chronicle, a structured data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. We generate a semester-spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions. Based on Psy-Chronicle, we construct and open-source CPCD, a Chinese long-horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD-Bench to evaluate models' long-horizon campus counseling capabilities from three dimensions: session-level response, long-horizon memory recall, and temporal-causal reasoning. Experimental results show that CPCD effectively improves session-level response generation and long-horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal-causal reasoning remain limited, indicating that event-chain organization and causal explanation are key challenges in long-horizon psychological counseling modeling. The related code and data are available at: https://github.com/EdwinUSTB/Psy-Chronicle

2605.22139 2026-05-22 cs.CV

EventGait: Towards Robust Gait Recognition with Event Streams

EventGait: 向事件流中实现鲁棒的步态识别

Senyan Xu, Shuai Chen, Chuanfu Shen, Kean Liu, Zhijing Sun, Chengzhi Cao, Xueyang Fu

AI总结 本文提出EventGait,一种端到端的双流框架,通过事件相机捕捉动态和形状信息,提升在复杂光照和运动环境下的步态识别鲁棒性,并通过合成数据集和新基准测试验证了其有效性。

详情
AI中文摘要

步态识别能够实现非侵入性和隐私保护的识别,但在不受控环境中由于传统相机的光照和运动敏感性而面临挑战。本文探讨了使用事件相机进行步态识别,事件相机提供微秒级时间分辨率和高动态范围,自然捕捉鲁棒的动态特征并抑制静态噪声。现有基于事件的方法通常将事件流聚合为事件图像,从而丢弃了对步态识别至关重要的细粒度运动动态。因此,我们提出了EventGait,一种端到端的双流框架,分别建模运动和形状,同时保留事件的优势。我们的动态流利用混合脉冲专家(MoSE)和多样化的神经元常数,以在复杂的运动和光照场景中实现稳健的动态感知,而静态流通过跨模态结构对齐(CroSA)学习密集的形状表示,使用大规模视觉基础模型。为了解决大规模基于事件的步态数据集的缺乏,我们引入了合成管道并发布了两个新的基准:SUSTech1K-E和CCGR-Mini-E。广泛的实验表明,基于事件的步态识别不仅在正常条件下实现了与基于相机的步态识别相当的结果,而且在低光场景中显著优于前者。我们的方法在合成和真实世界基于事件的步态基准上均达到了新的状态,突显了事件驱动步态分析的鲁棒性和潜力。代码和数据集已发布在https://github.com/QUEAHREN/EventGait。

英文摘要

Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity of conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition. Therefore, we propose \textbf{EventGait}, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structure Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets are released at https://github.com/QUEAHREN/EventGait.

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

AI总结 本文提出通过分解决策过程为三个系统:模拟推理、自我调节和反应执行,来提升代理推理的效率,并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情
AI中文摘要

代理应该如何决定何时以及如何规划?主流方法将代理建模为具有自适应计算的反应策略(例如链式思考),通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围,这些系统显著增加了推理长度,导致无效的令牌使用,而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统:模拟推理(系统II)通过世界模型将推理根植于未来状态预测;自我调节(系统III)通过学习的配置器决定何时以及如何深入规划;以及反应执行(系统I)处理细粒度的动作。模拟推理在不同任务中提供统一的规划,而无需每个领域的工程,同时自我调节确保规划只在需要时被调用。为了测试这一点,我们开发了SR$^2$AM(Self-Regulated Simulative Reasoning Agentic LLM),在LLM的链式思考中实现这两个系统作为独立阶段,其中LLM作为世界模型。我们探索了两种实现:从提示的多模块系统中记录决策(v0.1)和从预训练推理LLM的痕迹中重建结构化计划(v1.0),通过监督学习和强化学习(RL)训练。在数学、科学、表格分析和网络信息检索中,v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当,而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%,而规划频率仅增加2.0%,表明它学会了更远地规划而不是更频繁地规划。更广泛地说,学习的自我调节实例化了一个原则,我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

2605.22132 2026-05-22 cs.CV

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

通过可插拔的深度卷积加速视觉基础模型

Carmelo Scribano, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu, Giorgia Franchini, Danda Pani Paudel, Marko Bertogna, Luc Van Gool

AI总结 本文提出了一种通过可插拔深度卷积层替代部分注意力头来加速大规模预训练视觉Transformer,同时保持特征提取能力,在图像分类和分割任务中实现了17-20%的推理加速且性能损失极小。

Comments Accepted at ICPR 2026

详情
AI中文摘要

预训练的视觉基础模型在少量微调下即可在多种任务中取得优异性能。然而,其视觉Transformer(ViT)主干结构导致较高的推理开销,限制了在资源受限设备上的部署。在本文中,我们通过利用某些注意力头内在的卷积类行为,加速大规模预训练ViT的同时保持其特征提取能力。具体而言,我们引入了一个高效的基于深度卷积的层,作为这些头的可插拔替代方案。此外,我们提出了简单策略来识别可替换的头,并引入一种微调过程以恢复下游任务性能。在图像分类和分割任务中,我们的方法实现了17-20%的推理加速,且性能损失极小。我们通过详细的推导、广泛的实验和效率基准验证了该方法。参考实现已公开。

英文摘要

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

2605.22126 2026-05-22 cs.CV

AesFormer: Transform Everyday Photos into Beautiful Memories

AesFormer: 将日常照片转化为美丽的记忆

Tianxiang Du, Hulingxiao He, Yuxin Peng

AI总结 本文提出AesFormer框架,通过将审美规划与图像编辑解耦,改进照片的审美质量,同时保持主体身份和场景语义,构建了包含9071对严格对齐图像对的AesRecon基准数据集。

Comments Accepted by ICML 2026

详情
AI中文摘要

在日常摄影中,吸引人的时刻往往受到结构缺陷(如构图、相机视角或姿势)的影响,而现有的修图和人像增强方法无法修复这些缺陷。我们提出将审美照片重建(APR)定义为通过结构重建来提高照片的审美质量,同时保持主体身份和场景语义。尽管最近的图像编辑模型使APR成为可能,但它们通常缺乏审美理解,导致编辑结果在语义上合理但审美上薄弱。为此,我们提出了AesFormer,一个两阶段框架,将审美规划与图像编辑解耦。在第一阶段,一个审美动作模型(AesThinker)分析输入沿七个渐进的摄影维度,并输出可执行的编辑动作;我们进一步应用GRPO-A来鼓励在多样化的动作计划上进行广泛探索,超越SFT。在第二阶段,一个动作条件编辑器(AesEditor)在这些动作的指导下执行结构编辑。为了支持APR,我们构建了一个基于视频的语料挖掘管道(VCMP)并构建了AesRecon,一个包含9,071对严格对齐(差,好)图像对的基准。实验表明,AesFormer显著提高了APR性能,并与Nano Banana Pro具有竞争力。代码可在https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026获取。

英文摘要

In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo's aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026.

2605.22123 2026-05-22 cs.RO

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

超越像素:从少量示范中学习不变的奖励以实现实世界机器人学

Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan

AI总结 本文提出了一种从少量示范中学习不变奖励的方法,以实现实世界机器人学中的泛化能力,通过发现行为不变量来改进奖励函数的设计,从而在多个任务中提升策略学习效果。

详情
AI中文摘要

设计能够超越受控实验室环境的奖励函数仍然是强化学习在机器人学中的基本挑战。在开放世界操纵问题中,单一任务可以通过不同的物体实例、位置和摄像头视角出现多种变体。最近基于视觉的奖励模型倾向于记忆特定的像素分布,并且无法超越其训练条件进行泛化。为了解决这个问题,我们提出了一种框架,该框架可以从最少的五个示范中学习不变的符号奖励函数。关键思想是将视觉特征拟合转向发现行为不变量:在多样化的视觉实例中保持不变的任务级属性。该框架有两个耦合的组件:一个结构化奖励公式,它编码任务级策略和物理约束,同时保持最优策略不变性;以及一个混合的符号-数值过程,该过程从示范中提炼这些不变量,而无需在线交互。在八个Meta-World任务和三个Franka操纵任务上的实验表明,我们的方法在过程对齐和策略展开排名能力方面优于基线方法,加速了下游策略学习。三个现实世界的出分布实验进一步表明,学习到的奖励能够零样本泛化到位置、视角和物体变体,使单一奖励表示能够在实践中重用于多种任务变体。

英文摘要

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

2605.22121 2026-05-22 cs.CV

MotionDPS: Motion-Compensated 3D Brain MRI Reconstruction

MotionDPS: 3D脑部MRI重建中的运动补偿

Antonio Ortiz-Gonzalez, Erich Kobler, Lukas Schletter, Alexander Effland

AI总结 本文提出了一种统一的贝叶斯框架,用于运动补偿的3D MRI重建,通过直接从运动损坏的k空间数据中联合估计解剖图像、刚体运动参数和线圈灵敏度图,实现了无需配对无运动训练数据的完全无监督重建。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

磁共振成像(MRI)由于其相对较长的采集时间和k空间中按顺序采集数据的事实,对患者运动高度敏感。即使是很小的患者移动也会在测量之间引入相位不一致,导致严重的伪影,如模糊、鬼影和几何失真,这些伪影可能影响诊断质量。回顾性运动补偿仍然具有挑战性,尤其是在加速采集中,由于联合重建和运动估计问题的不恰当性。在本工作中,我们提出了一种统一的贝叶斯框架,用于运动补偿的3D MRI,该框架直接从运动损坏的k空间数据中联合估计解剖图像、刚体运动参数和线圈灵敏度图。我们的方法将预训练的3D复值分数扩散模型作为表达性解剖图像先验整合到基于物理的正向模型中。通过交替扩散后验图像更新和高效的近端优化步骤进行运动和线圈灵敏度估计,实现完全无监督的重建,无需配对无运动训练数据。在模拟和真实运动脑部MRI数据集上的实验表明,所提出的方法在图像质量和运动鲁棒性方面优于最先进的经典和学习运动校正技术,特别是在存在严重运动和高加速的情况下。

英文摘要

Magnetic resonance imaging (MRI) is highly susceptible to patient motion due to its relatively long acquisition times and the fact that data are acquired sequentially in k-space. Even small patient movements introduce phase inconsistencies across measurements, leading to severe artifacts such as blurring, ghosting, and geometric distortions that can compromise diagnostic quality. Retrospective motion compensation remains challenging, particularly in accelerated acquisitions, due to the ill-posed nature of the joint reconstruction and motion estimation problem. In this work, we propose a unified Bayesian framework for motion-compensated 3D MRI that jointly estimates the anatomical image, rigid-body motion parameters, and coil sensitivity maps directly from motion-corrupted k-space data. Our approach integrates pretrained 3D complex-valued score-based diffusion models as expressive anatomical image priors within a physics-based forward model. Inference is performed by alternating diffusion posterior image updates with efficient proximal optimization steps for motion and coil sensitivity estimation, enabling fully unsupervised reconstruction without the need for paired motion-free training data. Experiments on simulated and real-motion brain MRI datasets demonstrate that the proposed method achieves improved image quality and motion robustness compared to state-of-the-art classical and learning-based motion correction techniques, particularly in the presence of severe motion and high acceleration.

2605.22111 2026-05-22 cs.LG cs.CE stat.ML

Aerodynamic force reconstruction using physics-informed Gaussian processes

利用物理信息高斯过程进行气动力重建

Gledson Rodrigo Tondo, Igor Kavrakov, Guido Morgenthal

AI总结 本文提出一种基于物理信息的机器学习方法,用于从结构动态响应的噪声测量中重建底层气动载荷,通过避免过拟合和无需正则化方案,提高了模型的准确性和适用性。

详情
AI中文摘要

准确建模气动载荷对于理解和预测复杂结构系统的响应至关重要。然而,这些模型往往依赖于真实物理力的简化,引入假设可能会限制其准确性。在存在噪声或不完整数据的情况下,验证这些模型变得特别具有挑战性。为此,我们介绍了一种概率物理信息机器学习方法,旨在从结构动态响应的噪声测量中重建底层气动载荷。该模型避免了过拟合,消除了对正则化方案的需要,并允许在训练过程中使用异质和多保真度数据。通过重建大贝尔东桥在线性非稳态假设下的气动载荷,证明了该方法的有效性。结果表明,真实和预测载荷之间有很强的一致性,特别是在均方误差、幅度、相位角和信号峰值值方面。该载荷重建方法具有广泛的应用前景,如模型验证、未来载荷估计和结构损伤预测。

英文摘要

Accurate modeling of aerodynamic loads is essential for understanding and predicting the responses of complex structural systems. However, these models often rely on simplifications of the true physical forces, introducing assumptions that can limit their accuracy. Validating such models becomes particularly challenging in the presence of noisy or incomplete data. To address this, we introduce a probabilistic physics-informed machine learning approach designed to reconstruct the underlying aerodynamic loads from noisy measurements of structural dynamic responses. The model avoids overfitting, eliminates the need for regularization schemes, and allows for the use of heterogeneous and multi-fidelity data during the training process. The efficacy of the approach is demonstrated through the reconstruction of aerodynamic loads on the Great Belt East Bridge, simulated under a linear unsteady assumption. Results show a strong agreement between true and predicted loads, particularly related to root mean squared errors, magnitude, phase angle and peak values of the signals. The method for load reconstructing holds broad applicability, such as modeling validation, future load estimation, and structural damage prognosis.

2605.22109 2026-05-22 cs.AI cs.CV cs.CY

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

感知还是偏见:大语言模型能否超越个性的第一印象?

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang

AI总结 本文探讨了多模态大语言模型(MLLMs)在感知个性方面的能力,提出了一种新的任务Grounded Personality Reasoning(GPR),并构建了一个新的数据集MM-OCEAN,通过三重评估体系揭示了MLLMs在人格推理中的偏见问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)正在越来越多地应用于需要感知个性的人类交互角色中,但现有的基准测试仅评估其对大五人格特质分数的预测能力,未能确定模型是通过行为理解真正感知个性,还是仅通过表面模式匹配进行偏见判断。我们通过三个贡献填补了这一空白:(i)一个新的任务:我们正式定义了Grounded Personality Reasoning(GPR),要求MLLMs通过一系列评分、推理和锚定过程,将每个大五评分与可观察的证据联系起来;(ii)一个新的数据集:我们发布了MM-OCEAN(1,104个视频,5,320个多项选择题),由多代理流程生成,包含时间戳行为观察、证据支持的特质分析以及七类线索锚定多项选择题;(iii)基准测试和分析:我们设计了一个三级评估体系(评分、推理、锚定)以及四个样本级失败模式指标:偏见率(PR)、编造率(CR)、整合失败率(IR)和整体锚定率(HR),并基准测试了27个MLLMs(13个封闭式,14个开放式)。分析揭示了一个显著的偏见差距:在所有正确评分中,51%的评分没有基于检索到的线索进行锚定,而整体锚定率仅在0-33.5%之间。这些发现揭示了获得正确分数与为正确原因推理之间的脱节,为MLLMs中的扎根社会认知绘制了路线图。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.