arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2506.12362 2026-05-11 cs.LG cs.AI

HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs

Xingyue Huang, Mikhail Galkin, Michael M. Bronstein, İsmail İlkan Ceylan

AI总结 本文提出HYPER,一种用于归纳性超图链接预测的基础模型,能够处理包含全新实体和全新关系类型的超图。HYPER通过编码超边中实体及其在超边中的位置信息,实现了对不同元数关系类型的泛化能力。实验表明,HYPER在多种归纳设置下均优于现有方法,展示了其对高元数关系结构的强大泛化能力。

详情
英文摘要

Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely novel entities (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with novel relation types (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to any knowledge hypergraph, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of varying arities, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures.

2506.11512 2026-05-11 cs.LG cs.AI

From Time Series Analysis to Question Answering: A Survey in the LLM Era

Wei Li, Zhe Xie, Yuxuan Liang, Xinli Hao, Yunyao Cheng, Dan Pei, Xiaofeng Meng

AI总结 近年来,大语言模型(LLMs)为时间序列分析(TSA)引入了新的范式,但传统TSA任务难以覆盖时间序列语言理解等任务,存在与LLMs目标不匹配的问题。为此,研究提出将TSA向时间序列问答(TSQA)演进,强调以用户为中心的统一任务处理。本文综述了从TSA到TSQA的演变过程,提出了三种对齐范式,并分析了数据集特点与未来研究方向。

Comments Accepted by IJCAI 2026 Survey Track

详情
英文摘要

Recently, Large Language Models (LLMs) have introduced a novel paradigm in Time Series Analysis (TSA), leveraging strong language capabilities to support tasks such as forecasting and anomaly detection. However, these analysis tasks cannot adequately cover temporal language tasks, such as interpretation and captioning. A fundamental gap remains between TSA and LLMs: LLMs are pre-trained to optimize natural language relevance for question answering rather than objectives specialized for TSA. To bridge this gap, TSA is evolving toward Time Series Question Answering (TSQA), shifting from expert-driven and task-specific analysis to user-driven and task-unified question answering. TSQA depends on flexible exploration rather than predefined TSA pipelines. In this survey, we first propose a taxonomy that reflects the evolution from TSA to TSQA, driven by a shift from external to internal alignment. We then organize existing literature into three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment, and provide practical guidance for flexible, economical, and generalizable selection of alignment paradigms. We finally analyze datasets across domains and characteristics, identify challenges, and highlight future research directions.

2506.05668 2026-05-11 cs.LG stat.ML

RNE: plug-and-play diffusion inference-time control and energy-based training

Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas

AI总结 本文提出了一种名为RNE的插件式扩散模型方法,用于在推理阶段实现对生成过程的控制,并支持基于能量的训练。RNE基于路径分布之间的密度比概念,建立了边缘密度与转移核之间的基本联系,从而统一了扩散密度估计、推理控制和能量训练等多个任务。实验表明,RNE在推理控制任务中表现出色,同时为能量型扩散模型提供了简单高效的正则化方法,并适用于连续和离散扩散模型。

Comments Accepted at ICLR 2026

详情
英文摘要

Diffusion models generate data by removing noise gradually, which corresponds to the time-reversal of a noising process. However, access to only the denoising kernels is often insufficient. In many applications, we need the knowledge of the marginal densities along the generation trajectory, which enables tasks such as inference-time control. To address this gap, in this paper, we introduce the Radon-Nikodym Estimator (RNE). Based on the concept of the \textit{density ratio} between path distributions, it reveals a fundamental connection between marginal densities and transition kernels, providing a flexible plug-and-play framework that unifies (1) diffusion density estimation, (2) inference-time control, and (3) energy-based diffusion training under a single perspective. Experiments demonstrate that RNE delivers strong results in inference-time control applications, such as annealing and model composition, with promising inference-time scaling performance, and achieves a simple yet efficient regularisation for training energy-based diffusion models. Additionally, our proposed RNE is modality-agnostic and applicable not only to continuous diffusion models but also to their discrete diffusion counterparts.

2506.00886 2026-05-11 cs.AI

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, Kam-Fai Wong

AI总结 随着大型语言模型逐步演变为具备工具增强能力的智能体,一个核心问题仍未解决:何时才需要调用外部工具?本文提出,智能体应在认知上必要时才调用外部工具,即当仅靠其内部推理无法可靠完成任务时。为此,文章引入了“智能体理论”(ToA)框架,将智能体视为在不确定情况下决定是内部处理还是外部委托的序列决策者,并指出不必要的工具调用不仅效率低下,还可能阻碍内部推理能力的发展。该研究为工具使用提供了规范性准则,有助于构建更智能、更高效的智能体系统。

详情
英文摘要

As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent's internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.

2505.13741 2026-05-11 cs.CV cs.NE

Frozen Backpropagation: Relaxing Weight Symmetry in Deep Spiking Neural Networks

Gaspard Goupy, Pierre Tirilly, Ioan Marius Bilasco

AI总结 本文研究了在分离网络结构下深度脉冲神经网络(SNN)训练中权重对称性的放松问题。为了解决传统反向传播(BP)在神经形态硬件上实施时因权重对称性带来的高能耗和硬件开销问题,作者提出了冻结反向传播(Frozen Backpropagation,fBP)算法,通过周期性冻结反馈权重来减少权重传输和同步开销。实验表明,fBP在保持较高准确率的同时显著降低了权重传输成本,并可通过部分权重传输策略进一步提升效率。

详情
英文摘要

Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware can greatly reduce energy costs compared to GPU-based training. However, implementing Backpropagation (BP) on such hardware is challenging because forward and backward passes are typically performed by separate networks with distinct weights. To compute correct gradients, forward and feedback weights must remain symmetric during training, necessitating weight transport between the two networks. This symmetry requirement imposes hardware overhead and increases energy costs. To address this issue, we introduce Frozen Backpropagation (\textsc{fBP}), a BP-based training algorithm relaxing weight symmetry in settings with separate networks. fBP updates forward weights by computing gradients with periodically frozen feedback weights, reducing weight transports during training and minimizing synchronization overhead. To further improve transport efficiency, we propose three partial weight transport schemes of varying computational complexity, where only a subset of weights is transported at a time. We evaluate our methods on image recognition tasks using both temporally and rate-coded SNNs, and compare them to existing approaches addressing the weight symmetry requirement. Our results show that fBP outperforms these methods and achieves accuracy comparable to BP while significantly lowering transport costs. With partial weight transport, fBP can further lower those costs by up to 10,000x at the expense of moderate accuracy loss. This work provides insights for guiding the design of neuromorphic hardware incorporating BP-based on-chip learning.

2504.16559 2026-05-11 cs.LG q-bio.QM

Synergistic Benefits of Joint Molecule Generation and Property Prediction

Adam Izdebski, Jan Olszewski, Pankhil Gawade, Krzysztof Koras, Serra Korkmaz, Valentin Rauscher, Jakub M. Tomczak, Ewa Szczurek

AI总结 该研究探讨了联合分子生成与性质预测的协同优势,提出了一种基于Transformer架构的联合模型Hyformer。该模型通过交替注意力机制和联合预训练策略,实现了分子生成与性质预测功能的融合,能够在条件采样、分布外性质预测和表征学习等方面展现协同效益。实验表明,Hyformer在抗菌肽设计等药物研发任务中表现出显著的联合学习优势。

Comments 17 pages, 4 figures

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2026
英文摘要

Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial~peptides.

2503.14998 2026-05-11 cs.CV

Tables Guide Vision: Learning to See the Heart through Tabular Data

Marta Hasny, Maxime Di Folco, Keno Bressem, Julia Schnabel

AI总结 该研究提出了一种基于表格数据引导的对比学习框架,旨在解决传统视觉对比学习方法在医学影像领域中忽略样本间语义关系的问题。通过利用临床表格数据,该方法能够识别患者层面的相似性并构建更具语义意义的样本对,从而提升视觉表征的学习效果。实验表明,在心脏MRI图像和临床属性数据集上,结合表格数据的引导能够显著增强模型在下游任务中的表现,包括细调、线性探针和零样本预测等,并且方法在自然图像数据集上也表现出良好的泛化能力。

详情
英文摘要

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. Code is available at https://github.com/marteczkah/tables_guide_vision.

2503.12285 2026-05-11 cs.LG cs.AI cs.GT cs.SY eess.SY stat.ML

A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback

Vaneet Aggarwal, Shweta Jain, Subham Pokhriyal, Christopher John Quinn

AI总结 本文研究了在噪声函数评估下的双目标组合优化问题,提出了一个适用于此类问题的鲁棒性框架。该框架引入了$(α,β,δ,\texttt{N})$-鲁棒性概念,用于描述在有界噪声下近似保证的联合退化情况,并开发了一个通用的黑盒方法,将任何鲁棒的离线算法转化为适用于双目标组合多臂老虎机问题的在线算法。该方法在无需线性、子模性等结构假设的情况下,实现了次线性遗憾和约束违反的累积上界,展示了框架在经典子模优化贪心算法中的适用性。

详情
Journal ref
Transactions on Machine Learning Research, May 2026
英文摘要

We study bi-criteria combinatorial optimization under noisy function evaluations. While resilience and black-box offline-to-online reductions have been studied in single-objective settings, extending these ideas to bi-criteria problems introduces new challenges due to the coupled degradation of approximation guarantees for objectives and constraints. We introduce a notion of $(α,β,δ,\texttt{N})$-resilience for bi-criteria approximation algorithms, capturing how joint approximation guarantees degrade under bounded (possibly worst-case) oracle noise, and develop a general black-box framework that converts any resilient offline algorithm into an online algorithm for bi-criteria combinatorial multi-armed bandits with bandit feedback. The resulting online guarantees achieve sublinear regret and cumulative constraint violation of order $\tilde{O}(δ^{2/3}\texttt{N}^{1/3}T^{2/3})$ without requiring structural assumptions such as linearity, submodularity, or semi-bandit feedback on the noisy functions. We demonstrate the applicability of the framework by establishing resilience for several classical greedy algorithms in submodular optimization.

2502.07143 2026-05-11 cs.CL

Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, Junde Wu

AI总结 医疗资源短缺导致大量患者无法及时获得可靠医疗服务,而大型语言模型(LLMs)在实际临床对话中仍面临权威医学依据不足、诊断不确定性处理不透明以及语言缺乏人性化等问题。为此,研究提出“Ask Patients with Patience(APP)”,一种基于多轮对话的医疗助手,通过共情对话引导用户描述症状,结合贝叶斯主动学习实现透明、适应性的诊断,并基于权威医学指南进行推理。实验表明,APP在提升诊断准确性、降低不确定性及改善用户体验方面均优于现有模型,为人工智能辅助医疗提供了更具临床实用性的解决方案。

详情
英文摘要

The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.

2501.09209 2026-05-11 cs.CV

Surgical Visual Understanding (SurgVU) Dataset

Aneeq Zia, Max Berniker, Rogerio Nespolo, Xiaorui Zhang, Conor Perreault, Ziheng Wang, Benjamin Mueller, Ryan Schmidt, Kiran Bhattacharyya, Xi Liu, Anthony Jarc

AI总结 本文介绍了Surgical Visual Understanding (SurgVU)数据集,旨在推动手术数据科学领域的基础研究。该数据集包含大量手术视频及其标签,涵盖了数据采集方法和独特属性,并提出了多个示例问题,适用于多种机器学习任务。该数据集不仅针对特定科学挑战设计,还具有广泛的适用性,期望能吸引更广泛的机器学习社区关注手术场景中的挑战性问题,并成为未来研究的重要基准。

详情
英文摘要

Owing to recent advances in machine learning and the ability to harvest large amounts of data during robotic-assisted surgeries, surgical data science is ripe for foundational work. We present a large dataset of surgical videos and their accompanying labels for this purpose. We describe how the data was collected and some of its unique attributes. Multiple example problems are outlined. Although the dataset was curated for a particular set of scientific challenges (in an accompanying paper), it is general enough to be used for a broad range machine learning questions. Our hope is that this dataset exposes the larger machine learning community to the challenging problems within surgical data science, and becomes a touch-stone for future research. The videos are available at https://storage.googleapis.com/isi-surgvu/surgvu24_videos_only.zip, the labels at https://storage.googleapis.com/isi-surgvu/surgvu24_labels_updated_v2.zip, a validation set for tool detection problem at https://storage.googleapis.com/isi-surgvu/cat1_test_set_public.zip, and a sample set of question & answer pairs dataset for surgical visual question answering at https://storage.googleapis.com/isi-surgvu/SURGVU25_cat_2_sample_set_public.zip.

2410.06355 2026-05-11 cs.RO cs.AI

UNCOM: Zero-shot Context-Aware Command Understanding for Tabletop Scenarios

Antonio Galiza Cerdeira Gonzalez, Paweł Gajewski, Bipin Indurkhya

AI总结 本文提出了一种名为UNCOM的新型混合框架,用于在桌面场景中理解自然的人类指令。该系统整合了语音、手势和场景上下文等多源信息,提取结构化的可执行指令,支持机器人在无需预定义物体模型或特定任务训练数据的情况下进行零样本操作。通过基础模型和任务特定的深度学习模型,UNCOM实现了即开即用的语音识别、自然语言理解、手势检测和物体分割,其模块化架构提升了系统的透明性和可解释性,并在实际机器人交互数据集上达到了82.39%的成功率。

详情
英文摘要

This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real-world data set of human-robot interaction scenarios; achieving an 82.39\% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.

2410.06347 2026-05-11 cs.RO cs.AI

Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning

Paweł Gajewski, Dominik Żurek, Marcin Pietroń, Kamil Faber

AI总结 本文提出了一种用于多目标离线强化学习的基于目标条件的决策 Transformer 模型,旨在解决机器人领域中样本效率低和跨目标泛化能力差的问题。该方法通过将目标状态显式地融入序列建模框架,能够在仅使用预收集数据的情况下高效完成多种任务。实验表明,该方法在 Franka Emika Panda 平台的新离线数据集上优于最先进的在线基线方法,尤其在稀疏奖励环境下表现出良好的鲁棒性。

详情
英文摘要

Reinforcement learning (RL) in robotics faces significant hurdles regarding sample efficiency and generalization across varying goals. While Offline RL mitigates the need for costly online interactions, its integration with goal-conditioned policies and transformer-based architectures remains underexplored. We introduce a Goal-Conditioned Decision Transformer adapted for offline multi-goal robotics. By explicitly incorporating goal states into the sequence modeling framework, our approach efficiently solves varying tasks using only pre-collected data. We validate this method on a newly released offline dataset for the Franka Emika Panda platform. Experimental results demonstrate that our approach outperforms state-of-the-art online baselines in complex tasks and maintains robustness in sparse-reward settings, even with limited expert demonstrations.

2408.07522 2026-05-11 cs.SD cs.LG eess.AS

Optimising MFCC parameters for the automatic detection of respiratory diseases

Yuyang Yan, Sami O. Simons, Loes van Bemmel, Lauren Reinders, Frits M. E. Franssen, Visara Urovi

AI总结 该研究探讨了MFCC参数对呼吸道疾病自动检测性能的影响,系统分析了系数数量、帧长和帧移等关键参数的作用。通过四个公开数据集和SVM分类器进行实验,发现MFCC的准确率随帧移增加而下降,最佳系数数量约为30,并揭示了不同数据集对帧长的敏感性差异。研究进一步优化了参数组合,显著提升了分类准确率,最高提升幅度达19.6%。

详情
英文摘要

Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.

2408.06747 2026-05-11 cs.CV

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Jingyun Wang, Guoliang Kang

AI总结 该论文研究了如何利用CLIP模型进行无监督语义分割任务,并指出在像素级理解任务中,CLIP存在类偏好和空间偏好等偏差,影响分割性能。为此,作者提出ReCLIP++方法,通过设计可学习的参考提示和位置嵌入投影,分别建模并校正这两种偏差,并利用矩阵乘法生成偏差logits图,再通过元素级减法对CLIP的logits进行校正。实验表明,该方法在多个基准数据集上取得了优于现有方法的性能。

Comments Extended version of our CVPR 24 paper, accepted by IJCV 2025

详情
英文摘要

Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.

2407.15134 2026-05-11 cs.LG cs.AI

Proximal Policy Distillation

Giacomo Spigler

AI总结 本文提出了一种新的策略蒸馏方法——近端策略蒸馏(Proximal Policy Distillation, PPD),将学生驱动的蒸馏与近端策略优化(PPO)相结合,旨在提高样本效率并利用学生策略在蒸馏过程中获得的额外奖励。实验表明,与传统的学生蒸馏和教师蒸馏方法相比,PPD在多种强化学习环境中表现出更高的样本效率和更优的学生策略性能,尤其在从不完美示范中蒸馏策略时展现出更强的鲁棒性。

详情
Journal ref
Transactions on Machine Learning Research, ISSN 2835-8856 (2025)
英文摘要

We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.

2406.13724 2026-05-11 cs.AI

Heterogeneous Graph Neural Networks with Post-hoc Explanations for Multi-modal and Explainable Land Use Inference

Xuehao Zhai, Junqi Jiang, Adam Dejl, Antonio Rago, Fangce Guo, Francesca Toni, Aruna Sivakumar

AI总结 该研究针对城市土地利用推断任务,提出了一种结合异构图神经网络(HGN)与可解释AI技术的框架,以提升模型在多模态数据下的预测精度与可解释性。该方法有效捕捉了空间邻近对象间的关联及不同服务类型的异质性,并通过特征归因和反事实解释提供了透明的决策依据。实验表明,该框架在多个土地利用指标上优于传统图神经网络,尤其在“办公”和“生活”类别上表现突出,为城市规划提供了更具说服力的分析工具。

详情
Journal ref
Information Fusion, Volume 120, 103057. 2025
英文摘要

Urban land use inference is a critically important task that aids in city planning and policy-making. Recently, the increased use of sensor and location technologies has facilitated the collection of multi-modal mobility data, offering valuable insights into daily activity patterns. Many studies have adopted advanced data-driven techniques to explore the potential of these multi-modal mobility data in land use inference. However, existing studies often process samples independently, ignoring the spatial correlations among neighbouring objects and heterogeneity among different services. Furthermore, the inherently low interpretability of complex deep learning methods poses a significant barrier in urban planning, where transparency and extrapolability are crucial for making long-term policy decisions. To overcome these challenges, we introduce an explainable framework for inferring land use that synergises heterogeneous graph neural networks (HGNs) with Explainable AI techniques, enhancing both accuracy and explainability. The empirical experiments demonstrate that the proposed HGNs significantly outperform baseline graph neural networks for all six land-use indicators, especially in terms of 'office' and 'sustenance'. As explanations, we consider feature attribution and counterfactual explanations. The analysis of feature attribution explanations shows that the symmetrical nature of the `residence' and 'work' categories predicted by the framework aligns well with the commuter's 'work' and 'recreation' activities in London. The analysis of the counterfactual explanations reveals that variations in node features and types are primarily responsible for the differences observed between the predicted land use distribution and the ideal mixed state. These analyses demonstrate that the proposed HGNs can suitably support urban stakeholders in their urban planning and policy-making.

2403.18149 2026-05-11 cs.RO cs.SY eess.SY math.OC

Code Generation and Conic Constraints for Model-Predictive Control on Microcontrollers with Conic-TinyMPC

Ishaan Mahajan, Khai Nguyen, Sam Schoedel, Elakhya Nedumaran, Moises Mata, Brian Plancher, Zachary Manchester

AI总结 本文研究了如何在资源受限的微控制器上高效部署带有二次锥约束的模型预测控制(MPC)。为解决传统嵌入式求解器在处理复杂约束时计算开销大的问题,作者基于ADMM方法扩展开发了一个结构化求解器,并支持从Python、MATLAB和Julia生成C++代码。实验表明,该求解器在求解QP和SOCP问题时相比现有嵌入式求解器速度提升达10.6至142.7倍,并显著提高了微控制器的内存利用率,已在实际飞行器轨迹跟踪任务中得到验证。

Comments Accepted to ICRA 2026. 4 Figures. 2 Tables. First three authors contributed equally

详情
英文摘要

Model-predictive control (MPC) is a state-of-the-art control method for constrained robotic systems, yet deployment on resource-limited hardware remains difficult. This challenge is magnified by expressive conic constraints, which offer greater modeling power but require significantly more computation than linear alternatives. To address this challenge, we extend recent work developing fast, structure-exploiting, cached solvers for embedded applications based on the Alternating Direction Method of Multipliers (ADMM) to provide support for second-order cones, as well as C++ code generation from Python, MATLAB, and Julia. Microcontroller benchmarks show that our solver provides up to a two-order-of-magnitude speedup, ranging from 10.6x to 142.7x, over state-of-the-art embedded solvers on QP and SOCP problems, and enables us to fit order-of-magnitude larger problems in memory. We validate our solver's deployed performance through simulation and hardware experiments, including trajectory tracking with conic constraints on a 27g Crazyflie quadrotor. Our open-source code is available at https://tinympc.org.

2310.07379 2026-05-11 cs.CV cs.AI cs.LG

Causal Unsupervised Semantic Segmentation

Junho Kim, Byung-Kwan Lee, Yong Man Ro

AI总结 本文研究了无需人工标注的无监督语义分割问题,提出了一个基于因果推理的新型框架CAUSE。该方法通过引入干预导向的因果调整策略,构建了一个两步任务流程:首先生成概念聚类作为中介变量,用于表示不同粒度的概念原型;然后利用该中介变量引导像素级的自监督学习,实现更精准的语义分组。实验表明,CAUSE在多个数据集上取得了当前最先进的无监督语义分割性能。

Comments code available: https://github.com/ByungKwanLee/Causal-Unsupervised-Segmentation

详情
Journal ref
Pattern Recognition, Volume 171, Part B, 112173 (2026)
英文摘要

Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations. With the advent of self-supervised pre-training, various frameworks utilize the pre-trained features to train prediction heads for unsupervised dense prediction. However, a significant challenge in this unsupervised setup is determining the appropriate level of clustering required for segmenting concepts. To address it, we propose a novel framework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages insights from causal inference. Specifically, we bridge intervention-oriented approach (i.e., frontdoor adjustment) to define suitable two-step tasks for unsupervised prediction. The first step involves constructing a concept clusterbook as a mediator, which represents possible concept prototypes at different levels of granularity in a discretized form. Then, the mediator establishes an explicit link to the subsequent concept-wise self-supervised learning for pixel-level grouping. Through extensive experiments and analyses on various datasets, we corroborate the effectiveness of CAUSE and achieve state-of-the-art performance in unsupervised semantic segmentation.

2305.01429 2026-05-11 cs.LG stat.ML

Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression

David Guijo-Rubio, Matthew Middlehurst, Guilherme Arcencio, Diego Furtado Silva, Anthony Bagnall

AI总结 本文研究了时间序列外生回归(TSER)问题,即利用一组训练时间序列预测与回归变量无直接关系的连续响应变量。作者扩展了TSER算法比较数据集,从19个问题增加到63个,并对比了多种回归模型,发现基于分类器的回归方法(如旋转森林)表现优异。文中提出两种新的TSER算法——FreshPRINCE和DrCIF,它们通过提取时间序列的统计特征进行预测,在多个数据集上显著优于其他方法,尤其是优于标准的旋转森林回归器。

Comments 19 pages, 21 figures, 6 tables. Appendix included

详情
Journal ref
Data Mining and Knowledge Discovery, Volume 38, pages 2141-2185, (2024)
英文摘要

Time Series Extrinsic Regression (TSER) involves using a set of training time series to form a predictive model of a continuous response variable that is not directly related to the regressor series. The TSER archive for comparing algorithms was released in 2022 with 19 problems. We increase the size of this archive to 63 problems and reproduce the previous comparison of baseline algorithms. We then extend the comparison to include a wider range of standard regressors and the latest versions of TSER models used in the previous study. We show that none of the previously evaluated regressors can outperform a regression adaptation of a standard classifier, rotation forest. We introduce two new TSER algorithms developed from related work in time series classification. FreshPRINCE is a pipeline estimator consisting of a transform into a wide range of summary features followed by a rotation forest regressor. DrCIF is a tree ensemble that creates features from summary statistics over random intervals. Our study demonstrates that both algorithms, along with InceptionTime, exhibit significantly better performance compared to the other 18 regressors tested. More importantly, these two proposals (DrCIF and FreshPRINCE) models are the only ones that significantly outperform the standard rotation forest regressor.

2304.13029 2026-05-11 cs.LG

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Matthew Middlehurst, Patrick Schäfer, Anthony Bagnall

AI总结 本文回顾并评估了近年来时间序列分类(TSC)算法的发展,基于扩展后的UCR数据集对多种算法进行了比较实验。研究扩展了原有的算法分类体系,新增了三个类别,并引入了30个新数据集以进一步验证各算法性能。实验结果显示,新提出的Hydra+MultiROCKET和HIVE-COTEv2算法在当前和新问题上均表现出显著优势。

详情
英文摘要

In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.

2104.07551 2026-05-11 cs.LG

HIVE-COTE 2.0: a new meta ensemble for time series classification

Matthew Middlehurst, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, Anthony Bagnall

AI总结 HIVE-COTE 2.0 是一种用于时间序列分类的新型元集成方法,通过结合多种不同领域的分类器,如基于形状片段、词袋字典和相位依赖区间的方法,提升分类性能。该方法在原有 HIVE-COTE 1.0 的基础上进行了全面改进,引入了两种新的分类器 Temporal Dictionary Ensemble(TDE)和 Diverse Representation Canonical Interval Forest(DrCIF),并新增了由 ROCKET 分类器组成的 Arsenal 集成模块,显著提高了准确率和实用性。实验表明,HIVE-COTE 2.0 在多个时间序列数据集上均优于当前最先进的方法。

详情
英文摘要

The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is a heterogeneous meta ensemble for time series classification. HIVE-COTE forms its ensemble from classifiers of multiple domains, including phase-independent shapelets, bag-of-words based dictionaries and phase-dependent intervals. Since it was first proposed in 2016, the algorithm has remained state of the art for accuracy on the UCR time series classification archive. Over time it has been incrementally updated, culminating in its current state, HIVE-COTE 1.0. During this time a number of algorithms have been proposed which match the accuracy of HIVE-COTE. We propose comprehensive changes to the HIVE-COTE algorithm which significantly improve its accuracy and usability, presenting this upgrade as HIVE-COTE 2.0. We introduce two novel classifiers, the Temporal Dictionary Ensemble (TDE) and Diverse Representation Canonical Interval Forest (DrCIF), which replace existing ensemble members. Additionally, we introduce the Arsenal, an ensemble of ROCKET classifiers as a new HIVE-COTE 2.0 constituent. We demonstrate that HIVE-COTE 2.0 is significantly more accurate than the current state of the art on 112 univariate UCR archive datasets and 26 multivariate UEA archive datasets.

2605.07514 2026-05-11 cs.RO cs.CV

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Hong-Han Shuai

AI总结 本文研究了世界动作模型(WAMs)在生成未来动作和观测时的动态一致性问题,指出当前模型可能仅生成视觉上合理但动力学上不兼容的未来轨迹。通过系统分析,作者发现动作与状态转移的一致性是衡量WAM可靠性的重要指标,并提出背景坍塌现象可能导致错误一致性判断。基于这些发现,作者提出了一种无需价值函数的共识策略,用于提升测试时的轨迹选择效果,有效提高了多个机器人任务的成功率。

Comments Technical Report

详情
英文摘要

World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.

2605.07513 2026-05-11 cs.LG

Tessellations of Semi-Discrete Flow Matching

Emile Pierret, Johannes Hertrich, Samuel Hurault, Julie Delon

AI总结 本文研究了半离散流匹配问题,即在有限离散目标点集上将高斯源分布进行传输的场景。该设置是流匹配用于生成建模的理论基础,文中给出了精确流匹配速度场的闭式表达,使得能够独立于优化和近似效应分析终端流映射所诱导的几何结构。研究发现终端分配区域是开集且单连通的,并在额外假设下与单位球同胚,但与半离散最优传输中的拉格朗日单元相比,这些区域可能具有非凸、曲边界的特性,表现出不同的有界性和邻接模式。

详情
英文摘要

We study Flow Matching in a semi-discrete setting where a Gaussian source is transported toward a discrete target supported on finitely many points. This semi-discrete regime is the theoretical setting behind the use of Flow Matching for generative modeling, where the target distribution is represented by a finite dataset. In this semi-discrete regime, the exact Flow Matching velocity field is available in closed form, which makes it possible to analyze the geometry induced by the terminal flow map independently of optimization and approximation effects. We investigate the terminal assignment regions, namely the preimages of the target atoms under the terminal flow. We show that these regions are open, simply connected and, under an additional assumption, homeomorphic to the unit ball. At the same time, a planar four-point example shows that these cells can differ sharply from Laguerre cells arising in semi-discrete optimal transport: they may be non-convex, have curved boundaries, and exhibit different boundedness and adjacency patterns. These results clarify the geometry intrinsically induced by the exact semi-discrete Flow Matching objective before neural approximation enters the picture.

2605.07512 2026-05-11 cs.CV

Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

Mengxin Qin, Xiang Zhang, Kun Wei, Xu Yang, Cheng Deng

AI总结 本文研究了视觉-语言模型在持续学习中的类别增量学习问题,旨在在不断学习新知识的同时避免遗忘已有知识。为了解决任务间子空间干扰导致的严重遗忘问题,作者提出了一个分层双子空间解耦框架HDSD,通过引入特征调制模块和分层学习模块,将参数空间分解为通用和任务特定子空间,有效减少了子空间干扰和参数漂移。实验表明,该方法在多个基准测试中取得了最先进的性能。

详情
英文摘要

Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.

2605.07510 2026-05-11 cs.CV cs.CL cs.IR

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li, Xuemeng Song, Jianfei Yang

AI总结 现有的多模态智能体搜索基准主要评估文本搜索和视觉浏览能力,但视觉证据通常仅作为输入或最终答案,未在搜索过程中动态交互。本文提出 **InterLV-Search**,一个用于评估交错语言-视觉智能体搜索的新型基准,要求在搜索过程中交替使用文本和视觉证据进行条件引导。该基准包含2,061个样本,涵盖从主动视觉证据搜索到开放网络交错搜索的三个难度级别,并引入多分支比较任务以提升挑战性。实验表明,当前主流多模态系统在交错搜索任务中表现仍较弱,最佳模型整体准确率低于50%,突显了视觉证据获取、搜索控制和多模态信息融合等方面的困难。

详情
英文摘要

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

2605.07507 2026-05-11 cs.CL cs.IR

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

Hanqing Zhao

AI总结 随着学术文献的爆炸式增长,自动从非结构化科学文本中提取结构化知识的需求日益迫切。本文提出TCMIIES,一个基于浏览器、无需安装的智能信息抽取系统,利用商业大语言模型(LLM)API实现学术文献的结构化信息抽取。该系统采用新型的模式引导提示框架,支持用户通过图形界面自定义抽取模式,无需编程即可使用,并具备本地数据处理、多LLM支持、批量处理和中文数据库智能映射等功能,在中医药研究等场景中表现出优异的抽取准确率和合规率,为领域研究人员提供了灵活、隐私保护且成本低廉的实用工具。

详情
英文摘要

The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.

2605.07505 2026-05-11 cs.AI cs.LG

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

Yubin Wu, Zicheng Cai, Liping Ning, Hua Wang, Zhi Chen, Yaohua Tang, Hao Chen

AI总结 本文提出了一种无需监督微调的新型训练范式LiteGUI,旨在提升小型视觉-语言GUI代理的性能。通过引入引导式策略蒸馏和多解双层次探索框架,该方法有效缓解了小模型在多解任务中的幻觉和认知偏差问题,并增强了长期任务中的探索能力。实验表明,LiteGUI在保持轻量级的同时,在多个基准上达到了最先进的性能,甚至接近大模型的表现。

详情
英文摘要

Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.

2605.07503 2026-05-11 cs.CV

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

Jingyuan Zhu, Biaolong Chen, Le Zhang, Aixi Zhang, Hao Jiang, Pipei Huang

AI总结 本文提出了一种名为Diffusion-APO的轨迹感知偏好对齐方法,用于提升视频扩散模型与人类意图的一致性。该方法通过同步训练噪声与推理去噪路径,优化梯度信号的有效性,解决了现有方法在奖励模型偏差和时间步采样不足的问题。研究还引入了一个统一的模块化RLHF框架,实现了无需基于标量奖励的策略梯度即可进行灵活、多阶段的偏好对齐,并在多个实验中展现出更优的视觉质量和指令遵循能力。

详情
英文摘要

Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.

2605.07499 2026-05-11 cs.CV

Cloud-top infrared observations reveal the four-dimensional precipitation structure

Tianchi Xu, Ziqiang Ma, Andrea Marinoni, Yuanpeng He, Xiaoqing Li, Chuanfeng Zhao, Kang He, Jintao Xu, Bohan Zhou, Wenbo Zhao, Haoshuang Chen, Tun Wang, Dongdong Wang, Yang Hong

AI总结 本研究利用云顶红外观测揭示了降水的四维结构,解决了全球范围内高精度降水信息获取的难题。研究提出了一种物理约束的深度学习框架4DPrecipNet,通过整合多通道红外亮温与雷达降水数据,重建了降水系统的垂直与时间演变过程。该方法成功捕捉了深层对流结构及其演变,验证了云顶红外观测中蕴含的次云层降水信息,为全球连续监测降水结构提供了新途径。

详情
英文摘要

Accurate four-dimensional (4D) precipitation information is essential for understanding the Earth's energy and water cycles, yet remains observationally unresolved at global scales. Conventional theory holds that geostationary infrared observations primarily sense cloud-top properties, with limited sensitivity to sub-cloud precipitation. Here we show that cloud-top infrared measurements nevertheless encode sufficient information to recover the four-dimensional structure of precipitation, revealing a previously unexploited observability of sub-cloud processes. We introduce a physically constrained deep learning framework, 4DPrecipNet, in which a moisture-first constraint requires the latent representation to recover precipitable water vapour, anchoring the model in thermodynamic consistency. By integrating multi-channel infrared radiances with these constraints and radar-derived precipitation profiles, we reconstruct the vertical and temporal evolution of precipitation systems from geostationary orbit. The framework captures deep convective structures and their evolution, with robust performance across large samples and independent radar comparisons. These results demonstrate that sub-cloud precipitation is physically encoded in cloud-top infrared observations, establishing a new pathway for continuous global monitoring of precipitation structure.

2605.07495 2026-05-11 cs.CV

Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing

Yujin Cho, Flavien Armangeon, Yanhao Li

AI总结 本文研究了在无配对数据情况下智能手机图像信号处理(ISP)的轻量级图像转换问题。为了解决RAW图像与目标RGB图像之间场景和颜色对齐困难的问题,作者提出了一种基于语义伪配对的方法,通过DINOv2提取语义嵌入,并利用融合的格罗莫夫-瓦舍尔(FGW)最优传输算法在图像和块级别建立伪配对,从而缓解数据无配对性的影响。基于这些伪配对,作者设计了一个仅有7K参数的轻量CNN网络,专注于颜色变换以提升训练稳定性并减少伪影,最终在挑战测试集上取得了优异的性能表现。

Comments 13 pages, 9 figures, CVPR Workshops 2026

详情
英文摘要

Unpaired smartphone ISP is a challenging problem due to the lack of scene and color alignment between RAW and target RGB images. Many existing methods either require paired data or rely heavily on adversarial training, which can become unstable in the unpaired setting. In this work, we present a simple and effective approach developed for the NTIRE 2026 Learned Smartphone ISP Challenge with Unpaired Data. Our method first reconstructs larger images from training patches to recover global context. Then, we extract semantic embeddings with DINOv2, and use fused Gromov-Wasserstein (FGW) optimal transport to build pseudo pairs between RAW and RGB images at both image and patch levels. This semantic matching allows us to partially alleviate the unpairedness of the data and build these pseudo input-target pairs. Based on these pseudo pairs, we train a lightweight CNN with only 7K parameters for color rendering. The network is designed to be compact and focus on color transformation rather than structural change, which helps reduce artifacts and improve training stability. Our challenge submission achieves 22.569 PSNR, 0.675 SSIM, and 8.067 $ΔE$ on the final hidden test set, significantly improving over the baseline and achieving the 3rd best SSIM and $ΔE$ among all challenge entries. Our code is available at github.com/nuniniyujin/Unpaired-ISP .