arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2506.12362 2026-05-11 cs.LG cs.AI

HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs

Xingyue Huang, Mikhail Galkin, Michael M. Bronstein, İsmail İlkan Ceylan

AI总结本文提出HYPER，一种用于归纳性超图链接预测的基础模型，能够处理包含全新实体和全新关系类型的超图。HYPER通过编码超边中实体及其在超边中的位置信息，实现了对不同元数关系类型的泛化能力。实验表明，HYPER在多种归纳设置下均优于现有方法，展示了其对高元数关系结构的强大泛化能力。

2506.11512 2026-05-11 cs.LG cs.AI

From Time Series Analysis to Question Answering: A Survey in the LLM Era

Wei Li, Zhe Xie, Yuxuan Liang, Xinli Hao, Yunyao Cheng, Dan Pei, Xiaofeng Meng

AI总结近年来，大语言模型（LLMs）为时间序列分析（TSA）引入了新的范式，但传统TSA任务难以覆盖时间序列语言理解等任务，存在与LLMs目标不匹配的问题。为此，研究提出将TSA向时间序列问答（TSQA）演进，强调以用户为中心的统一任务处理。本文综述了从TSA到TSQA的演变过程，提出了三种对齐范式，并分析了数据集特点与未来研究方向。

Comments Accepted by IJCAI 2026 Survey Track

2506.05668 2026-05-11 cs.LG stat.ML

RNE: plug-and-play diffusion inference-time control and energy-based training

Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas

AI总结本文提出了一种名为RNE的插件式扩散模型方法，用于在推理阶段实现对生成过程的控制，并支持基于能量的训练。RNE基于路径分布之间的密度比概念，建立了边缘密度与转移核之间的基本联系，从而统一了扩散密度估计、推理控制和能量训练等多个任务。实验表明，RNE在推理控制任务中表现出色，同时为能量型扩散模型提供了简单高效的正则化方法，并适用于连续和离散扩散模型。

Comments Accepted at ICLR 2026

2506.00886 2026-05-11 cs.AI

Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, Kam-Fai Wong

AI总结随着大型语言模型逐步演变为具备工具增强能力的智能体，一个核心问题仍未解决：何时才需要调用外部工具？本文提出，智能体应在认知上必要时才调用外部工具，即当仅靠其内部推理无法可靠完成任务时。为此，文章引入了“智能体理论”（ToA）框架，将智能体视为在不确定情况下决定是内部处理还是外部委托的序列决策者，并指出不必要的工具调用不仅效率低下，还可能阻碍内部推理能力的发展。该研究为工具使用提供了规范性准则，有助于构建更智能、更高效的智能体系统。

2505.13741 2026-05-11 cs.CV cs.NE

Frozen Backpropagation: Relaxing Weight Symmetry in Deep Spiking Neural Networks

Gaspard Goupy, Pierre Tirilly, Ioan Marius Bilasco

AI总结本文研究了在分离网络结构下深度脉冲神经网络（SNN）训练中权重对称性的放松问题。为了解决传统反向传播（BP）在神经形态硬件上实施时因权重对称性带来的高能耗和硬件开销问题，作者提出了冻结反向传播（Frozen Backpropagation，fBP）算法，通过周期性冻结反馈权重来减少权重传输和同步开销。实验表明，fBP在保持较高准确率的同时显著降低了权重传输成本，并可通过部分权重传输策略进一步提升效率。

2504.16559 2026-05-11 cs.LG q-bio.QM

Synergistic Benefits of Joint Molecule Generation and Property Prediction

Adam Izdebski, Jan Olszewski, Pankhil Gawade, Krzysztof Koras, Serra Korkmaz, Valentin Rauscher, Jakub M. Tomczak, Ewa Szczurek

AI总结该研究探讨了联合分子生成与性质预测的协同优势，提出了一种基于Transformer架构的联合模型Hyformer。该模型通过交替注意力机制和联合预训练策略，实现了分子生成与性质预测功能的融合，能够在条件采样、分布外性质预测和表征学习等方面展现协同效益。实验表明，Hyformer在抗菌肽设计等药物研发任务中表现出显著的联合学习优势。

Comments 17 pages, 4 figures

2503.14998 2026-05-11 cs.CV

Tables Guide Vision: Learning to See the Heart through Tabular Data

Marta Hasny, Maxime Di Folco, Keno Bressem, Julia Schnabel

AI总结该研究提出了一种基于表格数据引导的对比学习框架，旨在解决传统视觉对比学习方法在医学影像领域中忽略样本间语义关系的问题。通过利用临床表格数据，该方法能够识别患者层面的相似性并构建更具语义意义的样本对，从而提升视觉表征的学习效果。实验表明，在心脏MRI图像和临床属性数据集上，结合表格数据的引导能够显著增强模型在下游任务中的表现，包括细调、线性探针和零样本预测等，并且方法在自然图像数据集上也表现出良好的泛化能力。

2503.12285 2026-05-11 cs.LG cs.AI cs.GT cs.SY eess.SY stat.ML

A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback

Vaneet Aggarwal, Shweta Jain, Subham Pokhriyal, Christopher John Quinn

AI总结本文研究了在噪声函数评估下的双目标组合优化问题，提出了一个适用于此类问题的鲁棒性框架。该框架引入了$(α,β,δ,\texttt{N})$-鲁棒性概念，用于描述在有界噪声下近似保证的联合退化情况，并开发了一个通用的黑盒方法，将任何鲁棒的离线算法转化为适用于双目标组合多臂老虎机问题的在线算法。该方法在无需线性、子模性等结构假设的情况下，实现了次线性遗憾和约束违反的累积上界，展示了框架在经典子模优化贪心算法中的适用性。

2502.07143 2026-05-11 cs.CL

Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning

Jiayuan Zhu, Jiazhen Pan, Yuyuan Liu, Fenglin Liu, Junde Wu

AI总结医疗资源短缺导致大量患者无法及时获得可靠医疗服务，而大型语言模型（LLMs）在实际临床对话中仍面临权威医学依据不足、诊断不确定性处理不透明以及语言缺乏人性化等问题。为此，研究提出“Ask Patients with Patience（APP）”，一种基于多轮对话的医疗助手，通过共情对话引导用户描述症状，结合贝叶斯主动学习实现透明、适应性的诊断，并基于权威医学指南进行推理。实验表明，APP在提升诊断准确性、降低不确定性及改善用户体验方面均优于现有模型，为人工智能辅助医疗提供了更具临床实用性的解决方案。

2501.09209 2026-05-11 cs.CV

Surgical Visual Understanding (SurgVU) Dataset

Aneeq Zia, Max Berniker, Rogerio Nespolo, Xiaorui Zhang, Conor Perreault, Ziheng Wang, Benjamin Mueller, Ryan Schmidt, Kiran Bhattacharyya, Xi Liu, Anthony Jarc

AI总结本文介绍了Surgical Visual Understanding (SurgVU)数据集，旨在推动手术数据科学领域的基础研究。该数据集包含大量手术视频及其标签，涵盖了数据采集方法和独特属性，并提出了多个示例问题，适用于多种机器学习任务。该数据集不仅针对特定科学挑战设计，还具有广泛的适用性，期望能吸引更广泛的机器学习社区关注手术场景中的挑战性问题，并成为未来研究的重要基准。

2410.06355 2026-05-11 cs.RO cs.AI

UNCOM: Zero-shot Context-Aware Command Understanding for Tabletop Scenarios

Antonio Galiza Cerdeira Gonzalez, Paweł Gajewski, Bipin Indurkhya

AI总结本文提出了一种名为UNCOM的新型混合框架，用于在桌面场景中理解自然的人类指令。该系统整合了语音、手势和场景上下文等多源信息，提取结构化的可执行指令，支持机器人在无需预定义物体模型或特定任务训练数据的情况下进行零样本操作。通过基础模型和任务特定的深度学习模型，UNCOM实现了即开即用的语音识别、自然语言理解、手势检测和物体分割，其模块化架构提升了系统的透明性和可解释性，并在实际机器人交互数据集上达到了82.39%的成功率。

2410.06347 2026-05-11 cs.RO cs.AI

Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning

Paweł Gajewski, Dominik Żurek, Marcin Pietroń, Kamil Faber

AI总结本文提出了一种用于多目标离线强化学习的基于目标条件的决策 Transformer 模型，旨在解决机器人领域中样本效率低和跨目标泛化能力差的问题。该方法通过将目标状态显式地融入序列建模框架，能够在仅使用预收集数据的情况下高效完成多种任务。实验表明，该方法在 Franka Emika Panda 平台的新离线数据集上优于最先进的在线基线方法，尤其在稀疏奖励环境下表现出良好的鲁棒性。

2408.07522 2026-05-11 cs.SD cs.LG eess.AS

Optimising MFCC parameters for the automatic detection of respiratory diseases

Yuyang Yan, Sami O. Simons, Loes van Bemmel, Lauren Reinders, Frits M. E. Franssen, Visara Urovi

AI总结该研究探讨了MFCC参数对呼吸道疾病自动检测性能的影响，系统分析了系数数量、帧长和帧移等关键参数的作用。通过四个公开数据集和SVM分类器进行实验，发现MFCC的准确率随帧移增加而下降，最佳系数数量约为30，并揭示了不同数据集对帧长的敏感性差异。研究进一步优化了参数组合，显著提升了分类准确率，最高提升幅度达19.6%。

详情

DOI: 10.1016/j.apacoust.2024.110299

英文摘要

Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.

URL PDF HTML ☆

赞 0 踩 0

2408.06747 2026-05-11 cs.CV

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Jingyun Wang, Guoliang Kang

AI总结该论文研究了如何利用CLIP模型进行无监督语义分割任务，并指出在像素级理解任务中，CLIP存在类偏好和空间偏好等偏差，影响分割性能。为此，作者提出ReCLIP++方法，通过设计可学习的参考提示和位置嵌入投影，分别建模并校正这两种偏差，并利用矩阵乘法生成偏差logits图，再通过元素级减法对CLIP的logits进行校正。实验表明，该方法在多个基准数据集上取得了优于现有方法的性能。

Comments Extended version of our CVPR 24 paper, accepted by IJCV 2025

详情

DOI: 10.1007/s11263-025-02566-5

英文摘要

Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.

URL PDF HTML ☆

赞 0 踩 0

2407.15134 2026-05-11 cs.LG cs.AI

Proximal Policy Distillation

Giacomo Spigler

AI总结本文提出了一种新的策略蒸馏方法——近端策略蒸馏（Proximal Policy Distillation, PPD），将学生驱动的蒸馏与近端策略优化（PPO）相结合，旨在提高样本效率并利用学生策略在蒸馏过程中获得的额外奖励。实验表明，与传统的学生蒸馏和教师蒸馏方法相比，PPD在多种强化学习环境中表现出更高的样本效率和更优的学生策略性能，尤其在从不完美示范中蒸馏策略时展现出更强的鲁棒性。

2406.13724 2026-05-11 cs.AI

Heterogeneous Graph Neural Networks with Post-hoc Explanations for Multi-modal and Explainable Land Use Inference

Xuehao Zhai, Junqi Jiang, Adam Dejl, Antonio Rago, Fangce Guo, Francesca Toni, Aruna Sivakumar

AI总结该研究针对城市土地利用推断任务，提出了一种结合异构图神经网络（HGN）与可解释AI技术的框架，以提升模型在多模态数据下的预测精度与可解释性。该方法有效捕捉了空间邻近对象间的关联及不同服务类型的异质性，并通过特征归因和反事实解释提供了透明的决策依据。实验表明，该框架在多个土地利用指标上优于传统图神经网络，尤其在“办公”和“生活”类别上表现突出，为城市规划提供了更具说服力的分析工具。

详情

DOI: 10.1016/j.inffus.2025.103057
Journal ref: Information Fusion, Volume 120, 103057. 2025

英文摘要

Urban land use inference is a critically important task that aids in city planning and policy-making. Recently, the increased use of sensor and location technologies has facilitated the collection of multi-modal mobility data, offering valuable insights into daily activity patterns. Many studies have adopted advanced data-driven techniques to explore the potential of these multi-modal mobility data in land use inference. However, existing studies often process samples independently, ignoring the spatial correlations among neighbouring objects and heterogeneity among different services. Furthermore, the inherently low interpretability of complex deep learning methods poses a significant barrier in urban planning, where transparency and extrapolability are crucial for making long-term policy decisions. To overcome these challenges, we introduce an explainable framework for inferring land use that synergises heterogeneous graph neural networks (HGNs) with Explainable AI techniques, enhancing both accuracy and explainability. The empirical experiments demonstrate that the proposed HGNs significantly outperform baseline graph neural networks for all six land-use indicators, especially in terms of 'office' and 'sustenance'. As explanations, we consider feature attribution and counterfactual explanations. The analysis of feature attribution explanations shows that the symmetrical nature of the `residence' and 'work' categories predicted by the framework aligns well with the commuter's 'work' and 'recreation' activities in London. The analysis of the counterfactual explanations reveals that variations in node features and types are primarily responsible for the differences observed between the predicted land use distribution and the ideal mixed state. These analyses demonstrate that the proposed HGNs can suitably support urban stakeholders in their urban planning and policy-making.

URL PDF HTML ☆

赞 0 踩 0

2403.18149 2026-05-11 cs.RO cs.SY eess.SY math.OC

Code Generation and Conic Constraints for Model-Predictive Control on Microcontrollers with Conic-TinyMPC

Ishaan Mahajan, Khai Nguyen, Sam Schoedel, Elakhya Nedumaran, Moises Mata, Brian Plancher, Zachary Manchester

AI总结本文研究了如何在资源受限的微控制器上高效部署带有二次锥约束的模型预测控制（MPC）。为解决传统嵌入式求解器在处理复杂约束时计算开销大的问题，作者基于ADMM方法扩展开发了一个结构化求解器，并支持从Python、MATLAB和Julia生成C++代码。实验表明，该求解器在求解QP和SOCP问题时相比现有嵌入式求解器速度提升达10.6至142.7倍，并显著提高了微控制器的内存利用率，已在实际飞行器轨迹跟踪任务中得到验证。

Comments Accepted to ICRA 2026. 4 Figures. 2 Tables. First three authors contributed equally

2310.07379 2026-05-11 cs.CV cs.AI cs.LG

Causal Unsupervised Semantic Segmentation

Junho Kim, Byung-Kwan Lee, Yong Man Ro

AI总结本文研究了无需人工标注的无监督语义分割问题，提出了一个基于因果推理的新型框架CAUSE。该方法通过引入干预导向的因果调整策略，构建了一个两步任务流程：首先生成概念聚类作为中介变量，用于表示不同粒度的概念原型；然后利用该中介变量引导像素级的自监督学习，实现更精准的语义分组。实验表明，CAUSE在多个数据集上取得了当前最先进的无监督语义分割性能。

Comments code available: https://github.com/ByungKwanLee/Causal-Unsupervised-Segmentation

2305.01429 2026-05-11 cs.LG stat.ML

Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression

David Guijo-Rubio, Matthew Middlehurst, Guilherme Arcencio, Diego Furtado Silva, Anthony Bagnall

AI总结本文研究了时间序列外生回归（TSER）问题，即利用一组训练时间序列预测与回归变量无直接关系的连续响应变量。作者扩展了TSER算法比较数据集，从19个问题增加到63个，并对比了多种回归模型，发现基于分类器的回归方法（如旋转森林）表现优异。文中提出两种新的TSER算法——FreshPRINCE和DrCIF，它们通过提取时间序列的统计特征进行预测，在多个数据集上显著优于其他方法，尤其是优于标准的旋转森林回归器。

Comments 19 pages, 21 figures, 6 tables. Appendix included

2304.13029 2026-05-11 cs.LG

Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Matthew Middlehurst, Patrick Schäfer, Anthony Bagnall

AI总结本文回顾并评估了近年来时间序列分类（TSC）算法的发展，基于扩展后的UCR数据集对多种算法进行了比较实验。研究扩展了原有的算法分类体系，新增了三个类别，并引入了30个新数据集以进一步验证各算法性能。实验结果显示，新提出的Hydra+MultiROCKET和HIVE-COTEv2算法在当前和新问题上均表现出显著优势。

详情

DOI: 10.1007/s10618-024-01022-1

英文摘要

In 2017, a research paper compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a `bake off', identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, Hydra+MultiROCKET and HIVE-COTEv2, perform significantly better than other approaches on both the current and new TSC problems.

URL PDF HTML ☆

赞 0 踩 0

2104.07551 2026-05-11 cs.LG

HIVE-COTE 2.0: a new meta ensemble for time series classification

Matthew Middlehurst, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, Anthony Bagnall

AI总结 HIVE-COTE 2.0 是一种用于时间序列分类的新型元集成方法，通过结合多种不同领域的分类器，如基于形状片段、词袋字典和相位依赖区间的方法，提升分类性能。该方法在原有 HIVE-COTE 1.0 的基础上进行了全面改进，引入了两种新的分类器 Temporal Dictionary Ensemble（TDE）和 Diverse Representation Canonical Interval Forest（DrCIF），并新增了由 ROCKET 分类器组成的 Arsenal 集成模块，显著提高了准确率和实用性。实验表明，HIVE-COTE 2.0 在多个时间序列数据集上均优于当前最先进的方法。

2605.07514 2026-05-11 cs.RO cs.CV

Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Hong-Han Shuai

AI总结本文研究了世界动作模型（WAMs）在生成未来动作和观测时的动态一致性问题，指出当前模型可能仅生成视觉上合理但动力学上不兼容的未来轨迹。通过系统分析，作者发现动作与状态转移的一致性是衡量WAM可靠性的重要指标，并提出背景坍塌现象可能导致错误一致性判断。基于这些发现，作者提出了一种无需价值函数的共识策略，用于提升测试时的轨迹选择效果，有效提高了多个机器人任务的成功率。

Comments Technical Report

2605.07513 2026-05-11 cs.LG

Tessellations of Semi-Discrete Flow Matching

Emile Pierret, Johannes Hertrich, Samuel Hurault, Julie Delon

AI总结本文研究了半离散流匹配问题，即在有限离散目标点集上将高斯源分布进行传输的场景。该设置是流匹配用于生成建模的理论基础，文中给出了精确流匹配速度场的闭式表达，使得能够独立于优化和近似效应分析终端流映射所诱导的几何结构。研究发现终端分配区域是开集且单连通的，并在额外假设下与单位球同胚，但与半离散最优传输中的拉格朗日单元相比，这些区域可能具有非凸、曲边界的特性，表现出不同的有界性和邻接模式。

2605.07512 2026-05-11 cs.CV

Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

Mengxin Qin, Xiang Zhang, Kun Wei, Xu Yang, Cheng Deng

AI总结本文研究了视觉-语言模型在持续学习中的类别增量学习问题，旨在在不断学习新知识的同时避免遗忘已有知识。为了解决任务间子空间干扰导致的严重遗忘问题，作者提出了一个分层双子空间解耦框架HDSD，通过引入特征调制模块和分层学习模块，将参数空间分解为通用和任务特定子空间，有效减少了子空间干扰和参数漂移。实验表明，该方法在多个基准测试中取得了最先进的性能。

2605.07510 2026-05-11 cs.CV cs.CL cs.IR

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng, Xin Li, Xuemeng Song, Jianfei Yang

AI总结现有的多模态智能体搜索基准主要评估文本搜索和视觉浏览能力，但视觉证据通常仅作为输入或最终答案，未在搜索过程中动态交互。本文提出 **InterLV-Search**，一个用于评估交错语言-视觉智能体搜索的新型基准，要求在搜索过程中交替使用文本和视觉证据进行条件引导。该基准包含2,061个样本，涵盖从主动视觉证据搜索到开放网络交错搜索的三个难度级别，并引入多分支比较任务以提升挑战性。实验表明，当前主流多模态系统在交错搜索任务中表现仍较弱，最佳模型整体准确率低于50%，突显了视觉证据获取、搜索控制和多模态信息融合等方面的困难。

2605.07507 2026-05-11 cs.CL cs.IR

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

Hanqing Zhao

AI总结随着学术文献的爆炸式增长，自动从非结构化科学文本中提取结构化知识的需求日益迫切。本文提出TCMIIES，一个基于浏览器、无需安装的智能信息抽取系统，利用商业大语言模型（LLM）API实现学术文献的结构化信息抽取。该系统采用新型的模式引导提示框架，支持用户通过图形界面自定义抽取模式，无需编程即可使用，并具备本地数据处理、多LLM支持、批量处理和中文数据库智能映射等功能，在中医药研究等场景中表现出优异的抽取准确率和合规率，为领域研究人员提供了灵活、隐私保护且成本低廉的实用工具。

2605.07505 2026-05-11 cs.AI cs.LG

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

Yubin Wu, Zicheng Cai, Liping Ning, Hua Wang, Zhi Chen, Yaohua Tang, Hao Chen

AI总结本文提出了一种无需监督微调的新型训练范式LiteGUI，旨在提升小型视觉-语言GUI代理的性能。通过引入引导式策略蒸馏和多解双层次探索框架，该方法有效缓解了小模型在多解任务中的幻觉和认知偏差问题，并增强了长期任务中的探索能力。实验表明，LiteGUI在保持轻量级的同时，在多个基准上达到了最先进的性能，甚至接近大模型的表现。

2605.07503 2026-05-11 cs.CV

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

Jingyuan Zhu, Biaolong Chen, Le Zhang, Aixi Zhang, Hao Jiang, Pipei Huang

AI总结本文提出了一种名为Diffusion-APO的轨迹感知偏好对齐方法，用于提升视频扩散模型与人类意图的一致性。该方法通过同步训练噪声与推理去噪路径，优化梯度信号的有效性，解决了现有方法在奖励模型偏差和时间步采样不足的问题。研究还引入了一个统一的模块化RLHF框架，实现了无需基于标量奖励的策略梯度即可进行灵活、多阶段的偏好对齐，并在多个实验中展现出更优的视觉质量和指令遵循能力。

2605.07499 2026-05-11 cs.CV

Cloud-top infrared observations reveal the four-dimensional precipitation structure

Tianchi Xu, Ziqiang Ma, Andrea Marinoni, Yuanpeng He, Xiaoqing Li, Chuanfeng Zhao, Kang He, Jintao Xu, Bohan Zhou, Wenbo Zhao, Haoshuang Chen, Tun Wang, Dongdong Wang, Yang Hong

AI总结本研究利用云顶红外观测揭示了降水的四维结构，解决了全球范围内高精度降水信息获取的难题。研究提出了一种物理约束的深度学习框架4DPrecipNet，通过整合多通道红外亮温与雷达降水数据，重建了降水系统的垂直与时间演变过程。该方法成功捕捉了深层对流结构及其演变，验证了云顶红外观测中蕴含的次云层降水信息，为全球连续监测降水结构提供了新途径。

2605.07495 2026-05-11 cs.CV

Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing

Yujin Cho, Flavien Armangeon, Yanhao Li

AI总结本文研究了在无配对数据情况下智能手机图像信号处理（ISP）的轻量级图像转换问题。为了解决RAW图像与目标RGB图像之间场景和颜色对齐困难的问题，作者提出了一种基于语义伪配对的方法，通过DINOv2提取语义嵌入，并利用融合的格罗莫夫-瓦舍尔（FGW）最优传输算法在图像和块级别建立伪配对，从而缓解数据无配对性的影响。基于这些伪配对，作者设计了一个仅有7K参数的轻量CNN网络，专注于颜色变换以提升训练稳定性并减少伪影，最终在挑战测试集上取得了优异的性能表现。

Comments 13 pages, 9 figures, CVPR Workshops 2026