arXivDaily arXiv每日学术速递 周一至周五更新
重置
2509.11548 2026-06-11 cs.CV 版本更新

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

辅助推理如何释放VLM中的GUI定位能力

Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Min Yu, Tongxiao Ruan, Manni Duan

发表机构 * Zhejiang Lab(浙江实验室) Hangzhou Research and Development Center(杭州研发中心) China Mobile(中国移动) Innovation Center of Yangtze River Delta(长江三角洲创新中心) Zhejiang University(浙江大学)

AI总结 针对VLM在GUI定位任务中隐式空间理解强但显式坐标输出弱的问题,提出三种零样本辅助推理方法(如标记网格),通过输入图像添加空间线索,显著提升定位性能,在多个基准上达到接近最优微调方法的效果。

详情
AI中文摘要

图形用户界面(GUI)定位是构建GUI代理的基础任务。然而,通用视觉语言模型(VLM)由于缺乏特定优化,在此任务上表现不佳。本文识别出一个关键差距:尽管VLM表现出显著的潜在定位能力(如通过Pointing Game衡量的性能所示),但在输出显式坐标时表现不佳。为了解决这一差异并绕过当前微调方法的高数据和高标注成本,我们提出了三种零样本辅助推理方法。通过提供显式空间线索(如轴、网格和标记交点)作为输入图像的一部分,这些方法使VLM能够更好地表达其隐式空间理解能力。我们在四个GUI定位基准上评估了这些方法,涉及七个开源和专有VLM。实验结果表明,辅助推理带来了显著的性能提升。Mark-Grid Scaffold将Gemini-3.1-Pro在ScreenSpot-v2上的直接推理准确率从11.72%提升至95.20%,在ScreenSpot上达到最先进性能,并在ScreenSpot-v2和UI-I2E-Bench上接近最强的微调方法。我们的代码可在该https URL获取。

英文摘要

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to better articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. Experimental results show substantial gains from auxiliary reasoning. Mark-Grid Scaffold boosts Gemini-3.1-Pro from 11.72\% under direct inference to 95.20\% on ScreenSpot-v2, achieves state-of-the-art performance on ScreenSpot, and approaches the strongest fine-tuned methods on ScreenSpot-v2 and UI-I2E-Bench. Our code is available at this https URL.

2509.10303 2026-06-11 cs.LG cs.AI 版本更新

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

超越次优性:离线强化学习通过随机解决方案学习有效调度

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出离线RL算法CDQAC,从次优静态数据集学习调度策略,在JSP/FJSP上超越在线RL和强启发式方法,仅需1-5%数据,发现状态-动作覆盖比轨迹质量更重要。

详情
AI中文摘要

在线强化学习(RL)方法通过与模拟环境直接交互学习调度策略,在作业车间调度(JSP)和柔性作业车间调度(FJSP)问题上表现出色。然而,这些方法通常需要大量的训练交互,限制了其样本效率和实际适用性。受此挑战的启发,我们引入了保守离散分位数演员-评论家(CDQAC),这是一种离线RL算法,可以直接从静态、次优数据集中学习有效的调度策略。CDQAC将基于分位数的评论家与延迟策略更新相结合,以估计机器-操作对的回报分布。在JSP和FJSP基准上的大量实验表明,CDQAC始终优于生成数据的启发式方法,超越了最先进的离线和在线RL基线,并且具有很高的样本效率,仅需原始数据集的1%到5%即可学习高质量策略。我们的分析表明,在调度中,离线RL的性能主要受状态-动作覆盖范围而非单个轨迹质量的影响。调度将密集奖励(与完工时间目标对齐)与跨启发式方法的等长轨迹相结合,从而能够从广泛的行为中有效学习。与此观察一致,由简单随机启发式方法生成的具有更广覆盖范围的数据集,使其性能优于在由更强启发式方法(如遗传算法)生成的数据集上训练的策略。

英文摘要

Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated environments. However, these methods often require extensive training interactions, limiting their sample efficiency and practical applicability. Motivated by this challenge, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL algorithm that learns effective scheduling policies directly from static, suboptimal datasets. CDQAC couples a quantile-based critic with delayed policy updates to estimate the return distribution of machine-operation pairs. Extensive experiments on JSP and FJSP benchmarks demonstrate that CDQAC consistently outperforms the data-generating heuristics, surpasses state-of-the-art offline and online RL baselines, and is highly sample efficient, requiring only 1 to 5% of the original dataset to learn high-quality policies. Our analysis suggests that, in scheduling, offline RL performance is governed mainly by state-action coverage rather than the quality of individual trajectories. Scheduling couples a dense reward aligned with the makespan objective with equal-length trajectories across heuristics, enabling effective learning from a broad range of behaviors. Consistent with this observation, datasets generated by a simple random heuristic with broader coverage let it outperform policies trained on datasets produced by stronger heuristics such as Genetic Algorithms.

2508.17077 2026-06-11 stat.ML cs.LG 版本更新

CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference

CP4SBI: 基于模拟推断中可信集的局部共形校准

Luben M. C. Cabezas, Vagner S. Santos, Thiago R. Ramos, Pedro L. C. Rodrigues, Rafael Izbicki

发表机构 * Department of Statistics, Federal University of São Carlos(统计系,圣卡洛斯联邦大学) Institute of Mathematics and Computer Science, University of São Paulo(数学与计算机科学学院,圣保罗大学) Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK(格勒诺布尔阿尔卑斯大学,法国国家信息与自动化研究所,法国国家科学研究中心,格勒诺布尔INP,LJK)

AI总结 提出CP4SBI框架,通过回归树和CDF校准实现局部贝叶斯覆盖,为任意评分函数提供有限样本局部覆盖保证,提升神经后验估计的不确定性量化质量。

详情
AI中文摘要

当前实验科学家越来越依赖基于模拟的推断(SBI)来反演具有难以处理似然的复杂非线性模型。然而,通过SBI获得的后验近似通常校准不佳,导致可信区域低估真实参数。我们开发了$\texttt{CP4SBI}$,一个模型无关的共形校准框架,用于构建具有局部贝叶斯覆盖的可信集。我们提出的两种变体,即通过回归树进行局部校准和基于CDF的校准,为任意评分函数(包括HPD、对称和基于分位数的区域)提供了有限样本局部覆盖保证。在广泛使用的SBI基准上的实验表明,我们的方法使用归一化流和分数扩散建模提高了神经后验估计器的不确定性量化质量。

英文摘要

Current experimental scientists have been increasingly relying on simulation-based inference (SBI) to invert complex non-linear models with intractable likelihoods. However, posterior approximations obtained with SBI are often miscalibrated, causing credible regions to undercover true parameters. We develop $\texttt{CP4SBI}$, a model-agnostic conformal calibration framework that constructs credible sets with local Bayesian coverage. Our two proposed variants, namely local calibration via regression trees and CDF-based calibration, enable finite-sample local coverage guarantees for any scoring function, including HPD, symmetric, and quantile-based regions. Experiments on widely used SBI benchmarks demonstrate that our approach improves the quality of uncertainty quantification for neural posterior estimators using both normalizing flows and score-diffusion modeling.

2508.18636 2026-06-11 cs.SE cs.AI 版本更新

LaQual: An Automated Framework for LLM App Quality Evaluation

LaQual: 一种用于LLM应用质量评估的自动化框架

Yan Wang, Xinyi Hou, Junjun Si, Yanjie Zhao, Weiguo Lin, Haoyu Wang

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出LaQual自动化框架,通过静态指标筛选和动态场景评估,实现LLM应用质量评估,与人类判断高度一致,可减少66.7%-81.3%候选应用。

详情
AI中文摘要

代表软件分发的新范式,LLM应用商店正在迅速兴起,为用户提供内容生成、编程辅助、教育等多样化选择。然而,当前LLM应用商店中的排名和推荐机制主要依赖静态指标(如用户交互和收藏),使用户难以高效识别高质量应用。同时,当前学术研究专注于特定垂直领域,缺乏适用于多样化LLM应用生态的通用自动化评估框架。为应对上述挑战,我们提出LaQual,一种用于LLM应用质量评估的自动化框架。LaQual整合三个关键阶段:(1) LLM应用标注与层次分类,实现精确场景映射;(2) 静态指标评估,使用时间加权用户参与度和功能能力指标过滤低质量应用;(3) 动态场景自适应评估,由LLM生成场景特定评估指标、评分标准和任务,进行全面质量评估。在主流LLM应用商店上的实验证明了LaQual的有效性。其自动化评分与人类判断高度一致。通过有效筛选,LaQual可将候选LLM应用池减少66.7%至81.3%。用户研究进一步验证了其相对于基线系统的显著优势,特别是在比较效率(均值5.45 vs. 3.30)和解释信息价值(4.75 vs. 2.25)方面。这些结果表明,LaQual为现实场景中LLM应用的高质量发现与推荐提供了可扩展、客观且以用户为中心的解决方案。

英文摘要

Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

2508.10807 2026-06-11 quant-ph cs.LG math.OC 版本更新

Parity Cross-Resonance: A Multiqubit Gate

奇偶交叉共振:一种多量子比特门

Xuexin Xu, Siyu Wang, Radhika Joshi, Rihan Hai, Mohammad H. Ansari

发表机构 * Peter Grünberg Institute, Forschungszentrum Jülich(彼得·格林堡研究所,吕贝克研究中心) Jülich-Aachen Research Alliance (JARA)(吕贝克-亚琛研究联盟(JARA)) Fundamentals of Future Information Technologies(未来信息科技基础) Institute for Quantum Information, RWTH Aachen University(量子信息研究所,亚琛RWTH大学) Department of Software Technology, Delft University of Technology(软件技术系,代尔夫特理工大学)

AI总结 提出一种原生三量子比特纠缠门,通过混合优化方法实现控制-控制-目标和控制-目标-目标操作,用于GHZ态制备、Toffoli逻辑和受控ZZ门,提升表面码稳定子测量保真度。

详情
Journal ref
Phys. Rev. Applied 25, 044045 (2026)
Comments
19 pages, 10 figures
AI中文摘要

我们提出一种原生三量子比特纠缠门,它利用工程化相互作用在单次相干步骤中实现控制-控制-目标和控制-目标-目标操作。与传统的分解为多个两量子比特门不同,我们的混合优化方法选择性地放大所需相互作用,同时抑制不需要的耦合,从而在整个计算子空间及之外实现稳健性能。这种新门可归类为交叉共振门。我们展示了它可以多种方式使用,例如在GHZ三重态制备、具有多体相互作用的Toffoli类逻辑演示以及实现受控ZZ门中。后者将两个数据量子比特的奇偶性直接映射到测量量子比特上,从而在表面码量子纠错中实现更快、更高保真度的稳定子测量。在所有示例中,我们展示了三量子比特门性能在希尔伯特空间大小上的稳健性,这通过增加总激发数下的测试得到证实。这项工作为协同设计电路架构和控制协议奠定了基础,这些协议利用原生多量子比特相互作用作为下一代超导量子处理器的核心元素。

英文摘要

We present a native three-qubit entangling gate that exploits engineered interactions to realize control-control-target and control-target-target operations in a single coherent step. Unlike conventional decompositions into multiple two-qubit gates, our hybrid optimization approach selectively amplifies desired interactions while suppressing unwanted couplings, yielding robust performance across the computational subspace and beyond. The new gate can be classified as a cross-resonance gate. We show it can be utilized in several ways, for example, in GHZ triplet state preparation, Toffoli-class logic demonstrations with many-body interactions, and in implementing a controlled-ZZ gate. The latter maps the parity of two data qubits directly onto a measurement qubit, enabling faster and higher-fidelity stabilizer measurements in surface-code quantum error correction. In all these examples, we show that the three-qubit gate performance remains robust across Hilbert space sizes, as confirmed by testing under increasing total excitation numbers. This work lays the foundation for co-designing circuit architectures and control protocols that leverage native multiqubit interactions as core elements of next-generation superconducting quantum processors.

2405.06995 2026-06-11 cs.SD cs.CV cs.MM eess.AS 版本更新

Benchmarking Cross-Domain Audio-Visual Deception Detection

跨域音视频欺骗检测基准测试

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

发表机构 * Rapid-Rich Object Search (ROSE) Lab and the College of Computing and Data Science, Nanyang Technological University (NTU)(快速丰富对象搜索(ROSE)实验室和南洋理工大学计算与数据科学学院) School of Computing and Information Technology and Dongguan Key Laboratory for Intelligence and Information Technology, Great Bay University(计算与信息科技学院和东莞智能与信息技术重点实验室,大湾大学) DSO National Laboratories(国防科学实验室) College of Computing and Data Science, Nanyang Technological University (NTU)(计算与数据科学学院,南洋理工大学) SMBU, Shenzhen 518172, China(深圳SMBU,越南河内VinUniversity,和新加坡NTU) VinUniversity, Hanoi 100000, Vietnam and NTU, Singapore

AI总结 提出首个跨域音视频欺骗检测基准,评估不同场景下的泛化能力,并设计MM-IDGM算法和Attention-Mixer融合方法提升性能。

详情
Comments
17 pages
AI中文摘要

自动欺骗检测对于帮助人类准确评估真实性和识别欺骗行为至关重要。传统的接触式技术,如测谎仪,依赖生理信号来确定个体陈述的真实性。然而,自动欺骗检测的最新进展表明,从音频和视频模态中提取的多模态特征在公开数据集上可能优于人类观察者。尽管有这些积极发现,现有音视频欺骗检测方法在不同场景下的泛化能力仍 largely unexplored。为弥补这一空白,我们提出了首个跨域音视频欺骗检测基准,使我们能够评估这些方法在现实场景中的泛化能力。我们使用了广泛采用的音频和视觉特征以及不同的架构进行基准测试,比较了单到单和多到单域泛化性能。为了进一步利用来自多个源域的数据进行训练的影响,我们研究了三种域采样策略,包括域同步、域交替和逐域采样,用于多到单域泛化评估。我们还提出了一种通过最大化模态编码器之间的梯度内积来增强泛化性能的算法,称为“MM-IDGM”。此外,我们提出了Attention-Mixer融合方法来提高性能,并相信这一新的跨域基准将促进未来音视频欺骗检测的研究。

英文摘要

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

2507.17012 2026-06-11 cs.AI cs.CE 版本更新

Sustainability assessment using multimodal AI agents

使用多模态AI代理进行可持续性评估

Zhihan Zhang, Alexander Metzger, Yuxuan Mei, Felix Hähnlein, Zachary Englhardt, Tingyu Cheng, Gregory D. Abowd, Shwetak Patel, Adriana Schulz, Vikram Iyer

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(保罗·G·艾伦计算机科学与工程学院,华盛顿大学) Computer Science and Engineering, University of Notre Dame(计算机科学与工程,诺丁汉大学) Electrical and Computer Engineering, Northeastern University(电气与计算机工程,东北大学)

AI总结 提出多模态多代理AI系统,模拟生命周期评估专家与利益相关者协作,自动估算电子设备碳足迹,将数据收集时间从数周缩短至一分钟,误差在19%以内。

详情
Comments
This article is published in Nature Electronics, and is available online at: this https URL
AI中文摘要

减少计算行业快速增长的环境影响需要大规模评估电子产品的排放。然而,传统的电子设备生命周期评估(LCA)需要专有或不可用的数据。在这里,我们通过引入一个多模态多代理AI系统重新构想传统的可持续性评估,该系统模拟LCA专业人员与利益相关者(如产品经理和工程师)之间的协作过程,自动估算电子设备的碳足迹。代理通过利用结构化数据抽象和从公共互联网(包括维修社区和政府监管数据库)挖掘信息的软件工具,迭代构建完整的生命周期清单。这将数据收集时间从数周或数月减少到不到一分钟。该系统可以在零专有数据的情况下,以专家LCA的19%误差范围内计算碳足迹(典型的人类LCA之间的差异)。我们还表明,通过编码领域特定知识,环境影响估算可以重新定义为数据驱动的预测任务,其中未知产品和排放因子都被表示为具有已知排放的相似产品的加权组合。

英文摘要

Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. However, a traditional life cycle assessment (LCA) of an electronic device, which maps materials and processes to environmental impacts, often requires proprietary or unavailable data. Here, we reimagine conventional sustainability assessment by introducing a multimodal multi-agent AI system that emulates the collaborative process between LCA professionals and stakeholders (such as product managers and engineers) to automatically estimate the carbon footprint of electronic devices. The agents iteratively construct a complete life-cycle inventory by leveraging a structured data abstraction and software tools that mine information from the public internet, including repair communities and government regulatory databases. This reduces data gaps and data collection from weeks or months of expert time to under one minute. The system can calculate carbon footprint within 19% of expert LCAs with zero proprietary data (typical of the variation between human LCAs). We also show that by encoding domain-specific knowledge, environmental impact estimation can be reframed as a data-driven prediction task, in which both unknown products and emission factors are represented as weighted combinations of similar ones with known emissions.

2506.20040 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

跨层离散概念发现用于解释语言模型

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

发表机构 * University of Washington(华盛顿大学)

AI总结 提出跨层向量量化变分自编码器(CLVQ-VAE),通过离散向量量化瓶颈将残差流中的重复特征压缩为紧凑可解释的概念向量,在三个数据集上优于聚类、单层VQ-VAE和稀疏自编码器基线。

详情
AI中文摘要

由于残差流的存在,解释语言模型仍然具有挑战性,残差流在相邻层之间线性混合和复制特征,导致单层分析忽略这种跨层结构。跨层稀疏自编码器(SAE)解决了层混合问题,但在连续空间中操作,概念分散在许多神经元上,没有清晰的边界。我们引入了跨层向量量化变分自编码器(CLVQ-VAE),这是一种新颖的框架,通过离散向量量化瓶颈将较低层的表示映射到较高层,将重复的残差流特征压缩为紧凑、可解释的概念向量。我们的方法结合了基于top-k温度的采样和指数移动平均(EMA)码本更新,在保持码本多样性的同时,对离散潜在空间进行受控探索。在基于编码器和解码器的模型上,针对ERASER-Movie、Jigsaw和AGNews数据集,CLVQ-VAE在三个评估轴上优于聚类、单层向量量化变分自编码器(VQ-VAE)和稀疏自编码器(SAE)基线:移除识别出的概念使模型准确率下降高达93%,LLM评判员在66.7%的比较中将我们的概念排在首位,人类标注者从我们的可视化中恢复模型预测的准确率为78%,而聚类为54%。

英文摘要

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

2507.03065 2026-06-11 cs.LG 版本更新

Persistent Homology as a Theory of Emergent Structure

持久同调作为涌现结构理论

Xin Li

发表机构 * Department of Computer Science, University at Albany(计算机科学系,阿尔巴尼大学)

AI总结 提出将涌现属性定义为持久非平凡同调类,通过持久条、收缩相似图算子和Hodge分解等工具,统一描述涌现的六个特征,并提供可验证预测。

详情
AI中文摘要

为什么某些宏观结构在其微观组分不断变化时仍保持可识别?涡旋在流体团翻转时持续,神经记忆在尖峰和突触波动时持续,机构在个体进出时持续。我们提出一个尺度相对的回答:涌现属性是一个持久的非平凡同调类 $[z]\in H_p=\ker\partial_p/\im\partial_{p+1}$,即一个在描述过滤中闭合但不精确的宏观特征。这一识别将涌现转化为一个\emph{测量}问题。持久条检测稳定的宏观特征,我们引入收缩相似(CS)图算子以提供预测鲁棒性的支架谱间隙。Hodge分解将调和宏观支架与精确和共精确微观流分离;函子凝聚解释何时一个层次的涌现类成为下一个层次的单位。由此产生的支架-流框架用同一数学语言表达了涌现的六个熟悉特征(即必然性、相干性、不可约性、互补性、鲁棒性和层次性)。它还在大气、神经和社会系统中产生可证伪的预测:真正的涌现结构应在过滤中持续,保持谱稳定,对调和干预有不成比例的反应,并需要时间尺度分离以实现层次自主性。

英文摘要

Why do some macroscopic structures remain identifiable even though their microscopic constituents continually change? Vortices persist while fluid parcels turn over, neural memories persist while spikes and synapses fluctuate, and institutions persist while individuals enter and leave. We propose a scale-relative answer: an emergent property is a persistent nontrivial homology class $[z]\in H_p=\ker\partial_p/\im\partial_{p+1}$, a macro-feature that is closed but not exact across a filtration of descriptions. This identification turns emergence into a \emph{measurement} problem. Persistent bars detect stable macro-features, and we introduce a contractive-similarity (CS) graph operator to supply scaffold spectral gaps that predict robustness. Hodge decomposition separates harmonic macro-scaffold from exact and co-exact micro-flow; and functorial condensation explains when one level's emergent class becomes a unit for the next. The resulting scaffold-flow framework expresses six familiar signatures of emergence (i.e., inevitability, coherence, irreducibility, complementarity, robustness, and hierarchy) within one mathematical language. It also yields falsifiable predictions across atmospheric, neural, and social systems: genuine emergent structures should persist across filtrations, remain spectrally stable, respond disproportionately to harmonic interventions, and require timescale separation for hierarchical autonomy.

2506.03933 2026-06-11 cs.CV cs.AI 版本更新

Diffusion-based Cumulative Adversarial Purification for Vision Language Models

基于扩散的累积对抗净化方法用于视觉语言模型

Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Swiss Federal Institute of Technology Lausanne(洛桑联邦理工学院) University of California, Los Angeles(加州大学洛杉矶分校) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍兹中心) RISE Research Institutes of Sweden(瑞典RISE研究机构) Halmstad University(哈马碧大学)

AI总结 提出DiffCAP,一种基于扩散的对抗净化策略,通过理论证明对抗效应随扩散单调衰减,并利用噪声注入与VLM嵌入相似度阈值自适应净化,显著提升防御效果并加速去噪。

详情
Comments
Accepted to Transactions on Machine Learning Research (TMLR 2026)
AI中文摘要

视觉语言模型(VLM)在多模态理解方面表现出卓越的能力,但它们对对抗扰动的敏感性对其在实际应用中的可靠性构成了重大威胁。尽管这些扰动通常对人类不可察觉,但它们可能极大地改变模型输出,导致错误的解释和决策。本文介绍了DiffCAP,一种新颖的基于扩散的净化策略,可以有效中和VLM中的对抗性破坏。我们在理论上建立了前向扩散过程中的可证明恢复区域,同时量化了相对于VLM的语义变化的收敛速度。这些发现表明,随着扩散的进行,对抗效应单调减弱。基于这一原理,DiffCAP利用噪声注入,以VLM嵌入的相似度阈值作为自适应标准,然后通过反向扩散恢复出干净且可靠的表示用于VLM推理。通过在三个任务场景中、不同攻击强度下、使用三个VLM在六个数据集上进行的大量实验,我们表明DiffCAP以显著优势优于现有的防御技术。值得注意的是,DiffCAP显著降低了超参数调优的复杂性和所需的扩散时间,从而加速了去噪过程。结合理论定理和实验支持,DiffCAP为在对抗环境中安全部署VLM提供了一种稳健且实用的解决方案。源代码可在以下网址获取:https://this URL。

英文摘要

Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a provable recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments. The source code is available at this https URL.

2506.02568 2026-06-11 cs.AI 版本更新

MLaGA: Multimodal Large Language and Graph Assistant

MLaGA: 多模态大语言与图助手

Dongzhe Fan, Yi Fang, Jiajin Liu, Djellel Difallah, Qiaoyu Tan

发表机构 * New York University(纽约大学) New York University Shanghai(纽约大学上海) New York University Brooklyn(纽约大学布鲁克林) Virginia Polytechnic Institute and State University(弗吉尼亚理工大学) New York University Abu Dhabi(纽约大学阿布扎克)

AI总结 提出MLaGA模型,通过结构感知多模态编码器和指令微调,将大语言模型扩展到多模态图数据,在监督和迁移学习任务中优于基线方法。

详情
AI中文摘要

大语言模型(LLMs)在推进图结构化数据分析方面展现了显著的功效。现有的基于LLM的图方法擅长将LLM适应于文本丰富的图,其中节点属性是文本描述。然而,它们在多模态图上的应用——其中节点与多种属性类型(如文本和图像)相关联——仍然未被充分探索,尽管这些图在现实场景中普遍存在。为了弥合这一差距,我们引入了多模态大语言与图助手(MLaGA),这是一种创新模型,巧妙地将LLM能力扩展到促进对复杂图结构和多模态属性的推理。我们首先设计了一个结构感知的多模态编码器,通过联合图预训练目标将文本和视觉属性对齐到统一空间中。随后,我们实现了一种多模态指令微调方法,通过轻量级投影仪将多模态特征和图结构无缝集成到LLM中。在多个数据集上的大量实验证明了MLaGA相对于领先基线方法的有效性,在监督和迁移学习场景下的各种图学习任务中均取得了优越性能。

英文摘要

Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions. However, their applications to multimodal graphs--where nodes are associated with diverse attribute types, such as texts and images--remain underexplored, despite their ubiquity in real-world scenarios. To bridge the gap, we introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes. We first design a structure-aware multimodal encoder to align textual and visual attributes within a unified space through a joint graph pre-training objective. Subsequently, we implement a multimodal instruction-tuning approach to seamlessly integrate multimodal features and graph structures into the LLM through lightweight projectors. Extensive experiments across multiple datasets demonstrate the effectiveness of MLaGA compared to leading baseline methods, achieving superior performance in diverse graph learning tasks under both supervised and transfer learning scenarios.

2506.01396 2026-06-11 cs.LG cs.CR stat.ML 版本更新

Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

通过有界自适应裁剪减轻差分隐私学习中的差异影响

Linzh Zhao, Aki Rehn, Mikko A. Heikkilä, Razane Tajeddine, Antti Honkela

发表机构 * Department of Computer Science, University of Helsinki(计算机科学系,赫尔辛基大学) Department of Electrical and Computer Engineering, American University of Beirut(电气与计算机工程系,贝鲁特美国大学)

AI总结 针对差分隐私学习中梯度裁剪对少数群体造成的不公平影响,提出有界自适应裁剪方法,通过引入可调下界防止过度梯度抑制,在Skewed和Fashion MNIST上最差类准确率提升超过10个百分点。

详情
Comments
TMLR camera-ready version
AI中文摘要

差分隐私已成为隐私保护机器学习的基本框架。然而,现有的差分隐私学习方法通常对模型预测产生差异影响,例如对少数群体。梯度裁剪常用于差分隐私学习,但会抑制来自困难样本的较大梯度。我们表明,自适应裁剪会加剧这一问题,因为它通常会将裁剪边界缩小到极小值以匹配拟合良好的多数类,同时显著降低其他类的准确率。我们提出有界自适应裁剪,引入可调下界以防止过度梯度抑制。与无界自适应裁剪相比,我们的方法在Skewed和Fashion MNIST上将最差类准确率提高了超过10个百分点,与自动裁剪相比提高了7个百分点,与恒定裁剪相比提高了5个百分点。代码可在该 https URL 获取。

英文摘要

Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at this https URL.

2505.17623 2026-06-11 cs.CR cs.AI cs.ET cs.LG cs.PF 版本更新

\texttt{Range-Arithmetic}: Verifiable Deep Learning Inference on an Untrusted Party

Range-Arithmetic: 在不可信方上进行可验证的深度学习推理

Ali Rahimi, Babak H. Khalaj, Mohammad Ali Maddah-Ali

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Range-Arithmetic框架,通过将非算术运算转化为可验证的算术步骤,实现高效的深度神经网络推理验证,降低了计算和通信开销。

详情
AI中文摘要

可验证计算(VC)在去中心化机器学习系统中日益重要,由于区块链的限制,深度神经网络(DNN)推理等资源密集型任务被外包给外部参与者。这产生了在不重新执行的情况下验证外包计算正确性的需求。我们提出了\texttt{Range-Arithmetic},一个新颖的框架,用于高效且可验证的DNN推理,它将非算术运算(如定点矩阵乘法后的舍入和ReLU)转化为可通过求和检查协议和串联范围证明验证的算术步骤。我们的方法避免了布尔编码、高次多项式和大查找表的复杂性,同时保持与基于有限域的证明系统的兼容性。实验结果表明,我们的方法不仅匹配现有方法的性能,还降低了验证结果的计算成本、执行DNN推理的不可信方所需的计算工作量以及双方之间的通信开销。

英文摘要

Verifiable computing (VC) has gained prominence in decentralized machine learning systems, where resource-intensive tasks like deep neural network (DNN) inference are offloaded to external participants due to blockchain limitations. This creates a need to verify the correctness of outsourced computations without re-execution. We propose \texttt{Range-Arithmetic}, a novel framework for efficient and verifiable DNN inference that transforms non-arithmetic operations, such as rounding after fixed-point matrix multiplication and ReLU, into arithmetic steps verifiable using sum-check protocols and concatenated range proofs. Our approach avoids the complexity of Boolean encoding, high-degree polynomials, and large lookup tables while remaining compatible with finite-field-based proof systems. Experimental results show that our method not only matches the performance of existing approaches, but also reduces the computational cost of verifying the results, the computational effort required from the untrusted party performing the DNN inference, and the communication overhead between the two sides.

2505.08784 2026-06-11 stat.ML cs.LG math.ST stat.ME 版本更新

PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

PCS-UQ:基于可预测性-可计算性-稳定性框架的不确定性量化

Abhineet Agarwal, Fange Xiao, Rebecca Barter, Omer Ronen, Boyu Fan, Bin Yu

发表机构 * Department of Statistics, University of California, Berkeley(加州大学伯克利分校统计学系) Department of Epidemiology, University of Utah(犹他大学流行病学系) Department of Electrical Engineering and Computer Science, University of California, Berkeley(加州大学伯克利分校电气工程与计算机科学系)

AI总结 提出PCS-UQ框架,通过预测检查、bootstrap采样和乘法校准实现不确定性量化,在回归和分类任务中优于或媲美共形预测方法,并提供理论保证。

详情
AI中文摘要

随着机器学习进入高风险领域,可信的不确定性量化对于安全性至关重要。本文基于真实数据科学的可预测性、可计算性和稳定性原则,提出了PCS-UQ框架。从候选模型或算法集开始,PCS-UQ集成了严格的预测检查以筛选出集合中不合适的模型,并利用bootstrap样本来捕获预测检查算法的样本间变异性和算法不稳定性。然后,我们引入了一种新颖的乘法校准方案来增强局部自适应性,这基本上对应于共形预测中的新分数。此外,我们编制了17个真实世界回归数据集,并手动构建了子组。在该基准测试中,PCS-UQ在保持目标覆盖率的同时,在区间宽度上优于或匹配配备有oracle选择算法的共形方法。PCS-UQ实现了一致的子组覆盖率,优于这些oracle选择的共形方法。值得注意的是,PCS-UQ在实现竞争性区间宽度和一致子组覆盖率方面表现出色。在6个分类数据集上,PCS-UQ将预测集大小减少了20%。为了将框架扩展到深度学习,我们提出了计算高效的变体,避免了昂贵的重新训练。在三个计算机视觉基准测试中,这些变体将预测集大小比共形基线减少了20%。最后,我们提供了理论证明,即修改后的PCS-UQ算法在可交换性下作为分割共形推断的一种形式保持了有效的覆盖率。

英文摘要

As machine learning (ML) enters high-stakes domains, trustworthy uncertainty quantification (UQ) is essential for safety. In this paper we introduce PCS-UQ, a framework based on the Predictability, Computability, and Stability (PCS) principles for veridical data science. Starting with a candidate set of models or algorithms, PCS-UQ integrates a rigorous prediction-check to screen out unsuitable models in the set and utilizes bootstrap samples, in order to capture both inter-sample variability and algorithmic instability for the prediction-checked algorithms. We then introduce a novel multiplicative calibration scheme to enhance local adaptivity, which basically corresponds to a new score in conformal prediction. Moreover, we produce a compilation of 17 real-world regression datasets with manually-constructed subgroups. On this benchmark, PCS-UQ maintains the target coverage while outperforming or matching conformal methods equipped with oracle-selected algorithms in interval width. PCS-UQ achieves consistent subgroup coverage, outperforming these oracle-selected conformal methods. Notably, PCS-UQ stands out in achieving both competitive interval widths and consistent subgroup this http URL 6 classification datasets, PCS-UQ reduces prediction set sizes by 20\%. To scale the framework for deep learning, we propose computationally efficient variants that bypass expensive retraining. On three computer vision benchmarks, these variants reduce prediction set sizes by 20\% over conformal baselines. Finally, we provide theoretical proof that a modified PCS-UQ algorithm preserves valid coverage under exchangeability as a form of split conformal inference.

2505.03296 2026-06-11 cs.RO cs.AI cs.LG 版本更新

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

离散时间高斯过程混合在机器人策略学习中的惊人有效性

Jan Ole von Hartz, Adrian Röfer, Joschka Boedecker, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系)

AI总结 提出MiDiGap方法,利用少量演示和相机观测,通过离散时间高斯过程混合实现机器人操作策略的灵活表示与模仿学习,在长时域、高约束、动态和多模态任务上取得SOTA性能,并支持推理时引导。

详情
Comments
Submitted for publication to IEEE Transaction on Robotics
AI中文摘要

我们提出了离散时间高斯过程混合(MiDiGap),一种用于机器人操作中灵活策略表示和模仿学习的新方法。MiDiGap仅使用相机观测,即可从少至五次演示中学习,并在一系列具有挑战性的任务中泛化。它在长时域行为(如泡咖啡)、高约束运动(如开门)、动态动作(如用铲子舀取)和多模态任务(如挂杯子)上表现出色。MiDiGap在CPU上不到一分钟即可学习这些任务,并线性扩展到大型数据集。我们还开发了一套丰富的推理时引导工具,利用碰撞信号和机器人运动学约束等证据。这种引导实现了新颖的泛化能力,包括避障和跨本体策略迁移。MiDiGap在多样化的少样本操作基准上达到了最先进的性能。在受约束的RLBench任务上,它将策略成功率提高了76个百分点,并将轨迹成本降低了67%。在多模态任务上,它将策略成功率提高了48个百分点,并将样本效率提高了20倍。在跨本体迁移中,策略成功率提高了一倍以上。我们在以下网址公开了代码:https://this https URL。

英文摘要

We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at this https URL.

2504.21072 2026-06-11 cs.CR cs.AI cs.LG 版本更新

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

擦除但未遗忘:后门如何破坏概念擦除

Tobias Braun, Jonas Henry Grebe, Marcus Rohrbach, Anna Rohrbach

发表机构 * arXiv.org GitHub

AI总结 本文揭示了一种名为擦除规避后门(EEB)的漏洞,攻击者将后门触发器绑定到待擦除概念上,使得该恶意链接在后续擦除后仍然存在,从而绕过多种概念擦除方法。

详情
AI中文摘要

文本到图像扩散模型的扩展引发了对有害输出的担忧,从捏造的公众人物描绘到露骨的色情图像。为减轻此类风险,先前工作提出了概念擦除方法,旨在通过微调从模型中切断不需要的概念,但仍不清楚这些方法是否真正移除了与有害概念的所有联系,或仅仅是掩盖了表面连接。在这项工作中,我们揭示了一个关键漏洞——擦除规避后门(EEB):攻击者将后门触发器绑定到待擦除的概念上,并且这种恶意链接在后续擦除后仍然存在。我们展示了黑盒和白盒攻击者都能实例化这一威胁。在六种最先进的擦除方法中,包括那些明确搜索目标概念替代表示的鲁棒方法,EEB始终能暴露有害内容:针对名人身份遗忘的成功率高达82%,针对物体擦除的成功率高达94%,针对露骨内容暴露的放大倍数高达16倍。虽然EEB揭示了当前擦除方法的一个盲点,但它也为压力测试未来的概念擦除技术提供了诊断工具。

英文摘要

The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoor (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16 times amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.

2504.12556 2026-06-11 cs.CV 版本更新

Contour Field based Elliptical Shape Prior for the Segment Anything Model

基于轮廓场的椭圆形状先验用于Segment Anything模型

Xinyu Zhao, Faqiang Wang, Li Cui, Yuping Duan, Jun Liu

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 针对SAM难以高效生成椭圆形状分割结果的问题,提出一种参数化椭圆轮廓场约束方法,通过变分法和对偶算法将椭圆先验与图像特征融合,提升特定任务分割精度。

详情
AI中文摘要

椭圆形状先验信息在提高医学和自然图像特定任务的分割精度方面起着至关重要的作用。现有的基于深度学习的分割方法,包括Segment Anything模型(SAM),通常难以高效地生成具有椭圆形状的分割结果。本文提出了一种新方法,利用变分法将椭圆形状先验集成到基于深度学习的SAM图像分割技术中。该方法建立了一个参数化的椭圆轮廓场,约束分割结果与预定义的椭圆轮廓对齐。利用对偶算法,该模型将图像特征与椭圆先验和空间正则化先验无缝集成,从而大大提高了分割精度。通过将SAM分解为四个数学子问题,我们集成变分椭圆先验设计了一种新的SAM网络结构,确保SAM的分割输出由椭圆区域组成。在特定图像数据集上的实验结果表明,该方法优于原始SAM。

英文摘要

The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.

2503.22926 2026-06-11 cs.RO 版本更新

SR-LIO++: LiDAR-Inertial Odometry and Quantized Mapping with Caching-Aware Sweep Reconstruction

SR-LIO++: 基于缓存感知扫描重建的LiDAR-惯性里程计与量化建图

Zikang Yuan, Ruiye Ming, Chengwei Zhao, Yonghao Tan, Pingcheng Dong, Yuan Ren, Yuzhong Jiao, Xin Yang, Kwang-Ting Cheng

发表机构 * ACCESS-AI Chip Center for Emerging Smart Systems, InnoHK Centers, Hong Kong Science Park, Hong Kong, China(新兴智能系统 ACCESS-AI 芯片中心,InnoHK 中心,香港科学公园,香港,中国) HongKong University of Science and Technology, HongKong, China(香港理工大学,香港,中国) Huazhong University of Science and Technology, Wuhan, China(华中科技大学,武汉,中国) Hangzhou Qisheng Intelligent Techology Co. Ltd., Hangzhou, China(杭州启盛智能科技有限公司,杭州,中国) Southeast University, Nanjing, China(东南大学,南京,中国)

AI总结 提出SR-LIO++系统,通过扫描重建提高输出频率,并采用缓存机制和量化地图点管理,在资源受限平台上实现高精度、高效率的LiDAR-惯性里程计。

详情
Comments
18 pages, 10 figures
AI中文摘要

解决3D LiDAR固有低采集频率限制以实现高频输出已成为LiDAR-惯性里程计(LIO)领域的关键研究焦点。为确保实时性能,频率增强的LIO系统必须在显著缩短的时间窗口内处理每次扫描,这对在资源受限平台上的部署提出了巨大挑战。为解决这些限制,我们引入了SR-LIO++,一种创新的LIO系统,能够在资源受限的硬件平台(包括Raspberry Pi 4B)上实现相对于输入频率两倍的输出频率。我们的系统采用先前提出的扫描重建方法来提高LiDAR扫描频率,生成高频重建扫描。在此基础之上,我们提出了一种针对最近段中间结果(即表面参数)的缓存机制,有效减少了相邻重建扫描中公共段的冗余处理。该方法将处理时间从传统的对重建扫描频率的线性依赖中解耦出来。此外,我们提出了一种基于索引表映射的量化地图点管理,通过将全局3D点存储从64位双精度转换为8位字符表示,显著减少了内存使用。该方法还将最近邻搜索中计算密集的欧几里得距离计算从64位双精度转换为16位短整型和32位整型格式,降低了计算成本。在三个不同计算平台和四个公开数据集上的广泛实验评估表明,SR-LIO++在保持最先进精度的同时,显著提高了效率。值得注意的是,我们的系统在Raspberry Pi 4B硬件上成功实现了20 Hz的状态输出。

英文摘要

Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on resource-constrained platforms. To address these limitations, we introduce SR-LIO++, an innovative LIO system capable of achieving doubled output frequency relative to input frequency on resource-constrained hardware platforms, including the Raspberry Pi 4B. Our system employs the previously proposed sweep reconstruction methodology to enhance LiDAR sweep frequency, generating high-frequency reconstructed sweeps. Building upon this foundation, we propose a caching mechanism for intermediate results (i.e., surface parameters) of the most recent segments, effectively minimizing redundant processing of common segments in adjacent reconstructed sweeps. This method decouples processing time from the traditionally linear dependence on reconstructed sweep frequency. Furthermore, we present a quantized map point management based on index table mapping, significantly reducing memory usage by converting global 3D point storage from 64-bit double precision to 8-bit char representation. This method also converts the computationally intensive Euclidean distance calculations in nearest neighbor searches from 64-bit double precision to 16-bit short and 32-bit integer formats, reducing computational cost. Extensive experimental evaluations across three distinct computing platforms and four public datasets demonstrate that SR-LIO++ maintains state-of-the-art accuracy while substantially enhancing efficiency. Notably, our system successfully achieves 20 Hz state output on Raspberry Pi 4B hardware.

2503.10973 2026-06-11 cs.LG 版本更新

Learning Patterns and Abstractions from Perceptual Sequences

从感知序列中学习模式与抽象

Shuchen Wu

发表机构 * Mathematisch-Naturwissenschaftlichen Fakultät und der Medizinischen Fakultät der Eberhard-Karls-Universität Tübingen(图宾根大学数学自然科学学院和医学学院)

AI总结 研究从感知序列中通过分块和抽象发现模式与层次结构的计算原理,提出理性分块模型和非参数层次变量模型,实现高效序列分解与无监督模式发现。

详情
Comments
Doctoral thesis
AI中文摘要

认知过程迅速将高维感官流分解为熟悉的部分并揭示它们之间的关系。结构为何出现?它们如何支持学习、泛化和预测?什么计算原理构成了感知和智能的这一核心方面?简化来说,感官流是一维序列。在学习此类序列时,我们自然地将其分割成部分——这一过程称为分块。在第一个项目中,我研究了在序列反应时任务中影响分块的因素,并表明人类在平衡速度和准确性的同时适应底层分块。在此基础上,我开发了学习分块并逐块解析序列的模型。从规范角度,我提出分块是一种理性策略,用于发现重复模式和嵌套层次结构,从而实现高效的序列分解。学习到的分块可作为可复用的原语,用于迁移、组合和心理模拟——使模型能够从已知中组合出新内容。我展示了该模型在单维和多维序列中学习层次结构的能力,并强调了其在无监督模式发现中的实用性。第二部分从具体序列转向抽象序列。我对抽象主题进行了分类,并考察了它们在序列记忆中的作用。行为证据表明,人类利用模式冗余进行压缩和迁移。我提出了一个非参数层次变量模型,该模型同时学习分块和抽象变量,揭示不变符号模式。我展示了其与人类学习的相似性,并与大型语言模型进行了比较。综上所述,本论文表明,分块和抽象作为简单的计算原理,能够支持从简单到复杂、从具体到抽象的层次组织序列中的结构化知识获取。

英文摘要

Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.

2503.08445 2026-06-11 cs.RO 版本更新

iPack: Intuitive Bin Packing with Large Language Models

iPack: 基于大型语言模型的直观装箱

Yannik Blei, Michael Krawez, Adrian Göß, Devadas Vijayan Sheela, Tobias Jülg, Pierre Krack, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg(图恩大学) Friedrich-Alexander University of Erlangen–Nuremberg(埃尔兰根-纽伦堡弗里德里希-亚历山大大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出LLM-Pack方法,利用语言和视觉基础模型生成模仿人类策略的杂货装箱顺序,无需专门训练即可处理新物品,模块化设计便于升级。

详情
Comments
7 Pages, 9 Figures
AI中文摘要

机器人和自动化在物流中越来越有影响力,但仍主要局限于传统仓库。在杂货零售中,虽然存在无收银员超市等进步,但顾客仍然手动挑选和包装杂货。尽管机器人领域对箱子拾取问题有大量关注,但包装物品和杂货的任务基本上未被触及。然而,以正确顺序包装杂货物品对于防止产品损坏至关重要,例如,重物不应放在易碎物品之上。然而,正确包装顺序的确切标准很难定义,特别是考虑到商店中通常有大量各种物品。在本文中,我们介绍了LLM-Pack,一种新颖的杂货包装方法。LLM-Pack利用语言和视觉基础模型来识别杂货并生成模仿人类包装策略的包装序列。LLM-Pack不需要专门训练来处理新的杂货物品,其模块化设计允许轻松升级底层基础模型。我们广泛评估了我们的方法以展示其性能。我们将在本文发表后公开LLM-Pack的源代码。

英文摘要

Robotics and automation are increasingly influential in logistics but remain largely confined to traditional warehouses. In grocery retail, advancements such as cashier-less supermarkets exist, yet customers still manually pick and pack groceries. While there has been a substantial focus in robotics on the bin picking problem, the task of packing objects and groceries has remained largely untouched. However, packing grocery items in the right order is crucial for preventing product damage, e.g., heavy objects should not be placed on top of fragile ones. However, the exact criteria for the right packing order are hard to define, in particular given the huge variety of objects typically found in stores. In this paper, we introduce LLM-Pack, a novel approach for grocery packing. LLM-Pack leverages language and vision foundation models for identifying groceries and generating a packing sequence that mimics human packing strategy. LLM-Pack does not require dedicated training to handle new grocery items and its modularity allows easy upgrades of the underlying foundation models. We extensively evaluate our approach to demonstrate its performance. We will make the source code of LLMPack publicly available upon the publication of this manuscript.

2503.06578 2026-06-11 cs.RO eess.SY 版本更新

Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning

非平衡MAV捕获MAV:基于时间最优规划和强化学习

Canlun Zheng, Zhanyu Guo, Zikang Yin, Chunyu Wang, Zhikun Wang, Shiyu Zhao

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院,中国杭州) WINDY Lab, Department of Artificial Intelligence, Westlake University, Hangzhou, China(西湖大学人工智能系WINDY实验室,中国杭州) Department of Electrical Engineering, California Institute of Technology, Pasadena, USA(加州理工学院电气工程系,美国帕萨迪纳)

AI总结 针对高机动性目标捕获难题,本文设计紧凑型捕获MAV,结合时间最优规划与强化学习方法,在非稳定状态下实现目标捕获。

详情
AI中文摘要

由于飞行MAV(微型飞行器)的捕获具有挑战性和广阔应用前景,近年来受到越来越多的研究关注。尽管已有进展,现有工作的一个关键限制是捕获策略通常相对简单且受平台性能约束。本文研究能够捕获高机动性目标的控制策略。在非稳定条件下实现目标捕获这一独特挑战使其区别于传统的追逃和制导问题。在本研究中,我们从较大的MAV平台过渡到一种专门设计的、配备定制发射装置的紧凑型捕获MAV,同时保持高机动性。我们探索了时间最优规划(TOP)和强化学习(RL)方法。仿真表明,TOP提供高机动性和更短的轨迹,而RL在实时适应性和稳定性方面表现优异。此外,RL方法已在真实场景中测试,成功实现了即使在非稳定状态下的目标捕获。

英文摘要

The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.

2501.12942 2026-06-11 cs.AI 版本更新

Offline Diffusion Policy for Multi-User Delay-Constrained Scheduling

面向多用户延迟约束调度的离线扩散策略

Zhuoran Li, Ruishuo Chen, Hai Zhong, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University(交叉信息学院(IIIS),清华大学)

AI总结 提出基于离线强化学习的SOCD算法,利用扩散策略和批评网络指导,从离线数据中学习高效调度策略,避免在线交互,在部分可观测和大规模环境中表现优异。

详情
AI中文摘要

有效的多用户延迟约束调度在诸多实际应用中至关重要,包括具身AI、即时通讯、直播和数据中心管理,这些场景需要在具有不同延迟敏感性的用户之间进行高效资源分配。在这些场景中,调度器必须实时做出决策,以满足延迟和资源约束,同时无需事先了解系统动态,这些动态通常是时变的且难以估计。当前基于学习的方法通常需要在训练阶段与实际系统进行在线交互。因此,这些方法往往难以实施或不切实际,因为它们会显著降低系统性能并产生高昂的服务成本。为应对这些挑战,我们提出了一种新颖的基于离线强化学习的算法,名为SOCD(通过离线学习与批评引导和扩散模型进行调度),该算法仅从预先收集的离线数据中学习高效调度策略。SOCD创新性地采用了扩散策略,并辅以无采样的批评网络进行策略引导。通过将拉格朗日乘子优化融入离线强化学习,SOCD仅从可用数据集中高效训练出高质量且满足约束的策略,无需与系统进行在线交互。实验结果表明,SOCD对多种系统动态具有鲁棒性,包括部分可观测和大规模环境,并且与现有方法相比性能更优。

英文摘要

Effective multi-user delay-constrained scheduling is crucial in various real-world applications, including embodied AI, instant messaging, live streaming, and data center management, where efficient resource allocation is required among users with diverse delay sensitivities. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. {Current learning-based methods typically require online interactions with actual systems during the training stage. Therefore, these approaches are often difficult or impractical, as they can significantly degrade system performance and incur substantial service costs.} To address these challenges, we propose a novel offline reinforcement learning-based algorithm, named \underline{S}cheduling By \underline{O}ffline Learning with \underline{C}ritic Guidance and \underline{D}iffusion Model (SOCD), to learn efficient scheduling policies purely from pre-collected \emph{offline data}. SOCD innovatively employs a diffusion policy, complemented by a sampling-free critic network for policy guidance. By integrating the Lagrangian multiplier optimization into the offline reinforcement learning, SOCD efficiently trains high-quality constraint-aware policies exclusively from available datasets, eliminating the need for online interactions with the system. Experimental results demonstrate that SOCD is resilient to various system dynamics, including partially observable and large-scale environments, and delivers superior performance compared to existing methods.

2410.24145 2026-06-11 stat.ML cs.LG stat.ME 版本更新

Projected random forests and conformal prediction of circular data

投影随机森林与圆形数据的共形预测

Paulo C. Marques F., Rinaldo Artes, Helton Graziadei

发表机构 * Insper University(Insper大学) University of São Paulo(圣保罗大学)

AI总结 针对圆形响应回归问题,应用共形预测技术,通过投影方法将线性回归模型转换为圆形模型,并利用随机森林的袋外机制避免额外校准样本,生成具有有限样本覆盖保证和自适应弧长的预测集。

详情
Comments
7 pages; 4 figures
AI中文摘要

我们将共形预测技术应用于具有圆形响应的回归问题,在数据可交换性假设下,为任何圆形预测模型生成具有自适应弧长和有限样本覆盖保证的预测集。利用现有为线性响应设计的高性能预测模型,我们分析了一种通用的投影过程,将任何线性响应回归模型转换为适用于圆形响应的模型。当在此投影过程中使用随机森林作为基模型时,我们利用随机森林的袋外机制,在构建预测集时无需单独的校准样本。在合成和真实数据集上,与两种现有替代模型生成的分割共形预测集相比,所得的投影随机森林模型产生了更高效的袋外共形预测集,中位弧长更短。

英文摘要

We apply conformal prediction techniques to regression problems with circular responses, producing prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under the assumption of data exchangeability. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear-response regression model into one suitable for circular responses. When random forests are used as base models in this projection procedure, we leverage the random forest out-of-bag mechanism to eliminate the need for a separate calibration sample in the construction of prediction sets. On synthetic and real datasets, the resulting projected random forest model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, than the split conformal prediction sets generated by two existing alternative models.

2410.12327 2026-06-11 cs.CL 版本更新

Neuron-based Personality Trait Induction in Large Language Models

基于神经元的大语言模型人格特质诱导

Jia Deng, Tianyi Tang, Yanbin Yin, Wenhao Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学首都人工智能学院) Tongyi Lab(通义实验室) Institute of Statistics and Big Data, Renmin University of China(中国人民大学统计与大数据研究院)

AI总结 提出基于神经元的大语言模型人格特质诱导方法,通过构建PersonalityBench数据集、识别人格相关神经元并调整其值,实现无需训练的参数级控制,效果媲美微调模型。

详情
Comments
25 pages. Published at ICLR 2025
AI中文摘要

大型语言模型(LLMs)在模拟各种人格特质方面变得越来越熟练,这是支持相关应用(例如角色扮演)的重要能力。为了进一步提高这种能力,在本文中,我们提出了一种基于神经元的方法,用于LLMs的人格特质诱导,包含三个主要技术贡献。首先,我们构建了PersonalityBench,一个用于识别和评估LLMs人格特质的大规模数据集。该数据集基于心理学中的大五人格特质,旨在评估LLMs对特定人格特质的生成能力。其次,通过利用PersonalityBench,我们提出了一种高效的方法,通过检查给定特质的对立面来识别LLMs中与人格相关的神经元。第三,我们开发了一种简单而有效的诱导方法,通过操纵这些已识别的人格相关神经元的值。该方法无需训练和修改模型参数即可实现对LLMs所表现特质的细粒度控制。大量实验验证了我们神经元识别和特质诱导方法的有效性。值得注意的是,我们的方法实现了与微调模型相当的性能,为LLMs的人格特质诱导提供了更高效、更灵活的解决方案。我们在以下网址提供所有提到的资源:此 https URL。

英文摘要

Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at this https URL.

2409.18478 2026-06-11 cs.CV 版本更新

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Temporal2Seq: 一个面向时序视频理解任务的统一框架

Min Yang, Zichen Zhang, Qian Dang, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) China Design Group Co., Ltd(中国设计集团有限公司)

AI总结 提出Temporal2Seq统一框架,将时序视频理解任务输出表示为离散token序列,通过单一架构训练通用模型,在TAD、TAS、GEBD三个任务上取得合理结果,并优于单任务训练。

详情
Comments
Accepted by CVIU
AI中文摘要

随着视频理解的发展,出现了大量用于片段级时序视频分析的任务,包括时序动作检测(TAD)、时序动作分割(TAS)和通用事件边界检测(GEBD)。尽管针对特定任务的视频理解模型在每个任务上都表现出色,但仍然缺乏一个能够同时处理多个任务的统一框架,这是下一代人工智能的一个有前景的方向。为此,在本文中,我们提出了一个单一的统一框架,称为Temporal2Seq,将这些时序视频理解任务的输出表述为离散token序列。通过这种统一的token表示,Temporal2Seq可以在单一架构内训练一个通用模型,用于不同的视频理解任务。在没有多任务学习(MTL)基准的情况下,我们通过借用TAD、TAS和GEBD任务的数据集,编制了一个全面的联合训练数据集。我们在三个任务的相应测试集上评估了我们的Temporal2Seq通用模型,结果表明Temporal2Seq能够在各种任务上产生合理的结果,并且在该框架上相比单任务训练具有优势。我们还研究了通用模型在不同任务的新数据集上的泛化性能,其表现优于特定模型。

英文摘要

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

2409.12707 2026-06-11 physics.flu-dyn cs.LG 版本更新

Machine-learning-based multipoint optimization of fluidic injection parameters for improving nozzle performance

基于机器学习的流体注入参数多点优化以提升喷管性能

Yunjia Yang, Jiazhe Li, Yufei Zhang, Haixin Chen

发表机构 * Tsinghua University(清华大学)

AI总结 针对过膨胀单斜面喷管,采用预训练神经网络替代CFD进行多点优化,结合先验预测策略提高精度,利用反向传播快速计算梯度,在七个设计点优化平均推力系数提升1.14%。

详情
AI中文摘要

流体注入为改善车辆加速过程中过膨胀单斜面喷管(SERN)的性能提供了一种有前景的解决方案。然而,确定能在多个喷管工作状态下产生最佳整体性能的注入参数仍然是一个挑战。基于梯度的优化方法需要在每个设计点计算注入参数的梯度,当使用计算流体动力学(CFD)模拟时,这可能导致高昂的计算成本。本文使用预训练神经网络在优化过程中替代CFD,从而能够快速计算多个设计点的喷管流场。考虑到喷管流场的物理特性,采用基于先验的预测策略来提高模型的准确性。此外,神经网络的反向传播算法只需运行一次计算即可快速计算梯度,从而与有限差分法相比大大减少了梯度计算时间。作为测试案例,对SERN在七个设计点的平均喷管推力系数进行了优化,结果提高了1.14%。即使包括建立训练数据库所需的时间,与传统优化方法相比,时间成本也大大降低。

英文摘要

Fluidic injection offers a promising solution to improve the performance of the overexpanded single expansion ramp nozzles (SERNs) during vehicle acceleration. However, determining the injection parameters that yield the best overall performance across multiple nozzle operating conditions remains a challenge. The gradient-based optimization method requires gradients of injection parameters at each design point, which can lead to high computational costs when using computational fluid dynamics (CFD) simulations. This paper uses a pretrained neural network to replace CFD during optimization, enabling quick calculation of the nozzle flow field at multiple design points. Considering the physical characteristics of the nozzle flow field, a prior-based prediction strategy is adopted to enhance the model's accuracy. In addition, the neural network's back-propagation algorithm computes gradients quickly by running the computation only once, thereby greatly reducing gradient computation time compared to the finite difference method. As a test case, the average nozzle thrust coefficient of an SERN at seven design points is optimized, resulting in a 1.14\% improvement. The time cost is greatly reduced compared with traditional optimization methods, even when the time required to establish the training database is included.

2307.01472 2026-06-11 cs.AI cs.LG cs.MA 版本更新

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

通过扩散模型提升离线多智能体强化学习的泛化能力与数据效率

Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences(交叉信息学院) Tsinghua University(清华大学) Department of Electronic and Computer Engineering(电子与计算机工程系) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出扩散离线多智能体模型(DOM2),利用扩散模型增强策略表达力和多样性,结合轨迹数据重加权,在离线MARL中显著提升性能、泛化能力和数据效率。

详情
AI中文摘要

我们提出了一种新颖的扩散离线多智能体模型(DOM2),用于离线多智能体强化学习(MARL)。与主要依赖策略设计中保守性的现有算法不同,DOM2基于扩散模型增强了策略的表达力和多样性。具体来说,我们将扩散模型融入策略网络,并在训练中提出了一种基于轨迹的数据重加权方案。这些关键要素显著提高了算法对环境变化的鲁棒性,并在性能、泛化和数据效率方面取得了显著提升。我们的大量实验结果表明,DOM2在所有多智能体粒子和多智能体MuJoCo环境中均优于现有最先进方法,并且由于其高表达力和多样性,在迁移环境中(在评估的30个设置中有28个)泛化能力显著更强。此外,DOM2具有超高的数据效率,与现有算法相比,实现相同性能所需数据不超过5%(数据效率提升20倍)。

英文摘要

We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

2305.06145 2026-06-11 cs.CV 版本更新

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

因果衣物不变特征学习用于换衣行人重识别

Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Yating Liu, Qi Chu, Mang Ye, Wanli Ouyang, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络安全学院) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Data Science, University of Science and Technology of China(中国科学技术大学数据科学学院) Wuhan University(武汉大学)

AI总结 针对换衣行人重识别中衣物变化导致特征失效的问题,提出因果衣物不变学习(CCIL),通过因果干预阻断衣物捷径,实现衣物不变特征学习,在PRCC和DeepChange数据集上分别达到66.4%和59.2%的Rank-1准确率。

详情
AI中文摘要

在换衣行人重识别(CCReID)中,学习衣物不变特征至关重要,这些特征能提供对衣物变化保持鲁棒的判别性ID特征。然而,当前存在的虚假相关性限制了现有ReID方法有效提取这些衣物不变特征。这种虚假相关性源于衣物归属:衣物很少在不同身份间共享,因此模型倾向于记忆衣物线索进行身份识别,这种策略对未见过的衣物泛化能力差。本文提出因果衣物不变学习(CCIL),将CC-ReID从似然学习P(Y|X)显式转换为因果干预学习P(Y|do(X))以阻断衣物捷径。CCIL通过三个模块实现这种干预:混淆字典、干预模块和解耦正则化。基于因果关系的建模使整个模型自然具有衣物不变性,有效防止特征学习中捕获虚假相关性。大量实验验证了CCIL的有效性。在PRCC和DeepChange数据集上,CCIL分别达到66.4%和59.2%的Rank-1准确率,比现有最优方法分别高出1.4和4.1个百分点。

英文摘要

In cloth-changing person re-identification (CCReID), it is critical to learn clothes-invariant feature, which can provide discriminative ID features that remain robust against clothing changes. However, a spurious correlation currently limits existing ReID methods from effectively extracting these clothing-invariant features. This spurious correlation arises from clothing ownership: clothing is rarely shared across different identities, so models tend to memorize clothing cues for identity recognition, and this strategy generalizes poorly to unseen clothing. In this paper, we propose Causal Clothes-Invariant Learning (CCIL), which explicitly shifts CC-ReID from likelihood learning P (Y|X) to causal intervention learning P (Y|do(X)) to block the clothing shortcut. CCIL realizes this intervention through three modules: a Confounder Dictionary, an Intervention Module, and Disentangle Regularization. The causality-based modeling makes the entire model naturally clothes-invariant, effectively preventing the capture of spurious correlations in feature learning. Extensive experiments validate the effectiveness of CCIL. On PRCC and DeepChange datasets, CCIL achieves Rank-1 accuracies of 66.4% and 59.2%, outperforming state-of-the-art methods by 1.4 and 4.1 percentage points, respectively.

2304.13905 2026-06-11 cs.CR cs.AI cs.LG 版本更新

LSTM based IoT Device Identification

基于LSTM的物联网设备识别

Kahraman Kostas

发表机构 * Kahraman Kostas

AI总结 提出一种端到端机器学习流程,利用LSTM网络处理原始网络数据包,通过滑动窗口时间序列特征识别27类物联网设备,在最优配置下达到79.85%准确率和75.70%宏平均F1分数。

详情
AI中文摘要

随着物联网的使用越来越普及,大量设备进入市场,许多安全漏洞也随之出现。在此环境下,物联网设备识别方法提供了一种预防性安全措施,作为识别这些设备并检测其漏洞的重要因素。在本研究中,我们提出了一种端到端的机器学习流程,利用长短期记忆(LSTM)网络识别阿尔托大学数据集(物联网设备捕获)中的物联网设备。原始网络数据包捕获(PCAP)被处理成25个工程特征,然后排列为滑动窗口时间序列。我们系统地评估了从2到20的序列长度,报告称性能在长度6之前近似线性提升,之后呈波浪形模式,在长度18时达到峰值。在最优配置的最终保留测试集上,该模型在27个设备类别上达到了79.85%的准确率和75.70%的宏平均F1分数。

英文摘要

While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number of devices being introduced to the market. In this environment, IoT device identification methods provide a preventive security measure as an important factor in identifying these devices and detecting the vulnerabilities they suffer from. In this study, we present an end-to-end machine learning pipeline that identifies IoT devices in the Aalto university dataset (IoT devices captures) using Long Short-Term Memory (LSTM) networks. Raw network packet captures (PCAP) are processed into 25 engineered features, which are then arranged as sliding-window time-series sequences. We systematically evaluate sequence lengths from 2 to 20, reporting that performance improves approximately linearly up to length 6 and thereafter in a wave-like pattern, reaching its peak at length 18. On the final held-out test set with the optimal configuration, the model achieves an accuracy of 79.85% and a macro-averaged F1-score of 75.70% across 27 device classes.

2605.26938 2026-06-11 cs.AI math.OC

Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*

开发用于最优合规性检查的全幺模线性规划:何时以及为何它补充A*

Izack Cohen

发表机构 * Bar Ilan University(巴伊兰大学)

AI总结 提出将基于对齐的合规性检查重新表述为在全幺模线性规划上的问题,利用网络流结构保证整数最优解,实验表明在长轨迹和有偏差情况下显著加速A*。

详情
Journal ref
Expert Systems with Applications, Volume 331, Part A, 2026, 133021
Comments
Author-accepted manuscript accepted for publication in Expert Systems with Applications. Code and experiment scripts are available at: https://github.com/Izack-Cohen/unimodular-conformance-checking. Version corresponding to the accepted paper: v1.0.0
AI中文摘要

基于对齐的合规性检查是比较观察到的过程执行与规范过程模型的最先进方法。标准的精确解依赖于基于A*的启发式搜索,在存在长轨迹或大量偏差时可能表现出指数级运行时间。本文介绍了将基于对齐的合规性检查重新表述为定义在同步积的可达图上的全幺模线性规划(LP)。通过利用底层的网络流结构,所提出的公式通过LP松弛保证了整数最优极值点的存在,从而避免了与整数变量和分支定界搜索相关的组合开销。我们在来自真实世界和合成基准数据集的超过210万个合规性检查实例上进行了广泛的实证评估。结果表明,A*和LP方法表现出互补的性能特征:前者在短且符合良好的轨迹上表现最佳,而LP公式为具有偏差的较长轨迹提供了显著的加速,这正是合规性检查最具信息量的地方。基于这些发现,我们推导出结合两种方法的简单算法选择指南,与始终使用A*相比,实现了平均38.6%的运行时间节省和96%的选择准确率。

英文摘要

Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A*-based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment-based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network-flow structure, the proposed formulation guarantees the existence of an integral optimal extreme-point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch-and-bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real-world and synthetic benchmark datasets. The results show that A* and the LP approach exhibit complementary performance characteristics: the former performs best on short, well-conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm-selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A*.