arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2508.09456 2026-06-02 cs.CV cs.CL cs.CR

IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding

IAG: 基于输入感知的后门攻击针对VLM视觉定位

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Columbia University(哥伦比亚大学) Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学)

AI总结 提出IAG方法,通过文本条件UNet动态生成输入感知的触发器,实现首个多目标后门攻击VLM视觉定位,在多个模型和基准上达到最佳攻击成功率且不影响正常性能。

Comments Accepted by CVPR 2026; Code is at https://github.com/lijunxian111/IAG

详情
Journal ref
https://openaccess.thecvf.com/content/CVPR2026/papers/Li_IAG_Input-aware_Backdoor_Attack_on_VLM-based_Visual_Grounding_CVPR_2026_paper.pdf
AI中文摘要

近期视觉语言模型(VLM)的进展显著提升了视觉定位任务,该任务涉及根据自然语言查询在图像中定位对象。尽管取得了这些进展,基于VLM的定位系统的安全性尚未得到彻底研究。本文揭示了一个新颖且现实的安全漏洞:首个针对VLM视觉定位的多目标后门攻击。与依赖静态触发器或固定目标的先前攻击不同,我们提出了IAG,一种动态生成输入感知、文本引导触发器的方法,这些触发器以任意指定目标对象描述为条件来执行攻击。这是通过一个文本条件的UNet实现的,该网络将难以察觉的目标语义线索嵌入视觉输入,同时保持对良性样本的正常定位性能。我们进一步开发了一个联合训练目标,平衡语言能力与感知重建,以确保隐蔽性、有效性和隐秘性。在多个VLM(如LLaVA、InternVL、Ferret)和基准(RefCOCO、RefCOCO+、RefCOCOg、Flickr30k Entities和ShowUI)上的大量实验表明,IAG在几乎所有设置下都实现了比其他基线最佳的攻击成功率,同时不损害干净准确率,保持对现有防御的鲁棒性,并展现出跨数据集和模型的迁移性。这些发现强调了具有定位能力的VLM中的关键安全风险,并突出了对可信多模态理解的进一步研究的必要性。

英文摘要

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

2510.19496 2026-06-02 cs.CV cs.AI cs.LG

CARES: Context-Aware Resolution Selector for VLMs

CARES: 面向视觉语言模型的上下文感知分辨率选择器

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

发表机构 * Technion(技术ion大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学) Ben-Gurion University(本· Gurion大学)

AI总结 提出CARES轻量级预处理模块,通过紧凑型VLM预测图像-查询对的最小足够分辨率,在保持任务性能的同时最多减少80%计算量。

Comments Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Accepted to ACL 2026 (Oral presentation). Code available at https://github.com/mkimhi/CARES

详情
AI中文摘要

大型视觉语言模型通常以原始或高分辨率处理图像以保持跨任务有效性。这导致视觉令牌通常占总令牌的97-99%,即使低分辨率图像就足够时,也会产生高计算量和延迟。我们引入了CARES——一种上下文感知分辨率选择器,这是一个轻量级预处理模块,给定图像-查询对,预测最小的足够输入分辨率。CARES使用紧凑型VLM(350M)提取特征,并预测目标预训练VLM的响应何时收敛到其正确回答的峰值能力。尽管作为一组可选分辨率上的离散分类器进行训练,但CARES在推理时插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像以及多样化目标VLM的五个多模态基准测试中,CARES在保持任务性能的同时最多减少80%的计算量。

英文摘要

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

2603.19453 2026-06-02 cs.CL cs.GT

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

超越标量奖励:面向序列社会困境的LLM策略合成中的密集反馈

Víctor Gallego

发表机构 * Komorebi AI Technologies(Komorebi AI技术公司)

AI总结 本研究通过反馈工程,比较稀疏奖励与密集社会指标反馈对LLM在多智能体环境中生成程序化策略的影响,发现密集反馈能打破奖励别名,提升策略性能。

Comments Accepted to NExT-Game 2026: New Frontiers in Game-Theoretic Learning, ICML 2026 Workshop

详情
AI中文摘要

我们研究LLM策略合成:使用语言模型迭代生成多智能体环境的程序化智能体策略。我们的框架不是通过强化学习训练神经策略,而是提示LLM生成Python策略函数,在自对弈中评估它们,并在迭代中使用性能反馈进行改进。我们研究反馈工程(在改进过程中向LLM展示哪些评估信息的设计),比较稀疏反馈(仅标量奖励)与密集反馈(奖励加上社会指标:效率、平等、可持续性、和平)。在两个经典的序列社会困境(Gathering和Cleanup)和两个前沿LLM(Claude Sonnet 4.6,Gemini 3.1 Pro)上,密集反馈在所有指标上始终匹配或超过稀疏反馈。我们通过反馈别名解释这种不对称性:当标量奖励单独将不同的失败模式映射到相同的值(例如,清洁不足与过度清洁)时,社会指标打破了别名,让LLM诊断出应该采取哪个纠正方向。因此,社会指标作为协调信号而非干扰,产生了诸如Voronoi区域划分和废物自适应清洁调度等策略。代码见https://github.com/vicgalle/llm-policies-social-dilemmas。

英文摘要

We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. We explain the asymmetry through feedback aliasing: when scalar reward alone maps distinct failure modes to the same value (e.g., under- vs. over-cleaning), social metrics break the alias and let the LLM diagnose which corrective direction to take. Social metrics thus function as a coordination signal rather than a distraction, yielding strategies such as Voronoi territory partitioning and waste-adaptive cleaner schedules. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

2603.18652 2026-06-02 cs.CV cs.AI cs.IR

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

超越字符串匹配:PDF表格提取的语义评估

Pius Horn, Janis Keuper

发表机构 * Institute for Machine Learning and Analytics (IMLA)(机器学习与分析研究所) Offenburg University(奥芬堡大学) University of Mannheim(曼海姆大学)

AI总结 提出基于LLM-as-a-judge的语义评估框架,通过合成PDF和人工验证,显著优于现有规则指标(TEDS、GriTS),并评估了21种PDF解析器。

Comments Submitted to BMVC 2026

详情
AI中文摘要

从PDF中可靠地提取表格对于大规模科学数据挖掘和知识库构建至关重要,然而现有的评估方法依赖于基于规则的指标,无法捕捉表格内容的语义等价性。我们提出了一个基于合成PDF的基准测试框架,这些PDF具有精确的LaTeX真实标注,并使用来自arXiv的表格以确保现实的复杂性和多样性。作为我们的核心方法论贡献,我们将LLM-as-a-judge应用于语义表格评估,并将其集成到一个能够适应解析器输出不一致性的匹配流水线中。通过一项包含超过1500个提取表格对的人工验证研究,我们表明基于LLM的评估与人类判断的相关性(Pearson r=0.93)显著高于当前使用的基于树编辑距离的相似度(TEDS, r=0.68)和网格表格相似度(GriTS, r=0.70)。对21个当代PDF解析器在包含451个表格的100个合成文档上的评估揭示了显著的性能差异。我们的结果为选择用于表格数据提取的解析器提供了实用指导,并为这一关键任务建立了一种可重复、可扩展的评估方法。代码和数据:https://github.com/phorn1/pdf-parse-bench 指标研究和人工评估:https://github.com/phorn1/table-metric-study

英文摘要

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to currently used Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

2502.15411 2026-06-02 cs.CL cs.AI

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

HiFi-KPI:用于从财报中提取层次化KPI的数据集

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

发表机构 * Department of Computer Science, Aalborg University(奥尔堡大学计算机科学系) ALIPES ApS(ALIPES公司) University of Copenhagen(哥本哈根大学) Pioneer Centre for AI(先锋人工智能中心)

AI总结 针对财报中关键绩效指标(KPI)跨公司可迁移性差的问题,提出包含165万段落和19.8万层次化标签的HiFi-KPI数据集,并评估分类、提取和结构化提取三个任务。

详情
AI中文摘要

准确标注财报可以为利益相关者带来显著的短期回报。机器可读的内联可扩展商业报告语言(iXBRL)是公开财务申报的强制要求。然而,其复杂且细粒度的分类法限制了标记关键绩效指标(KPI)的跨公司可迁移性。为了解决这个问题,我们引入了层次化财务关键绩效指标(HiFi-KPI)数据集,这是一个包含165万段落和19.8万个独特层次化标签的大规模语料库,这些标签与iXBRL分类法相关联。HiFi-KPI支持多个任务,我们评估了其中三个:KPI分类、KPI提取和结构化KPI提取。为了快速评估,我们还发布了HiFi-KPI-Lite,一个手动策划的8K段落子集。在HiFi-KPI-Lite上的基线实验表明,基于编码器的模型在分类任务上达到了超过0.906的宏F1分数,而大型语言模型(LLM)在结构化提取任务上达到了0.440的F1分数。最后,定性分析显示提取错误主要与日期相关。我们在https://github.com/aaunlp/HiFi-KPI上开源了所有代码和数据。

英文摘要

Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings. Yet, its complex, fine-grained taxonomy limits the cross-company transferability of tagged Key Performance Indicators (KPIs). To address this, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, a large-scale corpus of 1.65M paragraphs and 198k unique, hierarchically organized labels linked to iXBRL taxonomies. HiFi-KPI supports multiple tasks and we evaluate three: KPI classification, KPI extraction, and structured KPI extraction. For rapid evaluation, we also release HiFi-KPI-Lite, a manually curated 8K paragraph subset. Baselines on HiFi-KPI-Lite show that encoder-based models achieve over 0.906 macro-F1 on classification, while Large Language Models (LLMs) reach 0.440 F1 on structured extraction. Finally, a qualitative analysis reveals that extraction errors primarily relate to dates. We open-source all code and data at https://github.com/aaunlp/HiFi-KPI.

2208.00335 2026-06-02 cs.LG

Rule Extraction in Machine Learning: Chat Incremental Pattern Constructor

机器学习中的规则提取:聊天增量模式构造器

Caleb Princewill Nwokocha

发表机构 * Caleb Princewill Nwokocha

AI总结 提出ChatIPC系统,通过增量学习从文本中提取有序令牌转换规则,利用定义扩展和相似度引导候选选择构建响应,实现可解释的规则提取。

Comments 10 pages

详情
AI中文摘要

规则提取是可解释机器学习中的一个核心问题,因为它旨在将不透明的预测行为转换为人类可读的符号结构。本文提出了聊天增量模式构造器(ChatIPC),一个轻量级的增量符号学习系统,它从文本中提取有序的令牌转换规则,通过基于定义的扩展丰富这些规则,并通过相似度引导的候选选择构建响应。该系统可被视为在令牌图上运行的规则提取器,而非传统的分类器。我形式化了ChatIPC使用的知识库、定义扩展、候选评分、重复控制、英语规则启发式和响应构建机制。我还将该方法置于规则提取、决策树归纳、关联规则、可解释机器学习和序列构建的文献中。此外,详细回顾了更新后的实现:它解析嵌入式字典,标准化词汇键,缓存定义令牌和词性标签,在位集上计算Jaccard分数,应用启发式语言奖励,并使用带版本号的二进制格式持久化知识库。本文强调数学公式和算法清晰性,并为学习、评分和构建算法提供了伪代码。

英文摘要

Rule extraction is a central problem in interpretable machine learning because it seeks to convert opaque predictive behavior into human-readable symbolic structure. This paper presents Chat Incremental Pattern Constructor (ChatIPC), a lightweight incremental symbolic learning system that extracts ordered token-transition rules from text, enriches them with definition-based expansion, and constructs responses by similarity-guided candidate selection. The system may be viewed as a rule extractor operating over a token graph rather than a conventional classifier. I formalize the knowledge base, definition expansion, candidate scoring, repetition control, English-rule heuristics, and response construction mechanisms used by ChatIPC. I further situate the method within the literature on rule extraction, decision tree induction, association rules, interpretable machine learning, and sequence construction. The updated implementation is also reviewed in detail: it parses an embedded dictionary, normalizes lexical keys, caches definition tokens and part-of-speech tags, computes Jaccard scores on bitsets, applies heuristic linguistic bonuses, and persists the knowledge base with a versioned binary format. The paper emphasizes mathematical formulation and algorithmic clarity, and it provides pseudocode for the learning, scoring, and construction algorithms.

2603.18016 2026-06-02 cs.CL cs.AI cs.DC cs.LG

MineDraft: A Framework for Batch Parallel Speculative Decoding

MineDraft: 批量并行推测解码框架

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究院) Toyota Motor Corporation(丰田公司)

AI总结 提出MineDraft框架,通过批量并行设计将草稿生成与验证阶段重叠,显著提升推测解码的吞吐量和端到端延迟。

Comments Accepted at ICML 2026

详情
AI中文摘要

推测解码(SD)通过使用较小的草稿模型提出草稿令牌,随后由较大的目标模型验证,从而加速大型语言模型推理。然而,标准SD的性能通常受限于这些草稿和验证阶段的严格顺序执行。为解决此问题,本文提出MineDraft,一种批量并行推测解码(PSD)框架,旨在通过将草稿生成与验证重叠来有效隐藏草稿延迟。我们的理论分析表明,PSD比标准SD高效得多。MineDraft通过一种新颖的批量并行设计实现PSD,该设计维护两个请求批次,将一个批次的草稿生成与另一个批次的验证重叠。我们的实验结果显示,与标准SD相比,MineDraft在吞吐量(最高提升75%)和端到端延迟(最高降低39%)方面均有显著改进。此外,我们已将MineDraft实现为vLLM的插件,展示了其在生产级推理系统中的实用性。

英文摘要

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

2603.16142 2026-06-02 cs.CL

Parametric Social Identity Injection and Diversification in Public Opinion Simulation

参数化社会身份注入与多样化在舆论模拟中的应用

Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu

发表机构 * DCST, Tsinghua University(清华大学信息科技系) Quancheng Laboratory(泉城实验室) Tsinghua University(清华大学) Shandong China(山东中国)

AI总结 针对大语言模型在舆论模拟中社会多样性不足的问题,提出参数化社会身份注入(PSII)框架,通过向中间隐藏状态注入显式参数化的人口属性与价值取向表示,显著提升分布保真度与多样性。

Comments Accepted to KDD 2026 Research Track. Project page: https://github.com/halsayxi/PSII

详情
AI中文摘要

大语言模型(LLMs)最近被用作舆论模拟的合成代理,为昂贵且缓慢的人类调查提供了一种有前景的替代方案。尽管具有可扩展性,当前基于LLM的模拟方法未能捕捉社会多样性,产生扁平化的群体间差异和跨人口群体的过度同质化响应。我们将这一限制识别为LLM隐藏表示中的多样性崩溃现象,其中不同的社会身份在各层之间变得越来越难以区分。受此观察启发,我们提出了参数化社会身份注入(PSII),这是一个通用框架,将人口属性和价值取向的显式参数化表示直接注入LLM的中间隐藏状态。与基于提示的人物角色条件化不同,PSII能够在表示层面实现细粒度且可控的身份调制。在多个开源LLM上使用世界价值观调查进行的广泛实验表明,PSII显著提高了分布保真度和多样性,减少了与真实世界调查数据的KL散度,同时增强了整体多样性。这项工作为LLM代理的表示级控制提供了新见解,并推动了可扩展的、具有多样性意识的舆论模拟。

英文摘要

Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses across demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation.

2603.14771 2026-06-02 cs.AI

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

OpenHospital: 一个用于进化和基准测试基于LLM的集体智能的物自体竞技场

Peigen Liu, Rui Ding, Yuren Mao, Ziyan Jiang, Yuxiang Ye, Yunjun Gao, Ying Zhang, Renjie Sun, Longbin Lai, Zhengping Qian

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) Laboratory for Statistical Monitoring and Intelligent Governance of Common Prosperity, Zhejiang Gongshang University(浙江工商大学共同富裕统计监测与智能治理实验室) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Alibaba Cloud(阿里云)

AI总结 提出OpenHospital交互式竞技场,通过医生代理与患者代理的互动进化集体智能,采用数据在代理自身范式快速提升能力并提供医学熟练度和系统效率的基准测试。

详情
AI中文摘要

基于大型语言模型的集体智能为克服数据墙并持续提升LLM代理能力提供了一种有前景的方法。然而,目前缺乏专门用于进化和基准测试基于LLM的集体智能的竞技场。为解决这一空白,我们引入了OpenHospital,一个交互式竞技场,其中医生代理可以通过与患者代理的互动进化集体智能。该竞技场采用数据在代理自身范式,快速增强代理能力,并为基准测试医学熟练度和系统效率提供稳健的评估指标。实验证明了OpenHospital在促进和量化集体智能方面的有效性。

英文摘要

Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents. However, there is currently no dedicated arena for evolving and benchmarking LLM-based CI. To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents. This arena employs a data-in-agent-self paradigm that rapidly enhances agent capabilities and provides robust evaluation metrics for benchmarking both medical proficiency and system efficiency. Experiments demonstrate the effectiveness of OpenHospital in both fostering and quantifying CI.

2504.05033 2026-06-02 cs.RO cs.CV

CloSE: A Geometric Shape-Agnostic Cloth State Representation

CloSE: 一种几何形状无关的布料状态表示

Jay Kamat, Júlia Borràs, Carme Torras

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(西班牙工业机器人与信息技术研究所,CSIC-UPC)

AI总结 提出一种基于拓扑索引的dGLI圆盘表示,并从中抽象出紧凑、连续的CloSE表示,用于预测布料折叠位置并支持语义标注与规划。

Comments Accepted at ICRA 2026 (8 pages, 11 figures, 1 table). Project page: https://close-representation.github.io/

详情
AI中文摘要

布料操作是一个难题,主要是因为布料的非刚性特性,这使得对变形的良好表示至关重要。我们提出了一种新的布料变形状态表示。首先,我们提出了基于拓扑索引的dGLI圆盘表示,这些索引是针对排列在圆形网格上的布料边界边缘段计算的。dGLI圆盘的热力图揭示了与布料状态特征相对应的模式,这些模式对于不同形状、尺寸或方向的布料是一致的。然后,我们将这些重要特征从dGLI圆盘中抽象成一个圆,称为布料状态表示(CloSE)。这种表示紧凑、连续,且适用于不同形状。我们表明,这种表示能够准确预测多个仿真布料数据集中的折叠位置。最后,我们还展示了这种表示在两个相关应用中的优势:语义标注以及高层和低层规划。代码和数据集可从以下网址获取:https://close-representation.github.io/

英文摘要

Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/

2603.14465 2026-06-02 cs.AI

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

AgentProcessBench:诊断工具使用代理的步骤级过程质量

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin

发表机构 * Renmin University of China(中国人民大学) Beijing Jiaotong University(北京交通大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学)

AI总结 提出AgentProcessBench基准,通过三元标注方案和错误传播规则评估工具增强轨迹中的步骤级有效性,揭示当前模型在区分中立与错误动作方面的挑战,并证明过程信号对结果监督的补充价值。

详情
AI中文摘要

尽管大型语言模型(LLMs)已发展为工具使用代理,但在长期交互中仍然脆弱。与数学推理中错误通常可通过回溯纠正不同,工具使用失败常引发不可逆的副作用,因此准确的步骤级验证至关重要。然而,现有的过程级基准主要局限于封闭世界的数学领域,未能捕捉工具执行的动态和开放性质。为弥补这一差距,我们引入AgentProcessBench,这是首个致力于评估现实工具增强轨迹中步骤级有效性的基准。该基准包含1,000条多样化轨迹和8,509个人工标注的步骤注释,标注者间一致性达89.1%。它采用三元标注方案以捕捉探索行为,并引入错误传播规则以减少标注歧义。大量实验揭示了关键见解:(1)较弱的策略模型因提前终止而表现出膨胀的正确步骤比例;(2)区分中立动作和错误动作对当前模型仍是一个重大挑战;(3)过程信号为结果监督提供补充价值,显著增强测试时扩展。我们希望AgentProcessBench能促进奖励模型的未来研究,并为通用代理铺平道路。代码和数据可在https://github.com/RUCBM/AgentProcessBench获取。

英文摘要

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

2603.14405 2026-06-02 cs.LG cs.AI

ES-Merging: Biological MLLM Merging via Embedding Space Signals

ES-Merging: 通过嵌入空间信号进行生物多模态大模型合并

Wonbin Lee, Dongki Kim, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) DeepAuto.ai

AI总结 提出ES-Merging框架,利用嵌入空间信号估计合并系数,实现生物多模态大模型的高效合并,提升跨模态推理和单模态知识保留能力。

详情
AI中文摘要

生物多模态大语言模型(MLLMs)已成为科学发现的基础模型。然而,现有模型专注于单一模态,限制了其解决跨模态科学问题的能力。虽然模型合并是将不同模态组合成统一MLLM的有效方法,但现有方法依赖于与输入无关的参数空间启发式,无法准确捕捉模态特异性。为克服这一局限,我们提出基于嵌入信号的MLLM合并(ES-Merging),该框架从嵌入空间信号估计合并系数,将合并范式从参数信号转向嵌入信号。ES-Merging利用嵌入空间中的粗粒度和细粒度信号分别估计层间和元素级合并系数,并联合实现互补系数估计。通过大量实验,我们证明ES-Merging不仅在跨模态推理上,而且在单模态知识保留上均优于现有合并方法,表明嵌入空间信号为MLLM合并提供了有原则且有效的基础。

英文摘要

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to faithfully capture modality specialization. To overcome this limitation, we propose the Embedding-Signal-based MLLM Merging (ES-Merging), a framework that estimates merging coefficients from embedding space signals, moving the merging paradigm from the parameter signals to the embedding signals. ES-Merging exploits coarse-grained and fine-grained signals from embedding space to estimate the layer-wise and element-wise merging coefficients, respectively, which are jointly combined for complementary coefficient estimation. Through extensive experiments, we demonstrate that ES-Merging outperforms existing merging methods not only on the cross-modal reasoning but also on the single-modal knowledge preservation, establishing that embedding space signals provide a principled and effective foundation for MLLM merging.

2603.14010 2026-06-02 cs.RO

URDF-Anything+: End-to-End Generation for Simulation-Ready Articulated Assets

URDF-Anything+:面向仿真就绪铰接资产的端到端生成

Zhuangzhe Wu, Yue Xin, Chengkai Hou, Minghao Chen, Yaoxu Lyu, Jieyu Zhang, Shanghang Zhang

发表机构 * Peking University(北京大学) Visual Geometry Group, University of Oxford(牛津大学视觉几何组) University of Washington(华盛顿大学)

AI总结 提出URDF-Anything+,一种端到端自回归扩散框架,从单张RGB图像直接生成仿真就绪的URDF模型,统一建模部件几何与铰接结构。

详情
AI中文摘要

铰接物体是机器人学、物理仿真和交互式虚拟环境的基础。然而,从视觉观测中恢复它们本质上具有挑战性,因为图像仅提供关于部件几何及其底层运动学结构的部分和模糊线索。现有方法通常依赖多阶段流水线、从资产库检索或显式部件分割。我们提出URDF-Anything+,一种端到端自回归扩散框架,直接从单张RGB图像生成仿真就绪的URDF模型。以视觉观测和物体几何为条件,URDF-Anything+在结构化潜在空间中运行,并在统一生成过程中联合建模部件几何和铰接。具体而言,模型顺序预测每个铰接部件及其关联的关节参数,同时一个终止标记动态确定部件数量。这种设计使得无需外部检索或后处理阶段即可直接生成完全可执行的URDF。在大规模铰接物体基准上的实验表明,URDF-Anything+在几何重建质量、关节参数估计和物理可执行性方面优于先前方法,同时比现有多阶段方法显著更高效。此外,生成的URDF作为忠实数字孪生,使得纯仿真训练的操作策略能够零样本迁移。

英文摘要

Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, recovering them from visual observations is inherently challenging, as images provide only partial and ambiguous cues about both part geometry and their underlying kinematic structure. Existing approaches typically rely on multi-stage pipelines, retrieval from asset libraries, or explicit part segmentation. We present URDF-Anything+, an end-to-end autoregressive diffusion framework that generates simulation-ready URDF models directly from a single RGB image. Conditioned on visual observations and object geometry, URDF-Anything+ operates in a structured latent space and jointly models part geometry and articulation in a unified generation process. Specifically, the model sequentially predicts each articulated part together with its associated joint parameters, while a termination token dynamically determines the number of parts. This design enables direct generation of fully executable URDFs without external retrieval or post-processing stages. Experiments on large-scale articulated object benchmarks demonstrate that URDF-Anything+ outperforms prior methods in geometric reconstruction quality, joint parameter estimation, and physical executability, while being substantially more efficient than existing multi-stage approaches. Furthermore, the generated URDFs serve as faithful digital twins, enabling the zero-shot transfer of manipulation policies trained purely in simulation.

2603.04256 2026-06-02 cs.CV

A Hypertoroidal Covering for Perfect Color Equivariance

完美颜色等变的超环面覆盖

Yulong Yang, Zhikun Xu, Yaojun Li, Christine Allen-Blanchette

发表机构 * arXiv.org GitHub

AI总结 提出一种通过将区间值提升到圆上的双覆盖来构建真正等变的颜色等变架构,解决了先前方法中近似饱和度和亮度为1D平移带来的伪影问题,在细粒度分类和医学成像等任务上提升了性能。

Comments Accept to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

当输入图像的颜色分布在推理时发生变化,传统神经网络架构的性能会显著下降。一些研究者已开始将颜色几何的先验知识融入神经网络设计。这些颜色等变架构将色调变化建模为2D旋转,饱和度和亮度变换建模为1D平移。虽然这种方法在多种情况下提高了神经网络对颜色变化的鲁棒性,但我们发现将饱和度和亮度(区间值量)近似为1D平移会引入明显的伪影。本文提出了一种真正等变的颜色等变架构。我们不再用实直线近似区间,而是将区间上的值提升到圆上的值(双覆盖),并在其上构建等变表示。我们的方法解决了先前方法的近似伪影问题,提高了可解释性和泛化能力,并在细粒度分类和医学成像等任务上取得了优于传统和等变基线的预测性能。超越颜色范畴,我们提出的提升方法还可以扩展到尺度等几何变换。

英文摘要

When the color distribution of input images changes at inference, the performance of conventional neural network architectures drops considerably. A few researchers have begun to incorporate prior knowledge of color geometry in neural network design. These color equivariant architectures have modeled hue variation with 2D rotations, and saturation and luminance transformations as 1D translations. While this approach improves neural network robustness to color variations in a number of contexts, we find that approximating saturation and luminance (interval valued quantities) as 1D translations introduces appreciable artifacts. In this paper, we introduce a color equivariant architecture that is truly equivariant. Instead of approximating the interval with the real line, we lift values on the interval to values on the circle (a double-cover) and build equivariant representations there. Our approach resolves the approximation artifacts of previous methods, improves interpretability and generalizability, and achieves better predictive performance than conventional and equivariant baselines on tasks such as fine-grained classification and medical imaging tasks. Going beyond the context of color, we show that our proposed lifting can also extend to geometric transformations such as scale.

2505.21806 2026-06-02 cs.LG

Towards Operational Automated Greenhouse Gas Plume Detection and Delineation

面向运营的自动化温室气体羽流检测与描绘

Brian D. Bue, Jake H. Lee, Andrew K. Thorpe, Philip G. Brodrick, Daniel Cusworth, Alana Ayasse, Vassiliki Mancoridis, Anagha Satish, Shujun Xiong, Riley Duren

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) NASA Jet Propulsion Laboratory(美国国家航空航天局喷气推进实验室)

AI总结 针对高空间分辨率成像光谱仪,通过卷积神经网络和多任务学习解决数据质量、时空偏差和建模目标对齐等障碍,实现运营级温室气体羽流检测与分割。

Comments Main 19 pages 14 figures. Supplemental 19 pages 16 figures. In review

详情
Journal ref
Remote Sensing of Environment 343 (2026) 115506
AI中文摘要

尽管深度学习方法取得了最新进展,但对于精细空间分辨率成像光谱仪,全自动设施级温室气体(GHG)羽流检测系统的运营部署仍然具有挑战性。然而,随着数据可用性的急剧增加,自动化在排放监测中的重要性持续上升。本工作回顾并解决了该领域的几个关键障碍:数据和标签质量控制、时空偏差的预防以及正确对齐的建模目标。我们通过使用来自机载和星载仪器的多活动数据进行的严格实验证明,当这些障碍得到缓解时,卷积神经网络(CNN)能够实现运营检测性能。我们证明,同时学习实例检测和像素级分割的多任务模型可以成功走向运营路径。我们评估了模型在不同排放源类型和区域上的羽流可检测性,确定了运营部署的阈值。最后,我们提供了分析就绪的数据、模型和源代码以实现可重复性,并致力于定义一套最佳实践和验证标准,以促进未来对该领域的贡献。

英文摘要

Operational deployment of a fully automated facility-scale greenhouse gas (GHG) plume detection system remains challenging for fine spatial resolution imaging spectrometers, despite recent advances in deep learning approaches. With the dramatic increase in data availability, however, automation continues to increase in importance for emissions monitoring. This work reviews and addresses several key obstacles in the field: data and label quality control, prevention of spatiotemporal biases, and correctly aligned modeling objectives. We demonstrate through rigorous experiments using multicampaign data from airborne and spaceborne instruments that convolutional neural networks (CNNs) are able to achieve operational detection performance when these obstacles are alleviated. We demonstrate that a multitask model that learns both instance detection and pixelwise segmentation simultaneously can successfully lead towards an operational pathway. We evaluate the model's plume detectability across emission source types and regions, identifying thresholds for operational deployment. Finally, we provide analysis-ready data, models, and source code for reproducibility, and work to define a set of best practices and validation standards to facilitate future contributions to the field.

2603.12996 2026-06-02 cs.LG

DAPD: Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

DAPD: 面向扩散LLM的基于注意力的依赖感知并行解码

Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种无需训练的并行解码方法DAPD,利用自注意力构建掩码标记的依赖图,通过选择独立集并行解码,避免强耦合标记同时更新,提升了扩散LLM的精度-步数权衡。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散LLM(dLLM)的并行解码很困难,因为每个去噪步骤仅提供逐标记的边缘分布,而同时解掩多个标记需要考虑标记间的依赖关系。我们提出依赖感知并行解码(DAPD),一种简单、无需训练的解码方法,它使用自注意力在掩码标记上诱导条件依赖图。在每次迭代中,图中的边捕捉强标记交互,而非边表示弱依赖。然后,并行解码简化为在图上选择一个独立集并并行解掩所选标记。这避免了同时更新强耦合标记,无需辅助模型或重新训练。在LLaDA和Dream上的实验表明,DAPD改进了现有方法的精度-步数权衡,并实现了更全局分布的并行更新,更好地利用了dLLM的任意顺序生成能力。项目地址:https://ai-isl.github.io/dapd

英文摘要

Parallel decoding for Diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs. The project is available at https://ai-isl.github.io/dapd

2603.00171 2026-06-02 cs.CV cs.AI

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

LookWise: 知道何时何地关注多模态大语言模型中的细粒度视觉推理

Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, Jie Zhang

发表机构 * Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences(智能机器研究所,合肥物理科学研究院,中国科学院) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) East China Normal University(华东师范大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出LookWise框架,通过置信度模块和语义引导定位模块实现自适应视觉推理,无需额外训练即可提升细粒度推理精度并加速推理。

详情
AI中文摘要

多模态大语言模型正转向通过主动探索图像细节进行“图像思考”。虽然有效,但大规模训练计算成本高昂,这激发了对轻量级、无需训练解决方案的兴趣。然而,现有无需训练方法存在两个缺陷:无差别裁剪导致的感知冗余,增加了计算成本并引入噪声;以及语义意图与空间注意力之间的漂移,阻碍了用户关注区域的准确定位。为应对这些挑战,我们提出LookWise,一个自适应视觉推理框架。LookWise遵循两阶段流程:基于置信度的模块决定何时更仔细地观察,语义引导的定位模块确定观察位置。该设计使MLLM能够自适应获取细粒度视觉证据而无需额外训练。在细粒度和高分辨率视觉推理基准上的实验表明,LookWise在强基线上持续提升准确率,同时相较于基于搜索的方法ZoomEye实现约$4.0 imes$的推理加速,展现出稳健的跨模型泛化能力。

英文摘要

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose LookWise, a framework for adaptive visual reasoning. LookWise follows a two-stage pipeline: a confidence-based module decides when to look more carefully, and a semantic-guided localization module determines where to look. This design enables MLLMs to adaptively acquire fine-grained visual evidence without additional training. Experiments on fine-grained and high-resolution visual reasoning benchmarks show that LookWise consistently improves accuracy over strong baselines while achieving an approximately $4.0\times$ inference speedup over the search-based method ZoomEye, demonstrating robust cross-model generalization.

2603.12109 2026-06-02 cs.AI

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

强化学习中信息自锁现象及其在LLM智能体主动推理中的应用

Deyu Zou, Yongqiang Chen, Fan Feng, Mufei Li, Pan Li, Yu Gong, James Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM智能体在主动推理中因强化学习导致的信息自锁问题,提出基于优势重加权的方法AREW,通过方向性批评重新分配轨迹信用,显著提升智能体性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

强化学习已成为构建基于LLM的智能体的事实标准范式,这些智能体能够在扩展的任务范围内行动、交互和推理。然而,在主动推理中,智能体必须通过与环境的交互来获取新观察以解决问题,我们发现基于结果的强化学习会诱发一种系统性的失败模式,我们称之为信息自锁(SeL):智能体既无法获取信息性反馈,也无法内化已获得的证据。为了理解这个问题,我们将智能体行为追踪为两种耦合的能力:动作选择(AS),它决定观察流;信念追踪(BT),它更新智能体内部的任务理解。理论和实证分析揭示了一个导致SeL的双向瓶颈:弱的BT模糊了信息性动作的信用,而弱的AS剥夺了BT有用的证据。这种耦合削弱了两种能力的学习信号,导致SeL。为了缓解这个问题,我们提出了AREW,一种简单而有效的优势重加权方法,它使用易于获取的方向性批评来重新分配轨迹内的信用。在9个不同复杂度的智能体任务上的大量实验表明,AREW显著缓解了SeL,最终性能提升高达60个点。代码可在https://github.com/unimpor/T3获取。

英文摘要

Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task horizons. However, in active reasoning where agents must elicit new observations through interaction with the environment to solve the task, we find that outcome-based RL can induce a systematic failure mode which we call information self-locking (SeL): agents fail both to elicit informative feedback and to internalize obtained evidence. To understand the issue, we trace agentic behaviors into two coupled capabilities: Action Selection (AS), which determines observation streams, and Belief Tracking (BT), which updates the agent's internal task understanding. Theoretical and empirical analyses reveal a bidirectional bottleneck that leads to SeL: weak BT obscures the credit of informative actions, while weak AS deprives BT of useful evidence. This coupling weakens the learning signal for both capabilities and leads to SeL. To mitigate this issue, we propose AREW, a simple yet effective Advantage Reweighting method that uses easy-to-obtain directional critiques to reallocate credit within trajectories. Extensive experiments across 9 agentic tasks of varying complexity show that AREW significantly mitigates SeL, yielding up to 60-point gains in final performance. Code is available at https://github.com/unimpor/T3.

2603.12037 2026-06-02 cs.LG

Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference

用于因果推断的先验数据拟合网络的频率派一致性

Valentyn Melnychuk, Vahid Balazadeh, Stefan Feuerriegel, Rahul G. Krishnan

发表机构 * arXiv.org cs.LG(计算机学习)

AI总结 本文分析基于先验数据拟合网络(PFN)的平均处理效应(ATE)估计量的频率派一致性,发现其存在先验诱导的混淆偏差,并提出基于一步后验校正(OSPC)的校准方法,结合鞅后验恢复功能干扰后验,从而恢复频率派一致性并实现半参数Bernstein-von Mises定理。

详情
Journal ref
Proceedings of the 43-rd International Conference on Machine Learning, Seoul, South Korea, PMLR 306, 2026
AI中文摘要

基于先验数据拟合网络(PFN)的基础模型通过将因果推断任务构建为上下文学习问题,在因果推断中展现出强大的实证性能。然而,目前尚不清楚基于PFN的因果估计量是否提供与经典频率派估计量一致的不确定性量化。在这项工作中,我们通过分析基于PFN的平均处理效应(ATE)估计量的频率派一致性来填补这一空白。(1)我们表明,现有的PFN在解释为贝叶斯ATE估计量时,可能表现出先验诱导的混淆偏差:先验不会被数据渐近覆盖,这反过来阻碍了频率派一致性。(2)作为补救措施,我们建议采用基于一步后验校正(OSPC)的校准程序。我们证明OSPC有助于恢复频率派一致性,并能为校准后的PFN导出半参数Bernstein-von Mises定理(即,随着数据规模增大,校准后的基于PFN的估计量和经典半参数有效估计量在分布上收敛)。(3)最后,我们通过在PFN之上定制鞅后验来实现OSPC。通过这种方式,我们能够从PFN中恢复OSPC所需的功能干扰后验。在多个(半)合成实验中,使用我们的鞅后验OSPC校准的PFN产生的ATE不确定性(i)渐近匹配频率派不确定性,并且(ii)与其他贝叶斯ATE估计量相比,在有限样本中校准良好。

英文摘要

Foundation models based on prior-data fitted networks (PFNs) have shown strong empirical performance in causal inference by framing the task as an in-context learning problem. However, it is unclear whether PFN-based causal estimators provide uncertainty quantification that is consistent with classical frequentist estimators. In this work, we address this gap by analyzing the frequentist consistency of PFN-based estimators for the average treatment effect (ATE). (1) We show that existing PFNs, when interpreted as Bayesian ATE estimators, can exhibit prior-induced confounding bias: the prior is not asymptotically overwritten by data, which, in turn, prevents frequentist consistency. (2) As a remedy, we suggest employing a calibration procedure based on a one-step posterior correction (OSPC). We show that the OSPC helps to restore frequentist consistency and can yield a semi-parametric Bernstein-von Mises theorem for calibrated PFNs (i.e., both the calibrated PFN-based estimators and the classical semi-parametric efficient estimators converge in distribution with growing data size). (3) Finally, we implement OSPC through tailoring martingale posteriors on top of the PFNs. In this way, we are able to recover functional nuisance posteriors from PFNs, required by the OSPC. In multiple (semi-)synthetic experiments, PFNs calibrated with our martingale posterior OSPC produce ATE uncertainty that (i) asymptotically matches frequentist uncertainty and (ii) is well calibrated in finite samples in comparison to other Bayesian ATE estimators.

2603.11946 2026-06-02 cs.LG cs.AI

Geometry-Aware Probabilistic Circuits via Voronoi Tessellations

基于Voronoi剖分的几何感知概率电路

Sahil Sidheekh, Sriraam Natarajan

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校)

AI总结 针对概率电路因数据无关混合权重而无法捕捉数据流形局部几何结构的问题,提出通过Voronoi剖分将几何结构直接融入求和节点,并开发近似推理框架和精确推理条件,最后引入可微松弛实现梯度学习,在密度估计任务上验证了有效性。

详情
AI中文摘要

概率电路(PC)支持精确且易于处理的推理,但采用数据无关的混合权重,限制了其捕捉数据流形局部几何结构的能力。我们提出将Voronoi剖分(VT)作为将几何结构直接融入PC求和节点的自然方式。然而,直接引入这种结构会破坏可处理性。我们形式化了这种不兼容性,并开发了两种互补的解决方案:(1)一个近似推理框架,为推理提供保证的下界和上界;(2)VT的一个结构条件,在该条件下恢复精确的可处理推理。最后,我们引入了VT的可微松弛,使得基于梯度的学习成为可能,并在标准密度估计任务上实证验证了所提方法。

英文摘要

Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to capture local geometry of the data manifold. We propose Voronoi tessellations (VT) as a natural way to incorporate geometric structure directly into the sum nodes of a PC. However, naïvely introducing such structure breaks tractability. We formalize this incompatibility and develop two complementary solutions: (1) an approximate inference framework that provides guaranteed lower and upper bounds for inference, and (2) a structural condition for VT under which exact tractable inference is recovered. Finally, we introduce a differentiable relaxation for VT that enables gradient-based learning and empirically validate the resulting approach on standard density estimation tasks.

2603.11653 2026-06-02 cs.LG cs.RO

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

简单配方有效:视觉-语言-动作模型通过强化学习成为自然持续学习者

Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin

发表机构 * University of Southern California(南加州大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过系统研究发现,对于大型预训练视觉-语言-动作模型,简单的顺序微调结合低秩适配在持续强化学习中表现出高可塑性、几乎无遗忘和强零样本泛化,优于复杂方法。

Comments Accepted at RLC 2026

详情
AI中文摘要

持续强化学习(CRL)用于视觉-语言-动作(VLA)模型是一个有前景的方向,旨在实现能够在开放、不断变化的环境中适应的自我改进具身智能体。然而,持续学习的传统观点认为,简单的顺序微调(Seq. FT)会导致灾难性遗忘,需要复杂的CRL策略。在这项工作中,我们退一步,对大型预训练VLA在多种终身RL基准上的CRL进行了系统研究。我们发现,与既定信念相反,使用低秩适配(LoRA)的简单Seq. FT非常强大:它实现了高可塑性,几乎没有遗忘,并保持了强大的零样本泛化,通常优于更复杂的CRL方法。通过详细分析,我们表明这种鲁棒性源于大型预训练模型、参数高效适配和在线RL之间的协同作用。这些组件共同重塑了稳定性-可塑性权衡,使持续适应既稳定又可扩展。我们的结果将顺序微调定位为VLA持续RL的强大方法,并为大模型时代的终身学习提供了新见解。代码可在github.com/UT-Austin-RobIn/continual-vla-rl获取。

英文摘要

Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across diverse lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.

2603.11537 2026-06-02 cs.RO

MiNI-Q: A Miniature, Wire-Free Quadruped with Unbounded, Independently Actuated Leg Joints

MiNI-Q:一种微型、无线四足机器人,具有无界、独立驱动的腿关节

Daniel Koh, Suraj Shah, Yufeng Wu, Dennis Hong

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出MiNI-Q^2微型无线四足机器人,通过无界独立驱动腿关节设计实现多种运动模式,速度达0.46 m/s,可折叠至2.5 cm高度,并开源所有设计文件。

Comments 7 pages, 11 figures. Submitted to the IEEE RAS Conference on Ubiquitous Robots (UR 2026)

详情
AI中文摘要

物理关节限制在腿式机器人中很常见,会限制工作空间、约束步态设计并增加硬件损坏风险。本文介绍MiNI-Q^2,一种微型、无线四足机器人,具有独立驱动、机械无界的2自由度腿关节。我们提出了所设计机器人的机械设计、运动学分析和实验验证。该腿机构可实现振荡步态和旋转运动,同时允许机器人折叠至2.5 cm的最小高度。实验上,MiNI-Q达到0.46 m/s的速度,并展示了低间隙爬行、爬楼梯、倒立运动、跳跃和后空翻。无线架构扩展了我们之前的Q8bot设计,提高了微型尺度的装配可靠性。所有机械和电气设计文件均已开源,以支持可重复性和进一步研究。

英文摘要

Physical joint limits are common in legged robots and can restrict workspace, constrain gait design, and increase the risk of hardware damage. This paper introduces MiNI-Q^2, a miniature, wire-free quadruped robot with independently actuated, mechanically unbounded 2-DOF leg joints. We present the mechanical design, kinematic analysis, and experimental validation of the proposed robot. The leg mechanism enables both oscillatory gaits and rotary locomotion while allowing the robot to fold to a minimum height of 2.5 cm. Experimentally, MiNI-Q achieves speeds up to 0.46 m/s and demonstrates low-clearance crawling, stair climbing, inverted locomotion, jumping, and backflipping. The wire-free architecture extends our previous Q8bot design, improving assembly reliability at miniature scale. All mechanical and electrical design files are released open source to support reproducibility and further research.

2603.10619 2026-06-02 cs.CL

Disentangling Similarity and Relatedness in Topic Models

在主题模型中区分相似性和相关性

Hanlin Xiao, Yang Wang, Mauricio A. Álvarez, Rainer Breitling

发表机构 * Manchester Institute of Biotechnology, The University of Manchester, UK(曼彻斯特生物技术研究所,曼彻斯特大学,英国) Department of Computer Science, The University of Manchester, UK(计算机科学系,曼彻斯特大学,英国) Department of Chemistry, The University of Manchester, UK(化学系,曼彻斯特大学,英国)

AI总结 本文通过构建大规模合成基准和神经评分器,从心理语言学角度区分主题词之间的主题相关性和分类相似性,并证明两个维度对下游任务性能有不同影响。

Comments 26 pages, 9 figures, 18 tables

详情
AI中文摘要

大型预训练语言模型(PLMs)的最新成功推动了它们与主题建模的整合。然而,PLM增强的主题模型不仅在性能上,而且在它们捕获的语义结构类型上也与经典共现模型(如潜在狄利克雷分配(LDA))不同。我们沿着两个心理语言学轴形式化了这种区别:主题相关性(狗/骨头)和分类相似性(狗/狼)。为了衡量主题词上的这两个轴,我们使用基于LLM的标注构建了一个大型合成词对基准,并训练了一个神经评分器。在多个语料库和模型家族中,评分器将不同的主题模型家族置于联合相似性-相关性空间中的不同位置。这两个分数进一步预测了下游任务性能:需要相似性的任务受益于富含相似性的主题,而需要相关性的任务则受益于相反的情况,并且对任一轴的过度强调都会降低与对立语义结构对齐的任务的性能。没有一个轴是普遍有益的。因此,衡量两者提供了一种实用的、与模型无关的诊断方法,用于评估主题模型捕获的语义结构。

英文摘要

The recent success of large pre-trained language models (PLMs) has motivated their integration into topic modeling. However, PLM-augmented topic models differ from classical co-occurrence models such as Latent Dirichlet Allocation (LDA) not only in performance, but also in the type of semantic structure they capture. We formalize this distinction along two psycholinguistic axes: thematic relatedness (dog/bone) and taxonomic similarity (dog/wolf). To measure both axes over topic words, we construct a large synthetic benchmark of word pairs using LLM-based annotation and train a neural scorer on it. Across multiple corpora and model families, the scorer places different topic-model families at distinct positions within the joint similarity-relatedness space. The two scores further predict downstream task performance: tasks requiring similarity benefit from similarity-rich topics, whereas tasks requiring relatedness benefit from the converse, and excessive emphasis on either axis degrades performance on tasks aligned with the opposing semantic structure. Neither axis is uniformly beneficial. Measuring both therefore provides a practical, model-agnostic diagnostic for evaluating the semantic structure captured by topic models.

2603.10282 2026-06-02 cs.RO

Update-Free On-Policy Steering via Verifiers

基于验证器的免更新在线策略引导

Maria Attarian, Ian Vyse, Claas Voelcker, Jasper Gerigk, Evgenii Opryshko, Anas Almasri, Sumeet Singh, Yilun Du, Igor Gilitschenski

发表机构 * University of Toronto(多伦多大学) Google DeepMind(谷歌DeepMind) University of Alberta(阿尔伯塔大学) UTAustin(得克萨斯大学奥斯汀分校) Harvard University(哈佛大学)

AI总结 提出UF-OPS方法,利用策略评估中的验证器函数引导基础策略选择高成功概率动作,无需更新参数即可提升黑箱扩散策略性能,在5个真实任务中平均成功率提升49%。

Comments 9 pages, 6 figures

详情
AI中文摘要

近年来,行为克隆(BC)已成为从人类演示中学习操作的最流行方法之一。尽管取得了成功,但BC策略通常脆弱且难以进行精确操作。为了克服这些问题,我们提出了UF-OPS,一种免更新的在线策略引导方法,使机器人能够预测其动作的成功可能性,并在执行时调整策略。我们通过使用在策略初始评估期间获得的策略回滚数据来训练验证器函数来实现这一点。这些验证器随后用于将基础策略引导向具有更高成功可能性的动作。我们的方法在不改变基础参数的情况下提高了黑箱扩散策略的性能,使其轻量且灵活。我们展示了来自仿真和真实世界数据的结果,并在5个真实任务中实现了比基础策略平均49%的成功率提升。

英文摘要

In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for learning manipulation from human demonstrations. Despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policies, without changing the base parameters, making it lightweight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.

2603.09692 2026-06-02 cs.LG cs.AI cs.CL

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback:使用主动学习的高效偏好数据生成

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出ActiveUltraFeedback主动学习流水线,通过不确定性估计和两种新采样方法(DRTS和DeltaUCB)动态选择最具信息量的响应对,以最少六分之一的标注数据实现与静态基线相当或更优的下游性能。

Comments 40 pages, 9 figures, 26 tables

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)已成为对齐大型语言模型(LLMs)的标准方法,但其有效性受到偏好数据获取高成本的瓶颈限制,尤其是在低资源和专家领域。为解决这一问题,我们引入了ACTIVEULTRAFEEDBACK,一个模块化的主动学习流水线,利用不确定性估计动态识别最具信息量的响应进行标注。我们的流水线支持系统评估标准响应选择方法以及两种新方法:DOUBLE REVERSE THOMPSON SAMPLING(DRTS)和DELTAUCB,这两种方法优先选择预测质量差距大的响应对,利用近期研究结果,即此类对为微调提供良好信号。实验表明,ACTIVEULTRAFEEDBACK生成的高质量数据集在下游性能上带来显著提升,尤其以静态基线六分之一的标注数据即可达到相当或更优的结果。我们的流水线可在https://github.com/lasgroup/ActiveUltraFeedback获取,偏好数据集可在https://huggingface.co/ActiveUltraFeedback获取。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

2603.09529 2026-06-02 cs.CV

RESBev: Making BEV Perception More Robust

RESBev:使BEV感知更加鲁棒

Lifeng Zhuo, Kefan Jin, Zhe Liu, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出RESBev,一种即插即用的鲁棒BEV感知方法,通过构建潜在世界模型学习时空相关性来预测干净BEV特征,从而在无需修改骨干网络的情况下增强对自然扰动和对抗攻击的鲁棒性。

详情
AI中文摘要

鸟瞰图(BEV)感知已成为自动驾驶系统的基石,为下游规划和控制提供了结构化的、以自我为中心的表示。然而,实际部署面临传感器退化和对抗攻击的挑战,这些可能导致严重的感知异常,最终危及自动驾驶系统的安全性。为了解决这个问题,我们提出了一种弹性且即插即用的BEV感知方法RESBev,它可以轻松应用于现有的BEV感知方法,以增强其对各种扰动的鲁棒性。具体来说,我们将感知鲁棒性重新构建为潜在语义预测问题。构建了一个潜在世界模型来提取连续BEV观测中的时空相关性,从而学习潜在的BEV状态转换,以预测干净的BEV特征来重建被破坏的观测。所提出的框架在Lift-Splat-Shoot管道的语义特征级别上运行,使其能够在无需修改底层骨干网络的情况下,对自然扰动和对抗攻击进行泛化恢复。在nuScenes数据集上的大量实验表明,通过少样本微调,RESBev显著提高了现有BEV感知模型对各种外部扰动和对抗攻击的鲁棒性。

英文摘要

Bird's-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.

2603.09390 2026-06-02 cs.CV

Training-Free Coverless Multi-Image Steganography with Access Control

免训练的无载体多图像隐写术与访问控制

Minyeol Bae, Si-Hyeon Lee

发表机构 * Department of Computer Science, Seoul National University(首尔国立大学计算机科学系)

AI总结 提出MIDAS框架,通过潜在级融合实现免训练的多图像隐写与用户特定访问控制,引入随机基机制抑制残差结构信息,并理论分析信息泄露。

Comments Accepted (Poster) at ICML 2026

详情
AI中文摘要

无载体图像隐写术(CIS)在不显式修改载体图像的情况下隐藏信息,提供强不可感知性和对隐写分析的固有鲁棒性。然而,现有的CIS方法大多缺乏鲁棒的访问控制,难以向不同授权用户选择性揭示不同隐藏内容。这种访问控制对于多用户场景中可扩展且隐私敏感的信息隐藏至关重要。我们提出MIDAS(基于多图像扩散的访问控制隐写术),一种免训练的基于扩散的CIS框架,通过潜在级融合实现具有用户特定访问控制的多图像隐藏。MIDAS引入随机基机制以抑制残差结构信息,并附带信息泄露的理论分析,以及一个潜在向量融合模块,该模块重塑聚合的潜在向量以更好地与扩散过程对齐。实验结果表明,MIDAS在访问控制功能、隐写图像质量和多样性、对噪声的鲁棒性以及抵抗隐写分析方面,始终优于现有的免训练CIS基线,为访问控制的无载体隐写术建立了一种实用且可扩展的方法。

英文摘要

Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS (Multi-Image Diffusion-based Access-controlled Steganography), a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information, together with a theoretical analysis of information leakage, and a Latent Vector Fusion module that reshapes aggregated latents to better align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.

2603.09292 2026-06-02 cs.RO cs.CV

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

看、规划、回退:面向鲁棒机器人操作的进度感知视觉-语言-动作模型

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zihao Zhang, Zhihui Li, Salman Khan, Jun Yu, Xiaojun Chang

发表机构 * School of Information Science and Technology, University of Science and Technology of China(信息科学与技术学院,中国科学技术大学) University of Technology Sydney(新南威尔士大学) Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence(人工智能与计算机视觉系,Mohamed Bin Zayed人工智能大学) The University of Hong Kong(香港大学) Institute of AI for Industry, Chinese Academy of Sciences(产业人工智能研究所,中国科学院) School of Intelligent Science and Engineering, Harbin Institute of Technology (Shenzhen)(智能科学与工程学院,哈尔滨工业大学(深圳))

AI总结 提出进度感知的视觉-语言-动作框架SPR,通过动态将语言指令映射为空间子目标序列,并利用闭环进度监控实现错误恢复,在LIBERO基准上提升5%性能,在LIBERO-Plus上展现最先进的鲁棒性。

Comments Suggested to CVPR Findings. https://tingjundai.github.io/SPRVLA/

详情
AI中文摘要

通过明确的、可操作的里程碑来测量任务进度对于鲁棒机器人操作至关重要。这种进度感知使模型能够把握当前任务状态,预期可验证的中间状态,并在进度停滞时检测和恢复失败。为体现这一能力,我们引入了 extbf{看}、 extbf{规划}、 extbf{回退}(SPR),一个进度感知的视觉-语言-动作框架,它动态地将语言指令接地到一系列空间子目标中。SPR通过一个连续的核心循环运行:观察当前状态和即将到来的里程碑,规划朝向下一个2D航点的轨迹,并在失败时通过监控与预期序列的进度来回退到可恢复状态。这种闭环方法无需额外训练数据或辅助模型即可实现鲁棒的错误纠正。大量实验证明了该框架的有效性、泛化能力和鲁棒性:SPR在LIBERO基准上比MolmoAct基线高出5%。在具有未见指令和初始状态的挑战性LIBERO-Plus基准上,SPR实现了最先进的鲁棒性,性能下降最小,超越了OpenVLA-OFT和UniVLA,展示了优越的分布外鲁棒性。

英文摘要

Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce \textbf{S}ee, \textbf{P}lan, \textbf{R}ewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.

2509.25773 2026-06-02 cs.CV cs.AI cs.CL

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

v-HUB: 从视觉和声音理解视频幽默的基准

Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui, Wei Bi, Songchun Zhu, Bo Zhao, Zilong Zheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) Wuhan University(武汉大学) Beijing Institute for General Artificial Intelligence(北京一般人工智能研究院) Independent Researcher(独立研究者)

AI总结 提出v-HUB基准,通过非语言短视频评估多模态大语言模型在仅凭视觉线索理解幽默的能力,并发现音频信息有助于提升幽默理解。

Comments 24 pages, 9 figures

详情
AI中文摘要

能够理解幽默的AI模型具有现实应用前景——例如,增强人机交互中的参与度。为了评估和诊断多模态大语言模型(MLLMs)理解幽默的能力,我们引入了v-HUB,一个新颖的视频幽默理解基准。v-HUB包含一个精心策划的非语言短视频集合,反映了仅通过视觉线索即可欣赏幽默的现实场景。我们将每个视频片段与丰富的标注配对,以支持各种评估任务和分析,包括一项关于增强幽默的环境声音的新研究。为了扩大其适用性,我们构建了一个开放式问答任务,使v-HUB能够轻松集成到现有的视频理解任务套件中。我们评估了多种MLLMs,从专门的Video-LLMs到能够原生处理音频的多功能OmniLLMs,涵盖了开源和专有领域。实验结果揭示了MLLMs在仅凭视觉线索理解幽默时面临的困难。我们的发现还表明,结合音频有助于视频幽默理解,突显了为复杂视频理解任务整合更丰富模态的前景。

英文摘要

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

2603.08026 2026-06-02 cs.CL cs.AI cs.PF

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

DyLLM: 基于显著性标记选择与部分注意力的高效扩散LLM推理

Younjoo Lee, Seungkyun Dan, Junghoo Lee, Jaiyoung Park, Jung Ho Ahn

发表机构 * University of Seoul(首尔大学)

AI总结 针对扩散语言模型迭代去噪计算昂贵的问题,提出DyLLM框架,通过仅计算显著标记的注意力与前馈操作,实现无需训练的加速推理,在保持精度的同时吞吐量提升高达9.6倍。

Comments 21 pages, 10 figures, 7 tables, accepted at ICML 2026

详情
AI中文摘要

掩码扩散语言模型支持并行令牌解码,为自回归生成的顺序性质提供了一种有前景的替代方案。然而,其迭代去噪过程仍然计算昂贵,因为每一步都重复处理整个序列。我们观察到,在这些扩散步骤中,大多数令牌表示保持稳定;只有一小部分(我们称之为显著令牌)对下一次更新有实质性贡献。利用这种时间稀疏性,我们提出了DyLLM,一种无需训练的推理框架,通过仅选择性地计算这些显著令牌来加速解码。DyLLM通过测量相邻去噪步骤之间注意力上下文的余弦相似性来识别显著性。它仅对显著令牌重新计算前馈和注意力操作,同时为其余令牌重用缓存的激活。在多种推理和代码生成基准测试中,DyLLM实现了高达9.6倍的吞吐量提升,同时基本保持了代表性开源扩散LLM(LLaDA和Dream)的基线准确性。

英文摘要

Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of representative open-source diffusion LLMs, LLaDA, and Dream.