arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03635 2026-06-03 cs.CV cs.AI

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

VidMsg:短视频中隐含信息推断的基准测试

Issar Tzachor, Michael Green, Rami Ben-Ari

AI总结 提出VidMsg基准,通过消息优先构建流程和双向检索任务,评估视频理解模型对短视频中隐含信息的推断能力。

详情
Comments
Project page: https://iyttor.github.io/VidMsg
AI中文摘要

理解短视频不仅仅是识别可见物体和动作;视频制作者常常在片段中包含潜在的信息或目的。我们引入了VidMsg,一个用于评估互联网原生短视频中隐含信息理解的基准测试。VidMsg包含400个来自YouTube的片段,涵盖9个实际主题领域和52个细粒度目标信息,涉及职业与金融、教育、健康与福祉、文化、安全、可持续性和生活方式等领域。VidMsg通过消息优先流程构建:LLM首先将目标信息转化为间接搜索场景,用于检索候选片段。然后,人工标注者保留那些传达预期信息但不过于直白的片段。VidMsg主要设计用于双向消息-片段检索,适用于视频搜索和推荐等可扩展应用,系统必须捕捉全面的视频理解。除了检索,VidMsg还包括一个诊断性多项选择问答基准,模型需要从语义相关的选项中选出片段的预期信息。与当代视频语言和检索模型的实验表明,强模型在VidMsg上常常失败,因为该任务需要语用推理、上下文线索整合以及语义相近信息的区分。我们还引入了VidVec-Msg,一种改进消息导向检索的基线方法,同时为未来工作留下了足够的提升空间。

英文摘要

Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.

2606.03629 2026-06-03 cs.AI

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

TSQAgent: 通过专用智能体推理评估时间序列数据质量

Shunyu Wu, Dan Li, Haozheng Ye, Weibin Feng, Jian Lou, Bo Zhang, Wenjie Feng, Chenjuan Guo, See-Kiong Ng

AI总结 提出TSQAgent框架,通过三个协作智能体(感知器、检查员、裁决者)识别相关质量维度并进行定量比较,显著提升LLM在时间序列数据质量评估中的表现。

详情
AI中文摘要

评估时间序列(TS)数据的质量是基础但极具挑战性的任务,因为质量维度具有多面性。最近,大语言模型(LLM)通过成对比较和逐维度评估,成为TS质量评估的一种有前景的范式。然而,现有方法依赖手动预定义的质量维度和纯文本推理,尚不清楚LLM能否识别真正相关的质量维度或进行基于证据的定量质量比较。为探究此问题,我们构建了TSQBench,一个专用基准,用于评估LLM在两种渐进能力上的表现:(i)理解和识别相关质量维度,(ii)在特定维度下进行质量比较。分析表明,当前LLM在维度识别和基于证据的质量比较方面均存在困难。为解决这些局限,我们提出TSQAgent,一种新颖的用于TS质量评级的智能体推理框架,包含三个协作角色:感知器(负责聚焦维度选择)、检查员(负责逐维度定量分析)和裁决者(负责聚合并优化最终判断)。特别地,我们引入一种智能体推理策略,赋予模型识别和优先考虑最相关质量维度的能力,并进一步提出一个配备外部分析工具的智能体工作流,以实现对选定维度的精确定量比较。在提出的基准和11个真实世界数据集上的实验表明,我们的框架不仅显著提升了LLM在质量理解和定量比较方面的能力,而且有效地将这些改进转化为更好的质量感知数据选择,从而提升下游性能和数据效率。

英文摘要

Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensions. Recently, large language models (LLMs) have emerged as a promising paradigm for TS quality assessment via pairwise comparison and per-dimension evaluation. However, existing approaches rely on manually predefined quality dimensions and purely text-based reasoning, leaving it unknown whether LLMs can identify truly relevant quality dimensions or perform grounded and quantitative quality comparisons. To investigate this, we construct TSQBench, a dedicated benchmark for evaluating LLMs on two progressive capabilities: (i) understanding and identifying relevant quality dimensions, and (ii) performing quality comparison under specific dimensions. Our analysis reveals that current LLMs consistently struggle with both dimension identification and evidence-grounded quality comparison. To address these limitations, we propose TSQAgent, a novel agentic reasoning framework for TS quality rating consisting of three collaborative roles: Perceiver for focused dimension selection, Inspector for dimension-wise quantitative analysis, and Adjudicator that aggregates and refines the final judgment. In particular, we introduce an agentic reasoning strategy that instills the ability to identify and prioritize the most relevant quality dimensions, and further propose an agent workflow equipped with external analytical tools to enable precise quantitative comparisons over selected dimensions. Experiments on both the proposed benchmark and eleven real-world datasets demonstrate that our framework not only substantially improves LLMs' capabilities in quality understanding and quantitative comparison but also effectively translates these improvements into better quality-aware data selection, leading to enhanced downstream performance and data efficiency.

2606.03628 2026-06-03 cs.CL cs.AI cs.LG

Building Reliable Long-Form Generation via Hallucination Rejection Sampling

通过幻觉拒绝采样构建可靠的长文本生成

Lin Li, Georgia Channing, Suhaas M Bhat, Gabriel Davis Jones, Yarin Gal

AI总结 提出分段幻觉拒绝采样框架SHARS,利用任意幻觉检测器在生成过程中拒绝并重采样幻觉片段,以缓解长文本生成中的幻觉累积问题,提升事实一致性。

详情
Comments
accepted by ICML 2026
AI中文摘要

大型语言模型(LLMs)在开放式文本生成方面取得了显著进展,但仍容易产生不正确或无依据的幻觉内容,这损害了其可靠性。在长文本生成中,由于幻觉雪崩现象(早期错误传播并累积到后续输出),这一问题更加严重。为了解决这一挑战,我们提出了一种新颖的推理时幻觉缓解框架,称为分段幻觉拒绝采样(SHARS),该框架使用任意幻觉检测器在生成过程中识别并拒绝幻觉片段,并重新采样直到生成忠实的内容。通过仅保留可信信息并在此基础上构建后续生成,该框架减轻了幻觉累积并增强了事实一致性。为了实例化该框架,我们采用语义不确定性作为检测器,并引入了若干关键修改以解决其局限性并更好地适应长文本。我们的方法使模型能够自我纠正幻觉,无需外部资源(如网络搜索或知识库),同时保持与这些资源的兼容性以便未来扩展。在标准化幻觉基准上的实证评估表明,我们的方法显著减少了长文本生成中的幻觉,同时保持甚至提高了生成的信息量。代码可在以下网址获取:this https URL。

英文摘要

Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.

2606.03626 2026-06-03 cs.CV cs.AI cs.CY

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

TurtleAI:海龟图形学中视觉编程的多模态模型基准测试

Chao Wen, Jacqueline Staub, Adish Singla

AI总结 提出TurtleAI基准,包含823个基于海龟图形学真实任务的视觉编程任务,评估20多个多模态模型发现成功率低于30%,并通过少量种子样本生成合成数据微调Qwen2-VL-72B提升约20%性能。

详情
Comments
ACL Findings 2026 paper
AI中文摘要

视觉语言模型(VLM)已被探索用于视觉编程,即生成代码以解决视觉任务。然而,大多数先前工作侧重于提高生产力的视觉编程;目前尚不清楚当前VLM在教育导向的视觉编程上表现如何,以及哪些因素限制了它们的性能。为填补这一空白,我们引入了TurtleAI,这是一个包含823个任务的基准,这些任务基于海龟图形学领域的真实视觉编程任务精心策划。解决这些任务需要模型感知几何图案、推理空间关系,并合成能忠实再现几何图案的Python代码。我们评估了20多个VLM,包括GPT-5、GPT-4o和Qwen2-VL-72B,发现它们表现显著困难,大多数成功率低于30%。为解决这些限制,我们提出了一种仅需少量种子样本的数据生成技术。在生成的合成数据上微调Qwen2-VL-72B,在真实任务上取得了约20%的提升。我们的失败分析揭示,GPT-4o在空间推理和精确视觉复制方面存在困难,而微调主要改善了视觉推理与代码实现之间的对齐。

英文摘要

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.

2606.03624 2026-06-03 cs.AI cs.CL

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

桥接辅助约束以解决大型推理模型中的指令遵循问题

Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu

AI总结 针对大型推理模型难以可靠遵循多重约束的问题,提出约束关系图补全框架,通过显式建模约束关系并发现桥接约束,将约束违反率降低39%。

详情
Comments
a pre-MIT Press publication version
AI中文摘要

大型推理模型(LRMs)在许多任务中展现出令人印象深刻的能力,但在可靠地遵循多个指令方面存在困难,要么无法满足单个约束,要么难以同时平衡相互竞争的约束。我们将这一挑战形式化为约束遵循问题(CAP)。本文引入了一个新颖的框架,通过将指令表示为约束的结构化知识图来解决CAP。我们的方法,约束关系图补全(CRGC),显式建模约束之间的关系,识别遵循挑战,并发现“桥接约束”,帮助模型更好地聚焦和协调需求。桥接约束作为辅助指令,使主要约束更加突出和兼容。与通过通用训练方法增强指令遵循的现有方法不同,CRGC通过利用模型自身的知识来创建更好的生成路径,从而专门提高约束满足度。在三个流行的指令遵循数据集上的实验表明,与标准提示相比,我们的方法将约束违反减少了39%,同时保持了大型推理模型的推理能力。

英文摘要

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.

2606.03620 2026-06-03 cs.LG cs.AI

Physics-Guided Policy Optimization with Self-Distillation

基于物理引导的自蒸馏策略优化

Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei

AI总结 针对自蒸馏策略优化中固定步长导致训练不稳定的问题,提出受粘性流体动力学启发的物理引导策略优化(PGPO),通过互信息估计动态调整步长,在Science-QA数据集上提升性能并保持训练稳定性。

详情
AI中文摘要

自蒸馏策略优化(SDPO)已成为大语言模型后训练的一种流行范式,其中模型根据特权信息从自身预测中学习。然而,SDPO对每次更新步长的信任程度敏感:来自自我教师的修正可能在某些批次上信息丰富,而在其他批次上具有误导性,若以固定步长统一应用,会破坏训练稳定性。受粘性流体动力学启发,并在随机微分方程层面形式化类比,我们提出物理引导策略优化(PGPO),该方法引入一个基于学生预测与反馈条件教师之间互信息估计的信息调制步长乘子。我们证明这种调制保留了普通SGD的一阶弱近似保证,且每次迭代的额外开销可忽略。我们在Science-QA数据集上评估PGPO,它在4个领域中的3个上优于SDPO,提升高达+4.5个点,同时在SDPO训练后期崩溃的设置中保持稳定。

英文摘要

Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.

2606.03618 2026-06-03 cs.AI

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

跨语言令牌套利:通过本地LLM预处理优化代码智能体上下文窗口

Mehmet Utku Colak

AI总结 提出一种预处理的边缘端提示重写中间件,利用本地Llama 3.2模型进行跨语言翻译和结构重写,在保持或提升任务准确率的同时减少34-47%的提示令牌和最高18.8%的总令牌消耗。

详情
Comments
Submitted to EMNLP 2026
AI中文摘要

AI辅助编码智能体受到输入令牌成本的瓶颈限制。原始人类输入的两个病理现象导致了大部分开销:非英语文本的令牌化低效和对话提示中的结构熵。现有方法通过压缩已经臃肿的上下文或在失败发生后进行干预来被动应对。我们引入了一种预处理的边缘端提示重写中间件,在开发者和云智能体之间运行。本地Llama 3.2(3B)模型执行跨语言翻译成英语、结构重写为紧凑的任务导向格式,以及正则表达式验证的重写-回退保护,确保优化后的提示永远不会大于原始提示。我们在OMH-Polyglot(一个涵盖土耳其语、阿拉伯语、中文和代码混合规范的多语言编码基准)上进行评估。在三个商业LLM后端上,该中间件将提示令牌减少了34-47%,总令牌减少了最多18.8%,同时保持或提高了任务准确率。消融研究表明,收益主要来自重写阶段,而非简单的函数名提取。与LLMLingua-2在匹配压缩率下相比,我们的方法在所有评估后端上始终获得更优的OckScore性能。这些结果表明,主动提示优化可以在不牺牲编码质量的情况下大幅降低推理成本。

英文摘要

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

2606.03610 2026-06-03 cs.CV

SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition

SkelHCC:一种基于双曲CLIP驱动的缓存自适应框架用于骨架基础的一次动作识别

Yanan Liu, Anqi Zhu, Jingmin Zhu, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Dan Xu, Qiuhong Ke

AI总结 提出SkelHCC框架,利用双曲几何编码骨架层次结构,结合CLIP和免训练缓存实现一次动作识别,在三个数据集上达到最优。

详情
Comments
Accepted by ICML 2026
AI中文摘要

基于骨架的动作识别旨在从人体关节序列理解人类行为,在一次设置中尤其具有挑战性,因为每个新动作仅有一个标记样本。关键挑战是学习捕捉人体运动的层次和组合结构的表示,同时在极端数据稀缺下与高层动作语义有效对齐。现有方法主要基于欧几里得嵌入和低级运动线索,难以建模骨架数据的树状组织,限制了跨模态对齐和对未见动作类别的泛化。我们提出SkelHCC,一个统一的骨架双曲CLIP驱动的缓存自适应框架,用于一次骨架动作识别。SkelHCC引入显式层次双曲CLIP(EH-HCLIP)模块,将骨架序列和动作语言嵌入共享双曲空间。通过利用双曲几何的负曲率和指数体积增长,EH-HCLIP自然编码人体解剖学的关节-部位-身体层次,并产生结构一致的跨模态表示。为支持高效的一次自适应,SkelHCC进一步集成了一个无需训练的LLM引导的多粒度投票缓存(LMV-Cache),用于上下文感知推理。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD上的实验表明,SkelHCC持续优于最先进方法。

英文摘要

Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and compositional structure of human motion while aligning effectively with high-level action semantics under extreme data scarcity. Existing approaches, largely based on Euclidean embeddings and low-level motion cues, struggle to model the tree-like organization of skeleton data, limiting cross-modal alignment and generalization to unseen action categories. We propose SkelHCC, a unified skeleton hyperbolic CLIP-driven cache adaptation framework for one-shot skeleton-based action recognition. SkelHCC introduces an Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module that embeds skeleton sequences and action language into a shared hyperbolic space. By leveraging the negative curvature and exponential volume growth of hyperbolic geometry, EH-HCLIP naturally encodes the joint-part-body hierarchy of human anatomy and yields structurally consistent cross-modal representations. To support efficient one-shot adaptation, SkelHCC further integrates a training-free LLM-guided Multi-granularity Voting Cache (LMV-Cache) for context-aware inference. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD demonstrate that SkelHCC consistently outperforms state-of-the-art methods.

2606.03608 2026-06-03 cs.LG cs.AI

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

利用验证-生成差距:基于置信度条件的测试时强化学习

Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng

AI总结 提出TTRL-CoCoV框架,通过置信度自适应机制解决无标签设置下Pass@k优化中的伪标签错误和多样性崩溃问题,显著提升Pass@1和Pass@k性能。

详情
AI中文摘要

测试时强化学习已成为一种有前景的范式,用于在完全无标签的方式下增强大型语言模型的复杂推理能力。尽管现有研究关注Pass@1性能,但在无标签设置下优化Pass@k(衡量生成覆盖率以支持持续探索)仍未被充分探索且至关重要。在无标签设置下优化Pass@k极具挑战性,因为直接应用对RLVR有效的Pass@k优势设计会导致性能不佳。通过深入的实证分析,我们发现阻碍性能的根本原因:低置信度样本的伪标签估计很可能不正确,而高置信度样本的候选答案则遭受严重的多样性崩溃。为克服这些障碍,我们提出TTRL-CoCoV(基于置信度条件的测试时强化学习),一种新颖的置信度自适应框架,可扩展Pass@k覆盖率并提升Pass@1性能。基于我们的关键洞察——验证能力通常领先于生成能力,TTRL-CoCoV采用置信度条件机制:对于高置信度样本,它引导验证器并应用探索增强奖励以防止多样性崩溃;对于低置信度样本,它将伪标签选择委托给验证器以过滤错误伪标签;对于中等置信度样本,则完全绕过验证。大量实验表明,TTRL-CoCoV在6个广泛认可的基准上优于最佳竞争方法,在Pass@1上平均绝对提升+9.8%,在Pass@16上平均绝对提升+18.7%,甚至在与全监督强化学习方法相比时,在多个推理基准上实现了高达+5.0%的Pass@1绝对提升。我们的代码仓库:此 https URL。

英文摘要

Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.

2606.03604 2026-06-03 cs.CL

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

超越字面:多模态模因理解中的语用意图分解

Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Luyao Ye, Huimin Wang, Hanqi Yan, Binyang Li, Kam-Fai Wong, Yulan He

AI总结 针对大型视觉语言模型(LVLMs)在理解模因时倾向于描述字面内容而非语用意图的问题,提出Intent Projection框架,通过表示、输出和目标三层面的字面-语用分解,在六个基准上超越开源模型并缩小与专有模型的差距。

详情
AI中文摘要

当被问及一个模因或讽刺帖子的含义时,大型视觉语言模型(LVLMs)倾向于描述图像显示的内容,而不是作者试图传达的信息。标准指令调优将帖子的字面内容与其语用意义纠缠在一起,让表面细节污染最终响应。我们将模因理解重新定义为字面-语用分解问题,并提出 extbf{Intent Projection},这是一个在单个LVLM骨干网络中的表示、输出和目标三个层面分离这两个信号的框架。在表示层面,一个正交投影模块从融合的图像-文本表示中移除主要的单模态方向,仅保留语用残差,同时一个表面真实情感分类器用一个离散标签锚定解码器,该标签命名了极性差距。在输出层面,模型外化一个结构化的推理链,在目标层面,一个对比奖励明确惩罚重复字面描述的答案。在六个多模态基准测试中,Intent Projection始终优于开源基线,并缩小了与专有模型的差距,在字面崩溃最具破坏性的高分歧帖子上取得了最大收益。

英文摘要

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

2606.03603 2026-06-03 cs.CV cs.CL

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

世界模型遇见语言模型:论具体推理与抽象推理的互补性

Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

AI总结 本文提出受控具体推理框架及PF-OPSD方法,通过结合世界模型的视觉模拟与多模态大语言模型的抽象推理,在空间前瞻和开放域物理预测任务上提升性能与鲁棒性。

详情
AI中文摘要

世界模型和多模态大语言模型(MLLMs)为从静态视觉观察预测未来结果提供了互补能力。世界模型可以生成可能未来的具体视觉推演,而MLLMs可以对问题、目标和规则进行抽象推理。然而,生成的推演是随机的,可能在视觉上合理但任务不正确,因此需要确定视觉模拟何时有用、推演是否可信以及它应如何影响最终答案。我们将此问题形式化为受控具体推理,其中模型学习在抽象推理之外调用、验证和整合视觉未来模拟。为了研究这一设置,我们构建了两个人工验证的基准:用于可控空间前瞻的VRQABench和用于开放域物理预测的OpenWorldQA,并提出了特权未来在策略自蒸馏(PF-OPSD)。在训练期间,PF-OPSD仅使用真实未来视频和答案作为教师侧特权上下文来评估在策略具体推理轨迹,而可部署的学生在测试时从未观察到真实未来。实验结果表明,PF-OPSD在VRQABench和OpenWorldQA上分别比基线高出10.6%和10.9%,同时增强了对噪声或冲突推演的鲁棒性。我们的代码和数据集可在以下网址获取:https://this https URL。

英文摘要

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

2606.03602 2026-06-03 cs.LG cs.AI cs.CL

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

CauTion:知道何时信任LLM进行集成因果发现

Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu

AI总结 提出CauTion框架,通过共识过滤和LLM可靠性估计,将LLM领域知识可靠地集成到多个统计因果发现算法中,解决纯统计方法的局限和LLM错误问题。

详情
AI中文摘要

从观测数据进行因果发现仍然具有挑战性,因为纯统计方法存在根本性限制,例如等价类内的统计可区分性和对有限样本量的敏感性。虽然大型语言模型(LLM)提供了有希望的领域知识来源来补充统计推断,但现有的LLM增强方法容易受到LLM错误的影响,并且产生高昂的令牌成本。此外,依赖单一数据驱动算法可能使结果对算法特定偏差敏感。为了解决这些限制,我们提出了CauTion,一个通过共识过滤和LLM可靠性估计将LLM领域知识可靠地集成到统计因果发现算法集成中的框架。CauTion分三个阶段进行。首先,算法集成利用共识投票解决算法一致的最多96%的边,在过滤后的共识边上实现接近完美的准确性。其次,一个信任校准仲裁机制通过无注释的信任校准过程估计LLM和算法的相对可靠性,然后用于控制信任加权投票过程,将LLM仲裁限制在算法证据不可靠的边上。第三,应用循环修复步骤确保最终因果图是有效的无环图。在六个数据集上的实验表明,CauTion在性能上始终优于数据驱动和LLM增强的基线,在更大的图上获得更大的收益,并且对LLM错误具有强大的鲁棒性。代码可在以下网址获取:https://this URL。

英文摘要

Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.

2606.03601 2026-06-03 cs.SE cs.AI

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

DDOR: 用于可解释过度拒绝测试与修复的Delta调试方法

Qinyan Zhou, Peixin Zhang, Jun Sun, Haonan Zhang, Dongxia Wang

AI总结 提出DDOR框架,通过delta调试定位最小拒绝触发片段(mRTF),实现黑盒环境下大语言模型过度拒绝行为的自动化测试与修复。

详情
AI中文摘要

虽然安全对齐和护栏有助于大语言模型(LLM)避免有害输出,但它们也可能导致过度拒绝,即对仅看似有风险的无害查询进行无根据的拒绝。我们提出了DDOR(用于过度拒绝的Delta调试),这是一个完全自动化和可解释的框架,用于在黑盒设置中进行过度拒绝测试和修复,其中仅可访问模型输入和输出,内部安全机制保持不透明。DDOR应用delta调试来定位最小拒绝触发片段(mRTF),这些片段提供了短语级别的、可解释的证据,说明拒绝发生的原因。基于这些mRTF,DDOR生成多样化、上下文丰富的提示,并执行多预言验证以过滤本质上不安全或模糊的案例,从而产生可扩展且模型特定的过度拒绝测试套件(每个模型约1K个案例)。除了评估之外,我们进一步利用定位的mRTF进行有针对性的提示修复,显著减少过度拒绝,同时保留原始意图并在真正有害的输入上保持安全性。总体而言,DDOR提供了一种实用的端到端解决方案,用于评估和缓解过度拒绝,在不牺牲安全性的情况下提高LLM的可用性。

英文摘要

While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.

2606.03593 2026-06-03 cs.SE cs.RO

Making Embodied AI Reliable: A Community Agenda from Testing to Formal Verification

使具身AI可靠:从测试到形式验证的社区议程

Xi Zheng, Dulanga Weerakoon, Yintong Huo, Teresa Yeo, Guy Van Den Broeck, Vijay Ganesh, Daniel Neider, Biplav Srivastava, Ivan Ruchkin, Archan Misra, Corina Pasareanu

AI总结 本文基于AAAI'26 Bridge Program讨论,提出通过集成测试、形式验证和运行时保证的神经符号方法,解决具身AI在开放世界中的生命周期可靠性问题。

详情
AI中文摘要

具身AI系统越来越多地部署在开放世界环境中,但确保其可靠性仍然是一个根本性挑战。借鉴AAAI'26 Bridge Program关于“通过测试和形式验证使具身AI可靠”的讨论,本文认为具身AI的可靠性本质上是一个生命周期保证问题,源于不确定性、人类交互以及紧密耦合系统组件之间的涌现行为。我们确定了实现可靠具身AI的三个互补方向:(1)基于可信场景的测试,由经过验证的规范和有意义覆盖度量支持;(2)通过系统行为和环境的符号化结构化表示实现的组合验证;(3)能够在部署期间适应不确定性和分布偏移的运行时保证机制。我们不将这些方法视为独立,而是倡导集成保证工作流,通过共享的神经符号表示和系统生命周期中的持续反馈,连接测试、验证和运行时适应。这种集成为构建能够在复杂现实世界中安全可靠运行的值得信赖的具身AI系统提供了基础。

英文摘要

Embodied AI systems are increasingly deployed in open-world environments, yet ensuring their reliability remains a fundamental challenge. Drawing on discussions from the AAAI'26 Bridge Program on "Making Embodied AI Reliable with Testing and Formal Verification", this article argues that reliability in embodied AI is inherently a lifecycle assurance problem arising from uncertainty, human interaction, and emergent behaviors across tightly coupled system components. We identify three complementary directions toward reliable embodied AI: (1) trustworthy scenario-based testing supported by validated specifications and meaningful coverage metrics, (2) compositional verification enabled by structured symbolic representations of system behavior and environmental context, and (3) runtime assurance mechanisms capable of adapting to uncertainty and distribution shifts during deployment. Rather than treating these approaches independently, we advocate integrated assurance workflows that connect testing, verification, and runtime adaptation through shared neuro-symbolic representations and continuous feedback across the system lifecycle. Such integration provides a foundation for building trustworthy embodied AI systems that can operate safely and reliably in complex real-world environments.

2606.03590 2026-06-03 cs.RO

CANMOT: Class-Aware Noise Modeling for Multi-Object Tracking in Autonomous Driving

CANMOT: 自动驾驶中多目标跟踪的类别感知噪声建模

Timo Osterburg, Stefan Schütte, Torsten Bertram

AI总结 针对自动驾驶中多目标跟踪任务,提出一种类别感知且目标对齐的噪声建模框架CANMOT,通过引入类别特定的过程与测量噪声协方差矩阵,并在目标坐标系中表达以保持纵向-横向各向异性,从而提升跟踪性能并显著减少身份切换。

详情
Comments
submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
AI中文摘要

基于卡尔曼滤波的多目标跟踪(MOT)因其强大的性能、计算效率和可解释性,仍然是自动驾驶的强基线。在大多数实际系统中,过程噪声和测量噪声协方差是全局定义并在对象类别间共享的,假设异质交通参与者具有相同的不确定性特征。本文重新审视了这一假设,并提出了CANMOT,一种用于基于KF的3D MOT的类别感知和目标对齐的噪声建模框架。引入了类别特定的对角过程与测量协方差矩阵,并可选地在对象坐标系中表达以保持纵向-横向各向异性。在nuScenes基准上的系统实验表明,与最先进方法相比,类别感知和目标对齐的噪声建模提高了跟踪性能,并显著减少了身份切换。此外,使用平均归一化估计误差平方(ANEES)和基于$\chi^2$的违例测试分析了估计不确定性的一致性。结果揭示了标准基于KF的MOT基线存在严重的过度自信。虽然所提出的公式在不修改底层滤波框架的情况下改善了校准,但仍然表现出显著的不一致性,凸显了在该领域进一步研究的必要性。代码可在该https URL获取。

英文摘要

Kalman filter (KF)-based multi-object tracking (MOT) remains a strong baseline for autonomous driving due to its strong performance, computational efficiency and interpretability. In most practical systems, the process noise and measurement noise covariances are defined globally and shared across object classes, presuming identical uncertainty characteristics across heterogeneous traffic participants. This work revisits this assumption and proposes CANMOT, a class-aware and object-aligned noise modeling framework for KF-based 3D MOT. Class-specific diagonal process and measurement covariance matrices are introduced and optionally expressed in the object coordinate frame to preserve longitudinal-lateral anisotropy. Systematic experiments on the nuScenes benchmark show that class-aware and object-aligned noise modeling improves tracking performance and substantially reduces identity switches compared to state-of-the-art (SotA). In addition, the consistency of the estimated uncertainty is analyzed using the Average Normalized Estimation Error Squared (ANEES) and $χ^2$-based violation tests. The results reveal severe overconfidence in standard KF-based MOT baselines. While the proposed formulation improves calibration without modifying the underlying filtering framework, it still exhibits substantial inconsistency, highlighting the need for further research in this area. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms.

2606.03584 2026-06-03 cs.LG cond-mat.dis-nn cs.NE

Training a Predictive Coding Network on ImageNet using Equilibrium Propagation

使用均衡传播在ImageNet上训练预测编码网络

Tugdual Kerjan, Rasmus Høier, Benjamin Scellier

AI总结 提出一种结合中心化均衡传播与新型均衡方案的预测编码网络训练方法,在ImageNet上训练10层卷积PCN,达到13.23% top-5错误率,接近反向传播基线。

详情
AI中文摘要

均衡传播(EP)是一种基于物理的训练框架,主要应用于能量模型,包括连续Hopfield网络、非线性电阻网络和耦合相位振荡器。然而,EP的实际应用至今仍局限于相对小规模的问题。预测编码网络(PCN)是另一类根植于计算神经科学的能量模型,通常使用专门的算法训练,同样尚未在大规模上得到验证。在这项工作中,我们开发了一种基于EP的PCN训练方法,该方法将中心化EP与一种新的PCN均衡方案相结合。使用这种方法,我们在全尺寸ImageNet上训练了一个10层卷积PCN(VGG10),在top-5分类任务上实现了13.23%的测试错误率,接近12.2%的反向传播基线。据我们所知,这是PCN和基于EP的训练首次在ImageNet规模上得到验证。这些结果显著扩展了两种方法的可扩展性,并表明在其他物理系统中扩展EP的主要挑战可能更多地来自这些系统的计算特性,而非EP框架本身的固有限制。

英文摘要

Equilibrium Propagation (EP) is a physics-based training framework that has primarily been employed in energy-based models, including continuous Hopfield networks, nonlinear resistive networks and coupled phase oscillators. However, EP's practical applications have so far remained limited to relatively small-scale problems. Predictive coding networks (PCNs), another class of energy-based models rooted in computational neuroscience, are typically trained with a specialized algorithm and have likewise not yet been demonstrated at large scale. In this work, we develop an EP-based training method for PCNs which combines the centered variant of EP with a novel equilibration scheme for PCNs. Using this approach, we train a 10-layer convolutional PCN (VGG10) on full-size ImageNet, achieving 13.23\% test error rate on the top-5 classification task, close to the 12.2\% backpropagation baseline. To our knowledge, this is the first demonstration of both PCNs and EP-based training at ImageNet scale. These results significantly extend the scalability of both approaches and suggest that the primary challenges in scaling EP in other physical systems may come more from the computational properties of these systems than from inherent limitations of the EP framework.

2606.03581 2026-06-03 cs.CV cs.RO

UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion

UnsOcc:非结构化场景下基于渲染融合的3D语义占用预测

Ye Wu, Ruiqi Song, Baiyong Ding, Nanxin Zeng, Junjie Cheng, Yunfeng Ai

AI总结 提出UnsOcc多模态框架,通过渲染融合模块和基于高斯溅射的细节感知辅助监督,解决非结构化场景中跨模态融合困难与长尾分布问题,在露天矿和nuScenes数据集上超越现有方法。

详情
Comments
8 pages
AI中文摘要

非结构化场景给自动驾驶带来了独特挑战,因为不规则障碍物和稀疏的场景布局削弱了3D目标检测等传统感知方法的有效性。3D语义占用预测因其能够通过为3D空间中的单个体素分配语义标签来提供密集的空间表示而成为研究热点。然而,将3D语义占用预测直接应用于非结构化场景仍然具有挑战性,因为场景稀疏性阻碍了有效的跨模态融合,并且这些场景中更严重的长期尾部分布进一步降低了预测性能。为了验证我们方法的有效性,我们构建了一个从露天矿收集的非结构化场景专用数据集。在此基础上,我们提出了UnsOcc,一种多模态3D语义占用预测框架,提高了在非结构化环境中的鲁棒性。其核心是,我们引入了一个基于渲染的融合模块RenderFusion,通过双向渲染监督增强跨模态特征对齐。此外,我们提出了GSRefinement,一种基于高斯溅射的细节感知辅助监督方法,将稀疏的3D占用预测投影到密集的2D语义分割图中,从而实现对长尾类别的有效监督。在露天矿数据集和nuScenes数据集上的大量实验表明,我们的方法显著优于现有的最先进方法。

英文摘要

Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

2606.03578 2026-06-03 cs.CV

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

在正确空间中扩散:潜在可扩散性的系统研究

Tianxiong Zhong, Xingye Tian, Xuebo Wang, Xin Tao, Pengfei Wan

AI总结 本文系统研究潜在扩散模型中潜在表示的可扩散性,提出速度不可约方差(VIV)作为生成质量的稳定预测指标。

详情
AI中文摘要

潜在扩散模型利用视觉分词器将图像压缩到潜在空间以实现高效生成建模。然而,分词器更好的重建质量并不一定转化为更好的生成质量,这表明潜在表示不仅应通过保真度评估,还应通过其可扩散性评估。最近的研究提出了多种对扩散友好的潜在空间的解释,包括语义可分离性、仿射等变性、分布均匀性、空间结构、谱平滑性和流形连续性。然而,这些性质通常在一组有限的分词器上验证,导致不清楚哪些因素最能预测下游生成质量,以及这些结论是否适用于其引入的特定设置之外。在这项工作中,我们通过训练大量具有不同正则化策略、架构和潜在配置的分词器,并使用多个下游扩散骨干网络对其进行评估,对潜在可扩散性进行了系统研究。我们的分析确定了几个与生成质量持续相关且在实验设置中表现出强泛化能力的潜在性质。除了现有指标,我们引入了速度不可约方差(VIV),这是一种由轨迹交叉引起的速度模糊性的度量。大量实验表明,VIV是生成质量最稳定的预测因子之一。

英文摘要

Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated not only by fidelity but also by their diffusability. Recent studies have proposed diverse explanations for diffusion-friendly latent spaces, including semantic separability, affine equivariance, distribution uniformity, spatial structure, spectral smoothness, and manifold continuity. Yet these properties are often validated on a limited set of tokenizers, leaving it unclear which factors are most predictive of downstream generation quality and whether such conclusions hold beyond the specific settings in which they are introduced. In this work, we conduct a systematic study of latent diffusability by training a large collection of tokenizers with diverse regularization strategies, architectures, and latent configurations, and evaluating them with multiple downstream diffusion backbones. Our analysis identifies several latent properties that consistently correlate with generation quality and exhibit strong generalization across experimental settings. Beyond existing metrics, we introduce Velocity Irreducible Variance (VIV), a measure of velocity ambiguity induced by trajectory crossings. Extensive experiments show that VIV is one of the most stable predictors of generation quality.

2606.03577 2026-06-03 cs.CV

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

通过宽基线匹配激发多模态大语言模型中的复杂空间推理

Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen

AI总结 本文提出ReasonMatch-Bench基准和动态对应强化学习(DCRL)方法,以系统评估和提升多模态大语言模型在宽基线匹配任务中的空间推理能力。

详情
Comments
CVPR 2026. Project page: https://aim-uofa.github.io/reasonmatch/ Code: https://github.com/aim-uofa/ReasonMatch
AI中文摘要

宽基线匹配(WBM)需要整合几何理解、视角变化、细粒度感知和遮挡推理,使其成为部署在物理环境中的多模态大语言模型(MLLMs)空间推理的一个具有挑战性的测试平台。然而,当前的MLLMs缺乏对这些能力的系统评估和训练框架。我们引入了ReasonMatch-Bench,这是一个根据视角位移和匹配粒度在室内、室外和以物体为中心的场景中分层的基准,并表明当前的MLLMs在细粒度宽基线对应上仍然存在困难:在一个困难的90样本子集上,人类标注者达到84.0 F1,而最佳现有基线达到37.2。为了弥补这一差距,我们构建了一个可扩展的数据生成管道,该管道从大规模视频-3D语料库(包括RGB-D视频和SfM重建)中自动提取宽基线视图对,产生多样且可验证的监督。我们进一步提出了动态对应强化学习(DCRL),它结合了图像级视角进展和点级对应课程,通过可验证的奖励改进WBM训练,无需显式的CoT监督。大量实验表明,DCRL显著提高了ReasonMatch-Bench的性能,并迁移到相关的空间基准,同时在几个基准上保持了通用视觉理解性能并取得了适度提升。

英文摘要

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

2606.03569 2026-06-03 cs.CV cs.AI

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

当注意力崩溃时:从结构到语义的阶段性视觉令牌剪枝

Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang

AI总结 针对视觉语言模型推理中视觉令牌剪枝因依赖单一注意力分数导致特征多样性下降的问题,提出两阶段剪枝框架STS,先通过排斥采样最大化结构多样性,再通过指令感知交叉注意力过滤语义无关令牌,从而提升保留令牌的结构多样性与细粒度任务对齐。

详情
AI中文摘要

视觉语言模型(VLMs)展现了卓越的能力,但在推理过程中承受着巨大的计算开销。虽然视觉令牌剪枝提供了一种有前景的解决方案,但现有方法主要依赖于初始注意力分数。这种单一度量范式存在一个关键缺陷:高注意力分数会固有地坍缩到语义相似区域,从而严重降低特征多样性并丢弃重要的上下文细节。为解决这一问题,我们引入了结构到语义(STS),一种新颖的两阶段视觉令牌剪枝框架,明确解耦了剪枝过程。第一阶段采用基于排斥的采样机制,以最大化空间和结构多样性。第二阶段利用指令感知的交叉注意力,精确过滤掉与提示无关的令牌。这种两阶段协同构成了STS的核心,首先确保几何覆盖,然后根据语义相关性细化保留的令牌。大量评估表明,STS减轻了由基于注意力的选择引起的冗余,提高了保留视觉令牌的结构多样性和细粒度任务对齐。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

2606.03568 2026-06-03 cs.CV cs.AI cs.LG cs.RO

Learned Non-Maximum Suppression for 3D Object Detection

用于3D目标检测的学习型非极大值抑制

Timo Osterburg, Stefan Schütte, Torsten Bertram

AI总结 提出两种基于学习的过滤模块(D2D-Rescore和GossipNet3D)替代启发式NMS,通过检测间关系提升3D检测性能,尤其改善小物体和稀有类别的检测精度。

详情
Comments
6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026
AI中文摘要

后处理是基于激光雷达的3D目标检测中的关键阶段,必须过滤密集且重叠的提议以实现紧凑可靠的感知。本文引入了两个学习型过滤模块,通过利用检测之间的关系来替代启发式非极大值抑制(NMS)。D2D-Rescore采用基于Transformer的检测到检测(D2D)注意力,而GossipNet3D通过鸟瞰图中的局部消息传递将2D GossipNet概念适应到3D。一种与nuScenes评估协议对齐的度量感知匹配策略确保了训练和验证行为的一致性,从而提高了整体检测性能。与CircleNMS相比,两种方法都提高了平均精度(mAP)、nuScenes检测分数(NDS)和真阳性质量,特别是对于小物体和稀有类别,同时增加了最小的计算开销。这些结果表明,学习型的检测级过滤可以在不修改基础网络的情况下增强3D检测器的可靠性,为启发式抑制提供了一种原则性的替代方案。代码可在以下网址获取:https://this URL。

英文摘要

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

2606.03566 2026-06-03 cs.CV cs.AI

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结 提出一种基于SwinUNETR和局部块采样的方法,实现多发性硬化侧脑室脉络丛的自动分割,在降低99%计算量的同时取得优于现有模型的Dice系数。

详情
AI中文摘要

背景:侧脑室脉络丛(LVCP)正逐渐被认为是与多发性硬化(MS)身体残疾和神经炎症相关的关键影像生物标志物。然而,LVCP的手动分割非常繁琐,限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程,利用靶向的脑室内和脑室周围小块采样,从独立和多模态MRI输入中自动分割MS中的LVCP。方法:我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描(数据集1:n=177;数据集2:n=177;扩展测试集:n=388)。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构,并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数(DSC),辅以计算需求(GFLOPs)和95百分位豪斯多夫距离(HD95)。结果:在扩展测试集上,SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868(95% CI: 0.863-0.872),显著优于UXNET(DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001)。当仅限于独立FLAIR输入时,基于Transformer的方法保持了0.863的高DSC,而UXNET的空间定位显著恶化(HD95: 1.86 vs. 3.00 mm)。重要的是,所提出的框架将计算负载降低了99%(91.8 vs. 22,080 GFLOPs)。通过将局部块采样与SwinUNETR架构相结合,该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

2606.03557 2026-06-03 cs.AI cs.HC

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

从提示到服务:基于SLM的AI驱动虚拟世界代理编排网关

Louis Nisiotis, Aimilios Hadjiliasi

AI总结 本文提出一种基于小语言模型的代理编排网关,通过意图驱动的服务路由解耦虚拟世界客户端与异构AI后端,并在虚拟博物馆测试床中验证了其可行性和效率。

详情
AI中文摘要

随着生成式AI能力的扩展,AI驱动的虚拟世界面临日益增长的架构挑战。用户通过世界内界面以多模态方式进行交互,但其请求需要根本不同的AI后端模型和计算资源。将这些能力直接嵌入虚拟世界系统会降低可扩展性、增加维护复杂性,并限制协调分布在边缘和云基础设施上的服务的能力。本文提出一种基于SLM的代理编排网关,这是一种轻量级运行时协调机制,通过意图驱动的服务路由将虚拟世界客户端与异构AI后端解耦。边缘部署的SLM对每个用户提示的语义意图进行分类,可配置的服务注册表验证并解析路由决策,然后透明地调用所选后端,从而无需修改客户端应用即可在虚拟世界中引入新的AI能力。该网关在InterwovenXR虚拟博物馆测试床中实现并评估。评估表明,紧凑型SLM可以在边缘硬件上作为可靠的意图路由器,并且任务特定的微调可以将参数低于十亿的模型转化为实用的低延迟路由器。一种分层配置将微调后的十亿以下参数模型作为路由器,与用于对话响应生成的较大SLM配对,证明可以在中端边缘硬件上部署,并且比将两个职责委托给单个模型更高效。研究结果表明,SLM可以支持虚拟世界中实用的AI服务编排,并且该工作贡献了一种可评估的架构,用于可扩展、可扩展且支持边缘的AI交互,使虚拟代理成为分布式生成式AI服务的访问点。

英文摘要

As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.

2606.03556 2026-06-03 cs.RO

Partially Observable Adversarial Patch Attacks on Vision-Language-Action Models in Robotics

部分可观测的对抗性补丁攻击在机器人视觉-语言-动作模型上的应用

Xiaofei Wang, Mingliang Han, Tianyu Hao, Yi Yang, Yun-Bo Zhao, Keke Tang

AI总结 针对机器人VLA模型,提出部分可观测威胁模型下的两阶段攻击框架,利用注意力图定位关键区域并优化补丁以破坏语义接地和增加动作轨迹曲率,导致长期任务失败。

详情
Comments
Accepted by IEEE Robotics and Automation Letters, 2026
AI中文摘要

视觉-语言-动作(VLA)模型在机器人领域受到关注,但其对对抗性攻击的鲁棒性仍鲜有探索。现有工作表明对抗性补丁可以误导基于VLA的机器人,但假设完全访问整个执行轨迹,这在实践中是不现实的。我们通过制定部分可观测威胁模型来解决这一限制,其中攻击者只能利用轨迹的短前缀来生成固定补丁,应用于所有后续帧。在此设置下,我们提出了一个两阶段框架。首先,我们使用模型的注意力图定位补丁,以识别与完整指令对应的视觉关键区域。然后,我们优化补丁以破坏目标对象的语义接地并增加动作轨迹的曲率,从而在感知和控制中复合故障。在模拟和真实机器人环境中的大量实验表明,我们的方法在部分可观测性下维持对抗效果,诱导长期中断并显著降低任务成功率。

英文摘要

Vision-language-action (VLA) models are gaining attention in robotics, yet their robustness to adversarial attacks remains largely unexplored. Existing work shows that adversarial patches can mislead VLA-based robots but assumes full access to the entire execution trajectory, an unrealistic requirement in practice. We address this limitation by formulating a partially observable threat model, where the adversary can exploit only a short prefix of the trajectory to generate a fixed patch applied to all subsequent frames. Under this setting, we propose a two-phase framework. First, we localize the patch using the model's attention maps to identify visually critical regions that correspond to the full instruction. Then, we optimize the patch to disrupt the semantic grounding of target objects and increase the curvature of action trajectories, thereby compounding failures in both perception and control. Extensive experiments in simulation and real-world robotic environments show that our method sustains adversarial effects under partial observability, inducing long-horizon disruptions and significantly reducing task success rates.

2606.03551 2026-06-03 cs.RO

NVIDIA Isaac Sim: Enabling Scalable, GPU-Accelerated Simulation for Robotics

NVIDIA Isaac Sim:实现可扩展的GPU加速机器人仿真

Sicong Gao, Maurice Pagnucco, Tomasz Bednarz, Yang Song

AI总结 本文系统综述了NVIDIA Isaac Sim的架构、应用模式及局限性,重点分析其GPU加速在大规模并行训练、合成数据生成和物理精确建模方面的优势,并探讨了未来方向。

详情
AI中文摘要

仿真已成为机器人研究的核心基础设施。与以往的仿真器不同,NVIDIA Isaac Sim利用GPU加速实现大规模并行训练和物理精确建模。其合成数据生成流水线缓解了高质量训练数据的稀缺性,支持数据驱动的机器人学习和大规模以仿真为中心的实验。然而,现有综述通常将其视为众多仿真器之一,缺乏对其架构特性、使用模式和局限性的系统分析。本文从系统和应用角度综述Isaac Sim,概述其架构并与广泛使用的仿真器进行比较。我们分析了五个主要领域的代表性研究,总结了常见的使用模式,特别是在数据生成和高保真仿真方面。我们还概述了关键的未来方向和挑战,包括物理开放世界学习、以仿真为中心的培训以及实际可用性约束。

英文摘要

Simulation has become a core infrastructure for robotics research. Unlike previous simulators, NVIDIA Isaac Sim leverages GPU acceleration to enable large-scale parallel training and physics-accurate modeling. Its synthetic data generation pipeline alleviates the scarcity of high-quality training data, supporting data-driven robot learning and large-scale simulation-centric experimentation. However, existing surveys often treat it as one simulator among many, without a systematic analysis of its architectural characteristics, usage patterns, and limitations. This survey reviews Isaac Sim from system and application perspectives, outlining its architecture and comparing it with widely used simulators. We analyze representative studies across five major domains and summarize common usage patterns, particularly in data generation and high-fidelity simulation. We also outline key future directions and challenges, including physics open-world learning, simulation-centric training and practical usability constraints.

2606.03549 2026-06-03 cs.LG math.PR

How Many Trees in a Random Forest? A Revisited Approach with Plateau Search and Optuna Integration

随机森林中需要多少棵树?一种结合平台搜索与Optuna集成的重新审视方法

Vadim Porvatov, Andrey Dukhovny, Andrey Lange

AI总结 提出一种基于三元组平台搜索的算法,通过监控袋外分数的相对变化自动确定随机森林的树数量,避免预设搜索范围,并提供了理论分析和实验验证。

详情
AI中文摘要

随机森林的超参数优化在调整树数量时面临一个特定困难:预测分数通常随集成规模单调提升,因此诸如树结构Parzen估计器(TPE)和Hyperband等标准方法需要预定义搜索范围,且往往将估计推向其右边界。早停策略避免了固定这样的范围,但对分数噪声敏感且容易过早停止。为解决此问题,我们提出一种集成的基于三元组的平台搜索算法,该算法将树数量从直接TPE搜索空间中移除,同时仍利用跨HPO试验积累的信息。该方法通过监控三个森林规模上的袋外(OOB)分数相对变化,自适应地跟踪接近最小的充分集成规模,并相应移动该三元组。这产生了一个基于容差参数的自动化且用户可解释的过程。我们还提供了理论分析:我们将所提出的相对OOB分数准则与当前分数和极限分数之间的差距联系起来,并推导了相应的基于OOB的绝对相对差异的渐近方差估计。实验表明,所选树数量可能与常见启发式方法有显著差异:对于大多数经典基准数据集,它更小;而对于一些高维生物信息学数据集(如Arcene和Dorothea),则更大。源代码和可重复实验可在以下网址获取:https://github.com/your-repo。

英文摘要

Hyperparameter optimization (HPO) for Random Forest faces a specific difficulty in tuning the number of trees: the predictive score typically improves monotonically with ensemble size, so standard methods such as Tree-structured Parzen Estimator (TPE) and Hyperband require a predefined search range and often drive the estimate toward its right boundary. Early-stopping strategies avoid fixing such a range, but can be sensitive to score noise and prone to premature stopping. To address this, we propose an integrated triplet-based plateau-search algorithm that removes the number of trees from the direct TPE search space and still exploits information accumulated across HPO trials. The method adaptively tracks a near-minimal sufficient ensemble size by monitoring relative changes in the out-of-bag (OOB) score across a triplet of forest sizes and shifting this triplet accordingly. This yields an automated and user-interpretable procedure based on a tolerance parameter. We also provide a theoretical analysis: we relate the proposed relative OOB-score criterion to the gap between the current and limiting scores, and derive an asymptotic variance estimate for the corresponding OOB-based absolute relative difference. Experiments show that the selected number of trees can differ substantially from the common heuristic: for most classical benchmark datasets it is smaller, whereas for some high-dimensional bioinformatics datasets, such as Arcene and Dorothea, it is larger. The source code and reproducible experiments are available at https://github.com/lange-am/rf_plateau_hpo.

2606.03545 2026-06-03 cs.RO

Static and Dynamic Representations for Tactile Contact-Angle Estimation with Event-Based Sensors

基于事件传感器的触觉接触角估计的静态与动态表示

Yanhui Lu, Efi Psomopoulou, Benjamin Ward-Cherrier

AI总结 本文利用事件触觉传感器(NeuroTac)的事件流,比较了三种事件衍生的空间轮廓表示(动态、静态及其组合)用于接触角估计,并验证了其在机器人操作中实现高频、低延迟触觉角度估计的潜力。

详情
Comments
8 pages, 8 figures. Submitted to IEEE Robotics and Automation Letters (RAL), under review
AI中文摘要

基于事件的触觉传感为接触密集的机器人交互提供了低延迟信号采集。本文研究了使用来自事件触觉传感器(NeuroTac)的事件流进行接触角估计,并比较了三种事件衍生的空间轮廓表示:捕获近期事件活动的动态表示、恢复更持久接触状态的静态表示以及它们的组合表示。在评估的运动场景中,所有表示管道在所有测试采样间隔下的P99处理延迟均低于10毫秒,展示了它们在机器人操作中用于高频基于事件的触觉角度估计的潜力。在特定场景训练下,静态表示始终比动态和组合表示表现略好,在连续传感器滚动期间产生平均总体MAE为0.160°,在随机插入的运动中断期间停止阶段平均MAE为0.251°。它还在速度和压痕深度变化方面表现出比其他两种表示更小的性能波动。

英文摘要

Event-based tactile sensing offers low-latency signal acquisition for contact-rich robotic interaction. This paper investigates contact-angle estimation using event streams from an event-based tactile sensor (NeuroTac) and compares three event-derived spatial contour representations: a dynamic representation capturing recent event activity, a static representation recovering a more persistent contact state, and their combined representation. Across the evaluated motion scenarios, all representation pipelines exhibited P99 processing latency below 10 ms at all tested sampling intervals, demonstrating their potential for high-frequency event-based tactile angle estimation in robotic manipulation. The static representation consistently achieved marginally better performance than the dynamic and combined representations under scenario-specific training, yielding a mean overall MAE of 0.160° during continuous sensor rolling and a stop-phase mean MAE of 0.251° during randomly inserted motion interruptions. It also exhibited smaller performance fluctuations across speed and indentation depth variations than the other two representations.

2606.03544 2026-06-03 cs.AI cs.CL

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

SAGE: 智能体生态中社会化演化的定量评估

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

AI总结 提出SAGE框架,通过对比社会演化(SocialEvo)与自我演化(SelfEvo)两种计算条件,在三个领域评估共享经验对智能体性能的影响,发现群体历史并非普遍放大器,但能帮助陷入停滞的智能体取得突破,且社会收益依赖于抽象能力而非暴露量。

详情
Comments
13 pages, 5 figures
AI中文摘要

自我改进的语言智能体通常被孤立评估:一个智能体尝试任务、接收反馈并迭代优化自身行为。然而,智能体越来越多地与同伴一起运作,其策略和结果公开可见。这引发了一个研究不足的问题:共享经验何时能产生自我改进无法单独实现的改进?我们引入了SAGE(社会智能体群体演化),一个评估框架,比较两种计算匹配的条件:SocialEvo,其中来自五个不同模型家族的智能体共同演化,可访问所有同伴的历史;以及SelfEvo,其中每个智能体获得相同数量的任务尝试,但只能看到自己的过去,这是自我改进智能体研究中的常规做法。我们在三个领域实例化SAGE:开放式机器学习研究、长期经济规划和战略多人游戏,并在多个演化轮次中进行评估。我们发现群体历史并非普遍放大器:最强的智能体并未超过其自我演化上限。然而,在自我改进下停滞的智能体,当同伴经验可用时,可以取得重大突破。在竞争环境中,反事实控制显示智能体普遍改进,而非发展针对对手的策略。在不同形式的共享历史中,过滤后的同伴轨迹和反思性摘要通常优于原始日志,表明社会收益依赖于抽象而非暴露量。这些发现表明,同伴历史收益是智能体特定的、领域依赖的,并取决于从公共轨迹中抽象可转移知识的能力。

英文摘要

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

2606.03540 2026-06-03 cs.CV

Attend to Anything: Foundation Model for Unified Human Attention Modeling

关注一切:统一人类注意力建模的基础模型

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

AI总结 提出 Attend to Anything Model (AAM),一种多模态基础模型,通过层次化语言提示和双曲空间嵌入统一图像、视频和视听任务中的注意力建模,并在16个基准上平均提升6%,视频推理加速约4倍。

详情
Comments
Accepted to ICML 2026
AI中文摘要

现有人类注意力(显著性)建模方法在模态、场景和任务公式上高度碎片化。因此,即使模型容量和数据规模增加,当前模型仍主要依赖于场景且针对特定任务,无法在实际应用中泛化。为解决这些根本限制,我们提出了关注一切模型(AAM),一种多模态基础模型,统一了各种图像、视频和视听任务及场景中的注意力建模。AAM将注意力重新表述为一种认知蕴含关系,按通用到特定的层次组织,通过双曲空间中的层次嵌入语言提示实现。此外,为统一静态图像和动态视频注意力,我们采用流体动力学视角,将视频帧注意力建模为由Fokker-Planck方程控制的扩散时间演化。在16个基准上的大量实验表明,AAM在各种场景下平均比最先进方法高出6%,同时视频推理速度提升约4倍。总体而言,这些结果表明AAM为未来注意力和显著性相关任务的研究提供了原则性基础。数据集和代码将在此https URL提供。

英文摘要

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

2606.03539 2026-06-03 cs.CV

Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

零空间中知识保留的模型调优用于鲁棒的时空视频定位

Haoxuan Chen, Xianqin Liu, Jian-Fang Hu

AI总结 针对低质量视频导致预训练知识被破坏的问题,提出零空间调优(NST)框架,通过将可学习残差限制在冻结权重的零空间内来保留预训练知识,同时利用质量自适应单元和双空间重参数化合成残差,在混合质量基准上达到最优性能。

详情
Comments
Accepted by ICME 2026
AI中文摘要

时空视频定位旨在基于文本查询定位目标管。尽管近期方法取得了显著成功,但它们主要关注高质量输入,忽略了现实场景中广泛存在的低质量视频。虽然像LoRA这样的调优方法可以适应降质输入,但它们不可避免地破坏了预训练知识。为解决这一问题,我们提出了零空间调优(NST)。该框架利用了将冻结权重的零空间内的向量添加到层输入不会影响输出的几何性质。利用这一点,NST将可学习残差注入输入特征,这些残差可以选择性地对预训练骨干网络不可见。具体地,NST结合了质量自适应单元和双空间重参数化来合成这些残差,通过将高质量输入的组件限制在零空间内,同时将低质量输入的恢复组件引导至非零空间。由于冻结权重消除了零空间组件,我们有效地纠正了降质输入,同时保留了高质量输入的预训练知识。大量实验表明,NST在我们的混合质量基准上优于最先进的方法。

英文摘要

Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.