arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2183
2606.12215 2026-06-11 cs.CV cs.IR cs.LG 新提交

MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching

MLT-Dedup:通过多级表示和时空匹配的高效大规模在线视频去重

David Yuchen Wang, Haoying Li, Hailun Xu, Wei Chee Yew, Zirui Zhu, Sanjay Saha, Hao Hei, Kanchan Sarkar, Kun Xu

发表机构 * TikTok Singapore(TikTok新加坡) School of Computing, National University of Singapore(新加坡国立大学计算机学院) TikTok San Jose(TikTok圣何塞)

AI总结 提出MLT-Dedup框架,采用多级视频编码器提取细粒度帧级和稀疏片段级嵌入,结合差分特征增强相似性模块进行时空匹配,在90%精度下降低在线重复率91%,索引容量提升5倍。

详情
Comments
Accepted by KDD-2026 ADS track
AI中文摘要

在线平台上用户生成视频内容的爆炸性增长伴随着大量近似重复视频的出现——这些视频相同或高度相似,但存在部分编辑差异。这些重复视频降低了用户体验,增加了存储和带宽成本,使得大规模视频去重成为一项关键任务。现有的视频去重框架在有限的索引预算下检索足够高质量候选视频方面面临根本性挑战,同时在效率和精度之间存在权衡。为了解决这些问题,我们提出了MLT-Dedup,一种基于多级表示和时空匹配的高效大规模在线视频去重框架。我们的方法采用多级视频编码器(ML-VE)提取细粒度的帧级嵌入和稀疏的片段级嵌入:稀疏嵌入支持高效的候选检索,而细粒度嵌入则用于精确的成对匹配。在匹配过程中,我们引入了DiF-SiM,一种差分特征增强相似性模块,能够定位重复的时间片段并提供可靠的相似性证据,以支持基于策略的去重决策。在真实大规模平台上的大量实验表明,MLT-Dedup在90%精度下将在线重复率降低了91%。此外,我们的稀疏检索设计使索引容量提升了5倍,从而在实际部署中实现了更广泛的候选覆盖。

英文摘要

The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.

2606.12213 2026-06-11 cs.CV 新提交

SHERPA: Seam-aware Harmonized ERP Adaptation for Open-Domain 360$^\circ$ Panorama Generation

SHERPA: 面向开放域360°全景生成的无缝感知协调ERP适配

Jungwoon Kang, Jaehun Kim, Yiwon Yu, Hyungyum Jang, Sanghoon Lee, Jongyoo Kim

发表机构 * Yonsei University(延世大学)

AI总结 提出SHERPA框架,通过频率选择性圆形RoPE、圆形潜编码/解码、图像侧FFN适配器和双路径训练方案,实现从平面扩散模型到360°全景的轻量级适配,支持逼真和风格化全景生成。

详情
Comments
29 pages, 23 figures, 5 tables. Preprint version
AI中文摘要

全景图像越来越多地用于世界生成、游戏和仿真中,用户不仅需要逼真的场景,还需要风格化和非逼真的环境。大规模文本到图像扩散和流模型为此目标提供了广泛的风格和语义先验,但平面图像训练使它们与等距柱状投影(ERP)表示的360°全景的环绕拓扑和极地区域不对齐。我们提出了SHERPA,一个轻量级适配框架,结合了频率选择性圆形RoPE、圆形潜编码/解码、图像侧FFN适配器和双路径训练方案。圆形RoPE仅将接缝敏感的高频水平RoPE带替换为整数周期谐波,同时保留预训练的低频频谱。配对全景路径监督几何,而未配对风格路径使用自监督偏航一致性进行无目标风格化提示。结果,SHERPA在逼真全景域和开放域风格化提示下生成360°全景。

英文摘要

Panoramic imagery is increasingly used in world-generation, games, and simulation, where users may need not only photorealistic scenes but also stylized and non-photorealistic environments. Large-scale text-to-image diffusion and flow models provide broad style and semantic priors for this goal, but planar image training misaligns them with the wrap-around topology and polar regions of $360^\circ$ panoramas represented in equirectangular projection (ERP). We present SHERPA, a lightweight adaptation framework that combines frequency-selective Circular RoPE, Circular Latent Encoding/Decoding, image-side FFN adapters, and a Dual-Path Training Scheme. Circular RoPE replaces only the seam-sensitive high-frequency horizontal RoPE band with integer-periodic harmonics while preserving the pretrained lower-frequency spectrum. The Paired Panorama Path supervises geometry, while the Unpaired Style Path uses self-supervised yaw consistency for target-free stylized prompts. As a result, SHERPA generates $360^\circ$ panoramas across both photorealistic panorama domains and open-domain stylized prompts.

2606.12210 2026-06-11 cs.CL 新提交

Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

新闻能否预测市场?零样本金融自然语言处理的局限性与可解释人工智能的作用

Ali M Karaoglu, Shreyank N Gowda

发表机构 * University of Nottingham(诺丁汉大学)

AI总结 本研究通过零样本自然语言处理框架,结合时间聚合与多层次可解释性,发现零样本方法无法超越简单基线,但可解释性信号能区分可靠与不可靠预测,强调透明性和不确定性感知在决策支持中的价值。

详情
AI中文摘要

金融新闻能否可靠地预测短期股票波动?尽管大型语言模型取得了进展,但这一问题仍未解决。我们使用零样本自然语言处理框架重新审视该问题,研究模型能否在无需领域特定训练的情况下从金融新闻中提取可操作信号。我们设计了一个结构化流程,将零样本自然语言推理与时间聚合相结合,在整合跨文章信息时明确建模时效性和事件依赖的影响范围。为了解决高风险场景中对透明度的需求,我们引入了一个多层次可解释性框架,将预测与词元级、文章级和聚合证据联系起来,并生成基于文本的自然语言理由。在多个模型和预测时间跨度上,我们发现零样本方法始终无法超越简单基线,在负向波动上表现尤其薄弱,这表明将新闻情绪映射到短期价格动态存在更深层次的结构性限制。然而,可解释性信号能够可靠地区分可信和不可信的预测,即使在准确性有限的情况下也具有实用价值。这些发现凸显了零样本金融自然语言处理的局限性,并促使我们转向优先考虑透明性和不确定性感知的决策支持系统。代码:此 https URL

英文摘要

Can financial news reliably predict short-term stock movements? Despite advances in large language models, this question remains unresolved. We revisit this problem using a zero-shot natural language processing framework, investigating whether models can extract actionable signals from financial news without domain-specific training. We design a structured pipeline that combines zero-shot natural language inference with temporal aggregation, explicitly modelling recency and event-dependent impact horizons when integrating information across articles. To address the need for transparency in high-stakes settings, we introduce a multi-layered explainability framework that links predictions to token-level, article-level, and aggregate evidence, and produces grounded natural language rationales. Across multiple models and prediction horizons, we find that zero-shot approaches consistently fail to outperform simple baselines, with particularly weak performance on negative movements, suggesting deeper structural limitations in mapping news sentiment to short-term price dynamics. However, explainability signals reliably distinguish between trustworthy and unreliable predictions, offering practical value even when accuracy is limited. These findings highlight the limits of zero-shot financial NLP and motivate a shift toward decision-support systems that prioritise transparency and uncertainty awareness. Code: this https URL

2606.12207 2026-06-11 cs.RO cs.AI 新提交

Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends

具身基准构建的智能自动化:流程、具身、模拟器与趋势

Jinshan Lai, Jianwei Hu, Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Tingxuan Huang, Xi Ren, Qiang Ma

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Qiyuan Lab(启元实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学)

AI总结 本文综述具身智能基准构建的五阶段流程,分析从人工到自动化再到智能体闭环的转变,指出自动化将成本转向验证与治理。

详情
AI中文摘要

具身智能现已涵盖导航、家务辅助、操作、自动驾驶、空中智能体及多模态大模型控制。这一扩展使得基准构建成为可靠评估的核心瓶颈。与静态数据集不同,具身基准将任务规范、环境、机器人数据、演示、标注、指标、评估脚本和发布策略整合为一个评估系统。本综述通过五阶段构建流程回顾文献:需求与任务构建、数据获取、数据清洗与标注、基准套件生成与指标定义、评估执行与诊断反馈。针对每个阶段,分析从人工管理到传统自动化、基础模型辅助以及智能体闭环工作流的转变。同时比较了人工、数据与资产获取、计算与仿真、验证与调试、治理与维护以及返工风险等定性构建成本。主要结论是:自动化并非简单降低基准成本,而是往往将成本转向验证、可审计性、版本控制和长期治理。因此,具身评估的进展不仅取决于更大的基准套件,还取决于可诊断、可审计且可负责任地更新的构建流程。

英文摘要

Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.

2606.12203 2026-06-11 cs.CL 新提交

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

自适应多分辨率程序性知识压缩用于大型语言模型

Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, Runzhong Qiao, Xuancheng Li, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出SKIM框架,通过自适应多分辨率软令牌压缩程序性技能,在保持任务性能的同时将技能令牌长度压缩至30%-60%。

详情
AI中文摘要

大型语言模型(LLM)被广泛用于处理具有自主工作流的复杂任务。最近,可重用的自然语言技能作为一种流行的范式出现,用于向LLM应用程序注入程序性知识。由于流行的技能经常被重复调用,将它们的完整文本放在每个上下文中会显著增加预填充成本和延迟。虽然文本压缩技术有潜力解决这个问题,但大多数现有方法旨在压缩文档中的事实性知识而非程序性知识,这使得它们不足以用于技能压缩。在本文中,我们认为有效的技能压缩方法应该:1)保留工作流和工具协议之间的逻辑依赖关系;2)支持对频繁更新的社区技能进行轻量级、离线压缩;3)能够适应不同技能之间的复杂性变化。为了解决这个问题,我们提出了SKIM(SKIll coMpression),一个用于程序性技能的自适应多分辨率软令牌压缩框架。根据每个技能的复杂性,SKIM创建不同数量的软令牌,这不仅提高了LLM推理的效率,而且保留了技能使用的有效性。实验表明,SKIM将技能压缩到其原始令牌长度的30%到60%,同时比现有的压缩方法更好地保持了任务性能。我们已在https://this URL发布了我们的代码。

英文摘要

Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression this http URL have released our code at this https URL.

2606.12195 2026-06-11 cs.CV 新提交

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

InternVideo3: 用多模态上下文推理代理化基础模型

Ziang Yan, Sheng Xia, Jiashuo Yu, Yue Wu, Tianxiang Jiang, Songze Li, Kanghui Tian, Yicheng Xu, Yinan He, Kai Chen, Limin Wang, Yu Qiao, Yi Wang

发表机构 * Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学)

AI总结 提出InternVideo3框架,通过多模态上下文推理(MCR)和高效KV缓存压缩方法M^2LA,增强长视频理解与迭代交互能力,在多个基准上取得强性能。

详情
AI中文摘要

近期基础模型的进展已转向涉及多步推理和工具使用的代理行为。然而,开源工作主要聚焦于文本主导的场景,使得长时域多模态任务探索不足。这一差距在需要持续时间理解和迭代交互的视频任务中尤为明显。我们提出InternVideo3,一个通过多模态上下文推理(MCR)增强这些能力的框架。MCR将理解视为一个闭环过程,作用于包含观察、指令、推理、工具动作和记忆的共享演化上下文。这将长视频理解框架化为证据积累与验证。为确保效率,我们引入多模态多头潜在注意力(M^2LA),一种保留令牌的重参数化方法,压缩KV缓存状态同时保留完整令牌流。我们的分阶段训练包括持续预训练、短到长监督微调、基于规则的强化学习以及在线策略蒸馏。实验表明,InternVideo3在Video-MME、MLVU和EgoSchema等基准上取得了强性能。我们进一步将该模型实例化为带有检索工具的视频代理,展示了稳健的基于证据的行为。我们的结果表明,高效的上下文处理和闭环推理对于将开放多模态模型适应于长时域视觉接地代理至关重要。

英文摘要

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

2606.12189 2026-06-11 cs.CV 新提交

DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds

DynaTok: 基于Token的部分点云4D重建

Weirong Chen, Keisuke Tateno, Hidenobu Matsuki, Michael Niemeyer, Daniel Cremers, Federico Tombari

发表机构 * Technical University of Munich(慕尼黑工业大学) Google(谷歌) Imperial College London(伦敦帝国理工学院) University of Bonn(波恩大学)

AI总结 提出DynaTok框架,通过Transformer时空编码器和流匹配解码器,从部分点云序列中无对应地重建完整且时间一致的4D点云,无需图像。

详情
Comments
ICML 2026. Project page: this https URL
AI中文摘要

我们解决从部分点云序列的4D重建问题,其中深度传感器观测不完整、无序且缺乏显式时间对应。这种仅几何的设置由于缺失观测和模糊动态而具有挑战性。尽管最近的进展主要依赖于基于图像的方法,现有的基于点的方法通常关注单个物体、假设相对完整的输入或需要显式对应。为了解决这些限制,我们提出了DynaTok,一个基于点的框架,用于从部分点云序列中无对应地进行4D重建,无需图像。DynaTok将帧编码为紧凑的潜在token,通过基于Transformer的时空编码器随时间聚合不完整的观测,并通过统一模型中的残差token解耦几何和运动。然后,一个流匹配解码器以潜在token为条件,重建完整且时间一致的4D点云序列。在物体和场景级基准上的实验表明,从部分点云观测中重建质量和时间一致性得到了改善。项目页面:此https URL。

英文摘要

We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: this https URL.

2606.12186 2026-06-11 cs.CL 新提交

A Resource for Enthymeme Detection in Controversial Political Discourse

争议性政治话语中省略推理检测的资源

Martial Pastor, Nelleke Oostdijk

发表机构 * Centre for Language Studies, Radboud University Nijmegen(奈梅亨大学语言研究中心)

AI总结 提出一个标注了省略推理及其结构的推文资源,基于Walton论证方案设计指南,通过复杂性分析揭示标注不一致来源,实验表明利用标注者分歧训练的模型优于多数投票标签。

详情
Comments
43 pages, to be submitted to the Language Resource and Evaluation Journal
AI中文摘要

省略推理(enthymemes)是指前提或结论未明确陈述的论证,在说服性话语中普遍存在,但其标注历来具有高度主观性。我们提供了一个来自政治争议性话语的1,482条推文资源,由五位标注者标注了省略推理的存在及其论证结构,旨在研究标签变异性。我们首先重新审视省略推理的定义,并提出了基于Walton论证方案的标注指南,提供了一种结构化且受约束的方法,同时保留了任务解释性空间。这与以往资源形成对比,后者倾向于消除分歧,掩盖其来源并阻止研究其对模型性能的潜在益处。我们进一步提出了任务的复杂性分析,识别了标注中认知负荷高的环节及其可能引发不一致标注的原因。初步实验表明,基于标注者分歧训练的模型优于基于硬多数投票标签训练的模型。最后,我们反思了省略推理定义和指南中的结构开放性如何能够为未来资源和关注人类推理的下游NLP应用研究主观推理过程中的变异性提供支持。

英文摘要

Enthymemes, arguments with unstated premises or conclusions, are pervasive in persuasive discourse, yet their annotation remains notoriously subjective. We present a resource of 1,482 tweets from politically controversial discourse, annotated by five annotators for the presence of enthymemes and their argument structure, designed to study label variation. We first revisit the definition of enthymemes and propose annotation guidelines anchored in Walton's argumentation schemes, offering a structured and constrained approach that nonetheless preserves room for the interpretive nature of the task. This contrasts with past resources, which tend to eliminate disagreement, obscuring its sources and preventing investigation of its potential benefits for model performance. We further propose a complexity analysis of the task, identifying where annotation imposes high cognitive load and may give rise to inconsistent annotation. Our preliminary experiments show that models trained on annotator disagreement outperform models trained on hard majority-vote labels. We close by reflecting on how structural openness in enthymeme definitions and guidelines enables the study of variation in subjective inferential processes for future resources and downstream NLP applications concerned with human inference.

2606.12182 2026-06-11 cs.LG math.DS math.OC 新提交

How Low Can You Go? Active Learning for Sparse Model Discovery in the Ultra-Low-Data Limit

你能低到多少?超低数据极限下稀疏模型发现的主动学习

Ana Larrañaga, Urban Fasel, Steven L. Brunton

发表机构 * Department of Mechanical Engineering, University of Washington(华盛顿大学机械工程系) NSF AI Institute in Dynamic Systems, University of Washington(华盛顿大学NSF动态系统人工智能研究所) Department of Aeronautics, Imperial College London(伦敦帝国理工学院航空系)

AI总结 针对超低数据极限下动力学系统方程发现的数据稀缺问题,提出基于E-SINDy的主动学习策略,通过迭代优先采样信息量大的区域,在Lorenz、Burgers和Kuramoto-Sivashinsky系统上验证了比随机采样更少数据即可准确识别动力学。

详情
Comments
20 pages, 10 figures
AI中文摘要

识别复杂动力系统的控制方程仍然是科学和工程中的一个基本挑战。虽然早期方法依赖于经验数据和启发式方法,但现代数据驱动方法提供了更大的灵活性和更少的假设。然而,在实际环境中获取数据通常成本高昂。本文通过引入一种主动学习策略来解决这一挑战,用于超低数据极限下的动力学发现。我们的方法不是随机采样,而是迭代地优先考虑对模型识别最有信息量的区域。该方法基于稀疏非线性动力学识别(SINDy),并利用集成扩展E-SINDy来估计认知不确定性并指导常微分方程和偏微分方程(ODEs/PDEs)的采样。对于ODEs,在Lorenz系统上进行了详尽的分析,考虑了不同的数据预算和噪声水平。对于PDEs,研究了两个具有对比动力学特性的系统:Burgers方程,其中尖锐的激波前沿区分了信息丰富和信息贫乏的区域;以及Kuramoto-Sivashinsky方程,它呈现出更复杂的空间采样景观。在所有场景中,所提出的方法都能以比随机采样显著更少的数据样本准确识别控制动力学。

英文摘要

Identifying the governing equations of complex dynamical systems remains a fundamental challenge across science and engineering. While early approaches relied on empirical data and heuristics, modern data-driven methods offer greater flexibility and fewer assumptions. However, data acquisition in real-world settings is often expensive. This work addresses this challenge by introducing an active learning strategy for dynamics discovery in the ultra-low data limit. Rather than sampling randomly, our method iteratively prioritizes regions that are most informative for model identification. This approach builds on Sparse Identification of Nonlinear Dynamics (SINDy), and utilizes an ensemble extension, E-SINDy, to estimate epistemic uncertainty and guide the sampling for both ordinary and partial differential equations (ODEs/PDEs). For ODEs, an exhaustive analysis is conducted on the Lorenz system across varying data budgets and noise levels. For PDEs, two systems with contrasting dynamical characteristics are examined: the Burgers' equation, where a sharp shock front creates a distinction between informative and uninformative regions, and the Kuramoto-Sivashinsky equation, which presents a more spatially complex sampling landscape. Across all scenarios, the proposed method accurately identifies the governing dynamics with significantly fewer data samples than random sampling.

2606.12171 2026-06-11 cs.CV cs.LG 新提交

Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions

超越暗知识:基于混合的蒸馏实现可靠预测

José Medina, Paul Honeine, Abdelaziz Bensrhair, Amnir Hadachi

发表机构 * ITS Lab, Institute of Computer Science, University of Tartu(塔尔图大学计算机科学学院ITS实验室) LITIS, Université de Rouen(鲁昂大学LITIS实验室) LITIS, INSA de Rouen(鲁昂国立应用科学学院LITIS实验室)

AI总结 研究知识蒸馏与混合训练结合时教师-学生不匹配的影响,发现学生能独立获得线性结构并提升准确率与校准,提出混合蒸馏作为更丰富的知识传递通道。

详情
AI中文摘要

知识蒸馏(KD)和混合(mixup)已被证明能有效诱导类别边界的平滑性:KD捕捉概率分布中的固有类别关系,而混合通过输入的凸组合强制执行这些关系。然而,它们的相互作用仍未被充分理解,特别是当混合仅在学生训练期间应用时。在这种情况下,教师被查询来自其训练期间从未见过的邻域分布的输入,这是一种受控的不匹配,其对知识转移的影响尚未被表征。我们表明,这种不匹配导致教师的监督信号被分布混淆而非类间结构主导。尽管如此,学生并非仅仅模仿教师:它独立地在邻域区域获得更大的线性度,这是教师缺乏的结构特性,并超越了暗知识转移。与基线相比,带有混合的KD持续提高学生准确率,并将过度自信降低一个数量级,在CIFAR和ImageNet上使用不同容量的教师均如此。关键的是,校准独立于准确率转移从教师传播到学生,温度缩放控制着可测量的准确率-校准权衡,在邻域训练下这种权衡更加明显。这些结果将混合蒸馏重新定义为不是标准KD的退化版本,而是一个更丰富的传递通道,同时塑造判别性能、不确定性估计和表示几何。

英文摘要

Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.

2606.12169 2026-06-11 cs.CV cs.AI cs.CL cs.LG 新提交

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

OpenMedReason: 医学视觉语言模型的科学推理监督

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) University of British Columbia(不列颠哥伦比亚大学) University of Toronto(多伦多大学) Unity Health Toronto / St. Michael’s Hospital(多伦多联合健康/圣迈克尔医院) University Health Network(大学健康网络) Arc Institute(弧研究所) Queen's University(女王大学)

AI总结 提出OpenMedReason,一个包含约45万图像-问题-答案实例的大规模开放医学推理语料库,其推理轨迹主要来自生物医学科学文章,并配套基准OpenMedReason-Bench进行细粒度评估,在监督微调和强化对齐中有效提升模型性能。

详情
Comments
42 pages, 9 figures, 24 tables. Dataset and code: this https URL
AI中文摘要

高风险临床使用大型视觉语言模型(LVLMs)需要基于视觉证据和临床知识的推理,而不仅仅是正确的最终答案。我们引入了OpenMedReason,这是一个大规模、开放的多模态医学推理语料库,包含约45万图像-问题-答案实例,其推理轨迹主要来自策划的生物医学、人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督,涵盖了多种医学领域视觉模态,如放射学扫描、显微图像、可见光照片、图表等。我们辅以OpenMedReason-Bench,这是一个留出基准,允许沿三个互补的能力轴(包括感知、医学知识和推理)对LVLMs进行细粒度评估,从而实现超越最终答案准确性的诊断性评估。OpenMedReason是一个丰富的训练资源,在监督微调(SFT)和基于强化的对齐中均显示出有效性。使用OpenMedReason进行训练,在VQA准确率上比基础模型平均提高20%,并且性能达到最强可比规模医学LVLMs的4.2%以内。细粒度性能分析证实,增益并非集中在单一轴上:OpenMedReason共同提升了感知、医学知识和推理,并且在86.1%的成对比较中,其推理轨迹优于基础模型。我们在以下网址发布代码和数据集:此 http URL。

英文摘要

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

2606.12160 2026-06-11 cs.CL 新提交

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher(独立研究员)

AI总结 本研究通过分析每层令牌logits特征,提出CHAIR框架检测幻觉,在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情
AI中文摘要

在这项工作中,我们引入了CHAIR(Classifier of Hallucination As ImproveR),一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征,如最大值、最小值、均值、标准差和斜率,从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明,CHAIR显著提高了检测准确性,特别是在零样本场景下,展示了其鲁棒性和泛化能力。除了幻觉检测,CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式,我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案,还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.

2606.12153 2026-06-11 cs.CV cs.GR 新提交

TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

TopoCap: 学习拓扑无关的运动先验用于单目视频到动画

Cheng-Feng Pu, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu

发表机构 * Zhili College, Tsinghua University(清华大学致理书院) BNRist, Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系,北京国家信息科学与技术研究中心) VAST

AI总结 提出TopoCap,首个统一框架,从单目视频提取运动并重定向到任意未见骨骼拓扑的角色,无需测试时优化,通过图CVAE学习通用运动流形和条件流匹配实现。

详情
AI中文摘要

生成式3D资产的爆炸式增长创造了巨大的动画需求,然而当前的动作捕捉方法仍然脆弱,局限于特定物种的模板(例如SMPL)或需要劳动密集型的手动绑定。我们引入了TopoCap,这是第一个统一的框架,能够从单目视频中提取运动并将其重定向到具有任意、未见过的骨骼拓扑的角色,即从双足到六足和无生命物体,无需测试时优化。我们的关键洞察是,虽然骨骼结构是组合且离散的,但运动背后的物理占据了一个连续的、低维的流形。我们通过一个两阶段生成流水线实现了这一洞察。首先,我们使用图CVAE学习一个通用运动流形,该流形将异构的运动链压缩成共享的、固定长度的潜在代码。通过明确地以目标骨架的结构嵌入为条件对解码器进行条件化,我们将运动动力学与骨骼拓扑解耦。其次,我们将视频到动画视为一个条件流匹配问题,从视觉特征预测这些拓扑无关的代码。为了学习这种广义先验,我们引入了Mobjaverse,这是一个从Objaverse-XL整理的大规模数据集。它包含超过5000个独特的骨骼拓扑和200万帧,其结构多样性比现有数据集高出两个数量级。大量实验表明,\MethodMotion在人类和四足基准测试中优于专业模型,同时实现了对长尾3D生物的零样本重定向。数据集在此https URL公开。

英文摘要

The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at this https URL.

2606.12147 2026-06-11 cs.AI 新提交

Towards Responsibly Non-Compliant Machines

迈向负责任的不合规机器

Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins (University of Manchester, Manchester, United Kingdom)

发表机构 * University of Bergen(卑尔根大学) University of Manchester(曼彻斯特大学)

AI总结 研究工程化能负责任地拒绝用户请求的自主智能体,提出基于理由、覆盖机制及风险责任追踪的合规框架。

详情
Comments
Presented at AAMAS-26 Workshop on Rebellion and Disobedience in AI this https URL
AI中文摘要

我们考虑工程化能够负责任地不遵守用户请求的自主智能体的问题。我们认为机器不合规有多种不同形式,并勾勒出在实现负责任不合规智能机器的道路上应追求的问题。我们将负责任的不合规锚定在任务拒绝的理由、覆盖不合规的途径,以及安全风险和责任转移的仔细追踪上。

英文摘要

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

2606.12142 2026-06-11 cs.RO cs.CV 新提交

AerialClaw: An Open-Source Framework for LLM-Driven Autonomous Aerial Agents

AerialClaw:一个用于LLM驱动的自主空中智能体的开源框架

Ke Li, Jianfei Yang, Luyao Zhang, Guo Yu, Chengwei Yan, Yuan Ding, Di Wang, Nan Luo, Gang Liu, Xiao Gao, Quan Wang

发表机构 * Xidian University(西安电子科技大学) Xi'an University of Architecture and Technology(西安建筑科技大学)

AI总结 提出AerialClaw开源框架,采用模块化脑-技能-运行时架构,使基于LLM的智能体能够理解自然语言任务、调用空中技能、闭环决策,提升无人机系统的灵活性、可复现性和可扩展性。

详情
AI中文摘要

无人机(UAV)越来越多地用于检查、搜索救援、环境监测和应急响应。然而,大多数无人机应用仍然依赖于预定义的命令序列或特定任务的管道,开发者手动连接感知、规划、飞行控制、仿真、日志记录和安全模块。这限制了自主空中系统的灵活性、可复现性和可扩展性。本文提出了AerialClaw,一个开源软件框架,使无人机能够作为决策型空中智能体运行,而不仅仅是遵循命令的平台。给定自然语言任务,AerialClaw允许基于LLM的智能体理解任务、维护上下文、调用可执行的空中技能、观察感知和运行时反馈,并在闭环中迭代更新其决策。该框架采用模块化的脑-技能-运行时架构,结合了用于原子无人机操作的硬技能、基于Markdown的可重用任务策略软技能、文档驱动的智能体状态和能力边界、记忆驱动的反思、面向安全的运行时验证以及平台无关的执行适配器。AerialClaw支持轻量级模拟执行、PX4 SITL与Gazebo以及基于AirSim的仿真,同时提供Web控制台、可插拔模型后端、示例任务、仿真资产和分阶段部署脚本。通过结合标准化的空中技能、文档驱动的智能体状态、记忆和闭环LLM决策,AerialClaw提供了一个可复现且可扩展的开源框架,用于构建能够解释任务、做出决策、执行技能并根据反馈调整行为的无人机系统。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly used in inspection, search and rescue, environmental monitoring, and emergency response. However, most UAV applications still rely on pre-defined command sequences or task-specific pipelines, where developers manually connect perception, planning, flight control, simulation, logging, and safety modules. This limits the flexibility, reproducibility, and extensibility of autonomous aerial systems. This paper presents AerialClaw, an open-source software framework that enables UAVs to operate as decision-making aerial agents rather than merely command-following platforms. Given a natural-language mission, AerialClaw allows an LLM-based agent to understand the task, maintain context, invoke executable aerial skills, observe perception and runtime feedback, and iteratively update its decisions in a closed loop. The framework adopts a modular brain-skill-runtime architecture, combining hard skills for atomic UAV operations, Markdown-based soft skills for reusable task strategies, document-driven agent state and capability boundaries, memory-driven reflection, safety-oriented runtime validation, and platform-agnostic execution adapters. AerialClaw supports lightweight mock execution, PX4 SITL with Gazebo, and AirSim-based simulation, together with a web console, pluggable model backends, example missions, simulation assets, and staged deployment scripts. By combining standardized aerial skills, document-driven agent state, memory, and closed-loop LLM decision-making, AerialClaw provides a reproducible and extensible open-source framework for building UAV systems that can interpret missions, make decisions, execute skills, and adapt their behavior from feedback.

2606.12141 2026-06-11 cs.LG 新提交

PCA-Enhanced Adaptive NVAR Framework for High-Resolution Sea Surface Temperature Forecasting in the East Sea

PCA增强的自适应NVAR框架用于东海高分辨率海面温度预测

Sherkhon Azimov, Susana López-Moreno, Eric Dolores-Cuenca, JinYong Choi, Sangil Kim

发表机构 * Pusan National University(釜山大学)

AI总结 提出PCA增强的自适应NVAR框架,通过SVD降维和自适应NVAR时序建模,实现东海海面温度的高效准确预测,优于标准NVAR方法。

详情
Comments
14 pages, 7 figures
AI中文摘要

准确预测东海等区域海的海面温度(SST)对于监测海洋生态系统、评估气候风险、管理渔业和执行海军行动至关重要。传统的数值海洋模型提供可靠的预测,但计算成本高,通常不适合实时预测。许多深度学习方法也难以处理高维时空海洋数据,并在较长的预测周期内出现误差累积。本研究基于我们先前提出的自适应下一代储层计算(Adaptive NVAR)框架,该框架最初在合成动力系统上引入和测试,并将其扩展到海洋预测。我们提出了一种降阶预测框架,将奇异值分解(SVD)与自适应NVAR相结合,以预测东海的SST动态。使用SVD将SST场压缩为低维表示,提取海洋变率的主导模态。自适应NVAR对这些潜在状态的时间演化进行建模,并将预测状态重建为SST预测。我们使用区域海洋数据集评估该框架,并将其与标准NG-RC/NVAR进行比较。结果表明,自适应NVAR在多个预测时域上始终实现较低的预测误差。此外,SVD降低了计算复杂度,从而产生了一个适用于实时海洋预测的快速且可扩展的框架。

英文摘要

Accurate forecasting of sea surface temperature (SST) in regional seas such as the East Sea is crucial for monitoring marine ecosystems, assessing climate risks, managing fisheries, and conducting naval operations. Traditional numerical ocean models provide reliable predictions but are computationally expensive and often unsuitable for real-time forecasting. Many deep learning methods also struggle with high-dimensional spatiotemporal ocean data and experience error accumulation over longer forecasting periods. This study builds on our previously proposed Adaptive Next-Generation Reservoir Computing (Adaptive NVAR) framework, initially introduced and tested on synthetic dynamical systems, and extends it to ocean forecasting. We present a reduced-order forecasting framework that combines Singular Value Decomposition (SVD) with Adaptive NVAR to predict SST dynamics in the East Sea. SST fields are compressed into a low-dimensional representation using SVD, which extracts dominant modes of ocean variability. Adaptive NVAR models the temporal evolution of these latent states, and the predicted states are reconstructed into SST forecasts. We evaluate the framework using regional ocean datasets and compare it with the standard NG-RC/NVAR. Results show that Adaptive NVAR consistently achieves lower forecasting errors across multiple prediction horizons. In addition, SVD reduces computational complexity, resulting in a fast and scalable framework suitable for real-time ocean forecasting.

2606.12140 2026-06-11 cs.CV 新提交

Time-Conditioned and Multi-Time Survival Prediction from 2D PET/CT Projections in Lung Cancer

基于2D PET/CT投影的时间条件与多时间生存预测在肺癌中的应用

Ashish Chauhan, Sambit Tarai, Elin Lundström, Johan Öfverstedt, Håkan Ahlström, Joel Kullberg

发表机构 * Radiology, Department of Surgical Sciences, Uppsala University(乌普萨拉大学外科学系放射科) National Academic Infrastructure for Supercomputing (NAISS), Linköping University(林雪平大学国家学术超级计算基础设施) Antaros Medical SciLifeLab, Uppsala University(乌普萨拉大学SciLifeLab)

AI总结 提出时间条件生存(ATCS)和多时间生存(MTS)两种方法,利用2D PET/CT投影预测非小细胞肺癌患者生存,ATCS在早期预测更优,MTS在晚期更优。

详情
Comments
Under review at MIUA 2026
AI中文摘要

从正电子发射断层扫描/计算机断层扫描(PET/CT)准确预测总生存期(OS)可以支持肿瘤学中的个性化治疗和随访策略。然而,时间建模对基于影像的生存预测的影响仍未得到充分探索。我们通过开发两种互补方法:注意力引导的时间条件生存(ATCS)和多时间生存(MTS),研究了不同时间公式如何影响生存预测。我们回顾性分析了848例非小细胞肺癌(NSCLC)患者的治疗前PET/CT图像,其中556例用于模型开发,292例用于保留测试。使用先前提出的时间条件生存(TCS)模型作为基线。模型通过5折交叉验证训练,并在测试集上使用时间依赖性曲线下面积(AUC)在0.5至5年之间每6个月间隔进行评估。ATCS和MTS均优于基线TCS模型,平均AUC分别为0.794和0.793,而基线为0.767。ATCS在早期时间点(0.5-3年)表现更好,而MTS在后期间隔(3.5-5年)表现更好。结合肿瘤特异性和组织特异性PET/CT特征比单独使用任一输入提高了性能。更精细的时间离散化改善了短期预测,而更粗的间隔提供了更稳定的长期估计。这些发现表明时间建模和输入设计影响基于PET/CT的生存预测。所提出的方法能够从治疗前影像进行时间特异性生存估计,并可能支持改进的风险分层和临床决策。

英文摘要

Accurate prediction of overall survival (OS) from positron emission tomography/computed tomography (PET/CT) can support personalized treatment and follow-up strategies in oncology. However, the impact of temporal modeling on imaging-based survival prediction remains insufficiently explored. We investigate how different temporal formulations influence survival prediction by developing two complementary approaches: Attention-guided Time-Conditioned Survival (ATCS) and Multi-Time Survival (MTS). We retrospectively analyzed pre-treatment PET/CT images from 848 patients with non-small cell lung cancer (NSCLC), including 556 for model development and 292 for held-out testing. A previously proposed Time-Conditioned Survival (TCS) model was used as a baseline. Models were trained using 5-fold cross-validation and evaluated on the test set using time-dependent area under the curve (AUC) at 6-month intervals from 0.5 to 5 years. Both ATCS and MTS outperformed the baseline TCS model, achieving mean AUCs of 0.794 and 0.793, respectively, compared to 0.767. ATCS performed better at earlier time points (0.5-3 years), whereas MTS performed better at later intervals (3.5-5 years). Combining tumor-specific and tissue-wise PET/CT features improved performance over either input alone. Finer temporal discretization improved short-term prediction, while coarser intervals provided more stable long-term estimates. These findings demonstrate that temporal modeling and input design influence PET/CT-based survival prediction. The proposed approaches enable time-specific survival estimation from pre-treatment imaging and may support improved risk stratification and clinical decision-making.

2606.12138 2026-06-11 cs.LG cs.AI cs.CL 新提交

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

不稳定特征,可复现子空间:理解稀疏自编码器中的种子依赖性

Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

发表机构 * T-Tech

AI总结 研究稀疏自编码器特征的可复现性,发现稳定特征承载主要信号,不稳定特征集中于可复现的低秩子空间,反映基歧义而非纯噪声。

详情
AI中文摘要

稀疏自编码器(SAE)被广泛用于解释神经网络表示,但其效用取决于学习到的特征是否在不同训练运行间可复现。我们通过\textit{特征稳定性}研究这一问题:对于每个SAE特征,我们估计其在独立训练的SAE中再次出现的概率。这产生了一个可扩展的每特征信号,将稳定特征与不稳定特征区分开来。在一项跨种子、模型、层、字典大小和SAE变体的大规模研究中,我们发现显著的功能不对称性:稳定特征承载了大部分重建和预测相关信号,而不稳定特征的边际影响较弱,并且在激活统计和自动解释中主要由低频表面形式触发主导。在几何上,不稳定特征个体不可复现,但集中在可复现的低秩子空间中,这表明种子依赖性通常反映了共享激活空间区域内的基歧义,而非纯噪声。一个受控的合成模型使这一机制明确,表明低秩真实特征可以在子空间级别被恢复,而作为个体SAE潜在变量跨种子仍不可识别。最后,通过汇集独特的跨种子特征,我们构建了更稳定的SAE,同时在此设置中保留了解释方差。这些结果共同表明,不稳定特征不仅仅是失败或噪声潜在变量:它们个体功能影响较弱,但反映了标准SAE跨种子不同解析的可复现低维结构。

英文摘要

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

2606.12126 2026-06-11 cs.CV 新提交

AGE-MIL: Anchor-Guided Evidence Learning for Patient-Level Prediction

AGE-MIL: 锚点引导的证据学习用于患者级别预测

Jiawei Niu, Jian Chen, Di Zhang, Junbo Lu, Zhangcheng Liao, Xuhao Liu, Honglin Zhong, Mireia Crispin-Ortuzar, Chen Li, Zeyu Gao, Yi Cai

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Department of Oncology, University of Cambridge(剑桥大学肿瘤学系) Xiangya School of Medicine, Central South University(中南大学湘雅医学院)

AI总结 提出AGE-MIL框架,通过构建患者级锚点整合多张全切片图像证据,将风险建模为证据积累过程,实现弱监督下的稳定优化,在六个任务中优于八种现有方法。

详情
Comments
11 pages, 2 figures, MICCAI early accepted
AI中文摘要

现有的计算病理学方法主要在全切片图像(WSI)级别的多实例学习(MIL)范式下运行,而患者级别的建模仍未得到充分探索。然而,在常规病理实践中,病理学家通过整合多个WSI的证据而非依赖任何单个切片来得出诊断和预后结论。当患者级别的监督直接施加于传统MIL框架时,这种差异造成了根本性的错位,常常导致优化不稳定和预测可靠性下降。为了解决这个问题,我们提出了锚点引导的证据MIL(AGE-MIL),一种用于患者级别预测的弱监督框架。AGE-MIL从切片表示中构建患者级别的锚点,以捕获全局病理上下文并指导诊断相关局部斑块的检索和整合,从而实现稳健的患者级别建模。患者级别的风险进一步被建模为证据积累过程,促进弱监督下的稳定优化。AGE-MIL在两个独立队列的六个临床相关患者级别预测任务上进行了评估。实验结果表明,所提出的框架始终优于八种最先进的MIL方法。代码可在以下网址获取:https://this https URL。

英文摘要

Existing computational pathology methods predominantly operate within whole-slide image (WSI)-level multiple instance learning (MIL) paradigms, while patient-level modeling remains underexplored. In routine pathological practice, however, pathologists derive diagnostic and prognostic conclusions by integrating evidence across multiple WSIs rather than relying on any single slide. This discrepancy creates a fundamental misalignment when patient-level supervision is directly imposed on conventional MIL frameworks, often leading to unstable optimization and degraded predictive reliability. To address this issue, we propose Anchor-Guided Evidence MIL (AGE-MIL), a weakly supervised framework for patient-level prediction. AGE-MIL constructs a patient-level anchor from slide representations to capture global pathological context and guide the retrieval and integration of diagnostically relevant local patches, enabling robust patient-level modeling. Patient-level risk is further modeled as an evidence accumulation process, promoting stable optimization under weak supervision. AGE-MIL is evaluated on six clinically relevant patient-level prediction tasks from two independent cohorts. Experimental results show that the proposed framework consistently outperforms eight state-of-the-art MIL methods. Code is available at this https URL.

2606.12125 2026-06-11 cs.CV 新提交

Q-Fold: Query-Aware Focus-Context Spatio-Temporal Folding for Long Video Understanding

Q-Fold: 查询感知的焦点-上下文时空折叠用于长视频理解

Biao Tang, Xu Chen, Shuxiang Gou, Jingyi Yuan, Yuhan Zhang, Chenqiang Gao

发表机构 * Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(电子科技大学深圳高等研究院)

AI总结 提出Q-Fold,一种无需训练的长视频输入构建框架,通过查询引导将相关片段保留为高保真焦点帧,不相关片段折叠为上下文布局,在固定预算下提升多模态大模型的长视频理解性能。

详情
Comments
10 pages, 5 figures, 8 tables. Code will be made publicly available
AI中文摘要

长视频理解对多模态大语言模型仍然具有挑战性,因为时间上延长的视频通常包含数千帧,因此穷举处理成本高昂。现有方法通常在有限的视觉预算下从长视频构建紧凑的视觉输入。然而,大多数方法仍然遵循以帧为中心的范式,并对保留的内容应用相似的表示,无论其重要性如何。这使得难以同时保留高保真视觉证据和广泛的时间覆盖。为了解决这个问题,我们提出了Q-Fold,一种无需训练的长视频理解输入构建框架。Q-Fold不将孤立帧作为基本建模单元,而是对连续的时间段进行操作,并在查询引导下构建异构的焦点-上下文表示。查询相关的片段被保留为高保真的焦点帧,而不太相关的片段被折叠成保持时间顺序的上下文布局。通过这种方式,Q-Fold保留了关键的视觉证据和广泛的时间覆盖,同时更好地保持了短片段内的局部时间连续性。在四个长视频基准测试和多个视频多模态大模型上的实验表明,Q-Fold在不增加输入预算的情况下持续提升性能。值得注意的是,它在一个超长视频基准测试上取得了高达9.1个百分点的提升。代码将公开提供。

英文摘要

Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos under a limited visual budget. However, most of them still follow a frame-centric paradigm and apply similar representations to retained content regardless of its importance. This makes it difficult to preserve both high-fidelity visual evidence and broad temporal coverage. To address this issue, we propose Q-Fold, a training-free input construction framework for long-video understanding. Instead of treating isolated frames as the basic modeling unit, Q-Fold operates on contiguous temporal segments and constructs a heterogeneous Focus--Context representation under query guidance. Query-relevant segments are preserved as high-fidelity Focus Frames, while less relevant segments are folded into chronology-preserving contextual layouts. In this way, Q-Fold preserves critical visual evidence and broad temporal coverage, while better maintaining local temporal continuity within short segments. Experiments on four long-video benchmarks with multiple Video-MLLMs show that Q-Fold consistently improves performance without increasing the input budget. Notably, it achieves gains of up to 9.1 percentage points on an ultra-long video benchmark. Code will be made publicly available.

2606.12120 2026-06-11 cs.LG math.OC 新提交

A Riemannian Approach to Low-Rank Optimal Transport

低秩最优传输的黎曼方法

Pratik Jawanpuria, Bamdev Mishra

发表机构 * Centre for Machine Intelligence and Data Science, IIT Bombay(印度理工学院孟买分校机器智能与数据科学中心) Microsoft India(微软印度)

AI总结 提出黎曼几何框架用于低秩最优传输,通过将平衡与不平衡秩r正因子耦合建模为光滑子流形,并采用Fisher-Rao乘积度量,实现高效的一阶和二阶求解器,在收敛速度和性能上超越现有方法。

详情
AI中文摘要

低秩最优传输(OT)缓解了经典求解器的二次缩放问题,但现有方法严重依赖需要仔细调整超参数且忽略优化景观曲率的一阶镜像下降更新。为了解决这些局限性,我们提出了一个统一的低秩OT黎曼几何框架,将平衡和不平衡秩$r$正因子耦合建模为正象限的新型光滑嵌入子流形。通过为这些流形配备Fisher-Rao乘积度量,我们推导出黎曼投影、收缩和Hessian-向量积的可处理公式。我们的成本无关框架无缝扩展到线性OT、Gromov-Wasserstein(GW)、融合GW及其不平衡对应物。对于平衡OT,我们的几何成分通过高效的共轭梯度和迭代Bregman更新计算。对于不平衡OT,我们的操作优雅地简化为闭式缩放,完全消除了内部迭代循环。在两种情况下,每次迭代的复杂度与数据集大小呈线性关系,并且我们提供了用于全局最优性验证的秩充分性证书。跨一系列问题规模的大量实验表明,我们的无正则化一阶和二阶求解器在收敛速度和性能上优于现有最先进的低秩OT求解器。

英文摘要

Low-rank optimal transport (OT) mitigates the quadratic scaling of classical solvers, yet existing approaches rely heavily on first-order mirror-descent updates that require careful hyperparameter tuning and ignore the optimization landscape's curvature. To address these limitations, we propose a unified Riemannian geometric framework for low-rank OT, modeling balanced and unbalanced rank-$r$ positive factored couplings as novel smooth embedded submanifolds of the positive orthant. By equipping these manifolds with the Fisher-Rao product metric, we derive tractable formulations for Riemannian projectors, retractions, and Hessian-vector products. Our cost-agnostic framework seamlessly extends to linear OT, Gromov-Wasserstein (GW), fused GW, and their unbalanced counterparts. For balanced OT, our geometric ingredients are computed via efficient conjugate-gradient and iterative Bregman updates. For the unbalanced OT, our operations elegantly reduce to closed-form scalings, completely eliminating inner iterative loops. In both regimes, per-iteration complexity scales linearly with dataset size, and we provide a rank-sufficiency certificate for global optimality verification. Extensive experiments across a range of problem sizes demonstrate that our regularization-free first- and second-order solvers achieve faster convergence and superior performance over existing state-of-the-art low-rank OT solvers.

2606.12117 2026-06-11 cs.CL cs.AI 新提交

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

软提示调优用于公平且高效的LLM基准评估

Selen Erkan, Bastian Boll, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu

发表机构 * Aleph Alpha Research Lab(Aleph Alpha 研究实验室) TU Darmstadt(达姆施塔特工业大学) Hessian.AI(黑森人工智能中心)

AI总结 提出软提示调优方法,通过优化少量软提示向量使基础模型适应基准格式,公平评估其真实知识,效率高且无需完整后训练。

详情
Comments
10 pages, 4 figures
AI中文摘要

基准分数常常错误地反映大型语言模型(LLM)的知识,因为它们依赖于模型遵循特定格式要求的能力等。这尤其惩罚了那些可能知道正确答案但缺乏按照指示结构化答案能力的基础模型——这种能力通常在后训练中引入。为了克服这一点,我们提出了软提示调优,一种高效、公平且架构无关的模型评估方法。通过在短时间调优内仅优化10个软提示向量(对于7B模型大约占参数的0.0006%),我们使模型适应特定的基准格式,缩小格式遵循方面的差距,确保底层知识准确地反映在基准分数中。这使得人们可以在基准上公平比较不同基础模型(使用各种预训练配方训练),而无需完整的后训练。我们在7个模型和7个数据集上评估了软提示调优。结果表明:(a) 软提示调优在80步(约640个样本)内使格式遵循饱和,因此非常高效;(b) 软提示调优显著优于零样本和少样本提示,揭示了标准提示遗漏的基础模型知识;(c) 即使后训练模型也可以从软提示中受益以最大化格式遵从性;(d) 软提示的基础模型性能比零样本和少样本基线更可靠地预测后训练模型的排名,为下游模型质量提供了低成本的代理。我们的贡献包括:(1) 解耦格式遵循和知识准确性的度量标准;(2) 更公平的LLM知识基准测试协议;(3) 一种成本效益高且内存有效的方案,用于在LLM开发早期识别最优预训练策略。

英文摘要

Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability -- typically introduced in post-training -- to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models -- trained with various pre-training recipes -- on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

2606.12114 2026-06-11 cs.CL 新提交

Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models

检测日语大语言模型预训练语料库中的敏感个人信息

Rei Minamoto, Yusuke Oda, Daisuke Kawahara

发表机构 * Waseda University(早稻田大学) Research and Development Center for LLMs, National Institute of Informatics(国立信息学研究所大语言模型研发中心)

AI总结 针对日语大语言模型预训练语料中的敏感个人信息,基于日本《个人信息保护法》定义的特殊要保护个人信息,构建数据集并训练机器学习模型进行快速检测,首次探索日语文本中的SCPI检测。

详情
AI中文摘要

敏感个人信息可能出现在大语言模型(LLMs)的大规模预训练语料中。因此,检测和过滤此类信息对于确保遵守隐私法规和防止意外信息泄露至关重要。然而,与英语和其他语言相比,日语中关于敏感个人信息的研究有限。在本研究中,我们聚焦于日本《个人信息保护法》(APPI)中定义为特殊要保护个人信息(SCPI)的敏感个人数据。我们使用基于LLM的标注构建了一个SCPI数据集,并训练机器学习模型以快速检测文本中的SCPI。结果,我们的SCPI分类器能够有效识别与SCPI相关的信息。本研究首次探索日语文本语料库中的SCPI检测,突显了准确检测的挑战。

英文摘要

Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.

2606.12113 2026-06-11 cs.CL cs.AI 新提交

Augmenting Molecular Language Models with Local $n$-gram Memory

增强分子语言模型的局部 $n$-gram 记忆

Xinni Zhang, Zijing Liu, He Cao, Yu Li, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) International Digital Economy Academy(国际数字经济学院)

AI总结 针对SMILES字符串的Transformer模型因字符级分词破坏化学语义的问题,提出MolGram模块,通过条件$n$-gram记忆哈希查找注入局部上下文,在三个任务上以更少参数超越基线。

详情
AI中文摘要

基于Transformer的SMILES字符串语言模型存在局部性差距:标准字符级分词会破坏化学上有意义的模式,迫使模型反复学习局部语法而牺牲长距离依赖。为了解决这个问题而不干扰标准分词器,我们提出了MolGram,它将条件$n$-gram记忆模块集成到分子语言模型中。MolGram通过可扩展的哈希查找将局部字符串模式映射到学习到的嵌入,并动态地将这种区域上下文注入隐藏状态。在三个任务(包括无条件分子生成、正向反应预测和单步逆合成)上的评估表明,MolGram持续提升性能。关键的是,我们的分析表明,MolGram以3倍更少的参数优于基线,将显式局部模式记忆确立为一种高效的归纳偏置。

英文摘要

Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.

2606.12112 2026-06-11 cs.RO 新提交

PEBRE: An Open-Hardware Compute and Perception Add-On for the Pepper Robot

PEBRE:Pepper 机器人的开源硬件计算与感知扩展模块

Malte Kuhlmann, Ignacio Bugueno-Cordova, Emil Alms, Javier Ruiz-del-Solar, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover(莱布尼茨汉诺威大学) University of Chile(智利大学)

AI总结 本文提出 PEBRE,一种为 Pepper 机器人设计的开源硬件扩展模块,通过集成 Jetson Orin Nano 等组件显著提升其计算与感知能力,并延长平台使用寿命。

详情
AI中文摘要

本文介绍了 PEBRE 的设计、开发与实验验证,PEBRE 是一种用于 Pepper 机器人快速软件开发的开放硬件扩展模块。我们的项目通过集成 Jetson Orin Nano、Logitech BRIO、Intel RealSense D435i、Samson UB1 和 RØDE VideoMicro II 等外部组件,增强了 Pepper 的计算和感知能力。结果表明,新硬件显著提升了 Pepper 的感知能力和计算性能。这一开发通过为 Pepper 机器人实现开放硬件和开源模块化扩展模块,并保持这一相关研究平台在其预期寿命之外的功能性,为社区做出了贡献。通过 PEBRE,我们旨在促进更快速的软件开发以及外部组件的更高效集成,最终增强 Pepper 机器人的能力。

英文摘要

This paper presents the design, development, and experimental verification of PEBRE, an open-hardware add-on for fast software development on the Pepper Robot. Our project enhances Pepper's computational and perception capabilities by integrating external components such as a Jetson Orin Nano, Logitech BRIO, Intel RealSense D435i, Samson UB1, and RØDE VideoMicro II. Our results show that the new hardware considerably improved Pepper's perception abilities and computational power. This development contributes to the community by implementing an open hardware and open-source modular add-on to the Pepper robot and keeping this relevant research platform functional beyond its expected lifespan. With PEBRE, we aim to facilitate faster software development and more efficient integration of external components, ultimately enhancing the capabilities of the Pepper robot.

2606.12109 2026-06-11 cs.RO cs.AI 新提交

Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning

弥合形态差距:通过意图条件微调使VLA模型适应灵巧操作

Chuanke Pang, Junyi Huang, Zhijun Zhao, Yaobing Wang, Kun Xu, Xilun Ding

发表机构 * Beihang University(北京航空航天大学) China Academy of Space Technology(中国空间技术研究院)

AI总结 提出InDex框架,通过将预训练的1-DoF平行抓取输出重用作宏观虚拟抓取意图代理,结合两阶段解耦学习架构,实现VLA模型从低自由度夹爪到高自由度灵巧手的适应,有效缓解灾难性遗忘和动作流形坍缩。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现了显著的零样本泛化能力,然而绝大多数预训练流程严格局限于低自由度平行夹爪。将这些丰富的语义先验适应到高自由度灵巧手引入了严重的形态差距,直接的端到端联合微调会由于数据稀缺而导致空间推理的灾难性遗忘和急性动作流形坍缩。在本文中,我们提出了InDex,一种新颖的、数据高效的适应框架,其根植于跨形态语义继承。我们不丢弃预训练的1-DoF平行抓取输出,而是将其重新用作连续的、宏观的虚拟抓取意图代理,以顺序化控制拓扑。我们实现了一个两阶段解耦学习架构:第一阶段参数高效地将VLA主干对齐以预测连续的臂轨迹和标量抓取意图;第二阶段冻结该空间主干,并利用一个意图条件去噪扩散头来解码多指末端执行器的细粒度关节运动。跨一系列多阶段、高接触灵巧操作任务的广泛模拟基准测试表明,InDex能够以最少的演示数据有效掌握复杂技能,显著优于整体基线,同时保留了原始VLA先验的鲁棒空间泛化能力。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.

2606.12106 2026-06-11 cs.CV cs.AI 新提交

MSUE: Multi-Modal Soccer Understanding Expert

MSUE:多模态足球理解专家

Litao Li, Yibo Yu, Yufeng Hu, Zhuo Yang, Jiali Wen, Yixin Chen, Yixi Zhou

发表机构 * South China University of Technology(华南理工大学) Johns Hopkins University(约翰霍普金斯大学) Peking University(北京大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出MSUE多专家问答架构,结合VLM数据合成管道与LLM动态调度文本、图像、视频专家,在SoccerNet VQA挑战中达到0.95准确率,获第三名。

详情
Comments
6 pages, 1 figures
AI中文摘要

本文介绍了我们对2026年SoccerNet VQA挑战赛的解决方案。我们首先开发了一个由视觉语言模型(VLM)驱动的低成本数据合成管道,该系统将原始领域数据系统地重构为多样化的VQA样本,包括简洁答案和长文本回复。其次,我们提出了MSUE,一种多专家问答架构,采用大语言模型(LLM)将问题动态分发给文本、图像和视频专家。这些专家分别实例化为强大的文本基线Gemini3-Flash、微调的Qwen3-VL和外部知识库,协同工作以提升VQA性能。MSUE在挑战基准上达到了\textbf{0.95}的准确率,在排行榜上获得第三名。

英文摘要

This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.

2606.12105 2026-06-11 cs.RO cs.CV cs.LG 新提交

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: 解耦异步多模态视觉语言动作模型

Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov

发表机构 * Intuitive Robots Lab, Karlsruhe Institute of Technology (KIT)(直觉机器人实验室,卡尔斯鲁厄理工学院) NVIDIA(英伟达) Robotics Institute of Germany(德国机器人研究所)

AI总结 针对VLA模型同步时钟与物理交互中不同模态频率不匹配的问题,提出DAM-VLA,通过解耦各模态时间处理、维护传感器速率更新的潜在缓冲区,并利用门控交叉注意力整合高频模态,在7个真实操作任务中平均成功率提升至95.2%。

详情
Comments
17 pages, 8 figures
AI中文摘要

视觉-语言-动作(VLA)模型继承了视觉-语言预训练中的共享同步时钟,以单一速率处理每个输入。这与物理交互不一致,在物理交互中,高频模态以数百赫兹变化,视觉演化较慢,而语言在整个回合中保持不变。同步VLA会过采样慢速模态,欠采样快速模态,并将动作生成限制在最低有效频率。我们假设解耦每个模态的时间处理,让每个模态以其自身传感器速率更新和保留信息,可以产生更强的表示和更鲁棒的控制。我们提出DAM-VLA,它维护每个模态的潜在缓冲区,以传感器速率刷新并由动作头连续读取,通过门控交叉注意力整合新的高频模态,同时保持预训练主干不变。在七个接触丰富的真实世界操作任务中,DAM-VLA将最强同步基线的平均成功率提高了一倍以上(95.2% vs. 40.95%),同时维持平滑、反应式的100 Hz控制。项目网站:\href{ this https URL }{ this http URL }

英文摘要

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{ this https URL }{ this http URL }

2606.12099 2026-06-11 cs.CV 新提交

ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation

ISAP-3D: 身份槽对齐的部件感知3D生成

Junlin Hao, Haoshuai Fu, Xibin Song, Wei Li, Ruigang Yang, Xinggong Zhang, Jinchuan Zhang

发表机构 * Peking University(北京大学) Tencent(腾讯) Huawei(华为) University of Science and Technology of China(中国科学技术大学)

AI总结 针对部件感知3D生成中因身份-布局纠缠导致的结构歧义问题,提出身份槽对齐框架ISAP-3D,通过语义身份令牌锚定每个部件并进行一对一布局预测,实现稳定可控的部件级3D生成。

详情
AI中文摘要

部件感知3D生成旨在合成具有语义意义组件的结构化对象,但由于身份-布局纠缠,常常遭受结构歧义。现有方法要么隐式推断部件身份和空间布局,导致不稳定的部件分配(例如槽交换或部件合并),要么依赖在实践中难以获得的强布局条件。我们将这种歧义归因于身份槽置换自由度:没有显式的身份槽对齐,训练期间语义部件和生成槽之间的对应关系不可识别,允许多个槽分配适应相同的监督,导致不一致的分解。基于这一见解,我们认为稳定的部件感知生成需要身份对齐的一对一槽建模。因此,我们提出了一个身份槽对齐框架ISAP-3D,该框架用语义身份令牌锚定每个部件,执行身份条件的一对一布局预测,随后进行布局条件的几何合成。结构化的局部-全局条件在语义、空间和几何阶段保持身份对齐。我们还构建了一个具有统一语义协议的部件级数据集,以实现可学习且一致的身份槽对齐。大量实验表明,与最先进的部件感知生成基线相比,我们的方法在结构稳定性、可控性和鲁棒性方面有所改进。

英文摘要

Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.

2606.12088 2026-06-11 cs.CL 新提交

Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

无保护属性的去偏:从文本画像中消除潜在概念

Shun Shao, Zheng Zhao, Anna Korhonen, Yftah Ziser, Shay B. Cohen

发表机构 * University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学) University of Groningen(格罗宁根大学) NVIDIA Research(英伟达研究院)

AI总结 提出H-SAL方法,利用自我描述文本作为隐式信号进行后处理概念和属性消除,在无直接敏感属性下实现去偏,并在多领域Stack Exchange基准上验证其效果与显式标签去偏相当或更优。

详情
Comments
23 pages, 5 figures, 12 tables. The paper is currently under review
AI中文摘要

大多数自然语言处理中的公平性研究假设可以直接访问性别、种族或国籍等保护属性。然而,在实践中,由于隐私限制、元数据缺失或法律约束,这些信息通常不可用,尽管模型可能从间接文本线索中推断出来。这引发了一个关键问题:在没有直接访问敏感属性的情况下,去偏能否成功?我们提出了H-SAL,它利用自我描述文本作为隐式去偏信号,执行事后概念和属性消除。为了支持这一设置,我们引入了一个基于Stack Exchange的多领域公平性基准,用于帮助度预测,该基准包括显式和隐式信号,从而能够在有保护标签的标准去偏和无敏感信息访问的去偏之间进行比较。在编码器和仅解码器语言模型中,我们发现隐式自我描述通常匹配或优于基于显式标签的去偏。我们的结果拓宽了表示层面的公平性研究,并为在现实数据约束下研究去偏提供了新的基准。

英文摘要

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.