arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.14692 2026-05-18 cs.CV

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

链式窥视：面向视频理解的搜索引导渐进性对象基础推理

Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.（网络与交换技术国家重点实验室，北京邮电大学，北京，中国）； Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China.（大数据研究院，复旦大学计算机科学与人工智能学院，中国）； ARC Lab, Tencent PCG, Shenzhen, China.（腾讯PCG深圳实验室，深圳，中国）； School of Artificial Intelligence, Beijing University of Technology, Beijing, China.（北京理工大学人工智能学院，北京，中国）

AI总结本文提出Chain-of-Glimpse框架，通过搜索引导的渐进推理解决视频中对象变化问题，提升多步骤决策的准确性和可解释性。

详情

AI中文摘要

视频理解需要在不同帧间识别和推理语义区分度高的视觉对象，但现有对象无关方法难以有效处理时间变化带来的显著对象变化。为此，我们引入Chain-of-Glimpse，一种搜索引导的渐进性对象基础推理框架，通过将每个推理步骤明确锚定到特定视觉证据区域，实现组合性和多步骤决策。形式上，Chain-of-Glimpse将视频推理视为逐步过程，逐步构建围绕任务相关视觉对象的空间基础轨迹，从而减少对显著性驱动线索的过度依赖。具体而言，Chain-of-Glimpse包含一个搜索引导的控制器，通过强化学习优化，以格式奖励显著激励基础能力，以迭代地基础视觉证据区域并形成可靠的推理轨迹，产生准确且可解释的多步骤决策。在域内NExTQA和域外Video-Holmes、CG-Bench Reasoning和VRBench基准测试中，广泛评估表明Chain-of-Glimpse在多样化视频推理任务中表现出一致的性能提升、鲁棒性和泛化能力。

英文摘要

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2604.10210 2026-05-18 cs.CV cs.AI cs.LG

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN：渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

发表机构 * Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms（人工智能理论与算法河南省工程研究中心）； Henan University（河南大学）； Faculty of Computer Science and Control Engineering（计算机科学与控制工程学院）； Shenzhen University of Advanced Technology（深圳先进技术大学）； Department of Electrical and Electronic Engineering（电子与电气工程系）

AI总结本文提出A3-FPN，通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示，提升密集预测任务中小物体的识别性能。

Journal ref Pattern Recognition, 2026, 113793

详情

DOI: 10.1016/j.patcog.2026.113793

AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展，但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络（A3-FPN），通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言，A3-FPN采用横向扩展的列网络，实现渐近全局特征交互，并将每个层次与所有层次表示解耦。在特征融合中，它从相邻层次收集补充内容，生成位置加权偏移和权重用于上下文感知重采样，并学习深度上下文重权重以提高类别内相似性。在特征重组装中，它进一步加强了同一尺度的判别特征学习，并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明，A3-FPN可以轻松集成到最先进的CNN和Transformer架构中，取得显著性能提升。值得注意的是，当与OneFormer和Swin-L主干结合时，A3-FPN在MS COCO上达到49.6的mask AP，在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

英文摘要

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

URL PDF HTML ☆

赞 0 踩 0

2604.08426 2026-05-18 cs.LG cs.AI cs.CL

KV Cache Offloading for Context-Intensive Tasks

KV缓存卸载用于上下文密集型任务

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

发表机构 * HSE（俄罗斯人民友谊大学）； Yandex ； NSU（俄罗斯国立核能研究大学梅利科夫）

AI总结本文研究了KV缓存卸载在上下文密集型任务中的应用，通过Text2JSON基准测试发现，该方法在Llama 3和Qwen 3模型上导致性能下降，分析指出低秩投影和不可靠地标是主要问题，并提出更简单的替代策略以提升准确性。

Comments Preprint

详情

AI中文摘要

随着长上下文LLM在广泛应用中的需求增长，键值（KV）缓存已成为延迟和内存使用的关键瓶颈。最近，KV缓存卸载作为一种减少内存占用和推理延迟同时保持准确性的有前途的方法出现。先前的评估主要集中在不需要从上下文中提取大量信息的任务上。在本文中，我们研究了KV缓存卸载在上下文密集型任务中的应用：解决这些问题需要从输入提示中查找大量信息。我们创建并发布了Text2JSON基准测试，这是一个高度上下文密集型任务，需要从原始文本中提取结构化知识。我们评估了现代KV卸载在Text2JSON和其他上下文密集型任务上的表现，并发现Llama 3和Qwen 3模型上存在显著的性能下降。我们的分析确定了两个关键原因：键的低秩投影和不可靠的地标，并提出了一种更简单的替代策略，该策略在多个LLM家族和基准测试中显著提高了准确性。这些发现突显了对长上下文压缩技术进行全面和严格评估的必要性。

英文摘要

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

URL PDF HTML ☆

赞 0 踩 0

2604.05966 2026-05-18 cs.CL

FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

FinReporting: 一种用于跨司法管辖区财务披露本地化报告的代理工作流

Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang, Yixi Zhou, Xunwen Zheng, Yueru He, Yuyang Dai, Georgi Georgiev, Ayesha Gull, Muhammad Usman Safder, Fan Wu, Liyuan Meng, Fengxian Ji, Junning Zhao, Xueqing Peng, Jimin Huang, Yu Chen, Xue, Liu, Preslav Nakov, Zhuohan Xie

发表机构 * MBZUAI ； The University of Tokyo（东京大学）； Meiji Gakuin University（明治大学）； McGill University（麦吉尔大学）； Kyoto University（京都大学）； Columbia University（哥伦比亚大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出FinReporting，一种代理工作流，用于跨司法管辖区的财务披露本地化报告。该系统构建了涵盖损益表、资产负债表和现金流量表的统一本体，将报告分解为可审计的阶段，并通过约束验证器提升一致性和可靠性。

Comments Accepted at ACL 2026 Demo Track. 9 pages, including figures and tables

详情

AI中文摘要

金融报告系统越来越多地利用大语言模型（LLMs）来提取和总结企业披露信息。然而，现有方法大多假设单一市场环境，忽视了不同司法管辖区之间的结构性差异。会计分类法、标记基础设施（例如XBRL与PDF）以及汇总惯例的差异给语义对齐和可靠验证带来了重大挑战。本文旨在弥合这一差距。我们提出了FinReporting，一种用于跨司法管辖区财务报告的代理工作流。该系统构建了一个涵盖损益表、资产负债表和现金流量表的统一本体，并将报告分解为可审计的阶段，包括文件获取、提取、本体映射和异常记录。与将LLMs视为自由生成器不同，FinReporting将其作为受明确决策规则约束的验证器，具有证据支撑。在评估美国、日本和中国的年度报告时，FinReporting在异构报告制度下提高了一致性和可靠性。我们还发布了一个交互式演示，可实现跨市场检查，并支持结构化导出本地化财务报表。我们的演示可在url{https://huggingface.co/spaces/BoomQ/FinReporting-Demo}获取。描述我们系统的视频可在https://www.youtube.com/watch?v=f65jdEL31Kk获取。

英文摘要

Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs.\ PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding. Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at url{https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.

URL PDF HTML ☆

赞 0 踩 0

2604.02812 2026-05-18 cs.RO

Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

通过合成神经符号监督学习结构化机器人策略

Alessandro Adami, Tommaso Tubaldo, Marco Todescato, Ruggero Carli, Pietro Falco

发表机构 * University of Padova, Dept. of Information Engineering（帕多瓦大学信息工程系）； Fraunhofer Italia Research（弗劳恩霍夫意大利研究所）； Polytechnic of Bari Dept. of Electrical and Information Engineering（巴里理工学院电气与信息工程系）

AI总结本文提出通过合成神经符号监督方法，利用视觉语言模型生成结构化机器人策略，结合多模态感知与符号控制，实现高维学习与符号控制的结合。

详情

AI中文摘要

视觉语言模型（VLMs）最近在将多模态观测映射到机器人行为方面展示了强大能力。然而，大多数现有方法依赖于端到端的视觉-运动策略，这些策略仍然不透明且难以分析，限制了其在现实世界机器人应用中的使用。相比之下，经典机器人系统通常依赖于结构化策略表示，提供可解释性、模块性和反应执行。本文研究如何将基础模型专门化以生成基于多模态感知的结构化机器人策略，弥合高维学习与符号控制之间的差距。我们提出了一种神经符号方法，其中VLM从视觉观测、自然语言指令和结构化系统规范中合成可执行的行为树策略。为了实现可扩展的监督而无需手动标注，我们引入了一个自动化流程，生成一个领域随机化的多模态数据集，其中包含与基础模型生成的指令-策略示例配对的场景。通过将受限符号语法下的结构化任务分解与硬件特定的运动控制解耦，我们证明了一个12B参数模型仅通过在硅中的监督即可学习执行BT合成所需的结构化空间-符号映射。在两个异构机械臂上的现实物理实验表明，这些结构受限的策略能够实现零样本迁移至现实世界环境。结果强调，通过程序化合成高保真的神经符号训练数据，可以绕过机器人规划中的数据瓶颈。

英文摘要

Vision-Language Models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in real-world robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. By decoupling structured task decomposition under constrained symbolic grammars from hardware-specific motor control, we demonstrate that a 12B-parameter model can learn structured spatial-symbolic mappings required for executable BT synthesis, solely through in-silico supervision. Real-world physical experiments on two heterogeneous robotic manipulators confirm that these structurally constrained policies achieve zero-shot transfer to real-world environments. The results emphasize that the data bottleneck in robotic planning can be bypassed by procedurally synthesizing high-fidelity, neuro-symbolic training data.

URL PDF HTML ☆

赞 0 踩 0

2603.17915 2026-05-18 cs.CL cs.AI

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

IndicSafe：评估南亚多语言大语言模型安全性的基准

Priyaranjan Pattnayak, Sanchari Chowdhuri

发表机构 * Oracle America Inc.（Oracle美洲公司）

AI总结本文提出IndicSafe基准，评估12种南亚语言中LLM的安全性，发现跨语言一致性仅12.8%，安全率波动超17%，揭示多语言LLM安全泛化缺口。

详情

AI中文摘要

随着大语言模型（LLM）在多语言环境中的部署，其在文化多样性和低资源语言中的安全性行为仍不明确。我们首次系统评估了12种印地语系语言中LLM的安全性，这些语言由超过12亿人使用，但在LLM训练数据中代表性不足。使用覆盖种姓、宗教、性别、健康和政治的6000个文化相关提示集，我们评估了10种领先LLM在翻译提示变体上的表现。我们的分析揭示了显著的安全漂移：跨语言一致性仅为12.8%，安全率波动超过17%。某些模型在低资源脚本中过度拒绝良性提示，在政治敏感话题上过度标记，而其他模型未能标记不安全生成。我们使用提示级熵、类别偏见分数和多语言一致性指数量化这些失败。我们的发现突显了多语言LLM在安全泛化方面的关键缺口，并表明安全对齐在不同语言中并不均匀转移。我们发布了IndicSafe，这是首个能够为印地语部署提供文化知情安全评估的基准，并倡导基于地区危害的语言意识对齐策略。

英文摘要

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

URL PDF HTML ☆

赞 0 踩 0

2603.15269 2026-05-18 cs.CV

Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

自监督ImageNet表示用于活体共聚焦显微镜：无需分割图的曲折度分级

Kim Ouan, Noémie Moreau, Katarzyna Bozek

发表机构 * Faculty of Mathematics and Natural Sciences, University of Cologne, Germany（科隆大学数学与自然科学学院，德国）； Institute for Biomedical Informatics, Faculty of Medicine and University Hospital Cologne, University of Cologne, Germany（医学信息学研究所，医学院及科隆大学医院，科隆大学，德国）； Center for Molecular Medicine Cologne (CMMC), Faculty of Medicine and University Hospital Cologne, University of Cologne, Germany（科隆分子医学中心（CMMC），医学院及科隆大学医院，科隆大学，德国）； Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), University of Cologne, Germany（科隆卓越集群：与衰老相关疾病相关的细胞应激反应（CECAD），科隆大学，德国）

AI总结本文提出利用自监督预训练的ImageNet特征进行活体共聚焦显微镜的曲折度分级，无需分割图，提升了准确率和灵敏度。

Comments 7 pages, 4 figures, MIDL 2026 - Short Paper Track

2603.05377 2026-05-18 cs.RO cs.CV

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

OpenFrontier: 基于视觉-语言基础的通用导航

Esteban Padilla-Cerdio, Boyang Sun, Marc Pollefeys, Hermann Blum

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Microsoft Spatial AI Lab（微软空间人工智能实验室）； University of Bonn（波恩大学）

AI总结本文提出OpenFrontier框架，通过稀疏子目标识别与到达问题实现高效导航，无需任务特定训练或微调，适用于多种视觉-语言先验模型，展示零样本性能和真实机器人部署效果。

详情

AI中文摘要

开放世界导航要求机器人在复杂日常环境中做出决策并适应灵活的任务需求。传统导航方法依赖密集3D重建和手工制定的目标指标，限制了其在任务和环境中的泛化能力。最近的视觉-语言导航（VLN）和视觉-语言-动作（VLA）模型使端到端策略成为可能，但通常需要交互式训练、大规模数据收集或任务特定的微调。我们提出将导航视为稀疏子目标识别与到达问题，并发现提供视觉锚定目标以高语义先验能够实现高效目标条件导航。基于这一见解，我们选择视觉前沿作为语义锚点，并提出OpenFrontier导航框架，无需任务特定训练或微调，无缝整合多种视觉-语言先验模型。OpenFrontier通过轻量级系统设计实现高效导航，不依赖密集3D语义映射、任务特定策略训练或模型微调。我们评估了OpenFrontier在多个导航基准上的表现，并展示了强大的零样本性能以及在移动机器人上的有效实际部署。

英文摘要

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision-language navigation (VLN) and vision-language-action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select visual frontiers as semantic anchors and propose OpenFrontier, a navigation framework that requires no task-specific training or fine-tuning and seamlessly integrates diverse vision-language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D semantic mapping, task-specific policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

URL PDF HTML ☆

赞 0 踩 0

2603.04299 2026-05-18 cs.CL

The Company You Keep: How LLMs Respond to Dark Triad Traits

你所交往的公司：大语言模型如何回应黑暗三联特质

Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov

发表机构 * Technical University of Applied Sciences Würzburg-Schweinfurt（韦尔堡-施维林应用科学大学）

AI总结研究探讨LLMs对表达不同黑暗三联特质（操纵、自大、精神病态）的用户提示的回应方式，发现模型在不同严重程度下表现出纠正与强化行为的差异，对设计更安全的对话系统有启示。

2603.01290 2026-05-18 cs.AI cs.GT cs.LG cs.SY eess.SY

Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy

在部分可观测性下对手状态推断：一种用于2026年F1能源策略的HMM-POMDP框架

Kalliopi Kleisarchaki

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出HMM-POMDP框架用于2026F1能源策略，通过HMM推断对手状态并利用DQN决策，解决部分可观测博弈问题，检测反收割陷阱。

Comments 17 pages. v3: editorial corrections and bibliographic updates. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry from Australian Grand Prix (8 March 2026) onwards

详情

AI中文摘要

2026年F1技术规则对能源策略进行了根本性改变：在内燃机与电池动力50/50分配、无限再生和驾驶员控制的Override模式下，最优能源部署策略不仅取决于驾驶员自身状态，还取决于对手车辆的隐藏状态。这形成了一个部分可观测随机博弈，无法通过单agent优化方法解决。本文提出一个可处理的双层推断和决策框架。第一层是一个40状态的隐藏马尔可夫模型（HMM），通过六个公开可观测的 telemetry 信号推断每个对手的ERS充电水平（四种模式：H、M、L_harvest、L_derate）、Override模式状态和轮胎退化状态。第二层是一个深度Q网络（DQN）策略，以HMM信念状态为输入，选择能量部署策略。我们正式刻画了反收割陷阱，一种欺骗策略，其中车辆故意压制可观测部署信号以诱导对手进入失败攻击，并表明检测它需要对ERS水平和harvest/derate子模式进行信念状态推断。在合成比赛上，HMM实现了96.8%的ERS水平准确性（随机基线25%），将L_harvest与L_derate分类准确率为89.4%，反收割陷阱检测召回率为96.3%。赛季前分析表明，赛道依赖的充电可用性（每圈1.0x到2.2x）是主要干扰因素；墨尔本是最难的验证环境。Baum-Welch校准在2026年比赛 telemetry 上从澳大利亚大奖赛（2026年3月8日）开始。

英文摘要

The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode, the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 40-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level (four modes: H, M, L_harvest, L_derate), Override Mode status, and tyre degradation state from six publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap, a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack, and show that detecting it requires belief-state inference over both ERS level and the harvest/derate sub-mode. On synthetic races, the HMM achieves 96.8% ERS-level accuracy (random baseline 25%), classifies L_harvest vs. L_derate with 89.4% accuracy, and detects counter-harvest trap conditions with 96.3% recall. Pre-season analysis indicates circuit-dependent recharge availability (1.0x to 2.2x per lap) as the primary confound; Melbourne is the hardest-case validation environment. Baum-Welch calibration on 2026 race telemetry begins with the Australian Grand Prix (8 March 2026).

URL PDF HTML ☆

赞 0 踩 0

2602.23410 2026-05-18 cs.LG cs.AI eess.SP q-bio.NC

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Brain-OF：一种适用于fMRI、EEG和MEG的多功能基础模型

Hanning Guo, Hanwen Bi, Farah Abdellatif, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers

发表机构 * INM-4, Forschungszentrum Jülich, Germany（Jülich 研究中心 INM-4 实验室，德国）； Department of Computer Science（计算机科学系）； Software Engineering, RWTH Aachen University, Germany（软件工程，亚琛工业大学，德国）； INM-7, Forschungszentrum Jülich, Germany（Jülich 研究中心 INM-7 实验室，德国）； Institute of Systems Neuroscience, Heinrich Heine University, Germany（系统神经科学研究所，海因里希·海涅大学，德国）； Department of Neurology, RWTH Aachen University, Germany（神经病学系，亚琛工业大学，德国）； JARA-BRAIN-Translational Medicine, Germany（JARA-BRAIN 转化医学，德国）； INM–11, JARA, Forschungszentrum Jülich, Germany（JARA-INM-11 实验室，Jülich 研究中心，德国）； IAS-6, Forschungszentrum Jülich, Germany（IAS-6 实验室，Jülich 研究中心，德国）； Department of Psychiatry, Psychotherapy and Psychosomatics, RWTH Aachen University, Germany（精神病学、心理治疗和精神病理学系，亚琛工业大学，德国）

AI总结 Brain-OF通过联合预训练fMRI、EEG和MEG数据，解决多模态数据语义异质性和分辨率差异问题，提升跨模态数据处理能力。

详情

AI中文摘要

脑基础模型在多种神经科学任务中取得了显著进展。然而，现有模型多局限于单一功能模态，限制了其利用互补的时空动态和不同神经成像技术的集体数据规模的能力。这一限制主要源于模态间的严重语义异质性和分辨率差异。为解决这些问题，我们提出了Brain-OF，一种联合预训练fMRI、EEG和MEG的多功能脑基础模型，能够在统一框架内处理单模态和多模态输入。为协调异构的时空分辨率，我们引入了Any-Resolution神经信号采样器，将多样化的脑信号投影到共享的语义空间。为进一步管理语义偏移，Brain-OF的主干整合了DINT注意力与稀疏专家混合模型，其中共享专家捕捉模态不变的表示，路由专家专注于模态特定的语义。此外，为了通过自监督学习显式内化神经活动的特征，我们提出了Masked Temporal-Frequency Modeling，一种双域预训练目标，联合重建时间和频率域中的脑信号。Brain-OF在包含约40个数据集的大型语料库上进行预训练，并在多样化的下游任务中表现出色，突显了联合多模态集成和双域预训练的优势。

英文摘要

Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across different neuroimaging techniques. This limitation largely arises from severe semantic heterogeneity and resolution discrepancies among modalities. To address these challenges, we propose Brain-OF, an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, to explicitly internalize the characteristics of neural activity through self-supervised learning, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

URL PDF HTML ☆

赞 0 踩 0

2602.21536 2026-05-18 cs.CV

IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model

IHF-Harmony：基于可逆分层流模型的多模态磁共振图像统一化

Pengli Zhu, Yitao Zhu, Haowen Pang, Anqi Qiu

发表机构 * Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong（健康科技与信息技术系，香港理工大学，香港）； School of Integrated Circuits and Electronics, Beijing Institute of Technology, China（集成电路与电子学院，北京理工大学，中国）； Mental Health Research Center, The Hong Kong Polytechnic University, Hong Kong（心理健康研究中心，香港理工大学，香港）； Department of Biomedical Engineering, Johns Hopkins University, USA（生物医学工程系，约翰霍普金斯大学，美国）

AI总结本文提出IHF-Harmony，通过可逆分层流模型实现多模态MRI图像统一化，利用无配对数据提升跨模态可扩展性，保留解剖结构并提升下游任务性能。

详情

AI中文摘要

回顾性MRI统一化受限于跨模态的可扩展性差和依赖旅行受试者数据集。为解决这些问题，我们引入IHF-Harmony，一种统一的可逆分层流框架，用于使用无配对数据的多模态统一化。通过将翻译过程分解为可逆的特征转换，IHF-Harmony保证了双射映射和无损重建，以防止解剖扭曲。具体而言，可逆分层流（IHF）通过分层减法耦合逐步去除与伪影相关的特征，而伪影感知归一化（AAN）则利用解剖固定特征调节来准确转移目标特征。结合解剖和伪影一致性损失目标，IHF-Harmony实现了高保真的统一化，保留了源解剖结构。在多个MRI模态上的实验表明，IHF-Harmony在解剖保真度和下游任务性能方面均优于现有方法，促进了大规模多中心成像研究的稳健统一化。代码可在https://github.com/Idea89560041/IHF-Harmony获取。

英文摘要

Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code is available at https://github.com/Idea89560041/IHF-Harmony.

URL PDF HTML ☆

赞 0 踩 0

2602.21141 2026-05-18 cs.CV

SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

SynthRender 和 IRIS：用于工业物体感知双向仿真-现实迁移的开源框架和数据集

Jose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig, Pablo Rey Valiente, Jens Lambrecht, Jörg Krüger

发表机构 * Technical University Berlin, Industrial Automation Technology（柏林技术大学，工业自动化技术）； Mercedes-Benz AG, Future Manufacturing Technologies（梅赛德斯-奔驰公司，未来制造技术）； Technical University Braunschweig, Institute for Cognitive Robotics（不伦瑞克技术大学，认知机器人研究所）

AI总结本文提出SynthRender和IRIS，通过合成数据生成与结构化评估，系统研究双向仿真-现实迁移，提供32类数据集和CAD模型，实现高效合成训练与工业应用。

详情

AI中文摘要

物体感知对于机器人物料搬运和质量检测等任务至关重要。然而，现代监督深度学习模型需要大量标注数据以在半受控条件下实现稳健自动化；这是在自有工业部件上广泛应用的主要障碍。我们通过整合合成数据生成和结构化经验评估的框架，系统研究双向仿真-现实迁移。我们的方法结合2D到3D的现实到仿真技术，通过SynthRender开源框架的程序化引导域随机化（GDR）从物理部件创建3D资产。跨多个基准的结构化消融研究量化了单个渲染设计选择的影响，得出实用的高效合成训练指南。为支持在现实工业条件下的评估，我们引入工业现实-仿真图像集（IRIS），包含32类，具有多样的纹理、类内变化、强类间相似性，并有19,672个注释，提供CAD模型和重建网格用于双向仿真-现实基准测试。在三个工业基准上，所提框架实现了高度竞争性的性能，达到99.1% mAP@50在公开机器人数据集、98.3% mAP@50在汽车基准和95.3% mAP@50在IRIS上。

英文摘要

Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning models require large annotated datasets for robust automation under semi-uncontrolled conditions; a major barrier for widespread deployment with proprietary industrial parts. We address this through an integrated framework combining synthetic data generation and structured empirical evaluation for systematic investigation of bidirectional sim-to-real transfer. Our method integrates 2D-to-3D Reality-to-Simulation techniques for 3D asset creation from physical parts with programmatic Guided Domain Randomization (GDR) via SynthRender, an open-source synthetic image generation framework. Structured ablation studies across multiple benchmarks quantify the impact of individual rendering design choices, yielding practical guidelines for dataefficient synthetic training. To support evaluation under realistic industrial conditions, we introduce Industrial Real-Sim Imagery Set (IRIS), a 32-class dataset with diverse textures, intra-class variation, strong inter-class similarities, and 19,672 annotations, providing both CAD models and reconstructed meshes for bidirectional sim-to-real benchmarking. Across three industrial benchmarks, the proposed framework achieves highly competitive performance, reaching 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

URL PDF HTML ☆

赞 0 踩 0

2602.19423 2026-05-18 cs.CV

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS: 从局部偏好和稀疏提示中学习，用于电子显微镜领域自适应分割

Jiabao Chen, Shan Xiong, Jialin Peng

发表机构 * College of Computer Science and Technology, Huaqiao University（华侨大学计算机科学与技术学院）

AI总结本文提出Prefer-DAS，通过利用局部偏好和稀疏提示实现高效的领域自适应分割，结合自训练和提示引导对比学习，提升了分割性能和灵活性。

详情

AI中文摘要

领域自适应分割（DAS）是一种有前景的范式，用于从各种大规模电子显微镜（EM）数据中界定细胞内结构，而无需在每个领域内耗费大量标注数据。然而，普遍的无监督领域自适应（UDA）策略往往表现出有限且有偏的性能，阻碍了其实际应用。在本研究中，我们探索稀疏点和局部人类偏好作为目标领域的弱标签，从而提出一个更加现实且标注高效的设置。具体而言，我们开发了Prefer-DAS，它开创了稀疏提示学习和局部偏好对齐。Prefer-DAS是一种可提示的多任务模型，整合了自训练和提示引导的对比学习。与SAM-like方法不同，Prefer-DAS允许在训练和推理阶段使用完整的、部分的甚至没有点提示，从而实现了交互式分割。与使用图像级人类偏好对齐进行分割不同，我们引入了局部直接偏好优化（LPO），为与空间变化的人类反馈对齐提供了即插即用的解决方案。为了解决潜在的反馈缺失问题，我们还引入了无监督偏好优化（UPO），它利用自学习的偏好。结果，Prefer-DAS模型能够根据点和人类偏好的可用性有效执行弱监督和无监督的DAS。在四个具有挑战性的DAS任务上的全面实验表明，我们的模型在自动和交互式分割模式中均优于SAM-like方法以及无监督和弱监督的DAS方法，突显了其强大的泛化能力和灵活性。此外，我们的模型性能非常接近或甚至超过了监督模型的性能。

英文摘要

Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO), plug-and-play solutions for alignment with spatially varying human feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

URL PDF HTML ☆

赞 0 踩 0

2602.08556 2026-05-18 cs.SD

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

全局旋转等变相位建模用于语音增强的深度幅度-相位交互

Chengzhong Wang, Andong Li, Dingding Yao, Junfeng Li

发表机构 * Institute of Acoustics, Chinese Academy of Sciences（中国科学院声学研究所）； Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences（中国科学院声学研究所语音声学与内容理解重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出一种全局旋转等变的幅度-相位双流框架，通过强制相位流保持全局旋转等变性，提升语音增强中的相位建模效果，实验显示在相位检索、去噪、去回声和带宽扩展任务中均优于现有方法。

Comments Submitted to IEEE TASLP

详情

AI中文摘要

ExplainerPFN：迈向无模型零样本特征重要性估计的表格基础模型

Joao Fonseca, Julia Stoyanovich

发表机构 * INESC-ID ； New York University（纽约大学）

AI总结本文提出ExplainerPFN，一种基于TabPFN的表格基础模型，通过预训练合成结构因果数据实现无模型零样本特征重要性估计，展示了其在真实和合成数据集上的竞争力。

Comments 35 pages, 11 figures

详情

AI中文摘要

在监督分类任务中计算特征重要性对模型可解释性至关重要。Shapley值是解释模型预测的常用方法，但需要直接访问底层模型，这一假设在现实部署中常被违反。我们探讨在零样本设置下是否能仅通过输入数据分布和不评估目标模型来获得有意义的特征归因。由于多个模型可能产生相同预测但产生不同Shapley分解，数据到归因的映射并非唯一可识别。因此，我们针对“真实数据”而非“真实模型”学习后验均值归因，基于元训练先验。为此，我们引入ExplainerPFN，一种基于TabPFN的表格基础模型，预训练于合成结构因果数据，通过精确或近精确的Shapley值监督，可预测未见过的表格数据集的特征归因，而无需模型访问、梯度或示例解释。我们的贡献包括：（1）展示少量样本替代解释器在仅使用两个参考观测时可实现高SHAP保真度；（2）提出ExplainerPFN，首个无需访问底层模型或参考解释的零样本方法，提供无现有解释器可应用的归因；（3）发布开源实现，包括完整训练流程和合成数据生成器；（4）通过大量真实和合成数据集实验，展示ExplainerPFN在性能上可与依赖2-10个SHAP示例的少量样本替代解释器竞争。

英文摘要

Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require direct access to the underlying model, an assumption frequently violated in real-world deployments. We investigate whether meaningful feature attributions can be obtained in a zero-shot setting, using only the input data distribution and no evaluations of the target model. Because multiple models can produce identical predictions yet yield different Shapley decompositions, the mapping from data to attributions is not uniquely identifiable. We therefore target attributions that are "true to the data" rather than "true to the model", learning a posterior mean attribution under a meta-training prior. To this end, we introduce ExplainerPFN, a tabular foundation model built on TabPFN, pretrained on synthetic structural causal datasets supervised with exact or near-exact Shapley values, that predicts feature attributions for unseen tabular datasets without model access, gradients, or example explanations. Our contributions are fourfold: (1) we show that few-shot surrogate explainers achieve high SHAP fidelity with as few as two reference observations; (2) we propose ExplainerPFN, the first zero-shot method for estimating Shapley-value-style feature attributions without access to the underlying model or reference explanations, providing a principled attribution where no existing explainer can be applied; (3) we release an open-source implementation including the full training pipeline and synthetic data generator; and (4) through extensive experiments on real and synthetic datasets, we show that ExplainerPFN achieves performance competitive with few-shot surrogate explainers that rely on 2-10 SHAP examples.

URL PDF HTML ☆

赞 0 踩 0

2601.21798 2026-05-18 cs.CV

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

CG-MLLM：通过多模态大语言模型实现图像描述与3D内容生成

Junming Huang, Chi Wang, Letian Li, Guangkai Xu, Donglin Huang, Hao Chen, Qiang Dai, Weiwei Xu

发表机构 * Zhejiang University, China（浙江大学）

AI总结本文提出CG-MLLM，一种能实现3D描述和高分辨率3D生成的多模态大语言模型，通过混合Transformer架构分离不同建模需求，结合预训练视觉语言模型与专用3D VAE潜在空间，提升3D生成质量与感知能力。

Comments ICML 2026

详情

AI中文摘要

大型语言模型(LLMs)已革新了文本生成和多模态感知，但其在3D内容生成方面的能力仍待探索。现有方法往往只能生成低分辨率网格或粗略结构代理，无法原生捕捉细粒度几何结构。本文提出CG-MLLM，一种新型多模态大语言模型，能够在单一框架内实现3D描述和高分辨率3D生成。通过混合Transformer架构，CG-MLLM分离了不同的建模需求，其中Token-level Autoregressive (TokenAR) Transformer处理token级内容，Block-level Autoregressive (BlockAR) Transformer处理块级内容。通过整合预训练的视觉语言骨干网络与专用3D VAE潜在空间，CG-MLLM促进了标准token与空间块之间的长上下文交互。实验结果表明，CG-MLLM在生成高保真3D对象方面显著优于现有MLLMs，有效将高分辨率3D内容创作带入主流LLM范式。此外，我们进一步发现，学习生成3D内容能够反向增强模型的基于图像的3D理解能力。

英文摘要

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.

URL PDF HTML ☆

赞 0 踩 0

2601.21636 2026-05-18 cs.LG cs.CR stat.ML

Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

无需采样矩阵机制下的随机分配隐私计费

Jan Schuchardt, Nikita Kalinin

发表机构 * Machine Learning Research, Morgan Stanley（摩根士丹利机器学习研究部）； Institute of Science and Technology Austria（奥地利科学与技术研究院）

AI总结本文提出基于Rényi散度和条件组合的无采样界限，用于矩阵分解下随机分配的差分隐私放大，解决了采样方法的高概率保证和随机放弃问题，适用于任意带状和非带状矩阵。

详情

AI中文摘要

我们研究了在随机分配（也称为球入箱模型）下矩阵分解中差分隐私模型训练的隐私放大。Choquette-Choo等人（2025）提出了一种基于采样的蒙特卡洛方法来计算放大参数，但其保证要么仅在高概率下成立，要么需要机制的随机放弃。此外，确保(ε,δ)-DP所需的样本数与δ成反比。相反，我们开发了基于Rényi散度和条件组合的无采样界限。前者通过动态规划公式高效计算界限，后者通过提供更强的隐私保证来补充，特别是在小ε的情况下，Rényi散度界限本质上导致过估计。我们的框架适用于任意带状和非带状矩阵。通过数值比较，我们展示了我们的方法在广泛使用的矩阵机制中的有效性。

英文摘要

We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(ε,δ)$-DP is inversely proportional to $δ$. In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $ε$, where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.

URL PDF HTML ☆

赞 0 踩 0

2601.20477 2026-05-18 cs.LG cs.IT math.IT

Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations

神经网络表示中的隐含假设检验与散度保持

Kadircan Aksoy, Protim Bhattacharjee, Peter Jung

发表机构 * German Aerospace Center, Institute for Space Research（德国航空航天中心，空间研究研究所）； Technical University of Berlin（柏林技术大学）

AI总结研究神经分类器的训练动态，通过二元假设检验重新形式化分类为类条件分布间的二元测试，证明泛化能力强的网络在训练过程中逐渐接近Neyman-Pearson最优决策规则，并定义信息平面评估收敛性。

2601.12894 2026-05-18 cs.RO cs.CV

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

稀疏动作生成：通过实时剪枝加速扩散策略

Kangye Ji, Jianbo Zhou, Yuan Meng, Ye Li, Hanyun Cui, Zhi Wang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Department of Computer Science（计算机科学系）

AI总结本文提出SAG方法，通过自适应剪枝和重用机制实现稀疏动作生成，提升实时视觉运动控制效率，实验显示生成速度提升4倍。

详情

AI中文摘要

扩散策略因其强大的多模态动作分布建模能力在动作生成中占据主导地位，但其多步去噪过程使其难以满足实时视觉运动控制的需求。现有基于缓存的加速方法通常依赖静态调度，无法适应机器人与环境交互的动态特性，导致性能不佳。本文提出稀疏动作生成（SAG），通过自适应剪枝和重用机制实现极稀疏的动作生成。为适应迭代交互，SAG定制了回滚自适应的剪枝-重用机制，首先在全局识别可剪枝的计算，然后利用缓存的激活值在动作扩散过程中进行替换。为捕捉回滚动态，SAG参数化了观察条件的扩散剪枝器，以实现环境感知的适应，并通过高参数和推理效率的设计实现实时预测。此外，SAG引入了一种通用的重用策略，以zig-zag方式在时间步和块之间重用激活值，最小化全局冗余。在多个机器人基准测试中，SAG在不牺牲性能的情况下实现了高达4倍的生成速度提升。项目页面：https://sparse-actiongen.github.io.

英文摘要

Diffusion Policy has dominated action generation due to its strong capabilities for modeling multi-modal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on $\textit{static}$ schedules that fail to adapt to the $\textit{dynamics}$ of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose $\underline{\textbf{S}}$parse $\underline{\textbf{A}}$ction$\underline{\textbf{G}}$en ($\textbf{SAG}$) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4$\times$ generation speedup without sacrificing performance. Project Page: https://sparse-actiongen.github.io.

URL PDF HTML ☆

赞 0 踩 0

2601.09512 2026-05-18 cs.RO cs.LG

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

CLARE：通过自主适配器路由和扩展实现视觉-语言-动作模型的持续学习

Ralf Römer, Yi Zhang, Yuming Li, Angela P. Schoellig

发表机构 * Technical University of Munich（慕尼黑技术大学）； TUM School of Computation, Information and Technology（TUM计算、信息与技术学院）； Department of Computer Engineering, Learning Systems and Robotics Lab（计算机工程系、学习系统与机器人实验室）； Munich Institute of Robotics and Machine Intelligence (MIRMI)（慕尼黑机器人与机器智能研究所（MIRMI））； Robotics Institute Germany（德国机器人研究所）； Munich Center for Machine Learning（慕尼黑机器学习中心）

AI总结 CLARE提出一种参数高效、无需示例的持续学习框架，通过自主扩展模型模块，实现机器人在新任务中保持旧知识，优于基于示例的方法。

Comments Accepted to IEEE Robotics and Automation Letters 2026. Project page: https://tum-lsy.github.io/clare. 11 pages, 9 figures

2601.07820 2026-05-18 cs.CL

Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

参考游戏作为模型不确定性与澄清请求对齐的测试平台

Manar Ali, Judith Sieker, Sina Zarrieß, Hendrik Buschmeier

发表机构 * Digital Linguistics Lab（数字语言实验室）； Computational Linguistics Group（计算语言学小组）

AI总结本文通过参考游戏测试语言模型在不确定性识别与澄清请求表达上的能力，发现模型在简单任务中难以准确识别自身不确定性并转化为澄清行为。

Comments Accepted at GEM@ACL 2026, the 5th Generation, Evaluation & Metrics Workshop

2601.03707 2026-05-18 cs.CL

AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

AirNav: 一个大规模无人机视觉与语言导航数据集，包含自然且多样的指令

Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Changhao Nai, Jue Hou, Wenhao Lu, Renxin Zhong

发表机构 * School of Intelligent Systems Engineering, Sun Yat-Sen University（中山大学智能系统工程学院）； Beihang University（北京航空航天大学）； Peking University（北京大学）； Beijing University Of Posts and Telecommunications（北京邮电大学）； Harbin Institute of Technology（哈尔滨工业大学）； Xiamen University（厦门大学）； National University of Defense Technology（国防科技大学）

AI总结本文提出AirNav数据集，包含137K自然多样指令的导航样本，评估了多种方法，提出AirVLN-R1模型在测试中取得51.82%的成功率，并通过实际无人机实验验证了仿真到现实的迁移能力。

详情

AI中文摘要

现有无人机视觉与语言导航（VLN）基准很少同时提供真实的空中场景、自然过程级指令和足够的规模，使得在现实设置下系统训练和评估UAV VLN代理变得困难。为此，我们提出了AirNav，一个基于真实城市空中数据的大规模基准，包含137K通过人与LLM协作流程生成的导航样本，涉及10个用户角色。我们对代表性的方法在AirNav上进行了系统评估，从传统模型到多模态大语言模型（MLLMs），在统一指标下使用开源实现。我们进一步提出了AirVLN-R1，通过监督微调（SFT）和强化微调（RFT）训练，实现了51.82%的成功率。在现实无人机平台上进行的实验提供了初步的仿真到现实迁移证据，且我们的数据集和代码已公开可用。

英文摘要

Existing UAV vision-and-language navigation (VLN) benchmarks rarely provide realistic aerial scenes, natural process-level instructions, and sufficient scale simultaneously, making it difficult to systematically train and evaluate UAV VLN agents under realistic settings. To address this, we propose \textbf{AirNav}, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human--LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models (MLLMs), under unified metrics with open-source implementations. We further propose \textbf{AirVLN-R1}, trained via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), achieving state-of-the-art performance with a 51.82\% success rate on the test-unseen split. Real-world experiments on a physical UAV platform provide preliminary evidence of sim-to-real transferability, and our dataset and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2512.15693 2026-05-18 cs.CV

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Skyra：通过 grounded artifact reasoning 实现 AI 生成视频检测

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）

AI总结本文提出 Skyra，一种专门用于识别 AI 生成视频中人类可感知的视觉瑕疵的多模态大语言模型，通过这些瑕疵作为基础证据进行检测和解释，同时构建了首个大规模 AI 生成视频瑕疵数据集并提出两阶段训练策略。

Comments Camera Ready Version. Project Page: https://github.com/JoeLeelyf/Skyra

2512.10100 2026-05-18 cs.AI

Robust AI Security and Alignment: A Sisyphean Endeavor?

稳健的AI安全与对齐：一项西西弗斯式的努力？

Apostol Vassilev

发表机构 * CSD/ITL（计算机科学与技术实验室）

AI总结本文通过扩展哥德尔不完全性定理，探讨了AI安全与对齐的理论极限，并提出应对挑战的实践方法，揭示了AI系统认知推理的局限性。

Comments 17 pages, 1 figure. This version will appear in IEEE Security $ Privacy in June 2026