arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.23187 2026-05-25 cs.CV cs.RO

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

IntentionNav: 一种基于隐式人类指令的意图驱动目标导航基准

Lin Qian, Shijie Li, Sihao Lin, Xuan Zhang, Bangya Liu, Yanran Li, Hujun Yin

AI总结 IntentionNav 是一个用于意图驱动对象导航的新基准，旨在评估智能体从隐含人类指令中推断目标物体并完成导航任务的能力。该基准不直接提供目标物体名称，而是通过自然语言指令隐含表达需求，要求智能体理解意图、识别目标并完成导航。研究引入了四种意图模式和多种指令风格，支持对目标推理、语言鲁棒性及导航成功率的细致分析，揭示了当前视觉语言模型在理解隐含意图和完成精准导航任务方面仍面临挑战。

Comments preprint

详情

AI中文摘要

现有的目标导航基准通常告诉具身智能体要找到哪个物体类别，例如微波炉或椅子。面向人类的具身AI经常被问到一些不那么直接的问题：“我需要热一下这个食物”或“房间感觉很闷”。智能体必须推断出能够满足需求的物体，找到一个场景中的实例，并决定是否已达到目标。我们将这种设置研究为意图驱动的目标导航，并引入IntentionNav，一个用于从隐式人类指令进行主动目标搜索的诊断基准。每个episode提供一个自由文本意图、RGB-D观测和位姿，但隐藏目标物体名称。IntentionNav包含176个Isaac Sim场景和64个目标类别上的500个意图。每个意图以四种受控指令风格重写，并标注四种意图模式之一，将表面措辞与语义线索类型分离，同时保持几何匹配。这种配对设计支持对目标推断、语言鲁棒性、邻域可达性和终端成功（而非仅聚合成功）的分析。我们使用一个固定的主动导航智能体评估了三个VLM。模型在48.3%的episode中识别出预期目标，在68.7%中进入其2米邻域，但仅在24.9%中成功终止，并在5.5%中达到接地1米成功。事件脚本意图的成功率最高（28.7%），而物理状态和可供性意图的成功率较低（分别为19.2%和18.5%），表明间接人类意图仍然是主动具身搜索中目标选择、视觉验证和终端定位的瓶颈。

英文摘要

Existing object navigation benchmarks usually tell an embodied agent which object category to find, such as microwave or chair. Human-facing embodied AI is often asked something less direct: "I need something to warm this food" or "the room feels stuffy." The agent must infer the object that can satisfy the need, find a scene-grounded instance, and decide whether the goal has been reached. We study this setting as intent-driven object navigation and introduce IntentionNav, a diagnostic benchmark for active object search from implicit human instructions. Each episode provides a free-text intent, RGB-D observations, and pose, but withholds the target object name. IntentionNav contains 500 intents over 176 Isaac Sim scenes and 64 target categories. Each intent is rewritten in four controlled instruction styles and annotated with one of four intent modes, separating surface phrasing from semantic cue type under matched geometry. This paired design supports analysis of target inference, language robustness, neighborhood reachability, and terminal success rather than only aggregate success. We evaluated three VLMs using a fixed active-navigation agent. Models identify the intended target in 48.3 percent of episodes and enter its 2 m neighborhood in 68.7 percent, but terminate successfully in only 24.9 percent and achieve grounded 1 m success in 5.5 percent. Success is highest for event-script intents (28.7 percent) and lower for physical-state and affordance intents (19.2 percent and 18.5 percent), showing that indirect human intent remains a bottleneck for target selection, visual verification, and terminal localization in active embodied search.

URL PDF HTML ☆

赞 0 踩 0

2605.23182 2026-05-25 cs.LG

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

强化学习中基于Bandit反馈的良好策略的纯探索

Zitian Li, Wang Chi Cheung

AI总结本文研究了强化学习中在仅获得带反馈（bandit feedback）的情况下，如何高效识别一个“足够好”的策略，而非传统的最优策略。为此，作者提出了“良好策略识别”（GPI）问题，目标是在给定奖励阈值的前提下，找到满足该阈值的策略或判断其不存在。文中设计了一种新算法BEE-GPI，并理论分析了其样本复杂度上界，表明其在正例和负例场景下均具有较高的效率，且其复杂度系数不依赖于状态和动作空间的大小，优于传统最优策略识别方法。实验验证了该方法的有效性。

详情

AI中文摘要

情节式强化学习中的纯探索主要关注最优策略识别（BPI），旨在以高置信度识别（近）最优策略。受实际场景中“足够好”的策略即可满足需求的启发，我们研究了另一种目标——良好策略识别（GPI）。对于给定的奖励阈值 $μ_0$，GPI 仅要求识别出一个期望奖励至少为 $μ_0$ 的策略（如果存在这样的策略，即正实例），或者声明不存在（负实例）。我们在固定置信度设置下形式化 GPI。要求输出以概率 $\geq 1-δ$ 正确，并寻求最小化期望样本复杂度，即输出所探索的情节数期望值。我们提出了一种新颖的算法 BEE-GPI，并推导了其在正实例和负实例下样本复杂度的理论上界。值得注意的是，对于正实例，上界中 $\log 1/δ$ 的系数为 $O(H^2/(V^* - μ_0)^2)$，其中 $H$ 是情节长度，$V^*$ 是情节的最优期望奖励。该系数不依赖于动作和状态空间大小，这与 BPI 中的样本复杂度形成鲜明对比。我们进一步建立了下界结果，以证明 BEE-GPI 的近最优性以及 $1/(V^* -μ)^2$ 项的必要性。数值实验进一步验证了我们方法的效率。

英文摘要

Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $μ_0$, GPI only requires identifying a policy with expected reward in an episode at least $μ_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-δ$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/δ$ in our upper bound is $O(H^2/(V^* - μ_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -μ)^2$ term. Numerical experiments further validate the efficiency of our approach.

URL PDF HTML ☆

赞 0 踩 0

2605.23180 2026-05-25 cs.CL cs.LG

Self-Improving In-Context Learning

自我改进的上下文学习

Baturay Saglam, Dionysis Kalogerias

AI总结本文提出了一种改进上下文学习（ICL）的方法，通过在测试时优化固定少样本提示的连续嵌入来提升模型性能。研究发现，模型对示例输出的对数概率可以作为衡量其任务理解程度的有效信号，并据此构建了一个无需额外数据的自监督置信度代理，通过零阶优化对提示嵌入进行校准。该方法无需微调、无需生成token、无需预定义标签集，适用于分类和自由生成任务，在多个ICL任务中表现出色，验证了其优化信号的有效性。

详情

AI中文摘要

我们提出通过优化测试时固定少样本提示的连续嵌入来改进上下文学习（ICL）。关键观察是，模型对其演示输出分配的对数概率——可在单次前向传播中获得，无需生成任何令牌——为模型从演示中推断任务提供了有意义的信号。我们将此信号形式化为一个有界的、自监督的置信度代理，并通过在提示嵌入上进行零阶优化来最大化它，从而得到一种测试时校准程序。该方法不需要微调、令牌生成、预定义标签集或外部数据，因此同样适用于分类和自由生成任务。在一系列全面的ICL任务中，所提出的校准方法始终匹配或改进基础模型，并在大多数任务上优于特定于分类的基线。代理改进与下游准确率提升之间的统计显著相关性证实了所提出的代理编码了用于上下文学习的可靠优化信号。

英文摘要

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

URL PDF HTML ☆

赞 0 踩 0

2605.23179 2026-05-25 cs.AI

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

重绘AI地图：代理生态系统中责任边界的理论

Muhammad Zia Hydari, Farooq Muzaffar

AI总结该论文探讨了智能体生态系统中责任边界配置的理论问题，提出了一种基于能力层次的责任边界定位理论。研究引入了“责任资产”概念，指出其对AI输出的合法性、可审计性和责任归属具有关键作用，并分析了验证成本和责任可转移性对责任边界与执行边界协同移动的影响。理论提出了三种边界策略，并引入“规则债务”概念，揭示了组织决策规则迁移至智能体执行环境所带来的治理负担，为理解数字模块化与组织解耦的关系提供了新视角。

详情

AI中文摘要

代理AI编排器降低了跨组织边界组合信息系统能力的接口和组装成本，看似加速了模块化和组织分解。然而，其输出需要证据、审查、签核或可分配责任的AI赋能能力，即使其技术接口变得模块化，也可能保留集成的责任边界。我们提出了代理生态系统中责任边界定位的能力层面理论。我们引入责任资产：使AI支持输出合法、可审计、可审查并可分配给责任方的互补资产。我们认为验证成本和责任可转移性决定了执行边界和责任边界能否一起移动。该理论识别出三种边界策略：组件、集成和双轨。它还引入了规则债务，即当组织决策规则从正式信息系统迁移到无治理的代理执行环境时产生的治理负担。整合数字创新、交易成本、互补资产、数字平台治理和IS控制视角，我们提出了七个命题，将代理组装成本降低、责任资产、可占有性、编排者意图捕获和边界错误配置与边界策略、价值占有和规则债务联系起来。该理论解释了数字模块化何时扩展到组织分解，以及责任何时保持能力集成。通过文档处理、法律服务、审计、临床决策支持和采购中的结构化示例来约束边界逻辑。

英文摘要

Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation. Yet AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular. We develop a capability-level theory of accountability-boundary placement in agentic ecosystems. We introduce accountability assets: complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party. We argue that verification cost and responsibility transferability determine whether the execution and accountability boundaries can move together. The theory identifies three boundary strategies: component, integrated, and dual-track. It also introduces rule debt, the governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments. Integrating digital innovation, transaction cost, complementary-assets, digital platform governance, and IS control perspectives, we develop seven propositions linking agentic assembly-cost reductions, accountability assets, appropriability, orchestrator intent capture, and boundary misconfiguration to boundary strategy, value appropriation, and rule debt. The theory explains when digital modularization extends to organizational disaggregation and when accountability keeps capabilities integrated. Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic.

URL PDF HTML ☆

赞 0 踩 0

2605.23178 2026-05-25 cs.CV

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

将人物组合在一起：面向多人交互场景的迭代姿态-图像生成

Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

AI总结尽管现有文本到图像模型在生成多人互动场景时仍面临语义多样性不足和构图准确性低的问题，常导致布局重复、姿势刻板和互动不自然。本文提出一种双模态的姿势-图像表示方法，将以人为中心的结构先验引入预训练的扩散变换模型，通过联合预测二维姿势图和对应的RGB图像，使结构与外观在学习过程中协同演化。核心方法采用跨模态对齐方案，将文本、姿势和图像表示进行绑定，确保多模态一致性，并设计迭代场景生成策略，逐步构建复杂的多人互动场景，有效分解整体生成复杂度，实验表明该方法显著提升了多人图像生成的提示对齐度和场景多样性。

Comments Accepted to SIGGRAPH Conference Papers 2026. 22 pages, 12 figures. Project page: https://cornell-vailab.github.io/PeopleComposer/

详情

DOI: 10.1145/3799902.3811129

AI中文摘要

尽管近期取得了进展，文本到图像模型仍然难以生成语义多样且组合准确的多人交互场景，常常陷入重复布局、刻板姿态和交互基础薄弱的问题。在这项工作中，我们通过引入一种双姿态-图像表示来弥合这一差距，该表示将人物中心的结构先验引入预训练扩散Transformer。我们的模型联合预测2D姿态可视化图像及其对应的RGB图像，使得结构和外观在学习过程中共同演化。其核心是一种跨模态对齐方案，将文本、姿态和图像表示绑定在一起，确保跨模态的一致性基础。此外，我们设计了一种迭代场景构建方案，逐步生成复杂的多人交互，同时有效分解整体生成复杂性。大量实验表明，我们的方法在多人图像生成中显著提高了提示对齐度和场景多样性。

英文摘要

Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.23174 2026-05-25 cs.CV

LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement

LQ-rPPG：一种用于远程生理测量的标签量化粗到细学习框架

Jun Seong Lee, Samyeul Noh, Changki Sung, Hyun Myung

AI总结远程光电容积图（rPPG）技术能够通过面部视频非接触地测量生理信号，在远程医疗和日常健康监测中具有重要应用前景。然而，现有基于深度学习的rPPG方法大多忽视了训练标签的质量及其对模型学习的影响，导致模型易受标签噪声和变化的影响，影响泛化性能。为此，本文提出LQ-rPPG，一种基于标签量化和粗到细学习的框架，通过将连续PPG信号转化为多比特伪标签以减少噪声，并在分层监督下逐步优化rPPG估计，从而提升模型鲁棒性和泛化能力，实验表明其在多个数据集上表现优异且计算效率显著提高。

详情

AI中文摘要

远程光电容积描记（rPPG）技术能够从面部视频中非接触式测量生理信号，在远程医疗和日常健康监测方面具有巨大潜力。受此驱动，研究者提出了多种基于深度学习的rPPG方法以改进估计性能。然而，以往的深度学习方法很少关注训练标签的质量及其对模型学习的影响。用作训练标签的接触式PPG信号通常包含由运动伪影、传感器接触不一致和形态畸变引起的噪声和变异性。这种标签不一致性可能导致模型过拟合标签噪声和变异性，从而降低泛化性能。为解决此问题，我们提出LQ-rPPG，一种标签量化的粗到细学习框架，用于鲁棒的rPPG估计。LQ-rPPG包含一个标签量化模块和一个粗到细的rPPG估计模型。标签量化模块将连续PPG信号转换为多比特量化伪标签，以降低噪声和变异性。粗到细估计模型在多比特伪标签的分层监督下逐步细化rPPG信号。这种设计减轻了对标签特定变异性的过拟合，使模型能够学习结构化和一致的表示。因此，LQ-rPPG即使在挑战性条件下也能实现鲁棒且可泛化的rPPG估计。在多个基准数据集上的实验表明，LQ-rPPG在数据集内和跨数据集评估中均取得了强劲性能，同时参数和乘累加操作分别减少88%和29%，吞吐量提高191%。代码可在https://github.com/Anonymous-repo-code/LQ-rPPG获取。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, offering strong potential for remote healthcare and daily health monitoring. Driven by this potential, various deep learning-based rPPG methods have been proposed to improve rPPG estimation. However, previous deep learning-based rPPG methods have paid little attention to the quality of training labels and their impact on model learning. Contact-based PPG signals used as training labels often contain noise and variability caused by motion artifacts, inconsistent sensor contact, and morphological distortions. Such label inconsistency can lead models to overfit to the label noise and variability and consequently degrade generalization performance. To address this issue, we propose LQ-rPPG, a label-quantized coarse-to-fine learning framework for robust rPPG estimation. LQ-rPPG consists of a label quantization module and a coarse-to-fine rPPG estimation model. The label quantization module transforms continuous PPG signals into multi-bit quantized pseudo labels with reduced noise and variability. The coarse-to-fine estimation model progressively refines rPPG signals under hierarchical supervision guided by the multi-bit pseudo labels. This design alleviates overfitting to label-specific variations and enables the model to learn structured and consistent representations. As a result, LQ-rPPG achieves robust and generalizable rPPG estimation even under challenging conditions. Experiments on multiple benchmark datasets demonstrate that LQ-rPPG achieves strong performance in both intra- and cross-dataset evaluations, while reducing parameters and multiply-accumulate operations by 88% and 29%, respectively, and increasing throughput by 191%. The code is available at https://github.com/Anonymous-repo-code/LQ-rPPG.

URL PDF HTML ☆

赞 0 踩 0

2605.23171 2026-05-25 cs.LG cs.AI stat.ML

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

理解与改进指令微调中的噪声嵌入技术

Abhay Yadav

AI总结该研究探讨了指令微调中嵌入层添加噪声的技术，分析了均匀噪声与高斯噪声的效果差异，并提出了一种新的对称噪声嵌入方法SymNoise。通过理论与实验分析，研究发现不同噪声类型性能相近，而SymNoise通过更严格地调控模型局部曲率，显著提升了微调效果。在多个基准测试中，SymNoise相比当前最优方法NEFTune取得了约6.7%的性能提升，展示了其在语言模型微调中的优越性。

Comments arXiv admin note: substantial text overlap with arXiv:2312.01523

详情

Journal ref: IEEE International Conference on Language Modeling (COLM), 2025

AI中文摘要

最近指令微调的进展在嵌入中注入噪声，其中NEFTune（Jain等人，2024）使用均匀噪声设立了基准。尽管NEFTune的实验发现均匀噪声优于高斯噪声，其原因仍不清楚。本文旨在通过提供彻底的理论和实证分析来澄清这一点，表明这些噪声类型之间的性能相当。此外，我们引入了一种新的语言模型微调方法，在嵌入中使用对称噪声。该方法旨在通过更严格地调节模型的局部曲率来增强模型功能，表现出优于当前方法NEFTune的性能。当使用Alpaca微调LLaMA-2-7B模型时，标准技术在AlpacaEval上获得29.79%的分数。然而，我们的方法SymNoise使用对称噪声嵌入将这一分数显著提高到69.04%，比最先进方法NEFTune（64.69%）提高了6.7%。此外，当在各种模型和更强的基线指令数据集（如Evol-Instruct、ShareGPT、OpenPlatypus）上测试时，SymNoise始终优于NEFTune。当前文献，包括NEFTune，强调了在语言模型微调中应用基于噪声的策略需要更深入的研究。我们的方法SymNoise是朝着这一方向迈出的又一重要步骤，显示出对现有最先进方法的显著改进。

英文摘要

Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

URL PDF HTML ☆

赞 0 踩 0

2605.23170 2026-05-25 cs.CL cs.AI cs.LG

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败：推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

AI总结该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足，导致无法准确评估模型在不同位置上的表现。为此，作者提出了Context Rot Evaluation（CRE）框架，系统地控制任务位置、填充内容和上下文长度三个因素，并通过实验发现，当目标任务从上下文末尾移至中间位置时，模型性能会显著下降，且随着上下文长度增加，这一问题更加严重。研究还表明，通过在末尾添加任务副本，可以有效缓解位置带来的性能下降，揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情

AI中文摘要

位置控制评估是检索任务（如Needle-in-a-Haystack和RULER）的标准做法，但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试，发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现，NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目，而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估（CRE），一个控制所有三个因素的框架，并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现：初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时，模型性能可能急剧下降，且对于易受影响的模型，这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点（中间准确率8%）。较新的发布显示出较小的下降：在64K下，四个模型中有三个的末尾位置准确率波动在+/-6个百分点内；MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下，所有四个模型在中间位置的下降仍然存在（在8K、32K、64K下范围-16到-56个百分点）。在8K下，一个诊断探针在末尾添加目标任务副本，使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内，这与位置解释一致。在初始五个模型集中，76%的中间位置错误与周围填充文本匹配，而末尾位置仅为22%，这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距：当任务位置不受控制时，无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

URL PDF HTML ☆

赞 0 踩 0

2605.23165 2026-05-25 cs.RO cs.AI cs.CL

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

AI总结本文提出了一种基于视觉语言模型（VLM）引导的自主前沿探索方法，用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策，指导传统的底层机器人控制系统，利用当前地图和潜在路径的视觉信息生成多模态提示，从而选择最具前景的探索方向。实验表明，该方法在六个室内环境的仿真中提升了地图覆盖率，且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

2605.23160 2026-05-25 cs.RO cs.CV

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

语义感知引导的无人机探索：面向语言条件的三维室内建图

Nitin Vegesna, Avideh Zakhor

AI总结本文提出了一种语义感知引导的无人机探索系统SAGE，用于在未知的室内3D环境中进行开放词汇的探索，能够在保持全面覆盖行为的同时，利用语义线索重新优先选择探索前沿。SAGE基于FALCON体积探索器，通过集成CLIP模型的四个关键组件，实现了语义与几何信息的联合规划，有效提升了目标发现效率。实验表明，SAGE在模拟和真实环境中均优于现有方法，尤其在目标发现速度和体积吞吐量方面表现突出。

Comments 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)

详情

AI中文摘要

我们提出语义感知引导探索（SAGE），一个用于未知三维室内环境的开放词汇探索系统，该系统在保持覆盖导向行为的同时，允许语义提示重新优先化前沿选择。基于FALCON体积探索器，SAGE通过四个关键组件集成对比语言-图像预训练（CLIP）：以物体为中心的嵌入存储、将最近观测投影到自由-未知边界的时间缓存、用于高相似度检测的物体前沿，以及统一的语义-几何规划成本。该成本函数限制了语义重新加权的影响，确保前沿被优先化而不牺牲总覆盖率。在基于Matterport3D的仿真中，SAGE在地图-查询对上的物体发现方面优于FALCON和纯语义消融。与Finding Things in the Unknown（FTU）相比，SAGE在九个共享地图-查询对上的探索速度提高了9.0到25.9倍，平均加速13.7倍。此外，SAGE的体积吞吐量显著高于FTU。最后，我们在Modal AI Starling 2四旋翼飞行器上，在两种环境中的五次真实飞行中部署了SAGE，配备机载感知和规划以及离板CLIP推理。比较SAGE和FALCON，我们发现虽然FALCON导致更快的探索和更短的建图轨迹，但SAGE在物体发现方面优于FALCON。

英文摘要

We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.23157 2026-05-25 cs.CL

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

相同模型，不同弱点：语言与模态如何重塑前沿多模态大语言模型的越狱攻击面

Casey Ford, Madison Van Doren, Sicheng Jin, Emily Dix

AI总结该研究探讨了多模态大语言模型（MLLM）在不同语言和模态下的越狱攻击表面差异，揭示了语言对模型安全性的非均匀影响。通过对比四种前沿模型在英语和西班牙语下的攻击表现，研究发现语言框架攻击在西班牙语中效果减弱，而视觉化多模态攻击则更有效，表明语言与模态对齐失败的机制存在差异。研究指出，当前将语言和模态视为独立维度的安全评估框架无法准确反映实际攻击风险，需进行重新设计。

详情

AI中文摘要

多模态大语言模型（MLLM）的攻击面具有语言依赖性，揭示了对齐失败的机制结构。我们首次进行系统的跨语言、多模态红队研究，比较了四种前沿MLLM（Claude Sonnet 4.5、GPT-5、Pixtral Large和Qwen Omni）在美国英语（en-US）和墨西哥西班牙语（es-MX）下的越狱漏洞。使用包含363个多样化提示场景的固定对抗基准，在纯文本和多模态条件下进行测试，从每组语言的九名母语标注员匹配小组收集了52,272个危害评级和二元攻击成功判断。我们的核心发现是，语言不会均匀地放大漏洞。贝叶斯混合效应分析显示，语言框架攻击（如角色扮演）在西班牙语提示下效果显著降低，而视觉显式多模态攻击效果增强，这直接指向提示-语言界面而非全局标注员宽松度。这种分离表明，语言和视觉对齐失败通过不同机制运作，切换语言足以暴露这种分离。实际后果是安全排名不跨语言保持。Qwen Omni在es-MX参与者中超越Pixtral Large成为最易受攻击的模型，这种排名反转是英语条件下分数的标量校正无法恢复的，并且绝对攻击成功率在模型代际间下降，但模型间差距未缩小。这些发现表明，将语言和模态视为独立维度的安全评估框架从根本上错误地指定了全球部署MLLM的攻击面，必须相应重新设计。

英文摘要

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

URL PDF HTML ☆

赞 0 踩 0

2605.23156 2026-05-25 cs.LG math.FA math.RT stat.ML

Any-Dimensional Invariant Universality

任意维不变泛化性

Shengtai Yao, Eitan Levin, Mateo Díaz

AI总结本文研究了适用于任意尺寸输入的机器学习模型的泛化能力问题，这类模型如处理不同节点数的图或点云的数据。传统泛化性分析通常针对固定尺寸的输入，而本文提出了一种系统的方法，通过将任意维函数映射到一个合适的无限维极限空间，从而建立任意维模型的泛化性理论。该方法利用输入的对称性及不同尺寸输入之间的关系，定义了该空间上的自然拓扑结构，并展示了如何在该空间上建立任意维泛化性。研究还指出了一些现有模型的泛化性缺陷，并提出了简单的改进方案以恢复其泛化能力。

详情

AI中文摘要

一些机器学习模型是为任意大小的输入定义的，例如具有不同节点数的图和包含不同点数目的点云。这类任意维模型的泛化性仍然知之甚少，因为泛化性传统上是在接受固定大小输入的模型上研究的，定义在其域的紧致子集上。与此形成鲜明对比的是，任意维模型可以被视为定义在规模不断增长的输入上的函数序列，目前尚不清楚它们在何种意义上可以是泛化的。我们开发了一种系统的方法来建立任意维泛化性，通过将任意维函数与一个唯一的函数等同起来，该函数在合适的无限维极限空间中接受输入，该空间包含所有有限大小的输入及其极限。利用这些输入的对称性以及不同大小输入之间的关系，我们证明了该极限空间具有自然的拓扑结构，并且包含丰富的紧致集族，在这些紧致集上可以建立任意维泛化性。我们通过展示几种现有架构无法实现泛化性，并提出了恢复泛化性的简单修改，来说明我们的方法。

英文摘要

Several machine learning models are defined for inputs of any size, such as graphs with different numbers of nodes and point clouds containing varying numbers of points. The universality properties of such any-dimensional models remain poorly understood, as universality is traditionally studied for models accepting inputs of a fixed size, defined on a compact subset of their domain. In sharp contrast, any-dimensional models can be viewed as sequences of functions defined on growing-sized inputs, and it is not clear in which sense they can be universal. We develop a systematic approach to establish any-dimensional universality, by identifying any-dimensional functions with a unique function taking inputs in a suitable infinite-dimensional limit space containing inputs of all finite sizes as well as their limits. Using the symmetries of these inputs and relations between inputs of different sizes, we show that this limit space admits a natural topology with rich families of compact sets on which any-dimensional universality can be established. We illustrate our approach by showing that several existing architectures fail to be universal, and we propose simple modifications that restore universality.

URL PDF HTML ☆

赞 0 踩 0

2605.23147 2026-05-25 cs.CL cs.AI

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

作为X，做Y：角色和任务如何在指令微调LLM中结合

Eric Xu

AI总结该研究探讨了在指令微调的大语言模型中，角色提示（如“As X, do Y”）如何将“人物”和“任务”信息结合，并发现这种结合在残差流中的某个特定位置可以通过线性分解清晰地体现。研究指出，人物和任务分别通过部分正交的加法方向影响模型输出，并展示了通过残差流局部加法结构可以实现对角色和任务贡献的可解释控制。然而，研究也表明，尽管存在局部加法结构，角色提示无法被压缩为单一的残差向量，因为其行为依赖于整个提示中的分布式机制。

Comments 12 pages, 1 figure. Code: https://github.com/xuy/localized-additive-composition

详情

AI中文摘要

形式为“作为X，做Y”的角色提示在残差流的一个特定位置——提示到答案的过渡（最后一个提示标记与前两个生成标记）——在早期/中层波段表现出清晰的线性分解。在那里，角色和任务通过部分正交的加性方向贡献。形成纯角色效应Δ_X、纯任务效应Δ_Y，并将h_BB + Δ_X + Δ_Y替换干净残差，在Gemma-2-2B-IT和Qwen-2.5-{1.5B, 3B}-Instruct上，跨越12个单元格的短网格和48个单元格的长角色网格，下游输出与干净输出的KL散度很小，并保留了角色特定的行为标记。从这种加性结构自然推断，角色提示可以压缩为单个缓存的残差向量。我们证明它不能。将缓存的加性预测——甚至oracle干净残差h_XY——注入到移除了角色文本的基线宿主提示中，无论是在一个位置还是在多个层，都无法接近干净的长角色目标。角色条件化的多标记生成通过注意力流回整个提示中的角色文本位置，这是任何单个位置的残差无法复现的。残差流中的局部加性性并不意味着提示可压缩。提示到答案过渡处的加性结构支持可解释性和对角色或任务贡献的细粒度控制；整个延续中的角色条件化行为依赖于分布式的提示/KV机制，局部激活算术无法取代。

英文摘要

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $Δ_X$, a pure task effect $Δ_Y$, and substituting $h_{BB} + Δ_X + Δ_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

URL PDF HTML ☆

赞 0 踩 0

2605.23146 2026-05-25 cs.LG cs.AI

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Infra-Bayesian 强化学习智能体在最坏情况鲁棒性上优于经典强化学习

Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Clément Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport

AI总结该论文研究了在存在模型误设和策略依赖不确定性的情况下，经典强化学习方法的局限性，并提出了一种基于Infra-Bayesian主义的强化学习框架。该方法通过区分普通概率不确定性与Knightian不确定性，采用最坏情况下的预期值最大化策略进行决策，从而在非现实环境中实现更稳健的性能。实验表明，该方法在具有Knightian不确定性的环境中表现出更低的最坏情况遗憾，并在纽康姆问题中优于经典决策理论方法。

详情

AI中文摘要

经典强化学习假设智能体与一个固定环境交互，该环境的行为不依赖于智能体的策略。这一假设在非可实现环境中失效，其中其他参与者可能预测智能体的行为，包括对 AI 安全至关重要的环境，例如智能体与预测者、人类、其他 AI 智能体和机构交互的环境。在此类环境中，智能体的模型类无法捕捉其运行的世界。在这种误设下，经典贝叶斯方法可能产生自信的错误后验、不可靠的决策和无界遗憾，因为可实现性无法获得。Infra-Bayesianism 是一个决策理论框架，通过将普通概率不确定性（其中先验可以合理选择）与 Knightian 不确定性（其中没有构建此类先验的依据）区分开来，解决了这些失败。它通过评估行动的最坏情况结果，而不是后验期望或加权平均来实现这一点。我们首次提出了一个用于有限结果无状态决策问题的 Infra-Bayesian 强化学习架构的概念验证实现。我们的智能体维护一组不精确的假设，使用 Infra-Bayesian 条件更新它们，并通过最大化最坏情况期望值来选择行动。我们将 Infra-Bayesian 极大极小决策过程的实现应用于具有 Knightian 不确定性的环境，并展示了与经典强化学习智能体相比更低的最坏情况遗憾。我们还研究了纽科姆问题，并表明 Infra-Bayesian 智能体选择了最优策略，优于经典决策理论智能体。我们的结果为在模型误设和策略依赖不确定性下保持鲁棒性的强化学习智能体迈出了一步。

英文摘要

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2605.23144 2026-05-25 cs.CV

SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection

SLIP-RS：面向遥感目标检测的结构化属性语言-图像预训练

Chenxu Wang, Yuxuan Li, Yunheng Li, Xiang Li, Jingyuan Xia, Qibin Hou

AI总结现有的遥感目标检测语言-图像预训练方法受限于单一标签学习，依赖黑盒数据枚举开放类别以获取细粒度表示，难以适应遥感领域数据稀缺的特点。为此，本文提出SLIP-RS方法，构建了一个结构化属性解耦范式，将开放类别空间映射到有限且具有物理意义的属性空间，通过显式结构逻辑提升细粒度判别能力。该方法包含两个关键技术：结构化属性对比学习和符合性属性可靠性引擎，分别用于解耦视觉逻辑和从噪声数据中提取高质量监督信号，最终在细粒度检测和跨域泛化方面取得了显著提升。

详情

AI中文摘要

现有的遥感目标检测语言-图像预训练受限于单一标签学习，该方法通过黑盒数据穷举开放集类别以获取细粒度表示，这种依赖性与领域固有的数据稀缺性不兼容。为突破这一瓶颈，我们提出SLIP-RS，建立结构化属性解耦范式，将开放类别空间映射到有限且物理有意义的属性空间，通过显式结构逻辑解锁细粒度判别能力。该范式通过两个技术支柱实现：（1）结构化属性对比学习，通过组合属性增强强制学习解耦的内在视觉逻辑；（2）共形属性可靠性引擎，利用共形预测理论从噪声源中严格提取高保真监督，生成RS-Attribute-15M，这是最大的包含超过1500万属性标注的数据集。大量实验表明，SLIP-RS在细粒度检测和跨域泛化方面建立了前所未有的性能，验证了结构化属性作为遥感基础的重要性。代码：https://github.com/facias914/SLIP-RS。

英文摘要

Existing language-image pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain's inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for remote sensing. Code: https://github.com/facias914/SLIP-RS.

URL PDF HTML ☆

赞 0 踩 0

2605.23141 2026-05-25 cs.CV

VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

VisAnalog：自然图像上视觉概念迁移的诊断套件

Zhaonan Li, Kyle R. Chickering, Bangzheng Li, Jacob Dineen, Xiao Ye, Zhikun Xu, Shijie Lu, Yuxi Huang, Ming Shen, Bach Nguyen, Jaya Adithya Pavuluri, Mau Son Nguyen, Sanika Chavan, Ngoc Minh Thu Le, Muhao Chen, Ben Zhou

AI总结 VisAnalog 是一个用于评估视觉概念迁移能力的诊断数据集，旨在测试模型是否能在不同场景中保持和操作概念属性。该数据集通过“A:B::C:?”的形式构造样本，要求模型根据给定的图像和变换关系推断出目标图像。实验表明，即使在强大的视觉语言模型上，其性能也远低于理想情况，且随着变换步骤的增加性能显著下降，而人类表现则接近最优。该数据集为分析模型在视觉关系推理和变换应用上的缺陷提供了有效工具。

Comments Accepted to the Workshop on Visual Concepts at CVPR 2026 as a non-archival report

详情

AI中文摘要

视觉概念学习的一个有用测试不仅在于模型能否在单张图像中识别概念，还在于它能否在变换下保留和操作概念级属性并将其迁移到新场景。我们引入了VisAnalog，一个针对自然图像上这一场景的受控套件。每个示例实例化$A\!:\!B::C\!:\,?$：图像$B$和隐藏的目标图像$D$是通过对源图像$A$和$C$应用相同的确定性变换序列生成的。给定$A$、$B$和$C$，模型必须回答关于$D$的多选题。该基准包含617个人工验证的问题，涵盖一到四步变换，如缩放、象限交换、旋转、翻转和色调旋转。在强大的专有和开源视觉语言模型上，当直接显示$D$时，端到端准确率显著低于oracle准确率，并且随着变换深度的增加而急剧下降，而人类表现仍接近上限。程序条件评估进一步将关系推理失败与变换应用失败分开，表明从$A \rightarrow B$推断视觉关系是主要瓶颈，在更困难的多步案例中还会出现额外的应用错误。该数据集公开于https://huggingface.co/datasets/zli99/VisAnalog。

英文摘要

A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at https://huggingface.co/datasets/zli99/VisAnalog.

URL PDF HTML ☆

赞 0 踩 0

2605.23139 2026-05-25 cs.LG cs.AI

CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection

CALAD：面向多元时间序列异常检测的信道感知对比学习

Jaehyeop Hong, Youngbum Hur

AI总结多变量时间序列异常检测在实际应用中日益重要，但通常面临标注数据稀缺的问题。现有方法多采用无监督学习建模正常模式，但往往对所有通道一视同仁，忽略了不同通道对异常检测的贡献差异。本文提出CALAD，一种基于通道感知的对比学习框架，通过估计通道相关性指导对比样本的构建，增强模型对异常语义的学习能力，并结合重建误差和对比学习，提升模型在分布偏移场景下的检测性能。

Comments Accepted to ICPR 2026

详情

AI中文摘要

多元时间序列异常检测在实际应用中变得越来越重要，而标记数据往往稀缺。许多现有方法依赖无监督学习来建模正常模式，但它们通常平等对待所有信道。这种设计会稀释异常相关信号，因为并非所有信道对异常检测的贡献相同。在本文中，我们提出CALAD，一种用于多元时间序列异常检测的信道感知对比学习框架。CALAD利用估计的信道相关性指导对比样本的构建，使学习过程反映异常语义而非通用相似性。信道相关性通过基于Transformer的自编码器的重构误差进行估计，并用于区分对异常行为影响更大的信道。利用这些信息，我们设计了一种信道级增强策略，其中正负样本基于异常相关信道是否被保留或扰动来构建。这鼓励对无关信道的变化保持不变性，同时对异常相关信道的变化保持敏感性。此外，CALAD结合了对比学习和辅助重构头，使模型在保留正常结构的同时学习判别性表示。在多个真实数据集上的实验表明，CALAD在分布漂移场景下持续优于现有方法。我们提供可复现的代码：https://github.com/hirundo1218/CALAD。

英文摘要

Multivariate time series anomaly detection has become increasingly important in real-world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly-relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer-based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel-wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly-relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly-relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real-world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at https://github.com/hirundo1218/CALAD

URL PDF HTML ☆

赞 0 踩 0

2605.23134 2026-05-25 cs.LG

Archimedean Copula Inference via Taylor-Mode AD

通过泰勒模式自动微分进行阿基米德Copula推断

Cambridge Yang, Dongdong Li

AI总结该研究提出了一种名为 \textsc{acopula} 的 JAX 框架，用于高效计算任意嵌套阿基米德Copula模型在高维、任意变量右删失情况下的精确似然和参数梯度。其核心方法是通过泰勒模式自动微分的多项式幂运算，替代传统手动推导的贝尔多项式表，从而支持任意生成函数和复杂的嵌套结构。实验表明，该框架在高维数据、大规模金融和医学数据集上表现出优越的性能和灵活性，并实现了比现有工具显著的加速效果。

详情

AI中文摘要

现有的嵌套阿基米德Copula工具无法同时处理以下三个方面：(a) 生存分析中任意变量的（右）删失，(b) 任意嵌套树，以及(c) 精确参数梯度。现有实现仅处理双变量问题、低维（即$d \leq 10$）情况、两层嵌套或仅手工推导的Copula嵌套。我们提出 extsc{acopula}，一个JAX原生框架，给定任意阿基米德生成元——经典或神经——在多项式时间内，在任意删失掩码下评估精确的嵌套Copula似然和参数梯度。其机制是泰勒模式自动微分输出的多项式幂运算，用单个可微计算替代每个族手工推导的偏贝尔多项式表，任何用户定义的生成元都可以驱动该计算。我们进行了大量模拟以验证 extsc{acopula}的正确性。然后我们展示了：(a) 在$d=53$的高维MIMIC-IV ICU入院数据（$85{,}229$条记录）上的逐变量删失，由经典阿基米德族和嵌套神经阿基米德Copula拟合；(b) 在$d=98$的标普500日收益率上的11部门层次模型；(c) 在一项视网膜病变研究中，跨十个族（其中五个族之前没有实现）的族无关删失MLE；以及(d) 在$d=35$时，相对于R的 exttt{nacLL}每密度加速约$650$倍，且二次扩展到$d=8{,}000$。

英文摘要

No existing nested Archimedean copula tool handles all three of (a) arbitrary per-variable (right-)censoring in survival analysis, (b) arbitrary nesting trees, and (c) exact parameter gradients. Existing implementations handle only bivariate problems, low dimensional (i.e., $d \leq 10$) cases, two layers of nesting, or only hand-derived copula nestings. We present \textsc{acopula}, a JAX-native framework that, given any Archimedean generator -- classical or neural -- evaluates exact nested-copula likelihoods and parameter gradients under arbitrary censoring masks in polynomial time. The mechanism is polynomial powering of Taylor-mode automatic differentiation output, which replaces per-family hand-derived partial Bell polynomial tables with a single differentiable computation that any user-defined generator can drive. We conduct extensive simulations to verify the correctness of \textsc{acopula}. We then demonstrate (a) per-variable censoring on $85{,}229$ MIMIC-IV ICU admissions in high dimensions with $d{=}53$, fit by both classical Archimedean families and nested neural Archimedean copulas; (b) an 11-sector hierarchical model on S\&P~500 daily returns at $d{=}98$; (c) family-agnostic censored MLE across ten families, five of them with no prior implementation, on a retinopathy study; and (d) a ${\sim}650\times$ per-density speedup over R's \texttt{nacLL} at $d{=}35$, scaling quadratically to $d{=}8{,}000$.

URL PDF HTML ☆

赞 0 踩 0

2605.23131 2026-05-25 cs.LG

When Determinants Are Not Enough: Private Rare Switching

当行列式不够时：私有稀有切换

Xingyu Zhou

AI总结本文探讨了在隐私保护背景下，传统基于行列式的线性上上下文 bandits 和强化学习更新规则的局限性。当引入高斯噪声以满足隐私要求时，设计矩阵的单调增长特性可能被破坏，导致原有分析不再适用。为解决这一问题，作者提出了一种基于广义瑞利商的稀有切换规则，恢复了对数策略更新和置信区间宽度的常数因子控制，从而在隐私设置下实现了有效的稀有切换策略。

2605.23128 2026-05-25 cs.RO

$π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

$\pi_0$-EqM：闭环视觉-语言-动作控制的均衡匹配

Huanming Liu, Congsheng Xu, Jianmin Ji, Yao Mu

AI总结本文提出了一种名为 $π_0$-EqM 的闭环视觉-语言-动作控制方法，通过将传统的流匹配解码器替换为均衡匹配（EqM）解码器，提升了机器人操作任务的性能。在固定计算预算下，该方法在多个任务中显著提高了成功率，并揭示了任务依赖的“稳定性-可执行性”差距现象，为迭代式VLA控制的策略设计提供了新视角。

Comments Preprint. 5 pages, 3 figures

详情

AI中文摘要

目前，视觉-语言-动作（VLA）模型因其在任务泛化方面的巨大潜力而成为机器人操作最常用的范式。然而，大多数用于VLA控制的生成式流匹配动作解码器通常以固定的采样视界部署，限制了状态相关的计算和控制周期之间的时间复用。我们提出$\pi_0$-EqM，用均衡匹配（EqM）解码器替换$\pi_0$中的流匹配专家，同时保持上游VLA堆栈不变。在匹配的300步预算下，$\pi_0$-EqM在19个任务上将RoboTwin的平均成功率从40.4%提升到50.2%，并在LIBERO上保持竞争力，在LIBERO-10上获得最显著的提升（87.0%）。两次阈值扫描揭示了残差与成功率之间存在任务依赖的非单调关系，我们称之为平稳性-可执行性差距。结果表明，迭代VLA控制中的推理深度是策略设计的一部分，并引入了一种基于能量的VLA视角，这可能为未来跨任务和跨本体的可组合动作生成工作提供参考。

英文摘要

Currently, Vision-Language-Action (VLA) models have become the most adopted paradigm for robotic manipulation for its great potential for task generalization. While most generative flow-matching action decoders for VLA control are often deployed with fixed sampling horizons, limiting state-dependent compute and temporal reuse across control cycles. We present $π_0$-EqM, which replaces the flow-matching expert in $π_0$ with an Equilibrium Matching (EqM) decoder while leaving the upstream VLA stack unchanged. Under a matched 300-step budget, $π_0$-EqM improves RoboTwin average success from 40.4% to 50.2% across 19 tasks and remains competitive on LIBERO, with its clearest gain on LIBERO-10 (87.0%). Two threshold scans reveal a task-dependent non-monotonic relation between residual and success, which we term the stationarity--executability gap. The results suggest that inference depth in iterative VLA control is part of policy design and introduce an energy-based VLA perspective that may inform future work on composable action generation across tasks and embodiments.

URL PDF HTML ☆

赞 0 踩 0

2605.23118 2026-05-25 cs.CV cs.AI cs.LG

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

在临床医生验证的交互式病灶追踪中利用纵向上下文

Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein

AI总结本文研究了如何在临床验证的交互式病灶追踪中有效利用纵向影像信息，以提高肿瘤在连续CT扫描中的追踪准确性。作者提出了一种“验证追踪”范式，通过临床医生验证注册提出的提示，并结合病灶的基线外观信息，解决分割中的模糊问题。该方法结合了早期空间提示融合与潜在时间差分加权，构建了一个统一的纵向信息引导分割框架，并通过大规模合成预训练克服数据稀缺问题，显著提升了性能。实验表明，该方法在全自动和验证追踪设置下均优于现有方法，且在MICCAI autoPET IV挑战赛中取得第一名。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

在系列CT扫描中追踪肿瘤病灶对于肿瘤学反应评估至关重要。现有的自动化方法面临一个基本权衡：端到端追踪器实现高度自动化，但无法纠正无声的追踪失败；而解耦的配准-分割流程允许用户验证，却丢弃了病灶的先验外观，限制了在模糊情况下的准确性。在这项工作中，我们提出了一种验证追踪范式：临床医生验证配准提出的提示，模型利用该提示以及基线病灶外观来解决分割模糊性。我们提出了一个统一框架，结合早期空间提示融合与潜在时间差异加权，用于纵向信息感知的分割。为了解决数据稀缺问题，我们利用大规模合成预训练，证明这对于利用纵向上下文至关重要，相比从头训练性能提升高达4.5个Dice点。我们的方法在MICCAI autoPET IV挑战中获得第一名。我们进一步整理并发布了PanTrack，一个新的纵向胰腺癌基准，以评估分布外泛化能力。实验表明，我们的模型在全自动和所提出的验证追踪设置中均优于先前工作，在自动化与控制之间提供了一个临床安全的中间地带。代码、模型和数据集将在https://github.com/MIC-DKFZ/LongiSeg发布。

英文摘要

Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC-DKFZ/LongiSeg

URL PDF HTML ☆

赞 0 踩 0

2605.23116 2026-05-25 cs.CV cs.AI

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

CoReVAD: 一种无需训练的视频异常检测上下文推理框架

Hyeongmuk Lim, Youngbum Hur

AI总结现有视频异常检测方法通常依赖任务特定的训练，导致领域依赖性强且训练成本高，且大多仅输出标量异常分数，缺乏对异常原因的解释。为此，本文提出CoReVAD，一种无需训练的上下文推理框架，利用冻结的视觉-语言模型直接生成异常分数和时间描述，并通过局部响应清理模块和全局时序优化策略提升检测精度与可解释性。实验表明，CoReVAD在多个数据集上表现出色，提供了可靠且易于理解的异常解释。

Comments Accepted to ICPR 2026

详情

AI中文摘要

现有的视频异常检测方法通常依赖于任务特定的训练，导致强领域依赖性和高训练成本。此外，大多数现有方法仅输出标量异常分数，对特定事件为何被视为异常提供的洞察有限。视觉语言模型的最新进展使得异常检测和人类可解释推理成为可能。然而，许多基于视觉语言模型的方法仍然需要额外的训练步骤（例如，指令调优或口头化学习）或外部大型语言模型，从而带来进一步的训练成本和推理开销。为了解决这些挑战，我们提出了CoReVAD，一种用于无需训练的视频异常检测的上下文推理框架，该框架使用单个冻结的视觉语言模型运行。CoReVAD直接从视觉语言模型生成异常分数和时间描述。为了减轻生成输出中的噪声，我们引入了一个基于局部视觉-文本对齐的局部响应清理模块。此外，通过基于softmax的精炼、高斯平滑和位置加权，融入了全局时间上下文和进展。在UCF-Crime和XD-Violence上的实验表明，CoReVAD在无需训练的方法中取得了竞争性能，同时提供了可靠且可解释的解释。我们的官方代码可在https://github.com/Muk-00/CoReVAD获取。

英文摘要

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

URL PDF HTML ☆

赞 0 踩 0

2605.23115 2026-05-25 cs.LG stat.ML

Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift

鲁棒OT引导的生成式残差域适应用于时间域偏移下的共享单车需求预测

Yiming Ma

AI总结本文研究了从2021年到2026年纽约Citi Bike共享单车需求预测中的时间域适应问题，提出了一种基于最优运输引导的残差域适应框架Gen-ROTDA。该方法通过拟合目标域的站点-时间锚点，转移残差而非原始需求，并采用确定性标签保持的残差特征生成器，提升了模型在时间域偏移下的鲁棒性。实验表明，Gen-ROTDA在主要任务2025至2026年的预测中取得了最低的平均绝对误差，并在多任务中优于其他最优运输方法，尤其在面对噪声数据时表现出更强的稳定性。

详情

AI中文摘要

基于历史站点-小时数据训练的共享单车模型在后续年份部署时，由于出行模式随时间变化，性能可能会下降。本文将2021年至2026年3月Citi Bike需求预测作为时间域适应问题进行研究，并提出了Gen-ROTDA，一种鲁棒最优传输引导的残差域适应框架。该方法利用少量标记目标子集拟合目标域站点-时间锚点，传输残差而非原始需求，应用确定性标签保持残差特征生成器，并在训练最终残差预测器之前修剪高成本传输匹配。实验将Gen-ROTDA与仅锚点、仅源域、仅目标域、微调、MMD适应、Sinkhorn OTDA、ROTDA和Gen-OTDA进行比较。Gen-ROTDA在2025年至2026年主要任务上取得了最低MAE，并且在多年度任务中平均表现最佳，尽管微调和MMD适应仍然是强大的整体基线。在异常目标无标签记录下，Gen-ROTDA比非鲁棒OT变体稳定得多，表明鲁棒传输对于共享单车需求预测中的噪声时间迁移是有用的。

英文摘要

Bike-sharing models trained on historical station-hour data may degrade when deployed in later years because travel patterns change over time. This paper studies March Citi Bike demand prediction from 2021 to 2026 as a temporal domain adaptation problem and proposes Gen-ROTDA, a robust optimal transport-guided residual domain adaptation framework. The method fits a target-domain station-time anchor with a small labeled target subset, transfers residual rather than raw demand, applies a deterministic label-preserving residual feature generator, and trims high-cost transport matches before training the final residual predictor. Experiments compare Gen-ROTDA with anchor-only, source-only, target-only, fine-tuning, MMD adaptation, Sinkhorn OTDA, ROTDA, and Gen-OTDA. Gen-ROTDA achieves the lowest MAE on the main 2025 to 2026 task and is the best OT-family method on average across multi-year tasks, although fine-tuning and MMD adaptation remain strong overall baselines. Under abnormal target-unlabeled records, Gen-ROTDA is much more stable than non-robust OT variants, suggesting that robust transport is useful for noisy temporal transfer in bike-sharing demand prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.23113 2026-05-25 cs.CV

Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization

不一致感知多模态薛定谔桥用于深度伪造定位

Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue

AI总结本文提出了一种基于不一致性感知的多模态Schrödinger Bridge（IaMSB）方法，用于深度伪造视频的区间级定位。该方法通过联合估计跨模态一致性并进行时间区间定位，有效抑制了单侧和异步伪造中的跨模态噪声传播。IaMSB利用Schrödinger Bridge框架统一了一致性估计、跨模态信息选择和桥步调度，在提升定位精度的同时减少了不必要的迭代，显著提高了高精度定位性能，尤其在单侧伪造检测中表现优异。

Comments Accepted by CVPR2026

详情

AI中文摘要

音视频深度伪造定位需要区间级输出作为时间证据。尽管近期取得进展，但在单侧或异步伪造下的对称融合会传播跨模态噪声，降低高精度定位。我们提出IaMSB，一种不一致感知多模态薛定谔桥（SB），联合估计跨模态一致性并执行区间级定位。与扩散模型不同，SB最小化路径分布差异，无需显式噪声注入或去噪即可生成一致性分数。借助薛定谔桥（SB），IaMSB将一致性估计、跨模态信息选择和桥步调度统一在一个框架中。具体地，轻量级粗桥首先提出候选区间并估计跨模态一致性；这些统计量选择跨模态见证信号并跨模态非对称分配桥步。然后，精炼桥执行步调融合并输出精炼的时间对齐区间。IaMSB预判单侧和异步伪造，并通过带步分配的瓶颈跨模态交互抑制噪声转移，避免不必要的迭代。在多个基准上，IaMSB稳定了严格IoU边界精度，将AP@0.95提高了3%~10%，并实现了改进的高精度定位，特别是对于单侧伪造。

英文摘要

Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrödinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

URL PDF HTML ☆

赞 0 踩 0

2605.23109 2026-05-25 cs.AI cs.DC cs.LO cs.PL

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

归纳演绎合成：使AI能够生成形式化验证的系统

Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li, Sylvia Ratnasamy, Aditya Parameswaran, Matei Zaharia, Ion Stoica, Mohsen Lesani

AI总结本文提出了一种名为归纳演绎综合（IDS）的新方法，旨在解决AI生成代码时缺乏形式化验证的问题，特别是在分布式系统领域。该方法通过联合生成实现代码和形式化证明，并从失败尝试中学习，系统性地尝试有效策略。IDS作为基于代理的大型语言模型系统，能够在约6.8小时内以较低成本完成7个分布式键值存储规范的形式化验证，且生成的实现性能优于现有验证系统。

详情

AI中文摘要

AI代理在生成、测试和优化代码方面日益出色。然而，在需要完全覆盖的形式化保证（仅靠测试无法提供）的任务上，它们表现不足。分布式系统是一个典型例子：读写一致性等属性必须在每个可能的事件交错下成立。机械化形式验证可以保证这种正确性，但通常需要专家数月到数年的努力。证据表明，即使是最先进的编码代理（Codex with GPT-5.4和Claude Code with Opus 4.6）也仅在7个分布式键值存储规范中的2个上成功。在本文中，我们提出了解决这一差距的首个有效方法——归纳演绎合成（IDS），它联合且增量地合成实现和证明，并从失败的尝试中学习以系统地尝试有前景的策略。作为基于LLM的代理系统，IDS在平均约6.8小时和每个规范106美元的成本下实现了7/7的成功率，比专家努力快约200倍，比最先进的代理便宜17%。IDS进一步将性能反馈纳入同一循环，产生的实现比已发布的验证系统快达3倍。

英文摘要

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23103 2026-05-25 cs.CL cs.AI cs.CY cs.DB

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

用于明清之际文集中个人书信标题的微调BERT分类器

Queenie Luo

AI总结本文提出了一种基于微调BERT的分类器Lepton，用于识别晚明至清初文集目录中的标题是否为个人书信，特别是与可混淆的序言（如告别序）进行区分。该模型在33位文人手标注的5438个文集标题上进行微调，并已部署于Hugging Face平台，应用于中国传记资料库（CBDB），成功识别出约五万五千封书信，为明信平台的数据建设提供了支持。

2605.23100 2026-05-25 cs.RO

Four Simple Proprioceptive Estimators for Legged Robots

腿式机器人的四种简单本体感受估计器

Frank Dellaert, Chiyun Noh, Varun Agrawal, Ayoung Kim

AI总结本文研究了如何利用足端间歇接触信息来改进腿式机器人在惯性测量单元（IMU）噪声影响下的姿态估计。作者提出了一系列逐步增强的估计方法，从基于接触辅助不变扩展卡尔曼滤波（EKF）的方法出发，逐步引入因子图和固定滞后平滑技术，以提升估计精度和鲁棒性。所有四种方法均在GTSAM中实现，并提供了ROS2兼容的代码，便于复现和进一步研究。

详情

AI中文摘要

腿式机器人携带IMU，但由于消费级IMU噪声大，惯性解会漂移。然而，足部与环境产生间歇性接触，可用于减轻这种漂移。本报告开发了一系列表达力逐渐增强的腿式机器人状态估计器，利用了这一特性。在所有情况下，浮动基座状态包括姿态、位置、速度和IMU偏置。为了建模足部接触，我们从Hartley等人的接触辅助不变EKF开始，但降低了接触更新率。然后通过用小因子图替换测量更新来增强。最后，我们将相同的因子转化为带有接触时段足端接触点的固定滞后平滑器，包括和不包括变化的IMU偏置。为了促进可重复性和本体感受腿式里程计的进一步研究，所有四种变体都在GTSAM（Dellaert等人）中可用，并且我们还提供了一个与ROS2兼容的实现。

英文摘要

Legged robots carry an IMU, but the inertial solution drifts because consumer-grade IMUs are noisy. However, the feet create intermittent contacts with the environment that can be used to mitigate that drift. This report develops a sequence of increasingly expressive legged robot state estimators that leverage this. In all cases, the floating-base state comprises attitude, position, velocity, and IMU biases. To model foot contacts, we start from the contact-aided invariant EKF of Hartley et al., albeit at a reduced contact update rate. This is then augmented by replacing the measurement update by a small factor graph. Finally, we turn the same factors into a fixed-lag smoother with contact-episode footholds, with and without an evolving IMU bias. To facilitate reproducibility and further research in proprioceptive legged odometry, all four variants are available in GTSAM (Dellaert et. al), and we additionally provide a ROS2-compatible implementation.

URL PDF HTML ☆

赞 0 踩 0

2605.23098 2026-05-25 cs.RO

UfM: Uncertainty from Motion for DNN Depth Estimation Using Gaussians

UfM*：基于高斯分布的运动不确定性用于DNN深度估计

Soumya Sudhakar, Sertac Karaman, Vivienne Sze

AI总结本文提出了一种名为UfM*的深度神经网络单目深度估计不确定性估计方法，通过使用高斯混合模型高效地衡量多视角预测之间的不一致性，仅需单次网络推理即可生成不确定性。相比传统方法，UfM*在计算和内存效率上显著提升，并在多个数据集上验证了其在提升校准误差和降低能耗方面的优越性，特别适用于资源受限的机器人系统。

Comments 18 pages, 15 figures

详情

AI中文摘要

可靠的不确定性估计对于在安全关键的机器人系统中部署单目深度深度神经网络（DNN）至关重要。传统的不确定性方法（如集成和基于采样的方法）需要每张图像多次推理，导致大量计算和内存开销。此外，从单张图像预测的不确定性无法衡量同一区域不同视图间预测的不一致性。我们提出UfM*（基于运动的不确定性），一种不确定性估计算法，通过使用紧凑高斯混合模型比较前后视图，高效衡量多视图不一致性，每张图像仅需一次DNN推理。使用高斯分布计算多视图不一致性不仅比先前使用点云的方法更节省计算和内存，而且通过衡量3D空间区域间的不一致性提高了不确定性质量。UfM*结合偶然不确定性，在100个分布外ScanNet序列上，与集成相比，期望校准误差改善24-28%，而能耗仅为集成的3%，内存仅为0.02%。我们证明，在微型能量受限机器人上，UfM*在Arm Cortex-A76 CPU上以30 FPS实时运行，每张224x224图像仅消耗63 mJ，突显了使用高斯分布衡量多视图不一致性能够为资源受限的机器人系统实现高效的不确定性估计。

英文摘要

Reliable uncertainty estimation is critical for deploying monocular depth deep neural networks (DNNs) in safety-critical robotic systems. Conventional uncertainty methods such as ensembles and sampling-based approaches require multiple inferences per image, incurring substantial compute and memory overhead. Moreover, uncertainty predicted from a single image misses out on measuring disagreement between predictions across views of the same region. We propose Uncertainty from Motion* (UfM*), an uncertainty estimation algorithm that measures multiview disagreement efficiently by comparing previous and current views using a compact Gaussian mixture, requiring only a single DNN inference per image. Using Gaussians to compute multiview disagreement is not only more compute- and memory-efficient than a prior approach using a point cloud, but also improves uncertainty by measuring disagreement across regions of 3D space. UfM* paired with aleatoric uncertainty improves expected calibration error by 24-28% compared to an ensemble, while requiring only 3% of the energy and 0.02% of the memory on 100 out-of-distribution ScanNet sequences. We demonstrate UfM* consumes only 63 mJ per 224x224 image while running real-time at 30 FPS on an Arm Cortex-A76 CPU onboard a miniature energy-constrained robot, highlighting that measuring multiview disagreement using Gaussians enables efficient uncertainty for resource-constrained robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23093 2026-05-25 cs.CL cs.CY

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

结构主题模型与BERTopic在简短开放式调查回答中的比较评估

Yan Jiang, Sihong Liu, Philip A. Fisher

AI总结本文比较了结构主题模型（STM）和基于嵌入的BERTopic模型在分析短文本开放性调查回复中的表现。研究通过多种参数设置对两种方法进行了评估，发现BERTopic在主题一致性方面优于STM，而STM在协变量分析方面更具优势。研究结果表明，两种方法各有优劣，适用于不同研究需求，为应用社会科学研究中的主题建模方法选择提供了实用指导。

详情

AI中文摘要

应用心理学中的主题建模日益跨越两种方法论传统：概率词袋模型和较新的基于嵌入的方法。然而，对这些方法的许多评估依赖于较长且更干净的基准语料库，对简短、开放式调查回答的指导较少。本文比较了结构主题模型（STM）（一种概率主题模型）和BERTopic（一种基于嵌入的模型）用于分析开放式调查回答。我们评估了三种STM条件和五种BERTopic条件，变化包括拼写纠正、词干提取、嵌入选择以及上下文增强（我们引入的一种为极短回答提供额外语义上下文的策略）。结果表明，BERTopic始终比STM产生更高的主题连贯性，其中上下文增强带来了最强的性能提升。相比之下，仅使用更高维度的嵌入并未改善连贯性，反而与更大的数据损失相关。定性评估显示，BERTopic生成了更可解释和稳定的主题，而STM主题通常更广泛且更混杂。然而，STM为推断性协变量分析提供了更强的支持，而BERTopic的协变量比较主要是描述性的。这些发现表明STM和BERTopic具有互补优势。我们最后为应用社会科学研究中选择和结合主题建模方法提供了实用指导。

英文摘要

Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

URL PDF HTML ☆

赞 0 踩 0

2605.23089 2026-05-25 cs.LG cs.AI

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

利用梯度惩罚潜在动力学实现平滑且高效的采样

Romil V. Sonigra, P. R. Kumar

AI总结本文提出了一种名为GPLD的梯度惩罚隐动力学正则化方法，用于改进基于模型的强化学习中的隐世界模型。该方法通过对后验隐状态分布施加行级雅可比惩罚，显式地鼓励局部平滑的转移动力学学习，从而提升模型的样本效率和学习稳定性。实验表明，GPLD在多个深度强化学习任务中表现出色，尤其在复杂运动控制环境中显著提升了性能，并且在四足机器人任务中实现了更早的高回报行为和更一致的长期学习效果。

Comments 17 pages and 9 figures

详情

AI中文摘要

基于模型的强化学习通过学习世界模型来提高样本效率。然而，现有的潜在世界模型（如DreamerV3）并未明确强制其学习的转移动力学具有局部平滑性，从而未利用这一有用的归纳偏置。我们提出GPLD，一种用于DreamerV3的梯度惩罚潜在动力学正则化器，通过对后验潜在分布施加行雅可比惩罚来鼓励局部平滑的转移学习。我们证明该惩罚可解释为离散嵌入状态MDP中转移律的有限差分平滑的连续潜在类比，并使用Hutchinson风格随机探针高效估计。实验上，在DeepMind Control本体感受任务中，GPLD提高了总体样本效率，在复杂度较高的运动环境中尤其显著。在更具挑战性的四足任务中，GPLD更早达到高回报行为，并在更长的时间跨度内表现出更一致的后期学习。显式局部平滑正则化是改善平滑连续控制环境中潜在世界模型的简单有效方法。GPLD代码见github.com/romils9/gpld-mbrl。

英文摘要

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld-mbrl .

URL PDF HTML ☆

赞 0 踩 0

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

Self-Improving In-Context Learning

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Autonomous Frontier-Based Exploration with VLM Guidance

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Any-Dimensional Invariant Universality

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection

VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection

Archimedean Copula Inference via Taylor-Mode AD

When Determinants Are Not Enough: Private Rare Switching

$π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift

Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

Four Simple Proprioceptive Estimators for Legged Robots

UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

UfM: Uncertainty from Motion for DNN Depth Estimation Using Gaussians