arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11913 2026-06-11 cs.CV 新提交

From Content to Knowledge: Lightning Fast Long-Video Understanding with Neural Knowledge Representations

从内容到知识：基于神经知识表示的闪电般快速长视频理解

Yuchen Guan, Xiao Li, Zongyu Guo, Xiaoyi Zhang, Xiulian Peng, Chun Yuan, Yan Lu

AI总结提出将长视频编码为神经知识表示（NKR），通过智能体知识蒸馏（AKD）自动合成描述和问答对，将视频知识嵌入VLM骨干网络的少量权重中，实现轻量级、可复用的视频理解，推理时无需重新加载视频，大幅降低延迟。

详情

AI中文摘要

我们提出了一种新的长视频理解范式，将长视频视为神经知识表示（NKR）。NKR既不将视频内容表示为标记流，也不表示为预组织的数据库，而是作为附加到VLM骨干网络的一小部分网络权重。通过一种新颖的智能体知识蒸馏（AKD）过程优化NKR权重，以封装视频的语义内容，其中智能体自动合成密集描述和问答对，将视频知识蒸馏到NKR中。虽然AKD作为一次性的全面编码阶段，但生成的NKR将视频转换为可移植、可重用的资产。在推理时，轻量级NKR被挂载到冻结的视觉语言模型（VLM）上，实现直接的、基于查询的理解，无需重新加载或重新编码原始视频。这种方法将视频长度与推理成本解耦，为多轮视频理解提供了高摊销效率。在LVBench基准上的实验表明，我们的方法在实现与最先进方法相当的性能的同时，将端到端延迟降低了两个数量级以上，为交互式长视频理解开辟了新的可能性。

英文摘要

We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.11911 2026-06-11 stat.ML cs.LG math.AT 新提交

From Persistence to Survival: Hypothesis Testing, Effect Sizes and Vectorisation for Topological Features

从持续性到生存：拓扑特征的假设检验、效应大小与向量化

Juliette Murris, Bernadette Stolz, Karsten Borgwardt

AI总结提出STRAND方法，将持久性图视为生存数据，利用持久性生存函数统一实现假设检验、效应大小计算和向量化，在合成数据和真实基准上验证了有效性。

详情

AI中文摘要

持久性图是拓扑数据分析中常见的表示形式，但它们并非天然存在于向量空间中，且用于比较它们的统计工具在很大程度上与用于下游预测的工具分开发展。我们引入STRAND（生存拓扑表示图分析），将（集合的）持久性图视为生存数据：每个具有持久性值 $p = d - b$ 的拓扑特征是一个完全观测的事件时间，持久性生存函数 $S(t) = \mathbb{P}(p > t)$ 是比较图的中心对象。从这个单一表示中，我们推导出（i）一个非参数双样本检验，具有校准的第一类错误率和少量图的高功效；（ii）可解释的效应大小；以及（iii）用于下游机器学习的1-Wasserstein稳定特征向量。我们在具有受控拓扑的合成流形上验证了校准和功效，展示了在14个图和3D点云基准上的竞争性向量化，并将该方法应用于fMRI/神经科学数据中的功能性脑连接研究。据我们所知，STRAND是第一个从单一连贯且可解释的表示为持久性图提供假设检验和向量化的方法。

英文摘要

Persistence diagrams are common representations in topological data analysis, but they do not naturally live in a vector space, and the statistical tools developed for comparing them have largely evolved separately from those used for downstream prediction. We introduce STRAND (Survival Topological Representation ANalysis of Diagrams), which treats (collections of) PDs as survival data: each topological feature with persistence value $p = d - b$ is a fully observed time-to-event, and the persistence survival function $S(t) = \mathbb{P}(p > t)$ is the central object for comparing diagrams. From this single representation we derive (i) a non-parametric two-sample test with calibrated Type I error and high power from a small number of diagrams; (ii) interpretable effect sizes; and (iii) a 1-Wasserstein-stable feature vector for downstream machine learning. We validate calibration and power on synthetic manifolds with controlled topology, demonstrate competitive vectorisation across 14 graph and 3D point cloud benchmarks, and apply the method to study functional brain connectivity in fMRI/neuroscience data. To our knowledge, STRAND is the first method to provide hypothesis testing and vectorisation for persistence diagrams from a single coherent and interpretable representation.

URL PDF HTML ☆

赞 0 踩 0

2606.11903 2026-06-11 cs.SD 新提交

Snapping Matters: Context-Aware Onset Refinement for Automatic Music Transcription

Snapping Matters: 上下文感知的起始点细化用于自动音乐转录

Abhirup Saha, Hans-Ulrich Berendes, Meinard Müller, Ben Maman

AI总结针对弱对齐的乐谱-音频数据，提出基于二分图匹配的上下文感知起始点细化方法，显著提升自动音乐转录的起始点对齐和转录精度。

详情

Comments: Published in International Computer Music Conference (ICMC) 2026

AI中文摘要

精确的音符级标注对于训练自动音乐转录（AMT）系统至关重要，尤其是音符起始点标签，它是许多现代AMT系统的核心组成部分。然而，真实世界录音的高质量标注非常稀缺。序列级乐谱-音频对齐方法（如动态时间规整）仅提供粗略对应，因此需要局部细化步骤。这个细化步骤称为snapping，它使用神经起始点后验图的峰值来调整对齐的乐谱起始点，并且通常决定了弱对齐的乐谱-音频对是否能够成为可用的训练数据。尽管具有实际重要性，snapping通常被视为简单的后处理启发式方法，并通过贪婪的局部决策实现。我们提出了用于训练乐器无关转录器的snapping策略的系统分析，证明了snapping对于从弱对齐数据学习至关重要。在此基础上，我们将snapping形式化为每个音高的分配问题，并通过二分图匹配解决，从而在重叠的细化窗口和不确定的初始对齐下做出上下文感知的起始点决策。在钢琴、室内乐和管弦乐录音上的广泛跨数据集实验表明，与贪婪snapping相比，起始点对齐和转录精度有所提高，并且随着snapping窗口变宽和初始对齐变粗糙，增益增加。定性示例见我们的项目页面：this https URL

英文摘要

Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level score--audio alignment methods such as dynamic time warping provide only coarse correspondence, making a local refinement step necessary. This refinement step, known as snapping, adjusts aligned score onsets using peaks in a neural onset posteriorgram and often determines whether weakly aligned score--audio pairs become usable training data at all. Despite its practical importance, snapping is typically treated as a simple post-processing heuristic and implemented with greedy local decisions. We present a systematic analysis of snapping strategies for training instrument-agnostic transcribers, demonstrating that snapping is essential for learning from weakly aligned data. Building on this, we formulate snapping as a per-pitch assignment problem and solve it via bipartite graph matching, yielding context-aware onset decisions under overlapping refinement windows and uncertain initial alignments. Extensive cross-dataset experiments across piano, chamber, and orchestral recordings show improved onset alignment and transcription accuracy over greedy snapping, with gains increasing for wider snapping windows and coarser initial alignments. Qualitative examples are provided on our project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.11891 2026-06-11 cs.RO cs.LG 新提交

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

评论家架构的重要性：双评论家与统一评论家在人形机器人移动操作中的对比

Mehmet Turan Yardımcı

AI总结针对人形机器人多目标强化学习，对比统一评论家与双评论家架构，实验表明双评论家策略在到达速度、吞吐量和成功率上显著优于统一评论家，且架构选择比奖励工程影响更大。

详情

Comments: Accepted at the ICRA 2026 Workshop on Reinforcement Learning for Imitation Learning (RL4IL), Vienna, Austria. 4 pages, 2 figures

AI中文摘要

人形机器人的多目标强化学习必须在单一策略中协调移动和操作。一个自然的设计选择是使用单一（统一）评论家来估计所有目标的组合价值，还是使用具有不相交奖励信号的单独（双）评论家。我们在NVIDIA Isaac Lab中对Unitree G1人形机器人（23个主动自由度）进行了受控比较，通过一个从静态到达延伸到具有可变方向目标的行走的13级顺序课程训练移动操作策略。在标准化评估中，与统一评论家策略相比，双评论家策略到达目标的速度快3.5倍（6.5 vs. 22.6模拟步），吞吐量高2倍（每1000步验证到达次数14.3 vs. 7.0），并且验证到达率更高（65.2% vs. 53.8%）。值得注意的是，额外的反博弈奖励机制在架构改变之外没有提供进一步改进（60.9% vs. 65.2%）。这些结果对新兴的强化学习微调模仿学习策略范式有直接影响：当使用强化学习优化预训练的操作策略时，统一评论家可能通过竞争性的移动梯度抑制已学习的行为。这些发现表明，评论家架构是多目标人形机器人强化学习中一个首要且常被忽视的设计选择，其对到达效率的影响大于奖励工程。

英文摘要

Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.11889 2026-06-11 cs.CV cs.AI cs.RO 新提交

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

AI总结研究视觉-语言模型在自动驾驶危险检测中，嵌入漂移与任务对齐危险分数变化的关系，发现不同腐败类型导致不同的失效模式，建议基准测试包含任务对齐稳定性指标。

详情

Comments: 8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

AI中文摘要

视觉-语言模型（VLM）越来越多地用于自动驾驶中的场景理解，但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败，我们将嵌入漂移与边际漂移（定义为扰动下危险分数的变化）进行比较。这种关系高度依赖于腐败类型：某些家族表现出表示漂移与决策漂移之间的强耦合，而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外，腐败家族在失效方向上有所不同：大多数通过假阴性抑制危险检测，而遮挡则触发假警报，这表明基准设计应考虑不对称的失效模式，而不仅仅是整体不稳定性率。这些结果表明，鲁棒性基准应包含任务对齐的稳定性指标，而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

URL PDF HTML ☆

赞 0 踩 0

2606.11876 2026-06-11 q-bio.QM cs.LG stat.ME 新提交

Seeing Below the Limit of Detection: A Censored-Poisson Bayesian Latent-Growth Change-Point Detector (the Span Detector) for Serial ctDNA in HR+/HER2- Metastatic Breast Cancer

检测限以下：用于HR+/HER2-转移性乳腺癌连续ctDNA的删失泊松贝叶斯潜在增长变点检测器（Span检测器）

Aarchi Singh Thakur, Abhijoy Sarkar

AI总结提出Span检测器，利用删失泊松贝叶斯潜在增长变点模型处理ctDNA非检测作为左删失观测，通过序贯广义似然比统计量检测变异检测率上升点，在10%假警报率下将提前三个月捕获进展的比例从11%提升至25%。

详情

Comments: 9 pages, 4 figures, 2 tables. Code and synthetic data generator: this https URL

AI中文摘要

循环肿瘤DNA（ctDNA）在影像学显示耐药性数月前就已携带证据，但最早证据存在于检测限（LoD）以下：新生亚克隆仅被间歇性检测到，产生微弱检测和非检测的闪烁序列。商业液体活检将每次抽取视为独立快照，并将非检测视为无信号。我们认为非检测是左删失观测，而随时间变化的非检测和微弱检测模式在单个值可信之前就携带了可操作的生长证据。我们引入Span，一种删失泊松贝叶斯潜在增长变点检测器，它对二元检测过程建模，为每个变异的检测率累积一个向上变点的序贯广义似然比统计量，并以校准的假警报控制发出竞争风险警报。Span没有学习权重，因此没有过拟合风险。在一线CDK4/6抑制剂联合内分泌治疗的HR+/HER2-转移性乳腺癌合成队列中，在匹配的10%假警报率下，Span将提前三个月捕获的即将进展比例大约翻倍（惰性出现：25% vs 快照的11%），具有可证伪的剂量反应：对惰性出现效果显著，对快速出现效果消失。值轨迹基线表现与快照相同，将增益归因于删失检测模型。生存主干在真实乳腺癌数据（GBSG-2，n=686；C指数0.67 vs 0.68）上与Cox基线匹配，在具有清洁生物标志物的真实纵向队列（PBC2，n=312）上，同一管道正确拒绝获胜，这是一个可证伪的边界测试，确认机制是特定于状态的。所有ctDNA轨迹均为合成数据。

英文摘要

Circulating-tumour DNA (ctDNA) carries evidence of drug resistance months before imaging shows it, but the earliest evidence lives below the assay's limit of detection (LoD): a nascent subclone is detected only intermittently, producing a flickering sequence of faint detects and non-detects. Commercial liquid biopsies treat each draw as an independent snapshot and a non-detect as nothing. We argue a non-detect is a left-censored observation, and the pattern of non-detects and faint detects over time carries actionable evidence of growth before any single value is trustworthy. We introduce Span, a censored-Poisson Bayesian latent-growth change-point detector that models the binary detection process, accumulates a sequential generalised-likelihood-ratio statistic for an upward change-point in the per-variant detection rate, and raises a competing-risks alarm with calibrated false-alarm control. Span has no learned weights, so there is nothing to overfit. On a synthetic cohort of HR+/HER2- metastatic breast cancer on first-line CDK4/6-inhibitor plus endocrine therapy, at a matched 10% false-alarm rate, Span roughly doubles the fraction of impending progressions caught three months ahead (indolent regime: 25% vs 11% for the snapshot), with a falsifiable dose-response: large for indolent emergence, vanishing for fast emergence. A value-trajectory baseline performs identically to the snapshot, isolating the gain to the censored detection model. The survival backbone matches a Cox baseline on real breast-cancer data (GBSG-2, n=686; C-index 0.67 vs 0.68), and on a real longitudinal cohort with clean biomarkers (PBC2, n=312) the same pipeline correctly declines to win, a falsifiable boundary test confirming the mechanism is regime-specific. All ctDNA trajectories are synthetic.

URL PDF HTML ☆

赞 0 踩 0

2606.11870 2026-06-11 cond-mat.mtrl-sci cs.LG 新提交

Modelling magnetic material properties with uncertainty-aware neural networks

用不确定性感知神经网络建模磁性材料性质

Clemens Wager, Heisam Moustafa, Alexander Kovacs, Qais Ali, Harald Oezelt, Hayate Yamano, Masao Yano, Noritsugu Sakuma, Hyuga Hosoi, Akihito Kinoshita, Tetsuya Shoji, Akira Kato, Thomas Schrefl

AI总结针对新材料发现中数据稀缺和分布外预测的不确定性问题，采用高斯负对数似然损失和基于dropout的贝叶斯近似量化预测不确定性，并迁移至微观结构预测矫顽力任务，证明不确定性量化可增强预测可信度且可迁移。

详情

Comments: pre print, unreviewed version

AI中文摘要

机器学习越来越多地被应用于通过探索大成分和结构设计空间来加速新材料的发现。然而，高质量数据的稀缺以及频繁的分布外预测需求引入了大量不确定性，使得评估模型可靠性变得至关重要。在这项工作中，我们研究了不确定性量化作为评估永磁体研究背景下模型置信度的一种手段。在第一项研究中，我们基准测试了经典和现代机器学习模型在预测本征磁性方面的性能，重点关注其不确定性估计的质量。我们应用高斯负对数似然损失和基于dropout的贝叶斯近似作为估计预测不确定性的实用策略。在第二项研究中，我们将这些用于不确定性估计的架构特征迁移到一个更复杂的任务：使用图神经网络从微观结构信息预测矫顽力。这些研究共同表明，不确定性量化不仅增强了预测的可信度，而且在不同建模任务之间是可迁移的。

英文摘要

Machine learning is increasingly applied to accelerate the discovery of novel materials by exploring large compositional and structural design spaces. Yet, the scarcity of high-quality data and the frequent need for out-of-distribution prediction introduce substantial uncertainty, making the assessment of model reliability essential. In this work, we investigate uncertainty quantification as a means to evaluate model confidence in the context of permanent magnet research. In a first study, we benchmark classical and modern machine learning models for predicting intrinsic magnetic properties, focusing on the quality of their uncertainty estimates. We apply Gaussian negative log-likelihood loss and dropout-based Bayesian approximation as practical strategies for estimating predictive uncertainty. In a second study, we transfer these architectural features for uncertainty estimation to a more complex task: predicting coercivity from microstructural information using a graph neural network. Together, these studies demonstrate that uncertainty quantification not only enhances the trustworthiness of predictions but is also transferable across different modeling tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.11869 2026-06-11 cs.SE cs.AI 新提交

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

层层代理：从底层到生产构建自定义AI代理的方法论

Marc Alier Forment, Juanan Pereira, Francisco José García-Peñalvo, María José Casañ Guerrero

AI总结提出一种无框架的方法论，通过两个前提条件（将LLM作为软件组件和构建块）和三个实践（原型设计、打包为CLI、代理测试代理）来构建自定义AI代理，实现端到端开发。

详情

AI中文摘要

自定义AI代理是存在于自己应用程序中的代理，它们与自己的数据和工具交互，强制执行自己的安全边界，并携带自己的品牌和审计跟踪。它们与通用层级的区别在于适配性而非能力：每个代理由维护它的工程师为一项工作而构建。目前没有已发布的实践说明如何端到端地构建一个自定义AI代理。各个部分随处可见（函数调用API、模型上下文协议、可配对的代码代理），但将这些部分串联起来的实践存在于播客、博客和泄露的系统提示中。本文将这些实践记录为一种方法论，即“层层代理”：两个前提条件一次交叉并保持，然后三个实践在代理的生命周期中重复。前提条件是（P1）底层：将LLM作为软件组件，框架化为工具、系统，然后在提示缓存下框架化为消息；（P2）构建块：函数调用、MCP、CLI编排、liteshell模式、代理循环、技能、角色、钩子和脚手架。三个实践是（P3）使用通用代理进行原型设计；（P4）收获、折叠并将结果作为CLI发布，即Turtle模式；（P5）代理测试代理，其中通用代理通过行为场景驱动自定义代理，这是对经典测试的补充而非替代。工作循环是P3到P4再到P5并返回，一个推论自然得出：多代理编排就是CLI组合。该方法论在构造上是无框架的。它从AAC中提炼而来，AAC是开源LAMB平台的自定义代理，由一名开发人员使用AI配对程序员在大约十天内构建并投入生产。我们将其作为一种可迁移的实践呈现，独立于任何语言或框架。

英文摘要

Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent's life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt-caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general-purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent-tests-agent, in which a general-purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi-agent orchestration is just CLI composition. The methodology is framework-free by construction. It was distilled from the AAC, a custom agent for the open-source LAMB platform, built in about ten days by one developer with an AI pair-programmer and in production. We present it as a transferable practice, independent of any language or framework.

URL PDF HTML ☆

赞 0 踩 0

2606.11865 2026-06-11 stat.ML cs.LG 新提交

Conformal Bayes under Label Shift: Post-Hoc Calibration vs. In-Training Adaptation

标签偏移下的共形贝叶斯：事后校准与训练内适应

Seungjin Choi

AI总结研究标签偏移下共形贝叶斯方法，通过重要性加权共形校准恢复目标域覆盖，比较事后校准与训练内适应两种策略，后者在偏差训练中起到去偏作用。

详情

Comments: 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)

AI中文摘要

共形贝叶斯将贝叶斯后验预测与共形校准相结合，产生既统计有效又几何高效的预测集。我们从统一视角研究标签偏移下的共形贝叶斯，识别出两种互补方法，它们通过重要性加权共形校准恢复名义目标域覆盖，但通过独立机制运作。\emph{事后校准}将后验预测向目标域倾斜，并通过重要性加权分位数校正共形阈值，保持参数后验不变。\emph{训练内适应}将参数后验本身向目标域倾斜，产生校正后的预测，其最高预测密度区域作为基于拟合目标预测的最高预测密度（HPD）预测集；效率依赖于模型，并不保证有限样本条件最优性。两个受控实验表明，在无偏训练机制下，两种策略同样实现有效覆盖，而在领先优化机制下，训练内适应作为去偏算子，在覆盖不变的情况下减少区间宽度。

英文摘要

Conformal Bayes combines Bayesian posterior predictives with conformal calibration to produce prediction sets that are both statistically valid and geometrically efficient. We study conformal Bayes under label shift from a unified perspective, identifying two complementary approaches that restore nominal target-domain coverage through importance-weighted conformal calibration but operate through independent mechanisms. \emph{Post-hoc calibration} tilts the posterior predictive toward the target domain and corrects the conformal threshold via an importance-weighted quantile, leaving the parameter posterior unchanged. \emph{In-training adaptation} tilts the parameter posterior itself to the target domain, producing a corrected predictive whose highest predictive density region serves as the highest predictive density (HPD) based prediction set under the fitted target predictive; efficiency is model-dependent and does not imply finite-sample conditional optimality. Two controlled experiments show that in an unbiased training regime both strategies achieve valid coverage equally, while in a lead-optimization regime in-training adaptation acts as a debiasing operator, reducing interval width at unchanged coverage.

URL PDF HTML ☆

赞 0 踩 0

2606.11860 2026-06-11 cs.LG 新提交

RePAIR: Predictive Self-Supervised Representation Learning in Chess

RePAIR：国际象棋中的预测性自监督表示学习

Christoph Koller, Johannes Fürnkranz, Timo Bertram

AI总结提出RePAIR架构，融合MAE、JEPA和BERT，通过掩码和迭代细化学习国际象棋序列的紧凑表示，无需强化学习即可推理棋子移动。

详情

Comments: Accepted for oral presentation at IEEE Conference on Games 2026

AI中文摘要

在本文中，我们介绍了通过自编码迭代细化进行表示预测（RePAIR）——一种新颖的自监督表示学习架构，它综合了掩码自编码器（MAE）、联合嵌入预测架构（JEPA）和来自Transformer的双向编码器表示（BERT）。我们展示了如何将其用于将顺序数据（如连续的国际象棋局面）中的对象编码为紧凑而有意义的表示。该架构的基本原理是掩码潜在状态序列的大部分，类似于BERT和MAE。然后，我们对潜在表示应用一个轻量级预测器，该预测器在类似JEPA的低维嵌入空间中修复序列中的间隙。我们在国际象棋领域的实验表明，编码器优化了棋盘表示，使得有意义的国际象棋概念在潜在空间中聚类出现。此外，掩码棋盘状态的重建表明，该模型能够在不依赖昂贵强化学习方法的情况下推理棋子移动。最后，我们发现，通过在这个语义丰富的空间中观察游戏路径轨迹，所得到的表示空间允许对国际象棋游戏进行快速直观的剖析。

英文摘要

In this paper, we introduce Representation Prediction via Autoencoding using Iterative Refinement (RePAIR) - a novel self-supervised representation learning architecture that synthesizes Masked Autoencoders (MAE), Joint Embedding Predictive Architectures (JEPA), and Bidirectional Encoder Representations from Transformers (BERT). We demonstrate how it can be used to encode objects in sequential data like consecutive chess positions into compact yet meaningful representations. The basic principle of the architecture is to mask large portions of a sequence of latent states, similar to BERT and MAE. Then, we apply a lightweight Predictor to the latent representations that repairs gaps in the sequence in a lower-dimensional embedding space akin to JEPA. Our experiments in the domain of chess show that the Encoder refines the board representations such that meaningful chess concepts emerge clustered in the latent space. Furthermore, reconstructions of the masked board states show that the model is able to reason about the piece movements without relying on costly reinforcement learning methods. Lastly, we find that the resulting representation space allows for quick and intuitive dissections of chess games by observing the game path trajectories in this semantically rich space.

URL PDF HTML ☆

赞 0 踩 0

2606.11857 2026-06-11 eess.SP cs.LG 新提交

REACH: Interpretability-Driven Feature Identification and Architecture Compression for Multi-Channel Vehicular Channel Estimation

REACH：面向多信道车辆信道估计的可解释性驱动特征识别与架构压缩

Simbarashe Aldrin Ngorima, Albert Helberg, Marelie H. Davel

AI总结提出REACH框架，通过梯度归因识别关键时频特征并压缩网络，在IEEE 802.11p信道估计中实现参数和计算量大幅降低，且OOD泛化性能下降缓慢。

详情

Comments: 22 pages, 16 figures

AI中文摘要

多信道混合信噪比训练改善了IEEE 802.11p车辆通信中深度学习信道估计器的分布外（OOD）泛化能力，但其内部机制尚不明确。本文提出REACH（基于相关性的信道估计器解释与架构压缩），一个在两层上运行的基于梯度的可解释性框架。输入级归因识别出一组在所有评估信道条件下始终相关的时频特征，从而以最小的性能损失实现输入维度缩减。滤波器级归因揭示了一种近乎通用的内部表示，为观察到的OOD泛化提供了表示层面的解释。基于由此产生的滤波器分类，相关性引导的架构压缩在归一化均方误差（NMSE）退化小于1 dB的情况下，大幅减少了参数数量和浮点运算次数（FLOPs），并且随着压缩程度的增加，OOD泛化性能的下降速度慢于分布内准确率的下降速度。

英文摘要

Multi-channel mixed-SNR training improves out-of-distribution (OOD) generalisation of deep learning channel estimators for IEEE 802.11p vehicular communications, yet the internal mechanism responsible for this remains unexplained. This work presents REACH (Relevance-based Explanation and Architectural Compression for cHannel estimators), a gradient-based interpretability framework that operates at two levels. Input-level attribution identifies a subset of time-frequency features consistently relevant across all evaluated channel conditions, enabling input dimensionality reduction with minimal performance loss. Filter-level attribution reveals a near-universal internal representation, providing a representational account of the observed OOD generalisation. Guided by the resulting filter taxonomy, relevance-guided architecture compression substantially reduces both the number of parameters and the number of floating-point operations (FLOPs) with sub-1 dB normalised mean square error (NMSE) degradation, and OOD generalisation degrades more slowly than within-distribution accuracy under increasing compression.

URL PDF HTML ☆

赞 0 踩 0

2606.11853 2026-06-11 cs.CV cs.AI 新提交

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

任务感知结构化记忆用于动态多模态上下文学习

Zhirui Chen, Ziwei Chen, Ling Shao

AI总结提出TASM框架，通过任务向量引导压缩、语义感知令牌合并和层次化记忆结构，解决多模态大语言模型上下文学习中记忆压缩导致的语义破坏和静态问题。

详情

Comments: Accepted to ICML 2026

AI中文摘要

多模态大语言模型（MLLMs）依赖上下文学习（ICL）进行快速任务适应，但其可扩展性受到有限上下文窗口和长多模态序列中键值（KV）缓存成本增长的严重限制。现有的记忆压缩方法通常依赖于刚性令牌移除或样本相关的重要性估计，这引入了偏差，破坏了语义结构（特别是视觉表示），并产生无法适应新查询的静态记忆。我们提出了TASM（任务感知结构化记忆），一个无需训练的框架，通过任务感知、结构保持和动态可访问的记忆构建来解决这些限制。TASM采用任务向量引导压缩，用捕获演示间共享相关性的任务级方向替代样本特定信号。为了保持底层流形，它通过二分图匹配应用语义感知令牌合并，在不进行破坏性修剪的情况下聚合令牌。最后，TASM将记忆结构化为一个层次结构，包括紧凑的核心记忆和潜在库，促进查询自适应的动态检索。评估证实，TASM在重度压缩下保持高性能，有效平衡了效率与适应性。

英文摘要

Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.

URL PDF HTML ☆

赞 0 踩 0

2606.11851 2026-06-11 cs.AI 新提交

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

StatefulDiscovery：开放科学发现中证据校准的声明形成

Jiayao Chen, Shi Liu, Linyi Yang

AI总结提出StatefulDiscovery框架，通过外部化探索状态来协调前沿选择、证据获取和声明裁决，在40个真实数据任务中生成更多高质量、有充分证据支持的声明。

详情

AI中文摘要

开放式的科学发现要求智能体超越为预定义问题执行分析。在多轮探索中，发现智能体必须决定哪些现象值得研究，同时避免过度解释，即新出现的声明超出支持它们的分析证据范围。这产生了一个证据校准问题：探索轨迹必须与声明状态耦合，以便证据既能指导下一步探索什么，也能指导可以声明什么。我们引入了StatefulDiscovery，一个将调查状态外部化并利用它来协调前沿选择、证据获取和声明裁决的发现框架。我们在40个真实数据发现任务上评估了StatefulDiscovery。与几个基线相比，StatefulDiscovery总体上产生了更多被认为既有充分支持又有高价值的声明。消融实验表明，结构化假设、局部裁决和前沿控制有助于性能。这些结果共同表明，显式的发现状态可以将探索与证据校准的声明形成耦合起来。

英文摘要

Open-ended scientific discovery asks agents to move beyond executing analyses for predefined questions. Across multiple rounds of exploration, a discovery agent must decide which phenomena warrant investigation while avoiding overinterpretation, where emerging claims exceed the evidential scope of the analyses supporting them. This creates an evidence-calibration problem: the exploration trajectory must be coupled with claim status so that evidence can guide both what to investigate next and what can be claimed. We introduce StatefulDiscovery, a discovery framework that externalizes investigation state and uses it to coordinate frontier selection, evidence acquisition, and claim adjudication. We evaluate StatefulDiscovery across 40 real-data discovery tasks. Compared with several baselines, StatefulDiscovery produces more claims overall judged to be both well-supported and high-value. Ablations indicate that structured hypotheses, local adjudication, and frontier control contribute to performance. Together, these results suggest that explicit discovery state can couple exploration with evidence-calibrated claim formation.

URL PDF HTML ☆

赞 0 踩 0

2606.11835 2026-06-11 cs.HC cs.AI 新提交

Designing AI-Supported Focus Groups: A Role x Modality Playbook

设计AI支持的焦点小组：角色×模态剧本

Zhiqing Wang, Steven Dow

AI总结针对焦点小组资源密集且对引导高度敏感的问题，提出按AI角色（工具、联合主持、主持）和模态（文本、语音、具身）组织的剧本，并分析交互权衡与开放问题。

详情

AI中文摘要

收集参与者的生活经验是设计研究的核心。焦点小组的独特价值在于参与者不仅分享个人经历，还能相互回应，从而呈现比较、分歧和集体意义建构。然而，焦点小组资源密集且对引导高度敏感：主持人必须探究细节、平衡参与、管理话题流程并维持心理安全，微妙的引导选择可能影响哪些内容变得突出。近期人机交互研究和商业会议工具表明，生成式AI可以通过提示、轮流调节、主题映射和实时总结来支撑实时对话。然而，用户体验研究团队缺乏关于这些能力在焦点小组中的含义以及引入的方法论风险的清晰图景。我们综合了AI支持实时对话的相关工作，并将其转化为一个焦点小组特定的剧本，按AI角色（工具、联合主持、主持）和模态（文本、语音、具身）组织。我们描述了交互权衡，并识别了将AI支持的焦点小组作为方法论配置进行评估的开放问题。

英文摘要

Collecting participants' lived experiences is central to design research. Focus groups are uniquely valuable because participants not only share individual accounts but also respond to one another, surfacing comparison, disagreement, and collective sensemaking. However, focus groups are resource-intensive and highly sensitive to facilitation: moderators must probe for specificity, balance participation, manage topic flow, and sustain psychological safety, and subtle facilitation choices can shape what becomes salient. Recent HCI work and commercial meeting tools show that generative AI can scaffold live conversation through prompting, turn regulation, thematic mapping, and real-time summarization. Yet UXR teams lack a clear map of what these capabilities mean in focus groups and what methodological risks they introduce. We synthesize AI supports for live conversation and translate them into a focus-group-specific playbook organized by AI role (tool, co-host, host) and modality (text, voice, embodied).We synthesize prior work on AI-supported live conversation and propose a focus-group-specific playbook of AI supports organized by role (tool, co-host, host) and modality (text, voice, embodied). We characterize interactional trade-offs and identify open questions for evaluating AI-supported focus groups as methodological configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.11817 2026-06-11 cs.CR cs.AI cs.CL cs.SE 新提交

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

语法约束解码可诱使大语言模型生成恶意代码

Yitong Zhang, Shiteng Lu, Jia Li

AI总结本文发现语法约束解码（GCD）可被利用发起名为CodeSpear的越狱攻击，使LLM生成恶意代码；并提出安全对齐方法CodeShield，通过生成蜜罐代码防御该攻击。

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于代码生成，引发了对它们可能被滥用来生成恶意代码的担忧。与此同时，语法约束解码（GCD）已被广泛采用，通过强制语法有效性来提高LLM生成代码的可靠性。在本文中，我们揭示了一个反直觉的风险：这种面向可靠性的技术本身可能成为攻击面。我们发现了一种新的越狱攻击，称为CodeSpear，它利用GCD诱导LLM生成恶意代码。我们的实验表明，仅应用良性代码语法约束即可有效越狱LLM。为了解决这一漏洞，我们提出了CodeShield，一种安全对齐方法，即使在攻击者控制的语法约束下也能稳健地保持安全行为。CodeShield通过在代码模态中对齐模型，教其在GCD下生成蜜罐代码。这种代码在语义上是无害的，因此不会实现恶意请求，并且在结构上是多样化的，因此难以通过语法收紧来抑制。同时，当自然语言可用时，CodeShield仍然保留自然语言的拒绝。在4个基准测试中对10个流行LLM的实验表明，CodeSpear优于代表性的越狱基线，平均攻击成功率提高了30个百分点以上。CodeShield在CodeSpear下恢复了安全性，同时保持了良性实用性。我们的发现揭示了GCD的一个基本风险，并呼吁对其潜在安全影响给予更多关注。

英文摘要

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

URL PDF HTML ☆

赞 0 踩 0

2606.11814 2026-06-11 quant-ph cs.AI cs.LG 新提交

Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

稀疏化Kolmogorov-Arnold网络用于可解释量子态层析

Xinge Wu, Huaxin Wang, Jiajun Liu, Ruiqing He, Jiandong Shang, Hengliang Guo, Qiang Chen

AI总结研究利用稀疏化Kolmogorov-Arnold网络作为可检查的重构规则，通过三量子比特GHZ基准测试，识别出与GHZ相关的Pauli测量集，并揭示与解析GHZ Pauli分组一致的输入-隐藏-输出通路结构，实现神经网络重构模型的结构可解释性。

详情

AI中文摘要

量子态层析的机器学习方法可以实现高保真度重构，但训练模型所使用的物理结构往往隐含。这里我们探究稀疏化Kolmogorov-Arnold网络（KAN）是否不仅可以作为回归器，还可以作为可检查的重构规则，其内部组织可以与已知的Pauli结构进行对照。我们研究了一个受控的三量子比特GHZ族基准测试，其中所有63个非恒等Pauli期望值被用于重构三个GHZ子空间变量：种群不平衡$z$、实部非对角分量$c$和虚部非对角分量$c$。在有限采样和退极化噪声下，外部消融从63个测量中识别出扩展的12通道GHZ相关Pauli集，在测试的采样次数和退极化噪声强度下实现了精确的前12恢复。这些支持模式在多种子随机初始化和噪声水平分析中保持稳定，并在随机标签控制下崩溃。主要的剪枝输入-隐藏-输出通路以与解析GHZ Pauli分组一致的方式组织Z型种群可观测量和X/Y非对角可观测量，稀疏公式恢复恢复了规范的带符号Pauli关系。因此，KAN的贡献在于神经重构模型中的通路级结构可解释性，而非优越的稀疏回归。结合阴性对照，这些探针提供了一条一致性链，用于审计学习到的重构规则与已知物理结构的一致性。

英文摘要

Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the trained model often remains implicit. Here we ask whether a sparsified Kolmogorov-Arnold Network (KAN) can be used not only as a regressor, but also as an inspectable reconstruction rule whose internal organization can be checked against known Pauli structure. We study a controlled three-qubit GHZ-family benchmark in which all 63 non-identity Pauli expectation values are used to reconstruct three GHZ-subspace variables: the population imbalance $z$, the real off-diagonal component $c$, and the imaginary off-diagonal component $s$. Under finite-shot sampling and depolarizing noise, external ablation identifies the extended 12-channel GHZ-relevant Pauli set from the 63 measurements, with exact top-12 recovery across the tested shot counts and depolarizing-noise strengths. These support patterns remain stable across multi-seed random-initialization and noise-level analyses, and collapse under random-label controls. The dominant pruned input-hidden-output pathways organize Z-type population observables and X/Y off-diagonal observables in a pattern consistent with the analytic GHZ Pauli grouping, and sparse formula recovery recovers the canonical signed Pauli relations. The contribution of the KAN is therefore pathway-level structural interpretability within a neural reconstruction model, rather than superior sparse regression. Together with negative controls, these probes provide a consistency chain for auditing learned reconstruction rules against known physical structure.

URL PDF HTML ☆

赞 0 踩 0

2606.11798 2026-06-11 q-fin.CP cs.LG math.OC 新提交

Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

时间不一致控制问题中学习均衡的确定性策略梯度

Xin Guo, Yijie Huang, Xiang Yu

AI总结提出一种连续时间无模型强化学习算法，通过确定性策略梯度和内定点迭代学习时间不一致控制问题的均衡策略，并在均值-方差投资组合和非指数贴现跟踪投资组合中验证有效性。

详情

Comments: Keywords: Time-inconsistent control, two-stage reformulation, model-free continuous-time reinforcement learning, deterministic policy gradient, fixed point iteration

AI中文摘要

在本文中，我们开发了一种连续时间无模型强化学习算法，用于学习一般时间不一致控制问题中的确定性均衡策略。利用扩展的Hamilton-Jacobi-Bellman系统，我们将原始时间不一致问题转化为一个等价的两阶段问题。在第一阶段，对于给定的辅助函数，我们采用确定性策略梯度方法在辅助的时间一致控制问题中学习最优策略。在第二阶段，给定更新后的策略，我们利用内定点迭代和某些鞅特征来学习辅助函数。作为理论贡献，我们提供了一些温和的模型假设，并建立了内定点迭代的收敛性。通过在两阶段之间重复这种演员-评论家风格的迭代，我们的算法旨在以统一的方式学习不同时间不一致性来源下的均衡。该算法在两种经典的时间不一致金融应用中的优越有效性得到了说明：均值-方差投资组合管理和非指数贴现下的最优跟踪投资组合。

英文摘要

In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.

URL PDF HTML ☆

赞 0 踩 0

2606.11795 2026-06-11 eess.AS cs.SD 新提交

Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency

说话人日志中的紧边界预测：基于因果-反因果一致性

Shota Horiguchi, Marc Delcroix, Naohiro Tawara, Takanori Ashihara, Atsushi Ando

AI总结针对松标注训练导致预测边界松散的问题，提出利用因果与反因果模型生成紧伪标签，并通过协同训练迭代优化，恢复约70%的紧标签训练效果并提升下游性能。

详情

Comments: Accepted to Interspeech 2026 (Long Paper Track)

AI中文摘要

多说话人对话自动语音识别数据常用于训练说话人日志模型。由于此类数据优先考虑语义连续性，语音段中包含停顿和边界余量，导致标注松散。在此类数据上训练的模型倾向于内化产生这种松散性的机制，尽管紧语音区间有时更适用于下游应用。本文解决了利用松散标签使模型产生紧预测的新任务。我们的方法使用因果和反因果模型生成更紧的伪标签，这些模型本质上无法学习松散行为。我们进一步提出了一种协同训练方案，迭代地收紧标签并更新两个模型以进行更渐进式的优化。实验结果表明，所提方法恢复了理想紧标签训练所实现的约70%的收紧效果，并提升了下游性能。

英文摘要

Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2606.11794 2026-06-11 cs.LG cs.AI 新提交

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

使用结构MRI和临床数据的阿尔茨海默病严重程度的多模态序数建模

Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

AI总结提出一种注意力增强的多模态序数回归框架，整合MRI、人口统计学和遗传数据，用于自动且可解释的AD严重程度分期，在ADNI等数据集上验证，序数模型在相邻阶段准确率（0.970）和与临床分期一致性（QWK 0.549）上表现最佳。

详情

Comments: 18 pages. Submitted to journal for review

AI中文摘要

神经退行性疾病如阿尔茨海默病（AD）需要准确且可扩展的工具来评估疾病严重程度，然而当前的临床分期仍然耗时且易变。我们提出了一种带有注意力增强的多模态机器学习框架，结合序数回归，用于自动且可解释的AD严重程度分期。该框架整合了T1加权MRI与人口统计学和遗传变量，并使用序数和非序数预测头比较了单模态和多模态架构。模型使用来自ADNI、AIBL和NIFD数据集的队列分层划分进行训练和验证。严格保留的测试集由排除在所有训练、验证、预处理和超参数调优过程之外的受试者构建，并在整个过程中采用受试者级划分以防止数据泄漏。在单模态方法中，T1加权MRI模型在相邻阶段准确率（0.963）和与临床分期的一致性（QWK 0.444）上略高于表格模型（QWK 0.433）。整合成像、人口统计学和遗传信息提高了整体性能。多模态非序数基线实现了最低的预测误差（MAE 0.340），而序数多模态模型实现了最高的相邻阶段准确率（0.970）和与临床分期的最强一致性（QWK 0.549）。这些发现表明，序数公式更好地捕捉了CDR量表的顺序结构，并产生与临床分期更一致的预测。使用Grad CAM++和SHAP的可解释性分析展示了解剖学和临床上合理的模型行为，支持透明决策。总体而言，基于注意力的多模态学习与序数回归代表了一种稳健、可解释且可扩展的方法，用于自动AD严重程度分期和AI辅助临床决策支持。

英文摘要

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.11783 2026-06-11 cs.CV 新提交

A Comprehensive Ecosystem for Open-Domain Customized Video Generation

开放域定制视频生成的综合生态系统

Jingxu Zhang, Yuqian Hong, Daneul Kim, Kai Qiu, Qi Dai, Jianmin Bao, Yifan Yang, Xiaoyan Sun, Chong Luo

AI总结提出百万级数据集PexelsCustom-1M和参数高效框架CustoMDiT，仅用8%额外参数实现定制视频生成，并构建千类基准OpenCustom，开源整个生态系统。

详情

Comments: 5 pages, 3 figures, 4 tables. Accepted by ICASSP 2026

AI中文摘要

近期视频生成的进展展示了令人印象深刻的视觉合成能力。然而，开放域定制视频生成仍然受到缺乏大规模、带标注的数据集来捕捉多样化的身份特定属性的限制。为了解决这个问题，我们引入了PexelsCustom-1M，这是第一个公开可用的百万级身份保持视频生成数据集，包含跨越8000多个类别的一百万个精心策划的<身份，文本，视频>三元组。利用这一点，我们提出了CustoMDiT，一个参数高效的框架，将预训练的多模态扩散Transformer适配为定制视频生成器，仅增加8%的可学习参数。我们的方法超越了先前的最先进技术。然而，像DreamBooth这样的基准只覆盖了100个类别，对于现实应用来说是不够的。为了克服这一点，我们构建了OpenCustom，一个新的包含1000多个类别的基准，通过ImageNet和MS-COCO的跨数据集知识融合创建。大量实验证实了我们的数据集和模型的优势。我们将开源整个生态系统——包括数据集、流水线、基准和实现——以支持进一步的研究。

英文摘要

Recent progress in video generation has shown impressive visual synthesis capabilities. However, open-domain customized video generation remains limited by the lack of large-scale, annotated datasets capturing diverse identity-specific attributes. To address this, we introduce PexelsCustom-1M, the first publicly available million-scale dataset for identity-preserving video generation, containing one million curated <identity, text, video> triplets across 8,000+ categories. Leveraging this, we propose CustoMDiT, a parameter-efficient framework that adapts a pretrained multimodal Diffusion Transformer into a customized video generator with only 8% additional learnable parameters. Our method surpasses prior state-of-the-art. However, benchmarks such as DreamBooth cover only 100 classes, which is insufficient for real-world applications. To overcome this, we construct OpenCustom, a new benchmark with 1,000+ categories, created via cross-dataset knowledge fusion from ImageNet and MS-COCO. Extensive experiments confirm the advantages of both our dataset and model. We will open-source the entire ecosystem--including dataset, pipeline, benchmark, and implementations--to support further research.

URL PDF HTML ☆

赞 0 踩 0

2606.11780 2026-06-11 cs.IR cs.AI cs.IT 新提交

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

量化对密集Top-$k$检索的限制是什么？一项理论研究

Koki Okajima, Tsukasa Yoshida

AI总结理论证明在有限精度下，完美Top-$k$检索所需维度随语料库大小对数增长，量化精度存在阈值，影响实际系统设计。

详情

Comments: 9 pages, 2 figures

AI中文摘要

我们建立了将包含$N$个文档的语料库嵌入为$d$维向量的条件，使得每个$k$子集$S \subseteq [N]$都能通过某个查询向量的top-$k$检索实现。最近的研究表明，在$\mathbb{R}^d$中，$d = O(k)$足以存在这样的嵌入，与$N$无关。我们理论上证明，这种与语料库无关的界限是无限精度所特有的。当每个坐标使用$B$比特时，完美top-$k$检索需要$Bd = \Omega(k \ln N)$；因此，在任何固定精度下，维度必须至少随$N$对数增长。针对$\ell_2$归一化的$B$比特均匀标量量化模型，我们还确定了精度阈值$B^{*} = O(\ln \ln N)$，低于该阈值任何维度都不够，同时还有两个进一步限制可行$(B, d)$对的区域。我们的结果表明，在实际的向量数据库和密集检索系统中，由于量化是标准操作，嵌入维度和可能的精度必须随语料库大小增长。

英文摘要

We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, independently of $N$. We theoretically prove that this corpus-independent bound is specific to infinite precision. With $B$ bits per coordinate, perfect top-$k$ retrieval requires $Bd = \Omega(k \ln N)$; thus, at any fixed precision, the dimension must grow at least logarithmically with $N$. Specializing to a $\ell_2$-normalized $B$-bit uniform scalar quantization model, we also identify a threshold on the precision $B^{*} = O(\ln \ln N)$ below which no dimension suffices, together with two further regimes that bound the feasible $(B, d)$ pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.

URL PDF HTML ☆

赞 0 踩 0

2606.11773 2026-06-11 math.OC cs.LG 新提交

Last-Iterate Convergence of Optimistic Multiplicative Weight Update

乐观乘性权重更新的最后迭代收敛性

Francesco Orabona

AI总结本文证明乐观乘性权重更新（OMWU）在光滑凸-凹鞍点问题中以足够小的常数学习率渐近收敛，无需唯一性、严格互补性、误差界或接近解的初始化。

2606.11766 2026-06-11 eess.AS cs.AI cs.CL cs.SD 新提交

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

AI总结提出交错堆叠方法加速语音基础模型蒸馏训练，通过保持层位置一致性解决性能下降问题，在SUPERB上验证有效性。

详情

Comments: Accepted by Interspeech 2026

AI中文摘要

将大型语音基础模型（SFM）蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟，但它需要额外的学生模型训练。然而，SFM蒸馏的训练效率仍未得到充分探索。在这项工作中，我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力，其中模型深度通过训练逐步增加，直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度，但它们遭受性能下降。为了解决这一限制，我们提出了交错堆叠，一种新颖的堆叠方法，在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键，因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

URL PDF HTML ☆

赞 0 踩 0

2606.11745 2026-06-11 cs.CV cs.AI 新提交

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

从提示到标记：将因果监督内化到视觉-语言模型中进行多图像因果推理

Haoping Yu, Yuanxi Li, Jing Ma

AI总结提出BridgeVLM，通过从多图像输入诱导因果图并转换为因果标记，注入LLM解码器进行因果消息传递，显著提升多图像因果推理性能。

详情

AI中文摘要

视觉因果推理对于理解和干预物理世界至关重要，需要从视觉输入中识别因果变量并推理干预效果。尽管最近取得了进展，大型视觉-语言模型（VLM）在此类任务上仍然脆弱，尤其是对于多图像输入上的干预和反事实查询。大多数现有探索通过文本提示注入因果知识，使因果机制外在于模型执行，限制了推理过程中的可靠控制。为了解决这个问题，我们提出了BridgeVLM，它通过从多图像输入中诱导因果图并将其转换为结构化的因果标记，由注入到LLM解码器中的RAMP层执行因果消息传递，从而内化视觉因果推理。我们进一步引入了一个统一的训练接口M3S，用于不同粒度（局部/全局级别）的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到了54.4%的准确率（而提示级监督为33.2%），在Causal3D上将结果从43.6%提升到49.0%，并在CausalVLBench上显著改善了因果结构学习（$F_1$：33.4% → 75.1%）。

英文摘要

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

URL PDF HTML ☆

赞 0 踩 0

2606.11738 2026-06-11 stat.ML cs.LG 新提交

Renewable Lasso without Batch-Number Constraints: A Gradient-Enhanced Approach

无批次数量约束的可再生Lasso：一种梯度增强方法

Junzhuo Gao, Ling Peng, Xu Guo, Heng Lian

AI总结针对高维广义线性模型的流数据在线估计，提出梯度增强替代损失函数，消除批次数量约束，并扩展到分布式流数据场景，理论推导非渐近误差界，实验验证精度提升。

详情

AI中文摘要

我们研究具有流数据的高维广义线性模型的在线估计。首先，针对非分布式设置，我们提出一种梯度增强替代损失函数，仅使用历史摘要近似累积损失，修改并改进了现有高维设置下同一模型的可再生估计方法，并消除了先前研究中的批次数量约束。然后，我们将该方法扩展到主从架构下的分布式流数据，其中批次按站点划分，仅交换摘要（梯度向量）。我们的调整方法不要求客户端计算完整的替代损失，而不是直接应用Jordan等人（2019）的流行方法到替代二次损失。我们在高维尺度下推导了非渐近误差界，没有先前研究中严格的批次数量约束。在线性和逻辑模型下的模拟结果以及实际数据应用表明，与现有的可再生估计器相比，精度有所提高。

英文摘要

We study online estimation for high-dimensional generalized linear models with streaming data. First, for the non-distributed setting, we propose a gradient-enhanced surrogate loss that approximates the cumulative loss using only historical summaries, which modifies and improves upon the existing renewable estimation approach for the same model in the high-dimensional setting, and removes the batch-number constraint in previous studies. We then extend the method to distributed streaming data under the master-client architecture, where batches are partitioned across sites and only summaries (gradient vectors) are exchanged. Instead of directing applying the popular method of Jordan et al. (2019) to the surrogate quadratic loss, our adjusted approach does not require the clients to compute the full surrogate loss. We derive non-asymptotic error bounds under the high-dimensional scaling, without the stringent constraint on the number of batches in the previous studies. Simulation results under linear and logistic models, together with a real-data application, show improved accuracy over existing renewable estimators.

URL PDF HTML ☆

赞 0 踩 0

2606.11737 2026-06-11 astro-ph.EP astro-ph.IM cs.LG 新提交

Machine-learning clustering of close-in exoplanet populations: links to pebble accretion

近地系外行星的机器学习聚类：与卵石吸积的联系

Yi Duann, Anders Johansen, Haiyang S. Wang, H. Jens Hoeijmakers

AI总结利用高斯混合模型对近地系外行星进行无监督聚类，揭示其内在子群，并通过卵石吸积合成种群解释形成路径差异。

详情

AI中文摘要

近地系外行星展现出由形成条件和迁移过程塑造的广泛轨道构型和物理性质。尽管种群合成模型预测了不同的行星种群，但在观测到的系外行星与合成种群之间建立定量联系仍然具有挑战性。我们使用物理驱动的动力学参数研究近地系外行星的内在组织，并将所得种群与卵石吸积形成路径联系起来。将两阶段高斯混合模型应用于观测到的近地系外行星样本，在由行星-恒星相互作用的动力学描述符主导的特征空间中进行无监督概率聚类。将所得聚类映射到统计驱动的三维参数空间中的卵石吸积合成种群。然后使用与形成相关的量（包括气体可用性、气体分数和冰岩质量比）来解释映射的种群。我们在不施加预定义分类边界的情况下识别出统计上支持的子群，包括超大质量气态巨行星、热巨行星、暖木星主导系统和低质量巨行星。映射的合成种群揭示了形成时间、气体吸积和固体增长历史的系统性差异。特别是，超大质量气态巨行星比热巨行星和暖木星主导种群更倾向于与更早的形成时期相关联。这些结果表明，物理驱动的机器学习方法可以为观测到的系外行星种群与理论行星形成路径之间的联系提供统计上稳健的框架。

英文摘要

Close-in exoplanets exhibit a wide range of orbital architectures and physical properties shaped by both formation conditions and migration processes. Although population-synthesis models predict distinct planetary populations, establishing a quantitative connection between observed exoplanets and synthetic populations remains challenging. We investigate the intrinsic organisation of close-in exoplanets using physically motivated dynamical parameters and connect the resulting populations to pebble-accretion formation pathways. A two-stage Gaussian mixture model (GMM) is applied to an observed sample of close-in exoplanets, performing unsupervised probabilistic clustering in a feature space dominated by dynamical descriptors of planet-star interactions. The resulting clusters are mapped onto a pebble-accretion synthetic population within a statistically motivated three-dimensional parameter space. Formation-related quantities, including gas availability, gas fraction, and ice-rock mass ratio, are then used to interpret the mapped populations. We identify statistically supported sub-populations without imposing predefined classification boundaries, including very-massive gas giants, hot giants, warm-Jupiter-dominated systems, and lower-mass giants. The mapped synthetic populations reveal systematic differences in formation timing, gas accretion, and solid growth histories. In particular, very-massive gas giants are preferentially associated with earlier formation epochs than hot-giant and warm-Jupiter-dominated populations. These results demonstrate that physically motivated machine-learning approaches can provide a statistically robust framework for linking observed exoplanet populations to theoretical planet formation pathways.

URL PDF HTML ☆

赞 0 踩 0

2606.11710 2026-06-11 cs.CV 新提交

ERN-Net: Evolving Reason Node-Net for Document Binarization

ERN-Net: 用于文档二值化的演化推理节点网络

Hsin-Jui Pan, Sheng-Wei Chan, Jen-Shiung Chiang

AI总结提出ERN-Net，通过演化推理节点和多尺度推理增强退化敏感区域，结合ConvNeXt-Tiny骨干网络和DIBCO预训练，在低数据低内存下实现高效文档二值化。

2606.11698 2026-06-11 cs.CR cs.AI 新提交

T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

T2S：一种基于排练的防提取模型水印方法

Jian-Ping Mei, Weibin Zhang, Ao Yao, Tiantian Zhu, Jie Xiao

AI总结针对模型提取攻击，提出一种基于排练的水印嵌入框架，通过模拟提取过程并利用被盗模型在触发集上的损失微调水印知识，增强水印的迁移性和鲁棒性。

详情

AI中文摘要

模型水印通过嵌入独特知识来诱导独特行为特征，从而保护AI模型的知识产权。主要技术挑战在于确保水印对水印模型的各种后处理攻击具有鲁棒性。模型提取攻击是最严重的威胁，攻击者利用预测输出训练替代模型，非法复制原始模型的功能。在这项工作中，我们提出了一种基于排练的水印嵌入框架，以增强模型水印对模型提取攻击的鲁棒性。通过模拟提取过程，我们的方法利用\textit{模拟被盗模型}在触发集上的损失作为训练信号，微调目标模型中的水印知识。这个微调步骤鼓励水印以增强可迁移性的方式嵌入，从而增加其在被盗模型中持续存在并保持可检测的机会。在不同设置下进行的全面实验表明，所提出的方法显著提高了模型水印对模型提取和后续水印移除攻击的鲁棒性。

英文摘要

Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model's functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textit{simulated stolen model} on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.11687 2026-06-11 cs.CV cs.LG cs.RO 新提交

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

DroneShield-AI：一种用于受争议空域中实时自主无人机威胁检测、行为意图分类和群体智能的多模态传感器融合框架

Marius Bayizere

AI总结提出DroneShield-AI框架，集成RF信号分类、声学检测、YOLOv8视觉检测等六层处理，通过行为意图分类引擎（BICE）实现六类威胁分类并提前30秒预警，以及图神经网络群体智能模块（GNN-SIM）分析多无人机编队，在低成本硬件上达到96.1%检测精度和142ms延迟。

详情

Comments: 23 pages, 6 figures, 11 tables. Code available at this https URL

AI中文摘要

无人机（UAV）威胁已成为21世纪定义性的安全挑战。本文提出DroneShield-AI，一个统一的开放框架，集成了六个处理层：RF信号分类、声学电机特征检测、基于YOLOv8的视觉检测、证据加权传感器融合、行为意图分类引擎（BICE）和图神经网络群体智能模块（GNN-SIM）。BICE首次引入了针对无人机飞行模式的系统性六类威胁分类法，能够提前30秒发出预测性操作员警报。GNN-SIM是首个用于对抗性多无人机编队分析的开放框架，采用图注意力网络。在三个公开的真实世界数据集上评估，融合流水线在约500-780美元总系统成本的商用CPU级硬件上实现了96.1%的检测准确率、3.2%的误报率、AUC-ROC：0.981以及142ms的端到端延迟。所有代码、模型权重和仿真数据集在提交时公开发布。

英文摘要

Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.

URL PDF HTML ☆

赞 0 踩 0

2606.11683 2026-06-11 cs.CV cs.AI 新提交

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

推理，再推理：跨视角重访提升空间推理

Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao

AI总结提出ReRe框架，通过生成互补新视角视频让MLLM先推理再验证，无需训练即可显著提升空间推理性能。

详情

Comments: ICML 2026

AI中文摘要

从自我中心视频进行空间推理本质上是具有挑战性的，因为可观察的证据受到相机轨迹的限制。现有方法依赖单轮推理，迫使模型通过语义先验而非可验证证据来解决几何歧义。我们认为空间推理应该是可重访的：在有限证据下形成的结论在获得互补视角时应保持开放以进行修正。基于这一见解，我们提出“推理，再推理”（ReRe），一种无需训练、推理时框架，包含两个阶段：在推理阶段，MLLM从原始视频形成空间假设；在再推理阶段，它通过观察合成的新视角视频来验证或修正假设。为了实现有效的跨视角重访，我们设计了一个几何到视频的流水线，从预测的3D几何中渲染出策略性互补的新视角。这些视角具有升高的、倾斜的视角，覆盖整个场景，同时保持MLLM的原生视频接口，无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明，ReRe显著提升开源MLLM，使其与专有最先进性能相媲美。项目页面：此https URL

英文摘要

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: this https URL

URL PDF HTML ☆

赞 0 踩 0