arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.00330 2026-06-17 cs.LG 版本更新

Conformalized Quantum DeepONet Ensembles for Scalable Operator Learning with Distribution-Free Uncertainty

conformalized 量子 deeponet 集团用于具有分布自由不确定性的可扩展操作学习

Purav Matlia, Christian Moya, Guang Lin

AI总结本文提出一种结合量子正交神经网络和适应性置信预测的框架，解决高维动态系统运算学习中的二次推断复杂度和不确定性量化问题，通过压缩多个模型到单个电路实现高效并行计算。

详情

AI中文摘要

操作学习能够快速构建高维动态系统的替代模型，但现有方法面临两个根本性限制：二次推断复杂性和安全关键设置中不可靠的不确定性量化。我们提出了 conformalized 量子 deeponet 集团，一个同时解决这两个挑战的框架。通过利用量子正交神经网络（qorthonn），我们将操作推断复杂性从 O(n²) 降低到 O(n)，使在细粒度离散化上可扩展的评估成为可能。为了提供严谨的不确定性量化，我们结合基于集合的epistemic建模与自适应 conformal 预测，从而获得分布自由的覆盖保证。在集合中的一个关键挑战是，朴素的并行性使硬件资源与模型数量线性增长。我们通过使用叠加参数化量子电路（spqcs）来解决这个问题，将多个集合成员压缩到一个电路中，并启用同时多模型执行。在合成偏微分方程和现实世界电力系统动态上的实验表明，我们的方法在保持现实量子噪声下的校准不确定性的同时实现了准确的预测。这些结果为量子机器学习中的可扩展、具有不确定性的操作学习建立了实用路径。

英文摘要

Operator learning enables fast surrogate modeling of high-dimensional dynamical systems, but existing approaches face two fundamental limitations: quadratic inference complexity and unreliable uncertainty quantification in safety-critical settings. We propose Conformalized Quantum DeepONet Ensembles, a framework that addresses both challenges simultaneously. By leveraging Quantum Orthogonal Neural Networks (QOrthoNNs), we reduce operator inference complexity from O(n^2) to O(n), enabling scalable evaluation over fine discretizations. To provide rigorous uncertainty quantification, we combine ensemble-based epistemic modeling with adaptive conformal prediction, yielding distribution-free coverage guarantees. A key challenge in ensembling is that naive parallelism scales hardware resources linearly with the number of models. We resolve this by using Superposed Parameterized Quantum Circuits (SPQCs), which compress multiple ensemble members into a single circuit and enable simultaneous multi-model execution. Experiments on synthetic partial differential equations and real-world power system dynamics demonstrate that our approach achieves accurate predictions while maintaining calibrated uncertainty under realistic quantum noise. These results establish a practical pathway toward scalable, uncertainty-aware operator learning in quantum machine learning.

URL PDF HTML ☆

赞 0 踩 0

2604.18701 2026-06-17 cs.LG cs.AI stat.ML 版本更新

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Curiosity-Critic：累积预测误差改进作为世界模型训练的可处理内在奖励

Vin Bhaskara, Haicheng Wang

AI总结提出Curiosity-Critic方法，通过可处理的每步替代项（当前预测误差与渐近误差基线的差值）作为内在奖励，利用共训练的评论家在线估计误差基线，有效分离可约与不可约预测误差，在随机网格世界实验中优于现有方法。

Comments Accepted to ICML 2026 Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026). Code: https://github.com/vinbhaskara/Curiosity-Critic

详情

AI中文摘要

基于局部预测误差的好奇心奖励仅关注当前转移，而不考虑世界模型在所有已访问转移上的累积预测误差。我们引入了Curiosity-Critic，其内在奖励基于这一累积目标的改进，并证明它有一个可处理的每步替代项：当前预测误差与当前状态转移的渐近误差基线之间的差值。我们通过一个与世界模型共同训练的评论家在线估计这一误差基线；由于评论家只需学习一个转移的预测难度，其对不可约噪声基线的估计在世界模型饱和之前就已收敛，从而将探索引导向可学习的转移。该奖励对可学习转移较高，而对随机转移趋近于零，从而在线分离认知（可约）和偶然（不可约）预测误差。从Schmidhuber（1991）到学习特征空间变体的先前预测误差好奇心公式，都作为该误差基线的特定近似特例出现。在随机网格世界上的实验表明，Curiosity-Critic在训练速度和最终世界模型准确性上优于基于预测误差、访问计数和随机网络蒸馏的方法。

英文摘要

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; since the critic only has to learn how hard a transition is to predict, its estimate of the irreducible noise floor converges well before the world model saturates, redirecting exploration toward learnable transitions. The reward is higher for learnable transitions and collapses toward zero for stochastic ones, thereby separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

URL PDF HTML ☆

赞 0 踩 0

2604.24357 2026-06-17 cs.LG cs.AI 版本更新

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

DPRM: 一种用于扩散语言模型的即插即用Doob h变换诱导的令牌排序模块

Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda

AI总结提出DPRM模块，通过在线估计从置信度驱动排序逐步过渡到过程奖励引导排序，改进扩散语言模型的令牌排序策略，在九种任务中提升性能。

详情

AI中文摘要

扩散语言模型生成时没有固定的从左到右顺序，令牌排序是一个核心算法选择。现有系统主要使用随机掩码或置信度驱动排序，分别存在训练-测试不匹配和短视探索的问题。我们引入DPRM（Doob变换过程奖励模型），一个即插即用的令牌排序模块，保持宿主架构、去噪目标和监督不变，仅修改排序策略。DPRM从置信度驱动排序开始，通过在线估计逐渐过渡到过程奖励引导排序。我们将精确的DPRM策略描述为奖励倾斜的Gibbs揭示律，证明其阶段式Soft-BoN近似的收敛性，表明在线分桶跟踪器以经验Bernstein速率跟踪精确的DPRM分数，并在可处理的优化假设下建立样本复杂度优势。在涵盖语言推理、测试时扩展、蛋白质、单细胞、分子、DNA、文本到图像生成和VQA的九个宿主中，DPRM排序变体改进了多个语言、DNA和多模态设置，同时也识别了仅置信度排序或任务特定效用更优的边界情况。代码见：this https URL

英文摘要

Diffusion language models generate without a fixed left-to-right order, leaving token ordering as a central algorithmic choice. Existing systems mainly use random masking or confidence-driven ordering, which respectively suffer from train--test mismatch and myopic exploration. We introduce DPRM (Doob -transform Process Reward Model), a plug-in token-ordering module that keeps the host architecture, denoising objective and supervision unchanged, and modifies only the ordering policy. DPRM starts from confidence-driven ordering and gradually shifts to process-reward-guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove convergence of its stagewise Soft-BoN approximation, show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and establish a sample-complexity advantage under tractable optimization assumptions. Across nine hosts covering language reasoning, test-time scaling, protein, single-cell, molecular, DNA, text-to-image generation, and VQA, DPRM order variants improve several language, DNA, and multimodal settings while also identifying boundary cases where confidence-only ordering or task-specific utilities are preferable. Code is available at: https://github.com/DakeBU/DPRM-DLLM

URL PDF HTML ☆

赞 0 踩 0

2604.22128 2026-06-17 cs.CL cs.LG 版本更新

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

括号序列Transformer中可解码性与因果使用的分离

Aryan Sharma, Cutter Dawes, Shivam Raval

AI总结通过探针和干预实验，发现Dyck语言Transformer中层级表示虽可解码，但仅注意力模式中的栈顶位置对长距离准确性有因果影响。

详情

AI中文摘要

当在需要理解层级结构的任务上训练时，Transformer被发现以不同方式表示这种层级：在残差流的几何结构中，以及在维持后进先出顺序的类栈注意力模式中。然而，这些表示是被因果使用还是仅仅可解码仍不清楚。我们在Dyck语言（一种平衡括号序列的形式语言）上训练的Transformer中检验了这一差距，其中层级真实标签是明确的。通过探针和干预残差流及注意力模式，我们发现深度、距离和栈顶信号都是可解码的，但它们的因果作用不同。具体而言，掩盖真实栈顶位置的注意力会导致长距离准确性急剧下降，而消融低维残差流子空间则影响相对较小。这些结果扩展到模板化的自然语言设置，表明即使在相关层级变量已知的受控设置中，仅可解码性并不意味着因果使用。

英文摘要

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

URL PDF HTML ☆

赞 0 踩 0

2604.19762 2026-06-17 cs.CL 版本更新

Evidence of Layered Positional and Directional Constraints in the Voynich Manuscript: Implications for Cipher-Like Structure

伏尼契手稿中分层位置和方向约束的证据：对类密码结构的影响

Christophe Parisel

AI总结通过分析伏尼契手稿的字素序列，发现词内从右到左优化和词边界从左到右依赖的双层结构，这种方向分离在四种对比语言中未出现；测试两类生成器均无法同时满足四个签名标准，表明手稿存在难以用简单位置或频率机制复现的类密码结构约束。

详情

AI中文摘要

伏尼契手稿（VMS）展示了一种起源不明的文字，其字素序列一直抗拒语言学分析。我们对其字素序列进行了系统分析，揭示了两个互补的结构层：词内序列中字符级的从右到左优化，以及词边界处的从左到右依赖，这种方向分离在我们四种对比语言（英语、法语、希伯来语、阿拉伯语）中均未观察到。我们进一步根据一个四签名联合标准评估了两类结构化生成器：一个参数化的槽位生成器和一个实现Rugg（2004）胡言乱语假设的卡尔达诺格栅。在其全部测试参数空间中，两类生成器均无法同时再现所有四个签名。虽然这些结果并未排除我们未测试的生成器类别，但它们提供了第一个定量基准，未来任何关于VMS的生成或密码分析模型均可据此评估，并且表明VMS表现出类似密码的结构约束，这些约束难以仅通过简单的位置或频率机制复现。

英文摘要

The Voynich Manuscript (VMS) exhibits a script of uncertain origin whose grapheme sequences have resisted linguistic analysis. We present a systematic analysis of its grapheme sequences, revealing two complementary structural layers: a character-level right-to-left optimization in word-internal sequences and a left-to-right dependency at word boundaries, a directional dissociation not observed in any of our four comparison languages (English, French, Hebrew, Arabic). We further evaluate two classes of structured generator against a four-signature joint criterion: a parametric slot-based generator and a Cardan grille implementing Rugg's (2004) gibberish hypothesis. Across their full tested parameter spaces, neither class reproduces all four signatures simultaneously. While these results do not rule out generator classes we have not tested, they provide the first quantitative benchmarks against which any future generative or cryptanalytic model of the VMS can be evaluated, and they suggest that the VMS exhibits cipher-like structural constraints that are difficult to reproduce from simple positional or frequency-based mechanisms alone.

URL PDF HTML ☆

赞 0 踩 0

2604.13899 2026-06-17 cs.CL cs.AI 版本更新

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中？比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

AI总结研究比较了LLM与人类在主动学习中的标注效果，发现LLM标注成本更低且性能更优，但主动学习在LLM标注下无优势。

详情

AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习（AL）提出了两个问题：LLM标签能否替代AL回路中的人类标签？当整个语料库可以廉价标注时，AL是否仍然必要？我们在一个新的包含277,902条德国政治TikTok评论（25,974条LLM标注，5,000条人工标注）的数据集上进行了研究，比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下，大规模LLM标注的性能优于人类监督分类器，成本约为其十分之一（GPT-5.2 Batch API为28美元，Prolific为316美元）。这一优势对于闭源（GPT-5.2）和开源（Qwen3.5-122B-10B）LLM均成立，在软标签评估下具有鲁棒性，并且是通过双问题分解实现的；整体单提示基线仅与人类监督持平。在任一LLM标注器下，主动学习相比随机采样没有可靠优势。然而，错误结构差异显著：只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡，而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

URL PDF HTML ☆

赞 0 踩 0

2601.19792 2026-06-17 cs.CL cs.AI cs.HC 版本更新

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan Brennan, Owen Rambow

AI总结通过人类与AI配对的多轮指称交流实验，发现LVLMs无法像人类一样利用共同基础生成和解析指称表达，导致交流不畅。

Comments 27 pages, 16 figures

2604.17616 2026-06-17 cs.LG 版本更新

Conditional Attribution for Root Cause Analysis in Time-Series Anomaly Detection

时间序列异常检测中根因分析的条件归因

Shashank Mishra, Karan Patil, Cedric Schockaert, Didier Stricker, Jason Rambach

AI总结提出一种条件归因框架，通过检索与异常观测上下文相似的正态实例进行依赖保持的解释，结合变分自编码器和UMAP流形嵌入实现高维时间序列的高效归因，并在SWaT和MSDS基准上提升了根因识别准确率与鲁棒性。

Comments Accepted at ECML PKDD. 16 pages, 8 figures, 13 tables, and an appendix

详情

Journal ref: ECML PKDD 2026

AI中文摘要

根因分析对于时间序列异常检测在复杂真实世界系统的可靠运行中至关重要。现有的解释方法通常依赖于不切实际的特征扰动，并忽略时间依赖和跨特征依赖，导致归因不可靠。我们提出了一种条件归因框架，该框架相对于上下文相似的正态系统状态来解释异常。我们的方法不是使用边际或随机采样的基线，而是检索以异常观测为条件的代表性正态实例，从而实现依赖保持且操作上有意义的解释。为了支持高维时间序列数据，在学习的低维表示中使用变分自编码器潜在空间和UMAP流形嵌入进行上下文检索。通过将检索过程基于系统学习的流形，该策略避免了分布外伪影，并在保持计算效率的同时确保归因保真度。我们进一步引入了置信感知和时间评估指标，用于评估解释的可靠性和响应性。在SWaT和MSDS基准上的实验表明，所提出的方法在多个异常检测模型上持续提高了根因识别准确率、时间定位和鲁棒性。这些结果突显了条件归因在复杂时间序列系统中用于可解释异常诊断的实际效用。代码和模型将公开发布。

英文摘要

Root cause analysis (RCA) for time-series anomaly detection is critical for the reliable operation of complex real-world systems. Existing explanation methods often rely on unrealistic feature perturbations and ignore temporal and cross-feature dependencies, leading to unreliable attributions. We propose a conditional attribution framework that explains anomalies relative to contextually similar normal system states. Instead of using marginal or randomly sampled baselines, our method retrieves representative normal instances conditioned on the anomalous observation, enabling dependency-preserving and operationally meaningful explanations. To support high-dimensional time-series data, contextual retrieval is performed in learned low-dimensional representations using both variational autoencoder latent spaces and UMAP manifold embeddings. By grounding the retrieval process in the system's learned manifold, this strategy avoids out-of-distribution artifacts and ensures attribution fidelity while maintaining computational efficiency. We further introduce confidence-aware and temporal evaluation metrics for assessing explanation reliability and responsiveness. Experiments on the SWaT and MSDS benchmarks demonstrate that the proposed approach consistently improves root-cause identification accuracy, temporal localization, and robustness across multiple anomaly detection models. These results highlight the practical utility of conditional attribution for explainable anomaly diagnosis in complex time-series systems. Code and models are available at: https://github.com/dfki-av/Conditional-Attribution-for-Root-Cause-Analysis-in-Time-Series-Anomaly-Detection.

URL PDF HTML ☆

赞 0 踩 0

2603.18104 2026-06-17 cs.AI cs.DC cs.LG cs.NE 版本更新

Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

自适应领域模型：贝叶斯演化、热旋转与几何及神经形态AI的规范化训练

Houston Haynes

AI总结提出基于维度类型系统、程序超图和b-posit有界设计的替代训练架构，实现内存开销恒定、梯度精确累积和级保持更新，并引入贝叶斯蒸馏和热旋转机制，支持领域特定模型的持续自适应与可验证正确性。

Comments 32 pages, 3 figures

详情

AI中文摘要

当前AI训练假设在IEEE-754算术上进行反向模式自动微分。训练相对于推理的内存开销、优化器复杂性以及训练过程中几何属性的结构退化，都是该算术基底的后果。本文基于三项先前结果开发了一种替代训练架构：维度类型系统和确定性内存管理框架（Haynes 2026），将栈可分配梯度分配和精确quire累积确立为设计时可验证属性；程序超图（Haynes 2026），将几何代数计算中的级保持确立为类型级不变量；以及b-posit有界设计（Jonnalagadda et al. 2025），使posit算术在传统上被视为仅推理的硬件目标上变得可行。它们的组合实现了深度无关的训练内存（约为推理占用量的两倍）、级保持的权重更新和精确梯度累积，统一适用于损失函数优化和脉冲时序依赖的神经形态模型。我们引入了*贝叶斯蒸馏*，一种通过ADM训练机制提取通用模型潜在先验结构的机制，解决了领域特定训练的数据稀缺自举问题。对于部署，我们引入了*热旋转*，一种操作模式，其中更新后的模型在不中断服务的情况下过渡到活跃推理路径，并通过PHG证书和签名版本记录形式化正确性。结果是一类领域特定AI系统，比通用模型更小、更精确，持续自适应，相对于其领域的物理结构可验证正确，并且可从现有模型初始化。

英文摘要

Prevailing AI training assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework (Haynes 2026), which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph (Haynes 2026), which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit bounded-regime design (Jonnalagadda et al. 2025), which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce *Bayesian distillation*, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce *warm rotation*, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.

URL PDF HTML ☆

赞 0 踩 0

2505.00986 2026-06-17 cs.LG cs.CV 版本更新

EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems

EmbodiTTA：面向具身视觉系统的资源高效测试时自适应

Xiao Ma, Young D. Kwon, Dong Ma

AI总结提出按需测试时自适应范式OD-TTA，通过轻量域移检测、源域选择和分离批归一化更新，在边缘设备上实现高效准确的自适应，显著降低计算和能耗开销。

详情

AI中文摘要

连续测试时自适应（CTTA）持续对每个到达的数据批次调整部署模型。虽然达到了最优精度，但现有的CTTA方法由于巨大的内存开销和能耗，在资源受限的边缘设备上实际应用性差。本文首先引入一种新范式——按需TTA，仅在检测到显著域移时触发自适应。然后，我们提出OD-TTA，一种用于边缘设备上准确高效自适应的按需TTA框架。OD-TTA包含三项创新技术：1）轻量级域移检测机制，仅在需要时激活TTA，大幅降低总体计算开销；2）源域选择模块，选择合适的源模型进行自适应，确保高且鲁棒的精度；3）解耦的批归一化（BN）更新方案，实现小批量下的内存高效自适应。大量实验表明，OD-TTA在显著降低能量和计算开销的同时，实现了可比甚至更好的性能，使TTA成为实际可行的技术。

英文摘要

Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm -- on-demand TTA -- which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.

URL PDF HTML ☆

赞 0 踩 0

2604.03444 2026-06-17 cs.LG cs.CL 版本更新

Olmo Hybrid: From Theory to Practice and Back

Olmo Hybrid：从理论到实践再回到理论

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal

AI总结本文通过理论分析和实验验证，证明混合模型（结合注意力与线性RNN）在表达能力、扩展效率上优于纯Transformer，并训练了7B参数的Olmo Hybrid模型，在标准评估中超越Olmo 3。

Comments Corrected author list and typos in appendix

详情

AI中文摘要

近期工作展示了非Transformer语言模型（尤其是线性递归神经网络（RNN）和混合注意力与递归的混合模型）的潜力。然而，对于这些新架构的潜在优势是否值得承担规模化扩展的风险和努力，尚无共识。为解决此问题，我们从多个方面提供混合模型优于纯Transformer的证据。首先，理论上，我们证明混合模型不仅继承了Transformer和线性RNN的表达能力，还能表达超出两者的任务，例如代码执行。将这一理论付诸实践，我们训练了Olmo Hybrid，一个70亿参数模型，与Olmo 3 7B基本相当，但将滑动窗口层替换为Gated DeltaNet层。我们表明，在标准预训练和中期训练评估中，Olmo Hybrid优于Olmo 3，证明了混合模型在受控大规模设置下的优势。我们发现混合模型的扩展效率显著高于Transformer，这解释了其更高的性能。然而，尚不清楚为何特定形式问题上的更高表达能力会导致更好的扩展性或在下游任务（与这些问题无关）上表现更优。为解释这一明显差距，我们回到理论，论证为何增强的表达能力应转化为更好的扩展效率，从而完成循环。总体而言，我们的结果表明，混合注意力和递归层的混合模型是语言建模范式的强大扩展：不仅用于减少推理时的内存，更是获得在预训练中更好扩展的更具表达能力模型的基本途径。

英文摘要

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

URL PDF HTML ☆

赞 0 踩 0

2603.18492 2026-06-17 cs.LG 版本更新

AIMER: Calibration-Free Task-Agnostic MoE Expert Pruning

AIMER: 免校准任务无关的MoE专家剪枝

Zongfang Liu, Guangyi Chen, Shengkun Tang, Yifan Shen, Huan Wang, Xin Yuan

AI总结提出AIMER方法，通过专家权重的集中度模式识别独特专家，实现免校准的任务无关MoE专家剪枝，在7B至47B模型上优于现有方法。

详情

AI中文摘要

混合专家（MoE）语言模型在不增加每token计算量的情况下增加了参数容量，但部署时仍需存储全部专家池，因此专家剪枝对于减少内存和服务开销至关重要。现有的任务无关专家剪枝方法通常依赖校准：它们通过校准集上的路由或激活统计估计专家重要性，使得剪枝决策对校准数据变化敏感，同时引入大量预处理成本。我们提出AIMER（基于均方根绝对均值的重要性专家排序），一种简单的免校准准则，通过捕捉专家权重的集中度模式来识别更独特的专家，使其非常适合任务无关的专家剪枝。在具有不同架构的7B至47B MoE语言模型和16个多样化基准上，AIMER在跨任务能力平衡方面始终优于现有的免校准方法。令人惊讶的是，AIMER还比基于强校准的专家剪枝基线（在广泛使用的任务无关C4语料库上校准）实现了更好的平衡，同时仅需0.22–2.06秒即可对所有专家进行评分。

英文摘要

Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token computation, yet deployment still requires storing the full expert pool, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert-pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, making pruning decisions sensitive to calibration-data variation while introducing substantial preprocessing cost. We propose AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that identifies more distinct experts by capturing the concentration pattern of expert weights, making it well suited for task-agnostic expert pruning. Across 7B to 47B MoE language models with distinct architectures and 16 diverse benchmarks, AIMER consistently delivers stronger capability balance across diverse tasks than existing calibration-free methods. Surprisingly, AIMER also achieves better balance than strong calibration-based expert-pruning baselines calibrated on the widely used task-agnostic C4 corpus, while requiring only 0.22--2.06 seconds to score all experts.

URL PDF HTML ☆

赞 0 踩 0

2506.18831 2026-06-17 cs.CL 版本更新

Adaptive Activation Steering for Efficient LLM Reasoning via Closed-Loop PID Control

自适应激活引导：通过闭环PID控制实现高效LLM推理

Aryasomayajula Ram Bharadwaj

AI总结提出PID-steering方法，利用PID控制器根据块级冗余分类器动态调整激活引导强度，在减少推理开销的同时提升准确率。

2604.06802 2026-06-17 cs.AI 版本更新

Riemann-Bench: A Benchmark for Moonshot Mathematics

Riemann-Bench: 面向登月级数学的基准测试

Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen

AI总结提出Riemann-Bench基准，由专家设计研究级数学问题，评估AI系统超越奥数水平的推理能力，结果显示前沿模型得分低于10%。

详情

AI中文摘要

最近的AI系统在国际数学奥林匹克竞赛中取得了金牌级别的表现，展示了在竞赛式问题解决方面的卓越能力。然而，竞赛数学仅代表了数学推理的一个狭窄部分：问题来自有限的领域，需要最少的先进工具，并且通常奖励洞察力技巧而非深奥的理论知识。我们引入了Riemann-Bench，一个由专家策划的私有基准测试，旨在评估AI系统在研究级数学上的表现，这远远超出了奥林匹克的前沿。问题由常春藤联盟数学教授、研究生和拥有博士学位的IMO金牌得主编写，并且通常需要作者数周才能独立解决。每个问题都经过两位独立领域专家的双盲验证，他们必须从头开始解决问题，并通过程序化验证器得出唯一的封闭形式解。我们将前沿模型评估为不受限制的研究智能体，可以完全访问编码工具、搜索和开放式推理，使用每个问题100次独立运行的无偏统计估计器。我们的结果显示，所有前沿模型目前得分低于10%，揭示了奥林匹克级问题解决与真正研究级数学推理之间的巨大差距。通过保持基准完全私有，我们确保测量的性能反映了真实的数学能力，而不是对训练数据的记忆。

英文摘要

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce Riemann-Bench, a private benchmark of expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

URL PDF HTML ☆

赞 0 踩 0

2603.28251 2026-06-17 cs.CV cs.AI 版本更新

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

DiffAttn: 基于扩散的驾驶员视觉注意力预测与LLM增强语义推理

Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

AI总结提出DiffAttn框架，将驾驶员视觉注意力预测建模为条件扩散去噪过程，结合Swin Transformer、特征融合金字塔和LLM增强语义推理，在四个数据集上达到最先进性能。

详情

AI中文摘要

驾驶员的视觉注意力为预测潜在危险提供关键线索，并直接影响决策和控制操作，其缺失可能危及交通安全。为模拟驾驶员的感知模式并推进智能车辆的视觉注意力预测，我们提出DiffAttn，一种基于扩散的框架，将该任务建模为条件扩散-去噪过程，从而更准确地建模驾驶员注意力。为捕捉局部和全局场景特征，我们采用Swin Transformer作为编码器，并设计了一个解码器，该解码器结合了特征融合金字塔用于跨层交互，以及密集的多尺度条件扩散，以共同增强去噪学习并建模细粒度的局部和全局场景上下文。此外，引入大语言模型（LLM）层以增强自上而下的语义推理，并提高对安全关键线索的敏感性。在四个公共数据集上的大量实验表明，DiffAttn实现了最先进的性能，超越了大多数基于视频、自上而下特征驱动和LLM增强的基线。我们的框架进一步支持可解释的以驾驶员为中心的场景理解，并具有改善智能车辆中座舱人机交互、风险感知和驾驶员状态测量的潜力。

英文摘要

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

URL PDF HTML ☆

赞 0 踩 0

2604.03120 2026-06-17 cs.CV cs.RO 版本更新

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

SCC-Loc: 无人机热红外地理定位的统一语义级联共识框架

Xiaoran Zhang, Yu Liu, Jinyu Liang, Kangqiushi Li, Zhiwei Huang, Huaxin Xiao

AI总结提出SCC-Loc框架，通过共享DINOv2骨干网络、语义引导视口对齐、级联空间自适应纹理结构滤波和共识驱动可靠性感知位置选择，解决热红外-可见光模态差异导致的特征模糊问题，实现零样本高精度绝对位置估计，平均定位误差9.37米。

Comments 17 pages, 5 figures. Submitted to IEEE J-STARS

详情

AI中文摘要

跨模态热红外地理定位（TG）为无人机在GNSS拒止环境中提供了鲁棒的全天候解决方案。然而，深刻的热红外-可见光模态差异引入了严重的特征模糊性，系统性地破坏了传统的由粗到精配准。为打破这一瓶颈，我们提出SCC-Loc，一个统一的语义-级联-共识定位框架。通过在全局检索和MINIMA$_{\ ext{RoMa}}$匹配中共享单个DINOv2骨干网络，它最小化内存占用并实现零样本、高精度的绝对位置估计。具体而言，我们通过引入三个协同组件来解决模态模糊性。首先，我们设计语义引导视口对齐（SGVA）模块，自适应优化卫星裁剪区域，有效校正初始空间偏差。其次，我们开发级联空间自适应纹理结构滤波（C-SATSF）机制，显式强制几何一致性，从而消除密集的跨模态离群点。最后，我们提出共识驱动可靠性感知位置选择（CD-RAPS）策略，通过物理约束位姿优化的协同作用推导出最优解。为解决数据稀缺问题，我们构建了Thermal-UAV数据集，提供11,890个多样化的热红外查询，并参考大规模卫星正射影像和相应的空间对齐数字表面模型（DSM）。大量实验表明，SCC-Loc建立了新的最先进水平，将平均定位误差抑制到9.37米，并在严格的5米阈值内比最强基线提供了7.6倍的精度提升。代码和数据集可在该URL获取。

英文摘要

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

URL PDF HTML ☆

赞 0 踩 0

2604.00611 2026-06-17 cs.RO 版本更新

Physical Imitation Learning: Distilling Control Policies into Passive Elasticity

物理模仿学习：将控制策略蒸馏到被动弹性中

Huyue Ma, Yurui Jin, Helmut Hauser, Rui Wu

AI总结提出物理模仿学习(PIL)方法，将强化学习控制策略分解为主动与被动部分，被动部分卸载到并联弹性关节，显著降低能耗，在模拟四足机器人上实现高达95%的机械功率卸载。

详情

AI中文摘要

由于脑-体协同进化，动物的内在身体动力学在其节能运动中起着关键作用。具体来说，控制努力在主动肌肉和被动身体动力学之间共享——这一原则通常被称为物理智能。因此，身体动力学是解决方案的一部分。相比之下，机器人身体通常被设计得尽可能简单，但主动控制常常与内在身体动力学对抗，导致低能效。我们引入了物理模仿学习（PIL），这是一种新颖的方法，使当前的机器人控制更接近动物。PIL 获取通过强化学习（RL）获得的学习控制策略，并将其系统地分解为主动和被动控制贡献。然后，被动部分可以直接卸载到被动并联弹性关节（PEJ）上。结果，主动控制贡献显著减少，降低了整体能耗。此外，策略可以通过 RL 训练，通过生成更容易被 PEJ 模仿的步态来利用 PEJ 的辅助。这使得主动和被动控制组件的协同设计成为可能，将更大份额的驱动努力转移到 PEJ。在这里，我们在模拟四足动物中展示了这种方法的潜力。我们的结果表明，所提出的方法可以在平坦地形上卸载高达 95% 的机械功率到被动身体动力学，在崎岖地形上卸载 13%。因此，PIL 提供了一条可推广的途径，用于实现特定任务的物理智能，适用于各种基于关节的机器人形态。

英文摘要

Due to brain-body co-evolution, animals' intrinsic body dynamics play a crucial role in their energy-efficient locomotion. Specifically, the control effort is shared between active muscles and passive body dynamics--a principle often referred to as Physical Intelligence. As a result, the body dynamics are part of the solution. In contrast, robot bodies are typically designed to be as simple as possible, but the active control often fights the intrinsic body dynamics, resulting in low energy-efficiency. We introduce Physical Imitation Learning (PIL), a novel approach that brings current robotics control closer to animals. PIL takes learned control policies obtained with Reinforcement Learning (RL) and systematically splits them up into an active and passive control contribution. The passive part can be then directly offloaded to passive Parallel Elastic Joints (PEJs). As a result, the active control contribution is significantly reduced, lowering the overall energy consumption. Furthermore, the policy can be trained via RL to leverage the PEJ assistance by generating gaits that are more readily emulated by the PEJs. This enables co-design of the active and passive control components, shifting a greater share of actuation effort to the PEJs. Here we demonstrate the potential of this approach in simulated quadrupeds. Our results show that the proposed approach can offload up to 95% of mechanical power to passive body dynamics on flat terrain and 13% on rough terrain. PIL thereby provides a generalisable route to task-specific Physical Intelligence applicable to a wide range of joint-based robot morphologies.

URL PDF HTML ☆

赞 0 踩 0

2604.00605 2026-06-17 cs.CV 版本更新

Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent

流利地撒谎：对抗鲁棒性可能依赖于底层架构

Daye Kang, Hyeongboo Baek

AI总结发现一种新的对抗攻击失败模式——质量崩溃（QC），即检测数量不变但精度骤降，且仅出现在特定SNN架构（EMS-YOLO）中，表明对抗失败模式可能依赖于底层架构。

Comments Withdrawn by the authors due to an implementation bug discovered in the main experimental pipeline. The bug affects the main results, and therefore the empirical claims and conclusions of the paper are no longer supported

详情

AI中文摘要

用于监控和防御对抗攻击下目标检测器的主要工具假设，当精度下降时，检测数量也会同步下降。这种耦合是假设的，并未经过测量。我们报告了在单个模型上观察到的反例：在标准PGD攻击下，EMS-YOLO（一种脉冲神经网络（SNN）目标检测器）保留了超过70%的检测结果，而mAP从0.528骤降至0.042。我们将这种保持检测数量但精度崩溃的现象称为质量崩溃（QC），以区别于在非目标评估中占主导地位的抑制现象。在四种SNN架构和两种威胁模型（l-infinity和l-2）下，QC仅出现在测试的四种检测器之一（EMS-YOLO）中。在该模型上，所有五种标准防御组件均未能检测或缓解QC，这表明防御生态系统可能依赖于一种基于单一底层架构校准的共享假设。据我们所知，这些结果首次证明对抗失败模式可能依赖于底层架构。

英文摘要

The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.

URL PDF HTML ☆

赞 0 踩 0

2603.28378 2026-06-17 cs.SD cs.AI 版本更新

Membership Inference Attacks against Large Audio Language Models

针对大型音频语言的成员推断攻击

Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee

AI总结首次系统评估大型音频语言模型的成员推断攻击，提出盲基线协议控制分布偏移，发现跨模态记忆仅源于说话人声纹与文本绑定。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

我们首次对大型音频语言模型（LALMs）进行了系统的成员推断攻击（MIA）评估。利用基于文本、频谱和韵律特征的多模态盲基线，我们证明即使没有模型推理，常见音频数据集也表现出近乎完美的训练/测试可分离性（AUC ~ 1.0），因此MIA可能主要检测分布偏移。因此，我们引入了一个盲基线协议来控制这一混杂因素。在该协议下，我们发现分布匹配的数据集能够实现可靠的MIA评估，而不会产生分布偏移伪影。我们基准测试了多种MIA方法，并在这些数据集上进行了模态解缠实验。结果表明，LALM的记忆是跨模态的，仅源于将说话人的声纹与其文本绑定。这些发现为审计LALMs建立了超越虚假相关性的原则性标准。我们的代码库可在该网址获取。

英文摘要

We present the first systematic Membership Inference Attack (MIA) evaluation of LALMs. Using Multi-modal Blind Baselines based on textual, spectral and prosodic features, we demonstrate that common audio datasets exhibit near-perfect train/test separability (AUC ~ 1.0) even without model inference, thus MIA may primarily detect distribution shift. We therefore introduce a blind-baseline protocol to control for this confound. Under this protocol, we identify that the distribution-matched datasets enable reliable MIA evaluation without distribution-shift artifacts. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations. Our codebase is available at https://github.com/snooow1029/ALM_MIA.

URL PDF HTML ☆

赞 0 踩 0

2603.26592 2026-06-17 cs.LG cs.AI cs.HC 版本更新

Evaluating Interactive 2D Visualization as a Sample Selection Strategy for Biomedical Time-Series Data Annotation

评估交互式二维可视化作为生物医学时间序列数据标注的样本选择策略

Einari Vaaras, Manu Airaksinen, Okko Räsänen

AI总结针对生物医学时间序列标注困难，比较随机采样、最远优先遍历和基于交互式2D可视化（2DV）的三种样本选择方法，在婴儿运动评估和语音情感识别任务中，2DV在聚合标签时表现最佳，但个体标注者间标签分布差异大，随机采样最安全。

Comments Accepted for publication in Computers in Biology and Medicine (Elsevier)

详情

DOI: 10.1016/j.compbiomed.2026.111809

AI中文摘要

生物医学领域中可靠的机器学习模型依赖于准确的标签，然而标注生物医学时间序列数据仍然具有挑战性。算法样本选择可能支持标注，但涉及真实人类标注者的研究证据很少。因此，我们比较了三种用于标注的样本选择方法：随机采样（RND）、最远优先遍历（FAFT）和一种基于图形用户界面的方法，该方法能够探索高维数据的互补二维可视化（2DV）。我们在婴儿运动评估（IMA）和语音情感识别（SER）的四个分类任务中评估了这些方法。十二名标注者，分为专家和非专家，在有限的标注预算下进行数据标注，并进行了标注后实验以评估采样方法。在所有分类任务中，当聚合标注者的标签时，2DV表现最佳。在IMA中，2DV最有效地捕获了稀有类别，但也表现出由于有限的标注预算导致的标注者间标签分布变异性增大，当模型在个体标注者的标签上训练时，分类性能下降；在这些情况下，FAFT表现出色。对于SER，2DV在专家标注者中优于其他方法，并在个体标注者设置中与非专家标注者的性能相当。失败风险分析显示，当标注者数量或标注者专业知识不确定时，RND是最安全的选择，而2DV由于标签分布变异性更大而具有最高风险。此外，实验后访谈表明，2DV使标注任务更有趣和愉快。总体而言，基于2DV的采样对于生物医学时间序列数据标注似乎很有前景，特别是在标注预算不是非常紧张的情况下。

英文摘要

Reliable machine-learning models in biomedical settings depend on accurate labels, yet annotating biomedical time-series data remains challenging. Algorithmic sample selection may support annotation, but evidence from studies involving real human annotators is scarce. Consequently, we compare three sample selection methods for annotation: random sampling (RND), farthest-first traversal (FAFT), and a graphical user interface-based method enabling exploration of complementary 2D visualizations (2DVs) of high-dimensional data. We evaluated the methods across four classification tasks in infant motility assessment (IMA) and speech emotion recognition (SER). Twelve annotators, categorized as experts or non-experts, performed data annotation under a limited annotation budget, and post-annotation experiments were conducted to evaluate the sampling methods. Across all classification tasks, 2DV performed best when aggregating labels across annotators. In IMA, 2DV most effectively captured rare classes, but also exhibited greater annotator-to-annotator label distribution variability resulting from the limited annotation budget, decreasing classification performance when models were trained on individual annotators' labels; in these cases, FAFT excelled. For SER, 2DV outperformed the other methods among expert annotators and matched their performance for non-experts in the individual-annotator setting. A failure risk analysis revealed that RND was the safest choice when annotator count or annotator expertise was uncertain, whereas 2DV had the highest risk due to its greater label distribution variability. Furthermore, post-experiment interviews indicated that 2DV made the annotation task more interesting and enjoyable. Overall, 2DV-based sampling appears promising for biomedical time-series data annotation, particularly when the annotation budget is not highly constrained.

URL PDF HTML ☆

赞 0 踩 0

2603.26292 2026-06-17 cs.CL cs.AI 版本更新

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

findsylls: 一种语言无关的音节级语音分词与嵌入工具包

Héctor Javier Vázquez Martínez

AI总结提出语言无关的模块化工具包findsylls，统一经典音节检测器和端到端音节切分器，支持音节分割、嵌入提取和多粒度评估，在英语、西班牙语及低资源语言Kono上验证了跨语言可重复实验能力。

Comments 4 pages + 2 for references, disclosures & acknowledgements; to appear in Interspeech 2026; DOI to cite findsylls library: https://doi.org/10.5281/zenodo.20707804

详情

AI中文摘要

音节级单元为口语语言建模和无监督词汇发现提供了紧凑且具有语言意义的表示，但关于音节化的研究仍然分散在不同的实现、数据集和评估协议中。我们介绍了findsylls，一个模块化的、语言无关的工具包，它将经典的音节检测器和端到端音节切分器统一在一个通用接口下，用于音节分割、嵌入提取和多粒度评估。该工具包实现并标准化了广泛使用的方法（例如，Sylber、VG-HuBERT），并允许重新组合其组件，从而实现对表示、算法和令牌率的受控比较。我们在英语和西班牙语语料库以及来自Kono（一种未被充分记录的中部曼德语）的新手工标注数据上演示了findsylls，展示了单一框架如何支持在资源丰富和资源不足的环境中均可重复的音节级实验。

英文摘要

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.

URL PDF HTML ☆

赞 0 踩 0

2603.22372 2026-06-17 cs.LG cs.AI 版本更新

Rethinking Multimodal Fusion for Time Series: Text Modalities Need Constrained Fusion

重新思考时间序列的多模态融合：文本模态需要受约束的融合

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, SoonYoung Lee, Wonbin Ahn

AI总结针对多模态时间序列预测中朴素融合方法效果不佳的问题，提出受约束融合方法及受控融合适配器（CFA），通过低秩适配器过滤无关文本信息，在多种数据集和模型上验证了有效性。

Comments KDD Workshop on Mining and Learning from Time Series 2026

详情

AI中文摘要

多模态学习的最新进展推动了将文本或视觉等辅助模态集成到时间序列（TS）预测中。然而，现有方法大多增益有限，通常仅在特定数据集上提升性能，或依赖限制泛化能力的架构特定设计。在本文中，我们表明采用朴素融合策略（例如简单加法或拼接）的多模态模型通常表现不如单模态TS模型，我们将其归因于辅助模态的未受控集成可能引入无关信息。受此观察启发，我们探索了各种旨在控制这种集成的受约束融合方法，并发现它们始终优于朴素融合方法。此外，我们提出了受控融合适配器（CFA），一种简单的即插即用方法，无需修改TS主干即可实现受控的跨模态交互，仅集成与TS动态对齐的相关文本信息。CFA采用低秩适配器在将文本信息融合到时间表示之前过滤无关文本信息。我们在各种数据集和TS/文本模型上进行了超过20K次实验，证明了受约束融合方法的有效性。代码见：this https URL。

英文摘要

Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods. Code is available at: https://github.com/seunghan96/cfa.

URL PDF HTML ☆

赞 0 踩 0

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA：赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结提出ThinkJEPA框架，结合密集JEPA分支与稀疏VLM思考者分支，通过分层金字塔表示提取模块，实现细粒度运动建模与长程语义引导，在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情

AI中文摘要

潜在世界模型（如V-JEPA2）的最新进展展示了从视频观测预测未来世界状态的能力。然而，短观测窗口的密集预测限制了时间上下文，可能导致预测偏向局部低层次外推，难以捕捉长程语义并降低下游效用。相比之下，视觉-语言模型（VLM）通过对均匀采样帧进行推理，提供强大的语义基础和通用知识，但由于计算驱动的稀疏采样、语言输出瓶颈（将细粒度交互状态压缩为文本导向表示）以及适应小规模动作条件数据集时的数据分布不匹配，它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架，通过双时间路径结合密集帧动态建模与长程语义指导：一个密集JEPA分支用于细粒度运动和交互线索，以及一个均匀采样的VLM“思考者”分支，具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号，我们引入了一个分层金字塔表示提取模块，将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上，我们的方法优于强VLM-only基线和JEPA预测器基线，并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

URL PDF HTML ☆

赞 0 踩 0

2603.20775 2026-06-17 cs.LG 版本更新

Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

评估结构偏差下的提升建模：对指标稳定性和模型鲁棒性的洞察

Yuxuan Yang, Dugang Liu, Yiyan Huang

AI总结针对现实营销数据中的多种偏差，设计半合成基准框架，发现TARNet具有鲁棒性，且与ATE对齐的指标更稳定。

Comments Accepted by KDD 26

详情

AI中文摘要

在个性化营销中，提升模型通过反事实分析模拟客户在不同干预下的行为变化，来估计干预的增量效果。然而，现实营销数据常存在多种偏差，如选择偏差、溢出效应、测量误差和未观测混杂。这些偏差会同时影响提升估计的准确性和评估指标的有效性。尽管偏差感知评估很重要，但缺乏系统研究来评估不同模型和指标在偏差条件下的表现。为填补这一空白，我们设计了一个系统基准框架。与标准预测任务不同，现实提升数据集天然缺乏反事实真值。这一限制使得评估指标的直接验证不可行，并阻碍了偏差的精确量化。因此，半合成方法成为系统基准的关键推动力。该方法通过保留现实特征依赖关系，同时提供隔离结构偏差所需的真值，有效弥合了差距。我们的研究发现：(i) 提升定位和预测可能表现为不同目标，擅长一个并不保证另一个有效；(ii) 尽管许多模型在多种偏差下表现不一致，但TARNet表现出显著的鲁棒性，为后续模型设计提供了见解；(iii) 评估指标的稳定性与其与ATE的数学对齐程度相关，表明在结构数据不完美下，近似ATE的指标能产生更一致的模型排名。这些发现表明，在现实数据不完美下需要更鲁棒的提升模型和评估指标。

英文摘要

In personalized marketing, uplift models estimate the incremental effect of an intervention by modeling how customer behavior would change under alternative treatments using counterfactual analysis. However, real-world marketing data often exhibit various biases, such as selection bias, spillover effects, measurement error, and unobserved confounding. These biases can adversely affect both the accuracy of uplift estimation and the validity of evaluation metrics. Despite the importance of bias-aware assessment, there remains a lack of systematic studies evaluating how different models and metrics perform under such biased conditions. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets inherently lack counterfactual ground truth. This limitation renders the direct validation of evaluation metrics infeasible and prevents the precise quantification of biases. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking. This approach effectively bridges the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) the stability of evaluation metrics is linked to their mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and evaluation metrics under real-world data imperfections.

URL PDF HTML ☆

赞 0 踩 0

2602.14771 2026-06-17 cs.CV cs.AI cs.LG cs.MM cs.NE 版本更新

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA：基于联合嵌入预测架构的通用目标跟踪与模型自适应及遮挡处理

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

AI总结提出GOT-JEPA框架，通过预测跟踪模型而非图像特征来提升泛化能力，并设计OccuSolver增强遮挡感知，在七个基准上验证了有效性。

Comments Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

详情

DOI: 10.1109/TCSVT.2026.3675005
Journal ref: IEEE Transactions on Circuits and Systems for Video Technology 2026

AI中文摘要

人类视觉系统通过整合当前观测与先前观测信息、适应目标和场景变化、以及精细推理遮挡来跟踪物体。相比之下，最近的通用目标跟踪器通常针对训练目标进行优化，这限制了在未见场景中的鲁棒性和泛化能力，并且它们的遮挡推理仍然粗糙，缺乏对遮挡模式的详细建模。为了解决这些在泛化和遮挡感知方面的局限性，我们提出了GOT-JEPA，一个模型预测预训练框架，将JEPA从预测图像特征扩展到预测跟踪模型。给定相同的历史信息，教师预测器从干净的当前帧生成伪跟踪模型，学生预测器学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督，并明确训练预测器在遮挡、干扰和其他不利观测下产生可靠的跟踪模型，从而提高了对动态环境的泛化能力。基于GOT-JEPA，我们进一步提出了OccuSolver来增强目标跟踪的遮挡感知。OccuSolver调整了一个以点为中心的点跟踪器，用于目标感知的可见性估计和详细的遮挡模式捕获。在跟踪器迭代生成的目标先验条件下，OccuSolver逐步细化可见性状态，增强遮挡处理，并产生更高质量的参考标签，逐步改进后续模型预测。在七个基准上的广泛评估表明，我们的方法有效增强了跟踪器的泛化能力和鲁棒性。

英文摘要

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

URL PDF HTML ☆

赞 0 踩 0

2603.17356 2026-06-17 cs.CL 版本更新

PACE-RAG: Patient-Aware Contextual and Evidence-Constrained RAG for Clinical Drug Recommendation

PACE-RAG：面向临床药物推荐的患者感知上下文与证据约束RAG

Chaeyoung Huh, Hyunmin Hwang, Jung Hwan Shin, Sungyang Jo, Jinse Park, Jong Chul Ye

AI总结提出PACE-RAG框架，通过提取患者特定临床特征、检索相关病例并结合当前症状与用药史，实现个性化药物推荐，在帕金森病和MIMIC-IV数据集上取得最优性能。

Comments 32 pages, 18 figures

详情

AI中文摘要

药物推荐需要深入理解个体患者背景，尤其是帕金森病等复杂疾病。尽管大语言模型拥有广泛的医学知识，但无法捕捉实际处方模式的细微差别。现有的RAG方法也难以应对这些复杂性，因为基于指南的检索仍然过于通用，而相似患者检索往往复制多数模式，未考虑个体患者的独特临床细微差别。为弥合这一差距，我们提出PACE-RAG（患者感知上下文与证据约束RAG）。PACE-RAG并非直接从检索到的患者中复制常用药物，而是首先提取患者特定临床特征，围绕这些特征检索病例，然后利用患者当前症状、活跃用药史和焦点特异性处方倾向来优化最终处方。通过分析针对特定临床特征的治疗模式，PACE-RAG生成患者特定的药物推荐以及可解释的临床总结。在帕金森病队列和MIMIC-IV基准上使用Llama-3.1-8B和Qwen3-8B进行评估，PACE-RAG实现了最先进的性能，F1分数分别达到80.84%和47.22%。这些结果表明PACE-RAG是一个稳健且临床基础扎实的个性化决策支持框架。我们的代码可在以下网址获取：this https URL。

英文摘要

Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-Constrained RAG). Rather than directly copying frequent medications from retrieved patients, PACE-RAG personalizes recommendations by first extracting patient-specific clinical features, retrieving cases around these features, and then refining the final prescription using the patient's current symptoms, active medication history, and focus-specific prescribing tendencies. By analyzing treatment patterns tailored to specific clinical features, PACE-RAG generates patient-specific medication recommendations along with an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results suggest that PACE-RAG is a robust and clinically grounded framework for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.

URL PDF HTML ☆

赞 0 踩 0

2507.20708 2026-06-17 cs.LG math.OC stat.AP 版本更新

Exposing the Illusion of Fairness: Auditing Vulnerabilities to Distributional Manipulation Attacks

揭露公平的幻象：审计对分布操纵攻击的脆弱性

Valentin Lafargue, Adriana Laurindo Monteiro, Emmanuelle Claeys, Laurent Risser, Jean-Michel Loubes

AI总结研究恶意被审计方如何通过分布操纵制造公平假象，提出基于熵和最优传输的操纵策略，并评估统计检验的检测能力，为监管验证提供指导。

详情

Journal ref: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Applied Data Science Track, 2026

AI中文摘要

人工智能系统在高风险领域（包括欧盟AI法案（Regulation (EU) 2024/1689）归类为高风险的领域）的快速部署，加剧了对可靠合规审计的需求。对于二分类器，监管风险评估通常依赖于全局公平性指标，如差异影响比，该指标广泛用于评估潜在歧视。在典型的审计设置中，被审计方将其数据集的一个子集提供给审计方，而监管机构可能验证该子集是否代表完整的底层分布。在这项工作中，我们研究了恶意被审计方在多大程度上可以从一个不合规的原始分布中构建一个符合公平性且看似具有代表性的样本，从而制造公平的幻象。我们将该问题形式化为一个受约束的分布投影任务，并引入基于熵和最优传输投影的数学基础操纵策略。这些构造刻画了满足公平约束所需的最小分布偏移。为了对抗此类攻击，我们通过基于分布距离的统计检验形式化代表性，并系统评估其检测操纵样本的能力。我们的分析强调了公平性操纵在统计上未被检测到的条件，并为加强监管验证提供了实用指南。我们通过在用于偏差检测的标准表格数据集上进行实验来验证我们的理论发现。代码公开于 https://this URL。

英文摘要

The rapid deployment of AI systems in high-stakes domains, including those classified as high-risk under the The EU AI Act (Regulation (EU) 2024/1689), has intensified the need for reliable compliance auditing. For binary classifiers, regulatory risk assessment often relies on global fairness metrics such as the Disparate Impact ratio, widely used to evaluate potential discrimination. In typical auditing settings, the auditee provides a subset of its dataset to an auditor, while a supervisory authority may verify whether this subset is representative of the full underlying distribution. In this work, we investigate to what extent a malicious auditee can construct a fairness-compliant yet representative-looking sample from a non-compliant original distribution, thereby creating an illusion of fairness. We formalize this problem as a constrained distributional projection task and introduce mathematically grounded manipulation strategies based on entropic and optimal transport projections. These constructions characterize the minimal distributional shift required to satisfy fairness constraints. To counter such attacks, we formalize representativeness through distributional distance based statistical tests and systematically evaluate their ability to detect manipulated samples. Our analysis highlights the conditions under which fairness manipulation can remain statistically undetected and provides practical guidelines for strengthening supervisory verification. We validate our theoretical findings through experiments on standard tabular datasets for bias detection. Code is publicly available at https://github.com/ValentinLafargue/Inspection.

URL PDF HTML ☆

赞 0 踩 0

2603.08001 2026-06-17 cs.LG stat.ML 版本更新

Amortizing Maximum Inner Product Search with Learned Support Functions

通过学习支持函数摊销最大内积搜索

Theo X. Olausson, João Monteiro, Michal Klein, Marco Cuturi

AI总结提出基于回归的摊销MIPS方法，通过训练神经网络直接预测最优键，利用支持函数的凸性加速搜索，在BEIR基准上显著提升IVF匹配率。

详情

AI中文摘要

最大内积搜索（MIPS）是机器学习中的关键子程序，需要从数据库（键）中识别出与给定查询最匹配的向量。我们提出摊销MIPS：一种基于回归的方法，训练神经网络直接预测MIPS解，从而摊销在固定键数据库上从已知分布中重复求解查询的MIPS成本。我们的关键洞察是，MIPS值函数是键集合的\emph{支持}函数，这是一个经过充分研究的凸函数，其梯度给出最优键。这激发了两种互补的摊销模型：SupportNet，一个输入凸神经网络，用于回归支持函数；以及KeyNet，一个向量值网络，直接回归最优键。SupportNet可以作为聚类路由器，将查询引导到相关的数据库分区，而KeyNet可以作为原始查询的直接替代品，直接输入到现成的索引流水线中。我们在BEIR基准上的实验表明，对于文档嵌入，当考虑计算工作量（无论是FLOPs、探测次数还是挂钟时间）时，学习的SupportNet和KeyNet显著提高了IVF匹配率。我们的代码可在以下网址获取：this https URL。

英文摘要

Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring the identification of a vector taken within a database (the keys) that best aligns with a given query. We propose amortized MIPS: a regression-based approach that trains neural networks to directly predict MIPS solutions, amortizing the cost of repeatedly solving MIPS for queries drawn from a known distribution over a fixed key database. Our key insight is that the MIPS value function is the \emph{support} function of the set of keys, a well-studied convex function whose gradient yields the optimal key. This motivates two complementary amortized models: SupportNet, an input-convex neural network trained to regress the support function, and KeyNet, a vector-valued network that directly regresses the optimal key. SupportNet can serve as a cluster router, steering queries toward relevant database partitions, while KeyNet can be used as a drop-in replacement for the original query, fed directly to off-the-shelf indexing pipelines. Our experiments on the BEIR benchmark show that, for document embeddings, learned \SupportNet{}s and \KeyNet{}s significantly improve IVF match rates when accounting for compute effort, whether measured in FLOPs, number of probes, or wall-clock time. Our code is available at: https://github.com/apple/ml-amips.

URL PDF HTML ☆

赞 0 踩 0

2510.19255 2026-06-17 cs.CV 版本更新

Advances in 4D Representation: Geometry, Motion, and Interaction

4D表示进展：几何、运动与交互

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji, Peter Chun, Ali Mahdavi-Amiri, Hao Zhang

AI总结本文综述了4D生成与重建领域，从几何、运动和交互三个核心支柱出发，分析不同4D表示方法的特性、挑战及适用场景，并探讨了大语言模型和视频基础模型在其中的作用。

Comments CGF'26,21 pages. Project Page: https://mingrui-zhao.github.io/4DRep-GMI/

详情

AI中文摘要

我们呈现了一篇关于4D生成与重建的综述，这是一个快速发展的计算机图形学子领域，其进展得益于神经场、几何与运动深度学习以及3D生成式人工智能（GenAI）的最新突破。尽管我们的综述并非首篇，但我们从独特且鲜明的4D表示视角构建领域覆盖，以建模随时间演变的3D几何，同时展现运动和交互。具体而言，我们并未穷举众多工作，而是采取更具选择性的方法，聚焦代表性工作，以突出每种表示在不同计算、应用和数据场景下的理想特性及随之而来的挑战。我们旨在向读者传达的主要信息是：如何为其任务选择并定制合适的4D表示。在组织上，我们基于三个关键支柱：几何、运动与交互，对4D表示进行划分。我们的讨论不仅涵盖当今最流行的表示，如神经辐射场（NeRFs）和3D高斯泼溅（3DGS），还关注在4D背景下相对未被充分探索的表示，如结构化模型和长程运动。在整个综述中，我们将重新审视大语言模型（LLMs）和视频基础模型（VFMs）在各种4D应用中的作用，同时引导讨论指向它们当前的局限性以及如何解决。我们还专门介绍了目前可用的4D数据集以及推动该子领域前进所缺乏的数据。项目页面：this https URL

英文摘要

We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well as 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

URL PDF HTML ☆

赞 0 踩 0

2509.15626 2026-06-17 cs.SD eess.AS 版本更新

LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control

LibriTTS-VI：用于高效语音印象控制的公开语料库与新方法

Junki Ohmura, Yuki Ito, Emiru Tsunoo, Toshiyuki Sekiya, Toshiyuki Kumakura

AI总结针对数值语音印象控制中缺乏公开语料库和印象泄漏问题，构建首个公开语料库LibriTTS-VI，并提出解耦训练和无参考方法，显著提升控制精度。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

数值语音印象（VI）控制（例如，缩放明亮度）能够在文本到语音（TTS）中实现细粒度控制。然而，它面临两个挑战：缺乏公开语料库和印象泄漏，其中参考音频会使合成语音偏离目标VI。针对第一个挑战，我们引入了LibriTTS-VI，这是基于LibriTTS-R构建的首个公开VI语料库。针对第二个挑战，我们假设单个参考通过纠缠说话人身份和VI导致泄漏。为了缓解这一问题，我们提出：1）使用同一说话人的两个话语进行解耦训练，分别用于说话人和VI条件化；2）一种无参考方法，仅通过目标VI控制印象。实验表明，我们的最佳方法提高了可控性：11维VI均方误差从0.61降至0.42（客观）和从1.15降至0.92（主观）。与基于提示的TTS比较显示，后者存在数值控制不精确以及VI与文本语义纠缠的问题，而我们的方法克服了这些缺陷。

英文摘要

Numerical voice impression (VI) control (e.g., scaling brightness) enables fine-grained control in text-to-speech (TTS). However, it faces two challenges: no public corpus and impression leakage, where reference audio biases synthesized voice away from the target VI. To address the first challenge, we introduce LibriTTS-VI, the first public VI corpus built on LibriTTS-R. For the second, we hypothesize a single reference causes leakage by entangling speaker identity and VI. To mitigate this, we propose 1) disentangled training with two utterances from the same speaker for speaker and VI conditioning, and 2) a reference-free method controlling the impression solely via target VI. Experimentally, our best method improves controllability: 11-dimensional VI mean squared error drops from 0.61 to 0.41 objectively and 1.15 to 0.92 subjectively. A comparison with a prompt-based TTS reveals imprecise numerical control and entanglement between VI and text semantics, which our methods overcome.

URL PDF HTML ☆

赞 0 踩 0