arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13602 2026-06-12 cs.AI 新提交

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

EpiBench：人工智能代理在表观基因组学分析中的可验证评估

Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结提出EpiBench基准，通过106个评估任务测试AI代理在表观基因组学工作流中的决策能力，发现最佳系统GPT-5.5/Pi通过率仅45%，失败多因缺乏深度科学判断。

详情

AI中文摘要

我们介绍了EpiBench，一个用于短周期表观基因组学分析的可验证基准。EpiBench评估代理是否能够从真实工作流状态中做出明确定义的分析决策，并返回可确定性评分的答案。该基准包含CUT\&Tag/CUT\&RUN、ATAC-seq、ChIP-seq和DNA甲基化工作流中的106个评估。在来自16个模型-框架对的5,088条有效轨迹中，没有系统通过大多数尝试：GPT-5.5 / Pi以45.0%（143/318次尝试；95%置信区间（CI），36.3--53.7）领先，其次是GPT-5.5 / OpenAI Codex的39.9%（127/318次尝试；95% CI，31.6--48.3）。Claude Opus 4.8 Max / Pi和GPT-5.4 / Pi分别通过了39.0%（124/318次尝试；95% CI，30.2--47.8和31.0--47.0）。性能因检测类型而异，许多失败的运行仍包含部分正确答案。代理通常能找到正确的文件并计算出有用的中间结果，但当任务需要更深入、特定于检测的科学判断时，它们就会失败。

英文摘要

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

URL PDF HTML ☆

赞 0 踩 0

2606.13601 2026-06-12 cs.RO eess.SY 新提交

MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation

MCR-Bionic Hand: 用于灵巧操作的解剖结构先验

Haosen Yang, Guowu Wei

发表机构 * University of Salford（索尔福德大学）

AI总结本文提出MCR-Bionic Hand，一种基于人体手部解剖结构先验的仿生机械手，通过结构智能实现低维控制到灵巧操作的映射，在接触密集型任务中验证了其有效性。

详情

AI中文摘要

灵巧机器人手通常被表述为由自由度、驱动和控制算法支配的高维主动控制系统。然而，人类手的灵巧性部分编码在骨骼、韧带、肌腱、腱膜和内在肌肉的物理结构中。本文将这种贡献描述为两种相互关联的结构智能形式：结构先验生成，其中腕指腱固定、FDS/FDP路径和背侧伸肌腱帽将低维姿态输入转换为默认抓取构型及PIP到DIP协调；以及肌肉介导的调节，其中外在肌、蚓状肌和骨间肌围绕该默认状态调节MCP姿态、远端稳定性、指尖力路径和接触状态。基于此框架，MCR-Bionic Hand被开发为一个1:1肌肉骨骼仿生手，在一个主体内集成了两排八骨手腕、跨腕肌腱、解剖屈肌腱路径、掌板和侧副韧带约束、背侧伸肌腱帽以及内在肌通路。功能演示和几何力学模型表明，手腕姿态诱导多关节预塑形，伸肌腱帽将PIP姿态映射为耦合的DIP响应，而内在肌通路在抓取形成后调节远端稳定性和指尖动作方向。接触密集型任务，包括硬币旋转、笔传递、手背翻硬币和立方体操作，表明MCR-Bionic将低维状态生成与精细接触后调节联系起来。这些结果表明，解剖仿生学的价值不在于视觉相似性，而在于识别执行部分控制功能的人手结构。

英文摘要

Dexterous robotic hands are usually formulated as high dimensional active control systems governed by degrees of freedom, actuation, and algorithms. Human hand dexterity, however, is partly encoded in the physical architecture of bones, ligaments, tendons, aponeuroses, and intrinsic muscles. This work describes that contribution as two linked forms of structural intelligence: structural prior generation, in which wrist to finger tenodesis, FDS/FDP routing, and the dorsal extensor hood transform low dimensional posture inputs into default grasp configurations and PIP to DIP coordination; and muscle mediated modulation, in which extrinsic muscles, lumbricals, and interossei regulate MCP posture, distal stability, fingertip force paths, and contact states around that default state. Based on this framework, MCR-Bionic Hand is developed as a 1:1 musculoskeletal biomimetic hand integrating a two row eight bone wrist, cross wrist tendons, anatomical flexor routing, volar plate and collateral ligament constraints, the dorsal extensor hood, and intrinsic muscle pathways within one body. Functional demonstrations and geometric mechanical models show that wrist posture induces multi joint pre shaping, the extensor hood maps PIP posture to a coupled DIP response, and intrinsic plus pathways modulate distal stability and fingertip action direction after grasp formation. Contact rich tasks, including coin rotation, pen transfer, dorsal coin flipping, and cube manipulation, show that MCR-Bionic links low dimensional state generation with fine post contact modulation. These results suggest that anatomical biomimetics is valuable not for visual similarity, but for identifying human hand structures that perform part of control.

URL PDF HTML ☆

赞 0 踩 0

2606.13591 2026-06-12 cs.AI cs.LG cs.MA 新提交

Multiagent Protocols with Aggregated Confidence Signals

带有聚合置信信号的多智能体协议

Ali Elahi, Barbara Di Eugenio

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结提出三种协议，通过转换原始置信信号并采用软投票或贝叶斯融合，为多智能体系统输出聚合置信度，在保持正确性的同时显著提升判别能力。

详情

Comments: 22 pages and 5 figures, 9 pages and 2 figures before the appendix

AI中文摘要

置信度在自然语言处理中用于可靠性、监督和一系列下游决策任务，但目前没有方法能够为多智能体系统的输出产生或评估置信度。先前的工作在多智能体辩论中使用置信度来加权消息、触发辩论或校准单个智能体，但从未将这些置信度聚合成系统本身的单一置信度。我们引入了三种协议，通过首先转换原始置信信号使其在不同模型间可比，然后通过软投票或称为贝叶斯融合的概率融合方法将它们组合，从而产生最终答案和单一的聚合置信度。这种聚合置信度在判别性（AUARC）上显著优于最佳单个智能体或标准辩论基线，同时正确性（F1分数）保持稳定，并恢复了多智能体辩论在更模糊任务上的损失。通过分析两种估计器（序列概率和自我报告）以及参数和非参数校准器，我们发现校准提高了两种估计器的F1分数，而AUARC对其依赖较小。我们在五个基准测试和四种任务类型上评估了每基准六对同质和异质辩论对，涵盖了多种模型能力和大小。

英文摘要

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

URL PDF HTML ☆

赞 0 踩 0

2606.13589 2026-06-12 cs.LG 新提交

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

单纯形约束的稀疏装袋：集成学习中从均匀先验到稀疏后验的转变

Meher Sai Preetam, Meher Bhaskar

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出SCSB框架，通过最小化袋外损失在概率单纯形上联合优化集成剪枝与校准，引入凹二次惩罚解决L1单纯形悖论，实现高达96%的压缩并提升校准性能。

详情

Comments: 6 pages, 3 tables

AI中文摘要

我们提出单纯形约束的稀疏装袋（SCSB），一个用于基于自助法的装袋集成后训练压缩和概率校准的数学严格框架。标准装袋集成（如随机森林、装袋SVM和装袋神经网络）赋予所有组成估计器均匀的投票权。然而，这种朴素的均匀先验忽略了基估计器不同的局部能力，并导致模型过度自信。我们将集成剪枝和校准表述为在概率单纯形上的联合优化问题，通过最小化袋外（OOB）损失。为了诱导稀疏性，我们通过引入凹二次惩罚来解决理论上的“L1单纯形悖论”——即L1范数在单纯形上为常数且无法剪枝的数学现实。SCSB是模型无关的，实现了高达96%的集成压缩，带来线性推理加速和优越的概率校准（降低期望校准误差），同时保持或提升泛化精度。

英文摘要

We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.13580 2026-06-12 cs.CV cs.AI 新提交

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

EvTexture++: 事件驱动的视频超分辨率纹理增强

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China（中国科学技术大学，脑启发智能感知与认知教育部重点实验室）； Midea Group（美的集团）

AI总结提出首个事件驱动的视频超分辨率纹理增强框架EvTexture++，利用事件的高频时空细节逐步恢复纹理，并通过时间纹理对齐模块增强帧间一致性，在多个数据集上达到最优性能。

详情

Comments: IEEE TPAMI 2026. Extended version of arXiv:2406.13457 (ICML 2024). Project page: this https URL

AI中文摘要

基于事件的视觉因其独特特性（包括超高时间分辨率和极端动态范围）而受到越来越多的关注。最近的工作将其引入视频超分辨率（VSR）以增强光流估计和时间对齐。相比之下，本文将事件信号的关注点从运动细化转向VSR中的纹理增强。我们提出了EvTexture++，这是首个专用于VSR中纹理增强的事件驱动框架。它利用事件的高频时空细节来改善纹理恢复。EvTexture++包含一个定制的纹理增强分支，以及一个迭代纹理增强模块，该模块逐步利用高时间分辨率的事件信息进行纹理恢复。这使得纹理区域在迭代中逐渐细化，从而产生更准确、更详细的高分辨率输出。除了帧内纹理恢复外，大运动可能会降低帧间时间一致性，尤其是在纹理区域，导致纹理闪烁。为了缓解这一问题，我们进一步利用事件的连续时间运动线索来增强时间一致性，引入了一个时间纹理对齐模块，该模块估计事件引导的纹理感知光流，以实现精确的帧间纹理对齐。此外，EvTexture++被设计为即插即用工具，可灵活提升现有VSR模型的性能。在五个数据集上的实验表明，EvTexture++达到了最先进的性能。当集成到最近的VSR模型中时，它带来了显著的改进，在纹理丰富的Vid4数据集上PSNR提升高达1.55 dB。代码：此https URL。

英文摘要

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13578 2026-06-12 cs.CL cs.AI cs.LG cs.MM cs.RO 新提交

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA：在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结针对科学实验室中机器人执行协议面临的数据和实体瓶颈，提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA，在LabUtopia基准上取得最高平均成功率。

详情

Comments: Work in progress. Project website at this https URL

AI中文摘要

科学实验室越来越依赖AI系统来推理实验，但物理实验操作仍超出其能力范围。AI可以帮助阅读文献、生成假设和规划协议，但实验台前的协议执行仍需人类操作员。视觉-语言-动作（VLA）模型为书面协议与机器人执行之间提供了一种可能的接口，但现有策略主要在家庭和桌面演示上训练，很少遇到科学实验室中的仪器、透明液体或固定协议工作流。弥补这一差距需要实验室特定的监督和统一的学习框架，以适应执行实验协议所使用的不同机器人实体。因此，我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题，我们构建了RoboGenesis，这是一个基于模拟的工作流和数据引擎，能够从原子技能组合配置的实验室工作流，验证和过滤 rollout，并跨支持的机器人配置文件导出结构化演示。在策略方面，我们提出了LabVLA，采用两阶段训练方案：首先进行FAST动作标记预训练，使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作意识；然后进行流匹配后训练，在知识隔离下附加一个DiT动作专家。在LabUtopia基准上，LabVLA在分布内和分布外设置下均达到了所有评估基线中最高的平均成功率。

英文摘要

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

URL PDF HTML ☆

赞 0 踩 0

2606.13576 2026-06-12 cs.LG cs.CC cs.DS stat.ML 新提交

Learning with Simulators: No Regret in a Computationally Bounded World

与模拟器学习：计算受限世界中的无悔学习

Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin

发表机构 * MIT（麻省理工学院）； Microsoft Research（微软研究院）

AI总结提出可模拟过程框架，利用模拟器近似任意复杂依赖的数据分布，恢复VC维误差界，并展示条件采样的统计与计算优势。

详情

Comments: To appear at COLT 2026

AI中文摘要

理解泛化所需的最小假设是学习理论的基本问题。不幸的是，大多数结果严重依赖于数据生成过程的独立性（或其某种代理），而强依赖数据的结果则非常有限。为填补这一空白，我们引入了可模拟过程的框架，其中学习器可以访问一个近似数据生成分布（可能是任意复杂且依赖的过程）的模拟器。令人惊讶的是，我们表明，在访问这样的模拟器的情况下，我们可以恢复与经典独立数据设置相同的学习保证，即依赖于VC维的误差界。此外，我们利用这一框架研究条件采样的能力，并展示了在这种设置下严格的统计和计算优势。作为我们框架的一个亮点，我们展示了一个单一算法，该算法同时学习所有在有限多项式时间内可采样的过程下的任意给定VC类，其遗憾由过程的时间有界Kolmogorov复杂度控制。这为经典PAC模型提供了重要的概念扩展。

英文摘要

Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far more limited. Towards addressing this gap, we introduce the framework of simulatable processes, where the learner has access to a simulator that approximates the distribution generating the data (which may be an arbitrarily complex and dependent process). Surprisingly, given access to such a simulator, we show that we can recover the same learning guarantees as in the classical setting with independent data, namely, error bounds that depend on the VC dimension. Further, we use this framework to study the power of conditional sampling and show strict statistical and computational advantages in this setting. As a highlight of our framework, we exhibit a single algorithm that simultaneously learns any given VC class under all processes samplable in bounded polynomial time, with regret controlled by the time-bounded Kolmogorov complexity of the process. This provides a significant conceptual broadening of the classical PAC model.

URL PDF HTML ☆

赞 0 踩 0

2606.13572 2026-06-12 cs.CL cs.AI 新提交

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

ArogyaSutra：面向印度语言的多模态医学推理的多智能体框架

Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

发表机构 * Indian Institute of Technology Patna（印度理工学院巴特那分校）； Indian Institute of Technology Kanpur（印度理工学院坎普尔分校）； Prasannadeb Women’s College（普拉萨纳德布女子学院）

AI总结针对印度语言医疗场景中多模态大语言模型性能不足的问题，提出多模态医学问答数据集ArogyaBodha和基于演员-评论家的多智能体框架ArogyaSutra，通过工具接地与双记忆机制提升多语言医学推理准确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在通用领域展现出有希望的推理能力，但在医疗等专业场景中，尤其是在多语言和低资源情况下，其性能仍然有限。这一差距在印度农村等地区尤为关键，患者通常用本土印度语言表达复杂的医疗问题，并依赖医学图像等多模态输入。现有的以英语为中心的MLLMs难以支持此类用例，限制了公平获取AI驱动的医疗辅助。为应对这一挑战，我们引入了ArogyaBodha，一个大规模的多语言多模态医学问答数据集，由八个异构来源构建，涵盖31个身体系统、六种成像模态和21个临床领域，覆盖英语和七种主要印度语言。我们进一步提出了ArogyaSutra，一个基于演员-评论家的多智能体框架，将工具接地与双记忆机制相结合，实现逐步的、推理感知的决策，并使用存储的演员-评论家模拟轨迹进行蒸馏。实验表明，我们的数据集和框架在所有印度语言上提高了多语言医学推理的准确性，消融实验验证了每个组件的贡献。源代码和数据集可在以下网址获取：this https URL ArogyaSutra/

英文摘要

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: this https URL ArogyaSutra/

URL PDF HTML ☆

赞 0 踩 0

2606.13571 2026-06-12 cs.LG cs.AI 新提交

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

存在先于价值：时间序列预测中观测存在性与状态演变的联合建模

Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant International（蚂蚁国际）

AI总结提出Timeflies框架，联合建模未来观测是否发生（存在性）与数值估计，通过观测流和数值流耦合模块提升缺失值时间序列预测性能。

详情

AI中文摘要

现实世界的时间序列常因传感器休眠、传输延迟和事件驱动采样而高度不完整和不规则，使得可靠预测面临根本性挑战。现有方法已从插值后预测的流水线发展到连续时间模型，如神经常微分方程和连续时间图网络。尽管这些方法改进了历史不规则性的建模，但它们仍然在推理时依赖一个隐式的先知假设：未来有效观测的时间戳被假定为预先已知。这一假设限制了实际相关性，因为在许多现实系统中，更根本的问题不仅是未来值是多少，还包括是否会有有效观测发生。在本文中，我们提出Timeflies，一个统一的框架，将预测重新表述为未来可观测性推断和数值估计的联合问题。为了显式建模观测动态与状态演变之间的交互，Timeflies采用观测流和数值流，通过三个专用模块（可靠性感知嵌入、观测引导的依赖建模和联合预测）进行耦合。我们进一步构建了Shadow基准，该基准结合了来自公共数据集和真实工业数据的自然缺失，并引入观测-值联合熵（OVJE）指标来全面评估这种耦合的可预测性。大量实验表明，Timeflies始终优于现有方法，突显了在缺失值时间序列预测中显式建模未来可观测性的重要性。代码和数据集见https://this URL。

英文摘要

Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13562 2026-06-12 cs.CV cs.AI 新提交

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

对比信息增强和域对抗训练用于成人到新生儿MR重建泛化

Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza

发表机构 * University of Calgary（卡尔加里大学）； Seaman Family MR Research Centre, Foothills Medical Centre（Seaman家族磁共振研究中心，山麓医疗中心）； Hotchkiss Brain Institute, University of Calgary（Hotchkiss脑研究所，卡尔加里大学）； Pediatrics, Division of Neonatology, University of Calgary（卡尔加里大学儿科学系新生儿科）； Alberta Children’s Hospital Research Institute, University of Calgary（阿尔伯塔儿童医院研究所，卡尔加里大学）； Radiology and Clinical Neuroscience, University of Calgary（卡尔加里大学放射学与临床神经科学系）； Electrical and Software Engineering, University of Calgary（卡尔加里大学电气与软件工程系）

AI总结研究对比信息增强和域对抗训练提升E2E-VarNet从成人到新生儿MR重建的泛化能力，在加速因子R=4和R=8下，混合域对抗训练在SSIM和PSNR指标上表现最优。

详情

Comments: 24 pages, 1 table, 7 figures

AI中文摘要

目的：研究对比信息数据增强和域对抗训练是否能改善E2E-VarNet从成人到新生儿的泛化能力。方法：研究了三种训练方案：(1) 仅使用未增强的成人数据进行成人单独训练，(2) 使用配对的未增强和新生儿信息增强的成人数据进行混合训练，(3) 使用域对抗目标进行混合训练。模型在回顾性欠采样的多线圈成人T2加权脑MR数据上训练，并在新生儿和成人测试数据上以加速因子$R=4$和$R=8$进行评估，使用定量指标和定性评估。特征分析评估了域对抗训练是否改变了未增强成人、增强成人和新生儿测试样本的潜在表示。结果：在新生儿数据上评估时，混合训练（Mixed）和混合域对抗训练（Mixed-DAT）优于仅未增强的成人单独训练（Unaug-Only）。在R=4时，Mixed-DAT取得最佳性能（SSIM = 0.924 +/- 0.027，PSNR = 33.98 +/- 1.15 dB）。在R=8时，Mixed-DAT在SSIM指标上表现最佳（0.848 +/- 0.031，对比Unaug-Only的0.766 +/- 0.037和Mixed的0.814 +/- 0.035），而Mixed在PSNR指标上表现最佳（29.56 +/- 0.83 dB，对比Unaug-Only的26.26 +/- 0.78 dB和Mixed-DAT的29.43 +/- 0.83 dB）。t-SNE图的定性评估表明，Mixed-DAT增加了未增强成人、增强成人和新生儿测试数据的潜在表示之间的重叠。结论：对比信息增强和域对抗训练改善了基于深度学习的MR重建从成人到新生儿的泛化能力。这些发现表明，对比信息数据增强结合对抗训练可能提高欠采样新生儿MR重建中对域偏移的鲁棒性。

英文摘要

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.13558 2026-06-12 cs.CV cs.CL 新提交

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

编辑比特，差异编码：面向视觉自回归模型的逐比特残差编辑

Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze

发表机构 * LMU Munich & Munich Center for Machine Learning (MCML)（慕尼黑大学 & 慕尼黑机器学习中心 (MCML)）

AI总结提出BitResEdit，一种无需训练的视觉自回归图像编辑方法，通过比特级源负引导和残差编码注入，在保持背景的同时实现强文本对齐。

详情

AI中文摘要

基于文本引导的图像编辑与视觉自回归（VAR）生成器需要控制模型采样的内容以及将采样变化写回图像代码的位置。现有的VAR编辑器主要操作于令牌流、特征或扁平的下一个令牌对数几率，忽略了逐比特残差VAR模型的两个原生结构：逐比特伯努利预测头和图像组装所用的加性多尺度残差代码域。我们提出BitResEdit，一种针对逐比特残差VAR生成器（如Infinity）的无训练编辑器。BitEdit通过沿共享编辑前缀上计算的源-目标对比倾斜后CFG的逐比特对数几率，执行源负引导，然后将每个更新投影到干净CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样的比特转换为每尺度连续代码残差，用定位掩码对其进行门控，并通过生成器的原生尺度求和重新注入。它们共同将决策时的比特引导与组合时的代码组合耦合，使得被掩码的潜在特征通过代码算术精确保留，同时在目标区域内应用局部化的尺度感知编辑。在PIE-Bench上使用Infinity-2B，BitResEdit在相同骨干的VAR编辑器中实现了最强的文本对齐，在编辑区域上的CLIP比最强先前的编辑器提高了+1.07，同时背景保持与其相当。消融实验表明BitEdit和ResEdit在目标对齐和背景保持中发挥互补作用。

英文摘要

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

URL PDF HTML ☆

赞 0 踩 0

2606.13556 2026-06-12 cs.AI cs.HC q-bio.BM q-bio.GN q-bio.MN 新提交

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

是你还是你的环境？一种用于基因组锚定的个性化生理解释的贝叶斯推理框架

Aruna Dey, Suraj Biswas

发表机构 * Dots-In

AI总结提出一种贝叶斯推理框架，利用基因组先验解决个性化健康AI的冷启动问题，通过基因组锚定分离生理信号的体质与环境成分，并随数据积累动态更新。

详情

Comments: 24 pages, 8 figures, 3 tables. Conceptual framework paper

AI中文摘要

个性化健康AI系统面临一个根本性的冷启动问题：用于生理解释的机器学习模型需要数周的个人行为数据，才能区分体质变异与环境引起的偏差。我们提出一种基于因果推断和贝叶斯先验设计的解决方案。个体的基因组图谱作为外源性遗传锚点——一个领域信息化的个性化先验，在受孕时固定，不受反向因果影响，且在收集任何行为观测之前即可获得。该锚点初始化个体生理设定点G-hat = mu + sum(beta_i * g_i)上的贝叶斯信念状态，其中beta_i是GWAS衍生的效应大小，g_i是风险等位基因计数。每次传入的生理测量P产生一个非体质偏差delta = P - G-hat，将可归因于环境和状态的部分与体质固定的基线分离。随着行为数据的积累，先验根据G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t衰减，从基因组主导过渡到经验基线主导的推理。同一个观测到的HRV 55 ms，对于先验预测80 ms的人产生抑制假设，而对于先验预测30 ms的人产生增强假设——没有个性化锚点，这种反转是不可能的。我们在六个生理领域开发了这一架构，根据证据强度对基因组先验进行分级，区分稳健复制的锚点（FTO、FADS1/2、FKBP5）和有争议的候选基因（SLC6A4、MAOA、DRD2）。我们讨论了关联、孟德尔随机化和个体因果推断之间的推理边界，并定义了部署的四个约束：证据分级的先验、动态衰减、祖先匹配的效应大小以及归因而非确定性输出。

英文摘要

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

URL PDF HTML ☆

赞 0 踩 0

2606.13550 2026-06-12 cs.AI cs.CL 新提交

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

不确定性感知的混合检索用于长文档RAG

Hoin Jung, Xiaoqian Wang

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University（普渡大学埃尔莫尔家族电气与计算机工程学院）

AI总结提出UMG-RAG，一种无需训练的混合检索框架，通过多粒度分块和不确定性估计融合密集与稀疏检索结果，提升长文档问答质量。

详情

AI中文摘要

检索增强生成（RAG）关键依赖于检索证据的质量和粒度。大的检索单元保留上下文但常引入无关内容，可能稀释答案承载证据并恶化长上下文利用。细粒度单元更紧凑，但可能难以可靠检索，因为短块可能缺乏匹配查询所需的语义、词汇或桥接线索。我们提出不确定性感知的多粒度RAG（UMG-RAG），一种无需训练的混合检索框架，将分块粒度视为查询特定的可靠性估计。UMG-RAG不训练新检索器或修改生成器，而是利用现有密集和稀疏检索器作为跨多个分块粒度的互补专家。对于每个查询，它将每个专家-粒度得分列表转换为证据分布，从分布熵估计可靠性，并根据查询特定的语义、词汇和粒度置信度融合候选。我们进一步引入UMGP-RAG，一种父级提升变体，利用细粒度命中定位相关证据，同时返回更广泛的非冗余父块以保持局部连贯性。在问答基准上的实验表明，不确定性感知融合和父级提升在保持轻量级、即插即用检索管道的同时，提高了生成质量。

英文摘要

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.13537 2026-06-12 cs.CL 新提交

When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval

何时混合有帮助？分析多语言稠密检索中的查询嵌入插值

Tongyao Zhu, Chao-Ming Huang, Min-Yen Kan

发表机构 * National University of Singapore（新加坡国立大学）

AI总结通过嵌入级插值构造混合查询，系统研究多语言稠密检索对混合语言查询的敏感性，发现最优混合比在多数情况下优于单语言查询，且英语主导性导致不对称性。

详情

Comments: ACL 2026 Main (Oral)

AI中文摘要

虽然混合语言查询在多语言社区中普遍存在，但稠密检索器对此类查询的敏感性仍知之甚少。我们在mMARCO上进行了比例控制研究，通过嵌入级混合——将混合查询构建为单语言嵌入的插值——系统地评估了改变平行查询翻译混合比例时的检索性能。使用BGE-M3的实验表明，在88/105个案例中，最优混合比优于最佳单语言端点。我们发现了由英语主导性驱动的明显不对称性：当从非英语文档索引中检索时，混合普遍有益，而包含英语的索引则最好使用纯英语查询。此外，对于每种非英语文档语言，英语都是最强的混合伙伴。最后，在控制英语主导性后，混合收益与类型学距离呈负相关。我们得出结论，语言混合敏感性是有结构且可预测的，并且我们验证了这些模式在模型家族和规模上的鲁棒性。

英文摘要

While mixed-language querying is ubiquitous in multilingual communities, the sensitivity of dense retrievers to such queries remains poorly understood. We present a ratio-controlled study on mMARCO that systematically evaluates retrieval performance by varying the mixing proportion of parallel query translations via embedding-level mixing -- constructing mixed queries as an interpolation of monolingual embeddings. Experiments with BGE-M3 demonstrate that an optimal mixing ratio outperforms the best monolingual endpoint in 88/105 cases. We uncover a distinct asymmetry driven by English dominance: mixing is uniformly beneficial when retrieving from non-English document indices, whereas indices containing English are best served by pure English queries. Furthermore, English acts as the strongest mixing partner for every non-English document language. Finally, when controlling for English dominance, mixing gains correlate negatively with typological distance. We conclude that language-mix sensitivity is structured and predictable, and we validate the robustness of these patterns across model families and scales.

URL PDF HTML ☆

赞 0 踩 0

2606.13528 2026-06-12 cs.CV 新提交

What's Old is New Again: Classical Dimensionality Reduction for Efficient Saliency-Guided Biometric Attack Detection

旧法新用：经典降维方法用于高效显著性引导的生物特征攻击检测

Samuel Webster, Walter Scheirer

发表机构 * University of Notre Dame（圣母大学）

AI总结提出使用PCA和LDA等经典降维方法直接从训练数据生成显著性图，无需人工标注，在五个生物特征攻击检测领域超越基线甚至达到最优性能。

详情

Comments: 16 pages (8 main, 2 references, 6 appendix), 4 figures (3 main, 1 appendix), 13 tables (3 main, 10 appendix)

AI中文摘要

显著性引导训练是一种视觉识别范式，鼓励模型在学习过程中关注最相关的图像区域。尽管其在生物特征呈现攻击检测（PAD）中的应用在鲁棒性和泛化性方面显示出显著优势，但由于现有显著性获取方法（如有限数据集上的人工标注）成本高、领域特异性强且可扩展性有限，其采用往往受到限制。我们提出了一种新颖、成本效益高且高度可扩展的显著性获取方法，使用受经典降维技术PCA和LDA启发的图。我们提出的方法直接从原始训练数据生成显著性图，无需人工标注或领域知识。我们在三个显著性探索领域（虹膜PAD、合成人脸检测、指纹PAD）中情境化这些显著性源的有效性，并在两个显著性新颖领域（指纹静脉PAD和身份证PAD）中展示了其可扩展性。在所有测试领域中，使用降维来源的显著性图训练的模型在没有任何资源投入或特定领域工具的情况下，超过了基线甚至有时是最先进的显著性方法。我们的发现克服了显著性引导训练在生物特征攻击检测及更广泛领域中一个重要但尚未解决的障碍。

英文摘要

Saliency-guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost-efficient, and highly-scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency-explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency-novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction-sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain-specific tooling. Our findings overcome an important yet unaddressed barrier to saliency-guided training for biometric attack detection and beyond.

URL PDF HTML ☆

赞 0 踩 0

2606.13515 2026-06-12 cs.CV cs.LG cs.RO 新提交

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

MaskWAM：统一掩码提示与预测的世界-动作模型

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Tencent Robotics X（腾讯机器人X实验室）； Tsinghua University（清华大学）

AI总结提出MaskWAM，通过统一掩码输入与预测的混合Transformer架构，解决世界-动作模型的空间瓶颈，提升策略泛化能力，在LIBERO等任务上显著优于基线。

详情

AI中文摘要

世界-动作模型（WAMs）通过视频预测为机器人控制提供了一种有前景的范式。然而，当前的WAMs存在根本性的空间瓶颈：标准文本输入在杂乱场景中引入指代歧义，而非结构化的RGB预测缺乏语义基础，并受任务无关背景的偏差影响。为克服这些限制，我们引入了MaskWAM，一种以对象为中心的世界-动作模型。通过统一的混合Transformer（MoT）将掩码同时作为显式输入和预测进行联合集成，MaskWAM实现了鲁棒的策略泛化。该设计提供两个关键优势：（1）预测未来掩码产生以对象为中心的语义监督，抑制视觉噪声，显著增强甚至标准文本条件的WAMs；（2）将此预测监督与第一帧视觉提示（如目标对象掩码）耦合，建立精确的空间锚点，大幅减少语言歧义。关键在于，由于WAMs本质上是视觉驱动的架构，直接掩码条件化比单独文本提供更强的引导，为操作未见对象建立了精确且鲁棒的范式。在LIBERO、RoboTwin和真实世界任务上的评估表明，MaskWAM在语言清晰和语言模糊任务中均显著优于基线。

英文摘要

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.13513 2026-06-12 cs.AI 新提交

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

CloudCons：云资源整合的全面端到端基准测试

Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li, Hongkai Li, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University（浙江大学）； State Street Technology (Zhejiang) Ltd.（道富科技（浙江）有限公司）； Richoo AI ； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security（杭州高新区（滨江）区块链与数据安全研究院）； Datadog AI Research

AI总结提出CloudCons基准，评估云资源整合中预测模型的决策效用，发现基础模型零样本预测准确但决策效用未必更优，并分析预测分位数选择对资源效率与可靠性的权衡。

详情

Comments: Accepted to KDD 2026

AI中文摘要

由于为了保证服务可靠性而采取的保守过度配置，云数据中心的资源利用率仍然较低。为了缓解这一问题，出现了“先预测后优化”的范式，通过预测未来需求来优化整合。尽管新兴的时间序列基础模型通过零样本泛化有望增强这一范式，但现有基准仅关注预测误差指标。这些先进模型的实际决策效用尚未得到验证，使得它们在下游任务中的实际价值不确定。为了弥合这一差距，我们提出了CloudCons，一个全面的端到端基准测试，旨在评估云资源整合特定背景下的预测模型。我们构建了高质量数据集，涵盖华为云、微软Azure和Google Borg的不同工作负载，捕捉从同步昼夜节律到随机脉冲式突发和高频噪声的不同服务特征。我们对统计模型、深度学习模型和基础模型进行了广泛评估。实验揭示了一个关键发现：虽然基础模型在零样本预测准确性上表现出色，但这种优势并不必然转化为更好的决策效用。具有实际意义的是，我们系统分析了预测分位数的选择如何作为一个关键杠杆。我们提供了校准这些选择的可行指南，以平衡资源效率和服务可靠性之间的权衡，为实际部署决策提供了重要见解。

英文摘要

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.13509 2026-06-12 cs.CV cs.AI 新提交

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

基于测量校准的多相机融合用于视觉室内定位

Mateo Toro Diz, Jonathan Hoss, Noah Klarmann

发表机构 * Rosenheim Technical University of Applied Sciences（罗森海姆应用技术大学）

AI总结提出测量校准融合方法，通过显式量化单相机定位误差（单应校准、人体检测、运动跟踪）来优化多相机数据融合，实验表明该方法虽未显著提升绝对精度，但有效降低了轨迹方差并提高了运动平滑性。

详情

Comments: This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

AI中文摘要

基于视觉的室内定位系统受到检测噪声、遮挡和有限相机覆盖的影响，导致流程多个阶段存在不确定性。虽然多相机数据融合被广泛用于缓解这些问题，但通常被视为黑箱组件并仅通过端到端评估，掩盖了其机制贡献。为弥补这一不足，本文研究是否可以利用显式表征单相机定位误差来校准和优化多相机数据融合。我们提出了一种测量校准融合方法，该方法集成了组件级误差量化，具体分离了单应校准、人体检测和运动跟踪。进行了组件级评估以量化单应校准、人体检测和运动跟踪的误差贡献。实验结果表明，与单相机基线相比，数据融合提高了定位精度。虽然测量校准融合在绝对精度上相比标准融合仅提供有限的改进，但它显著降低了轨迹方差并提高了运动平滑性，这对于需要稳定连续运动估计的应用至关重要。这些结果突显了在设计基于视觉的室内定位系统的数据融合策略时，显式误差表征的价值。

英文摘要

Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13507 2026-06-12 cs.CL 新提交

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

利用音频大语言模型过滤语音到语音训练数据

Qixu Chen, Satoshi Nakamura

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）数据科学学院）； School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）人工智能学院）

AI总结提出Rank-to-Distill策略，训练音频大语言模型直接从语音对判断保留/丢弃，过滤噪声数据，提升端到端语音翻译性能。

详情

Comments: Accepted to INTERSPEECH 2026

AI中文摘要

大规模挖掘语料为端到端语音到语音翻译（S2ST）提供了丰富的训练数据，但可能包含噪声、错位和语义错误。过滤噪声数据对于保持鲁棒的语音翻译性能至关重要。我们研究如何训练音频语言模型直接从音频对配对的语音做出保留/丢弃决策。为了在没有人工标注的情况下获得可靠的监督，我们采用了一种可扩展的两阶段Rank-to-Distill策略。一个轻量级排序器从噪声语音对生成保留/丢弃伪标签，然后训练音频大语言模型直接从原始配对语音预测保留/丢弃。所得模型联合捕获声学保真度和跨语言语义一致性，用于选择语音条件数据。在CVSS-C和SpeechMatrix上的实验表明，与未过滤训练相比，性能持续提升，端到端S2ST的ASR-BLEU最高提升+1.4。

英文摘要

Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

URL PDF HTML ☆

赞 0 踩 0

2606.13503 2026-06-12 cs.CV cs.AI cs.RO 新提交

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

异构激光雷达早期融合与学习重排序策略用于非结构化环境中的鲁棒长期地点识别

Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente, Luis Payá

发表机构 * Miguel Hernández University of Elche（米格尔·埃尔南德斯·德埃尔切大学）

AI总结提出MinkUNeXt-VINE++方法，通过异构LiDAR数据早期融合和学习重排序策略，在非结构化环境（如葡萄园）中显著提升长期地点识别性能，Recall@1指标提升20%-30%。

详情

AI中文摘要

在非结构化环境（如农田）中，鲁棒定位是自主系统的关键挑战。LiDAR传感器提供环境的详细3D信息，且不受光照条件影响，因此基于LiDAR的地点识别方法备受关注。本文提出MinkUNeXt-VINE++，一种结合两个传感器（Livox Mid-360和Velodyne VLP-16）异构LiDAR数据早期融合与推理时学习重排序策略的新方法。这种融合利用每个传感器的优势，提供更全面的环境表示。此外，重排序方法在重复环境（如葡萄园）中尤为重要，因为找到真正匹配是一项重大挑战。我们使用TEMPO-VINE数据集评估了该方法，该数据集提供了不同物候阶段葡萄园环境中的异构LiDAR数据。结果表明，与单传感器方法和现有最优方法相比，MinkUNeXt-VINE++显著提升了地点识别性能。与单传感器方法相比，MinkUNeXt-VINE++在Recall@1指标上提升了20%，加入重排序后提升30%。我们的方法代码已公开，可复现结果。

英文摘要

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

URL PDF HTML ☆

赞 0 踩 0

2606.13497 2026-06-12 cs.RO cs.CV 新提交

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

SPARC：来自机器人演示的可靠空间标注

Nils Blank, Paul Mattes, Maximilian Xiling Li, Jakub Suliga, Thomas Roth, Moritz Reuss, Pankhuri Vanjani, Rudolf Lioutikov

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； NVIDIA（英伟达）； Robotics Institute Germany（德国机器人研究所）

AI总结提出SPARC框架，利用机器人任务的时空结构生成可靠性评分，自动标注演示中的空间信息，减少噪声标签并保留更多有用样本，在物体定位基准上优于纯检测基线。

详情

AI中文摘要

本文介绍了一种具有可靠性校准的机器人演示空间标注方法（SPARC），这是一个风险感知框架，能够自动为机器人演示标注结构化的空间信息，并为每个标注分配可靠性评分。结构化的空间标注，如边界框、物体轨迹和操作阶段标签，有益于广泛的机器人应用，从训练接地机器人策略和具身基础模型到运动规划和层次化任务组合。现有的自动化流水线可以大规模生成此类标注，但无法提供可靠的质量信号：检测器置信度对于标注正确性的校准不佳，迫使人们在接受噪声标签或丢弃有用样本之间做出选择。与现有的自动化流水线不同，SPARC利用机器人任务固有的时空结构生成可靠性信号，减少噪声标签并保留更多有用样本。我们进一步引入了交互感知基准（IA-Bench），这是一个衡量模型在机器人演示中接地交互物体位置准确性的基准。在涵盖多种实体和场景的1.7k个人工标注演示上，SPARC在定位准确性上显著优于纯检测基线，同时在高精度操作点保留了三倍以上的样本。我们的实验表明，基于我们的标注微调的模型在物体接地和指向基准上达到了与类似规模模型相当的最先进结果，同时在更广泛的空间推理套件上保持竞争力，无需手动验证或标注的训练数据。此外，基于SPARC生成的标注训练的策略在杂乱、视觉模糊的真实场景中优于基线。代码、数据和模型可从此网址获取。

英文摘要

This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at this http URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13496 2026-06-12 cs.CV 新提交

Budget-Constrained Step-Level Diffusion Caching

预算约束的步骤级扩散缓存

Mingkun Lei, Tong Zhao, Liangyu Yuan, Chi Zhang

发表机构 * Westlake-AGI-Lab（西湖大学AGI实验室）

AI总结提出BudCache方法，通过离线搜索（模拟退火+爬山）在固定计算预算下优化缓存策略，并引入缓存感知调度对齐，以提升扩散模型生成质量。

详情

Comments: Accepted by ICML 2026

AI中文摘要

步骤级缓存通过利用去噪步骤间的时间冗余来加速扩散模型。现有方法使用基于阈值的启发式方法进行每步缓存决策，没有直接优化最终输出质量。因此，它们的推理延迟随输入变化，在部署时难以控制。在这项工作中，我们提出了BudCache，它反转了这一公式：不是让每步误差阈值决定运行成本，而是预先固定计算预算，并搜索最能保留最终输出的缓存策略。为了应对步骤选择的组合复杂性，我们将模拟退火与确定性爬山相结合。这种离线搜索在几分钟内找到高质量的缓存策略，并且在推理过程中不引入在线搜索或阈值开销。当计算预算非常紧张时，我们进一步引入缓存感知调度对齐，它使时间离散化适应所选的缓存策略，以减少缓存引起的轨迹不匹配。在FLUX.1-dev和Wan2.1上的实验表明，在相同推理预算下，BudCache比启发式缓存基线实现了更好的生成质量。代码可在以下网址获取：https://this https URL

英文摘要

Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising steps. Existing methods make per-step cache decisions using threshold-based heuristics, without directly optimizing for final output quality. As a result, their inference latency varies across inputs and is difficult to control at deployment. In this work, we propose BudCache, which inverts this formulation: rather than letting per-step error thresholds dictate the runtime cost, we fix the compute budget in advance and search for the cache policy that best preserves the final output. To tackle the combinatorial complexity of step selection, we combine Simulated Annealing with deterministic Hill Climbing. This offline search identifies high-quality cache policies within minutes and introduces no online search or thresholding overhead during inference. When the compute budget is very tight, we further introduce cache-aware schedule alignment, which adapts the time discretization to the selected cache policy to reduce cache-induced trajectory mismatch. Experiments on FLUX.1-dev and Wan2.1 show that BudCache achieves better generation quality than heuristic caching baselines under the same inference budgets. Code is available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.13494 2026-06-12 cs.RO cs.CV 新提交

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

NavWAM：用于目标条件视觉导航的导航世界动作模型

Daichi Azuma, Taiki Miyanishi, Koya Sakamoto, Shuhei Kurita, Yaonan Zhu, Petr Khrapchenkov, Motoaki Kawanabe, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo（东京大学）； National Institute of Informatics（国立信息学研究所）； AIRoA ； ATR

AI总结提出NavWAM，一种扩散变换器策略，通过联合学习未来观测、目标进度值和动作块，将导航世界模型预测直接转化为可执行动作，在离线基准和真实机器人部署中优于基于规划的世界模型基线。

详情

Comments: Project page: this https URL

AI中文摘要

目标条件视觉导航要求机器人在部分可观测性下行动，通过预测其运动将如何改变未来的自我中心视图以及这种变化是否使其更接近目标。导航世界模型提供了这种视觉预见，但它们仍然是预测模块，需要外部规划器将预测的未来转化为闭环控制。我们提出导航世界动作模型（NavWAM），一种扩散变换器策略，通过将未来观测、目标进度值和动作块表示为共享的潜在序列，将导航世界模型预测转化为可执行动作。通过联合学习未来预测与决定闭环行为的动作和价值目标，NavWAM使视觉预见可直接用于机器人控制。我们通过模拟预训练和真实机器人适应构建NavWAM，并在图像目标导航任务上将其与基于规划的世界模型和代表性直接导航策略进行评估。在离线基准和闭环真实机器人部署中，NavWAM在使用默认策略模式（无CEM式动作搜索）的情况下，在我们的评估中优于基于规划的世界模型基线。项目页面：此 https URL

英文摘要

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.13488 2026-06-12 cs.CV 新提交

Point-Wise Geometry-Aware Transformer for Partial-to-Full Point Cloud Registration in Computer-Assisted Surgery

面向计算机辅助手术中部分到完整点云配准的点级几何感知Transformer

Siyu Zhou, Zhongliang Jiang

发表机构 * The Chair for Computer Aided Medical Procedures, Technical University of Munich（慕尼黑工业大学计算机辅助医疗程序教席）； The University of Hong Kong（香港大学）

AI总结提出GAPR-Net，一种结合卷积与Transformer的粗到细框架，通过交叉注意力融合局部与全局信息，并设计变换不变的点级几何特征，在四个骨骼数据集上实现94.2%配准召回率、1.992mm RMSE。

详情

AI中文摘要

由于重叠率变化、点密度波动以及噪声的存在，部分到完整配准仍然具有挑战性。尽管Transformer在点云处理中展现出强大潜力，但先前的方法通常将其局限于全局上下文聚合，忽略了对于精确对应至关重要的细粒度局部几何信息。我们提出GAPR-Net，一种基于学习的点云配准框架，采用粗到细架构，结合卷积和Transformer模块，通过交叉注意力机制在部分和完整点云之间融合局部和全局信息。为此，提出了一种变换不变的点级几何特征表示，能够鲁棒地捕获单个点相对于其邻域点的相对几何特征。为了评估所提方法的有效性，在四个几何上不同的骨骼（包括胫骨、股骨、骨盆和胸软骨）上进行了实验。整体配准召回率达到94.2%，该方法实现了低RMSE 1.992 mm，旋转和平移的R²值分别为0.908和0.974。结果表明，所提方法有效解决了部分到完整点云配准问题。该方法利用部分观测实现高精度3D点云配准，为计算机辅助手术中的精确手术导航和机器人干预提供了关键基础。代码将在双盲评审后公开。

英文摘要

Partial-to-full registration remains challenging due to varying overlap ratios, fluctuating point densities, and the presence of noise. While transformers have shown strong potential for point cloud processing, prior methods typically confine them to global context aggregation, overlooking fine-grained local geometry crucial for accurate correspondence. We propose \emph{GAPR-Net}, a learning-based point cloud registration framework with a coarse-to-fine architecture that combines convolution and transformer modules, in which local and global information is fused between the partial and full point clouds using a cross-attention mechanism. To achieve this, a transformation-invariant point-wise geometric feature representation is proposed, which can robustly capture relative geometric features for individual points with respect to their neighboring points. To evaluate the effectiveness of the proposed approach, experiments are conducted on four geometrically distinct bones, including the tibia, femur, pelvis, and thoracic cartilage. The overall registration recall reaches 94.2\%, the method results in a low RMSE of 1.992 mm and $R^2$ values of 0.908 and 0.974 for rotation and translation, respectively. The results demonstrate that the proposed method effectively addresses the partial-to-full point cloud registration problem. The proposed method enables highly accurate 3D point cloud registration using partial observation, providing a critical foundation for precise surgical navigation and robotic interventions in computer-assisted surgery. The code will be accessed after the double-blind review process.

URL PDF HTML ☆

赞 0 踩 0

2606.13486 2026-06-12 cs.LG cs.AI 新提交

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

CRAFTIIF：用于多元时间序列异常检测的跨分辨率分析四类型可解释孤立森林

William Smits

发表机构 * Avathon

AI总结提出CRAFTIIF无监督框架，通过四种小波特征和五个孤立森林同时检测点、分布、时间和集体四类异常，在mTSBench基准上达到平均F1=0.228，VUS-PR比先前最佳提升40.7%。

详情

Comments: 14 pages, 4 figures, 2 appendices. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE). Code: this https URL

AI中文摘要

多元时间序列中的异常检测面临四种结构不同的异常类型——点异常（孤立尖峰）、分布异常（水平偏移）、时间异常（节奏变化）和集体异常（传感器间相关性崩溃）——每种都需要不同的特征表示。大多数无监督方法只针对其中一两种类型，且可解释性有限。我们提出CRAFTIIF（跨分辨率分析四类型可解释孤立森林），这是一个完全无监督的框架，针对所有四种类型，无需针对数据集调整。CRAFTIIF生成K=500个随机分析小波特征，跨越四个小波族（Morlet、DOG、Haar、Coiflet），每个针对特定异常类型，并输入五个结构化的孤立森林——每种类型一个，外加一个用于复合异常的元IF。自适应Otsu/MAD阈值在0.1%到69.2%的异常率范围内自动校准检测。由于每个IF仅针对特定类型的特征进行训练，分支触发直接提供异常类型归因，无需事后解释。在mTSBench基准（Zhou等人，TMLR 2026）的所有19个数据集上评估，CRAFTIIF在全部19个数据集上达到平均F1=0.228，在13个可检测数据集上F1=0.322，在VUS-PR上排名第一（0.463对比之前最佳0.329，提升40.7%）。一个诊断框架——oracle F1、可检测性限制和分支分离比——识别出19个数据集中有6个从根本上无法被任何无监督方法检测。在11种消融条件下，自适应阈值（+38% F1）、四分支结构（+20%）和元IF（+23%）均被证明是必不可少的。代码：此 https URL

英文摘要

Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.13477 2026-06-12 cs.LG cs.AI cs.CL 新提交

SupraBench: A Benchmark for Supramolecular Chemistry

SupraBench: 超分子化学基准

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame（圣母大学）； University of Connecticut（康涅狄格大学）

AI总结为评估大语言模型在超分子化学推理中的能力，与领域专家合作发布了首个超分子基准SupraBench，包含四个基本任务和一个辅助视觉任务，并提供了16M令牌的语料库SupraPMC。

详情

AI中文摘要

超分子化学，包括非共价主客体组装的研究，推动了各种应用的发展。然而，设计主客体系统仍然耗时，每个候选对需要数天的干实验室验证。尽管LLMs已成为一种快速的替代方案，在分子结合任务上表现出色，但目前尚无基准系统性地评估LLMs在超分子化学基本任务（如结合亲和力预测）中的主客体推理能力。为此，我们与领域专家合作发布了首个超分子基准，称为SupraBench，用于评估LLMs在化学推理中的表现。具体来说，我们设计了四个基本任务，即结合亲和力预测、最佳结合物选择、溶剂识别和主客体描述，以及一个辅助的基于视觉的分子识别任务。我们还发布了SupraPMC，一个从Europe PMC中提取的经过整理的1600万令牌的超分子化学文章语料库，以支持对超分子领域的适应。我们对一系列开源和专有LLMs进行了基准测试，发现LLMs在所有任务上都有很大的提升空间。在SupraPMC上的领域自适应预训练可以干净地迁移到分布内回归，但会与严格的字母格式输出进行权衡。此外，不同任务家族的难度分布差异很大，揭示了不同的失败模式，表明当前超分子化学推理中存在特定的差距。我们的源代码和基准数据集可在以下网址获取：此 https URL。

英文摘要

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13473 2026-06-12 cs.LG cs.AI cs.CL 新提交

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

发表机构 * MiniMax ； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结提出MaxProof框架，结合生成-验证器强化学习与群体级测试时扩展，在MiniMax-M3系列上实现竞赛级数学证明，在IMO 2025和USAMO 2026上超越人类金牌阈值。

2606.13464 2026-06-12 cs.CL cs.AI 新提交

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China（哈尔滨工业大学（深圳）计算与智能研究所）； Shenzhen Loop Area Institute (SLAI), China（深圳环域研究所）

AI总结提出本体记忆增强的ASR校正框架，通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系，解决长文本-语音交错对话中的上下文校正问题，在RAMC-Corr数据集上优于直接校正。

详情

AI中文摘要

自动语音识别（ASR）校正传统上集中于孤立的话语或短局部上下文。然而，随着文本和语音在长交互中越来越交错，ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中，稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战，我们提出了一种本体记忆增强的ASR校正框架，用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆，其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储，用于上下文基础的校正。为了评估这一设置，我们构建了RAMC-Corr，一个源自MAGIC-RAMC的数据集，用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明，我们的方法在10个配对骨干-设置组合中的9个上优于直接校正，并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

URL PDF HTML ☆

赞 0 踩 0

2606.13460 2026-06-12 cs.CV 新提交

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

VISA: VLM引导的实例语义审计用于3D占据世界模型

Ruiqi Xian, Yuehan Xian, Jing Liang, Xuewei Qi, Dinesh Manocha

发表机构 * University of Maryland College Park（马里兰大学帕克分校）； Nanjing University of Posts and Telecommunications（南京邮电大学）； Stanford University（斯坦福大学）； Motional AD Inc.（Motional AD公司）

AI总结提出VISA方法，利用离线VLM对每个物理对象实例进行结构化语义审计，并通过可靠性加权损失蒸馏到3D占据模型中，无需VLM推理即可提升封闭集占据mIoU。

详情

AI中文摘要

语义3D占据为自动驾驶和机器人决策提供体素化世界状态，但对象和稀有类错误会影响自由空间解释、碰撞检测和时间状态传播。我们表明，常见的VLM策略（将3D体素或对象特征与裁剪-标题嵌入对齐）提高了文本-空间相似性，但未能可靠地改善封闭集占据mIoU。受此不匹配启发，我们提出VISA，一种针对现有占据世界模型的训练时语义审计方法。VISA对每个物理对象实例的代表性裁剪查询离线VLM，获得包含类别假设、可能混淆、可靠性、属性和证据的结构化审计，并将其沿对象轨迹传播。审计被关联到匹配的3D对象体素，并通过可靠性加权分类、属性因子和场景级审计图损失蒸馏到语义logits中，而推理保持不变且无需VLM。在nuScenes上，三次运行平均，VISA将OccWorld从19.06提升到20.05 mIoU，GaussianWorld从21.36提升到21.91 mIoU；在GaussianWorld上，对象mIoU从18.18提升到19.16，稀有类mIoU从15.60提升到16.79。这些结果表明，VLM更适合作为可靠性感知的语义审计器而非通用标题嵌入目标用于封闭集占据。

英文摘要

Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

URL PDF HTML ☆

赞 0 踩 0

2606.13444 2026-06-12 cs.LG 新提交

Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

使用图神经网络和自学习的节点属性网络聚类

Rodrigo de Sapienza Luna, Daniel Ratton Figueiredo

发表机构 * Systems Engineering and Computer Science (PESC), Federal University of Rio de Janeiro (UFRJ)（里约热内卢联邦大学系统工程与计算机科学系）

AI总结提出一种基于图神经网络和自学习的无监督图聚类框架，通过多轮自学习交替优化节点表示和聚类，利用上下文图提升性能，在合成和真实数据上表现优异。

详情

AI中文摘要

图聚类——将图的节点集划分为反映潜在信息的互不相交的子集——是一个基本问题，因为它应用于多种不同的场景。虽然这个经典问题已经被不同社区处理了几十年，但由真实数据驱动的一个最新变体考虑了节点具有信息性属性的场景。这引发了同时利用网络信息（边）和节点信息（属性）设计新型聚类算法的新方法。本文提出了一种新颖的框架，该框架建立在先前将图神经网络（GNN）应用于图聚类的工作之上。所提出的框架在完全无监督的设置下以自学习轮次运行。在每一轮中，GNN生成用于聚类节点的节点表示。这种聚类影响用于生成下一轮节点表示的图。此外，每一轮中使用原始图构建的上下文图用于生成节点表示。实验结果表明，所提出的方法从合成数据中的网络边和节点属性中提取信息，当两者都不太具有信息性时，其性能优于仅关注网络或属性的算法。多轮学习也提高了性能，并且总是优于长时间的单轮训练（即经典的GNN图聚类）。在考虑真实数据集时，实验结果表明，当聚类大小平衡时，所提出的方法与最先进的方法具有竞争力。

英文摘要

Graph clustering - partitioning the node set of a graph into disjoint subsets that reflect some latent information - is a fundamental problem as it finds applications in a myriad of different scenarios. While this classic problem has been tackled for decades by different communities, a recent variation of the problem driven by real data considers the scenario where nodes have attributes that are also informative. This has triggered novel methods that simultaneously leverage network information (edges) and node information (attributed) in the design of novel clustering algorithms. This work proposes a novel framework that builds on prior works that have applied graph neural networks (GNN) to graph clustering. The proposed framework operates in rounds of self learning in a fully unsupervised setting. In each round, a GNN generates representations for nodes that are used to cluster the nodes. This clustering influences the graph used to generate the node representation in the next round. Moreover, a context graph built in each round using the original graph is used to generate the node representations. Empirical results show that the proposed methodology extracts information from both network edges and node attributes in synthetic data, outperforming algorithms focused solely on the network or attributes when neither are very informative. Multiple rounds of learning also improve the performance and always outperforms a long single round of training (i.e., classic GNN graph clustering). When considering real datasets, empirical results indicate that the proposed methodology is competitive to state-of-the-art methods when cluster sizes are balanced.

URL PDF HTML ☆

赞 0 踩 0