arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.09316 2026-06-10 cs.AI 版本更新

Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Anything2Skill: 将外部知识编译为智能体的可复用技能

Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Anything2Skill框架,将异构外部知识编译为可复用、可检索、可执行的技能,结合RAG显著提升智能体任务成功率。

详情
AI中文摘要

检索增强生成(RAG)使智能体在推理时能够访问外部知识,但主要检索的是碎片化的陈述性证据,导致智能体需要反复从段落、手册、示例、日志或轨迹中推断任务流程。这引发了一个根本性问题:能否从外部知识库中提取技能并安装到智能体中,使其快速逼近领域专业知识?在本文中,我们提出Anything2Skill,一个基于分类的框架,将异构外部知识编译为智能体可复用、可检索、可执行的技能。给定一个知识记录语料库,Anything2Skill首先将每条记录分解为证据窗口,并在技能树先验下执行规划与扩展的技能提取。然后将提取的候选技能转换为结构化的技能契约,指定调用条件、禁忌、动作步骤、工作流程步骤、约束、输出规范、支持证据和置信度分数。为了构建可部署的程序性记忆,Anything2Skill通过分类感知编译、注册表级协调、生命周期跟踪、版本化更新和可见的技能树投影,将提取的技能管理在持久化的SkillBank中。在推理时,智能体从原始知识库中检索任务特定段落,并从SkillBank中检索相关程序性技能,使RAG提供陈述性证据,而编译的技能提供可复用的程序性指导。在qsv和GitHub-CLI上的实验表明,Anything2Skill结合RAG分别实现了98.85%和94.10%的成功率,显著优于仅使用RAG的智能体。这些结果表明,将潜在的程序性知识编译为显式技能是从知识访问扩展到能力复用的有效途径。

英文摘要

Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.

2606.09203 2026-06-10 cs.RO 版本更新

Deterministic Execution of ROS 2 Applications via Lingua Franca

通过Lingua Franca实现ROS 2应用的确定性执行

Harun Teper, Shaokai Lin, Shulu Li, Edward A. Lee, Jian-Jia Chen

发表机构 * TU Dortmund University(多特蒙德工业大学) University of California, Berkeley(加州大学伯克利分校) RWTH Aachen University(亚琛工业大学)

AI总结 提出框架将未修改的ROS 2应用转换为Lingua Franca程序,利用逻辑时间实现确定性执行,解决ROS 2中回调执行顺序和消息交织的非确定性问题。

详情
AI中文摘要

机器人操作系统2(ROS 2)是一种广泛用于机器人系统的中间件,其特点是发布-订阅(pub-sub)通信机制,计算结构为由ROS 2执行器调度的回调。尽管很流行,但ROS 2中的pub-sub模式本质上是不确定的:即使在单个执行器内,这些回调的运行顺序也是不确定的,分布式部署由于节点间消息的交织和网络延迟进一步增加了不确定性。这种不确定性常常导致并发问题,使得几乎不可能分析安全性并提供保证。我们提出了一个框架,能够将未修改的ROS 2应用程序转换并在Lingua Franca(LF)下运行,LF是一种使用逻辑时间进行确定性执行的协调语言,使得相同的输入总是产生相同的确定性执行顺序。我们首先描述了哪些ROS 2特性可以在逻辑时间下确定性执行。这些特性使得建立自动转换框架成为可能,该框架从ROS 2应用程序中提取信息并直接将其转换为LF程序。然后可以应用LF的丰富特性,如逻辑时间延迟、跨进程的联邦执行和故障处理,使ROS 2应用程序以确定性和时序可预测的方式执行,而无需更改ROS 2代码。我们在一个合成示例和Autoware参考系统上评估了该框架。我们表明,在默认ROS 2中,回调的执行顺序不同,同时端到端延迟在不同执行中也有所变化。相比之下,我们由LF控制的ROS 2系统产生了确定的执行顺序和一致的端到端延迟。

英文摘要

The Robot Operating System 2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.

2606.09079 2026-06-10 cs.LG cs.AI 版本更新

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4: 通过前瞻稀疏注意力实现闪电索引超长上下文

Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Miao Peng, Nuo Chen, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu

发表机构 * Independent Researchers(独立研究者) Tencent(腾讯) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 提出前瞻稀疏注意力(LSA),基于DeepSeek-V4架构的神经记忆索引器,通过预测未来上下文需求仅保留关键KV块,在超长上下文场景下将物理KV缓存压缩至全上下文的13.5%,同时保持或略微提升下游准确率。

详情
Comments
Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4
AI中文摘要

传统大语言模型在解码过程中保持完整的KV缓存,导致超长上下文服务出现严重的GPU内存瓶颈。在本报告中,我们提出前瞻稀疏注意力(LSA),一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。LSA并非被动地关注所有历史令牌,而是主动预测未来的上下文需求,并仅在GPU内存中保留查询关键的KV块。关键的是,我们通过无骨干的解耦训练策略实例化该架构。通过将索引器制定为标准双编码器架构,我们使用标准检索训练框架独立训练它,而无需将庞大的骨干模型加载到GPU内存中。我们证明这种“少即是多”的范式显著最大化服务效率,同时在依赖长期全局记忆的任务中充当有效的注意力去噪器。在主要的长上下文评估套件(例如LongBench-v2、LongMemEval和RULER)中,FM-DS-V4将平均物理KV缓存占用压缩至全上下文基线的仅13.5%,同时一致地保持或略微提升下游准确率(平均绝对边际+0.6%)。关键的是,在极端500K规模下,FlashMemory将物理KV缓存开销抑制超过90%,而不会破坏骨干的核心推理能力。

英文摘要

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

2606.08982 2026-06-10 cs.AI 版本更新

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Baichuan-M4:面向持续照护的临床级医疗智能体系统

Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

发表机构 * Baichuan AI(百川智能) THUBPM Group, Tsinghua University(清华大学THUBPM课题组)

AI总结 提出Baichuan-M4临床级医疗大模型,通过统一运行时、持续照护强化学习框架和临床工具层三大支柱构建智能体系统,在多项医疗评估中取得领先结果,幻觉率降至3.3%。

详情
AI中文摘要

Baichuan-M4是百川智能开发的临床级医疗大模型,专为\emph{持续照护}而非单轮医疗问答设计。它围绕三大支柱构建为协调的医疗智能体系统:\textbf{Baichuan-Harness},一个统一运行时,保持强化学习训练与实际部署的一致性,同时强制执行动作约束、工具使用、长期患者记忆和多智能体协调;一个\textbf{核心推理模型},采用持续照护强化学习框架训练,该框架集成了跨度级奖励建模(SPAR++)、推理路径压缩、课程学习和稳定的策略优化;以及一个\textbf{临床工具层},用于患者记忆管理、权威循证检索以及跨文档、X光和皮肤科的多模态医学感知。在跨维度医学评估套件中,Baichuan-M4在静态医学知识与安全性、动态OSCE式咨询、长上下文临床记忆、循证检索、医学文档OCR和多模态图像理解方面取得领先结果,同时将幻觉率降至3.3%。

英文摘要

Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a core reasoning model trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

2606.08779 2026-06-10 cs.LG 版本更新

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

重新制定LLM强化学习以在黑箱差异下高效训练

Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Zhejiang University(浙江大学) Tianjin University(天津大学)

AI总结 针对强化学习中的训练-推理差异问题,提出差异约束马尔可夫决策过程(DCMDP),通过拉格朗日松弛自适应平衡性能提升与差异控制,实现稳定高效训练。

详情
AI中文摘要

强化学习(RL)已成为一种关键的后训练范式,但它经常遭受不可预测的次优性能甚至训练崩溃。最近的研究将这些失败归因于隐藏的训练-推理差异(或不匹配),源于底层引擎和架构的不同。我们发现,当提供适当的学习信号时,训练策略可以主动自我纠正这种差异。然后,我们进一步通过经验确定了一个差异容忍区域:在该区域内,激进地缩小差异会抑制策略探索并降低学习效率,而在该区域外,减少过度差异可提高优化一致性并提升可达到的局部性能上限。根据这些发现,我们将此问题表述为差异约束马尔可夫决策过程(DCMDP),其中奖励最大化与对齐训练-推理行为的约束相结合,实现稳定的双目标优化。为了自适应地平衡性能改进和差异控制,我们引入了一种拉格朗日松弛机制,根据当前差异违反程度动态调整两个目标的相对权重。这使得双目标优化稳定:策略可以在容忍区域内自由探索,而当差异超出安全边界时则被引导回来。经验上,DCMDP显著提升了8B密集模型(Qwen-3-8b)和30B混合专家模型(Qwen-3-30bA3b)的性能,并实现了一种异构训练范式,其中LLM可以在高保真训练设置下进行优化,同时明确对齐以用于低成本、资源受限的推理部署。

英文摘要

Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉:长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington(华盛顿大学) National Tsing Hua University(国立清华大学) Seoul National University(首尔大学) Mila - Québec AI Institute(米拉-魁北克人工智能研究所) Allen Institute for AI(艾伦人工智能研究所)

AI总结 通过分析2023-2025年*CL会议论文中的人类评估协议,发现报告不透明和可重复性差的问题,并提出改进建议。

详情
Comments
Accepted to ACL 2026 Main
AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而,这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中,我们对*CL会议出版物(2023-2025年)中评估长文本生成任务的人类评估协议进行了大规模分析,包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准,并应用这些标准系统地检查了社区内的报告规范和实践。我们发现,人类评估研究设计的重要方面普遍报告不足,导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现,我们概述了可操作的建议,以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到:https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

2606.07605 2026-06-10 cs.LG cs.AI 版本更新

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

SRT: 基于解缠校正流的时间序列超分辨率

Jufang Duan, Shenglong Xiao, Yuren Zhang

发表机构 * Bytedance(字节跳动)

AI总结 提出SRT框架,通过解缠校正流将低分辨率时间序列重建为高分辨率,分解趋势与季节成分,利用隐式神经表示对齐分辨率,并引入跨分辨率注意力机制生成细节。

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
Comments
Accepted to the International Conference on Learning Representations (ICLR) 2026
AI中文摘要

具有高时间分辨率的细粒度时间序列数据对于广泛应用的精确分析至关重要。然而,获取此类数据通常受到成本和可行性的限制。可以通过基于特定先验从低分辨率输入重建高分辨率信号来解决此问题,这被称为超分辨率。虽然在计算机视觉中得到了广泛研究,但直接将图像超分辨率技术迁移到时间序列并非易事。为了从根本上解决这一挑战,我们提出了时间序列超分辨率(SRT),这是一种通过解缠校正流重建低分辨率输入中丢失的时间模式的新框架。SRT将输入分解为趋势和季节成分,使用隐式神经表示将它们对齐到目标分辨率,并利用一种新颖的跨分辨率注意力机制来指导高分辨率细节的生成。我们进一步引入了SRT-large,这是一个经过大规模预训练的扩展版本,具有强大的零样本超分辨率能力。在九个公共数据集上的大量实验表明,SRT和SRT-large在多个尺度因子下始终优于现有方法,展示了稳健的性能以及我们架构中每个组件的有效性。

英文摘要

Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose Super-Resolution for Time series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pre-training, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.

2606.07586 2026-06-10 cs.LG cs.AI cs.AR cs.MA 版本更新

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

从人类引导到自主:面向空间NPU上端到端LLM部署的智能体技能系统

Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

发表机构 * AMD Research and Advanced Development(AMD研究与高级开发)

AI总结 提出两阶段方法,从人类引导的智能体辅助部署到自主技能系统,在AMD XDNA 2 NPU上实现8种LLM的端到端自动部署,性能超越或持平人工优化基线。

详情
Comments
Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026
AI中文摘要

空间神经处理单元(NPU)为边缘LLM推理提供了能效平台,但在此类硬件上高效端到端部署LLM仍然劳动密集。尽管AI编码智能体已开始降低这一成本,现有研究主要关注单核优化,而非在资源受限的空间NPU上进行端到端LLM部署。\n我们提出一种两阶段方法,在AMD XDNA 2 NPU上实例化,从人类引导开发进展到智能体自主。第一阶段,我们通过人类引导的智能体辅助开发Llama-3.2-1B的参考部署。与手工优化基线相比,该实现实现了2.2倍的预填充加速和4.0倍的解码加速,优化轨迹及其经验教训全程记录为结构化文档。第二阶段,我们将文档提炼为一个由八个阶段组成的智能体技能系统,编排优化和调试技能集,并在每个阶段严格执行数值正确性。\n利用我们的智能体技能系统,我们使用开源编译器栈在AMD XDNA 2 NPU上自主端到端部署了另外八个仅解码器LLM(Llama-3.2-3B、SmolLM2-1.7B、Qwen2.5-{0.5B, 1.5B, 3B}、Qwen3-{0.6B, 1.7B, 4B})。据我们所知,这些模型此前尚未通过任何开源软件栈部署在AMD NPU上。每次部署在0.5-4小时的智能体挂钟时间内完成,几乎无需人类引导,并通过数值正确性门控,展示了对先前未见LLM的功能泛化能力。其中八个中的三个达到或超过了我们Llama-3.2-1B参考部署的持续性能,表明所得实现无需额外模型特定人工工程即可具有竞争力。

英文摘要

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势:揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique MBZUAI ENS-PSL Durham University

AI总结 通过控制实验和项目反应理论模型,分离语言能力与文化知识访问,发现本地语言在文化知识访问上具有优势,但常被语言能力不足掩盖。

详情
AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题,但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制:许多评估依赖于可能无法反映文化知识自然出现的平行模板问题,并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型(文化无关 vs. 文化特定)与查询语言(英语 vs. 本地语言),并使用共享的1PL项目反应理论模型估计能力,我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上,我们发现文化无关问题上存在一致的英语优势,表明更强的英语能力。然而,在考虑了这种能力差距后,本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖,但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明,较弱的本地语言表现并不一定意味着较弱的文化知识;相反,本地文化知识可能通过本地语言更容易访问,但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

2606.07135 2026-06-10 cs.LG 版本更新

Explaining Unsupervised Disease Staging in Huntington's Disease: Insights into Model Representations and Clusters

解释亨廷顿病中的无监督疾病分期:模型表示与聚类洞察

Lubna Mahmoud Abu Zohair, Hind Zantout

发表机构 * Heriot-Watt University

AI总结 本文通过可解释性分析扩展无监督疾病分期框架,在Enroll-HD数据集上揭示模型嵌入与临床进展一致,并利用SHAP量化特征重要性,识别出从早期认知运动障碍到严重功能依赖的疾病阶段。

详情
Comments
Accepted for oral presentation and as a full-length paper at the International Conference on AI in Healthcare 2026 (26-28 August 2026, Imperial College London) and will be published by Springer in the Lecture Notes in Computer Science (LNCS) series
AI中文摘要

亨廷顿病(HD)是一种进行性神经退行性疾病,影响运动、认知和行为功能,准确描述疾病进展对于改善患者预后和生活质量至关重要。无监督机器学习(ML)方法已证明能够从纵向数据中发现疾病进展轨迹和有意义的潜在阶段;然而,其有限的可解释性限制了临床信任和转化。我们通过将可解释性分析应用于提取的特征表示和发现的疾病阶段,扩展了先前提出的基于ML的疾病分期框架。应用于Enroll-HD数据集,我们首先将学习到的表示投影到低维空间,以直观评估所得聚类是否与既定临床指标的进展一致。然后,我们使用显著性图识别随时间对学习嵌入贡献最大的临床特征。最后,我们训练一个替代分类器并应用SHAP来量化特征对聚类分配的重要性,并分析哪些临床变量驱动疾病阶段之间的转换。可解释性分析表明,学习到的嵌入捕捉了具有临床意义的疾病结构,与既定的运动和功能严重程度评分一致,并显示出跨聚类的进行性恶化。在此分析中,SHAP揭示了疾病阶段的分层,范围从早期认知运动障碍到严重功能依赖,与已知的临床进展模式一致,同时也突出了阶段内变异性。

英文摘要

Huntington's disease (HD) is a progressive neurodegenerative disorder that affects motor, cognitive, and behavioral functions, where accurate characterization of disease progression remains essential to improve patient outcome and quality of life. Unsupervised machine learning (ML) approaches have demonstrated the ability to uncover disease progression trajectories and meaningful latent stages from longitudinal data; however, their limited interpretability restricts clinical trust and translation. We extend a previously proposed ML-based disease staging framework by applying an explainability analysis to the extracted feature representations and discovered disease stages. Applied to the Enroll-HD dataset, we first project the learned representations into a lower-dimensional space to intuitively assess whether the resulting clusters align with the progression of established clinical measures. We then use saliency maps to identify the clinical features that most strongly contribute to the learned embeddings over time. Finally, we train a surrogate classifier and apply SHAP to quantify feature importance for cluster assignments and to analyze which clinical variables drive transitions between disease stages. The explainability analysis indicates that the learned embeddings capture clinically meaningful disease structure, aligning with established motor and functional severity scores and exhibiting progressive deterioration across clusters. Within this analysis, SHAP reveals a stratification of disease stages, ranging from early cognitive-motor impairment to severe functional dependency, consistent with known clinical progression patterns, while also highlighting intra-stage variability.

2606.07088 2026-06-10 cs.LG math.OC 版本更新

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

残差控制乘子学习用于随机约束决策

Kang Liu, Jianchen Hu, Ziyu Qu, Edward Hengzhou Yan, Lun Yang, Meng Zhang

发表机构 * Xi’an Jiaotong University Tencent China University of Geosciences

AI总结 提出残差控制乘子学习(RCML),通过将乘子更新重构为投影压力反馈,并引入模块化随机稳定组件,解决随机约束决策中原始-对偶方法因小批量噪声导致乘子更新不稳定的问题,实现有限增益收敛和局部KKT残差解释。

详情
AI中文摘要

随机约束决策需要在强制执行统计要求(如安全性或公平性)的同时优化性能目标。然而,标准的原始-对偶方法在随机小批量反馈下难以稳健地更新乘子,因为小批量梯度和约束估计的噪声会直接累积到乘子记忆中。为了解决这个问题,我们提出了残差控制乘子学习(RCML),它将乘子更新重新表述为投影压力反馈。核心思想是将投影乘子分解为用于原始下降的有效压力信号和用于有限增益乘子跟踪的压力记忆残差。为了处理异质和有噪声的观测,我们进一步用模块化随机稳定组件增强这个残差-积分骨干。对于凸-仿射骨干,我们建立了有限增益收敛,推导了小批量反馈下的随机残差界,并表明在非凸问题的正则KKT点附近,残差反馈律具有局部KKT残差解释。在优化、分配和公平排序任务上的实验表明,RCML在保持竞争性目标性能的同时,改善了可行性控制和乘子稳定性。代码可在此处获取。

英文摘要

Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly under stochastic mini-batch feedback, as the noise of mini-batch gradients and constraint estimates can be directly accumulated into the multiplier memory. To address this issue, we propose Residual-Controlled Multiplier Learning (RCML), which reformulates multiplier updating as projected-pressure feedback. The central idea is to decompose the projected multiplier into an effective pressure signal for primal descent and a pressure-memory residual for finite-gain multiplier tracking. To handle heterogeneous and noisy observations, we further augment this residual-integral backbone with modular stochastic stabilization components. For the convex-affine backbone, we establish finite-gain convergence, derive a stochastic residual bound under mini-batch feedback, and show that the residual feedback law admits a local KKT-residual interpretation near regular KKT points of nonconvex problems. Experiments across optimization, allocation, and fair-ranking tasks show that RCML improves feasibility control and multiplier stability while maintaining competitive objective performance. Code is released at https://anonymous.4open.science/r/RCML-3114/.

2606.06888 2026-06-10 cs.LG 版本更新

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

数据受限的语言模型预训练:改进的正则化与缩放定律

Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang

发表机构 * University of Michigan KAIST AI

AI总结 研究数据受限下语言模型预训练的正则化与缩放,提出掩码输入正则化(MIR)改善验证损失,并设计SoftQ缩放定律更准确拟合重复数据下的模型与数据规模交互。

详情
AI中文摘要

语言模型预训练的经典缩放定律在固定计算预算下平衡模型规模与训练数据集大小,假设数据充足且仅对语料库遍历一次。随着训练计算量增长快于自然语言数据的供应,预训练可能进入数据受限、计算丰富的阶段,模型在有限数据集上训练多个周期。我们沿正则化和缩放两个维度研究数据受限预训练。对于正则化,我们研究掩码输入正则化(MIR),一种对随机掩码输入进行辅助下一词预测损失的方法。MIR测试扩散语言模型中的随机掩码是否能在不改变架构或增加推理开销的情况下有益于自回归预训练。在72M到1.4B参数的模型中,我们发现MIR在强权重衰减基础上进一步改善了验证损失,优于仅使用强权重衰减的自回归模型,并在1.4B规模上带来下游性能提升。对于缩放,我们提出SoftQ,一种将模型规模和数据规模耦合以捕捉它们在重复数据下交互的缩放定律。经典替代方案如Chinchilla定律使用加性形式解耦这些项,导致在数据受限情况下设定错误。我们发现SoftQ比这些替代方案更好地拟合数据受限实验,并估计MIR带来的增益相当于约1.3倍的独特训练数据。我们在https://this URL 发布代码。

英文摘要

Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR's gains as equivalent to roughly 1.3 times as much unique training data. We release our code at https://github.com/yixinw-lab/dc_pretrain.

2606.06758 2026-06-10 cs.CL 版本更新

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

长上下文与检索增强语言模型中证据利用的四条件诊断协议

Haizhou Xia

发表机构 * University of Western Ontario

AI总结 提出四条件证据可用性协议,通过ONCU估计器分离无证据、全上下文、检索证据和Oracle证据四种条件下的模型表现,诊断长上下文与检索增强语言模型的证据利用瓶颈。

详情
Comments
46 pages, 37 tables, 1 figure
AI中文摘要

最终答案准确性、检索召回率和引用重叠本身并不能确定长上下文或检索增强语言模型是否使用了所提供的证据。模型可能从参数记忆中进行回答,尽管接收到正确的段落却失败,或者引用证据但未将其转换为所请求的答案。本文提出了一种匹配的四条件证据可用性协议——无证据、全上下文、检索证据和Oracle证据参考——用于在固定示例、提示、评分字段、检索设置和有效性检查下诊断证据利用情况。ONCU被用作协议绑定的估计器,用于估计恢复的Oracle参考证据优势,并且仅针对分母有效的组进行计算;无分母的答案、证据、检索和失败审计指标分别报告。实证研究评估了来自Qwen、Gemma、Llama和Mistral家族的五个本地开源模型,在Controlled-ONCU-safe16K、HotpotQA-ONCU和2WikiMultiHopQA-ONCU上进行了评估,共产生18,000个ONCU兼容预测。主要发现是任务相关的瓶颈分裂:受控合成设置主要暴露全上下文利用失败,而测试的真实多跳设置主要暴露无分母答案和证据指标中的检索链覆盖失败,ONCU在Oracle改进组上支持相同方向。贡献在于提供了一个诊断协议,用于分离无证据可回答性、Oracle证据可恢复性、全上下文利用和检索条件利用,而不是为长上下文或检索增强系统提供单一分数排行榜。

英文摘要

Final-answer accuracy, retrieval recall, and citation overlap do not reveal how much answer advantage a long-context or retrieval-augmented language model actually recovers from supplied evidence. A model may answer from parametric priors, fail to use evidence that is present, or cite relevant text without converting it into the final answer. This paper introduces a four-condition diagnostic protocol for evidence-utilization evaluation under matched examples, models, prompts, and scoring rules. The protocol compares no-evidence, full-context, retrieved-evidence, and oracle-evidence reference conditions, and uses Oracle-Reference Normalized Context Utilization (ONCU) as a denominator-valid estimate of recovered oracle-reference evidence advantage. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families over Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, comprising 18,000 ONCU-compatible predictions. Results show a task-dependent diagnostic pattern: controlled synthetic settings expose reduced recovery when the same evidence is embedded in long input rather than supplied compactly, while realistic multi-hop reconstructions show that full-context inputs outperform the tested retrieved inputs in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. Sensitivity audits with stronger retrieval settings narrow some gaps but do not overturn the scoped interpretation. The main contribution is therefore not a single utilization ratio, but a matched diagnostic protocol that separates no-evidence answerability, oracle-evidence recoverability, full-context recovery, retrieval-conditioned recovery, denominator validity, and companion answer/evidence diagnostics.

2606.06744 2026-06-10 cs.LG cs.GT cs.MA econ.TH 版本更新

Learn to Match: Two-Sided Matching with Temporally Extended Feedback

学会匹配:具有时间扩展反馈的双边匹配

Haijing Zong, Yancheng Liang, Boyang Zhou, Natasha Jaques

发表机构 * University of Washington

AI总结 提出一个具有时间扩展反馈的双边匹配框架,将其建模为部分可观测马尔可夫博弈,并基于多智能体强化学习构建Learn2Match基准,实验表明独立PPO优于bandit基线,但存在信息摩擦损失。

详情
AI中文摘要

双边匹配市场通常涉及随时间通过面试、重复互动、学习和分离而展开的信息。现有的匹配模型通常将此过程简化为关于固定偏好的即时亚高斯反馈,忽略了支付相关信息逐渐揭示并改变未来匹配决策的场景。我们引入了一个具有时间扩展反馈的框架,将双边匹配建模为一个部分可观测马尔可夫博弈,其中包含昂贵的匹配前筛选、有噪声的匹配后观测、演变的潜在特征以及内生的延续或解散。我们在Learn2Match中实例化该框架,这是一个用于动态匹配市场的多智能体强化学习基准。Learn2Match支持关于面试谁、与谁匹配以及何时解散匹配的分散决策,同时使用遗憾、社会福利和信息摩擦损失(衡量由潜在偏好不完全揭示引起的福利差距)来评估策略。我们发现,在时间扩展反馈下,独立PPO比bandit风格的CA-ETC基线实现了更高的累积社会福利和更低的累积遗憾,展示了MARL在动态匹配市场中的前景。然而,PPO仍然产生更高的信息摩擦损失,表明端到端MARL尚未提供匹配bandit方法的协调探索结构。这些结果将Learn2Match定位为开发下一代匹配市场算法的基准:像RL智能体一样自适应、像bandit算法一样统计严谨、像稳定匹配机制一样结构感知的方法。

英文摘要

Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms. Please refer to https://sites.google.com/view/learn-to-match/home for the official website and the code link.

2606.06742 2026-06-10 cs.LG stat.ML 版本更新

TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

TorchKM:面向GPU的核学习与模型选择库

Yikai Zhang, Gaoxiang Jia, Jie Ding, Boxiang Wang

发表机构 * University of Iowa University of Minnesota

AI总结 提出GPU加速的核学习库TorchKM,通过智能复用矩阵运算加速SVM、核逻辑回归等模型的训练与模型选择,性能优于标准基线。

详情
Comments
14 pages, 2 figures
AI中文摘要

TorchKM是一个用于核机器的开源库,包括支持向量机、核逻辑回归和核分位数回归,并具有GPU加速。该库采用scikit-learn风格的API,旨在利用GPU友好的线性代数,通过智能复用矩阵运算加速完整的训练和模型选择流程。基准测试显示,与标准基线相比,具有竞争力的预测性能以及显著的加速效果。代码和文档可在https://this URL获取,并且该包可以通过PyPI轻松安装。

英文摘要

TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is designed to exploit GPU-friendly linear algebra, accelerating the full training and model-selection pipeline through intelligent reuse of matrix operations. Benchmarks show competitive predictive performance with substantial speedups over standard baselines. The efficiency and programmable design also make TorchKM a kernel-learning component for AI-driven workflows. Code and documentation are available at https://github.com/YikaiZhang95/torchkm, and the package can be easily installed via PyPI.

2606.06735 2026-06-10 cs.AI 版本更新

A Geometric Account of Activation Steering through Angle-Norm Decomposition

通过角度-范数分解的激活引导的几何解释

Georgii Aparin, Tatiana Gaintseva

发表机构 * Huawei Noah’s Ark Lab Queen Mary University of London

AI总结 本文通过控制实验分离角度和径向分量,发现概念主要编码在角度结构中,但范数对引导的稳定性和下游效应至关重要,建议将激活引导参数化为可解释的角度和径向分量。

详情
AI中文摘要

线性激活引导作为一种简单且经验有效的控制语言模型行为的方法已受到广泛关注。最近,球形引导范式被提出来解决加性干预的局限性,其动机通常是假设隐藏状态范数不携带概念相关信息。在这项工作中,我们通过一项旨在分离角度和径向分量作用的受控实证研究重新审视了这一假设。我们表明,引导方法的主要区别在于它们如何耦合两种几何效应:改变令牌与概念方向的角度对齐以及改变其隐藏状态范数。在七个语言模型上,我们发现概念主要表示在角度结构中,这支持了球形方法的动机,但范数对于引导的稳定性和下游效应仍然重要。我们的结果解释了为什么具有相似概念级别效果的干预可能表现不同,并建议激活引导应由干预的可解释角度和径向分量参数化,而不是由纠缠这两种效应的单个加性系数参数化。

英文摘要

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

2606.06698 2026-06-10 cs.LG cs.CL 版本更新

RECAP: Regression Evaluation for Continual Adaptation of Prompts

RECAP: 提示持续适应的回归评估

Harsh Deshpande, Kushal Chawla, Sangwoo Cho, William Campbell, Sambit Sahu

发表机构 * Capital One

AI总结 提出RECAP基准,在严格主动适应-测试协议下评估提示优化方法对约束变化的持续学习能力,发现现有方法在主动场景下性能无显著提升,强调设计主动提示适应方法的必要性。

详情
AI中文摘要

生产中的代理系统经常面临不断变化的约束,并且必须从下一次交互开始就遵守。诸如工具调用通知更改合规阈值或策略更新添加披露要求等场景符合这一标准,在生产中几乎没有出错的空间。这种主动适应设置在部署中很常见,但在当前的基准测试中却不存在,这些基准测试假设要么是静态约束集,要么是带有评估反馈的反应式协议。我们引入了RECAP,这是一个基准测试,在严格主动适应-测试协议下,在约束级别测量持续学习现象(遗忘、回归、前向转移):提示优化方法仅接收约束规范,并且必须在看到任何测试数据之前进行泛化。我们在四个LLM和三个具有不断变化的约束的调度上评估了六种方法,发现这些方法在性能上没有显著改善,即使在产生更高延迟之后也是如此。这些为离线或反应式设置设计的方法不足以应对主动范式。我们的工作强调了设计主动提示适应方法的日益增长的需求,其中模型必须对部署中不断变化的需求保持鲁棒性。

英文摘要

Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

2606.06622 2026-06-10 cs.CL 版本更新

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench:评估大语言模型分布随机性的基准

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Liang Luo, Ellie Dingqiao Wen, Lele Wang, Giuseppe Carenini, Peter West

发表机构 * University of British Columbia Independent Researcher

AI总结 提出UnpredictaBench基准,通过KS@N指标评估LLM从目标分布(统计分布、随机程序、自然语言场景)采样的能力,发现模型表现差异大且无模型超过40%准确率,表明分布采样能力仍有显著提升空间。

详情
AI中文摘要

我们引入了UnpredictaBench,这是一个评估大语言模型(LLM)捕捉真实潜在分布能力的测试。随着LLM越来越多地被用作其他实体的替代品(例如,在经济模拟中替代人类),许多模型倾向于坍缩到单一合理答案,这导致无法捕捉真实系统的不可预测性。最近关于提高输出多样性的工作对于这种设置是不够的:模拟需要从目标分布中校准的样本,而不仅仅是多样化的输出。UnpredictaBench提炼了该问题的一个简化但基础的版本:从单个目标分布中采样结果,包括经典统计分布、随机程序诱导的分布以及描述随机过程的自然语言场景。我们引入了448个这样的问题,以及KS@N,一个通用评估指标,通过Kolmogorov-Smirnov统计检验量化模型输出近似黑盒目标分布的程度。这是我们在样本量为N时未能拒绝模型样本与真实样本之间差异的比率,N越大表示难度越大。在开源和专有模型上的测试中,我们发现分布能力存在很大差异。例如,当模型生成样本量为100(KS@100,我们的标准指标)时,得分范围从接近0到超过20%。没有模型能在KS@100上达到40%以上,这表明分布采样作为一种能力仍有显著的提升空间。尽管增加推理可以在一定程度上提高得分,但我们发现这个问题没有立即可行的解决方案。UnpredictaBench表明,即使是简单的分布模拟仍然具有挑战性,这使得它成为使用LLM作为复杂系统替代品的必要第一步。

英文摘要

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

2606.11166 2026-06-10 stat.OT cs.AI 新提交

Flaws in the LLM Automation Narrative

LLM自动化叙事中的缺陷

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

AI总结 通过编写代码完成数据分析任务的新基准测试,发现前沿LLM在平均性能、方差和错误幅度上均不如人类专家,挑战了LLM达到人类专家水平的说法。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被描述为在知识经济任务上达到人类专家水平。这些说法主要基于LLM在标准化数据集上衡量平均性能的基准测试任务中的表现。许多基准测试任务的主要局限性在于,它们通常基于直接包含在LLM训练数据中的内容来衡量性能,并且经常不评估LLM性能的可靠性或LLM错误的幅度。然而,在高风险情境中,这些品质至关重要。通过一项需要编写计算机代码完成数据分析任务的新型LLM基准测试,我们将前沿LLM的性能与人类专家的提交进行了比较,并明确测量了响应的方差和错误的幅度。我们的研究表明,人类专家在一系列指标上平均表现更好,并且表现出更小的性能变异性。我们的结果提供了证据,表明LLM并非始终如一地达到人类专家的水平,并证明了在LLM基准评估中测量方差和评估错误幅度的重要性。

英文摘要

Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.

2606.11156 2026-06-10 stat.ML cs.LG 新提交

Itô maps for any-step SDEs

任意步SDE的Itô映射

Zhengkai Pan, Peter Potaptchik, Wenxi Yao, Michael S. Albergo, Jakiw Pidstrigach

AI总结 提出Itô映射,一种任意步随机流映射,通过单次前向传播预测未来状态,实现随机动力学的精确蒸馏,并支持推理时控制和后验采样。

详情
AI中文摘要

最近的单步生成模型通过学习底层动力学的确定性流映射来加速采样。这些方法依赖于从常微分方程学习,但如何为随机动力学定义精确的蒸馏过程仍是开放问题。我们引入Itô映射,一种任意步随机流映射,它接收中间状态和布朗路径,并在单次前向传播中预测未来状态。Itô映射公式通过提供廉价、可微的后验样本访问,为推理时控制提供了新的估计器。实验上,Itô映射从固定的中间状态生成多样、条件有效的端点样本,并在合成和图像生成基准上支持强引导性能。这些结果确立了任意步SDE积分作为后验采样和随机控制的有用原语。

英文摘要

Recent one-step generative models accelerate sampling by learning deterministic flow maps of the underlying dynamics. These methods rely on learning from ordinary differential equations, leaving open how to define an exact distillation procedure for stochastic dynamics. We introduce the Itô map, an any-step stochastic flow map that takes an intermediate state and Brownian path and predicts future states in a single pass. The Itô map formulation yields novel estimators for inference-time control by providing cheap, differentiable access to posterior samples. Empirically, Itô maps produce diverse, conditionally valid endpoint samples from fixed intermediate states and support strong steering performance on synthetic and image-generation benchmarks. These results establish any-step SDE integration as a useful primitive for posterior sampling and stochastic control.

2606.11125 2026-06-10 eess.SP cs.LG 新提交

DMT: Demographic Conditioning, Morphology-Enhanced Transformer for Cuffless Blood Pressure Estimation from PPG Signals

DMT: 基于人口统计条件与形态增强Transformer的无袖带血压估计方法

Yidan Shen, Neville Mathew, Maham Rahimi, Deependra Dhakal, George Zouridakis, Xin Fu, Renjie Hu

AI总结 提出一种基于Transformer的PPG信号无袖带血压估计网络,通过FiLM风格特征调制融入人口统计信息,并添加辅助形态头引导模型关注与动脉僵硬度相关的波形形态,在PulseDB数据集上实现收缩压MAE 4.56 mmHg、舒张压MAE 2.62 mmHg。

详情
AI中文摘要

血压(BP)是心血管风险评估和治疗决策的关键指标,而光电容积描记术(PPG)能够实现低成本、可穿戴友好的无袖带血压估计。然而,即使近期取得了进展,许多基于PPG的模型仅通过血压回归进行训练,可能依赖于以振幅为主的捷径。此外,系统性调节血管顺应性的人口统计协变量通常仅通过后期融合纳入,限制了特定于主体的表示学习。我们提出了一种基于Transformer的网络,用于从PPG信号进行无袖带血压估计,利用自注意力机制捕获多个心动周期之间的长程依赖关系。为了考虑特定主体的血管差异,模型通过Transformer块的注意力和前馈子层中应用的FiLM风格特征调制,以人口统计信息为条件。此外,我们添加了一个辅助形态头,引导模型关注与动脉硬度和波反射相关的血压相关波形形态。在大型PulseDB数据集上基于校准的评估协议下,所提方法在收缩压上实现了4.56 mmHg的平均绝对误差(MAE),在舒张压上实现了2.62 mmHg,与先前的人口统计增强PPG基线相比,误差分别减少了47%和50。由此产生的轻量级单传感器模型支持在启用校准的部署场景中进行可扩展且临床可靠的无袖带血压估计。

英文摘要

Blood pressure (BP) is a key marker for cardiovascular risk assessment and therapeutic decision-making, and Photoplethysmography (PPG) enables low-cost, wearable-friendly cuffless BP estimation. However, even with recent progress, many PPG-based models are trained with BP regression alone and may rely on amplitude-dominated shortcuts. In addition, demographic covariates that systematically modulate vascular compliance are often incorporated only via late fusion, limiting subject-specific representation learning. We propose a Transformer-based network for cuffless BP estimation from PPG signal, leveraging self-attention to capture long-range dependencies across multiple cardiac cycles. To account for subject-specific vascular differences, the model is conditioned on demographics via FiLM-style feature modulation applied through the attention and feed-forward sublayers of Transformer blocks. In addition, we add an auxiliary morphology head to guide the model to attend to BP-relevant waveform morphology associated with arterial stiffness and wave reflection. Under calibration-based evaluation protocols on the large-scale PulseDB dataset, the proposed method achieves MAE of 4.56 mmHg for systolic BP and 2.62 mmHg for diastolic BP, reducing errors by 47% and 50% compared with prior demographic-enhanced PPG baselines. The resulting lightweight, single-sensor model supports scalable and clinically grounded cuffless BP estimation in calibration-enabled deployment settings.

2606.11044 2026-06-10 stat.ML cs.LG 新提交

Generalized Conformal Predictive Systems Under Distributional Shifts

广义共形预测系统在分布偏移下的应用

Jef Jonkers, Johanna Ziegel

AI总结 针对分布偏移,通过观测特定置换权重编码偏移,扩展广义共形预测系统,提出偏移感知预测系统,并引入权重不确定性框构建鲁棒共形预测系统包络,提供有限样本或渐近置信保证。

详情
Comments
27 pages, 10 figures
AI中文摘要

共形预测系统(CPS)在可交换性假设下输出校准的CDF带。我们通过观测特定的置换权重编码分布偏移,将广义CPS扩展到非可交换设置。这产生了偏移感知预测系统,当测试点(条件于无序样本)是从观测原子中加权抽取时,该系统保持有效。由于此类权重通常需要估计,我们引入了权重不确定性框,并构建了具有有限样本或渐近置信保证的鲁棒CPS包络。我们推导了符合性度量CPS、共形分箱和共形等渗分布回归的高效计算方法。在协变量偏移和反馈驱动的生物分子设计实验下,校准的预测带在更强偏移下变宽,随样本量增加而收紧。

英文摘要

Conformal predictive systems (CPS) output calibrated bands of CDFs under exchangeability. We extend generalized CPS to non-exchangeable settings by encoding distributional shifts through observation-specific permutation weights. This yields shift-aware predictive systems that remain valid whenever the test point is, conditionally on the unordered sample, a weighted draw from the observed atoms. Since such weights are typically estimated, we introduce weight-uncertainty boxes and construct robust CPS envelopes with finite-sample or asymptotic confidence guarantees. We derive efficient computation for conformity-measure CPS, conformal binning, and conformal isotonic distributional regression. Experiments under covariate shift and feedback-driven biomolecular design show calibrated predictive bands that widen under stronger shifts and tighten as sample size increases.

2606.10972 2026-06-10 eess.AS cs.AI 新提交

Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks

基于CNN和GRU网络的哮喘与COPD鉴别诊断中二维输入表示和子阶段融合策略的优化

Ipek Sen, Ozgur Ozdemir, Elena Battini Sonmez

AI总结 本研究优化了二维输入表示(MFCC、对数梅尔谱图、VAR模型)和子阶段特征融合策略(直接拼接、GRU、GRU+注意力),使用CNN和GRU网络鉴别哮喘与COPD,最佳F1分数达0.877。

详情
AI中文摘要

本研究旨在探索VAR模型与梅尔频率倒谱系数(MFCC)矩阵和对数梅尔谱图在深度学习中的性能比较。在肺音分类中,基于谱图的表示因呼吸周期时长不同而存在时间维度不一致的问题。除了传统的裁剪/零填充,还提出了自适应长度窗口来固定时间维度。通过测试一系列参数优化其频谱和时间维度。采用不同的卷积神经网络(CNN)架构从子阶段获得的二维表示中提取特征。然后使用各种策略融合提取的子阶段特征,包括直接拼接、门控循环单元(GRU)网络和带注意力的GRU。通过基于呼吸周期的评估和基于受试者的评估(包含多个呼吸周期)来评估模型性能。还研究了多种数据增强技术以应对数据规模限制。最佳基于周期的F1分数(0.877)通过使用13个系数和每子阶段表示64点时间分辨率的MFCC矩阵,随后进行直接特征拼接获得;最佳基于受试者的F1分数(0.855)通过使用13个系数和每完整周期表示256点时间分辨率的MFCC矩阵获得,两者均采用自适应长度窗口。增强总体上降低了模型性能,但mixup增强是测试方法中最好的。MFCC在区分哮喘和COPD方面优于对数梅尔谱图和VAR模型。复杂的融合策略并未改善诊断。增强没有贡献,表明真实数据在肺音研究中的重要性。

英文摘要

This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

2606.10906 2026-06-10 stat.ML cs.AI cs.LG 新提交

Human-AI Teaming Through the Lens of Calibration

通过校准视角看人机协作

Eric Nalisnick, Chi Zhang, Sophia Qian, Yixin Wang

AI总结 研究通过统计校准视角分析人机协作模型,发现组合方法不保留人类校准度,而委托方法将校准负担转移给拒绝器元模型,且当人类依赖系统不可观测信息时无法实现。

详情
Comments
19 pages, 5 figures (including appendix)
AI中文摘要

我们通过统计校准的视角研究人机协作模型。假设团队由AI模型和人类组成——两者相对于特征空间的某种划分都是校准的——并揭示校准假设如何传播到协作框架中。特别地,我们考虑两种框架:(i) 结合人类和模型预测,或 (ii) 将预测责任委托给人类或模型。通过理论和实证结果,我们表明现有的组合方法不保留人类的校准程度。委托方法(通过委托行为本身)保留了后续预测器的校准,但将负担转移到了决定谁进行预测的拒绝器元模型上。拒绝器必须足够精细地校准,以定位每个成员的优势所在,这一需求随着人类专业知识的增长而增加,并且当人类依赖系统无法观测的信息时变得无法实现。

英文摘要

We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- both of which are calibrated with respect to some partitioning of the feature space -- and expose how the calibration assumptions propagate into the teaming framework. In particular, we consider frameworks that either (i) combine human and model predictions or (ii) delegate prediction responsibility to either a human or model. We show via theoretical and empirical results that existing methods for combination do not preserve the human's degree of calibration. Methods for delegation (by the very act of delegation) preserve calibration of the downstream predictors but shift the burden onto the rejector meta-model that decides who predicts. The rejector must be calibrated finely enough to locate where each member is superior, a demand that grows with the human's expertise and becomes unattainable when the human relies on information the system cannot observe.

2606.10889 2026-06-10 q-bio.NC cs.LG 新提交

Sleep EEG Signal Criticality as a Non-Invasive Predictor of Cognitive Decline in Dementia

睡眠脑电信号临界性作为痴呆认知衰退的非侵入性预测指标

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结 研究通过多重分形去趋势波动分析量化睡眠脑电信号临界性,发现认知健康者更接近最优临界状态,痴呆组DFA指数向1.0偏移,表明睡眠中无标度神经动力学重组先于临床症状,可作为早期筛查工具。

详情
Comments
4 pages, 2 figures, accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026
AI中文摘要

神经退行性疾病的早期检测仍然是一个关键的临床挑战。本研究探讨了通过多重分形去趋势波动分析(MFDFA)量化的睡眠脑电信号临界性是否可作为未来认知衰退的非侵入性生物标志物。我们分析了国家睡眠研究资源(NSRR)骨质疏松性骨折研究(SOF)队列的纵向数据,比较了保持认知正常与后来进展为痴呆相关损伤(3MS < 78)的女性之间的基线睡眠脑电动力学。我们的结果揭示了Hurst指数$H(q)$分布在组间的显著差异,特别是在非快速眼动阶段N2和N3期间。认知健康的个体在所有电极位置上表现出显著更接近最优临界状态的信号动力学($p \leqslant 0.001$),支持了大脑临界性假说。监督UMAP投影证实了整夜睡眠期间组间的清晰空间分离。痴呆组表现出DFA指数向$1.0$的偏移,表明睡眠中无标度神经动力学的重组先于临床症状。这些发现强调了将MFDFA衍生测量整合到自动化、基于睡眠的筛查工具中的潜力,从而能够在痴呆的前驱窗口期进行更早的预防性干预。

英文摘要

Early detection of neurodegeneration remains a critical clinical challenge. This study investigates whether sleep EEG signal criticality, quantified via Multifractal Detrended Fluctuation Analysis (MFDFA), serves as a non-invasive biomarker for future cognitive decline. We analyzed longitudinal data from the National Sleep Research Resource (NSRR) Study of Osteoporotic Fractures (SOF) cohort, comparing baseline sleep EEG dynamics between women who remained cognitively normal and those who later progressed to dementia-related impairment ($3MS < 78$).Our results reveal significant group-level differences in Hurst exponent $H(q)$ distributions, particularly during non-REM stages N2 and N3. Cognitively healthy individuals exhibited signal dynamics significantly closer to an optimally critical state across all electrode locations ($p \leqslant 0.001$), supporting the Brain Criticality Hypothesis. Supervised UMAP projections confirmed clear spatial separation between groups throughout the overnight sleep architecture.The dementia group demonstrated a shift in DFA exponents toward $1.0$, suggesting that a reconfiguration of scale-free neural dynamics during sleep precedes clinical symptoms. These findings highlight the potential for MFDFA-derived measures to be integrated into automated, sleep-based screening tools, enabling earlier preventative interventions during the prodromal window of dementia.

2606.10781 2026-06-10 eess.AS cs.CL 新提交

Recovering the Zipfian Distribution in Unsupervised Term Discovery

在无监督术语发现中恢复齐夫分布

Danel Slabbert, Simon Malan, Herman Kamper

AI总结 针对无监督术语发现中中心聚类导致分布不均匀的问题,提出图聚类方法,在三种语言上显著优于K-means等,恢复更接近齐夫分布的词汇分布。

详情
AI中文摘要

无监督术语发现涉及将未标记语音分割成词或音节单元,并将这些单元聚类成候选类型的词典。真实词典遵循齐夫分布,然而主流的基于中心的聚类方法——K-means——由于对球形聚类的归纳偏差,产生更均匀的分布。在本文中,我们重新审视基于图的聚类作为一种自下而上的替代方案,其中片段嵌入通过成对相似性连接,并使用Leiden算法进行划分。我们表明,在三种语言的词级和音节级词典发现中,图聚类在性能上显著优于基于中心的方法(K-means、GMM、BIRCH),产生更接近齐夫分布的分布。另一种自下而上的方法,即使用平均链接的凝聚聚类,也表现良好,尽管其计算效率较低,且对结果分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位,并推广图聚类作为一种有吸引力的替代方案。

英文摘要

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

2606.10770 2026-06-10 stat.ME cs.LG 新提交

Correcting Variable Importance Scored by Random Forests

修正随机森林产生的变量重要性评分

Guancheng Zhou, Haiping Xu, Jason Liu, Donghui Yan

AI总结 针对随机森林变量重要性受变量间相关性影响的问题,提出基于条件相关性的分组方法进行修正,实验证明两种计算高效方案均能有效校正变量重要性。

详情
Comments
22 pages, 10 figures
AI中文摘要

随机森林产生的变量重要性在统计分析中广泛应用,在辅助模型解释、模型选择和诊断、成本受限学习等任务中发挥重要作用。然而,RF中变量重要性的计算未考虑变量间的相关性,与许多其他变量相关的变量往往会获得较低的重要性指数,或被其他强相关变量完全掩盖(即重要性指数接近零)。为了在计算变量重要性时避免不相关变量的影响,我们提出根据变量的条件相关性(以响应变量为条件)对变量进行分组。我们探索了两种计算高效的方案:一种将变量单独分组,然后将感兴趣的变量与所有相关变量分离;另一种使用聚类根据变量间的成对条件相关性进行分组。实验表明,两种方法都能对变量重要性进行合理的修正。

英文摘要

Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.

2606.10738 2026-06-10 eess.AS cs.AI 新提交

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

Spatial-Omni:通过FOA编码在多模态大语言模型中实现空间音频理解

Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao

AI总结 提出Spatial-Omni,通过SO-Encoder将一阶Ambisonics空间音频注入现有全模态大语言模型,以轻量方式实现空间音频理解,并在构建的SO-Bench基准上超越现有模型。

详情
AI中文摘要

最近的多模态大语言模型主要将音频处理为单声道信号,从而丢弃了空间音频中包含的空间线索,这些线索用于声音定位、空间关系推理和空间场景理解。我们提出Spatial-Omni,一种轻量级方法,通过实现SO-Encoder将一阶Ambisonics(FOA)空间音频作为独立模态注入现有的全模态大语言模型,而无需修改其原始音频编码器。SO-Encoder以有限的额外上下文成本提供空间标记,并通过高效的分阶段训练提升空间音频理解。为支持训练和评估,我们从开源数据、真实录音和仿真中构建了SO-Dataset、SO-QA和SO-Bench,包含40万条FOA空间音频片段和210万个空间问答对。SO-Bench涵盖16个空间音频理解子任务,包括基本检测和位置估计、空间关系理解以及复杂空间推理。实验表明,Spatial-Omni在空间音频理解任务上优于现有的开源大型音频语言模型(LALM)和全模态大语言模型,同时保持合理的通用音频理解水平。代码和数据见:https://this https URL。

英文摘要

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 新提交

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

AI总结 提出++nnU-Net,通过图像配准进行数据增强,在预处理和训练前生成变形图像,在5个2D数据集上提升Dice系数最高约22%。

详情
Comments
7 pages, 1 figure, 2 tables
AI中文摘要

nnU-Net在医学分割任务中持续展现出成功,这严重依赖于标注生物医学数据的可用性和多样性。然而,由于隐私法规和标注成本等因素,收集医学影像队列仍然具有挑战性。因此,数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此,我们提出了++nnU-Net,一种基于图像配准的新型数据增强模块,在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中,图像数据经过两阶段配准过程,生成新的变形图像。然后将变换应用于相应的分割。此外,该管道计算可用磁盘空间,生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线,在Dice相似系数得分上有所提升。在最显著的情况下,我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性,特别是对于2D医学影像数据集,并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取:this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

2606.10673 2026-06-10 stat.OT cs.LG 新提交

ClusBench: The Clustering Benchmark Data Resource You've All Been Waiting For (?)

ClusBench:你一直期待的聚类基准测试数据资源(?)

David P. Hofmeyr

AI总结 本文通过拟合灵活的非参数分布,从200多个公开数据集生成近3000个合成数据集,用于大规模聚类方法评估,保留真实数据细微差别。

详情
AI中文摘要

尽管存在一些非常常见的测试平台用于评估聚类方法的性能,但大规模基准测试通常局限于相对简单的模拟设置。在这里,我们描述了近3000个合成数据集的生成和整理,这些数据集源自200多个公开可用的数据集;其中大多数来自实际应用。通过为每个基础数据集拟合灵活的非参数分布,我们能够保留真实数据中许多难以在标准模拟中重现的细微差别,同时生成的数据集的大小有时远大于它们所源自的数据集。合成数据集以及附带的R包可从该https URL下载。

英文摘要

Although some very common test beds exist for assessing the performance of clustering methods, large scale benchmarking is typically limited to relatively simplistic simulation set-ups. Here we describe the production and curation of close to 3000 synthetic data sets, derived from more than 200 publicly available data sets; the majority of which arose from real-world applications. By fitting a flexible non-parametric distribution to each base data set we are able to retain much of the nuance in real-world data which is difficult to reproduce in standard simulations, while also producing data sets whose sizes are sometimes substantially greater than the data sets from which they are derived. The synthetic data sets, plus an accompanying R package, are available for download from https://github.com/DavidHofmeyr/ClusBench.