arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2370
2605.28835 2026-05-29 cs.CL cs.AI

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc: 面向准确且泛化的函数调用的多智能体数据生成

Hao-Xiang Xu, Chong Deng, Jiaqing Liu, Wen Wang, Qian Chen, Lujia Bao, Xiangang Li, Zhen-Hua Ling

AI总结 提出GenesisFunc多智能体自动生成函数调用训练数据,通过多阶段评估保证质量,微调8B模型在域内和域外均优于同类开源模型,性能接近部分API模型。

Comments Accepted by ACL 2026 Main

详情
AI中文摘要

大型语言模型(LLM)通过函数调用(FC)扩展其能力,这依赖于高质量、多样化且覆盖广泛场景的训练数据。然而,获取和标注真实的函数调用数据具有挑战性,而现有流水线生成的合成数据通常存在API不可靠、工具可扩展性有限、多样性不足和质量控制薄弱等问题。为解决这些问题,我们提出了GenesisFunc,一个用于生成FC训练数据的自动化流水线。从广泛使用的公共基准中的可靠工具出发,我们的GenesisFunc采用多智能体框架支持对话生成系统,该系统生成涵盖多种场景的对话,同时在整个过程中保持多样性和质量。通过多阶段评估系统进一步强化数据的准确性。我们在合成数据集上微调了一个8B LLM,并通过大量实验表明,它在域内FC性能和域外泛化方面优于同等规模的开源模型,同时达到了与一些最新的基于API的模型相当的FC能力。此外,我们的方法展示了在下游工具中有效扩展的强大潜力,突显了其实际应用性。

英文摘要

Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.

2605.28834 2026-05-29 cs.CL cs.AI

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

评估荷兰语音节划分算法并通过深度学习结合语音和正字法信息提高准确性

Gus Lathouwers, Wieke Harmsen, Catia Cucchiarini, Helmer Strik

AI总结 本研究评估了四种荷兰语音节划分算法的性能,并提出一种结合语音和正字法信息的深度学习模型,实现了99.65%的词准确率,较文献最佳提升0.14%。

Comments Published in CLIN Journal

详情
Journal ref
Computational Linguistics in the Netherlands Journal, Vol. 14 (2025), pp. 365 to 383
AI中文摘要

音节划分描述将单词划分为音节的任务。由于许多规则和例外,训练算法以高准确率执行音节划分仍然是一个挑战。在过去几十年中,针对荷兰语音节划分提出了不同的算法,但尚未进行全面的比较评估。此外,近年来深度学习在自然语言处理中获得了显著普及,但尚未开发出基于现代深度学习的荷兰语正字法音节划分框架。最后,语音和正字法音节划分算法已被分别研究,但未结合研究。当前研究的目标有两个:(a) 检查现有荷兰语音节划分算法的性能,(b) 研究将语音和正字法信息结合到单个模型中是否能提高音节划分性能。为了比较算法性能,将四种算法(Brandt Corstius、Liang、Trogkanis-Elkan (CRF) 和新构思的深度学习模型)应用于三个不同的数据集(词典词、借词、伪词)。这些算法在数据集上表现出不同的性能,数据驱动算法在所有条件下除一个外均优于基于知识的算法。开发的新深度学习方法相比文献中发现的最佳结果(99.65%的词准确率,提高了0.14%)带来了性能提升。对添加语音信息改善音节划分性能的单词的分析表明,这些单词中正字法歧义可以通过发音信息解决。未来研究可以考察语音信息有益于正字法处理的其他领域。此外,新开发的深度学习框架可以应用于荷兰语以外的其他语言。

英文摘要

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

2605.28833 2026-05-29 cs.CL cs.AI

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

转录儿童语音:ASR性能与获取可靠正字法转录

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

AI总结 本研究评估了三种ASR模型家族(Whisper、Parakeet、Wav2Vec2)在荷兰儿童语音数据集上的性能,并提出了一种基于话语级选择的方法,以自动识别高置信度的正确发音,从而减少人工验证需求。

详情
AI中文摘要

自动语音识别(ASR)有潜力通过生成自动转录来大幅减少儿童语音研究中的手动标注工作。然而,在低资源语言中,由于缺乏针对儿童的预训练模型以及高度多样的噪声条件,获得可靠的高质量ASR转录仍然具有挑战性。本研究通过两个研究问题调查了最先进的ASR模型在儿童语音上的有效性,评估了来自三个模型家族(Whisper、Parakeet和Wav2Vec2)的九个ASR模型在两个荷兰儿童语音数据集JASMIN和DART上的表现。研究问题1考察了ASR模型应用于儿童语音的性能。微调的Whisper-medium模型取得了最佳整体性能,在JASMIN上WER为5.54%,在DART上为70.37%,表明噪声较大的DART数据明显更具挑战性。研究问题2考察了在多大程度上可以选择一个子集,使得无需人工验证即可自动获得可靠的正字法转录。我们使用一种话语级选择方法,将ASR输出与原始阅读提示进行比较,以识别正确发音的录音。使用所提出的选择方法,42.0% [对于JASMIN] 和18.1% [对于DART] 的话语可以高置信度地自动识别为正确发音,从而在话语级别上实现极低的错误率(精确度达到98.3%或更高),并减少了人工验证的需求。

英文摘要

Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.

2605.28832 2026-05-29 cs.CL cs.AI

A comparative study of transformer-based embeddings for topic coherence

基于Transformer的嵌入在主题连贯性中的比较研究

Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

AI总结 本研究系统比较了七种不同规模的Transformer语言模型(从MiniLM到LLaMA-2)在BERTopic流程中对主题质量的影响,发现模型大小(从2200万到130亿参数)对主题连贯性影响可忽略。

详情
AI中文摘要

主题建模是自然语言处理的一个分支,旨在根据词共现模式将大量文本组织成连贯的组,其中潜在狄利克雷分配仍是最广泛使用和可解释的概率方法之一。自然语言处理的最新进展,特别是基于Transformer的语言模型,提供了改进的文档表示。已知模型大小(以参数数量计)对语言模型在不同预定义任务上的性能有显著影响。在本研究中,我们通过分析七种基于Transformer的语言模型(从小型模型如MiniLM到大型模型如LLaMA-2)在BERTopic流程中对多种语料库的性能,系统地考察了模型大小对主题质量的影响。主题质量使用Röder等人(2015)的连贯性和分歧度指标进行评估。我们的结果表明,模型大小从2200万到130亿参数对主题质量的影响可忽略,表明较小的模型可以达到与较大模型相当的性能。

英文摘要

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

2605.28830 2026-05-29 cs.CL cs.AI cs.SE

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

开源安全防护模型基准测试:全面评估

Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali

AI总结 本研究对14个开源安全防护模型在8个NIST AI风险框架安全类别上进行全面评估,发现召回率是关键指标,且模型大小与安全检测性能不相关。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地部署在安全关键型应用中,稳健的内容审核变得至关重要。我们对14个开源安全防护模型进行了全面评估,使用了包含79,331个样本的精选基准,涵盖8个NIST AI风险框架安全类别。我们的基准聚合了四个不同的数据集(HarmBench、StrongREJECT、RealToxicityPrompts和BeaverTails),并经过筛选,仅关注安全相关内容(暴力、仇恨言论、骚扰、色情内容、自杀/自残、亵渎、威胁和健康虚假信息)。我们发现召回率是安全应用的关键指标,因为遗漏有害内容比误报构成更大风险。我们的评估揭示了令人惊讶的结果:Qwen Guard(4B参数)实现了最高的召回率(83.97%),而较大的模型如Llama Guard(12B)和GPT-OSS Safeguard(20B)表现出保守行为,遗漏了高达75%的不安全内容。我们证明了模型大小与安全检测性能不相关,并且通用防护模型优于专用模型。这些发现为在生产部署中选择安全防护模型提供了实用指导。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

2605.28828 2026-05-29 cs.CL cs.AI

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

微宏检索:减少大语言模型中的长文本幻觉

Yujie Feng, Jian Li, Zhihan Zhou, Pengfei Xu, Yujia Zhang, Xiaoyu Li, Xiaohui Zhou, Alan Zhao, Xi Chen, Xiao-Ming Wu

AI总结 提出微宏检索(M2R)框架,通过宏观检索外部粗粒度证据和微观检索推理中关键信息库,解决长文本生成中关键信息与输出距离过远导致的幻觉问题。

详情
AI中文摘要

大型语言模型(LLMs)在许多任务上表现出色,但容易产生幻觉,尤其是在长文本生成中,冗余的检索上下文和冗长的推理链会放大事实错误。最近的研究强调了一个关键现象:关键信息越接近模型输出,事实准确性越高。然而,现有的检索增强语言模型(RALMs)缺乏确保这种接近性的有效机制——外部证据通过多轮检索注入推理,但无法确保关键信息靠近输出。我们提出微宏检索(M2R),一种新颖的边检索边生成框架,以填补这一空白。在宏观层面,M2R从外部来源检索粗粒度证据;在微观层面,它从推理过程中构建的关键信息库中提取必要结果,并在生成答案时重用它们。这种设计直接解决了关键信息到输出的接近性瓶颈,有效减少了长文本任务中的幻觉。M2R使用基于课程学习的强化学习策略进行训练,并采用定制的基于规则的奖励,从而稳定地获得检索和接地技能。跨不同基准的大量实验证明了M2R的有效性,尤其是在长上下文设置中。

英文摘要

Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity - external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose Micro-Macro Retrieval (M2R), a novel retrieve-while-generate framework to fill this gap. At the macro level, M2R retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-information-to-output proximity bottleneck, effectively reducing hallucination in long-form tasks. M2R is trained with a curriculum learning-based reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of M2R, especially in lengthy-context settings.

2605.28827 2026-05-29 cs.CL cs.LG

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

RightNow-Arabic-0.5B-Turbo: 通过词汇注入和边缘优先部署的开源子10亿参数阿拉伯语语言模型

Jaber Jaber, Osama Jaber

AI总结 针对现有阿拉伯语模型要么是多语言模型对阿拉伯语支持不足,要么是参数过大难以部署的问题,提出基于Qwen2.5-0.5B的518M参数阿拉伯语专用模型RightNow-Arabic-0.5B-Turbo,通过词汇注入、继续预训练和监督微调等方法,在三个阿拉伯语基准上达到35.9%平均准确率,与1.5B模型性能相当,并实现边缘端高效部署。

Comments 12 pages, 7 tables, 4 figures, 1 algorithm. Weights: https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo

详情
AI中文摘要

开源的阿拉伯语大语言模型分为两类:子10亿参数的多语言模型将阿拉伯语视为次要语言(如Qwen2.5-0.5B、Falcon-H1-0.5B),以及需要服务器运行的7B-70B阿拉伯语专用模型(如Jais、AceGPT、ALLaM、SILMA)。唯一已发表的子20亿参数阿拉伯语专用模型Kuwain-1.5B从未发布权重。我们提出RightNow-Arabic-0.5B-Turbo,一个基于Qwen2.5-0.5B构建的518M参数阿拉伯语专用解码器LLM。该流程通过均值子词初始化添加27,032个阿拉伯语token,在8xH100上使用FSDP、FlashAttention变长打包和Liger融合内核继续预训练504M阿拉伯语token,然后对129,116个阿拉伯语指令对应用仅响应损失掩码的有监督微调,对6,750个阿拉伯语偏好对应用直接偏好优化,并对三个检查点进行权重汤合并。在三个lm-evaluation-harness阿拉伯语基准(COPA-ar、Arabic HellaSwag、ArabicMMLU)上,合并模型达到35.9%的平均准确率,击败所有同类开源模型,在COPA-ar上与Falcon-H1-1.5B持平(58.4%)但规模仅为三分之一,并以1/18的参数恢复了SILMA-9B平均性能的67%。边缘构建量化至398 MB(q4_k_m),通过llama.cpp在单个H100上以批量大小1达到635 tokens/s。所有代码(25个脚本共5,555行)、权重(bf16、int8和四种GGUF量化)及基准测试脚本已在https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo开源。

英文摘要

Open Arabic large language models split into two classes: sub-1B multilingual models that treat Arabic as an afterthought (Qwen2.5-0.5B, Falcon-H1-0.5B), and 7B-70B Arabic-specialized models that require a server to run (Jais, AceGPT, ALLaM, SILMA). The one published attempt at a sub-2B Arabic-specialized model, Kuwain-1.5B, never released its weights. We present RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized decoder LLM built on Qwen2.5-0.5B. The pipeline adds 27,032 Arabic tokens via mean-subtoken initialization, continues pretraining on 504M Arabic tokens on 8xH100 with FSDP, FlashAttention varlen packing, and Liger fused kernels, then applies supervised fine-tuning on 129,116 Arabic instruction pairs with response-only loss masking, direct preference optimization on 6,750 Arabic preference pairs, and weight soup merging across three checkpoints. On three lm-evaluation-harness Arabic benchmarks (COPA-ar, Arabic HellaSwag, ArabicMMLU) the merged model reaches 35.9% mean accuracy, beats every same-class open model, ties Falcon-H1-1.5B on COPA-ar (58.4%) at one-third the size, and recovers 67% of SILMA-9B's mean at 1/18 the parameters. The edge build quantizes to 398 MB (q4_k_m) and delivers 635 tokens/s at batch size 1 on a single H100 via llama.cpp. All code (5,555 lines across 25 scripts), weights (bf16, int8, and four GGUF quantizations), and benchmark scripts are released at https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo.

2605.28826 2026-05-29 cs.CL

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

从语境偏移到风格崩溃:为什么训练目标比规模更重要

Rohan Mahapatra

AI总结 本文通过分析17个模型(410M-100B+参数)在24个语言探针上的表现,发现指令微调系统会导致语言熵沿语篇和结构维度系统性崩溃,并表明弱干预加剧崩溃而强控制可显著改善,揭示了当前对齐流程在重新分配风格概率质量方面的结构性局限。

Comments 26 pages, 13 tables, 2 figures. Planning to submit to NeurIPS 2026

详情
AI中文摘要

在现代大型语言模型中,语言特征并非作为风格人工制品,而是作为概率质量的探针,在训练对齐目标下分配。使用当代流程训练的语言模型表现出语言特征的严重重塑,导致极端的语言重新分布。虽然先前的风格计量分析探讨了AI生成文本与人类文本之间的语言差异,但我们关注的是困扰LLM训练流程本身的重塑。我们分析了17个模型(410M-100B+参数)在24个语言动机探针上的表现,记录到指令微调系统沿语篇和结构维度系统性崩溃语言熵(平均放大倍数:1,949-16,853%,峰值:5,181-209,675%),同时选择性地将复杂标点抑制到基线频率的3.2-23.2%。这些效应在RLHF下并未恶化,因为匹配的基础模型和指令微调模型对之间的发散模式在统计上无显著差异(p > 0.25)。弱干预(lambda=1.0)使崩溃加剧240%,而强控制(lambda=5.0)实现了40.5%的改善,并且尽管规模劣势达200-1000倍,仍比前沿模型表现好96.7-98.2%。此外,lambda=5.0相比中等正则化,提供了15%更高的distinct-4、27%更高的词汇多样性以及78%更低的重复率,表明对齐需要足够的控制强度,而不仅仅是分布平滑。我们的发现强调了现代LLM如何重新分配风格概率质量,尽管有RLHF和规模。更广泛地说,我们的工作揭示了当前对齐流程的结构性局限:偏好优化重塑了标准质量指标不可见但可通过分布探针检测到的语言分布,这对AI检测、训练数据污染和长期语言演化具有影响。

英文摘要

In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping of linguistic features, leading to extreme language re-distribution. While previous stylometric analyses explored linguistic differences between AI-generated and human texts, we focus on the reshaping plaguing the LLM training pipeline itself. We analyze 17 models (410M-100B+ parameters) across 24 linguistically-motivated probes, documenting that instruction-tuned systems systematically collapse language entropy along discourse and structural dimensions (mean amplification: 1,949-16,853%, peaks: 5,181-209,675%), while selectively suppressing complex punctuation to 3.2-23.2% of baseline frequencies. These effects do not worsen under RLHF, as divergence patterns are statistically indistinguishable (p > 0.25) across matched base and instruction-tuned model pairs. Weak intervention (lambda=1.0) exacerbates collapse by 240%, while strong control (lambda=5.0) achieves 40.5% improvement and outperforms frontier models by 96.7-98.2% despite 200-1000x scale disadvantage. Additionally, lambda=5.0 delivers 15% higher distinct-4, 27% higher vocabulary diversity, and 78% lower repetition than moderate regularization, establishing that alignment requires sufficient control strength, not merely distributional smoothing. Our findings underscore how modern LLMs reallocate stylistic probability mass, despite RLHF and scale. More broadly, our work reveals a structural limitation of current alignment pipelines: preference optimization reshapes language distributions invisible to standard quality metrics yet detectable through distributional probes, with implications for AI detection, training data contamination, and long-term linguistic evolution.

2605.28825 2026-05-29 cs.CL

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

MechELK:一种用于激发大型语言模型中潜在知识的机制可解释性框架

Ji-jun Park, Soo-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju-Wan Lee

AI总结 提出MechELK框架,通过定位、验证和激发三个阶段,利用稀疏自编码器特征分析和因果探测等方法,从大型语言模型中提取隐藏知识,在TruthfulQA等基准上平均激发准确率达84.7%。

详情
AI中文摘要

大型语言模型(LLMs)经常在其内部表示中编码事实和推理知识,但这些知识并未在其表面输出中忠实反映——这种现象被称为“潜在知识”。现有的潜在知识激发方法,如对比一致性搜索(CCS),依赖于对比激活模式,在处理复杂的多步推理任务时存在困难,而机制可解释性工具主要用于“理解”模型行为而非“提取”隐藏知识。我们提出了**MechELK**,一个统一的三个阶段框架,桥接了机制可解释性和潜在知识激发。MechELK通过以下步骤运作:(1)**定位**——使用稀疏自编码器(SAE)特征分析和激活修补来识别承载知识的表示;(2)**验证**——采用因果探测来区分真正的潜在知识与虚假相关性;(3)**激发**——应用表示工程在不修改模型权重的情况下揭示隐藏知识。在TruthfulQA、精心策划的Deceptive Alignment基准和Quirky LM数据集上评估,MechELK实现了84.7%的平均激发准确率,比CCS高出6.2%,比直接线性探测高出9.1%。关键的是,MechELK在模型表面输出不正确或回避的78.3%案例中成功识别了潜在知识,展示了其在AI安全应用(包括欺骗性对齐检测)中的实用性。

英文摘要

Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs -- a phenomenon known as \emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emph{understand} model behavior rather than to \emph{extract} hidden knowledge. We present \textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) \textbf{Locate} -- using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) \textbf{Verify} -- employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) \textbf{Elicit} -- applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7\%, outperforming CCS by 6.2\% and direct linear probing by 9.1\%. Crucially, MechELK successfully identifies latent knowledge in 78.3\% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.

2605.28824 2026-05-29 cs.CL

A Modular Architecture for Typologically Controlled Lexicon Generation

一种用于类型学控制词汇生成的模块化架构

Sankalp Tattwadarshi Swain, Dhruv Kumar

AI总结 提出模块化框架,通过PHOIBLE音位库、可互换音系语法和Swadesh-Leipzig-Jakarta本体生成音系合理且类型学真实的词汇,实验表明概率语法优于确定性和随机基线。

详情
AI中文摘要

构建可发音、类型学合理且语义结构化的人工词汇仍是计算语言学中的开放挑战。现有语言生成器要么缺乏正式的音位配列保证,要么将生成委托给不透明、不可复现的基于LLM的流水线。我们提出一个模块化框架,从PHOIBLE中采样音位库,在可互换的音系语法(确定性、OT和MaxEnt)下生成词形,并通过Swadesh-Leipzig-Jakarta本体分配意义,实现显式的形式-意义对齐。在词汇规模为100-5,000个词形时,基于字符$n$-gram困惑度、对数似然和与PHOIBLE的KL散度的评估表明,概率语法在音位配列一致性和类型学真实性上始终优于确定性和随机基线。

英文摘要

Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines. We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh--Leipzig--Jakarta ontology with explicit form--meaning alignment. Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.

2605.28823 2026-05-29 cs.CL

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

他们在想什么?LLM中概念的界定、探测与追踪

Mohamed Abdelwahab, Michelle Yu Collins, Sihan Chen, Yi Cheng Zhao, Zafarullah Mahmood, Jiading Zhu, Soliman Ali, Jonathan Rose

AI总结 本文提出通过线性探针低成本地检测LLM嵌入中的概念,并展示了概念界定、探针训练与跨上下文追踪的方法,为大规模模型监控奠定基础。

详情
AI中文摘要

随着LLM影响力的扩大,深入了解其决策变得至关重要。一种方法是开发探针,用于检测LLM计算的嵌入中是否存在广泛的概念——这可以说是模型在“思考”的内容。这些探针应成本低廉且易于应用于任何LLM,以便在正常操作期间能够监控多个概念。在本文中,我们通过定义并执行关键任务的示例,迈出了开发创建此类探针能力的第一步:首先,通过创建包含概念存在和不存在的数据集来仔细界定概念;然后,训练并测试一组线性探针,以检测LLM任何层上的概念,包括对所需探针复杂性的探索;最后,我们展示了此类探针可以跨更大上下文追踪概念。我们使用四个不同的概念和三个不同的LLM进行了实验。当这一过程扩展到更多概念时,将能够轻松监控新模型。

英文摘要

As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of concepts within the embeddings computed in an LLM - which is what we might say a model is "thinking" about. Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation. In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a concept through the creation of a dataset with the concept both present and then absent. Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed. Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to easily monitor new models.

2605.28822 2026-05-29 cs.CL

Lightweight Multimodal LLM-Enabled Cost-Effective Defect Grading of Power Transmission Equipment

轻量级多模态大语言模型驱动的输电设备经济高效缺陷分级

Tao Wang, Lipeng Zhu, Jiayong Li, Feng Gao, Siwen Liang

AI总结 提出基于多模态大语言模型的缺陷分级框架,通过上下文学习最大化商业模型潜力,并利用链式思考问答对微调轻量级模型,实现低成本高精度分级。

Comments 9pages, 6figures

详情
AI中文摘要

输电设备缺陷分级(DGPTE)对电能传输的稳定性至关重要。尽管现有的机器学习方法在缺陷检测方面表现出强大的能力,但在更精细的缺陷分级领域,它们受到难以整合专家经验和面临类别不平衡问题的困扰。为了解决这一问题,本文提出了一种基于多模态大语言模型(MLLM)的新型缺陷分级框架。具体而言,该方法通过上下文学习最大化商业MLLM在DGPTE上的潜力,并获得了最先进的(SOTA)模型。通过向该模型发送二次请求,生成少量基于思维链的问答对(Q&As),这有效降低了人工标注的成本。通过这种方式,这些高质量可解释的Q&As被用于通过低秩自适应监督微调(SFT)训练Qwen3-VL-8B。在三个DGPTE任务上的实验结果表明,仅微调语言模型层即可获得SOTA性能。此外,多任务联合微调验证了仅通过单个轻量级MLLM处理多个分级任务的可行性。

英文摘要

Defect grading of power transmission equipment (DGPTE) is crucial to the stability of electric energy transmission. Although existing machine learning methods exhibit strong capabilities in defect detection, they are plagued by difficulties in integrating expert experience and facing class imbalance in more refined defect grading field. To address this issue, this paper introduces a novel defect grading framework based on multimodal large language model (MLLM). Specifically, this approach maximizes the commercial MLLMs' potential of DGPTE through in-context learning and obtains the state-of-te-art (SOTA) model. By sending a secondary request to this model, a small number of chain of thought-based question-answer pairs (Q\&As) are generated, which effectively reduces the cost of manual annotation. In this way, these high-quality interpretable Q\&As are used to train Qwen3-VL-8B via Low-Rank Adaption-based supervised fine-tuning (SFT). Experimental results on three DGPTE tasks demonstrate that fine-tuning only the language model layer yields the SOTA performance. Furthermore, multi-task joint fine-tuning verifies the feasibility of handling multiple grading tasks within only a single lightweight MLLM.

2605.28551 2026-05-29 cs.CV cs.GR cs.LG

Resolution-free neural surrogates for geometric parameterization and mapping with spatially varying fields

无分辨率依赖的几何参数化与映射神经替代模型:面向空间变化场

Yanwen Huang, Lok Ming Lui, Gary P. T. Choi

AI总结 提出一种无分辨率依赖的神经替代模型,通过多分辨率几何编码和几何感知约束(变分能量、扩散密度均衡、拟共形理论)无监督学习,直接从空间变化参数场预测映射位置,适用于任意结构化或非结构化点集。

详情
AI中文摘要

许多成像问题需要计算由空间变化的强度、特征或密度场引起的空间变换。典型例子包括畸变校正、可变形图像配准、基于图谱的分割以及变形驱动的图像分析。这些任务可以表述为几何映射问题,其中变换被约束以保持局部结构、控制边界行为或调节角度畸变。此类公式通常导致变分模型、扩散过程或椭圆偏微分方程。然而,当底层参数场在不同实例间变化时,重复求解高分辨率系统在计算上变得昂贵。在这项工作中,我们提出了一种无分辨率依赖的神经替代模型,用于几何参数化和映射问题。给定一个空间变化的参数场 $p:\Omega\to\mathbb{R}^m$ 和查询位置 $\{x_i\}_{i=1}^N\subset\Omega$,该模型预测任意结构化或非结构化点集上的映射位置 $\{u(x_i)\}_{i=1}^N$。为了避免对固定网格的依赖,我们采用了一种多分辨率几何编码策略,该策略将网络条件建立在参数场的坐标增强样本上。该模型通过强制执行源自变分能量、基于扩散的密度均衡和拟共形理论的几何感知约束进行训练,无需标记解数据。在拟共形映射和密度均衡映射问题上的实验结果展示了我们提出方法的有效性。

英文摘要

Many imaging problems require computing spatial transformations induced by spatially varying intensity, feature, or density fields. Canonical examples include distortion correction, deformable image registration, atlas-based segmentation, and deformation-driven image analysis. These tasks can be formulated as geometric mapping problems in which the transformation is constrained to preserve local structure, control boundary behavior, or regulate angular distortion. Such formulations typically lead to variational models, diffusion processes, or elliptic partial differential equations. However, repeatedly solving high-resolution systems becomes computationally expensive when the underlying parameter fields vary across instances. In this work, we propose a resolution-free neural surrogate for geometric parameterization and mapping problems. Given a spatially varying parameter field $p:Ω\to\mathbb{R}^m$ and query locations $\{x_i\}_{i=1}^N\subsetΩ$, the model predicts mapped locations $\{u(x_i)\}_{i=1}^N$ on arbitrary structured or unstructured point sets. To avoid dependence on a fixed grid, we use a multi-resolution geometric encoding strategy that conditions the network on coordinate-augmented samples of the parameter field. The model is trained without labeled solution data by enforcing geometry-aware constraints derived from variational energies, diffusion-based density equalization, and quasi-conformal theory. Experimental results on quasi-conformal mapping and density-equalizing mapping problems are presented to demonstrate the effectiveness of our proposed method.

2605.28293 2026-05-29 cs.LG cs.AI

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

ProRL: 通过修正策略梯度估计实现主动推荐的有效强化学习

Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang

AI总结 针对主动推荐系统中策略梯度估计存在的长度依赖偏差和高方差问题,提出ProRL框架,通过逐步奖励中心化和位置特定优势估计两个机制修正梯度,显著提升推荐效果。

Comments Accepted in ICML 2026

详情
AI中文摘要

主动推荐系统(PRS)旨在通过生成中间推荐路径来引导用户偏好向目标物品转移。强化学习(RL)为优化此类序列决策任务提供了原则性框架,因为路径奖励可以自然地捕捉短期接受度和长期引导有效性。然而,将策略梯度直接应用于PRS会导致梯度估计存在缺陷。我们识别出两个缺陷:(1)路径级奖励分解为具有正均值的步骤级奖励,产生长度依赖偏差,导致梯度倾向于路径扩展而非有意义的探索;(2)用整个路径级奖励加权每个步骤忽略了分解结构,导致高梯度方差。为修正这两个缺陷,我们提出了一种有效的RL框架ProRL,其中包含两种用于主动推荐的新机制。首先,逐步奖励中心化减去期望奖励以消除长度依赖偏差,确保路径扩展产生零期望梯度信号。其次,位置特定优势估计利用奖励分解结构计算步骤相关的基线,降低梯度方差。这些机制共同产生精确针对路径质量的策略梯度。我们在三个真实世界数据集上的实验表明,ProRL显著优于最先进的PRS。我们的代码可在https://github.com/hongruhou89/ProRL获取。

英文摘要

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.

2605.28108 2026-05-29 cs.CL

Ask Now, Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents

现在询问,以后使用:评估长期 LLM 代理中的主动性差距

Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi

AI总结 针对长期 LLM 代理在跨会话中未能主动获取用户偏好而导致的主动性差距,提出 Ask-to-Remember (ATR) 基准 ATRBench,通过隐藏用户偏好作为真实值来量化该差距,并诊断出获取环节是瓶颈。

详情
AI中文摘要

一个长期存在的 LLM 代理(例如 OpenClaw)的价值在于它能够根据用户跨会话的偏好和约束采取行动,而不仅仅是当前请求。然而,如今的代理会保留用户主动提供的信息,但很少询问那些未说出口的内容,这导致了长期 LLM 代理中的主动性差距:代理无法对从未获取到的偏好采取行动。随着用户将更多事务委托给代理,这种差距的影响也在增长。我们将这一差距的一个具体、可控的部分分离出来,称为 Ask-to-Remember (ATR):代理决定是否现在询问一个可重用的用户偏好,该偏好当前任务不需要,但后续与同一用户的会话会用到。ATR 甚至难以评估:正确的问题是不确定的,其回报会延迟到可能永远不会出现的任务。据我们所知,ATRBench 是第一个 ATR 基准,它通过将每个用户的偏好固定为隐藏的真实值,使得该差距可测量,因此成功需要询问,而不是回忆。在八个前沿 LLM 代理中,默认设置的表现至少比获得相关偏好的 oracle 低 62 分,而提示改进效果甚微。诊断表明获取是瓶颈。ATRBench 揭示了当前代理中的这一主动性差距,并提供了用于弥合该差距的诊断测试平台。

英文摘要

A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents, the impact of this gap grows. We isolate one concrete, controllable slice of this gap as Ask-to-Remember (ATR): the agent decides whether to ask now for a reusable user preference that the current task does not need but a later session with the same user will. ATR is hard even to evaluate: the right question is underdetermined and its payoff deferred to tasks that may never arise. ATRBench, to the best of our knowledge the first ATR benchmark, makes it measurable by fixing each user's preferences as hidden ground truth, so success demands asking, not recall. Across eight frontier LLM agents, defaults fall at least 62 points below an oracle handed the relevant preference, and prompting closes little of it. Diagnostics identify acquisition as the bottleneck. ATRBench surfaces this proactivity gap in current agents and offers a diagnostic testbed for closing it.

2605.27995 2026-05-29 cs.AI

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

AsyncTool: 多任务场景下异步函数调用能力的评估

Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie, Zhen Fang, Qiuchen Wang, Lin Chen, Huaian Chen, Zehui Chen, Feng Zhao

AI总结 提出AsyncTool基准,通过模拟工具响应延迟的多任务环境,评估基于大语言模型的智能体在异步工具调用中的任务协调与效率。

Comments https://github.com/StoKou/repo-asynctool

详情
AI中文摘要

基于大语言模型的智能体在使用外部工具解决复杂任务方面展现出强大能力。然而,现有评估往往忽视了工具使用的时间维度,尤其是工具响应延迟的影响,并且通常局限于单任务场景。在实际应用中,多个任务通常需要并发执行,整体效率取决于智能体是否能在等待工具响应时利用空闲时间。我们将这种能力称为异步工具调用。为评估该能力,我们提出了AsyncTool,一个用于评估基于大语言模型的智能体在具有延迟工具反馈的交互式多任务工具使用环境中的基准。AsyncTool同时呈现多个异构任务,并在执行过程中模拟真实的工具响应延迟。利用混合数据演化策略,我们构建了一个覆盖多种场景和工具使用模式的多样化异步多任务数据集。我们在步骤、子任务和任务级别评估模型,并引入面向效率的指标来衡量任务协调和完成效率。大量实验表明,延迟的工具反馈对当前智能体构成了重大挑战,并导致明显的性能下降。能够更好地协调任务切换、依赖跟踪和状态维护的模型在AsyncTool上取得了更强的性能。我们的分析识别了当前工具使用智能体的关键失败模式,并为设计未来具有更强时间推理和协调能力的系统提供了实用见解。

英文摘要

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

2605.27975 2026-05-29 cs.LG stat.ML

Continual Learning in Modern Hopfield Networks with an Application to Diffusion Models

现代Hopfield网络中的持续学习及其在扩散模型中的应用

Ken Takeda, Masafumi Oizumi, Ryo Karakida

AI总结 通过现代Hopfield能量分析扩散模型中的持续学习,证明高能量异常样本更容易被遗忘,并基于能量选择重放样本以缓解遗忘。

详情
AI中文摘要

生成模型(包括扩散模型)越来越多地被用作基础模型,并通过顺序微调进行适配,这使得持续学习成为一个关键问题设定。然而,此类生成模型中的持续学习仍未被充分理解:任务变化后,学习分布的哪些方面最容易丢失,以及应优先重放哪些样本?我们通过现代Hopfield能量来解决这些问题。现代Hopfield网络(MHN)与扩散模型之间的最新联系使得MHN中的分析可以迁移到扩散模型。我们引入内在遗忘作为任务变化后Hopfield能量的增加。在MHN的可处理设定中,我们证明高能量、类似异常值的样本比类似聚类的样本经历更大的能量增加,这意味着位于尖锐、孤立盆地中的样本更容易被遗忘。我们进一步分析了记忆重放,并表明重放对高能量样本特别有效,从而实现了基于能量的重放样本选择。我们在MHN和两种扩散模型(Stable Diffusion和像素空间DDPM)的持续学习设置实验中验证了这些预测。在这些扩散模型中,Hopfield能量追踪基于重建的遗忘,重放实验揭示了与MHN分析一致的能量依赖性遗忘缓解。

英文摘要

Generative models, including diffusion models, are increasingly used as foundation models and adapted through sequential fine-tuning, making continual learning an essential problem setting. However, continual learning in such generative models remains poorly understood: after a task change, what aspects of the learned distribution are most easily lost, and what replay samples should be prioritized? We address these questions through the modern Hopfield energy. Recent links between modern Hopfield networks (MHNs) and diffusion models allow analyses in MHNs to be transferred to diffusion models. We introduce intrinsic forgetting as an increase in Hopfield energy after the task change. In tractable settings in an MHN, we prove that high-energy, outlier-like samples undergo a larger energy increase than cluster-like samples, implying that samples located in sharp, isolated basins are more forgettable. We further analyze memory replay and show that replay is particularly effective for high-energy samples, enabling an energy-based selection of replay samples. We validate these predictions in experiments on MHNs and two diffusion models under continual-learning settings: Stable Diffusion and a pixel-space DDPM. In these diffusion models, Hopfield energy tracks reconstruction-based forgetting, and replay experiments reveal energy-dependent mitigation of forgetting that is consistent with the MHN analysis.

2605.27959 2026-05-29 cs.CV cs.AI

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

ROVER: 面向对象中心视觉证据的路由用于基于多图像推理

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

AI总结 提出ROVER,一种轻量级可学习插件,通过对象中心差分注意力聚合上下文、蒸馏图像内线索并路由历史感知证据,实现高效全局视觉证据路由,在多图像推理中提升答案和定位精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地定位和交错视觉证据以进行审慎推理。基于定位的方法通常通过将裁剪的图像块或感兴趣区域(RoI)特定特征注入推理上下文来关注RoI。然而,这种设计可能削弱整体场景理解和对象间关系,同时导致解码成本随RoI数量和大小增加而增加。或者,自适应视觉特征选择通常需要细粒度监督或复杂启发式方法。为解决这些限制,我们提出ROVER(面向对象中心视觉证据的路由用于基于多图像推理),一种轻量级、可学习的插件,用于高效的全局视觉证据路由。在每次对象定位预测时,ROVER注入一个步骤特定的令牌三元组,以协同地:(i) 聚合正在进行的推理上下文,(ii) 通过对象中心差分注意力将图像内线索蒸馏到视觉工作空间中,以及(iii) 在该空间内跨对象和图像路由并整合历史感知证据以供后续推理。我们将ROVER集成到Qwen2.5-VL-7B中,并开发了一个交错的SFT到GRPO训练流程。严格遵循原始数据集和评估协议,我们的方法在MM-GCoT(+4.8%答案准确率,+14.6%定位准确率)和VideoEspresso(+8.6%答案准确率)上取得了最佳性能。在VideoEspresso上训练的模型表现出强大的迁移能力,在多个基准测试上平均比基础模型高出+4.7%。

英文摘要

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

2605.27390 2026-05-29 cs.CL cs.AI

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation

EvoSpec: 通过实时词汇和参数自适应进化推测解码

Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang

AI总结 提出EvoSpec框架,通过动态词汇和参数自适应实现推测解码中草稿模型的实时进化,解决静态方法在专业领域和主题切换场景下接受率骤降的问题,在EAGLE-3上实现1.13倍加速并降低27%内存开销。

详情
AI中文摘要

推测解码通过草稿-验证范式加速大型语言模型推理,但随着词汇表规模扩大,输出投影层成为瓶颈。现有的静态剪枝方法虽有效降低开销,但由于无法捕捉动态分布变化,在专业领域或主题切换场景中接受率骤降。为解决此问题,我们提出EvoSpec框架,通过动态词汇和参数自适应实现草稿模型的实时进化。与静态或纯检索方法不同,EvoSpec采用上下文感知机制,通过高效的语义和统计索引检索关键长尾词。此外,我们提出一种轻量级在线对齐策略,利用课程学习持续最小化草稿模型与目标模型之间的分布差距。在专业领域(编码、法律和医学)的广泛评估证实,EvoSpec克服了静态基线的局限性。在EAGLE-3上,它相比最先进的静态基线FR-Spec实现1.13倍加速,且内存开销比标准在线自适应低27%。

英文摘要

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27\% lower memory overhead than standard online adaptation.

2605.27387 2026-05-29 cs.CL cs.AI

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

从自回归到扩散:利用严格因果与弹性视野高效适配大型语言模型

Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang

AI总结 提出FLUID框架,通过严格因果对齐和弹性视野机制,将自回归模型高效适配为扩散模型,实现并行文本生成并大幅降低训练成本。

Comments Accepted by ACL 2026

详情
AI中文摘要

扩散模型有望实现高效的并行文本生成,但其依赖双向注意力机制,与预训练的自回归(AR)模型存在结构不匹配。这种不兼容性阻碍了稳健AR先验的复用,需要从头开始进行代价高昂的预训练。为弥合这一差距,我们提出FLUID框架,该框架高效地将AR骨干网络适配到扩散范式。通过强制执行严格因果对齐,FLUID能够从标准GPT风格检查点无缝初始化,避免了大规模预训练。此外,我们引入弹性视野,这是一种基于局部信息密度而非固定调度动态调节去噪步长的熵驱动机制。实验表明,FLUID在将训练成本降低数个数量级的同时实现了最先进的性能,有效调和了成熟的AR基础与高效的并行生成。我们的代码可在https://github.com/Oli-lab-nun/FLUID/tree/main获取。

英文摘要

Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli-lab-nun/FLUID/tree/main.

2605.27379 2026-05-29 cs.AI cs.CL

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Soro: 一种轻量级塔吉克语基础模型与聊天机器人

Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, Khushbakht Shoymardonov, Shuhratjon Khalilbekov, Bonu Boboeva

AI总结 针对塔吉克斯坦计算和连接受限环境,提出基于Gemma 3的塔吉克语专用对话大语言模型Soro,通过持续预训练和监督微调,在塔吉克语基准测试上显著优于同尺寸基线,并支持量化部署。

详情
AI中文摘要

我们提出了Soro,一个塔吉克语专用对话大语言模型(LLM)家族,专为在塔吉克斯坦计算和连接受限条件下的实际部署而设计。从开放权重的Gemma 3检查点开始,我们在一个精心策划的19亿词元语料库上进行了仅塔吉克语的持续预训练,该语料库涵盖过滤后的网络文本、PDF文档和符合课程的教育材料,随后在4万个塔吉克语教师风格示例上进行监督指令微调。为了在标准基准测试中塔吉克语覆盖有限的情况下实现严格评估,我们引入了一套塔吉克语基准测试,涵盖常识、语言能力以及中学和大学入学考试领域,并在Hugging Face上开源。在这些塔吉克语基准测试中,Soro显著优于同尺寸的Gemma 3基线,同时在标准数据集上保持了较强的英语性能。我们进一步表明,Soro的FP8和INT4量化保留了大部分塔吉克语增益,同时降低了边缘部署的内存需求,支持正在进行的教育领域试点和计划在塔吉克斯坦学校中的扩展。

英文摘要

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

2605.27377 2026-05-29 cs.CL cs.AI cs.IR

Enhancing LLM Medical Coding with Structured External Knowledge

利用结构化外部知识增强LLM医学编码

Yidong Gan, David D. Nguyen, Yang Lin, Peter Zhong, Thanh Vu, Long Duong, Yuan-Fang Li

AI总结 提出RAG-Coding方法,通过将ICD表格列表编码为知识图谱并提炼指南摘要,无需训练即可增强LLM的医学编码能力,在MDACE和MDACE-2025数据集上显著优于基线。

详情
AI中文摘要

准确的医学编码需要查阅权威资源,如ICD表格列表和编码指南。现有的基于LLM的自动化方法主要依赖LLM的内部知识,容易产生幻觉且无法跟上指南更新。我们引入了RAG-Coding,一种无需训练的智能体方法,通过结构化外部知识增强LLM:将表格列表编码为知识图谱,捕获层次化和指令性的代码关系;将指南提炼为简洁、代码特定的摘要,而非检索原始文本。为支持我们的研究,我们还引入了MDACE-2025,即根据2025年ICD-10-CM/PCS指南对MDACE数据集进行的专家重新标注,增加了代码排序和理由注释。在MDACE上,RAG-Coding在五个LLM骨干网络上以micro-F1指标超越最佳基于LLM的基线3-13%,并与监督式最先进方法达到相当的micro-和macro-F1,以更高的召回率(+11%)为代价,精确率降低(-6%)。在MDACE-2025上,RAG-Coding超越所有基线,展示了对更新指南的有效泛化。消融实验确认了逐步提升,强调了整合结构化外部知识对基于LLM的医学编码的重要性。

英文摘要

Accurate medical coding requires consulting authoritative resources such as the ICD tabular list and coding guidelines. Existing LLM-based automated methods largely rely on LLMs' internal knowledge, which is prone to hallucination and cannot keep pace with guideline updates. We introduce RAG-Coding, an agentic, training-free method that augments LLMs with structured external knowledge: the tabular list is encoded as a knowledge graph capturing hierarchical and instructional code relationships, and the guidelines are distilled into concise, code-specific summaries rather than retrieved as raw text. To enable our study, we also introduce MDACE-2025, expert re-annotations of the MDACE dataset under the 2025 ICD-10-CM/PCS guidelines, adding code sequencing and justification comments. On MDACE, RAG-Coding outperforms the best LLM-based baseline by 3--13\% in micro-F1 across five LLM backbones, and achieves comparable micro- and macro-F1 to the supervised state-of-the-art, with higher recall ($+$11\%) at the cost of precision ($-$6\%). On MDACE-2025, RAG-Coding outperforms all baselines, demonstrating effective generalisation to updated guidelines. Ablations confirm stepwise gains, highlighting the importance of integrating structured external knowledge for LLM-based medical coding.

2605.27276 2026-05-29 cs.AI cs.CL

SIA: Self Improving AI with Harness & Weight Updates

SIA: 具有框架与权重更新的自我改进AI

Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran

AI总结 提出SIA框架,通过反馈智能体同时更新任务智能体的框架和权重,在三个领域(中国法律罪名分类、GPU内核优化、单细胞RNA去噪)超越仅迭代框架的方法。

详情
AI中文摘要

人类是构建和改进AI的瓶颈。无论是模型还是封装它们的智能体,都是由人类编写、调整和纠正的。一个能够自我改进的AI的长期目标仍然未实现。两条大致独立的研究路线试图解决这一瓶颈。框架更新学派让元智能体重写任务特定智能体的框架(其工具、提示、重试逻辑和搜索过程),而模型权重保持不变。测试时训练学派使用手写的强化学习流程,根据任务反馈更新模型自身的权重,而框架保持不变。这两个孤岛独立运作。我们提出SIA,一个自我改进循环,其中语言模型智能体(反馈智能体)同时更新任务特定智能体的框架和权重。我们在三个对比领域进行评估:中国法律罪名分类、底层GPU内核优化和单细胞RNA去噪。结合两个杠杆在所有三个基准上均优于仅迭代框架。SIA-W+H在LawBench上比先前SOTA高出25.1%,GPU内核比先前SOTA快12.4%(1017 vs 1161 μs),去噪性能比先前SOTA高出20.4%。框架更新使模型具有智能体性,塑造其搜索和行动方式,而权重更新构建了任何提示或框架都无法灌输的领域直觉。

英文摘要

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. SIA-W+H achieves 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels than prior SOTA (1,017 vs 1,161 μs), and 20.4% over prior SOTA on denoising. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

2605.26994 2026-05-29 cs.CV

ChartAct: A Benchmark for Dynamic Chart Understanding

ChartAct: 动态图表理解基准

Muye Huang, Lin Wu, Lingling Zhang, Hang Yan, Zhiyuan Wang, Yumeng Fu, Zesheng Yang, Jun Liu

AI总结 提出ChartAct基准,通过收集673个动态图表和1440个问答样本,评估多模态模型在交互式图表理解中的能力,发现现有模型表现有限。

详情
AI中文摘要

图表广泛用于呈现复杂数据以支持分析和决策。现有的图表理解基准主要关注静态图表,但现实中的图表通常是动态且可交互的。关键信息可能仅在悬停、点击、缩放或拖拽等操作后出现。因此,动态图表理解要求模型识别可见内容、选择合适的交互方式,并在变化的图表状态中进行推理。为了评估这一能力,我们提出了ChartAct,一个用于动态图表理解的交互式基准。ChartAct从8个真实图表网站收集并筛选了673个动态图表,涵盖7种常见图表类型,并构建了1440个高质量问答样本。每个样本在两个环境(动态图表和仪表板图表)中实例化,以评估不同上下文下的动态图表理解能力。基于ChartAct,我们系统评估了11个先进的多模态模型和GUI智能体。实验结果表明,现有模型在动态图表理解方面仍存在明显局限。最强的模型Claude-Opus-4.7达到了84.5%的平均成功率,而大多数模型仍低于60%。我们还进行了详细的失败归因和案例分析。ChartAct为研究真实交互环境中的图表理解提供了新的基准。代码见https://github.com/wulin-wulin/OSWorld_Chart。

英文摘要

Charts are widely used to present complex data for analysis and decision making. Existing chart understanding benchmarks mainly focus on static charts, but real-world charts are often dynamic and interactive. Key information may only appear after actions such as hovering, clicking, zooming, or dragging. Dynamic chart understanding therefore requires models to identify visible content, choose proper interactions, and reason over changing chart states. To evaluate this ability, we propose ChartAct, an interactive benchmark for dynamic chart understanding. ChartAct collects and filters 673 dynamic charts from 8 real chart websites, covers 7 common chart types, and constructs 1,440 high-quality question-answer samples. Each sample is instantiated in two environments, Dynamic Chart and Dashboard Chart, to evaluate dynamic chart understanding under different contexts. Based on ChartAct, we systematically evaluate 11 advanced multimodal models and GUI agents. Experimental results show that existing models still have clear limitations in dynamic chart understanding. The strongest model, Claude-Opus-4.7, achieves an average success rate of 84.5\%, while most models remain below 60\%. We also conduct detailed failure attribution and case analysis. ChartAct provides a new benchmark for studying chart understanding in real interactive environments. Codes at https://github.com/wulin-wulin/OSWorld_Chart

2605.26755 2026-05-29 cs.CL

SEEK: Semantic Evidence Extraction via Adaptive ChunKing for Multilingual Fact-Checking

SEEK: 通过自适应分块进行多语言事实核查的语义证据提取

Babu Kumar, Gaurav Kumar, Ayush Garg, Aditya Kishore, Jasabanta Patro

AI总结 提出SEEK框架,通过自适应语义分块构建连贯证据块,并微调多语言大模型,在多语言事实核查中提升宏F1最高达20%。

详情
AI中文摘要

多语言事实核查需要既相关又足够完整的证据,以实现可靠的事实性预测。然而,现有系统通常依赖搜索片段、句子级证据或局部切分的段落,这可能会遗漏关键上下文并产生碎片化的证据。为克服这些限制,我们提出SEEK,一种自适应分块的语义证据提取框架,通过识别语义主题转换并保留局部验证上下文,从完整的事实核查文章中构建连贯的证据块。构建的块使用多语言编码器进行编码,然后使用LoRA适配器微调多语言大模型进行真实性预测。在X-FACT和RU22Fact上的实验表明,与语义分块相比,SEEK将宏F1提高了最多10%,与句子分块相比提高了19%,与搜索片段基线相比提高了20%。证据完整性和显著性分析进一步表明,SEEK保留了更丰富的验证上下文,并实现了更可靠的多语言事实核查。

英文摘要

Multilingual fact verification requires evidence that is both relevant and sufficiently complete for reliable factuality prediction. However, existing systems often rely on search snippets, sentence-level evidence, or locally segmented passages, which can miss decisive context and produce fragmented evidence. To overcome these limitations, we propose SEEK, a Semantic Evidence Extraction with an adaptive chunKing framework that constructs coherent evidence chunks from full fact-checking articles by identifying semantic topic transitions and preserving local verification context. The constructed chunks are encoded using a multilingual encoder and then multilingual LLMs are finetuned using LoRA adapter for veracity prediction. Experiments on X-FACT and RU22Fact show that SEEK improves macro-f1 by up to 10% over semantic chunking, 19% over sentence chunking, and 20% over search-snippet baselines. Evidence completeness and significance analyses further show that SEEK preserves richer verification context and enables more reliable multilingual fact-checking.

2605.26428 2026-05-29 cs.CL cs.HC

Slide Deck Q&A Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation

Slide Deck Q&A 质量保证应用:面向教学问题生成的多阶段流水线

Jim Salsman

AI总结 提出一个基于Flask的多阶段大语言模型流水线,从PDF幻灯片中提取文本和图像,生成结构化的教学问题集,并通过窗口规划、幻灯片合成、标注和协调四个阶段提高问题质量。

Comments 15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices

详情
AI中文摘要

从讲座幻灯片中生成高质量、具有教学意义的问题是困难的,因为重要的教学内容分布在文本和视觉元素中,而且有用的问题必须根据演示流程进行搭建,而不是孤立地逐张幻灯片生成。本文描述了Slide Deck Q&A质量保证(slidesqaqa),一个基于Flask的软件系统,它从PDF幻灯片中提取文本和渲染图像,并通过一个四阶段的大语言模型流水线进行处理,包括窗口规划、幻灯片合成、标注和协调。该系统联合考虑幻灯片模态和教学角色,分配有限的问题预算,并在幻灯片组级别修订草稿标注以减少冗余并提高覆盖率。最终输出是一个结构化的JSON标注,包含幻灯片组级目标、章节结构、幻灯片级摘要、问题集和评估分数。在两个技术讲座幻灯片上的初步实验表明,该流水线可以过滤非教学幻灯片,并为视觉复杂的内容生成高保真、教学连贯的问题。

英文摘要

Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\&A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content. The working system is at https://slidesqaqa-974767694043.us-west1.run.app The software repository is at https://github.com/blinding2submit/slidesqaqa

2605.26408 2026-05-29 cs.LG stat.ME stat.ML

Function-Valued Causal Influence in Nonlinear Time Series

非线性时间序列中的函数值因果影响

Valentina V. Kuskova, Dmitry Zaytsev, Michael Coppedge

AI总结 针对非线性时间序列因果发现中常用标量评分掩盖状态依赖函数效应的问题,提出基于个体条件期望的框架从神经加性向量自回归模型直接估计因果响应函数,揭示标量评分无法区分的多种函数行为。

Comments 26 pages, 6 tables, 8 figures

详情
AI中文摘要

时间序列中的因果发现越来越多地使用非线性机器学习模型进行,但由此产生的因果关系几乎总是通过标量边评分来总结。我们认为,这种做法掩盖了非线性自回归模型真正学习到的对象:一个状态依赖的函数,其效应随机制、幅度和上下文而变化。我们形式化了加性、贡献可分解架构的函数值因果影响,并表明标量因果评分构成了严重的信息瓶颈,将状态间变化与状态内残差噪声混为一谈。以神经加性向量自回归作为代表性架构,我们引入了一个基于个体条件期望的实用框架,直接从训练好的模型估计因果响应函数。通过受控的合成实验,我们证明了具有无法区分的标量评分的边可以表现出定性的不同函数行为,包括单调、阈值、饱和和符号变化效应。一个关于民主发展的应用案例进一步表明,函数值分析揭示了以评分为中心的方法系统性遗漏的特定于机制和非对称的因果结构。

英文摘要

Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: a state-dependent function whose effect varies across regimes, magnitudes, and contexts. We formalize function-valued causal influence for additive, contribution-decomposable architectures and show that scalar causal scores constitute a severe information bottleneck, conflating between-state variation with within-state residual noise. Using Neural Additive Vector Autoregression as a representative architecture, we introduce a practical framework based on Individual Conditional Expectation for estimating causal response functions directly from trained models. Through controlled synthetic experiments, we demonstrate that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development further shows that function-valued analysis reveals regime-specific and asymmetric causal structure systematically missed by score-centric approaches.

2605.26194 2026-05-29 cs.LG

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

论归纳偏置在时间序列预训练中的作用:以临床时间序列学习通用表征的案例研究

Sharmita Dey, Diego Paez-Granados

AI总结 通过PathoFM编码器中心Transformer,结合局部补全、时间连续性和无监督上下文动力学三种互补目标,研究预训练目标中归纳偏置对跨任务类型和受试者迁移的影响,发现动态中心混合目标能产生最平衡的迁移表征。

详情
AI中文摘要

临床时间序列学习通常受限于小规模、异质性队列和协议漂移,而其下游应用涵盖分类(如病理诊断)和回归(如时间预测)。这些限制使得基础模型预训练具有吸引力,但提出了一个重要问题:预训练目标应施加何种归纳偏置,以使表征能够跨任务类型和受试者迁移。我们通过PathoFM研究脊髓损伤(SCI)的病理步态分析,PathoFM是一种以编码器为中心的Transformer,在多元步态窗口上使用三种互补目标进行预训练:局部补全(重建连续的掩码跨度以强制局部结构)、时间连续性(从观察到的前缀预测掩码的中期延续以强制平滑性和因果一致性)以及无监督上下文动力学(通过注意力基于受试者示例窗口进行支持-查询重建)。通过经验比较目标族(分组/对比、基于动力学和生成式重建),我们发现以动力学为中心的混合目标产生最平衡的迁移:分组目标有利于判别边界,但可能降低连续目标所需的幅度保真度,而仅重建目标保留波形结构但在分类上可能表现不佳。总体而言,将局部重建与时间连续性相结合,并在可获取示例时添加上下文条件,可产生稳健的受试者泛化表征。

英文摘要

Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting). These constraints make foundation-model pretraining appealing, but raises an important question of which inductive biases should the pretraining objective impose so that representations transfer across task types and subjects. We study this question in pathological gait analysis for spinal cord injury (SCI) via PathoFM, an encoder-centric transformer pretrained on multivariate gait windows with three complementary objectives: Local Completion (reconstruct contiguous masked spans to enforce local structure), Temporal Continuity (predict a masked mid-horizon continuation from an observed prefix to enforce smoothness and causal consistency), and Unsupervised In-Context Dynamics (support-query reconstruction conditioned on subject exemplar windows via attention). Empirically comparing objective families (grouping/contrastive, dynamics-based, and generative reconstruction), we find that dynamics-centric mixtures produce the most balanced transfer: grouping objectives favor discriminative margins but can degrade magnitude fidelity needed for continuous targets, whereas reconstruction-only objectives preserve waveform structure but may underperform on classification. Overall, combining local reconstruction with temporal continuity, and adding in-context conditioning when exemplar access is realistic, yields robust subject-generalizing representations.

2605.26193 2026-05-29 cs.LG cs.AI

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

桥接分类与重建:协同时间序列异常检测

Qideng Tang, Dai Chaofan, Wubin Ma, Yahui Wu, Haohao Zhou, Tao Zhang, Huan Li, Dalin Zhang

AI总结 提出CoAD框架,通过分类模块生成概率软掩码指导重建模块,协同利用分类与重建范式的互补优势,有效检测细微复杂异常,并在基准数据集上显著优于现有方法。

Comments 15 pages, submitted to KDD 2026

详情
AI中文摘要

时间序列异常检测(TSAD)因其广泛应用而长期成为数据挖掘领域的热门研究课题。最近的研究挑战了流行的深度学习方法在TSAD中的有效性,指出它们无法检测细微和持久的异常。异常暴露(OE)和掩码自编码器(MAE)作为两种有前景的范式(分类和重建)出现,用于解决上述问题。然而,基于OE的方法受限于泛化能力差,而基于MAE的方法受限于掩码错位问题。为了解决这些局限性,本文提出了一种新颖的框架CoAD,该框架统一了两种范式,以利用它们的互补优势,同时减轻各自的弱点。在该框架中,分类模块为重建模块生成概率信息软掩码,这反过来又缓解了分类模块的泛化问题。这种协同设计使CoAD能够有效检测现有方法常常忽略的细微和复杂异常。此外,分类模块经过精心设计,以解决分类粒度不当和忽视频率信息的问题。在高质量基准数据集上,按照严格的评估协议进行的大量实验表明,CoAD显著优于最先进的深度学习和传统数据挖掘方法,突显了深度学习在TSAD中的潜力。此外,CoAD轻量级且速度远快于现有SOTA方法,展示了其在大规模实时应用中的实用价值。

英文摘要

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

2605.26029 2026-05-29 cs.AI cs.CL

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab:面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

AI总结 提出CausaLab环境,通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力,发现两者存在显著差距。

详情
AI中文摘要

我们介绍了CausaLab,一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同,CausaLab既评估代理是否能够使用因果证据解决问题,也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中:它接收先前的测量记录,对操纵器晶体进行干预,并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型(SCM),因此成功需要恢复因果图和结构方程,而不是回忆先验知识。实验表明,预测和机制恢复之间存在持续差距:在纯观测的6节点设置中,GPT-5.2-high达到92%的任务准确率,但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度,而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点,并表明一致性验证可以缓解它。因此,CausaLab将预测成功与因果理解分开,并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.