arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.27874 2026-05-28 cs.CL

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

用于越南语自动语音识别的音节结构解码器

Nghia Hieu Nguyen, Quan Ngoc Hoang, Long Hoang Huu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

发表机构 * Faculty of Information Science and Engineering(信息科学与工程学院) Faculty of Computer Science(计算机科学学院) University of Information Technology(信息技术大学) Vietnam National University, Ho Chi Minh city(越南国家大学,胡志明市)

AI总结 针对越南语自动语音识别,提出基于音素级音节结构解码的方法,通过显式建模音节音系组成,在紧凑音素集上生成有效音节结构,显著减小词汇量并在两个基准上超越强基线。

详情
AI中文摘要

大多数自动语音识别(ASR)系统将转录视为对正字法单元(如字符、子词或词)的预测问题。尽管有效,但此类表示并未明确反映语音的语音结构,且通常需要大词汇量以保持充分覆盖。在这项工作中,我们从越南语的音位特征出发,提出了一种用于ASR的音节结构解码器,该解码器在音素层面而非正字法层面建模语音。我们的方法显式捕捉了音节的音系组成,使解码器能够从紧凑的音素库中生成有效的音节结构。这种设计更紧密地契合了语音的语音实现,同时显著减小了词汇量。在两个基准(代表标准语音的LSVSC和包含多种区域发音的多方言语料库UIT-ViMD)上的实验结果表明,尽管使用了更小的词汇量且无额外训练资源,我们的方法始终优于先前强基线,尤其是预训练基线如PhoWhisper和Wav2Vec2。这些结果突显了基于音素的音节建模在该语言ASR中的有效性。用于实验可复现的代码将在论文被接收后公开。

英文摘要

Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

2605.27873 2026-05-28 cs.AI

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AIBuildAI-2:一种用于自动构建AI模型的知识增强智能体

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电气与计算机工程系) Department of Medicine, University of California San Diego(加州大学圣地亚哥分校医学系)

AI总结 针对现有自动构建AI模型的智能体因依赖大语言模型静态参数知识而性能受限的问题,提出AIBuildAI-2,通过引入分层、可进化的外部知识系统,动态加载相关上下文,实现设计决策的专家知识支撑,在MLE-Bench上取得70.7%奖牌率并在心脏病预测竞赛中排名前6.6%。

详情
AI中文摘要

AI模型支撑着从图像和文本处理到生物学、物理学和化学科学发现的数据中心应用。然而,开发这些模型仍然高度依赖人工,需要从业者设计架构、构建训练流程并迭代优化解决方案,这使得缺乏专业AI工程专业知识的自然科学家难以构建其研究所需的高性能模型。为减轻这一负担并拓宽AI在科学发现中的可及性,已有研究提出自动构建AI模型的智能体。然而,这些智能体的性能很大程度上受限于其底层大语言模型的参数知识,这些知识是静态的、常常过时,且缺乏实用的AI模型工程诀窍。为解决这一局限,我们提出AIBuildAI-2,一种具有外部、可进化知识系统的知识增强智能体,用于自动构建AI模型。AIBuildAI-2的知识系统是分层的,将整理好的AI开发知识组织为按主题类别划分的高层知识指令和每个类别下的低层知识文档,智能体据此仅动态加载与当前状态及待解决AI任务相关的上下文,使每个设计和实现决策都基于具体、可外部验证的专业知识。该系统通过从网络收集和清洗AI开发相关文档并将其组织到相应类别进行初始化,并通过从智能体自身经验中提炼每次AI任务完成运行的结构化要点并写回知识系统而持续进化。AIBuildAI-2取得了最先进的结果,在MLE-Bench上以70.7%的奖牌率排名第一,并在一个心脏病预测竞赛中位列4370个人类专家团队的前6.6%。

英文摘要

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

2605.27865 2026-05-28 cs.CL

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

MERIT: 通过基于评分标准的训练进行审稿人匹配的专业知识匹配

Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University(数据科学与工程学院,东华大学)

AI总结 提出MERIT两阶段框架,通过强化学习训练审稿人评估器并蒸馏为检索器,实现大规模审稿人分配中的专业知识匹配。

Comments 22pages, 8 figures, 12 tables

详情
AI中文摘要

大规模地将投稿与合适的审稿人匹配是主要会议面临的日益严峻的挑战,然而现有方法要么依赖将一般相关性误认为真正适用性的粗略代理信号,要么需要难以扩展用于训练的昂贵人工标注。我们提出MERIT,一个两阶段框架,通过将标准级别的专业知识匹配转化为可扩展的适用性监督来弥合这一差距。在第一阶段,我们通过强化学习训练一个审稿人评估器,以识别论文所需的专业知识维度,将其与审稿人的先前工作匹配,并产生适用性决策,奖励由基于论文特定专业知识评分标准的LLM引导提供。在第二阶段,我们将评估器的预测蒸馏到基于嵌入的检索器中,以实现高效的大规模分配。实验表明,我们的4B审稿人评估器在适用性分类上优于更大的通用LLM,并且得到的检索器在LR-Bench和CMU Gold数据集上达到了最先进的性能。我们的代码可在https://github.com/Luli3220/MERIT获取。

英文摘要

Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at https://github.com/Luli3220/MERIT.

2605.27860 2026-05-28 cs.AI

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

C-MIG:基于多视角信息增益的检索增强生成用于临床诊断推理

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

发表机构 * Baidu Inc(百度公司)

AI总结 提出C-MIG框架,通过多视角信息增益和多重子查询检索增强策略,解决检索增强生成中奖励信号丢失和异构推理监督问题,在临床诊断任务上取得最优性能。

详情
AI中文摘要

检索增强生成结合强化学习在将大型语言模型锚定于可信医学证据方面显示出前景。然而,现有方法依赖精确匹配的二元奖励,在临床诊断中导致两个问题:(i) 语义相关但非逐字匹配的步骤获得零信号,丢弃了有价值的学习信号;(ii) 单一维度的奖励无法有效监督异构推理能力。为解决这些问题,我们提出C-MIG,一种基于多视角信息增益的临床诊断检索增强生成框架。C-MIG在冻结参考模型下从两个互补视角——检索文档和文档精炼——估计信息增益,以联合指导检索什么以及如何精炼,缓解了有价值奖励信号丢失和信用分配问题。我们进一步设计了一种多重子查询检索增强策略,提高了临床诊断场景中的知识召回覆盖率。在四个医学基准上的综合实验表明,C-MIG在领域内和领域外数据集上均达到所有RAG-RL方法中的最佳性能,并在临床诊断上超越了最先进的通用大型语言模型。

英文摘要

Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.

2605.27858 2026-05-28 cs.CL cs.AI cs.LG

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL: 学习提出有用、信息丰富且多样的问题以进行半监督、可追踪的声明验证

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

发表机构 * Department of Computer Science and Electrical Engineering(计算机科学与电气工程系)

AI总结 提出DecomposeRL框架,通过GRPO和多面奖励集成将声明分解为可追踪的子问题,在完全监督和半监督设置下实现高精度,且模型规模小4倍仍匹配大模型性能。

详情
AI中文摘要

声明验证分为两类:端到端分类器准确但无法提供可检查的追踪,而基于分解的方法可产生可检查的追踪但在基准数据集上性能滞后。我们提出DecomposeRL,一种能产生可检查追踪的准确声明验证器。DecomposeRL将分解建模为使用GRPO和多面奖励集成训练的RL策略,支持从无标签声明进行完全监督和半监督学习。DecomposeRL通过数据筛选漏斗解决了GRPO高昂的训练成本,将115K事实验证声明提炼为包含密集学习信号的5K声明子集。我们表明,仅在约5K精选声明上使用完全监督训练的DecomposeRL-7B策略,在包含生物医学、政治、科学和通用领域声明的11个声明验证基准上,实现了86.3的域内和69.8的域外平衡准确率。尽管规模小4倍,它匹配了32B基线和GPT-4.1-mini,并且在仅10%标签声明数据的半监督设置中进一步优于基线。代码、数据和模型见https://dipta007.github.io/DecomposeRL。

英文摘要

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

2605.27853 2026-05-28 cs.AI

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

MolLingo:面向LLM驱动的科学智能体的分子原生表示

Thao Nguyen, Heng Ji

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院)

AI总结 提出MolLingo多智能体系统,通过共享内存协调文献、化学家和编排智能体,结合基于BRICS的片段枚举(BFE)表示方法,实现分子块级推理与编辑,在四个基准上优于前沿LLM和专用基线。

详情
AI中文摘要

我们提出MolLingo,一个模拟化学家推理过程的多智能体系统,用于自动化分子设计。现有的基于LLM的方法要么作为独立的生成模型运行,无法访问外部工具,要么缺乏多智能体协调和共享内存,无法在分子设计流程中进行迭代、证据驱动的推理。MolLingo通过共享内存模块协调文献智能体、化学家智能体和编排智能体来解决这一问题,每个智能体配备领域特定工具。为了实现有效的分子推理,我们引入了基于BRICS的片段枚举(BFE),这是一种合成感知的分子碎片化方法,将分子分解为化学上有意义的构建块,表示为基于块的SMILES并配以常见化学名称。这种表示桥接了分子结构和LLM语义空间,实现了仅靠原始SMILES难以实现的块级推理和编辑。作为早期治疗设计的案例研究,MolLingo进一步将化学家智能体的推理基于结合位点几何和来自分子对接的残基级蛋白质上下文,以优化分子以实现更强的靶标结合。在四个基准上,MolLingo始终优于前沿LLM和专用基线,包括在相同底层模型下,对接分数比GPT-5.4提升四倍,在多个LLM骨干上一致的药物性质优化增益,以及在TOMG-Bench上达到最先进结果,超越了前沿LLM和基于RL的优化方法RePO。我们的结果表明,当通过化学上有意义的表示和生物学基础的上下文进行引导时,LLM已经能够成为有能力的分子设计助手。代码可在:https://anonymous.4open.science/status/MolLingo-7450 获取。

英文摘要

We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.

2605.27851 2026-05-28 cs.AI

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

当上下文翻转,安全失效:诊断对齐语言模型中的脆弱安全性

Dasol Choi, Alex Kwon

发表机构 * AIM Intelligence(AIM智能)

AI总结 本文提出上下文翻转评估方法,通过安全基准和常识控制测试12个模型,发现对齐语言模型存在安全特异性脆弱性,源于策略覆盖而非理解错误,并证明动作级护栏无法检测后果翻转。

详情
AI中文摘要

安全基准分数提供的部署准备证据不完整:对齐语言模型通常遵循刚性规则,即使情境更新翻转了哪个动作是安全的。我们将这种失败称为脆弱安全性。为诊断它,我们引入上下文翻转评估,在安全基准(PacifAIst)和两个常识控制上测试12个模型,使用配对变体,其中名义上安全的动作产生伤害。出现三个发现。首先,脆弱安全性是安全特异性的:所有12个模型都表现出安全-常识差距(平均+17.4个百分点)。基线准确率无法预测脆弱性:在基线准确率高于90%的模型中,脆弱率从13.7%到90.0%不等。其次,失败源于策略覆盖而非理解错误:尽管在每个案例中都承认上下文变化,模型通过三种不同机制持续存在,这些机制因更新类型和模型系列而异。第三,在对灾难性后果翻转场景的手动审计探测中,标准动作级护栏未能检测到任何情况,而状态感知验证器在正确干预上无假警报地检测到所有情况。这表明动作级内容审核系统性地对后果翻转视而不见,激发了状态感知架构替代方案。我们发布我们的协议、扰动基准和部署探测。

英文摘要

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

2605.27850 2026-05-28 cs.AI

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

TCP-MCP:面向多智能体系统的提示与通信拓扑的景观引导协同进化

Yi Ding, Zijie Xuan, Haowei Zhou, Zhenyu Ju, Xiaoxiao Dong, Jingwen Zhang, Xingyu Zhu, Leixin Sun, Haochi Zhang

发表机构 * National Institute of Metrology, China(中国计量科学研究院) University of California, Berkeley(加州大学伯克利分校) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) Nanjing University of Chinese Medicine(南京中医药大学) WEEX Exchange(WEEX交易所) National University of Singapore(新加坡国立大学) Wuhan University(武汉大学) Peking University(北京大学)

AI总结 提出TCP-MCP框架,通过协同进化智能体提示和通信拓扑,在任务性能、令牌成本和结构复杂度三个目标下实现多智能体系统的成本感知与任务自适应设计。

详情
AI中文摘要

有效的多智能体系统不能通过孤立地选择提示或通信图来设计。智能体行为取决于其接收的信息,而通信边的有用性则取决于接收智能体如何解释和使用该信息。我们提出 extbf{TCP-MCP}(面向多智能体协作问题求解的拓扑耦合提示),这是一个将智能体提示和通信拓扑作为统一基因组进行搜索的协同进化框架。TCP-MCP使用初始化时的景观探针来校准早期搜索行为,然后依赖帕累托前沿诊断在三个目标(任务性能、令牌成本和结构复杂度)下自适应调整探索。在所有方法中使用相同的DeepSeek-V3.2骨干网络,TCP-MCP在MMLU-Pro、MMLU和GSM8K上分别达到82.66%、89.96%和96.61%的准确率。在三个基准测试中,它持续优于自动图生成基线,并在报告的操作点上达到与辩论式系统相当的准确率,同时使用的令牌数最多减少5.69倍。这些结果表明,联合进化提示和通信结构为受控评估中成本感知和任务自适应的多智能体系统设计提供了一条实用途径。

英文摘要

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

2605.27846 2026-05-28 cs.AI

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO: 面向开放问答的基于熵驱动的自适应正负样本加权策略优化

Yunsheng Zeng, Gen Li, Yuwei Miao, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

发表机构 * Baidu Inc(百度公司)

AI总结 针对开放问答中强化学习固定权重的问题,提出基于熵驱动的自适应策略优化方法EAPO,通过动态调整正负样本权重平衡探索与稳定性,在医学问答数据集上显著提升多样性和稳定性。

详情
AI中文摘要

大型推理模型通常通过可验证奖励的强化学习(RLVR)进行训练。然而,现有方法对正负样本采用固定权重,且结论难以推广到开放问答(QA)。本文系统研究了开放问答中强化学习正负样本的作用。我们提出了一种基于奖励均值的策略来区分正负样本,并观察到负样本主要控制响应多样性和性能上限,而正样本主要决定响应质量和收敛稳定性。基于这些观察,我们提出了EAPO,一种基于熵驱动的自适应策略优化方法,该方法根据当前策略熵与初始熵的比率自适应计算正样本的加权系数。在熵减阶段,分配给正样本的权重降低以保持探索,而在熵增阶段则放大以增强稳定性,从而缓解熵崩溃。在两个公开的开放医学问答数据集上的实验表明,EAPO在响应多样性和稳定性方面一致且显著优于固定权重基线。

英文摘要

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

2605.27843 2026-05-28 cs.CV

A self-supervised learning approach to deep filter banks for texture recognition

一种用于纹理识别的深度滤波器组的自监督学习方法

Joao B. Florindo, Lucas O. Lyra, Antonio E. Fabris

发表机构 * Institute of Mathematics and Statistics of the University of Sao Paulo(圣保罗大学数学与统计学研究所) Institute of Mathematics, Statistics and Scientific Computing of the University of Campinas(坎皮纳斯大学数学、统计与科学计算研究所)

AI总结 针对纹理识别中训练数据有限的问题,提出一种基于卷积自编码器的自监督预训练框架,结合深度滤波器和Fisher向量池化,在不显著增加计算负担的情况下提升识别性能。

详情
AI中文摘要

纹理识别中的一个重要挑战是实际应用中经常遇到的训练数据有限。在计算机视觉中,缓解这一问题的一个成功策略是使用预训练阶段,其中神经网络以自监督方式学习识别数据各部分之间的关系。在这方面,一个成熟的框架是掩码自编码器。然而,这些模型通常依赖于计算密集型的架构,如视觉变换器。在纹理图像的特定情况下,大多数相关信息被压缩在每个像素周围的有限区域内,这表明通过注意力机制捕获长距离依赖可能是不必要的。基于这一假设,本文提出了一种预训练模型为卷积自编码器的框架。为了利用纹理模式传递的丰富信息,我们采用了深度滤波器与Fisher向量池化相结合的方法。通过这种方式,我们在不增加显著计算负担的情况下提高了纹理识别的性能。我们的方法与多个纹理数据库中的几种最先进方法进行了比较,证实了其在分类精度和计算复杂度方面的潜力。

英文摘要

An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.

2605.27838 2026-05-28 cs.SD

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Dasheng AudioGen: 从文本生成连贯音频场景的统一模型

Jiahao Mei, Heinrich Dinkel, Yadong Niu, Xingwei Sun, Gang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan, Mengyue Wu

发表机构 * X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China(上海交通大学XLANCE实验室,上海,中国) MiLM Plus, Xiaomi Inc., Beijing, China(小米公司MiLM Plus,北京,中国)

AI总结 提出Dasheng AudioGen统一框架,通过结构化多视角描述和高维统一语义-声学表示,实现从文本到混合音频场景的端到端生成。

详情
AI中文摘要

音频生成长期以来一直分散,语音、音乐和音效由特定领域的模型生成,无法从单一描述联合生成连贯的音频场景。关键障碍在于对真实世界混合音频缺乏细粒度监督,以及用于建模并发音频组件的声学表示有限。我们提出了Dasheng AudioGen,一个从文本生成通用混合音频场景的统一框架。Dasheng AudioGen引入了结构化多视角描述,将复杂声学场景显式解耦为互补的描述视角,从而实现对音频层的细粒度控制。此外,我们采用高维统一语义-声学表示作为共享潜在空间。它注入语义先验,促进跨模态训练收敛,同时其高维特征空间提供足够容量以有效解耦和融合并发音频组件。通过这些设计,一个简单的流匹配DiT实现了高质量端到端音频场景生成。我们还为音频场景生成建立了全面的评估流程。实验表明,Dasheng AudioGen在混合音频类别中实现了接近真实录音的性能,同时在单类型生成任务中与专门模型保持竞争力。演示可在https://nieeim.github.io/Dasheng-AudioGen-Web/获取。

英文摘要

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.

2605.27834 2026-05-28 cs.LG stat.ML

Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

从逆强化学习中的奖励迁移:一种耦合极小极大方法

Guang-Yuan Hao, Lars van der Laan, Aurélien Bibaut, Nathan Kallus

发表机构 * Cornell Tech, Cornell University(康奈尔科技,康奈尔大学) Netflix Research(netflix研究) Department of Statistics, University of Washington(华盛顿大学统计学系)

AI总结 提出一种耦合极小极大方法,通过联合求解源和目标环境的贝尔曼方程组,消除源贝尔曼残差误差的一阶影响,实现逆强化学习奖励从源环境到目标环境的有效迁移。

详情
AI中文摘要

我们研究利用逆强化学习从专家演示中学习到的奖励从一个环境迁移到另一个不同环境的强化学习问题。当演示在受控环境中收集时,这自然发生。我们将问题表述为跨源和目标环境的贝尔曼方程联合系统,并开发了目标软$q$函数的极小极大估计器。顺序求解方法首先估计源奖励,然后将其代入目标控制问题,而耦合方法则联合求解源和目标系统方程。我们表明,与顺序方法相比,耦合方法消除了源贝尔曼残差误差的一阶影响。我们刻画了每种方法的局部行为,建立了有限样本软$q$函数误差界,并证明了所得软控制策略的遗憾保证。使用脓毒症模拟器的实证研究验证了理论比较。

英文摘要

We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate the problem as a joint system of Bellman equations across the source and target environments and develop minimax estimators for the target soft-$q$-function. Whereas a sequential solution approach first estimates the source reward and then plugs it into the target control problem, a coupled approach solves the source and target system of equations jointly. We show that, in contrast to the sequential approach, the coupled approach removes the first-order influence of source Bellman residual error. We characterize the local behavior of each approach, develop finite-sample soft-$q$-function error bounds, and prove regret guarantees for the resulting soft-control policy. An empirical investigation using a sepsis simulator validates the theoretical comparison.

2605.27832 2026-05-28 cs.CL

Playing with Words, Improving with Rewards: Training Language Models for Creative Association

玩文字游戏,用奖励改进:训练语言模型进行创意联想

Vijeta Deshpande, Namrata Shivagunde, Sherin Muckatira, Hadrien Glaude, Mikhail Gronas, Claire Stevenson, Roger Beaty, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛市分校) Dartmouth College(达特茅斯学院) University of Amsterdam(阿姆斯特丹大学) Pennsylvania State University(宾夕法尼亚州立大学) Amazon AGI(亚马逊人工智能研究院)

AI总结 本研究通过强化学习与可验证奖励(RLVR)在Codenames游戏上训练LLM,探索了规模依赖的精确度-创造力权衡,发现8B模型在保持推理能力的同时提升创造力,而小模型则牺牲创造力换取推理精度。

详情
AI中文摘要

大型语言模型(LLM)正被应用于日益困难的问题和用例。为了有效导航其广阔的解决方案空间,LLM需要具备创造力。然而,创造力的主观性和人类判断的局限性使得训练LLM的创造力尤其具有挑战性。作为解决方案,我们在Codenames(一个词联想游戏)上训练LLM,该游戏锻炼了创造力的两个核心轴——发散思维和收敛思维,同时产生客观可验证的结果。这种可验证性使我们能够绕过人类判断,并使用具有可验证奖励的强化学习(RLVR)进行训练。我们训练了Qwen3-1.7B、4B和8B模型,并在十个创造力和四个推理基准上评估它们。我们发现精确度-创造力权衡是规模依赖的:8B模型优先考虑创造力而非精确度,而1.7B和4B模型则以牺牲创造力为代价获得推理精确度。具体来说,8B模型在8个创造力基准上显示出适度但一致的提升,且推理能力仅略有下降,而较小的模型在推理任务上取得了显著提升。我们的研究提出了一种可扩展且有效的解决方案来训练LLM的创造力。

英文摘要

Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.

2605.27831 2026-05-28 cs.LG eess.SP math.OC

Decentralized Parameter-Free Online Learning with Compressed Gossip

基于压缩八卦的去中心化无参数在线学习

Tomas Ortega, Hamid Jafarkhani

发表机构 * Center for Pervasive Communications & Computing and EECS Department, University of California, Irvine(普及通信与计算中心和加州大学尔湾分校电子工程与计算机科学系)

AI总结 提出DECO-EF算法,结合coin-betting预测与压缩差分八卦,实现去中心化在线凸优化中无参数自适应且压缩通信下的次线性网络遗憾。

详情
AI中文摘要

我们研究当智能体通过图通信且消息可能被压缩时的去中心化在线凸优化。经典的去中心化在线方法通常需要依赖于时间范围、比较器尺度或其他问题参数的学习率选择,而压缩通信引入了必须控制的额外不一致性。我们提出DECO-EF(带误差反馈的去中心化coin-betting),一种去中心化无参数在线学习算法,结合coin-betting预测与基于压缩差分的八卦。每个智能体维护一个干净的累积状态和一个压缩跟踪器,并在八卦步骤中仅通信压缩的状态差分。该方法在在线学习意义上是无参数的:它不调整时间范围、比较器范数或学习率。我们证明了在压缩通信下DECO-EF的期望比较器自适应网络遗憾界。据我们所知,这首次为压缩通信下的无参数去中心化在线学习提供了期望次线性网络遗憾保证。

英文摘要

We study decentralized online convex optimization when agents communicate over a graph and messages may be compressed. Classical decentralized online methods typically require learning-rate choices that depend on the horizon, comparator scale, or other problem parameters, while compressed communication introduces additional disagreement that must be controlled. We propose DECO-EF (DEcentralized COin-betting with Error Feedback), a decentralized parameter-free online learning algorithm that combines coin-betting predictions with compressed difference-based gossip. Each agent maintains a clean accumulated state and a compressed tracker, and communicates only compressed state differences during gossip steps. The method is parameter-free in the online-learning sense: it does not tune to the horizon, the comparator norm, or the learning rate. We prove expected comparator-adaptive network-regret bounds for DECO-EF under compressed communication. To the best of our knowledge, this gives the first expected sublinear network-regret guarantees for parameter-free decentralized online learning under compressed communication.

2605.27827 2026-05-28 cs.AI cs.CY

Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems

运营级AI部署保障:阈值敏感部署条件下的治理状态编排——高风险AI系统的治理框架

Khalid Adnan Alsayed

发表机构 * Ducaltus | AI Assurance \& Governance Newcastle upon Tyne, United Kingdom School of Computing, Engineering \& Digital Technologies Teesside University Middlesbrough, United Kingdom

AI总结 提出运营级AI部署保障(OADA)框架,通过部署保障分数、就绪分类、阈值稳定区、治理升级状态和修复感知保障推进等机制,将公平性分歧、子组不稳定性和阈值敏感性转化为部署导向的治理决策,以解决高风险AI系统中静态指标报告和事后审计的不足。

Comments 13 pages, 3 figures, governance-oriented framework for operational AI deployment assurance in high-stakes systems

详情
AI中文摘要

AI治理框架日益强调高风险领域的公平性、透明度、问责制和生命周期风险管理。然而,许多当前方法仍停留在观察层面,依赖静态指标报告、事后审计和监控仪表板,而未能直接治理部署就绪性、修复进展、升级状态或保障驱动的部署控制。本文引入运营级AI部署保障(OADA),这是一个治理框架,用于将公平性分歧、子组不稳定性、阈值敏感性、修复结果和运营不确定性转化为面向部署的保障决策。基于先前关于公平性分歧指数(FDI)和FairRisk-FDI的工作,OADA将治理不确定性重新定义为AI部署管道中的运营问题,而非指标分歧的副产品。该框架引入了部署保障分数、部署就绪分类、阈值稳定区、治理升级状态和修复感知保障推进。这些构造通过将评估输出与部署状态解释、重新评估、升级和运营控制相连接,支持高风险环境中的生命周期导向治理决策。通过在面部识别系统上进行面向部署的评估,并将讨论扩展到作为代表性高风险领域的医疗AI,本文展示了系统在孤立的公平性或性能指标下可能看似可接受,同时仍表现出影响部署就绪性的不稳定性。所提出的框架将运营部署保障定位为评估与现实世界AI部署之间的治理层。

英文摘要

AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domains. However, many current approaches remain observational, relying on static metric reporting, post-hoc auditing, and monitoring dashboards without directly governing deployment readiness, remediation progression, escalation states, or assurance-driven deployment control. This paper introduces Operational AI Deployment Assurance (OADA), a governance framework for translating fairness disagreement, subgroup instability, threshold sensitivity, remediation outcomes, and operational uncertainty into deployment-oriented assurance decisions. Building on prior work on the Fairness Disagreement Index (FDI) and FairRisk-FDI, OADA reframes governance uncertainty as an operational concern within AI deployment pipelines rather than a byproduct of metric disagreement. The framework introduces Deployment Assurance Scores, Deployment Readiness Classifications, Threshold Stability Zones, Governance Escalation States, and remediation-aware assurance progression. These constructs support lifecycle-oriented governance decisions across high-stakes settings by connecting evaluation outputs to deployment-state interpretation, reassessment, escalation, and operational control. Through deployment-oriented evaluation across facial recognition systems, with discussion extended to healthcare AI as a representative high-stakes domain, the paper demonstrates how systems may appear acceptable under isolated fairness or performance metrics while still exhibiting instability that affects deployment readiness. The proposed framework positions operational deployment assurance as a governance layer between evaluation and real-world AI deployment.

2605.27824 2026-05-28 cs.AI cs.CL

Revealing Algorithmic Deductive Circuits for Logical Reasoning

揭示逻辑推理的算法演绎电路

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

发表机构 * Japan Advanced Institute of Science and Technology(日本科学技术先进研究院)

AI总结 本研究通过因果中介分析定位大语言模型中负责逻辑推理步骤的注意力头,发现少量专用头处理事实和规则信息,而高层头促进信息整合和全局推理策略的出现。

详情
AI中文摘要

最近的研究表明,通过在少样本学习设置中引入抽象描述图遍历算法和逐步推理的功能性符号表示,大型语言模型(LLMs)能够实现强大的推理性能。然而,目前尚不清楚LLMs如何仅从有限的示例中真正理解每个推理步骤的抽象含义以及整体算法。本文旨在定位负责单个推理步骤的注意力头,并刻画它们之间传输的信息类型。我们首先在符号辅助的思维链(CoT)提示框架下,将组成推理步骤与其对应的token logits对齐。我们的分析表明,引导推理过程的token位置与低置信度分数相关,这些低置信度分数是由满足演示中推理行为模式的约束引起的。然后,我们采用因果中介分析技术来识别负责这些模式的注意力头。此外,我们的发现表明,LLMs通过专门的注意力头(约占全部头的3%)为各个子推理任务检索事实和基于规则的信息,而较高层主要促进信息整合和全局推理策略(例如图遍历算法)的出现,这些策略协调多个中间推理步骤以解决整体任务。

英文摘要

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

2605.27820 2026-05-28 cs.AI

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

EgoBench:面向工具使用智能体的交互式自我中心多模态基准

Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai, Yuqi Qing, Weiqiang Wang, Jian Liu

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出EgoBench,首个交互式自我中心多模态基准,通过1045个自我中心视频任务和用户-智能体-工具交互环境,联合评估视觉感知、工具增强多跳推理和动态交互能力,揭示当前最先进模型性能上限(平均准确率19.43%)。

Comments 68 pages, 6 figures

详情
AI中文摘要

随着AI智能体在开放的真实世界环境中日益运作,它们需要多模态感知、多跳推理的工具调用以及与用户的动态交互的深度协同。然而,现有基准由于在设计严格耦合的多能力任务、模拟自然且任务受限的用户反馈以及确保动态交互的客观评估方面存在挑战,未能联合评估这些能力。为弥补这一差距,我们引入了EgoBench,这是首个面向工具使用智能体的交互式多模态基准。EgoBench包含覆盖四个日常场景的1,045个自我中心视频任务,以及一个用于评估的用户-智能体-工具交互环境。我们实现了一个三阶段协同流水线,通过该流水线,每个任务旨在强制视觉感知和工具增强多跳推理的联合应用。我们还在EgoBench中开发了一个多智能体模拟用户来评估智能体的交互能力,该模拟用户生成高保真、任务对齐的响应。此外,我们建立了一个确定性联合验证框架,通过基于过程和基于结果的等价性保证客观评估。在EgoBench上对八个最先进的视频-MLLM智能体进行基准测试揭示了严重的性能上限:最佳模型在最佳表现场景中仅达到30.62%的准确率,在所有四个场景中平均为19.43%。最后,我们进行了多维错误分析以解开失败模式,揭示了推动未来AI智能体发展的能力瓶颈。

英文摘要

As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

2605.27819 2026-05-28 cs.LG cs.AI

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

ReSAE: 用于多层Transformer干预的残差化稀疏自编码器

Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对多层稀疏自编码器(SAE)在Transformer中因层间耦合导致的冗余和交互问题,提出残差化稀疏自编码器(ReSAE),通过拟合层间仿射映射并训练SAE于残差上,减少解码器冗余并提升多层替换下的交叉熵恢复。

详情
AI中文摘要

稀疏自编码器通常逐层训练,尽管Transformer残差流激活在深度上强烈耦合。这对多层干预造成实际问题:不同层的字典可能将容量用于表示相同的向前传递信息,同时替换多层可能产生单层行为无法预测的交互。我们引入残差化稀疏自编码器(ReSAE),它在选定层之间拟合仿射映射,并在未解释的残差上训练后续层的SAE,而非完整激活。重构通过拟合的仿射链映射回原始激活空间,因此ReSAE可以像普通SAE一样使用相同的干预协议进行评估。在Pythia-1.4B和Gemma-2-9B上,残差化减少了解码器冗余,并在大多数测试设置中改进了稀疏探测和定向扰动。尽管重构的原始激活方差较少,ReSAE在多层替换下恢复了更多Transformer交叉熵。这一增益在教师强制和足够的在线稀疏性下最为明显,表明ReSAE保留了与模型下游计算最相关的激活成分。这些结果表明,去除线性可预测的跨层结构是多层SAE干预的有用默认设置。

英文摘要

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

发表机构 * MIT(麻省理工学院) CMU(卡内基梅隆大学) Amazon FAR(亚马逊公司)

AI总结 提出一种解耦的视频到动作策略VERA,利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型,实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情
AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络,能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型,通过使用带有动作标签的数据微调视频模型,联合预测未来观测和动作。在本文中,我们测试了一种替代方法的极限:保持视频规划器不变,同时训练一个特定本体的逆动力学模型(IDM)。这种解耦带来了几个自然的好处:视频规划器保持本体无关,不同的视频模型可以轻松互换而无需重新训练IDM,并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略,该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型(VERA),在模拟和真实世界基准测试中取得了强劲的性能,包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对,可以在多个本体上使用。我们的结果表明,解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站:https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

2605.27816 2026-05-28 cs.CV

Pattern Recognition Tasks with Personalized Federated Learning

个性化联邦学习的模式识别任务

Md. Arifur Rahman, Isha Das, Mushfiqur Rahman Abir, B. M. Taslimul Haque, Abdullah Al Noman, Abir Ahmed, Md. Jakir Hossen

发表机构 * College of Graduate and Professional Studies, Trine University(特灵大学研究生与专业研究学院) Network Communication and IoT Lab, Chittagong University of Engineering and Technology(恰 TAGONG 工程技术大学网络通信与物联网实验室) Department of Computer Science and Engineering, American International University-Bangladesh(美国国际大学-孟加拉国计算机科学与工程系) Information Systems, Central Michigan University(中央密歇根大学信息系统系) Wilmington University(维尔明顿大学) Department of Information Technology, Washington University of Science & Technology(华盛顿科学与技术大学信息科技系) Center for Advanced Analytics (CAA), COE for Artificial Intelligence, Faculty of Engineering & Technology (FET), Multimedia University, Melaka(多媒体大学马六甲工程与技术学院(FET)、人工智能学院(COE)高级分析中心(CAA))

AI总结 本文通过比较七种个性化联邦学习算法在MNIST、SignMNIST和Digit5数据集上的性能,发现APPLE、FedGC和FedProto在准确率、精确率、召回率和F1分数上表现优异。

Comments Comprehensive comparative analysis of 7 Personalized Federated Learning algorithms across MNIST, SignMNIST, and Digit5 datasets. The paper presents detailed methodology, workflow architecture, experimental evaluation, and privacy-preserving AI analysis for distributed intelligent systems, secure collaborative learning, and critical infrastructure applications

Journal ref Emerging Science Journal 10(2):974-990 (2026)

详情
AI中文摘要

个性化联邦学习(PFL)构成了一种新颖的范式,它为每个客户端定制机器学习(ML)模型,从而在维护严格数据隐私原则的同时提供个性化的模型更新。与传统的标准联邦学习(FL)方法不同,PFL使模型适应不同的客户端数据分布,从而在最小化通信开销的同时,实现更高水平的准确性、定制化和数据安全性。这种方法在依赖于异构数据源且以隐私问题为关键的模式识别任务背景下尤为突出。在本研究工作中,本文对七种不同的PFL算法进行了全面的比较分析,这些算法在三个不同的数据集(即MNIST、SignMNIST和Digit5)上部署。总体目标是通过基于准确率、精确率、召回率和F1分数等指标的严格评估,确定在模式识别任务框架内最优秀的PFL算法。同时,对这些PFL算法进行了深入审查,阐明了它们的工作流程、优点和局限性。通过实证研究,结果表明APPLE、FedGC和FedProto是强有力的竞争者,在评估的数据集范围内始终提供优越的性能,同时承认其他算法的上下文特异性以及通过迭代改进实现最优结果的潜力。

英文摘要

Personalized Federated Learning (PFL) constitutes a novel paradigm that tailors Machine Learning (ML) models to individual clients, thereby furnishing personalized model updates whilst upholding stringent data privacy principles. Diverging from conventional standard Federated Learning (FL) approaches, PFL adapts models to distinct client data distributions, engendering heightened levels of accuracy, customization, and data security, all while minimizing communication overhead. This methodology proves particularly salient in contexts marked by pattern recognition tasks reliant upon heterogeneous data sources and underpinned by paramount privacy apprehensions. In the present research endeavor, this article undertake a comprehensive comparative analysis of seven distinct PFL algorithms deployed across three diverse datasets, namely MNIST, SignMNIST, and Digit5. The overarching objective entails ascertaining the preeminent PFL algorithm, within the framework of pattern recognition tasks, through a rigorous evaluation anchored in metrics encompassing Accuracy, Precision, Recall, and F1 Score. Concurrently, an in-depth scrutiny of these PFL algorithms is conducted, elucidating their operative workflows, advantages, and limitations. Through empirical investigation, the findings evince that APPLE, FedGC, and FedProto emerge as stalwart contenders, consistently furnishing superior performance across the spectrum of assessed datasets, while acknowledging the contextual specificity of alternative algorithms and the potential for iterative refinement to realize optimal outcomes.

2605.27813 2026-05-28 cs.CV cs.AI cs.LG

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

残差化时间稀疏自编码器用于解释扩散模型

Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou, Mohsen Imani

发表机构 * University of California, Irvine(加州大学 Irvine 分校)

AI总结 提出残差化时间稀疏自编码器,通过去噪时间步间的线性预测残差学习扩散激活轨迹中的可解释特征,并在Stable Diffusion 1.5上验证其有效性。

详情
AI中文摘要

文本到图像扩散模型通过迭代去噪过程生成图像,因此内部神经层产生激活轨迹而非单一静态表示。稀疏自编码器(SAE)最近被用于将扩散激活分解为可解释的特征方向,但大多数方法在单个时间步分析激活或基于时间条件,而非直接从完整激活轨迹中学习。在这项工作中,我们引入了用于扩散激活轨迹的残差化时间SAE。我们收集去噪时间上的激活,拟合相邻时间步之间的线性预测器,并使用初始激活以及这些线性动力学未解释的残差分量来表示每个轨迹。在这种残差化表示上训练SAE鼓励稀疏潜在变量捕捉超出线性可预测范围的结构。残差化解码器方向可以映射回激活空间,使得每个潜在变量可以作为去噪时间上的特征轨迹进行分析。通过在Stable Diffusion 1.5上的重建与消融研究、时空特征分析和定性引导实验,我们表明残差化时间SAE为研究时间结构化的扩散激活提供了一个有用的框架。

英文摘要

Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

2605.27811 2026-05-28 cs.AI

Constrained Auto-Bidding via Generative Response Modeling

通过生成式响应建模实现约束自动出价

Eunseok Yang, Xingdong Zuo, Kyung-Min Kim

发表机构 * NAVER Corporation(NAVER公司)

AI总结 提出生成式响应模型(GRM),通过预测未来流量和聚合成本/价值曲线,结合轻量解析控制器,在预算和比率约束下实现稳定高效的自动出价。

详情
AI中文摘要

自动出价系统旨在预算约束和成本每次获取等比率目标下,最大化广告主在长期内的价值,然而未来流量和拍卖动态是非平稳且不确定的。现有方法面临明显局限性:基于控制的节奏方法对偏差做出反应但无法预测未来条件,而强化学习和生成方法将约束纳入奖励信号,掩盖了违规并在分布偏移下退化。我们将学习目标从动作转向响应,提出生成式响应模型(GRM),这是一个基于历史条件的序列模型,联合预测未来流量和作为单一出价乘数函数的水平聚合成本/价值曲线。我们证明,在温和的单调性条件下,相对于完全逐拍控制的最优性差距受逐拍边际价值-成本离散度的限制。给定预测响应,一个轻量解析控制器通过一维求根步骤强制执行每个活动约束。我们证明该控制器对于单乘数问题是精确的,并根据预测误差限制了滚动时域重规划下的约束违规。在AuctionNet上的实验表明,与现有基线相比,GRM提高了约束稳定性和总体得分。

英文摘要

Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquisition, yet future traffic and auction dynamics are non-stationary and uncertain. Existing approaches face distinct limitations: control-based pacing reacts to deviations but cannot anticipate future conditions, while RL and generative methods fold constraints into reward signals, obscuring violations and degrading under distribution shift. We shift the learning target from actions to responses with the Generative Response Model (GRM), a history-conditioned sequence model that jointly predicts future traffic volume and horizon-aggregate cost/value curves as functions of a single bid multiplier. We show that under mild monotonicity conditions, the optimality gap relative to full per-tick control is bounded by the dispersion of per-tick marginal value-per-cost. Given predicted responses, a lightweight analytic controller enforces each active constraint via a 1D root-finding step. We prove this controller is exact for the single-multiplier problem and bound constraint violations under receding-horizon replanning in terms of prediction error. Experiments on AuctionNet show that GRM improves constraint stability and overall score compared to existing baselines.

2605.27808 2026-05-28 cs.CL cs.MM

TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

TARQ: 面向罕见词鲁棒自动语音识别的尾部感知重建量化

Xinyu Wang, Ziyu Zhao, Ke Bai, Silin Meng, Dongming Shen, Xiao-Wen Chang, Yixuan HE

发表机构 * McGill University(麦吉尔大学) Boson AI Arizona State University(亚利桑那州立大学)

AI总结 提出TARQ,一种无标签的后训练量化框架,通过尾部感知重建损失和罕见词平衡规则,在不增加额外训练的情况下显著降低罕见词错误率。

详情
AI中文摘要

数据感知后训练量化(PTQ)在小型校准语料库上最小化每个token的重建损失,隐式地根据经验频率对位置进行加权。对于自动语音识别(ASR),这与尾部敏感风险不一致:名称、数字和领域特定词获得的校准质量比例较小。我们提出了尾部感知重建量化(TARQ),一种无标签的PTQ框架,通过罕见词平衡(一种封闭形式的每线性层规则,平衡常见/尾部质量)和度量一致的残差校正,将校准转向词汇尾部。TARQ不需要实体标签、不需要精心设计的校准集、不需要验证解码,也不需要额外训练。在八个ASR骨干网络和六个数据集上,W4G128下,TARQ在不导致总体WER回归的情况下改善了平均罕见词错误率(rare-WER),在比较方法中实现了最低的跨语料库rare-WER波动,并在无需实体监督的情况下迁移到实体丰富的基准测试(ProfASR, ContextASR-Speech-En)。

英文摘要

Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

2605.27805 2026-05-28 cs.CL cs.AI

ChildEval: When large language models meet children's personalities

ChildEval:当大语言模型遇到儿童个性

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng

发表机构 * JIUTIAN Research(九天研究院) China Mobile(中国移动) Beijing, China(北京,中国)

AI总结 提出ChildEval基准,通过合成3-6岁儿童个性档案和偏好(显式或隐式表达),评估大语言模型在长对话中推断并遵循儿童偏好的能力,实验表明微调可提升儿童中心性能。

Comments 8 pages of main text (ACL Findings format), with references and appendix

详情
AI中文摘要

虽然大语言模型(LLM)使得个性化聊天机器人成为可能,但它们在儿童中心个性化方面的有效性仍不明确,因为缺乏对儿童特定偏好的系统评估。为填补这一空白,我们引入了ChildEval,一个用于评估LLM在长上下文对话中推断和遵循儿童中心偏好能力的基准。ChildEval包含29K个3-6岁儿童的合成个性档案,提供相对静态的背景信息。每个个性档案关联一个儿童偏好——可能与个性一致、冲突或独立——通过单句显式表达或6-10轮对话隐式表达。显式和隐式偏好旨在反映相同的潜在偏好,但表达方式不同,捕捉偏好表达的动态方面而非静态个性的变化。该基准涵盖五个顶层类别和十四个子类别,覆盖儿童的日常生活和发展。我们进一步提出了细粒度、以儿童为中心的评估协议,以系统评估开源LLM。实验结果表明,不同的个性化表示如何影响LLM的响应,并表明在ChildEval上进行微调可以提升儿童中心性能。我们的代码和数据集可在https://github.com/ziyanluo/ChildEval获取。

英文摘要

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

2605.27800 2026-05-28 cs.CV

CuriosAI Submission to the CASTLE Challenge at EgoVis 2026

CuriosAI 在 EgoVis 2026 CASTLE 挑战赛中的提交

Yuto Kanda, Hayato Tanoue, Takayuki Hori

发表机构 * SoftBank Corp(软银公司)

AI总结 针对600多小时多视角自我中心视频的185道选择题,提出SVA(搜索-验证-回答)三阶段流水线和TMKG(时间多模态知识图谱)两种方法,SVA达到0.50准确率并作为最终提交。

Comments The 4th place solution for the CASTLE Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

CASTLE 2026 在超过600小时的同步多视角自我中心视频中提出了185道多项选择题。我们在共享的多模态预处理层之上探索了两种方法,包括每人时间线、说话人解析的转录本和多VLM描述集成。方法A,SVA:搜索-验证-回答,是一个三阶段流水线,它分层缩小到主要窗口,在四个反事实规则下用VLM验证子窗口,并在证据优先级层次下用LLM法官融合证据。方法B,TMKG:时间多模态知识图谱,是相反的:它构建一个时间多模态知识图谱,通过图搜索定位主要单元,并用单个接地VLM产生最终答案。SVA在排行榜上达到0.50的准确率,是我们的最终挑战提交;TMKG达到0.35。

英文摘要

CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.

2605.27799 2026-05-28 cs.AI eess.SP

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

GraD-IBD:基于诊断轨迹的图表示学习用于炎症性肠病的早期检测

Leo Y. Li-Han, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

发表机构 * Department of Surgery, Mayo Clinic, Rochester, MN, USA(外科部,梅奥诊所,罗切斯特,MN,美国) Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA(健康保健交付科学中心,梅奥诊所,罗切斯特,MN,美国) Division of Hepatobiliary and Pancreas Surgery, Mayo Clinic, Rochester, MN, USA(肝胆胰外科部,梅奥诊所,罗切斯特,MN,美国) Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA(人工智能与信息学部,梅奥诊所,罗切斯特,MN,美国)

AI总结 提出GraD-IBD图诊断模型,将纵向ICD轨迹重构为时间有向图,并设计上下文感知的时间衰减消息传递机制,以降低复杂度并提升炎症性肠病检测性能。

详情
AI中文摘要

国际疾病分类(ICD)是一种全球公认的编码系统,记录每次患者就诊的诊断事件,为各种临床任务提供标准化的数据基础。然而,ICD代码序列的不规则性和层次性给基于N-D格子的序列建模方法带来了挑战,导致模型设计过于复杂。在本文中,我们提出了GraD-IBD,一种图诊断模型,将纵向ICD轨迹重构为按就诊分桶的时间有向图,以检测炎症性肠病(IBD)的风险。我们开发了一种新颖的上下文感知时间衰减消息传递机制,以捕获时间依赖性并降低模型复杂度。使用真实世界临床数据集的实验结果表明,与最先进的方法相比,IBD检测性能一致且稳健地提升,同时与序列模型相比,计算复杂度显著降低。这些发现凸显了图表示学习在从纵向ICD诊断代码中进行高效、可扩展且准确的疾病风险预测方面的潜力。

英文摘要

International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient encounter, providing a standardized data foundation for various clinical tasks. However, the irregular and hierarchical nature of ICD code sequences poses challenges for N-D lattice-based sequential modeling methods, leading to overly complex model designs. In this paper, we propose GraD-IBD, a graph diagnosis model that reformulates longitudinal ICD trajectories as visit-bucketized, temporally directed graphs to detect the risk of inflammatory bowel disease (IBD). A novel context-aware, time-decay message passing mechanism was developed to capture temporal dependencies while reducing model complexity. The experimental results using a real-world clinical dataset demonstrated consistent and robust improvements in IBD detection over state-of-the-art methods, with significant reductions in computational complexity compared to sequential models. These findings highlight the potential of graph representation learning to enable efficient, scalable, and accurate disease risk prediction from longitudinal ICD diagnosis codes.

2605.27790 2026-05-28 cs.LG

SYNAPSE: Neuro-Symbolic Visual Thought-to-Text Decoding via Topological Semantic Denoising

SYNAPSE: 通过拓扑语义去噪的神经符号视觉思维到文本解码

Akshaj Murhekar, Abhijit Mishra

发表机构 * School of Information University of Texas at Austin(信息学院德克萨斯大学奥斯汀分校)

AI总结 提出SYNAPSE框架,利用常识图结构和潜在样本进行推理时符号正则化,稳定脑电到文本解码中的语义生成,无需微调大语言模型。

详情
AI中文摘要

大语言模型的最新进展加速了开放词汇的脑电到想象文本解码,其中视觉感知期间记录的非侵入性神经活动被翻译成所观看刺激的连贯自然语言描述。然而,现有系统仍然高度易受生物噪声影响,其中受损的神经投影在冻结语言模型中引发幻觉或语义不稳定的生成。我们引入了SYNAPSE(符号神经对齐用于精确语义提取),一个轻量级神经符号框架,通过推理时符号正则化稳定神经文本生成。通过使用常识图结构和潜在样本来净化脑电衍生的语义候选,SYNAPSE无需端到端微调LLM即可提高语义稳定性。在流行的脑电解码基准和多个冻结LLM后端上的实验表明,与无约束提示基线相比,SYNAPSE持续改进,在对象标签消融下具有鲁棒性,并且性能与资源密集得多的微调系统相当,同时通过将原始脑电处理完全限制在编码器堆栈内来保护生物特征隐私。

英文摘要

Recent advances in large language models have accelerated open-vocabulary EEG-to-imagined-text decoding, where non-invasive neural activity recorded during visual perception is translated into coherent natural language descriptions of viewed stimuli. However, existing systems remain highly vulnerable to biological noise, where corrupted neural projections induce hallucinated or semantically unstable generation in frozen language models. We introduce SYNAPSE (Symbolic Neural Alignment for Precise Semantic Extraction), a lightweight neuro-symbolic framework that stabilizes neural text generation through inference-time symbolic regularization. By purifying EEG-derived semantic candidates using commonsense graph structure and latent exemplars, SYNAPSE improves semantic stability without end-to-end LLM fine-tuning. Experiments across popular EEG decoding benchmarks and multiple frozen LLM backends demonstrate consistent gains over unconstrained prompting baselines, robustness under object-label ablation, and performance commensurate with substantially more resource-intensive fine-tuned systems, while preserving biometric privacy by localizing raw EEG processing entirely within the encoder stack.

2605.27789 2026-05-28 cs.AI cs.CL

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

固定预算、聚类感知的 LLM-as-a-Judge 评估标准:多跳 RAG 压力测试

Camilo Chacón Sartori, José H. García

发表机构 * Catalan Institute of Nanoscience and Nanotechnology(加泰罗尼亚纳米科学与纳米技术研究所)

AI总结 针对多跳 RAG 系统评估中的统计偏差问题,提出一种固定预算、聚类感知的 LLM-as-a-Judge 比较标准,并通过遗传算法证据选择器 GADMEC 在 400 个多跳问题上进行压力测试,揭示聚类感知推断改变了实证结论。

详情
AI中文摘要

检索增强生成(RAG)系统通常通过让大型语言模型(LLM)法官判断哪个答案更好来进行比较。对于多跳 RAG,这已成为一个测量问题,与建模问题同等重要:相同的分数可以反映检索质量、答案长度、词汇重叠或忽略聚类数据的统计检验。我们询问当这些选择被明确时会发生什么。 我们提出了 RAG 中 LLM-as-a-Judge 比较的最小测量标准。该标准固定了 top-100 候选池、证据预算、答案上限、生成器和提示;它还要求预先注册假设、聚类感知推断、在可行时进行精确的聚类符号翻转检验以及第二法官复制。聚类基准可能夸大进展;该领域应采用此标准。我们使用遗传算法解码器进行多跳证据组合(GADMEC),一种进化证据选择器,在计算机科学/机器学习(CS/ML)和材料科学领域的 400 个多跳问题上对其进行压力测试。该协议改变了实证故事。二项检验使所有四个语义基线比较看起来显著;聚类感知推断只留下一个 Bonferroni 显著结果。在相同预算下,BM25 优于纯语义 GADMEC,而词汇-语义混合在 CS/ML 中恢复并缩小了材料科学差距。

英文摘要

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

2605.27788 2026-05-28 cs.LG cs.CL

Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

知道何时求助:面向LLM工具使用的片段级信用分配

Abhijit Kumar, Zoey Wu, Mohit Suley

发表机构 * Microsoft AI Redmond(微软AI红mond)

AI总结 提出CARL方法,通过强化学习在模型自身轨迹上训练评论家,对每个工具使用片段独立分配信用,使模型学会区分参数知识足够与需要外部帮助的情况,在多个基准上提升准确率并减少不必要的工具调用。

详情
AI中文摘要

人类知道何时需要求助,例如 $347 \times 28$ 需要计算器而 $2+2$ 不需要。语言模型则不然。基于提示的方法可以指导模型何时调用工具,但这种脚手架并不能教会模型识别自身知识的边界。将单一结果奖励分配给整个轨迹的强化学习方法同样效果不佳:轨迹级信用无法隔离成功回合中哪个工具调用真正有帮助,也无法惩罚不必要的调用。我们提出 \textbf{CARL}(\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning),该方法在模型自身的 rollout 上训练评论家,以学习参数知识何时足够以及何时需要外部帮助。通过在每个 rollout 的自然工具使用边界(例如代码围栏分隔符和上下文块转换)处进行分解,CARL 从单一二元结果中为每个片段分配独立信用,无需外部评判或步骤级标注。因此,错误的工具调用、不正确的提取以及不必要的调用各自获得适当符号的优势。训练好的评论家捕捉了模型的领域能力:在7B规模下,它以AUC 0.93区分参数可解问题与工具依赖问题。在涵盖算术、多跳事实问答和金融表格数值推理的五个基准上,CARL在7B和3B规模下分别比最佳RL基线提高了6.7和9.7个精确匹配准确率点,其中在Musique上增益最大(7B +8.3 EM,3B +9.0 EM)。模型在参数可回答的问题上减少了53%的工具调用,同时在这些问题上仍保持约10个EM点的更高准确率。增益在小规模上最大:3B的改进是7B改进的1.4倍,这表明知道何时求助对参数记忆较小的模型有更大益处。

英文摘要

Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model's own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model's domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53\% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

2605.27785 2026-05-28 cs.AI cs.DB

A Query Engine for the Agents

面向智能体的查询引擎

Kenny Daniel

发表机构 * Hyperparam(Hyperparam公司)

AI总结 提出一个轻量级、JS原生、支持异步SQL和LLM UDF的查询引擎Hyperparam,用于在AI应用中分析非结构化文本,性能优于DuckDB-WASM。

Comments 4 pages, 1 figure, 3 tables

详情
AI中文摘要

当今生产环境中增长最快的数据是非结构化文本:智能体轨迹、聊天日志、推理链、模型输出。人们想要分析这些数据,而有价值的问题(例如“显示智能体在哪里感到困惑”)无法仅通过SQL回答,因为如果没有模型参与查询路径,文本是不可查询的。这种分析自然发生在新一类AI应用中(如Claude Code、Cursor、Claude Desktop、浏览器内智能体),这些应用在客户端运行,并在同一进程中托管人类用户和LLM智能体。这些应用越来越需要处理数据,但数据湖仓的读取路径在JS运行时中难以使用:Spark、Trino和托管数据仓库不适合。为了构建这种新型AI数据应用,引擎的三个属性成为首要考虑:JS原生分发,能够直接嵌入应用已运行的运行时;足够小的包体积,以便在冷标签页或每轮智能体沙箱中分发;以及一种将分析操作符与基于模型的文本解释交错的方法。我们提出Hyperparam,三个总大小低于70 KB的开源JavaScript库(Hyparquet、Squirreling、Icebird),它们直接从对象存储读取Parquet和Apache Iceberg,并通过基于单元格的异步原生SQL执行满足第三个属性,因此昂贵的单元格仅在下游操作符需要时才触发。Squirreling在过滤受限查询上运行LLM形状的异步UDF比DuckDB-WASM快300倍以上(排序受限查询快192倍),并以低三分之二的成本完成十项智能体分析师任务。我们认为数据工程作为一个学科需要更新,以适应现已投入生产的AI原生客户端应用以及与其用户协作的智能体。

英文摘要

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.