arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.27406 2026-05-28 cs.LG

A Simple State Space Model Excels at Multivariate Time Series Classification

一个简单的状态空间模型在多变量时间序列分类中表现出色

Hassan Saadatmand, Geoffrey I. Webb, Hamid Rezatofighi, Mahsa Salehi

AI总结 本文系统研究对角状态空间模型(S4D)和输入相关状态空间模型(Mamba系列)在大规模时间序列分类任务中的表现,发现S4D在准确性和效率上均优于Mamba变体,并提出了轻量级改进MS4和MS4N,在多个基准上达到或超越参数量大2-10倍的深度学习模型。

详情
AI中文摘要

结构化状态空间模型(SSM)最近作为序列建模的有前景基础出现,基于Mamba的架构通过输入相关的状态转换展示了强大的性能,尽管复杂度相当高。然而,它们在时间序列分类(TSC)中的应用主要局限于Mamba风格的架构,更广泛的SSM设计空间尚未充分探索。我们首次在大规模TSC基准上进行了涵盖对角SSM(S4D)和输入相关SSM(Mamba系列)的系统研究,探究这种复杂性是否对顶级性能是必要的。我们的结果揭示了一个令人惊讶的发现:S4D在准确性和效率上始终优于基于Mamba的变体,挑战了增加复杂性会在TSC中带来有意义收益的假设。基于此,我们引入了MS4,通过线性输入投影和通道混合机制对S4D进行轻量级修改,以及MS4N,一种归一化变体,以可忽略的开销稳定状态动态。在MONSTER(多达6000万样本、5万时间步、82个类别)和UEA基准上的59个数据集上,与15个基线相比,MS4和MS4N始终优于基于Mamba的模型,同时保持更高的效率,并且MS4N匹配或超越了参数量大约2倍和10倍的竞争性深度学习模型。这些结果将轻量级结构化SSM定位为在TSC中扩展复杂性的有吸引力替代方案。

英文摘要

Structured state space models (SSMs) have recently emerged as a promising foundation for sequence modeling, with Mamba-based architectures demonstrating strong performance through input-dependent state transitions, albeit at considerable complexity. However, their application to time-series classification (TSC) has been largely limited to Mamba-style architectures, leaving the broader SSM design space underexplored. We present the first systematic study spanning diagonal SSMs (S4D) and input-dependent SSMs (Mamba family) on large-scale TSC benchmarks, asking whether such complexity is necessary for top performance. Our results reveal a surprising finding: S4D consistently outperforms Mamba-based variants in both accuracy and efficiency, challenging the assumption that increased complexity translates to meaningful gains in TSC. Building on this, we introduce MS4, lightweight modifications to S4D via a linear input projection and channel-mixing mechanism, and MS4N, a normalized variant that stabilizes state dynamics with negligible overhead. Evaluated on 59 datasets across MONSTER (up to 60 million samples, 50K timesteps, 82 classes) and the UEA benchmark, against 15 baselines, MS4 and MS4N consistently outperform Mamba-based models while remaining more efficient, and MS4N matches or surpasses competing deep learning models that are roughly 2x and 10x larger in parameters. These results position lightweight structured SSMs as a compelling alternative to scaling complexity for TSC.

2605.27397 2026-05-28 cs.LG

IGADA-IoT: IoT Sensor Energy Optimization in Wireless Sensor Networks Driven by Automatic Data Augmentation

IGADA-IoT:自动数据增强驱动的无线传感器网络中物联网传感器能量优化

Mingchun Sun, Rongqiang Zhao, Muhammad Abdul Munnaf, Jie Liu

AI总结 提出一种信息间隙引导的自动数据增强框架IGADA-IoT,通过分层多生成器协作与调度,联合利用不同生成器能力减小信息间隙,并引入信息间隙-模型性能联合评估与闭环方法,提升增强决策准确性,实验表明平均准确率提升7.27%。

详情
AI中文摘要

在无线传感器网络(WSN)中,数据增强是一种提高采样频率决策性能的新方法,从而实现对物联网(IoT)传感器的能量优化。然而,现有方法依赖单一生成器和经验确定的量,未能建立动态信息间隙与多个生成器之间的映射,并且忽略了生成样本的异质性。此外,缺乏一种联合考虑信息间隙和模型性能的评估与闭环方法。为了解决这些问题,我们提出了一种信息间隙引导的物联网传感器自动数据增强框架(IGADA-IoT),具有分层多生成器协作和多轮调度。联合利用不同生成器的能力来减小信息间隙。在IGADA-IoT中,提出了一种分层多生成器协作与调度策略(HMGCS),以增强生成样本分配的针对性和合理性。提出了一种信息间隙-模型性能联合评估与闭环方法(IGMP-EC),以增强增强决策的准确性,并减轻欠增强和过增强的风险。实验结果表明,IGADA-IoT将多个下游模型的平均准确率提高了7.27%。与先进的数据增强方法相比,平均准确率提高了8.67%。与单个生成器相比,平均准确率提高了7.24%。此外,来自UCR Archive和实际部署的公共物联网传感器数据集证明了所提方法的准确性和泛化能力。

英文摘要

In wireless sensor networks (WSNs), data augmentation is a novel method to improve sampling-frequency decision performance, thereby enabling energy optimization for IoT (Internet of Things) sensors. However, existing methods rely on a single generator and empirically determined quantities, failing to establish a mapping between dynamic information gaps and multiple generators, and overlooking the heterogeneity of generated samples. Moreover, an evaluation and a closed-loop method that jointly considers the information gap and the model performance are lacking. To address these issues, we propose an information gap-guided IoT sensor automatic data augmentation framework (IGADA-IoT) with hierarchical multi-generator collaboration and scheduling over multiple rounds. Capabilities of different generators are jointly utilized to reduce the information gaps. In the IGADA-IoT, a hierarchical multi-generator collaboration and scheduling strategy (HMGCS) is proposed to enhance the targetedness and rationality of generated sample allocation. An information gap-model performance joint evaluation and closed-loop method (IGMP-EC) is proposed to enhance the accuracy of augmentation decisions, and to mitigate the risks of under-augmentation and over-augmentation. Experimental results show that the IGADA-IoT improves the average accuracy of multiple downstream models by 7.27%. Compared with advanced data augmentation methods, the average accuracy is improved by 8.67%. Compared with the individual generators, the average accuracy is improved by 7.24%. Furthermore, public IoT sensor datasets from the UCR Archive and real-world deployments demonstrate the accuracy and generalizability of the proposed method.

2605.27393 2026-05-28 cs.CL cs.AI

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI: 可控的多智能体治疗性对话生成

Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

AI总结 提出StoryMI框架,通过多LLM智能体协作、情境故事基础和动态策略控制,生成符合动机性访谈标准的治疗性对话,并构建评估协议和数据集验证其有效性。

Comments ACL2026

详情
AI中文摘要

大型语言模型(LLM)可以生成流畅的对话,但先前的工作缺乏情境基础、动态策略控制以及与动机性访谈(MI)临床标准对齐的评估。我们引入了StoryMI,一个用于可控MI对话生成的多LLM智能体框架,其中基于问卷的客户档案被扩展为情境故事,为对话提供叙事背景。治疗师和客户智能体生成由交互智能体选择的MI代码引导的MI编码话语,而交互智能体动态协调交换以在多次轮对话中控制MI策略。我们提出了一个两级评估协议:词汇指标和宏观层面咨询策略的MI特定度量,以及LLM作为评判者和人类专家评估。我们构建了一个包含6K模拟MI对话的数据集,基于1K问卷-故事对,涵盖12个MI代码和13个症状领域,并对六个开源和闭源LLM进行了基准测试。我们的结果表明,情境基础和宏观层面控制可以提高MI依从性和临床合理性,展示了结构化多智能体工作流在心理治疗对话生成中的有效性。我们提供代码和数据以促进可重复性。

英文摘要

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

2605.27388 2026-05-28 cs.CL cs.AI cs.SI

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

通过反应语气建模社区态度:评估LLM与在线社区语言行为对齐的人机协作框架

Nuan Wen, Xuezhe Ma

AI总结 提出CARE框架,通过细粒度言语气势分析,评估LLM模拟社区对真实新闻的反应,揭示其存在“现实主义差距”,表明当前对齐策略不足以捕捉在线群体的社会语言动态。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作计算社会分析的代理;然而,它们忠实再现人类社区“厚描述”(Geertz, 1973)的能力仍然是一个关键挑战。当前的评估通常将社会身份简化为静态标签,忽视了现实群体如何应对社会变迁。为弥合这一差距,我们引入了CARE(社区感知反应评估),一个以反应为中心的框架,将LLM模拟的话语与不同社区对真实新闻的真实、事件相关的反应进行基准测试。通过刻画细粒度的言语气势谱及其所体现的潜在态度——通过人机协作验证——我们的诊断揭示了一个持续的“现实主义差距”:使用明确的社区提示引导LLM并不能固有地提高模拟保真度。进一步分析识别了前沿模型之间的不同行为特征,表明当前的对齐策略仍不足以捕捉在线群体的社会语言动态。

英文摘要

Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

2605.27385 2026-05-28 cs.LG cs.AI

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

异构仿真环境中联邦强化学习的个性化观测归一化

Yiran Pang, Zhen Ni, Xiangnan Zhong

AI总结 针对联邦强化学习在异构环境中状态转移动力学差异导致输入分布不一致和参数更新不平衡的问题,提出个性化观测归一化方法,通过各智能体本地维护运行均值和方差对原始状态输入进行归一化,加速训练并提升性能。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025

详情
AI中文摘要

联邦强化学习(FedRL)使多个智能体能够在不共享原始数据的情况下协同训练全局策略,因此非常适合隐私敏感的应用。然而,FedRL在异构环境中面临挑战,其中不同的状态转移动力学导致聚合过程中输入分布不一致和参数更新不平衡。因此,本文开发了一种个性化观测归一化(PON)方法,允许每个智能体使用持续更新的运行均值和方差对原始状态输入进行局部归一化。这种设计确保了局部特征的一致缩放,而不会在聚合过程中掩盖其他智能体的特征。此外,我们证明了由于不同的局部输入分布,跨智能体共享归一化参数是无效的,这突显了个性化统计的必要性。在异构MuJoCo任务上的实验表明,我们开发的PON加速了训练,并且与基线方法相比取得了更优的性能。

英文摘要

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneous environments where differing state-transition dynamics lead to non-identical input distributions and imbalanced parameter updates during aggregation. Therefore, this paper develops a personalized observation normalization (PON) method, allowing each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This design ensures consistent scaling of local feature without overshadowing across agents during aggregation. Furthermore, we demonstrate that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, which highlights the necessity of personalized statistics. Experiments on heterogeneous MuJoCo tasks show that our developed PON accelerates training and achieves superior performance compared to baseline methods.

2605.27383 2026-05-28 cs.CL cs.AI

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

弥合稳定性与表现力之间的差距:低资源口语语言模型的合成数据扩展与偏好对齐

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, Xiaoyu Shen

AI总结 针对低资源口语语言模型因合成数据导致的表现力崩溃问题,提出两种自对齐框架(DGSA和TDSC)以恢复韵律多样性,实现超越商业系统的性能并首次支持老挝语零样本语音克隆。

详情
AI中文摘要

口语语言模型(SLM)通过绕过显式的字素到音素流水线,已成为语音合成的一种有前景的范式。然而,它们在低资源语言中的有效性仍然受到转录语音稀缺的根本限制。在实践中,合成数据已成为在此类场景下扩展SLM的主要策略,当真实数据不足时提供可靠的音素监督。在这项工作中,我们表明这种依赖引入了一个基本权衡,我们称之为稳定性-表现力差距:虽然合成数据提高了音素准确性,但它逐渐抑制了韵律变异性,最终导致表现力崩溃(合成侵蚀)。为了弥合这一差距,我们提出了两种自对齐框架。解耦引导的自对齐(DGSA)通过利用韵律-音色分离来恢复复杂语言的表现力。对于真实参考极其有限的场景,温度驱动的自我批评(TDSC)通过自动探索和过滤来稳定生成。我们的方法优于强大的商业系统,包括ElevenLabs和Gemini Pro,并首次实现了老挝语的零样本语音克隆能力。

英文摘要

Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

2605.27380 2026-05-28 cs.CL cs.AI

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX: 基于别名的检索与LLM排序的跨语言生物医学实体链接

Yi Wang, Corina Dima, Liangyu Zhong, Steffen Staab

AI总结 提出BioELX两阶段框架,通过维基数据多语言别名增强SapBERT检索器,并利用预训练LLM排序器进行上下文感知消歧,无需标注数据即在多个基准上取得最佳性能。

Comments 12 pages, 3 figures

详情
AI中文摘要

跨语言生物医学实体链接(BEL)将任何语言的提及映射到生物医学知识库(KB)中的唯一标识符,支持临床和生物医学NLP应用。然而,BEL的专家标注训练数据成本高昂,尤其是对于低资源语言。此外,许多跨语言BEL系统依赖于基于SapBERT的检索器,这些检索器主要在KB中的英语别名上训练,导致对未见过的非英语提及泛化能力差,且上下文感知消歧有限。我们提出BioELX,一个两阶段跨语言BEL框架,无需任务特定的标注训练语料。在第一阶段,我们用维基数据派生的多语言别名丰富SapBERT训练,并使用得到的检索器改进跨语言候选检索。在第二阶段,我们使用预训练LLM排序器进行上下文感知消歧,该排序器联合考虑提及上下文和候选,消除了监督训练的需要。在五个基准(XL-BEL、EMEA、Patent、WikiMed-DE和MedMentions)上的实验表明,BioELX实现了新的最先进性能。它在XL-BEL上将平均Recall@1提高了+19.2,尤其是低资源语言提升显著,例如土耳其语+21.6、韩语+22.1、泰语+30.8,并在EMEA(+6.2)、Patent(+5.4)和WikiMed-DE(+12.8)上持续改进。代码和资源将在发表后发布。

英文摘要

Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

2605.27378 2026-05-28 cs.CL cs.CV cs.MA

OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

OralAgent: 融合推理、工具与知识的交互式牙科影像分析

Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

AI总结 提出首个牙科专用AI智能体OralAgent,通过集成22种视觉分析工具和368本经典牙科教科书,实现多模态推理、工具决策与知识检索的自动化框架,在多个基准上达到最优性能。

Comments 14 pages, 7 figures, 6 tables

详情
AI中文摘要

牙科影像分析在支持口腔医疗的准确诊断和治疗规划中起着关键作用。尽管近期进展产生了针对特定任务和单一成像模态的牙科AI模型,但其孤立的设计限制了在实际临床工作流程中的实用性。在本文中,我们提出了OralAgent,这是首个牙科专用AI智能体,它在端到端自动化框架内统一了多模态推理、基于工具的决策和基于知识的检索。它集成了22种视觉分析工具和368本广泛使用的经典牙科教科书,实现了自主推理、规划、工具使用、知识检索和多步骤工作流执行。此外,我们引入了OralCorpus,这是一个大规模、高质量的双语文本资源,包含1.348亿个标记,专为牙科检索增强生成(RAG)而构建。为了评估模型的多学科牙科知识,我们构建了OralQA-ZH,这是一个中文选择题基准,包含来自11个口腔亚专业的798个项目。大量实验表明,OralAgent在MMOral-Uni、MMOral-OPG和OralQA-ZH基准上达到了最先进的性能,突显了其在真实临床环境中的有效性、可解释性和适应性。代码和模型已在https://github.com/isjinghao/OralAgent公开。

英文摘要

Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

2605.27376 2026-05-28 cs.CL cs.AI

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

解锁基于提示的文本转语音模型中的细粒度和句内说话风格控制

Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

AI总结 针对基于提示的TTS模型缺乏细粒度控制和句内风格变化的问题,提出句间风格插值和句内风格过渡技术,通过嵌入空间方向向量插值和KV缓存交换及滑动窗口注意力掩码实现平滑风格控制。

详情
AI中文摘要

虽然基于提示的文本转语音(TTS)模型支持自然语言驱动的说话风格控制,但它们通常提供有限的细粒度控制,并在整个话语中应用单一的全局风格。这限制了需要跨话语连续风格属性插值和单个话语内时变风格过渡的实际用例。在本文中,我们提出了在现有基于提示的TTS模型中实现这两种能力的新技术。对于句间风格插值,我们计算嵌入空间中对比风格提示之间的方向向量并进行简单插值,从而实现风格特征之间的平滑过渡。对于句内风格过渡,我们首先识别出自回归TTS解码器中对早期标记的强烈注意力偏差,导致初始音频实现主导后续生成。为了减轻这种影响,我们引入了KV缓存交换和滑动窗口注意力掩码。实验表明,我们提出的句间插值在性别转换中实现了99-100%的成功率,高达36 Hz的音高变化,以及高达1.6音节/秒的速度变化。我们的句内过渡保持了0.81-0.91的说话人相似度,并获得了3.48-4.48的感知平滑度分数。

英文摘要

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

2605.27375 2026-05-28 cs.CL

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

LCO:基于LLM的约束优化,用于现实任务中更安全的智能体LLM

Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, Hang Su

AI总结 提出LCO框架,通过自思考模块和进化采样模块约束LLM行为,在不微调模型的情况下减少上下文奖励黑客行为,实验表明在输出优化和策略优化场景中显著提升安全性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地充当自主智能体,但它们与环境的持续交互可能导致上下文奖励黑客行为(ICRH),即LLM迭代优化其行为以最大化代理目标,无意中产生有害副作用。现有防御方法不足以应对此风险,因为ICRH并非源于对抗性输入,而是模型自身的过度优化。为缓解此问题,我们提出基于LLM的约束优化(LCO),该框架无需模型微调即可有效减少ICRH。LCO包含两个模块:自思考模块,引导LLM在执行前主动思考并整合潜在安全约束;进化采样模块,利用基于LLM的交叉和变异将模型动作约束在安全解空间内,同时保持任务性能。实验结果表明,LCO在输出优化和策略优化场景中均显著缓解了ICRH。特别是在推文参与度优化任务中,LCO在GPT-4上使毒性增长率(TGR)降低了39%;在策略优化基准上,ICRH发生率降低了15.23%,在不牺牲任务性能的情况下提升了安全性。

英文摘要

Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization. To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

2605.27374 2026-05-28 cs.CL

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

ICG: 通过基于MLLM的提示和个性化偏好对齐改进封面图像生成

Zhipeng Bian, Jieming Zhu, Qijiong Liu, Wang Lin, Guohao Cai, Zhaocheng Du, Jiacheng Sun, Zhou Zhao, Zhenhua Dong

AI总结 提出ICG框架,利用多模态大语言模型和扩散模型,通过元标记提取语义特征、用户嵌入个性化对齐及多奖励学习策略,实现高质量、个性化封面图像生成。

Comments Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12268-12278, EMNLP 2025. Official version: https://doi.org/10.18653/v1/2025.emnlp-main.617

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (Main Track) EMNLP 2025 12268-12278
AI中文摘要

多模态大语言模型和扩散模型的最新进展为AI生成内容开辟了新的可能性。然而,个性化封面图像生成仍未被充分探索,尽管它在提升数字平台用户参与度方面起着关键作用。我们提出ICG,一个新颖的框架,将基于MLLM的提示与个性化偏好对齐相结合,以生成高质量、上下文相关的封面。ICG通过元标记从项目标题和参考图像中提取语义特征,使用用户嵌入进行细化,并将得到的个性化上下文注入扩散模型。为了解决缺乏标注监督的问题,我们采用了一种多奖励学习策略,该策略结合了公共美学和相关性奖励以及从用户行为训练的个性化偏好模型。与依赖手工提示和不连贯模块的先前流程不同,ICG采用适配器桥接MLLM和扩散模型进行端到端训练。实验表明,ICG显著提高了图像质量、语义保真度和个性化,从而在下游任务中增强了用户吸引力和离线推荐准确性。作为桥接MLLM和扩散模型的即插即用适配器,ICG兼容常见检查点,且在优化过程中不需要真实标签。

英文摘要

Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

2605.27373 2026-05-28 cs.AI cs.CL cs.CY

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

识别和理解文本中的人类价值观:一种可定制的基于LLM的架构

Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

AI总结 提出一种基于大型语言模型的可定制架构,通过三个模块(规范生成、文本标注、强度评估)检测文本中人类价值观的强度,避免依赖特定价值理论或复杂提示工程,实验表明具有良好检测性能。

Comments 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

详情
Journal ref
Proc. ICAART 2026, Vol. 5, SciTePress, 2026, pp. 4096-4103
AI中文摘要

随着智能系统变得更加自主,科学界专注于创建包含伦理和道德考量的决策机制,这与传统的效用最大化模型不同。为此,一个关键方面是评估这些决策与人类价值观的契合程度。基于此,一个有前景的研究方向是开发基于大型语言模型(LLM)的方法,从文本中识别显性或隐性的人类价值观,从而实现全程识别。本文介绍了一种基于LLM的架构,用于检测和量化文本中人类价值观的强度,避免了以往方法受限于特定价值理论或复杂提示工程的缺陷。该架构包含三个协调模块:一个从任何理论框架的基础文本中生成结构化价值规范;一个使用这些规范对文本进行标注;另一个基于修辞和语义证据分配分级支持或抵抗。这种模块化方法将概念化任务与检测人类价值观的任务分离,创建了一个可扩展且可重复的过程,由适应多种理论的价值规范驱动。该架构使用多个LLM实例化,并使用ValueEval数据集进行评估。实验表明具有良好的检测性能,证实了管道的通用性。

英文摘要

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

AI总结 提出并行框解码(PBD)方法,将边界框和点作为原子单元单步解码,结合大规模数据集LocateAnything-Data,实现高效统一的目标定位与检测,在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情
AI中文摘要

视觉语言模型(VLM)通常将视觉定位和检测表述为坐标令牌生成问题,将每个2D框序列化为多个1D令牌,这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配,并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything,一个基于并行框解码(PBD)的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码,LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎,并策划了LocateAnything-Data,这是一个包含超过1.38亿个训练样本的大规模数据集,大大增加了高精度定位的数据多样性。大量评估表明,LocateAnything推进了速度-精度前沿,在多个基准测试中实现了显著更高的解码吞吐量,同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

2605.27348 2026-05-28 cs.CV cs.AI

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

当眼睛背叛AI:社交注视一致性作为AI生成图像检测的语义线索

Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung, James Matthew Rehg, Hyesong Choi

AI总结 提出社交注视一致性作为高层语义线索,通过构建诊断数据集、块组合描述监督和跨架构验证,证明该线索能有效检测AI生成图像,并解释其跨生成器迁移的机制。

Comments 23 pages, 2 figures, 17 tables

详情
AI中文摘要

最近的生成模型在很大程度上缩小了低级伪影(像素指纹、频率异常、上采样痕迹)的差距,特别是在以人为中心和局部编辑的设置中,其中被操纵的区域很小且被光度真实的内容包围。我们引入了社交注视一致性,这是一个高层语义线索,定义为互动个体之间注视方向、头眼对齐和瞳孔放置的相互一致性,并表明它构成了一个先前未被充分利用的检测轴,与现有的低级范式正交。我们通过三个耦合机制实例化这一见解:(i) 一个受控的诊断数据集,具有注视一致图像的特定区域扰动,其中严格的成对分组阻止了生成器指纹记忆作为优化时间捷径,而不是依赖增强;(ii) 块组合描述监督,它在1250个宏观组合描述中保持一个单一的5块推理骨架不变,将推理一致性与表面多样性解耦;(iii) 跨架构验证表明,相同的监督在COCOAI交互子集上将视觉语言骨干(FakeVLM)的平衡准确率提高了3.7个百分点(67.8 -> 71.5),在COCOAI人物子集上提高了1.3个百分点(83.0 -> 84.3),并且在仅视觉骨干(Effort)上也有持续提升,证明了骨干无关的线索。真实类和伪造类召回率同时上升,排除了“全预测为伪造”的伪影。一个四步机制解释——成对编辑捷径阻断、难到易难度转移、CLIP先验保留以及扩散族在眼周结构中的共享频谱弱点——解释了为什么在单个修复模型(FLUX.1-Fill)上训练能够迁移到多生成器套件。我们将在论文被接收后发布代码以促进可重复性。

英文摘要

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

2605.27258 2026-05-28 cs.SD cs.AI

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

PilotTTS:一种有纪律的模块化配方用于竞争性语音合成

Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu

AI总结 提出PilotTTS轻量级自回归TTS系统,通过极简架构和严格数据工程(仅用20万小时开源处理数据)实现竞争性能,支持零样本语音克隆、情感/副语言/方言合成,在Seed-TTS Eval基准上取得最低WER和最高说话人相似度。

详情
AI中文摘要

构建最先进的文本转语音(TTS)系统通常需要数百万小时的专有数据和复杂的多阶段架构,这给资源受限的研究团队带来了巨大障碍。在本报告中,我们提出了PilotTTS,一种轻量级自回归TTS系统,通过极简架构和严格的数据工程实现了竞争性能。PilotTTS仅使用20万小时的数据进行训练,这些数据完全通过开源工具处理。具体来说,我们的贡献包括:(1)一个可复现的多阶段数据处理流水线,涵盖质量评估、标签标注和过滤;(2)一个紧凑的模型架构,采用基于Q-Former的条件化,通过跨样本配对训练将说话人身份与说话风格解耦。在统一框架内,PilotTTS支持零样本语音克隆、情感合成(11类)、副语言合成(4类)和中文方言合成(14种方言)。在Seed-TTS Eval基准上,PilotTTS在test-en上实现了最低的WER 1.50%,在test-zh上实现了CER 0.87%,并在两个测试集上取得了最高的说话人相似度(0.862和0.815),优于使用更大数据集训练的系统。我们在https://github.com/AMAPVOICE/PilotTTS上发布了完整的数据流水线配方、预训练权重和代码。

英文摘要

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

2605.27155 2026-05-28 cs.CV cs.AI

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

通过修复进行语义鲁棒性探测:面向安全关键目标检测的交互工具

Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

AI总结 提出SemProbe工具,通过扩散模型可控修复生成语义探针,支持用户自定义掩码和因素,自动评估并记录目标检测模型的鲁棒性变化。

详情
AI中文摘要

在安全关键领域测试目标检测器需要超越像素级损坏的语义上有意义的探针。我们提出了SemProbe,一个用于语义鲁棒性探测的工具:用户上传部署图像,手动或自动创建掩码,选择操作设计域衍生因素(或自定义提示),并运行基于扩散的可控修复。系统支持批量作业、并行种子/工作流变体以及可配置的生成参数。每次输出后,自动运行模型推理并显示带有性能差异的带注释的前后对比。所有探针都作为结构化工件记录,从而能够提供与安全评估工作流一致的可追溯鲁棒性证据。我们在尺寸锯的手部检测上演示了SemProbe,针对保险导向测试标准中的因素。

英文摘要

Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textsc{SemProbe} on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.

2605.26790 2026-05-28 cs.LG physics.space-ph

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability

低推力轨迹成本与可达性的预训练近似器

Zhong Zhang, Giacomo Acciarini, Dario Izzo, Hexi Baoyin, Francesco Topputo

AI总结 提出使用机器学习代理模型精确近似低推力轨迹的燃料消耗和转移可行性,通过同伦射线策略和自相似变换实现跨任务泛化,并开源模型与数据集。

Comments Submitted to the Journal of Guidance, Navigation and Control. Zenodo entry: https://doi.org/10.5281/zenodo.18769170

详情
AI中文摘要

低推力轨迹设计严重依赖于对燃料消耗和转移可行性的重复评估,这需要昂贵的优化控制解。在这项工作中,我们表明这些量可以通过机器学习代理模型准确近似,从而在广泛场景中实现快速且可扩展的评估。通过增加数据集大小和模型容量,我们观察到低推力轨迹优化遵循缩放定律,性能随训练数据和网络参数的对数线性提升,且在探索范围内没有饱和迹象。基于这一观察,我们使用针对任务设计需求提出的同伦射线策略构建了一个大规模数据集。关键是引入自相似变换,允许在半长轴、倾角和中心天体之间泛化,避免重新训练。因此,相同的神经近似器可应用于不同的轨道环境和任务类别。所提出的模型准确预测了单圈和多圈转移的最优燃料消耗和最小转移时间。其性能和泛化能力在公开数据集、全球轨迹优化竞赛的多小行星飞越问题以及小行星交会任务设计中得到验证。模型和数据集作为开源发布,以支持航天社区。

英文摘要

Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately approximated by machine learning surrogates, enabling fast and scalable evaluation across a wide range of scenarios. By increasing both dataset size and model capacity, we observe that low-thrust trajectory optimization follows a scaling law, with performance improving linearly with the logarithm of training data and network parameters, and no evidence of saturation within the explored regime. Guided by this observation, we construct a large-scale dataset using the proposed homotopy-ray strategy tailored to mission design requirements. A key is the introduction of a self-similar transformation, which allows generalization across semi-major axes, inclinations, and central bodies avoiding retraining. As a result, the same neural approximator can be applied to diverse orbital environments and mission classes. The proposed models accurately predict optimal fuel consumption and minimum transfer time for single- and multi-revolution transfers. Their performance and generalization are demonstrated on a public dataset, a multi-asteroid flyby problem from the Global Trajectory Optimization Competition, and an asteroid rendezvous mission design. The models and datasets are released as open-source to support the space community.

2605.26730 2026-05-28 cs.CL

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

PRISM:评估LLM同行评审员的多维基准

Ngoc Phan Phuoc Loc, Toan Huynh La Viet, Thanh Tran Khanh, Duy A Nguyen, Tuan Anh Nguyen Pham, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D. Doan, Binh T. Nguyen

AI总结 提出PRISM基准,从分析深度、新颖性评估、缺陷识别与主要问题排序、多维建设性四个维度评估LLM评审质量,发现LLM在单维度上可媲美甚至超越人类,但无系统在所有维度上一致平衡,表明LLM评审员更适合作为人类评审的针对性补充。

详情
AI中文摘要

机器学习会议投稿量的快速增长给科学同行评审系统带来了压力,并加剧了对基于LLM的自动评审系统的兴趣。然而,这些系统实际上有多好,特别是在捕捉科学漏洞方面与人类评审员相比如何,仍然知之甚少。在这项工作中,我们引入了PRISM(通过结构化多维评估的同行评审智能),这是一个评估框架,从四个维度评估评审质量:分析深度、新颖性评估、缺陷识别与主要问题排序、以及多维建设性。与大多数基于表面指标(如ROUGE和BLEU)或未约束的LLM-as-a-judge提示(将流畅性与严谨性混为一谈)的现有评估不同,PRISM将每个维度建立在论点挖掘、检索增强验证和基于共识的评分之上。我们应用PRISM对来自ICLR、ICML和NeurIPS的分层评审语料库中的五个领先自动化评审系统和人类评审员进行基准测试。结果显示,LLM在单个维度上可以匹配或超越人类评审员:可比较的分析深度、更强的新颖性验证以及高度准确的批评优先级排序。然而,没有一个系统能在所有维度上同时匹配人类基线的平衡表现。每个系统都表现出独特的专业化特征,带有典型的盲点——聚合指标完全遗漏的失败模式。这意味着LLM评审员最好被理解为人类评审的针对性补充,在特定维度上有效,但作为独立替代品不可靠。我们的演示和关键结果可在https://khanhthanhdev.github.io/prism-page/找到。

英文摘要

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.

2605.26624 2026-05-28 cs.CV

MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition

MSCGC-KAN: 用于脑电情感识别的多尺度因果图卷积与Kolmogorov-Arnold特征映射

Haoliang Gong, Qingshan She, Jiale Xu, Yunyan Gao, Xugang Xi

AI总结 本文提出MSCGC-KAN方法,通过多尺度因果图卷积和Kolmogorov-Arnold特征映射构建结构化任务头,在预训练CBraMod骨干上增强多尺度时间建模、可学习通道间连接建模和非线性判别映射,显著提升脑电情感识别性能。

详情
AI中文摘要

基于脑电图的情感识别是一项重要的情感计算任务,最近的脑电图基础模型为下游适应提供了有用的通用表示。然而,在微调设置下,三个局限性仍然突出:多尺度情感动态建模不足、通道间功能连接利用不充分以及简单线性分类头的表达能力有限。为了解决这些问题,本文提出了一种新的脑电情感识别方法,称为MSCGC-KAN,它引入了一个由多尺度因果图卷积和Kolmogorov-Arnold特征映射组成的结构化任务头。基于预训练的CBraMod骨干,MSCGC-KAN通过在紧凑的任务特定头中联合加强多尺度时间建模、可学习通道间连接建模和非线性判别映射来增强下游适应。这种设计保留了基础模型的表示优势,同时使分类器对情感相关的时空模式更加敏感。在公开的FACED和SEED-VII数据集上进行了大量实验。所提方法在FACED上实现了60.66%的平衡准确率、0.5525的Cohen's Kappa和60.40%的加权F1分数,在SEED-VII上分别获得了33.27%、0.2223和33.64%。与CBraMod+Linear基线相比,在两个数据集上平衡准确率分别提高了5.91和2.03个百分点。这些结果表明,在微调预训练脑电模型时,结构化任务头设计是改进脑电情感识别的有效方法。

英文摘要

Electroencephalogram (EEG)-based emotion recognition is an important affective computing task, and recent EEG foundation models provide useful generic representations for downstream adaptation. However, under the fine-tuning setting, three limitations remain prominent: insufficient modeling of multi-scale emotional dynamics, inadequate exploitation of inter-channel functional connectivity, and the limited expressive power of simple linear classification heads. To address these issues, this paper proposes a new EEG emotion recognition method, termed MSCGC-KAN, which introduces a structured task head composed of multi-scale causal graph convolution and Kolmogorov--Arnold feature mapping. Built on a pre-trained CBraMod backbone, MSCGC-KAN enhances downstream adaptation by jointly strengthening multi-scale temporal modeling, learnable inter-channel connectivity modeling, and nonlinear discriminative mapping within a compact task-specific head. This design preserves the representation advantage of the foundation model while making the classifier more sensitive to emotion-related spatiotemporal patterns. Extensive experiments are conducted on the public FACED and SEED-VII datasets. The proposed method achieves a balanced accuracy of 60.66\%, a Cohen's Kappa of 0.5525, and a weighted F1-score of 60.40\% on FACED, and obtains 33.27\%, 0.2223, and 33.64\%, respectively, on SEED-VII. Compared with the CBraMod+Linear baseline, the balanced accuracy is improved by 5.91 and 2.03 percentage points on the two datasets, respectively. These results indicate that structured task-head design is an effective way to improve EEG emotion recognition when fine-tuning pre-trained EEG models.

2605.26552 2026-05-28 cs.LG cs.AI

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

通过摊销基于样本的变分推断来对齐少步生成模型

Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park

AI总结 提出FAV框架,利用Stein变分梯度下降进行基于样本的变分推断,并通过固定点回归将粒子更新摊销到生成器参数中,实现对少步生成模型的对齐,在机器人操作和图像生成任务中优于现有方法。

Comments Under review

详情
AI中文摘要

对齐少步生成模型具有挑战性,因为现有的对齐框架通常依赖于限制性假设:可处理的似然、特定的ODE/SDE求解器或特定的模型族。我们引入了FAV(Few-step Generative Models Alignment via Sample-based Variational Inference),这是一个通用的对齐框架,仅需要对生成器和参考分布的样本访问。我们将对齐视为从倾斜于参考分布的奖励倾斜分布中采样。我们利用Stein变分梯度下降作为基于样本的变分推断方案,并通过固定点回归将粒子更新摊销到生成器参数中。我们在两个领域评估了FAV:机器人操作和图像生成器对齐。在机器人操作的生成策略对齐中,FAV在56个离线RL任务和30个离线到在线RL任务中优于现有的策略提取基线。对于图像生成器对齐,FAV微调了多种少步骨干模型,包括GAN、漂移模型、一致性模型和流映射,从ImageNet-$256$扩展到1024$^2$文本到图像合成。代码可在https://github.com/Jaewoopudding/FAV获取。

英文摘要

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.

2605.26189 2026-05-28 cs.LG cs.AI

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

近无损 HiF8 W8A8 量化感知训练的最大窗口尺度估计

Yingying Cheng, Jinquan Shi, Li Zhou, Zhiyang He, Zhaoyi Sun, Fan Zhang, Jie Sun

AI总结 针对 HiF8 W8A8 量化感知训练中的两种正交失效模式(amax 饱和与灾难性遗忘),提出保守的 64 步历史窗口最大算法 DTS 策略和 500 步 BF16 预热加低学习率 QAT 的修复方案,在 OpenPangu-Embedded-1B 上实现接近 BF16 基线的性能。

详情
AI中文摘要

使用低位浮点格式的量化感知训练(QAT)能够实现高效的 LLM 部署,但会引入标准训练指标无法察觉的微妙失效模式。我们通过延迟张量缩放(DTS)的视角,对 OpenPangu-Embedded-1B 的 HiF8 W8A8 QAT 进行了系统研究。在八个受控实验中,我们识别并解耦了两种正交的失效模式:(i) amax 饱和,其中延迟的尺度估计通过前向传播裁剪静默地破坏知识敏感表示,以及 (ii) 灾难性遗忘,其中激进的学习率独立于量化覆盖预训练的常识知识。两者都无法仅从训练损失中检测到。我们通过保守的 64 步历史窗口最大算法 DTS 策略解决 amax 饱和,并通过 500 步 BF16 预热后以 lr=10^{-5} 进行 QAT 来缓解遗忘。两种修复都是必要且充分的:我们的最终配置在匹配的 BF16 基线上实现了 0.43% MMLU 下降、0.58% HellaSwag 下降和 0.22% ARC-Challenge 下降,训练损失 APE 在 10,000 步内仅为 0.11%。

英文摘要

Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

2605.26114 2026-05-28 cs.AI cs.CL

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

MobileGym: 一个可验证且高度并行的移动GUI智能体研究仿真平台

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, Zhaoxiang Zhang

AI总结 提出MobileGym,一个基于浏览器的轻量级、完全可控的移动环境,通过结构化JSON状态实现可验证结果信号和低成本并行强化学习,并附带包含416个参数化任务模板的基准测试集。

Comments Project page: https://mobilegym.github.io

详情
AI中文摘要

我们提出MobileGym,一个托管于浏览器、轻量级、完全可控的日常移动使用环境,旨在实现交互保真度而无需复制专有后端。它实现了之前日常应用无法实现的两种能力:通过基于确定性状态判断的结构化JSON状态实现可验证的结果信号,以及通过低成本的并行回滚实现可扩展的在线强化学习。完整的环境状态被捕获、配置、分支和比较为结构化JSON,单个服务器可托管数百个并行实例,每个实例约400 MB内存,冷启动约3秒。分层状态模型和声明式任务定义框架使状态可编程性和任务创建在大规模下实用,单一的程序化判断机制同时提供确定性评估结果和密集的强化学习奖励。配套的MobileGym-Bench提供了416个参数化任务模板,包括256个测试模板和160个训练模板,覆盖28个应用,具有确定性判断器和结构化的AnswerSheet协议,避免了自由文本匹配失败。在Sim-to-Real案例研究中,Qwen3-VL-4B-Instruct上的GRPO在256任务测试集上获得了+12.8个百分点的提升,在59任务真实设备信号子集上,真实设备执行保留了模拟侧训练增益的95.1%。项目页面:https://mobilegym.github.io。

英文摘要

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

2605.25815 2026-05-28 cs.AI cs.MA

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

EvoMap背后:表征一个自进化的智能体间协作网络

Qiming Ye, Peixain Zhang, Yupeng He, Zifan Peng, Gareth Tyson

AI总结 通过分析EvoMap网络中的150万资产和12.8万智能体,揭示其设计选择在可重用性、演化和可审计性方面的权衡,发现奖励机制导致98%资产未被重用、评分系统易被操纵以及验证机制存在缺陷。

详情
AI中文摘要

智能体间(A2A)网络通过共享可重用的问题解决指令,使自主AI智能体能够协作。然而,这些去中心化生态系统在实践中如何运作仍然在很大程度上未被探索。我们首次对EvoMap(一个突出的A2A协作网络)进行了大规模实证研究。通过分析超过150万资产和12.8万智能体,我们展示了优先考虑可扩展增长的设计选择如何在可重用性、演化和可审计性方面引入权衡。首先,EvoMap的信用经济奖励智能体发布有价值的资产。尽管这种设计鼓励大规模参与,但奖励主要与发布而非采用挂钩。这导致智能体大量生产资产以积累信用。结果,98%的资产从未被重用,而奖励高度集中在少数智能体手中。其次,EvoMap采用一种算法(称为GDI)来评分和排序这些共享资产的质量。我们证明该评分系统存在缺陷:资产的排名并非衡量客观性能,而是严重受未经验证的自我报告元数据(例如声称修改的代码行数)支配。这使得智能体可以轻易操纵其资产的分数。最后,EvoMap依赖智能体提供本地执行日志作为上传资产功能正常的证据。由于这些验证未经独立核实,超过84%的已批准资产使用空测试(例如console.log())绕过质量检查。我们的发现表明,未来的A2A协作网络不能仅依赖未经验证的自我报告。可扩展的协作需要平衡开放参与与可验证执行和可信评估的机制。

英文摘要

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console$.$log()). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

2605.25767 2026-05-28 cs.CV

SAFE-Diff: Scale-Aware Attention and Feature-Dispersive Diffusion with Uncertainty Estimation for Contrast-Enhanced Breast MRI Synthesis

SAFE-Diff: 用于对比增强乳腺MRI合成的尺度感知注意力与特征分散扩散及不确定性估计

Tianyu Zhang, Xinglong Liang, Jarek van Dijk, Luyi Han, Chunyao Lu, Antonio Portaluri, Xinghe Xie, Yaofei Duan, Nika Rasoolzadeh, Xin Wang, Yuan Gao, Muzhen He, Yue Sun, Jonas Teuwen, Tao Tan, Ritse Mann

AI总结 提出SAFE-Diff模型,通过尺度感知注意力、特征分散扩散和不确定性估计,解决对比增强乳腺MRI合成中复杂病灶纹理和异质性增强模式的挑战。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

合成高保真度的对比增强MRI对于更安全、更高效的乳腺癌筛查具有临床价值,但由于复杂的病灶纹理和异质性增强模式,仍然具有挑战性。

英文摘要

Synthesizing high fidelity contrast enhanced MRI is clinically valuable for safer and more efficient breast cancer screening, yet remains challenging due to complex lesion textures and heterogeneous enhancement patterns.

2605.25378 2026-05-28 cs.CV cs.AI

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA: 通过多教师在线策略蒸馏将50种效果收集到1个LoRA中

Fangtai Wu, Hailong Guo, Shijie Huang, Jiayi Song, Yubo Huang, Mushui Liu, Zhao Wang, Yunlong Yu, Jiaming Liu, Ruihua Huang

AI总结 提出CollectionLoRA框架,通过多教师在线策略蒸馏将多达50种不同效果LoRA和少步生成能力整合到单个LoRA中,解决参数干扰并降低部署成本。

详情
AI中文摘要

定制图像编辑旨在使用有限的配对数据,通常通过低秩适配(LoRA)为预训练扩散模型配备特定的视觉效果。随着所需效果数量的增加,存储和动态加载这些效果LoRA会显著增加部署开销。此外,当前的流程通常将这些效果LoRA与加速模块级联以实现快速生成,这会引发严重的参数干扰,导致概念混淆和风格退化。我们提出了CollectionLoRA,一个多教师在线策略蒸馏框架,能够将多达50种不同效果LoRA的概念以及少步生成能力蒸馏到单个LoRA中。这从根本上解决了特征干扰问题,并显著降低了部署成本。具体来说,该方法引入了(i)概率双流路由机制,使模型在训练期间能够在数据源之间随机切换,有效增强其在未见场景中的泛化能力;(ii)非对称正交提示策略,在提示空间内实现概念隔离;(iii)从粗到细的蒸馏目标,以缓解教师模型与学生模型之间的分布差距。大量评估表明,CollectionLoRA将所有定制效果和少步生成蒸馏到单个LoRA中,降低了部署开销,同时实现了与独立训练的教师模型相当或更好的概念保真度。代码:https://github.com/Qwen-Applications/CollectionLoRA

英文摘要

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models. Code: https://github.com/Qwen-Applications/CollectionLoRA

2605.25252 2026-05-28 cs.LG cs.AI

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

量化 RLVR 中计算与监督的实证权衡

Ryo Mitsuhashi, Patrick Chen, Isabelle Tseng, Jasin Cekinmez, Addison J. Wu

AI总结 通过 GSM8K 上的 GRPO 实验,研究验证器噪声对 RLVR 的影响,发现计算扩展无法弥补监督噪声,且假阴性比假阳性危害更大。

Comments Workshop on Combining Theory and Benchmarks @ ICML 2026

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为后训练语言模型的标准范式,但在实践中,验证器很少是完美的。最近的理论工作预测,验证器噪声会影响学习速率,但不影响最终结果,这意味着足够的计算应该能够弥补不完美监督带来的任何差距。我们通过在 GSM8K 上使用 GRPO 对 Qwen2.5(0.5B, 1.5B)进行后训练,同时向二元正确性信号中注入受控的假阳性和假阴性噪声,并将每次提示的 rollout 数量作为计算轴,来实证检验这一预测。在实践中,验证准确率的差距在大量计算扩展下仍然存在,且计算收益急剧递减。我们进一步发现一种结构性不对称:假阴性单调地比假阳性更快地降低性能。这些发现表明,验证器质量和训练计算不可互换,并且减少假阴性比单纯扩展计算更有效。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, verifiers are rarely perfect. Recent theoretical work predicts that verifier noise affects the rate of learning but not its final outcome, implying that sufficient compute should close any gap induced by imperfect supervision. We test this prediction empirically by post-training Qwen2.5 (0.5B, 1.5B) with GRPO on GSM8K while injecting controlled false-positive and false-negative noise into the binary correctness signal, and varying rollouts per prompt as a compute axis. In practice, the gap in validation accuracy persists under substantial compute scaling, with returns to compute that are sharply diminishing. We further find a structural asymmetry where false negatives monotonically degrade performance more quickly than false positives. These findings suggest verifier quality and training compute are not interchangeable, and that reducing false negatives is a more effective lever than scaling compute alone.

2605.25230 2026-05-28 cs.AI

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

通过引导推理提升推理能力:递归模型的随机探索

Andrew Corbett, Archit Sood, Anna Tzatzopoulou, Sai-Aakash Ramesh, Tim Dodwell

AI总结 提出引导随机探索方法,通过随机扰动推理轨迹并在线重加权,提升递归模型在结构化推理任务上的性能,无需重新训练。

Comments Presented at the proceedings of the ICML 2026 Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM)}, Seoul, South Korea. 2026

详情
AI中文摘要

最近关于递归架构的研究表明,小型神经网络在结构化推理任务上可以出奇地强大。其诀窍是用潜在动力系统对推理轨迹进行建模。我们认为,这些架构的推理时行为最好被理解为对潜在推理轨迹的近似推理,其中确定性递归是单粒子、零噪声极限。我们通过引导随机探索使这一观点可操作:推理动力学的随机扰动提出相邻轨迹,而模型现有的早停头在线重新加权它们。该框架产生三个无标签诊断指标:局部稳定性、引导对齐度和云令牌熵。这些指标仅从推理轨迹就能预测该过程是否有帮助以及应信任其哪些输出。在Sudoku-Extreme上,它无需重新训练就将精确求解准确率从85.9%提升到98.0%;在Maze-Hard上,诊断指标标记出引导未对齐,后续验证性能也证实了这一点。因此,同一机制既能刻画递归推理在轨迹层面何时有改进空间,也能刻画模型内部引导何时能恢复它。

英文摘要

Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the inference-time behaviour of these architectures is best understood as approximate inference over latent reasoning trajectories, with deterministic recursion as the one-particle, zero-noise limit. We make this view operational through guided stochastic exploration: stochastic perturbations of the reasoning dynamics propose neighbouring trajectories, and the model's existing early-stopping head reweights them online. The framework yields three label-free diagnostics: local stability, guide alignment, and cloud-token entropy. These predict, from inference traces alone, whether the procedure will help and which of its outputs to trust. On Sudoku-Extreme it lifts exact-solve accuracy from $85.9\%$ to $98.0\%$ without retraining; on Maze-Hard the diagnostics flag a misaligned guide, as validation performance later confirms. The same machinery thus characterises both when recursive reasoning has room to improve at the trajectory level and when the model's internal guide can recover it.

2605.25183 2026-05-28 cs.CL cs.AI

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

知识图谱驱动的神经科学专家级推理

Jake Stephen, Niraj K. Jha

AI总结 本文通过从单一教科书构建知识图谱并生成问答监督,微调语言模型,实现超越大语言模型的专家级神经科学推理。

详情
AI中文摘要

知识图谱(KG)是一种可以从文本语料库中提取并用于深度推理的抽象结构。先前的工作利用KG微调语言模型(LM),实现了特定领域的超智能。在这项工作中,我们探索仅使用单一权威教科书中的信息,KG驱动的深度推理能力是否能在神经科学中出现。核心假设是,结构化知识在被提炼为高质量KG并转换为基于KG的问答(QA)监督后,足以通过微调LM产生专家级推理,该LM在准确率上超越大型语言模型(LLM),同时参数数量少几个数量级。我们通过双LLM验证流水线构建教科书衍生的KG,使用在KG拓扑上训练的掩码LM扩展它,生成多跳QA项目(包括QA对和推理轨迹),以仅基于KG的监督微调LM,并应用强化学习,使用路径衍生的KG信号作为隐式奖励模型。我们的结果表明,深度、机械性的神经科学理解可以在模型中诱导,而无需依赖大型、异构的网络规模语料库。基于KG的神经科学合成课程(读者可以自我测试)以及微调后的LM可在以下GitHub位置获取:https://kg-bottom-up-superintelligence.github.io/neuro-bench。

英文摘要

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.

2605.24302 2026-05-28 cs.CV

Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies

基于Mamba的第一人称视频跨模态动作识别:通过CLS令牌融合策略整合RGB和手部骨架流

Juan Ignacio Bustos Gorostegui, Maria Elena Buemi

AI总结 提出一种基于Mamba的跨模态架构,通过四种CLS令牌融合策略(朴素、平均、加权和基于上下文)整合RGB视频和手部骨架数据,在H2O数据集上平均策略达到最佳性能,Top-1准确率在Tiny配置下提升超10%。

Comments 4 pages , 2 figures , Egovis2026 , CVPR2026

详情
AI中文摘要

第一人称动作识别由于相机运动不稳定、手部频繁遮挡以及随时间保持一致视觉表示的困难而具有挑战性。在这项工作中,我们提出了一种跨模态架构,将RGB视频和时间手部骨架数据结合在一个统一的基于Mamba的框架内,利用状态空间模型(SSMs)的线性时间复杂度。我们的架构由三个组件组成:用于视觉特征提取的VideoMamba模块、基于Mamba块堆叠的骨架编码器,以及将两种模态整合为单一表示的融合模块。本工作的一个核心贡献是设计和评估了四种用于多模态融合的类(CLS)令牌混合策略:朴素、平均、加权和基于上下文。这些策略在如何利用预训练的单模态CLS令牌(其作用是作为信息汇聚集所学表示)来初始化用于最终分类的混合CLS令牌方面有所不同。我们在H2O数据集上评估了所有策略。实验结果表明,平均策略实现了最佳性能,在Tiny配置下比VideoMamba基线提高了超过10%的Top-1准确率,在Small配置下提高了2%。

英文摘要

Egocentric action recognition is a challenging task due to erratic camera motion, frequent hand occlusion, and the difficulty of maintaining consistent visual representations over time. In this work, we propose a cross-modal architecture that combines RGB video and temporal hand skeleton data within a unified Mamba-based framework, exploiting the linear time complexity of State Space Models (SSMs). Our architecture consists of three components: a VideoMamba module for visual feature extraction, a skeleton encoder built on a stack of Mamba blocks, and a fusion module that integrates both modalities into a single representation. A central contribution of this work is the design and evaluation of four Class (CLS) token mixing strategies for multimodal fusion: Naive, Average, Weighted and Context-based. These strategies differ in how the pretrained unimodal CLS tokens, which role is to act as information sinks concentrating learned representations, are leveraged to initialize the mixed CLS token used for final classification. We evaluate all strategies on the H2O dataset. Experimental results show that the Average strategy achieves the best performance, yielding gains of over 10% Top-1 accuracy in the Tiny configuration and 2% in the Small configuration over the VideoMamba baseline.

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素:用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

AI总结 本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder,探索人工智能在无引导发现中的开放性能力,并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异,同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情
AI中文摘要

我们正处于大规模工业和学术努力之中,旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上,这些过程在人类形式中的一个基本属性是它们的开放性:即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现?为了回答这个问题,我们转向Picbreeder,这是人类驱动的开放性搜索的典型范例,用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder,用前沿视觉语言模型(VLM)替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异,并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素,我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.