arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.27899 2026-05-28 cs.AI

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

SKILLC: 通过对比信用分配学习LLM智能体的自主技能内化

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

AI总结 提出SkillC框架,基于对比技能信用分配(CSCA)将技能帮助性对比转化为直接学习信号,实现LLM智能体的自主技能内化,在ALFWorld和WebShop上分别超越最强基线5.5%和4.4%。

详情
AI中文摘要

结构化技能提示改善了长周期智能体强化学习(RL)中的探索。技能增强型RL方法在推理时保留外部技能,而技能内化型RL方法在训练期间撤回技能以实现自主性能。然而,现有的内化方法仅使用技能帮助性对比进行课程控制,策略更新保持不变,无法区分技能依赖和自主成功。我们提出SkillC,一种基于对比技能信用分配(CSCA)的框架,将该对比转化为内化的直接学习信号。SkillC在同一策略更新中,为来自活跃技能类型的任务采样配对的技能注入和无技能轨迹,并通过双流优势估计器将它们的任务级对比注入优化,该估计器在保持全局排名的同时,对无技能成功施加单边校正。平滑的验证级信号进一步驱动自适应课程,包括归因强度、轨迹分配和单调活跃集剪枝。在ALFWorld和WebShop上的实验表明,在无运行时技能访问的情况下,SkillC分别超过最强先验技能内化RL基线5.5%和4.4%,同时与技能增强型RL方法保持竞争力。

英文摘要

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

2605.27898 2026-05-28 cs.AI

A Unified Framework for the Evaluation of LLM Agentic Capabilities

LLM 代理能力评估的统一框架

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao

AI总结 提出一个统一框架,通过标准化配置、固定 ReAct 架构和离线设置,分离框架与环境效应,实现 LLM 代理能力的公平评估,并在 7 个基准、24 个领域、15 个模型上进行了大规模实证分析。

详情
AI中文摘要

随着 LLM 越来越多地被部署为代理,对其代理能力的可靠评估变得至关重要。然而,报告的基准分数通常共同反映了模型能力以及每个基准所附带的实现选择,使得跨基准结果难以解释为对底层模型的纯粹测量。在这项工作中,我们提出了一个用于公平评估 LLM 代理能力的统一框架。在统一配置系统的驱动下,该框架将多样化的基准整合为标准化的指令-工具-环境格式,通过固定的 ReAct 风格架构在可控沙箱中执行代理,并提供可选的离线设置,用精心策划的快照替换易变的实时环境,从而可以分别分析框架效应和环境效应。在此基础上,我们在每个基准的原始任务成功标准下统一了评估方法,同时引入了资源消耗的统一指标以及决策和执行层面失败归因的分类法。在该框架内,我们适配了 7 个广泛使用的基准,涵盖单代理、多代理和安全关键场景的 24 个领域,并在 15 个模型上进行了超过 40 万次 rollout 和 50 亿 token 的大规模实证分析。结果表明,脚手架选择和环境波动会显著改变基准结果的方向,使我们的框架能够将内在的 LLM 能力与框架和环境引入的伪影分离开来。我们进一步展示了其作为安全关键领域安全测试床的可扩展性。代码和基准可在 https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework 获取。

英文摘要

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction--tool--environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

2605.27896 2026-05-28 cs.CL cs.CE

FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations

FinBoardBench: 通过棋盘游戏模拟基准测试大语言模型的动态财富管理和战略金融推理

Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao, Caiwei Li, Yue Ma, Jie He, Qiancheng Zhang, Yuntao Zou, Dagang Li

AI总结 提出基于三款经典金融棋盘游戏的评估套件FinBoardBench,测试大语言模型在动态财富管理、企业投资收购和竞争谈判等综合金融技能,发现模型虽具备基本规划能力但无法将静态推理转化为成功动态决策。

Comments Preprint

详情
AI中文摘要

近期,大语言模型(LLMs)在静态金融推理和简单动态交易任务中取得了优越性能。然而,现有的静态金融基准不足以评估LLMs在真实环境中的动态财富管理和金融决策能力。为弥补这一差距,我们提出了FinBoardBench,一个基于三款经典金融棋盘游戏(现金流、并购和大富翁)的评估套件。FinBoardBench评估一系列全面的金融技能,包括个人现金流管理与债务平衡、企业投资与收购预测,以及带有资产拍卖的竞争性贸易谈判。我们对9个先进LLMs的实验表明,尽管它们展现出基本的长期规划和投资逻辑,但未能有效利用复杂互动来获取利润,且其强大的静态推理性能并未转化为成功的动态决策。值得注意的是,它们倾向于优先获取即时资产而非维持充足流动性,这使得它们容易受到随机事件引发的金融危机的影响。我们希望FinBoardBench能为未来更智能的基于LLM的决策系统提供有价值的参考。

英文摘要

Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.

2605.27894 2026-05-28 cs.CV

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

面向不完整多模态输入的统一视觉-语言模型

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, Wei Ji

AI总结 针对视频-语言模型在传感器失效导致模态不完整数据下的训练-测试不一致问题,提出首个统一的不完整视频-语言模型作为即插即用模块,提升多模态任务性能。

Comments Published in AAAI 2026

详情
AI中文摘要

视频-语言模型(VLM)在多种计算机视觉应用中展示了令人印象深刻的多模态推理能力。然而,这些VLM是任务特定的,并假设视频和语言输入都是完整的。然而,现实世界的VLM应用可能因传感器停用(例如,由于数据隐私导致摄像头不可用)而面临挑战,产生模态不完整的数据,并导致训练和测试数据之间的不一致。虽然简单的不完整输入可以提升训练泛化能力并导致训练失败,但其对VLM在安全性和可信度方面的潜在风险在很大程度上被忽视了。为此,我们首次尝试提出一个统一的不完整视频-语言模型来处理不完整的多模态输入。大量实验结果表明,我们的方法可以作为先前工作的即插即用模块,提高它们在各种多模态任务中的性能。

英文摘要

Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

2605.27893 2026-05-28 cs.CV

SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

SIGMA:弥合视觉基础模型适应的结构与分布差距

Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi, Ying Huang

AI总结 提出SIGMA方法,通过尺度自适应融合和语义调制模块,以1.72%可训练参数实现视觉基础模型在密集预测任务上的高效微调,性能优于现有PEFT方法。

详情
AI中文摘要

视觉基础模型(VFM)展示了令人印象深刻的表示能力。然而,通过全微调将它们适应到下游任务会带来高昂的计算和存储开销。参数高效微调(PEFT)作为一种有吸引力的替代方案应运而生,旨在以最小的训练成本实现与全微调相当的性能。尽管如此,将PEFT应用于VFM进行密集预测任务仍然具有挑战性,因为存在结构和分布差距。为了弥合这些差距,我们提出了尺度集成全局调制适配器(SIGMA),一种新颖的轻量级PEFT方法,它由两个模块组成:尺度自适应融合和语义调制。具体来说,尺度自适应融合模块用于通过增强多粒度视觉信息的提取来弥合结构差距。此外,SIGMA在融合特征上引入语义调制以执行全局特征对齐,进一步消除分布差距。这种设计促进了统一的空间和分布适应,相对于VFM骨干网络仅需1.72%的可训练参数。在各种下游密集任务和多个VFM骨干网络上的全面实验表明,SIGMA在性能上一致且优于最先进的PEFT方法。

英文摘要

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

2605.27892 2026-05-28 cs.LG

FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

FedEHR-Gen: 通过潜在空间对齐和分布感知聚合的联邦合成时间序列电子健康记录生成

Jun Bai, Ziyang Song, Yue Li

AI总结 提出FedEHR-Gen,首个用于跨分布式医院合成时间序列电子健康记录的联邦框架,通过两阶段学习(联邦自编码器对齐潜在空间和联邦TCVAE分布感知聚合)解决高维稀疏和异质性挑战,在eICU和MIMIC-III上达到与集中训练相当的生成质量。

Comments 8 pages main paper with 14 pages supplementary appendix

详情
AI中文摘要

合成电子健康记录(EHR)生成为隐私受限的医疗环境中的数据增强和跨医院建模提供了一条有前景的途径。然而,大多数现有的EHR生成模型是集中式的,需要汇集各医院的数据,这在现实世界中数据共享受限时往往不可行。虽然联邦EHR生成提供了一种自然的解决方案,但由于EHR数据的高维性、稀疏性和跨医院异质性,直接的联邦建模常常崩溃或发散。在这项工作中,我们提出了FedEHR-Gen,这是首个用于跨分布式医院合成时间序列EHR生成的联邦框架。FedEHR-Gen采用两阶段学习范式。首先,我们引入了一个联邦自编码器,将高维稀疏的EHR特征投影到紧凑的潜在空间。为了确保跨医院的语义一致性,我们开发了一种逐层匹配聚合机制,将局部编码器对齐到统一的全局潜在空间。其次,在这个对齐的潜在空间上,我们训练了一个具有分布感知聚合的联邦时间条件变分自编码器(TCVAE),从而在严重的跨医院异质性下实现稳定的时间生成建模。在eICU和MIMIC-III数据集上的大量实验表明,FedEHR-Gen在生成保真度、下游效用和隐私风险方面与集中训练相当,同时始终优于标准的联邦基线。

英文摘要

Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are centralized and require pooling data across hospitals, which is often infeasible when real-world data sharing is restricted. While federated EHR generation offers a natural solution, direct federated modeling often collapses or diverges due to the high dimensionality, sparsity, and cross-hospital heterogeneity of EHR data. In this work, we propose FedEHR-Gen, the first federated framework for synthetic time-series EHR generation across distributed hospitals. FedEHR-Gen uses a two-stage learning paradigm. First, we introduce a federated autoencoder that projects high-dimensional and sparse EHR features onto a compact latent space. To ensure semantic consistency across hospitals, we develop a layer-wise matching aggregation mechanism that aligns local encoders into a unified global latent space. Second, operating on this aligned latent space, we train a federated temporal conditional variational autoencoder (TCVAE) with distribution-aware aggregation, enabling stable temporal generative modeling under severe cross-hospital heterogeneity. Extensive experiments on the eICU and MIMIC-III datasets demonstrate that FedEHR-Gen achieves generation fidelity, downstream utility, and privacy risk comparable to centralized training, while consistently outperforming the standard federated baseline.

2605.27891 2026-05-28 cs.CV cs.AI

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

SmartDirector: 基于关键帧的叙事节奏可控电影视频生成

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li

AI总结 提出SmartDirector框架,通过多关键帧条件控制视频生成中的叙事结构和时间节奏,采用两阶段方法(Director-Gen生成低分辨率视频,Director-SR利用高分辨率关键帧细化细节),显著优于现有方法。

详情
AI中文摘要

视频的叙事质量从根本上决定了其感知价值。尽管现有的视频生成方法可以生成视觉上吸引人的内容,但它们主要依赖于稀疏的条件信号,如文本提示或首尾帧,这限制了对叙事结构和时间节奏的精确控制。在本文中,我们提出了SmartDirector,一个通过多个关键帧增强视频生成模型叙事能力的框架。SmartDirector支持灵活的生成长场景,包括单镜头生成、多镜头叙事合成和视频扩展。该框架分两个阶段运行:Director-Gen根据提供的关键帧生成低分辨率视频,Director-SR通过利用高分辨率关键帧作为语义锚点来恢复细粒度细节,从而优化输出。为了实现鲁棒的多关键帧训练,我们构建了一个数据管道,从电影中策划单镜头和多镜头序列。大量实验表明,SmartDirector显著优于现有的最先进方法。我们将发布代码以促进进一步研究。

英文摘要

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

2605.27886 2026-05-28 cs.RO

Tabero: Learning Gentle Manipulation with Closed-Loop Force Feedback from Vision, Touch, and Language

Tabero: 通过视觉、触觉和语言闭环力反馈学习轻柔操作

Qiwei Wu, Rui Zhang, Xin Xiang, Tao Li, Weihua Zhang, Junjie Lai, Renjing Xu

AI总结 针对现有视觉-语言-动作模型缺乏触觉反馈导致无法实现轻柔操作的问题,提出Tabero基准和模型套件,通过数据高效管道生成视觉-触觉-语言任务,并采用解耦力-位置命令接口的Tabero-VTLA架构,在保持高任务成功率的同时将轻柔指令下的平均夹持力降低70%以上。

Comments Code:https://github.com/NathanWu7/Tabero

详情
AI中文摘要

触觉感知对于机器人实现类人轻柔操作至关重要。然而,现有的视觉-语言-动作(VLA)模型由于缺乏对齐的视觉-触觉-语言数据以及有效的闭环力反馈机制,难以利用触觉反馈进行轻柔操作。为解决这些挑战,我们引入了Tabero,一个用于轻柔、语言条件化机器人操作的基准和模型套件,该操作要求细粒度的接触力感知。首先,Tabero基准通过提出一种数据高效的管道来解决触觉数据稀缺问题,该管道重新利用开源机器人操作轨迹生成多样化的视觉-触觉-语言任务,并建立了一个多维评估协议,同时衡量任务成功率和物理交互质量。其次,我们提出了Tabero-VTLA,一种具有解耦力-位置命令接口的架构;生成的力-位置命令由固定的混合控制器执行,以实现实时的力感知操作。在Tabero上评估,我们的模型在保持高任务成功率的同时,在轻柔指令下将平均夹持力降低了70%以上,展示了其基于多模态经验调节交互力的能力。我们的代码公开在 https://github.com/NathanWu7/Tabero。

英文摘要

Tactile sensing is essential for robots to achieve human-like gentle manipulation. However, existing Vision-Language-Action (VLA) models struggle to exploit tactile feedback for gentle manipulation due to scarce aligned vision-tactile-language data and the lack of effective closed-loop force feedback mechanisms. To address these challenges, we introduce Tabero, a benchmark and model suite for gentle, language-conditioned robotic manipulation that demands fine-grained contact force perception. First, the Tabero benchmark addresses the scarcity of tactile data by presenting a data-efficient pipeline that repurposes open-source robot manipulation trajectories to generate diverse vision-tactile-language tasks, and establishes a multidimensional evaluation protocol that measures task success alongside physical interaction quality. Second, we propose Tabero-VTLA, an architecture with a decoupled force-position command interface; the resulting force-position commands are executed by a fixed hybrid controller to enable real-time, force-aware manipulation. Evaluated on Tabero, our model maintains high task success while reducing average grip force by over 70\% under gentle instructions, demonstrating its ability to modulate interaction forces based on multimodal experience. Our code is publicly available at https://github.com/NathanWu7/Tabero.

2605.27885 2026-05-28 cs.CV

Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

教师与求解器智能体之间的反思性对话用于视频问答

Takuya Murakawa, Toru Tamaki

AI总结 提出一种仅通过推理时上下文注入的适应方法,利用教师与求解器智能体之间的反思性对话(RD)来提升视频问答性能,在EgoCross基准上超越零样本和标准上下文学习,获得CVPR 2026 EgoVis Workshop跨域挑战赛开源赛道第三名。

Comments Yhis paper serves as the technical report for the 1st Cross-Domain EgoCross Challenge @ EgoVis Workshop, CVPR 2026

详情
AI中文摘要

已经提出了各种方法来使视觉语言模型(VLM)适应视频问答的专门领域,包括微调和上下文学习。然而,在推理阶段仅从少量标记支持集获取任务特定知识而不进行微调仍然是一个挑战。在本文中,我们提出了一种仅通过推理时上下文注入来实现适应的方法。我们的方法首先构建一个反思性对话(RD)——两个智能体之间的多轮对话,其中教师提出每个支持问题并提供正确性反馈,求解器回答并提供视觉基础解释(或反思)以说明正确和错误的答案。然后,该对话历史在推理阶段用作上下文。在EgoCross基准上的实验表明,我们的方法优于基线零样本设置和直接传递支持集示例的标准上下文学习方法,在CVPR 2026 EgoVis Workshop的第一届跨域EgoCross挑战赛开源赛道中获得第三名,本文也作为该挑战赛的技术报告。

英文摘要

Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) -- a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

2605.27884 2026-05-28 cs.CV

A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent Learning

一种基于道路条件的交通电影预测网络,具有时空和结构一致性学习

Joshua Kofi Asamoah, Blessing Agyei Kyem, Armstrong Aboah

AI总结 提出RCSNet,一种基于道路条件的时空网络,通过拓扑引导的未来状态生成和结构一致性学习,提高跨城市交通预测的准确性和结构一致性。

Comments 22 pages (double column), 7 Tables, 11 Figures

详情
AI中文摘要

城市范围的交通预测对于拥堵管理、路线引导和智能交通系统至关重要,但当未来交通必须作为整个城市网络的空间地图生成时,准确预测仍然具有挑战性。现有的交通电影预测方法提高了帧级精度,但许多方法仍主要将预测视为图像重建。这可能会产生数值上接近真实值但受道路布局、连通性、行驶方向和拥堵传播约束较弱的交通地图,尤其是在交通行为和道路结构都发生变化的跨城市场景中。为了解决这一局限性,本研究提出了RCSNet,一种基于道路条件的时空网络,将交通电影预测重新表述为拓扑引导的未来状态生成。RCSNet从静态道路地图中提取道路感知表示,从历史观测中建模多时域交通动态,将方向性交通特征与局部道路结构对齐,并逐步生成未来交通地图以提高时间一致性。结构一致性学习目标进一步鼓励预测保持准确、与道路对齐且空间稳定。跨多个城市的实验表明,RCSNet提高了预测准确性和结构一致性。在柏林、安特卫普和莫斯科的同城预测中,与最接近的基线相比,RCSNet平均MAE、MSE和RMSE分别降低了11.5%、10.0%和5.1%。在未见过的芝加哥和曼谷的跨城市测试中,无需目标城市微调,RMSE分别降低了10.6%和10.5。额外的时域、道路结构、可解释性、统计和效率分析表明,RCSNet产生了更准确、可迁移、与道路对齐且计算高效的交通预测。

英文摘要

City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

2605.27882 2026-05-28 cs.CL cs.AI

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench:野外长期主动搜索的基准测试

Xiaohongshu Inc

AI总结 针对现有搜索基准中查询过于明确、单轮交互和固定模式评估导致用户体验与评估结果差距的问题,提出VibeSearch范式并构建VibeSearchBench基准,通过渐进式用户模拟和图匹配评估框架测试前沿模型,发现所有模型在长期上下文推理、主动意图激发和结构化知识构建方面仍存在显著不足。

详情
AI中文摘要

基于LLM的智能体在搜索基准上得分很高,但真实用户始终觉得结果不令人满意,这揭示了持续的评估-体验差距。我们将这一差距归因于现有基准依赖于过度明确的查询、单轮交互和固定模式评估,这些都不反映真实搜索行为——用户和智能体通过多轮对话协作细化模糊意图。我们将这种范式称为VibeSearch,并引入VibeSearchBench,一个包含200个手动策划的双语(中文和英文)任务的基准,涵盖20个领域,分为VibeSearch-Pro(专业)和VibeSearch-Daily(日常生活)子集。每个任务将一个用户角色与一个无模式的真实知识图谱配对,并通过渐进式披露用户模拟器和图匹配评估框架进行评估。我们在ReAct框架和OpenClaw智能体框架下对七个前沿模型进行了基准测试。结果表明,所有模型对于VibeSearch仍然严重不足(最佳F1:30.30),凸显了在长期上下文推理、主动意图激发和结构化知识构建方面需要根本性进展。

英文摘要

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

2605.27879 2026-05-28 cs.AI

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

迈向忠实代理式XAI:一种验证方法和一个用于更好模型忠实度的开放世界基准

Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho, Jungseul Ok

AI总结 提出FAX框架,通过显式验证分解解释声明并交叉检查忠实工具,以及CRAFTER-XAI-Bench开放世界基准,在强化学习环境中将模拟忠实度从0.20提升至0.46。

详情
AI中文摘要

可解释AI(XAI)帮助用户解释模型行为并识别潜在故障。代理式XAI系统使用大型语言模型(LLM)通过自然语言交互使解释更易理解,但也可能产生看似合理但不忠实的解释。这种风险源于不可靠的XAI输出可能被LLM放大并误导用户。我们提出忠实代理式XAI(FAX),一个通过显式验证提高解释忠实度的框架。FAX将草稿解释分解为声明,并针对固有忠实工具进行交叉检查,在最终生成前过滤不支持或矛盾的声明。我们还引入了CRAFTER-XAI-Bench,一个具有复杂策略、多样目标和挑战场景的开放世界强化学习基准,用于评估模型特定忠实度。在CRAFTER-XAI-Bench上,FAX将模拟忠实度从最强基线的0.20提升至0.46,同时保持高信息量、相关性和流畅性。在三个表格基准上,FAX与先前的代理式XAI基线表现相当,但我们的分析表明,这些设置可能将任务准确性与模型特定忠实度混为一谈。这些发现表明,显式验证对于忠实代理式XAI至关重要,并且忠实度基准必须设计用于测试解释是否针对目标模型本身的行为。

英文摘要

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

2605.27878 2026-05-28 cs.CL

Narrative Flattening: How Post-Training Compresses Thematic, Affective, and Stylistic Variation in LLM Fiction

叙事扁平化:后训练如何压缩LLM小说中的主题、情感和风格变化

Zehan Li, Yutong Zhu, Siyang Wu, Honglin Bao, James A. Evans

AI总结 通过对比四个OLMo 32B检查点(Base、SFT、DPO、RLVR)在三种故事领域中的续写,发现后训练压缩了主题动态、情感强度和语言多样性,导致叙事扁平化,且专业文学领域压缩最严重。

详情
AI中文摘要

大型语言模型能生成流畅的小说,但其创造性输出普遍被视为扁平。我们探究这种质量源于训练的哪个阶段,以及是否对不同领域的人类小说产生同等影响。我们构建了一个匹配的故事续写范式,涵盖StoryStar(公共平台)、TMAS(提示引导)和《纽约客》(专业文学),并将四个OLMo 32B检查点(Base、SFT、DPO、RLVR)的续写与匹配的人类文本进行比较。由于这些检查点共享架构、规模、分词器和预训练,该设计隔离了后训练效应。我们沿三个句子级维度测量每次续写:主题动态、情感普遍性和语言多样性。在所有三个维度上,后训练压缩了动态变化:主题过渡变得更加均匀,高强度情感让位于中性,故事间的风格多样性缩小。我们将这种渐进性损失称为叙事扁平化。该效应在故事领域间方向稳定,但差距大小取决于人类基线:专业文学小说压缩最严重,而公共平台和提示引导故事的差距较小,这与它们的人类基线更接近模型的默认节奏一致。后训练端点在领域间收敛,表明对齐产生了一种续写机制,该机制在很大程度上不依赖于源领域的叙事纹理。

英文摘要

Large language models produce fluent fiction, yet their creative output is widely seen as flat. We ask where this quality originates in the training and whether it affects different domains of human fiction equally. We construct a matched story-continuation paradigm across StoryStar (public-platform), TMAS (prompt-guided), and The New Yorker (professional literary)-and compare continuations from four OLMo 32B checkpoints (Base, SFT, DPO, RLVR) against matched human text. Because these checkpoints share architecture, scale, tokenizer, and pretraining, the design isolates the post-training effect. We measure each continuation along three sentence-level dimensions: thematic motion, affective prevalence, and linguistic diversity. Across all three, post-training compresses dynamic variation: thematic transitions become more uniform, high-intensity emotions give way to neutrality, and stylistic diversity across stories shrinks. We term this progressive loss narrative flattening. The effect is directionally stable across story domains but gap size depends on the human baseline: professional literary fiction is compressed most, while public-platform and prompt-guided stories show smaller gaps, consistent with their human baselines sitting closer to the model's default rhythm. Post-trained endpoints converge across domains, suggesting alignment produces a continuation regime largely insensitive to the source domain's narrative texture.

2605.27877 2026-05-28 cs.LG cs.AI

SPAR: Support-Preserving Action Rectification

SPAR: 支持保持的动作纠正

Jiaxin Zhao, Weihang Pan, Xun Liang, Binbin Lin

AI总结 提出支持保持的动作纠正(SPAR)框架,通过将全局学习转化为局部残差纠正,并引入潜在自模仿机制,解决了离线策略改进中价值最大化与数据分布拟合之间的冲突,在D4RL基准上达到最优性能。

详情
AI中文摘要

离线策略改进面临着最大化价值与拟合数据分布之间的固有冲突。虽然样本内加权回归是稳定的,但它过度保守,抑制了分布尾部的高价值动作;相反,基于梯度的方法通常表现出梯度的拟合-优化冲突,这会将策略推离数据流形。为了解决这个问题,我们提出了支持保持的动作纠正(SPAR),它将全局学习重新定义为锚定在冻结的纯行为克隆策略上的局部残差纠正。该框架在残差空间中进行细粒度拟合和局部策略改进,从而收缩搜索空间。我们进一步引入了潜在自模仿,利用潜在采样加权回归机制来解决残差空间中的拟合-改进梯度冲突。理论上,我们证明了该机制消除了标准价值梯度的流形正常漂移,而广泛的D4RL实验表明,SPAR从次优基线中提取了显著的增益,实现了最先进的性能。

英文摘要

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

2605.27874 2026-05-28 cs.CL

Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

用于越南语自动语音识别的音节结构解码器

Nghia Hieu Nguyen, Quan Ngoc Hoang, Long Hoang Huu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

AI总结 针对越南语自动语音识别,提出基于音素级音节结构解码的方法,通过显式建模音节音系组成,在紧凑音素集上生成有效音节结构,显著减小词汇量并在两个基准上超越强基线。

详情
AI中文摘要

大多数自动语音识别(ASR)系统将转录视为对正字法单元(如字符、子词或词)的预测问题。尽管有效,但此类表示并未明确反映语音的语音结构,且通常需要大词汇量以保持充分覆盖。在这项工作中,我们从越南语的音位特征出发,提出了一种用于ASR的音节结构解码器,该解码器在音素层面而非正字法层面建模语音。我们的方法显式捕捉了音节的音系组成,使解码器能够从紧凑的音素库中生成有效的音节结构。这种设计更紧密地契合了语音的语音实现,同时显著减小了词汇量。在两个基准(代表标准语音的LSVSC和包含多种区域发音的多方言语料库UIT-ViMD)上的实验结果表明,尽管使用了更小的词汇量且无额外训练资源,我们的方法始终优于先前强基线,尤其是预训练基线如PhoWhisper和Wav2Vec2。这些结果突显了基于音素的音节建模在该语言ASR中的有效性。用于实验可复现的代码将在论文被接收后公开。

英文摘要

Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

2605.27873 2026-05-28 cs.AI

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AIBuildAI-2:一种用于自动构建AI模型的知识增强智能体

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie

AI总结 针对现有自动构建AI模型的智能体因依赖大语言模型静态参数知识而性能受限的问题,提出AIBuildAI-2,通过引入分层、可进化的外部知识系统,动态加载相关上下文,实现设计决策的专家知识支撑,在MLE-Bench上取得70.7%奖牌率并在心脏病预测竞赛中排名前6.6%。

详情
AI中文摘要

AI模型支撑着从图像和文本处理到生物学、物理学和化学科学发现的数据中心应用。然而,开发这些模型仍然高度依赖人工,需要从业者设计架构、构建训练流程并迭代优化解决方案,这使得缺乏专业AI工程专业知识的自然科学家难以构建其研究所需的高性能模型。为减轻这一负担并拓宽AI在科学发现中的可及性,已有研究提出自动构建AI模型的智能体。然而,这些智能体的性能很大程度上受限于其底层大语言模型的参数知识,这些知识是静态的、常常过时,且缺乏实用的AI模型工程诀窍。为解决这一局限,我们提出AIBuildAI-2,一种具有外部、可进化知识系统的知识增强智能体,用于自动构建AI模型。AIBuildAI-2的知识系统是分层的,将整理好的AI开发知识组织为按主题类别划分的高层知识指令和每个类别下的低层知识文档,智能体据此仅动态加载与当前状态及待解决AI任务相关的上下文,使每个设计和实现决策都基于具体、可外部验证的专业知识。该系统通过从网络收集和清洗AI开发相关文档并将其组织到相应类别进行初始化,并通过从智能体自身经验中提炼每次AI任务完成运行的结构化要点并写回知识系统而持续进化。AIBuildAI-2取得了最先进的结果,在MLE-Bench上以70.7%的奖牌率排名第一,并在一个心脏病预测竞赛中位列4370个人类专家团队的前6.6%。

英文摘要

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

2605.27865 2026-05-28 cs.CL

MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

MERIT: 通过基于评分标准的训练进行审稿人匹配的专业知识匹配

Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li

AI总结 提出MERIT两阶段框架,通过强化学习训练审稿人评估器并蒸馏为检索器,实现大规模审稿人分配中的专业知识匹配。

Comments 22pages, 8 figures, 12 tables

详情
AI中文摘要

大规模地将投稿与合适的审稿人匹配是主要会议面临的日益严峻的挑战,然而现有方法要么依赖将一般相关性误认为真正适用性的粗略代理信号,要么需要难以扩展用于训练的昂贵人工标注。我们提出MERIT,一个两阶段框架,通过将标准级别的专业知识匹配转化为可扩展的适用性监督来弥合这一差距。在第一阶段,我们通过强化学习训练一个审稿人评估器,以识别论文所需的专业知识维度,将其与审稿人的先前工作匹配,并产生适用性决策,奖励由基于论文特定专业知识评分标准的LLM引导提供。在第二阶段,我们将评估器的预测蒸馏到基于嵌入的检索器中,以实现高效的大规模分配。实验表明,我们的4B审稿人评估器在适用性分类上优于更大的通用LLM,并且得到的检索器在LR-Bench和CMU Gold数据集上达到了最先进的性能。我们的代码可在https://github.com/Luli3220/MERIT获取。

英文摘要

Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer's prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor's predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at https://github.com/Luli3220/MERIT.

2605.27861 2026-05-28 cs.LG cs.AI q-bio.QM

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

从检测到机制:跨注意力图神经网络实现药物相互作用类型预测——一项以乙酰水杨酸验证的消融研究

Juergen Dietrich

AI总结 本研究通过系统消融实验比较三种图神经网络架构,发现跨注意力机制(CrossAtt)在药物相互作用类型预测(多分类)上比二元检测提升显著,并在乙酰水杨酸验证中实现10/10正确预测。

Comments 12 pages, 1 figure

详情
AI中文摘要

预测两种药物是否相互作用(二元检测)与预测该相互作用的机制类型(多分类)是本质上不同的任务。本研究在包含38,337个正例对(涵盖86种相互作用类型)的公开基准数据集上,对三种图神经网络架构进行了系统的消融实验,用于药物相互作用预测。在相同训练条件下(n=61,339对)比较了三种架构:带有拼接的双消息传递神经网络(Concat)、带有四头跨注意力的双MPNN(CrossAtt)以及引入相互作用图的三元MPNN(Ternary)。CrossAtt在多分类F1-macro上比Concat绝对提升+0.186(+45%),而二元AUC仅提升+0.012(+1.3%),证实原子级分子间通信专门支持机制类型分类。尽管训练数据相同,三元架构表现不佳,其失败与训练不稳定性假设一致。在训练前保留的十个乙酰水杨酸药物对上的验证表明,CrossAtt实现了10/10正确的DDI类型预测,而Ternary为0/10。在所有架构中识别出两个一致的失败案例,与一项配套毒性研究中确立的结构限制相关。

英文摘要

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

2605.27860 2026-05-28 cs.AI

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

C-MIG:基于多视角信息增益的检索增强生成用于临床诊断推理

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

AI总结 提出C-MIG框架,通过多视角信息增益和多重子查询检索增强策略,解决检索增强生成中奖励信号丢失和异构推理监督问题,在临床诊断任务上取得最优性能。

详情
AI中文摘要

检索增强生成结合强化学习在将大型语言模型锚定于可信医学证据方面显示出前景。然而,现有方法依赖精确匹配的二元奖励,在临床诊断中导致两个问题:(i) 语义相关但非逐字匹配的步骤获得零信号,丢弃了有价值的学习信号;(ii) 单一维度的奖励无法有效监督异构推理能力。为解决这些问题,我们提出C-MIG,一种基于多视角信息增益的临床诊断检索增强生成框架。C-MIG在冻结参考模型下从两个互补视角——检索文档和文档精炼——估计信息增益,以联合指导检索什么以及如何精炼,缓解了有价值奖励信号丢失和信用分配问题。我们进一步设计了一种多重子查询检索增强策略,提高了临床诊断场景中的知识召回覆盖率。在四个医学基准上的综合实验表明,C-MIG在领域内和领域外数据集上均达到所有RAG-RL方法中的最佳性能,并在临床诊断上超越了最先进的通用大型语言模型。

英文摘要

Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.

2605.27858 2026-05-28 cs.CL cs.AI cs.LG

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL: 学习提出有用、信息丰富且多样的问题以进行半监督、可追踪的声明验证

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

AI总结 提出DecomposeRL框架,通过GRPO和多面奖励集成将声明分解为可追踪的子问题,在完全监督和半监督设置下实现高精度,且模型规模小4倍仍匹配大模型性能。

详情
AI中文摘要

声明验证分为两类:端到端分类器准确但无法提供可检查的追踪,而基于分解的方法可产生可检查的追踪但在基准数据集上性能滞后。我们提出DecomposeRL,一种能产生可检查追踪的准确声明验证器。DecomposeRL将分解建模为使用GRPO和多面奖励集成训练的RL策略,支持从无标签声明进行完全监督和半监督学习。DecomposeRL通过数据筛选漏斗解决了GRPO高昂的训练成本,将115K事实验证声明提炼为包含密集学习信号的5K声明子集。我们表明,仅在约5K精选声明上使用完全监督训练的DecomposeRL-7B策略,在包含生物医学、政治、科学和通用领域声明的11个声明验证基准上,实现了86.3的域内和69.8的域外平衡准确率。尽管规模小4倍,它匹配了32B基线和GPT-4.1-mini,并且在仅10%标签声明数据的半监督设置中进一步优于基线。代码、数据和模型见https://dipta007.github.io/DecomposeRL。

英文摘要

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

2605.27853 2026-05-28 cs.AI

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

MolLingo:面向LLM驱动的科学智能体的分子原生表示

Thao Nguyen, Heng Ji

AI总结 提出MolLingo多智能体系统,通过共享内存协调文献、化学家和编排智能体,结合基于BRICS的片段枚举(BFE)表示方法,实现分子块级推理与编辑,在四个基准上优于前沿LLM和专用基线。

详情
AI中文摘要

我们提出MolLingo,一个模拟化学家推理过程的多智能体系统,用于自动化分子设计。现有的基于LLM的方法要么作为独立的生成模型运行,无法访问外部工具,要么缺乏多智能体协调和共享内存,无法在分子设计流程中进行迭代、证据驱动的推理。MolLingo通过共享内存模块协调文献智能体、化学家智能体和编排智能体来解决这一问题,每个智能体配备领域特定工具。为了实现有效的分子推理,我们引入了基于BRICS的片段枚举(BFE),这是一种合成感知的分子碎片化方法,将分子分解为化学上有意义的构建块,表示为基于块的SMILES并配以常见化学名称。这种表示桥接了分子结构和LLM语义空间,实现了仅靠原始SMILES难以实现的块级推理和编辑。作为早期治疗设计的案例研究,MolLingo进一步将化学家智能体的推理基于结合位点几何和来自分子对接的残基级蛋白质上下文,以优化分子以实现更强的靶标结合。在四个基准上,MolLingo始终优于前沿LLM和专用基线,包括在相同底层模型下,对接分数比GPT-5.4提升四倍,在多个LLM骨干上一致的药物性质优化增益,以及在TOMG-Bench上达到最先进结果,超越了前沿LLM和基于RL的优化方法RePO。我们的结果表明,当通过化学上有意义的表示和生物学基础的上下文进行引导时,LLM已经能够成为有能力的分子设计助手。代码可在:https://anonymous.4open.science/status/MolLingo-7450 获取。

英文摘要

We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.

2605.27851 2026-05-28 cs.AI

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

当上下文翻转,安全失效:诊断对齐语言模型中的脆弱安全性

Dasol Choi, Alex Kwon

AI总结 本文提出上下文翻转评估方法,通过安全基准和常识控制测试12个模型,发现对齐语言模型存在安全特异性脆弱性,源于策略覆盖而非理解错误,并证明动作级护栏无法检测后果翻转。

详情
AI中文摘要

安全基准分数提供的部署准备证据不完整:对齐语言模型通常遵循刚性规则,即使情境更新翻转了哪个动作是安全的。我们将这种失败称为脆弱安全性。为诊断它,我们引入上下文翻转评估,在安全基准(PacifAIst)和两个常识控制上测试12个模型,使用配对变体,其中名义上安全的动作产生伤害。出现三个发现。首先,脆弱安全性是安全特异性的:所有12个模型都表现出安全-常识差距(平均+17.4个百分点)。基线准确率无法预测脆弱性:在基线准确率高于90%的模型中,脆弱率从13.7%到90.0%不等。其次,失败源于策略覆盖而非理解错误:尽管在每个案例中都承认上下文变化,模型通过三种不同机制持续存在,这些机制因更新类型和模型系列而异。第三,在对灾难性后果翻转场景的手动审计探测中,标准动作级护栏未能检测到任何情况,而状态感知验证器在正确干预上无假警报地检测到所有情况。这表明动作级内容审核系统性地对后果翻转视而不见,激发了状态感知架构替代方案。我们发布我们的协议、扰动基准和部署探测。

英文摘要

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

2605.27850 2026-05-28 cs.AI

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

TCP-MCP:面向多智能体系统的提示与通信拓扑的景观引导协同进化

Yi Ding, Zijie Xuan, Haowei Zhou, Zhenyu Ju, Xiaoxiao Dong, Jingwen Zhang, Xingyu Zhu, Leixin Sun, Haochi Zhang

AI总结 提出TCP-MCP框架,通过协同进化智能体提示和通信拓扑,在任务性能、令牌成本和结构复杂度三个目标下实现多智能体系统的成本感知与任务自适应设计。

详情
AI中文摘要

有效的多智能体系统不能通过孤立地选择提示或通信图来设计。智能体行为取决于其接收的信息,而通信边的有用性则取决于接收智能体如何解释和使用该信息。我们提出 extbf{TCP-MCP}(面向多智能体协作问题求解的拓扑耦合提示),这是一个将智能体提示和通信拓扑作为统一基因组进行搜索的协同进化框架。TCP-MCP使用初始化时的景观探针来校准早期搜索行为,然后依赖帕累托前沿诊断在三个目标(任务性能、令牌成本和结构复杂度)下自适应调整探索。在所有方法中使用相同的DeepSeek-V3.2骨干网络,TCP-MCP在MMLU-Pro、MMLU和GSM8K上分别达到82.66%、89.96%和96.61%的准确率。在三个基准测试中,它持续优于自动图生成基线,并在报告的操作点上达到与辩论式系统相当的准确率,同时使用的令牌数最多减少5.69倍。这些结果表明,联合进化提示和通信结构为受控评估中成本感知和任务自适应的多智能体系统设计提供了一条实用途径。

英文摘要

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

2605.27846 2026-05-28 cs.AI

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO: 面向开放问答的基于熵驱动的自适应正负样本加权策略优化

Yunsheng Zeng, Gen Li, Yuwei Miao, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

AI总结 针对开放问答中强化学习固定权重的问题,提出基于熵驱动的自适应策略优化方法EAPO,通过动态调整正负样本权重平衡探索与稳定性,在医学问答数据集上显著提升多样性和稳定性。

详情
AI中文摘要

大型推理模型通常通过可验证奖励的强化学习(RLVR)进行训练。然而,现有方法对正负样本采用固定权重,且结论难以推广到开放问答(QA)。本文系统研究了开放问答中强化学习正负样本的作用。我们提出了一种基于奖励均值的策略来区分正负样本,并观察到负样本主要控制响应多样性和性能上限,而正样本主要决定响应质量和收敛稳定性。基于这些观察,我们提出了EAPO,一种基于熵驱动的自适应策略优化方法,该方法根据当前策略熵与初始熵的比率自适应计算正样本的加权系数。在熵减阶段,分配给正样本的权重降低以保持探索,而在熵增阶段则放大以增强稳定性,从而缓解熵崩溃。在两个公开的开放医学问答数据集上的实验表明,EAPO在响应多样性和稳定性方面一致且显著优于固定权重基线。

英文摘要

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

2605.27843 2026-05-28 cs.CV

A self-supervised learning approach to deep filter banks for texture recognition

一种用于纹理识别的深度滤波器组的自监督学习方法

Joao B. Florindo, Lucas O. Lyra, Antonio E. Fabris

AI总结 针对纹理识别中训练数据有限的问题,提出一种基于卷积自编码器的自监督预训练框架,结合深度滤波器和Fisher向量池化,在不显著增加计算负担的情况下提升识别性能。

详情
AI中文摘要

纹理识别中的一个重要挑战是实际应用中经常遇到的训练数据有限。在计算机视觉中,缓解这一问题的一个成功策略是使用预训练阶段,其中神经网络以自监督方式学习识别数据各部分之间的关系。在这方面,一个成熟的框架是掩码自编码器。然而,这些模型通常依赖于计算密集型的架构,如视觉变换器。在纹理图像的特定情况下,大多数相关信息被压缩在每个像素周围的有限区域内,这表明通过注意力机制捕获长距离依赖可能是不必要的。基于这一假设,本文提出了一种预训练模型为卷积自编码器的框架。为了利用纹理模式传递的丰富信息,我们采用了深度滤波器与Fisher向量池化相结合的方法。通过这种方式,我们在不增加显著计算负担的情况下提高了纹理识别的性能。我们的方法与多个纹理数据库中的几种最先进方法进行了比较,证实了其在分类精度和计算复杂度方面的潜力。

英文摘要

An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.

2605.27838 2026-05-28 cs.SD

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Dasheng AudioGen: 从文本生成连贯音频场景的统一模型

Jiahao Mei, Heinrich Dinkel, Yadong Niu, Xingwei Sun, Gang Li, Yifan Liao, Jiahao Zhou, Junbo Zhang, Jian Luan, Mengyue Wu

AI总结 提出Dasheng AudioGen统一框架,通过结构化多视角描述和高维统一语义-声学表示,实现从文本到混合音频场景的端到端生成。

详情
AI中文摘要

音频生成长期以来一直分散,语音、音乐和音效由特定领域的模型生成,无法从单一描述联合生成连贯的音频场景。关键障碍在于对真实世界混合音频缺乏细粒度监督,以及用于建模并发音频组件的声学表示有限。我们提出了Dasheng AudioGen,一个从文本生成通用混合音频场景的统一框架。Dasheng AudioGen引入了结构化多视角描述,将复杂声学场景显式解耦为互补的描述视角,从而实现对音频层的细粒度控制。此外,我们采用高维统一语义-声学表示作为共享潜在空间。它注入语义先验,促进跨模态训练收敛,同时其高维特征空间提供足够容量以有效解耦和融合并发音频组件。通过这些设计,一个简单的流匹配DiT实现了高质量端到端音频场景生成。我们还为音频场景生成建立了全面的评估流程。实验表明,Dasheng AudioGen在混合音频类别中实现了接近真实录音的性能,同时在单类型生成任务中与专门模型保持竞争力。演示可在https://nieeim.github.io/Dasheng-AudioGen-Web/获取。

英文摘要

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.

2605.27834 2026-05-28 cs.LG stat.ML

Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

从逆强化学习中的奖励迁移:一种耦合极小极大方法

Guang-Yuan Hao, Lars van der Laan, Aurélien Bibaut, Nathan Kallus

AI总结 提出一种耦合极小极大方法,通过联合求解源和目标环境的贝尔曼方程组,消除源贝尔曼残差误差的一阶影响,实现逆强化学习奖励从源环境到目标环境的有效迁移。

详情
AI中文摘要

我们研究利用逆强化学习从专家演示中学习到的奖励从一个环境迁移到另一个不同环境的强化学习问题。当演示在受控环境中收集时,这自然发生。我们将问题表述为跨源和目标环境的贝尔曼方程联合系统,并开发了目标软$q$函数的极小极大估计器。顺序求解方法首先估计源奖励,然后将其代入目标控制问题,而耦合方法则联合求解源和目标系统方程。我们表明,与顺序方法相比,耦合方法消除了源贝尔曼残差误差的一阶影响。我们刻画了每种方法的局部行为,建立了有限样本软$q$函数误差界,并证明了所得软控制策略的遗憾保证。使用脓毒症模拟器的实证研究验证了理论比较。

英文摘要

We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate the problem as a joint system of Bellman equations across the source and target environments and develop minimax estimators for the target soft-$q$-function. Whereas a sequential solution approach first estimates the source reward and then plugs it into the target control problem, a coupled approach solves the source and target system of equations jointly. We show that, in contrast to the sequential approach, the coupled approach removes the first-order influence of source Bellman residual error. We characterize the local behavior of each approach, develop finite-sample soft-$q$-function error bounds, and prove regret guarantees for the resulting soft-control policy. An empirical investigation using a sepsis simulator validates the theoretical comparison.

2605.27832 2026-05-28 cs.CL

Playing with Words, Improving with Rewards: Training Language Models for Creative Association

玩文字游戏,用奖励改进:训练语言模型进行创意联想

Vijeta Deshpande, Namrata Shivagunde, Sherin Muckatira, Hadrien Glaude, Mikhail Gronas, Claire Stevenson, Roger Beaty, Anna Rumshisky

AI总结 本研究通过强化学习与可验证奖励(RLVR)在Codenames游戏上训练LLM,探索了规模依赖的精确度-创造力权衡,发现8B模型在保持推理能力的同时提升创造力,而小模型则牺牲创造力换取推理精度。

详情
AI中文摘要

大型语言模型(LLM)正被应用于日益困难的问题和用例。为了有效导航其广阔的解决方案空间,LLM需要具备创造力。然而,创造力的主观性和人类判断的局限性使得训练LLM的创造力尤其具有挑战性。作为解决方案,我们在Codenames(一个词联想游戏)上训练LLM,该游戏锻炼了创造力的两个核心轴——发散思维和收敛思维,同时产生客观可验证的结果。这种可验证性使我们能够绕过人类判断,并使用具有可验证奖励的强化学习(RLVR)进行训练。我们训练了Qwen3-1.7B、4B和8B模型,并在十个创造力和四个推理基准上评估它们。我们发现精确度-创造力权衡是规模依赖的:8B模型优先考虑创造力而非精确度,而1.7B和4B模型则以牺牲创造力为代价获得推理精确度。具体来说,8B模型在8个创造力基准上显示出适度但一致的提升,且推理能力仅略有下降,而较小的模型在推理任务上取得了显著提升。我们的研究提出了一种可扩展且有效的解决方案来训练LLM的创造力。

英文摘要

Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.

2605.27831 2026-05-28 cs.LG eess.SP math.OC

Decentralized Parameter-Free Online Learning with Compressed Gossip

基于压缩八卦的去中心化无参数在线学习

Tomas Ortega, Hamid Jafarkhani

AI总结 提出DECO-EF算法,结合coin-betting预测与压缩差分八卦,实现去中心化在线凸优化中无参数自适应且压缩通信下的次线性网络遗憾。

详情
AI中文摘要

我们研究当智能体通过图通信且消息可能被压缩时的去中心化在线凸优化。经典的去中心化在线方法通常需要依赖于时间范围、比较器尺度或其他问题参数的学习率选择,而压缩通信引入了必须控制的额外不一致性。我们提出DECO-EF(带误差反馈的去中心化coin-betting),一种去中心化无参数在线学习算法,结合coin-betting预测与基于压缩差分的八卦。每个智能体维护一个干净的累积状态和一个压缩跟踪器,并在八卦步骤中仅通信压缩的状态差分。该方法在在线学习意义上是无参数的:它不调整时间范围、比较器范数或学习率。我们证明了在压缩通信下DECO-EF的期望比较器自适应网络遗憾界。据我们所知,这首次为压缩通信下的无参数去中心化在线学习提供了期望次线性网络遗憾保证。

英文摘要

We study decentralized online convex optimization when agents communicate over a graph and messages may be compressed. Classical decentralized online methods typically require learning-rate choices that depend on the horizon, comparator scale, or other problem parameters, while compressed communication introduces additional disagreement that must be controlled. We propose DECO-EF (DEcentralized COin-betting with Error Feedback), a decentralized parameter-free online learning algorithm that combines coin-betting predictions with compressed difference-based gossip. Each agent maintains a clean accumulated state and a compressed tracker, and communicates only compressed state differences during gossip steps. The method is parameter-free in the online-learning sense: it does not tune to the horizon, the comparator norm, or the learning rate. We prove expected comparator-adaptive network-regret bounds for DECO-EF under compressed communication. To the best of our knowledge, this gives the first expected sublinear network-regret guarantees for parameter-free decentralized online learning under compressed communication.

2605.27827 2026-05-28 cs.AI cs.CY

Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems

运营级AI部署保障:阈值敏感部署条件下的治理状态编排——高风险AI系统的治理框架

Khalid Adnan Alsayed

AI总结 提出运营级AI部署保障(OADA)框架,通过部署保障分数、就绪分类、阈值稳定区、治理升级状态和修复感知保障推进等机制,将公平性分歧、子组不稳定性和阈值敏感性转化为部署导向的治理决策,以解决高风险AI系统中静态指标报告和事后审计的不足。

Comments 13 pages, 3 figures, governance-oriented framework for operational AI deployment assurance in high-stakes systems

详情
AI中文摘要

AI治理框架日益强调高风险领域的公平性、透明度、问责制和生命周期风险管理。然而,许多当前方法仍停留在观察层面,依赖静态指标报告、事后审计和监控仪表板,而未能直接治理部署就绪性、修复进展、升级状态或保障驱动的部署控制。本文引入运营级AI部署保障(OADA),这是一个治理框架,用于将公平性分歧、子组不稳定性、阈值敏感性、修复结果和运营不确定性转化为面向部署的保障决策。基于先前关于公平性分歧指数(FDI)和FairRisk-FDI的工作,OADA将治理不确定性重新定义为AI部署管道中的运营问题,而非指标分歧的副产品。该框架引入了部署保障分数、部署就绪分类、阈值稳定区、治理升级状态和修复感知保障推进。这些构造通过将评估输出与部署状态解释、重新评估、升级和运营控制相连接,支持高风险环境中的生命周期导向治理决策。通过在面部识别系统上进行面向部署的评估,并将讨论扩展到作为代表性高风险领域的医疗AI,本文展示了系统在孤立的公平性或性能指标下可能看似可接受,同时仍表现出影响部署就绪性的不稳定性。所提出的框架将运营部署保障定位为评估与现实世界AI部署之间的治理层。

英文摘要

AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domains. However, many current approaches remain observational, relying on static metric reporting, post-hoc auditing, and monitoring dashboards without directly governing deployment readiness, remediation progression, escalation states, or assurance-driven deployment control. This paper introduces Operational AI Deployment Assurance (OADA), a governance framework for translating fairness disagreement, subgroup instability, threshold sensitivity, remediation outcomes, and operational uncertainty into deployment-oriented assurance decisions. Building on prior work on the Fairness Disagreement Index (FDI) and FairRisk-FDI, OADA reframes governance uncertainty as an operational concern within AI deployment pipelines rather than a byproduct of metric disagreement. The framework introduces Deployment Assurance Scores, Deployment Readiness Classifications, Threshold Stability Zones, Governance Escalation States, and remediation-aware assurance progression. These constructs support lifecycle-oriented governance decisions across high-stakes settings by connecting evaluation outputs to deployment-state interpretation, reassessment, escalation, and operational control. Through deployment-oriented evaluation across facial recognition systems, with discussion extended to healthcare AI as a representative high-stakes domain, the paper demonstrates how systems may appear acceptable under isolated fairness or performance metrics while still exhibiting instability that affects deployment readiness. The proposed framework positions operational deployment assurance as a governance layer between evaluation and real-world AI deployment.