arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2503.18288 2026-05-27 cs.CL

TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling

TFD:面向低资源语言处理和大规模建模的综合结构化藏语基础数据集

Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Yadi Liu, Wenbin Wei, Xiangxiang Wang, Yongbin Yu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Xizang University(西藏大学) ZenWeave AI The State Key Laboratory of Tibetan Intelligence(藏语智能国家重点实验室) Nanyang Technological University(南洋理工大学)

AI总结 为解决藏语大语言模型开发中缺乏覆盖预训练、指令微调、安全对齐、偏好优化和推理监督等完整流程的数据集问题,提出首个结构化、大规模、专家精选的藏语基础数据集TFD,包含超过110亿词元的统一语料库及链式推理数据集,通过训练Sun-Shine系列藏语模型在理解、安全、推理和生成基准上取得显著提升。

详情
AI中文摘要

大型语言模型(LLMs)在高资源语言中取得了显著成功,但藏语的进展仍然严重受限。尽管最近的工作开始解决藏语的预训练数据稀缺问题,但一个更根本的差距仍然存在:现有资源不支持完整的LLM开发流程,涵盖预训练、指令微调、安全对齐、偏好优化和推理监督。我们引入了藏语基础数据集(TFD),这是第一个结构化、大规模、专家精选的数据集,覆盖藏语大语言建模的所有关键阶段。TFD包含TIBSTC,一个超过110亿词元的统一语料库,带有用于指令微调、安全对齐和偏好优化的精选子数据集,以及TIBSTC-CoT,第一个大规模藏语链式推理数据集。我们通过训练Sun-Shine系列藏语LLM来展示其效用,在理解、安全、推理和生成基准上相比强基线取得了显著改进。这些结果强调,推进低资源语言建模不仅需要规模,还需要结构完整的数据生态系统。我们发布TFD以促进可重复研究和开发稳健、文化对齐的藏语LLM。代码和数据可在https://github.com/Vicentvankor/sun-shine获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable success in high-resource languages, yet progress in Tibetan remains severely constrained. While recent efforts have begun to address pre-training data scarcity for Tibetan, a more fundamental gap persists: no existing resource supports the complete LLM development pipeline, spanning pre-training, instruction tuning, safety alignment, preference optimization, and reasoning supervision. We introduce the Tibetan Foundation Dataset (TFD), the first structured, large-scale, and expert-curated dataset covering all key stages of Tibetan large language modeling. TFD comprises TIBSTC, a unified corpus of over 11 billion tokens with curated sub-datasets for instruction tuning, safety alignment, and preference optimization, and TIBSTC-CoT, the first large-scale Tibetan chain-of-thought dataset. We demonstrate its utility by training the Sun-Shine family of Tibetan LLMs, achieving substantial improvements over strong baselines on understanding, safety, reasoning, and generation benchmarks. These results underscore that advancing low-resource language modeling requires not only scale, but a structurally complete data ecosystem. We release TFD to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.

2602.12833 2026-05-27 cs.LG cs.AI cs.MA

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Vital Trace: 协议约束的患者状态推理用于纵向临床轨迹

Zhan Qu, Michael Färber

发表机构 * TU Dresden(德累斯顿理工大学)

AI总结 提出Vital Trace,一个协议约束的多智能体框架,通过紧凑的持久患者状态记忆和四个协调智能体(Router、Reasoner、Auditor、Steward)进行分阶段推理,以解决长期临床轨迹推理中的上下文漂移和不稳定问题,在MIMIC-IV和eICU数据集上预测未来血管加压药、呼吸、肾脏支持和恶化任务中优于自由形式多智能体基线。

详情
AI中文摘要

纵向临床推理需要跟踪电子健康记录中患者轨迹的生理测量、实验室结果和干预措施。现有的基于LLM的临床推理系统通常依赖于重复序列化患者历史或交换无约束的文本智能体消息,导致上下文漂移、推理不稳定以及长期推理成本增加。我们提出了Vital Trace,一个协议约束的多智能体框架,用于在动态ICU轨迹上进行未来临床风险预测。Vital Trace不维护无界文本历史,而是使用紧凑的持久患者状态记忆以及由四个协调智能体(Router、Reasoner、Auditor和Steward)执行的分阶段推理。为了支持时间上连贯的推理,我们引入了一个手动策划的全局协议,包含生理状态转换规则和动态患者状态表示,随时间跟踪血流动力学、呼吸、肾脏、代谢和炎症不稳定性。我们在MIMIC-IV和eICU上使用未来血管加压药支持、呼吸支持、肾脏支持和恶化预测任务评估Vital Trace。结果表明,与自由形式多智能体基线相比,结构化的协议约束推理提高了时间一致性、通信稳定性、校准性和可解释性,同时在长期ICU轨迹上实现了强大的预测性能。

英文摘要

Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, and interventions across extended patient trajectories. Existing LLM-based clinical reasoning systems often rely on repeatedly serializing patient histories or exchanging unconstrained textual agent messages, leading to context drift, unstable reasoning, and growing inference cost over long horizons. We present Vital Trace, a protocol-constrained multi-agent framework for future clinical risk prediction over evolving ICU trajectories. Instead of maintaining unbounded textual histories, Vital Trace uses a compact persistent patient-state memory together with staged reasoning performed by four coordinated agents: a Router, Reasoner, Auditor, and Steward. To support temporally coherent reasoning, we introduce a manually curated Global Protocol containing physiological state-transition rules and a dynamic patient-state representation that tracks hemodynamic, respiratory, renal, metabolic, and inflammatory instability over time. We evaluate Vital Trace on MIMIC-IV and eICU using future vasopressor-support, respiratory-support, renal-support, and deterioration prediction tasks. Results show that structured protocol-constrained reasoning improves temporal consistency, communication stability, calibration, and interpretability compared with free-form multi-agent baselines while achieving strong predictive performance across long ICU trajectories.

2602.11799 2026-05-27 cs.AI cs.IR

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Hi-SAM: 一种面向大规模推荐的分层结构感知多模态框架

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

发表机构 * Netease Cloud Music(网易云音乐)

AI总结 针对多模态推荐中语义ID离散化存在的次优分词和架构-数据不匹配问题,提出Hi-SAM框架,通过解耦语义分词器和分层记忆-锚点Transformer,在冷启动场景下显著提升推荐性能。

Comments Accepted at ACM KDD 2026 ADS

详情
AI中文摘要

多模态推荐因物品具有文本和图像等丰富属性而受到关注。基于语义ID的方法有效地将这些信息离散化为紧凑的令牌。然而,存在两个挑战:(1)次优分词:现有方法(如RQ-VAE)缺乏共享跨模态语义和模态特定细节之间的解耦,导致冗余或崩溃;(2)架构-数据不匹配:普通Transformer将语义ID视为扁平流,忽略了用户交互、物品和令牌的层次结构。将物品扩展为多个令牌会放大长度和噪声,使注意力偏向局部细节而非整体语义。我们提出Hi-SAM,一种分层结构感知多模态框架,包含两个设计:(1)解耦语义分词器(DST):通过几何感知对齐统一模态,并通过从粗到细的策略进行量化。共享码本提取共识,而模态特定码本通过互信息最小化从残差中恢复细微差别;(2)分层记忆-锚点Transformer(HMAT):通过分层RoPE将位置编码分解为物品间和物品内子空间以恢复层次结构。它插入锚点令牌将物品压缩为紧凑记忆,保留当前物品的细节,同时仅通过压缩摘要访问历史。在真实世界数据集上的实验表明,相比最先进基线方法,Hi-SAM持续改进,尤其在冷启动场景中。在服务数百万用户的大规模社交平台上部署后,Hi-SAM在核心在线指标上实现了6.55%的提升。

英文摘要

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

2507.11486 2026-05-27 cs.LG

Exploring the robustness of TractOracle methods in RL-based tractography

探索基于强化学习的纤维追踪中TractOracle方法的鲁棒性

Jeremi Levesque, Antoine Théberge, Maxime Descoteaux, Pierre-Marc Jodoin

发表机构 * Department of Computer Science, Faculty of Science, University of Sherbrooke(谢布罗克大学计算机科学系)

AI总结 本文通过整合强化学习的最新进展,扩展了TractOracle-RL框架,并引入迭代奖励训练(IRT)方法,实验表明基于oracle的RL方法在准确性和解剖有效性上显著优于传统纤维追踪技术。

Comments 38 pages, 8 figures. Submitted to Medical Image Analysis

Journal ref Medical Image Analysis, December 2025

详情
AI中文摘要

纤维追踪算法利用扩散MRI重建大脑白质的纤维结构。在机器学习方法中,强化学习(RL)已成为纤维追踪的一个有前景的框架,在几个关键方面优于传统方法。TractOracle-RL是一种最新的基于RL的方法,通过基于奖励的机制将解剖先验纳入训练过程,减少了假阳性。在本文中,我们通过整合RL的最新进展,研究了原始TractOracle-RL框架的四种扩展,并在五个不同的扩散MRI数据集上评估了它们的性能。结果表明,无论使用何种具体方法或数据集,将oracle与RL框架结合始终能产生鲁棒且可靠的纤维追踪。我们还提出了一种新的RL训练方案,称为迭代奖励训练(IRT),其灵感来自人类反馈强化学习(RLHF)范式。IRT不依赖人类输入,而是利用束过滤方法在训练过程中迭代优化oracle的指导。实验结果表明,使用oracle反馈训练的RL方法在准确性和解剖有效性方面显著优于广泛使用的纤维追踪技术。

英文摘要

Tractography algorithms leverage diffusion MRI to reconstruct the fibrous architecture of the brain's white matter. Among machine learning approaches, reinforcement learning (RL) has emerged as a promising framework for tractography, outperforming traditional methods in several key aspects. TractOracle-RL, a recent RL-based approach, reduces false positives by incorporating anatomical priors into the training process via a reward-based mechanism. In this paper, we investigate four extensions of the original TractOracle-RL framework by integrating recent advances in RL, and we evaluate their performance across five diverse diffusion MRI datasets. Results demonstrate that combining an oracle with the RL framework consistently leads to robust and reliable tractography, regardless of the specific method or dataset used. We also introduce a novel RL training scheme called Iterative Reward Training (IRT), inspired by the Reinforcement Learning from Human Feedback (RLHF) paradigm. Instead of relying on human input, IRT leverages bundle filtering methods to iteratively refine the oracle's guidance throughout training. Experimental results show that RL methods trained with oracle feedback significantly outperform widely used tractography techniques in terms of accuracy and anatomical validity.

2602.11460 2026-05-27 cs.CL

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

ADRD-Bench:阿尔茨海默病及相关痴呆症的初步大语言模型基准

Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Yiyu Shi, Meng Jiang, Zhi Zheng

发表机构 * Electrical Engineering, University of Notre Dame(诺丁汉大学电气工程系) Computer Science and Engineering, University of Notre Dame(诺丁汉大学计算机科学与工程系) School of Medicine, Indiana University(印第安纳大学医学院)

AI总结 针对现有基准对阿尔茨海默病及相关痴呆症覆盖不足的问题,提出ADRD-Bench,包含统一问答和照护问答两部分,评估了36个LLM,发现顶级模型准确率高但推理质量不稳定。

Comments Update article

详情
AI中文摘要

大语言模型(LLM)在医疗应用中显示出巨大潜力。然而,现有评估基准对阿尔茨海默病及相关痴呆症(ADRD)的覆盖极少。为解决这一差距,我们引入了ADRD-Bench,一个初步的ADRD专用LLM基准。ADRD-Bench包含两个部分:1) ADRD统一问答,整合了七个已有医学基准的1,438个问题,提供临床知识的统一评估;2) ADRD照护问答,一组新颖的149个问题,源自一个全国采用、大型临床试验支持的脑健康管理项目,弥补了现有基准缺乏实际照护背景的不足。我们在提出的ADRD-Bench上评估了36个最先进的LLM。结果显示,开源通用模型、开源医学模型和前沿闭源通用模型的准确率范围分别为0.63至0.93(均值:0.77;标准差:0.09)、0.47至0.93(均值:0.81;标准差:0.14)和0.83至0.93(均值:0.90;标准差:0.03)。虽然顶级模型达到了高准确率(>0.9),但案例研究揭示了不一致的推理质量和稳定性,凸显了需要领域特定改进以增强LLM基于日常照护数据的知识和推理。整个数据集可在https://github.com/IIRL-ND/ADRD-Bench获取。

英文摘要

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, a preliminary ADRD-specific LLM benchmark. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,438 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from a nationally adopted, large clinical trials supported brain health management program, mitigating the lack of practical caregiving context in existing benchmarks. We evaluated 36 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models, open-weight medical models, and frontier closed-source general models ranged from 0.63 to 0.93 (mean: 0.77; std: 0.09), 0.47 to 0.93 (mean: 0.81; std: 0.14), and 0.83 to 0.93 (mean: 0.90; std: 0.03), respectively. While top-tier models achieved high accuracies (>0.9), case studies revealed inconsistent reasoning quality and stability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

2602.10450 2026-05-27 cs.LG cs.AI math.OC

Constructing Industrial-Scale Optimization Modeling Benchmark

构建工业规模优化建模基准

Zhong Li, Hongliang Lu, Tao Wei, Yuxuan Chen, Wenyu Liu, Yuan Lan, Fan Zhang, Zaiwen Wen

发表机构 * Great Bay University(大湾大学) Peking University(北京大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出MIPLIB-NL基准,通过结构感知逆向构建方法从真实混合整数线性规划中生成自然语言规范与求解器代码,以评估大语言模型在工业规模优化建模中的性能。

Comments This paper was accepted by ICML'26 for publication

详情
AI中文摘要

优化建模支撑着物流、制造、能源和金融领域的决策,然而将自然语言需求转化为正确的优化公式和可执行求解器代码仍然需要大量人力。尽管大语言模型(LLMs)已被探索用于此任务,但评估仍以玩具级或合成基准为主,掩盖了具有$10^{3}$--$10^{6}$(或更多)变量和约束的工业问题的难度。一个关键瓶颈是缺乏将自然语言规范与基于真实优化模型的参考公式/求解器代码对齐的基准。为填补这一空白,我们引入了MIPLIB-NL,它通过一种结构感知的逆向构建方法从MIPLIB~2017中的真实混合整数线性规划构建而成。我们的流程(i)从平坦的求解器公式中恢复紧凑、可复用的模型结构,(ii)在统一的模型-数据分离格式下,逆向生成明确关联到该恢复结构的自然语言规范,以及(iii)通过专家评审和人类-LLM交互以及独立的逆向检查进行迭代语义验证。这产生了223个一对一的重构,保留了原始实例的数学内容,同时实现了现实的自然语言到优化评估。实验表明,在现有基准上表现良好的系统在MIPLIB-NL上性能显著下降,暴露了在玩具规模下不可见的失败模式。

英文摘要

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

2602.10104 2026-05-27 cs.CV cs.AI cs.LG

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore Research (A STAR), Singapore

AI总结 提出SeqΔ-REPA对齐目标,通过冻结自监督视频编码器的时序特征差异锚定潜在动作,实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情
AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口,但学习到的潜在表示往往难以跨上下文迁移:它们纠缠了场景特定线索,缺乏共享坐标系。这是因为标准目标仅在每个片段内操作,没有提供跨上下文对齐动作语义的机制。我们的关键洞察是,尽管动作未被观测到,但其语义效果是可观测的,可以作为共享参考。我们引入SeqΔ-REPA,一种序列级控制效果对齐目标,将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此,我们提出Olaf-World,一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明,我们的方法学习了更结构化的潜在动作空间,从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

2602.09878 2026-05-27 cs.CV

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

MVISTA-4D: 用于机器人操作的一致性视图4D世界模型与测试时动作推理

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

发表机构 * MMLab, The Chinese University of Hong Kong, Hong Kong SAR(香港理工大学MMLab,香港中文大学,香港特别行政区) The Hong Kong University of Science(香港理工大学) The University of Hong Kong(香港大学) Tsinghua University(清华大学)

AI总结 提出一种基于世界模型的4D场景生成方法,通过多视图RGBD预测和测试时动作优化,实现几何一致的4D动态预测与机器人操作。

Journal ref International Conference on Machine Learning 2026

详情
AI中文摘要

基于世界模型的“想象-然后行动”范式成为机器人操作的一种有前景的方法,但现有方法通常仅支持纯图像预测或部分3D几何推理,限制了其预测完整4D场景动态的能力。本文提出了一种新颖的具身4D世界模型,能够实现几何一致、任意视图的RGBD生成:仅以单视图RGBD观测作为输入,模型想象其余视角,然后通过反投影和融合构建跨时间的更完整3D结构。为了高效学习多视图、跨模态生成,我们明确设计了跨视图和跨模态特征融合,共同促进RGB与深度之间的一致性,并强制视图间的几何对齐。除了预测,将生成的未来转换为动作通常由逆动力学处理,但这是病态的,因为多个动作可以解释相同的状态转换。我们通过一种测试时动作优化策略来解决这个问题,该策略通过生成模型反向传播以推断与预测未来最佳匹配的轨迹级潜在变量,以及一个残差逆动力学模型,将该轨迹先验转换为精确的可执行动作。在三个数据集上的实验表明,该方法在4D场景生成和下游操作任务上均表现出色,消融实验为关键设计选择提供了实用见解。

英文摘要

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

2602.08586 2026-05-27 cs.AI

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

DIANOIA: 多智能体推理的诊断性分解与联合优化

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出DIANOIA框架,通过覆盖度、保真度和综合度三个可测量通道分解多智能体推理增益,并基于此设计诊断协议和对应系统,在多个基准上以更少token实现更优性能。

详情
AI中文摘要

多智能体LLM系统持续优于单智能体基线,但从业者仍无法预测哪种设计适用于新任务或诊断失败原因。我们认为这一差距主要源于该领域缺乏具有可测量原语和可测试预测的诊断框架。我们引入 extbf{DIANOIA},将多智能体推理增益分解为覆盖度、保真度和综合度三个通道,每个通道均可经验测量。基于此分解,我们推导出一个诊断协议,可识别任何给定任务的瓶颈通道。我们将该协议实例化为一个多智能体系统,其三个组件与通道对应:角色多样化的提议者(覆盖度)、基于执行验证的验证者(保真度)和迭代综合者。在GSM8K、AIME-2025、MBPP和BFCL-SP上,我们的方法在匹配token预算下优于强多智能体基线,在MBPP上以约$5 imes$的token节省主导帕累托前沿,在匹配成本下达到$+4.6$pp。在每个基准上,协议都能正确选择瓶颈通道;我们围绕它构建的系统在多个模型上领先。我们发布代码、适配器、诊断指标和Claude Code技能,网址为https://anonymous.4open.science/r/DIANOIA4MAS。DIANOIA将多智能体设计重新定义为通道感知的资源分配:诊断你的任务的瓶颈通道,然后相应投入token。

英文摘要

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new task or diagnose why one fails. We argue this gap persists largely because the field lacks a diagnostic framework with measurable primitives and testable predictions. We introduce \textbf{DIANOIA}, a three-channel decomposition of multi-agent reasoning gain into coverage, fidelity, and synthesis, each of which is empirically measurable. From this decomposition, we derive a diagnostic protocol that identifies the bottleneck channels for any given task. We instantiate the protocol as a multi-agent system whose three components mirror the channels: role-diverse proposers for coverage, execution-grounded verification for fidelity, and iterative synthesis. On GSM8K, AIME-2025, MBPP, and BFCL-SP, our method outperforms strong multi-agent baselines under matched token budgets, dominating the Pareto frontier on MBPP at $\sim$$5{\times}$ token savings and reaching $+4.6$pp at matched cost. On every benchmark, the protocol picks the right bottleneck channels; the system we built around it leads across models. We release code, adapters, diagnostic metrics, and a Claude Code skill at https://anonymous.4open.science/r/DIANOIA4MAS. DIANOIA reframes multi-agent design as channel-aware resource allocation: diagnose which channel is the bottleneck for your task, then invest tokens accordingly.

2511.16449 2026-05-27 cs.CV cs.AI

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

弥合视觉令牌剪枝中的语义-动作鸿沟以实现高效VLA推理

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) University of Science and Technology of China(中国科学技术大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) BAAI(北京人工智能研究院)

AI总结 提出VLA-Pruner方法,通过结合语义预填充和时序平滑的动作相关性估计视觉令牌重要性,并采用Combine-then-Filter策略,在保持操作质量的同时实现高达1.99倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过整合视觉感知、语言理解和动作执行,在具身人工智能中展现出巨大潜力。在实时部署中,这些模型必须处理连续的视觉流,产生大量计算开销。视觉令牌剪枝——一种通过保留显著令牌同时丢弃冗余令牌来加速视觉-语言模型(VLM)的主流技术——为这一挑战提供了自然的候选解决方案。然而,直接将面向VLM的剪枝方法应用于VLA推理会导致操作性能严重下降。我们的分析将这种下降归因于一个关键不匹配:VLA推理在视觉-语言预填充阶段和动作解码阶段表现出不同的注意力模式,因此仅基于上下文预填充语义显著性的剪枝偏向语义线索,可能移除动作关键的视觉令牌。受此观察启发,我们提出VLA-Pruner,一种有效的即插即用令牌剪枝方法,基于VLA推理的视觉需求,并进一步利用机器人操作的时间连续性。具体来说,VLA-Pruner从语义预填充和时序平滑的动作相关性两方面估计视觉令牌重要性,然后采用Combine-then-Filter策略,在计算预算下保留紧凑、非冗余的令牌。实验表明,VLA-Pruner在多种VLA架构上优于最先进方法,在相当的操作质量下实现高达1.99倍加速。

英文摘要

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

2511.06625 2026-05-27 cs.CV cs.AI cs.LG

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理:基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Department of Radiation Oncology(放射肿瘤学部) Winship Cancer Institute, Emory University(埃默里大学Winship癌症研究所)

AI总结 提出一种可解释的跨疾病推理框架,通过提取肺部发现、基于医学知识进行跨器官机制推理,并结合心脏子体积特征,从低剂量胸部CT中实现心血管风险评估,在NLST队列中AUC达0.919。

详情
AI中文摘要

低剂量胸部计算机断层扫描(LDCT)在一次扫描中捕获肺部和心脏结构,使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域,并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架,用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径:它提取肺部发现,将跨器官机制基于医学知识进行推理,并生成带有自然语言理由的心血管预测。它结合了四个组件:一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合,以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中,该框架在CVD筛查中达到0.919的AUC,在CVD死亡率预测中高达0.838,优于心脏特异性、单疾病和基础模型基线。目标对照表明,这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此,所提出的框架提供了一种可审计的方法,用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

2507.13428 2026-05-27 cs.CV cs.AI

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench:文本到视频模型中物理真实性的全面评估

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校) NVIDIA Research(NVIDIA研究) Northeastern University(东北大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出PhyWorldBench基准,通过1050个提示评估12个视频生成模型在物理规律遵循上的表现,并引入反物理类别,利用多模态大语言模型进行零样本评估。

Comments 35 pages, 21 figures

Journal ref ICLR 2026 oral

详情
AI中文摘要

视频生成模型在创建高质量、逼真内容方面取得了显著进展。然而,它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文提出了PhyWorldBench,一个全面的基准测试,旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了多个层次的物理现象,从基本物理原理如物体运动和能量守恒,到更复杂的场景如刚体相互作用以及人或动物的运动。此外,我们引入了一个新颖的反物理类别,其中提示故意违反现实世界的物理规律,从而评估模型在保持逻辑一致性的同时能否遵循此类指令。除了大规模人工评估外,我们还设计了一种简单而有效的方法,利用当前的多模态大语言模型以零样本方式评估物理真实性。我们评估了12个最先进的文本到视频生成模型,包括五个开源模型和五个专有模型,并进行了详细的比较和分析。通过对跨越基础、复合和反物理场景的1050个精心策划的提示进行系统测试,我们识别出这些模型在遵循现实世界物理规律方面面临的关键挑战。我们进一步研究了它们在不同物理现象和提示类型下的表现,并得出了针对性的建议,以构建增强物理原理保真度的提示。

英文摘要

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

2506.15199 2026-05-27 cs.LG stat.ML

Interpretability and Generalization Bounds for Learning Spatial Physics

学习空间物理的可解释性与泛化界

Alejandro Francisco Queiruga, Theo Gutman-Solo, Shuai Jiang

发表机构 * OpenAI Google(谷歌) Sandia National Laboratories(桑迪亚国家实验室)

AI总结 利用数值分析技术,严格量化了应用于线性微分方程的机器学习模型在参数发现或求解中的准确性、收敛率和泛化界,并基于格林函数表示引入科学模型的可解释性视角。

Comments To appear in ICML 2026. 18 pages, 13 figures

详情
AI中文摘要

尽管机器学习在科学问题上的许多应用看起来很有前景,但视觉可能具有欺骗性。利用数值分析技术,我们严格量化了某些应用于线性微分方程进行参数发现或求解的机器学习模型的准确性、收敛率和泛化界。除了数据的数量和离散化之外,我们发现数据的函数空间对模型的泛化至关重要。对于常用模型(包括物理特定技术),我们通过实验证明了类似的泛化不足。与直觉相反,我们发现不同类别的模型可能表现出相反的泛化行为。基于我们的理论分析,我们还引入了一种新的科学模型机械可解释性视角,即可以从黑箱模型的权重中提取格林函数表示。我们的结果为测量物理系统泛化性提供了一种新的交叉验证技术,该技术可作为基准。

英文摘要

While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. A similar lack of generalization is empirically demonstrated for commonly used models, including physics-specific techniques. Counterintuitively, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems, which can serve as a benchmark.

2501.06708 2026-05-27 cs.LG cs.AI

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

通过模仿模型权重评估样本效用以实现高效数据选择

Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan

发表机构 * Apple(苹果公司)

AI总结 提出基于梯度和几何的Mimic Score指标,通过Grad-Mimic框架在线重加权样本加速训练、离线构建数据过滤器,在六个图像数据集上提升数据效率和CLIP模型性能。

Comments This work appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026) and was selected as an Oral paper at the ICML 2025 DataWorld Workshop

详情
AI中文摘要

大规模网络爬取数据集包含噪声、偏差和不相关信息,因此需要数据选择技术。现有方法依赖于手工启发式、下游数据集或需要昂贵的基于影响力的计算——所有这些都限制了可扩展性并引入了不必要的数据依赖性。为了解决这个问题,我们引入了Mimic Score,一种简单且基于几何的数据质量指标,通过测量样本梯度与预训练参考模型诱导的目标方向之间的对齐来评估效用。这利用了现成的模型权重,避免了验证数据集的需求,并且计算开销最小。基于该指标,我们提出了Grad-Mimic,一个两阶段框架,在线重新加权样本以加速训练,并离线聚合样本效用以构建有效的数据过滤器。实验表明,使用模仿分数指导训练提高了数据效率,加速了收敛,在六个图像数据集上取得了一致的性能提升,并以减少20.7%的训练步骤增强了CLIP模型。此外,基于模仿分数的过滤器增强了现有过滤技术,使得用更少470万个样本训练的CLIP模型得到改进。

英文摘要

Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

2602.07120 2026-05-27 cs.CL

Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model

锚定解码:可证明降低任何语言模型的版权风险

Jacqueline He, Jonathan Hayase, Wen-tau Yih, Sewoong Oh, Luke Zettlemoyer, Pang Wei Koh

发表机构 * University of Washington(华盛顿大学) Allen Institute for Artificial Intelligence(人工智能研究院)

AI总结 提出锚定解码,一种即插即用的推理时方法,通过将生成内容约束在许可训练的安全模型附近,可证明地抑制语言模型逐字复制受版权保护的内容,实现可调的风险-效用权衡。

Comments Accepted to ICML 2026. 53 pages, 14 figures, 22 tables. Code is publicly available at https://github.com/jacqueline-he/anchored-decoding

详情
AI中文摘要

语言模型倾向于记忆其训练数据的部分内容并逐字生成。当底层来源敏感或受版权保护时,这种复现会引发创作者同意和补偿问题以及开发者合规风险。我们提出锚定解码,一种即插即用的推理时方法,用于抑制逐字复制:它通过将生成内容保持在许可训练的安全模型的有界邻近范围内,使得任何在混合许可数据上训练的有风险语言模型都能进行解码。锚定解码在生成轨迹上自适应地分配用户选择的信息预算,并强制执行每步约束,从而提供序列级别的保证,实现可调的风险-效用权衡。为使锚定解码实用化,我们引入了一个新的许可训练的安全模型(TinyComma 1.8B),以及锚定字节解码,这是我们方法的字节级变体,通过ByteSampler框架(Hayase等人,2025)实现跨词汇融合。在六个模型对上,针对复制风险和效用的长文本指标,锚定解码和锚定字节解码定义了新的帕累托前沿,在保持接近原始流畅性和事实性的同时,将有风险基线与安全参考之间的可测量复制差距缩小了高达75%,且推理开销适中。

英文摘要

Language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). Across six model pairs on long-form metrics for copying risk and utility, Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while closing up to 75% of the measurable copying gap between the risky baseline and a safe reference, at a modest inference overhead.

2602.04990 2026-05-27 cs.LG cs.GT

Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives

立场:机器学习用于心脏移植分配政策优化应考虑激励机制

Ioannis Anagnostides, Itai Zilberstein, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm

发表机构 * Department of Computer Science, Carnegie Mellon University(计算机科学系,卡内基梅隆大学) Department of Surgery, Division of Cardiothoracic Surgery, Medical University of South Carolina(外科系,心血管外科 division,南卡罗来纳医科大学) Strategy Robot, Inc., Strategic Machine, Inc., Optimized Markets, Inc.

AI总结 本文指出当前机器学习优化器官分配政策忽视了激励机制问题,提出下一代分配政策应具有激励意识,并呼吁整合机制设计、策略分类、因果推断和社会选择等研究。

Comments To appear at ICML 2026 (position paper track). V3 incorporates reviewers' feedback

详情
AI中文摘要

稀缺供体器官的分配构成了医疗保健中最具影响力的算法挑战之一。尽管该领域正迅速从僵化的、基于规则的系统转向机器学习和数据驱动的优化,我们认为当前的方法常常忽视了一个基本障碍:激励机制。在这篇立场论文中,我们强调器官分配不仅仅是一个优化问题,而是一个涉及器官获取组织、移植中心、临床医生、患者和监管机构的复杂博弈。聚焦于美国成人心脏移植分配,我们识别了决策流程中的关键激励错位,并展示了表明这些错位正在产生不良后果的数据。我们的主要立场是,下一代分配政策应具有激励意识。我们为机器学习社区概述了一个研究议程,呼吁整合机制设计、策略分类、因果推断和社会选择,以确保在面对各组成群体的策略行为时,系统具有鲁棒性、效率、公平性和信任度。

英文摘要

The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely an optimization problem, but rather a complex game involving organ procurement organizations, transplant centers, clinicians, patients, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, fairness, and trust in the face of strategic behavior from the various constituent groups.

2512.06609 2026-05-27 cs.LG cs.CV

Training-Free Vector Quantization via Gaussian VAEs

基于高斯VAE的无训练向量量化

Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang

发表机构 * AIR, Tsinghua University(清华空气研究院) CST, Tsinghua University(清华计算机研究所) University of Cambridge(剑桥大学)

AI总结 提出Gaussian Quant (GQ)方法,通过约束训练高斯VAE并直接转换为VQ-VAE,无需额外训练,在UNet和ViT架构上优于现有VQ-VAE。

详情
AI中文摘要

向量量化变分自编码器(VQ-VAEs)是将图像压缩为离散标记的离散自编码器。然而,由于离散化,它们难以训练。在本文中,我们提出了一种简单而有效的技术,称为Gaussian Quant (GQ),它首先在特定约束下训练高斯VAE,然后将其转换为VQ-VAE,无需额外训练。对于转换,GQ生成随机高斯噪声作为码本,并找到最接近后验均值的噪声向量。理论上,我们证明当码本大小的对数超过高斯VAE的bits-back编码率时,可以保证较小的量化误差。实际上,我们提出了一种启发式方法来训练高斯VAE以实现有效转换,称为目标散度约束(TDC)。实验上,我们表明GQ在UNet和ViT架构上均优于先前的VQ-VAE,如VQGAN、FSQ、LFQ和BSQ。此外,TDC还改进了先前的离散化方法,如TokenBridge。源代码见https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE。

英文摘要

Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

2602.04931 2026-05-27 cs.LG cs.AI

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

大型语言模型中跨深度的涌现因果几何动力学

Shahar Haim, Daniel C McNamee

发表机构 * Champalimaud Centre for the Unknown(查普拉米乌德未知中心)

AI总结 通过结合几何分析与因果干预,揭示了解码器-only大型语言模型中从上下文处理到预测形成的跨层转变,并发现后期层中角度结构参数化下一词分布相似性并实现选择性因果控制。

详情
AI中文摘要

对大型语言模型(LLM)表征的几何分析揭示了跨深度的结构化变化,但本质上与token预测形成相关。同时,因果干预揭示了依赖于深度的效能曲线,但缺乏对其表征动力学的统一解释。对LLM功能的完整解释需要说明表征结构如何跨深度演化以因果性地产生预测。我们通过将几何分析与机械干预相结合,明确将跨深度动力学作为解释LLM功能的组织轴,综合了这些视角。在解码器-only LLM中,我们识别出从上下文处理到预测形成计算的急剧转变,伴随着跨层的表征几何的更渐进重组。这种综合揭示了一种后期层几何编码,其中角度结构参数化下一词分布相似性,并能够对预测进行选择性因果控制,而表征范数编码的信息与预测基本解耦。总之,我们的结果提供了因果和几何视角的综合,产生了关于语言模型中跨深度的控制相关几何动力学如何将上下文转化为预测的机械论解释。这一视角调和了先前令人困惑的发现,并表明层状功能不能孤立地理解或有效干预,而只能在网络涌现的全局动力学结构中理解。

英文摘要

Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlational with respect to token prediction formation. Meanwhile, causal interventions expose depth-dependent efficacy profiles without a unifying account of their representational dynamics. A complete account of LLM function requires explaining how representational structure evolves across depth to causally produce predictions. We synthesize these perspectives by combining geometric analysis with mechanistic interventions, explicitly centralizing depth-wise dynamics as the organizing axis for interpreting LLM function. In decoder-only LLMs, we identify a sharp transition from context-processing to prediction-forming computation, accompanied by a more gradual reorganization of representational geometry across layers. This synthesis reveals a late-layer geometric code in which angular structure parameterizes next-token distributional similarity and enables selective causal control over predictions, while representation norms encode information largely decoupled from prediction. Together, our results provide a synthesis of causal and geometric perspectives, yielding a mechanistic account of how control-relevant geometric dynamics across depth transform context into prediction in language models. This perspective reconciles previously puzzling findings and implies that layer-wise function cannot be understood or effectively intervened upon in isolation, but only within the emergent global dynamical structure of the network.

2602.04599 2026-05-27 cs.LG

Stochastic Decision Horizons for Constrained Reinforcement Learning

约束强化学习的随机决策视界

Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences(马克斯·普朗克人类认知与脑科学研究所) Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)(可扩展数据分析与人工智能中心 (ScaDS.AI)) Hertie Institute for Clinical Brain Research & Center for Integrative Neuroscience(赫尔特临床脑研究所在线及整合神经科学中心) University of Tübingen(图宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 提出随机决策视界(SDH)框架,通过状态-动作延续概率实现每步约束满足,并开发了首个离策略和正则化算法(AS-SAC和VT-MPO),在90肌肉人形机器人上以4倍更少的环境步数达到最先进步态真实度。

详情
AI中文摘要

我们提出随机决策视界(SDH),这是一个理论基础的框架,用于解决具有每步约束满足的约束强化学习问题,这在许多实际应用中是一个理想属性。在SDH中,违反约束通过状态-动作延续概率有效缩短视界。利用控制作为推理,我们开发了首个用于即时约束RL的离策略和正则化算法。我们确定了违反后决策的两种原则性语义。吸收状态语义终止决策过程,因此只有存活的决策支付熵成本,产生最大熵AS-SAC。虚拟终止保持决策过程活跃,同时停止奖励信用,产生KL正则化VT-MPO。为了连接SDH与CMDP,我们跟踪违反沿轨迹的累积(它们的违反深度剖面)。SDH有效地通过每个轨迹的总违反的指数加权;这正好在违反发生在单一特征尺度时匹配加性CMDP预算,并且我们指出它不能匹配的情况:当罕见的深度违反与频繁的浅层违反混合时。实验验证了理论。在90肌肉H2190人形机器人(Hyfydy)上,VT-MPO以4倍更少的环境步数和更稳定的训练达到最先进的步态真实度。在Safety Gymnasium上,违反深度剖面正确识别了SDH提供强奖励-违反权衡的机制。

英文摘要

We propose stochastic decision horizons (SDH), a theoretically grounded framework for solving constrained RL problems with every-step constraint satisfaction, a desirable property in many real-world applications. In SDH, a constraint violation yields an effective shortening of horizon via a state-action continuation probability. Using Control as Inference, we develop the first off-policy and regularized algorithms for RL with instantaneous constraints. We identify two principled semantics for what counts as a decision after a violation. Absorbing-state semantics end the decision process, so only surviving decisions pay entropy cost, yielding max-entropy AS-SAC. Virtual-termination keeps the decision process alive while stopping reward credit, yielding KL-regularized VT-MPO. To connect SDH with CMDPs, we track how violations accumulate along trajectories (their violation-depth profile). SDH effectively weights each trajectory by the exponential of its total violation; this matches an additive CMDP budget exactly when violations occur at a single characteristic scale, and we pinpoint where it cannot: when rare, deep violations mix with frequent, shallow ones. Experiments validate the theory. On the 90-muscle H2190 humanoid (Hyfydy), VT-MPO matches state-of-the-art gait realism with $4\times$ fewer environment steps and substantially more stable training. On Safety Gymnasium, violation-depth profiles correctly identify the regimes in which SDH delivers strong reward-violation trade-offs. Experiments validate the theory. On the 90-muscle H2190 humanoid (Hyfydy), VT-MPO matches state-of-the-art gait realism with 4x fewer environment steps and substantially more stable training. On Safety Gymnasium, violation-depth profiles correctly identify the regimes in which SDH delivers strong reward-violation trade-offs.

2602.03545 2026-05-27 cs.AI

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

人格生成器:为任意上下文生成多样化的合成人格

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, Alexander Sasha Vezhnevets

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出Persona Generators,通过迭代进化优化生成覆盖广泛意见和偏好的多样化合成人格,在六个多样性指标上显著优于现有基线。

详情
AI中文摘要

评估与人类交互的AI系统需要理解它们在不同用户群体中的行为,但收集代表性人类数据通常成本高昂或不可行,特别是对于新技术或假设的未来场景。最近在生成式基于智能体建模方面的工作表明,大型语言模型可以高保真地模拟类似人类的合成人格,准确再现特定个体的信念和行为。然而,大多数方法需要关于目标群体的详细数据,并且通常优先考虑密度匹配(复制最可能的内容)而非支持覆盖(覆盖可能的内容),导致长尾行为未被充分探索。我们引入了Persona Generators,即能够为任意上下文生成多样化合成群体的函数。我们应用基于AlphaEvolve的迭代改进循环,使用大型语言模型作为变异算子,在数百次迭代中优化我们的Persona Generator代码。优化过程产生了轻量级的Persona Generators,能够自动将小规模描述扩展为多样化的合成人格群体,这些群体在相关多样性轴上最大化意见和偏好的覆盖。我们证明,进化后的生成器在保留上下文上的六个多样性指标上显著优于现有基线,产生了覆盖标准LLM输出中难以实现的罕见特征组合的群体。

英文摘要

Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting representative human data is often expensive or infeasible, particularly for novel technologies or hypothetical future scenarios. Recent work in Generative Agent-Based Modeling has shown that large language models can simulate human-like synthetic personas with high fidelity, accurately reproducing the beliefs and behaviors of specific individuals. However, most approaches require detailed data about target populations and often prioritize density matching (replicating what is most probable) rather than support coverage (spanning what is possible), leaving long-tail behaviors underexplored. We introduce Persona Generators, functions that can produce diverse synthetic populations tailored to arbitrary contexts. We apply an iterative improvement loop based on AlphaEvolve, using large language models as mutation operators to refine our Persona Generator code over hundreds of iterations. The optimization process produces lightweight Persona Generators that can automatically expand small descriptions into populations of diverse synthetic personas that maximize coverage of opinions and preferences along relevant diversity axes. We demonstrate that evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.

2602.03517 2026-05-27 cs.LG

Rank-Learner: Orthogonal Ranking of Treatment Effects

Rank-Learner:治疗效果的正交排序

Henri Arno, Dennis Frauen, Emil Javurek, Thomas Demeester, Stefan Feuerriegel

发表机构 * Ghent University - imec(根特大学 - imec) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心 (MCML))

AI总结 提出一种名为Rank-Learner的两阶段学习器,通过成对学习目标直接学习治疗效果排序,无需显式估计条件平均处理效应,具有Neyman正交性和模型无关性。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

许多决策问题需要根据治疗效果对个体进行排序,而不是估计确切的效果大小。例如,优先考虑患者进行预防性护理干预,或根据广告的预期增量影响对客户进行排名。令人惊讶的是,尽管因果效应估计在文献中受到了广泛关注,但直接学习治疗效果排序的问题在很大程度上仍未得到探索。在本文中,我们介绍了Rank-Learner,一种新颖的两阶段学习器,它直接从观测数据中学习治疗效果的排序。我们首先表明,基于精确治疗效果估计的朴素方法解决了一个比排序所需更困难的问题,而我们的Rank-Learner优化了一个成对学习目标,该目标恢复了真实的治疗效果顺序,无需显式的CATE估计。我们进一步证明,我们的Rank-Learner是Neyman正交的,因此具有强大的理论保证,包括对 nuisance 函数估计误差的鲁棒性。此外,我们的Rank-Learner是模型无关的,可以用任意机器学习模型(例如神经网络)实例化。我们通过大量实验证明了我们方法的有效性,其中Rank-Learner始终优于标准的CATE估计器和非正交排序方法。总的来说,我们为从业者提供了一种新的、正交的两阶段学习器,用于按治疗效果对个体进行排序。

英文摘要

Many decision-making problems require ranking individuals by their treatment effects rather than estimating the exact effect magnitudes. Examples include prioritizing patients for preventive care interventions, or ranking customers by the expected incremental impact of an advertisement. Surprisingly, while causal effect estimation has received substantial attention in the literature, the problem of directly learning rankings of treatment effects has largely remained unexplored. In this paper, we introduce Rank-Learner, a novel two-stage learner that directly learns the ranking of treatment effects from observational data. We first show that naive approaches based on precise treatment effect estimation solve a harder problem than necessary for ranking, while our Rank-Learner optimizes a pairwise learning objective that recovers the true treatment effect ordering, without explicit CATE estimation. We further show that our Rank-Learner is Neyman-orthogonal and thus comes with strong theoretical guarantees, including robustness to estimation errors in the nuisance functions. In addition, our Rank-Learner is model-agnostic, and can be instantiated with arbitrary machine learning models (e.g., neural networks). We demonstrate the effectiveness of our method through extensive experiments where Rank-Learner consistently outperforms standard CATE estimators and non-orthogonal ranking methods. Overall, we provide practitioners with a new, orthogonal two-stage learner for ranking individuals by their treatment effects.

2602.03238 2026-05-27 cs.AI

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

基于LLM的智能体评估统一框架的必要性

Pengyu Zhu, Li Sun, Philip S. Yu, Sen Su

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Chongqing University of Posts and Telecommunications(重庆邮电大学)

AI总结 针对当前LLM智能体评估受系统提示、工具集和环境动态等混杂因素影响的问题,提出标准化统一评估框架以提升公平性和可复现性。

详情
AI中文摘要

随着大型语言模型(LLM)的出现,通用智能体取得了根本性进展。然而,评估这些智能体带来了独特的挑战,使其区别于静态的问答基准。我们观察到,当前的智能体基准受到系统提示、工具集配置和环境动态等外部因素的严重混淆。现有评估通常依赖于碎片化的、研究者特定的框架,其中推理和工具使用的提示工程差异很大,使得难以将性能提升归因于模型本身。此外,缺乏标准化的环境数据导致不可追踪的错误和不可重复的结果。这种标准化的缺失给该领域带来了显著的不公平性和不透明性。我们提出,一个统一的评估框架对于智能体评估的严谨进展至关重要。为此,我们提出了一项旨在标准化智能体评估的建议。

英文摘要

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

2602.02518 2026-05-27 cs.LG cs.AI cs.CL

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer: 通过两阶段课程后训练训练LLMs在图上的探索与推理

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

发表机构 * Texas A&M University(德克萨斯大学A&M分校) University of Waterloo(滑铁卢大学) Lambda(Lambda公司) University of Oregon(俄勒冈大学)

AI总结 提出GraphDancer两阶段后训练框架,通过图感知课程逐步增加任务难度,使LLMs学会在异构图上进行自然语言推理与函数调用交织的探索与推理,仅用3B骨干模型即在跨域基准上超越更强基线。

Comments 15 pages, Project website: https://yuyangbai.com/graphdancer/

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖外部知识来提高事实性,然而许多真实世界的知识源被组织为异构图而非纯文本。在此类图上进行推理要求模型通过精确的函数调用遵循模式定义的关系,并在多轮交互中聚合证据。我们提出GraphDancer,一个两阶段后训练框架,通过将自然语言推理与图函数执行交织来教导LLMs在图上的推理。第一阶段教导模型在基于规则的奖励下如何与图交互,而第二阶段进一步教导其偏好更基于事实且高效的交互轨迹。GraphDancer的关键创新在于一个图感知课程,该课程根据信息寻求轨迹的结构复杂性组织两个阶段,在训练期间逐步增加任务难度。我们在一个多领域基准上评估GraphDancer,仅在一个领域上训练,并在未见过的领域和分布外问题类型上进行测试。尽管仅使用3B骨干模型,GraphDancer仍优于配备更大/更强骨干的基线,展示了图探索和推理技能的强大跨域泛化能力。我们的代码可在https://github.com/leopoldwhite/GraphDancer找到。

英文摘要

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We propose GraphDancer, a two-stage post-training framework that teaches LLMs to reason over graphs by interleaving natural-language reasoning with graph function execution. The first stage teaches the model how to interact with the graph under rule-based rewards, while the second stage further teaches it to prefer more grounded and efficient interaction trajectories. The key novelty of GraphDancer is a graph-aware curriculum that organizes both stages by the structural complexity of information-seeking trajectories, progressively increasing task difficulty during training. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with larger/stronger backbones, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code can be found at https://github.com/leopoldwhite/GraphDancer.

2602.01518 2026-05-27 cs.AI

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Qrita:使用基于枢轴的截断和选择的高性能Top-k和Top-p

Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Qrita算法,通过基于高斯sigma截断和四元枢轴搜索的枢轴方法,高效实现Top-k和Top-p采样,在保持与排序算法相同输出的同时,将端到端服务吞吐量提升至1.4倍并减少一半内存使用。

详情
AI中文摘要

尽管Top-k和Top-p算法在模型采样中很重要,但对于大词汇表的高效实现仍然是一个重大挑战。现有方法通常依赖于排序,这在GPU上会带来显著的计算和内存开销,或者依赖于改变算法输出的随机方法。在这项工作中,我们提出了Qrita,一种基于枢轴截断和选择的高效Top-k和Top-p算法。Qrita利用基于枢轴的搜索来实现Top-k和Top-p,并采用两种关键技术:1. 基于高斯的sigma截断,大大减少了词汇表的搜索空间;2. 具有重复处理能力的四元枢轴搜索,将枢轴搜索迭代次数减半并保证确定性输出。我们使用Triton实现了Qrita,并针对高性能LLM执行引擎(如SGLang和FlashInfer)的Top-k和Top-p内核评估了其性能,将端到端服务吞吐量提高了1.4倍,同时内存使用量减半,并提供了与基于排序算法相同的输出。Qrita现在是vLLM GPU执行路径的默认Top-k和Top-p采样器,Qrita的三元实现可在https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py获取。

英文摘要

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top-k and Top-p kernels of high-performance LLM execution engines such as SGLang and FlashInfer, improving end-to-end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting-based algorithms. Qrita is now the default Top-k and Top-p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py.

2602.00959 2026-05-27 cs.LG cs.CL

Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

探测知识边界:一种用于深度知识提取的交互式智能体框架

Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Westlake University(西交利物浦大学)

AI总结 提出一种交互式智能体框架,通过四种自适应探索策略和三级知识处理流水线,系统性地提取和量化大语言模型的知识,发现递归分类法最有效,并揭示了知识缩放定律、Pass@1与Pass@k的权衡以及训练数据对知识轮廓的影响。

Comments Homepage: https://ulab-uiuc.github.io/KnowledgeExtraction/

详情
AI中文摘要

大型语言模型(LLMs)可被视为压缩的知识库,但尚不清楚它们真正包含哪些知识以及其知识边界延伸多远。现有基准大多是静态的,对系统性知识探测的支持有限。本文提出一种交互式智能体框架,用于系统性地提取和量化LLMs的知识。我们的方法包括四种自适应探索策略,以不同粒度探测知识。为确保提取知识的质量,我们引入了一个三级知识处理流水线,结合基于向量的过滤以去除严格重复、基于LLM的裁决以解决模糊语义重叠,以及领域相关性审计以保留有效的知识单元。通过大量实验,我们发现递归分类法是最有效的探索策略。我们还观察到清晰的知识缩放定律,即更大的模型始终能恢复更多知识。此外,我们识别出Pass@1与Pass@k之间的权衡:领域专用模型初始准确率更高但退化迅速,而通用模型在长时间提取中保持稳定性能。最后,我们的结果表明,训练数据组成的差异导致不同模型家族具有独特且可测量的知识轮廓,反映了预训练如何塑造每个模型的参数化知识。

英文摘要

Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundary extends. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularity. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove strict duplicates, LLM-based adjudication to resolve ambiguous semantic overlap, and domain relevance auditing to retain valid knowledge units. Through extensive experiments, we find that Recursive Taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently recover more knowledge. In addition, we identify a Pass@1 versus Pass@k trade-off: domain-specialized models achieve higher initial accuracy but experience rapid degradation, while general-purpose models maintain stable performance over extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families, reflecting how pretraining shapes each model's parametric knowledge.

2602.00827 2026-05-27 cs.LG stat.ML

Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization

过度对齐 vs 过拟合:特征学习强度在泛化中的作用

Taesun Yeom, Taehyeok Ha, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(釜山科学技术大学(POSTECH))

AI总结 本文通过实验和理论分析,揭示了深度网络中特征学习强度存在最优值,过大导致过度对齐、过小导致过拟合,从而影响泛化性能。

Comments ICML 2026

详情
AI中文摘要

特征学习强度(FLS),即模型有效输出缩放的倒数,在塑造神经网络的优化动态中起着关键作用。尽管其影响已在渐近区域(训练时间和FLS)得到广泛研究,但现有理论对FLS如何影响实际设置中的泛化(例如,当训练在达到目标训练风险时停止)提供的见解有限。在这项工作中,我们研究了在实际条件下FLS对深度网络泛化的影响。通过实证研究,我们首先发现了一个$ extit{最优FLS}$的存在——既不太小也不太大——它能带来显著的泛化收益。这一发现与更强的特征学习普遍改善泛化的主流直觉相悖。为了解释这一现象,我们开发了对使用逻辑损失训练的两层ReLU网络中的梯度流动力学的理论分析,其中FLS通过初始化尺度控制。我们的主要理论结果建立了最优FLS的存在性,它源于两种竞争效应之间的权衡:过大的FLS会导致$ extit{过度对齐}$现象,降低泛化性能,而过小的FLS则会导致$ extit{过拟合}$。

英文摘要

Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes -- both in training time and FLS -- existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ -- neither too small nor too large -- that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.

2502.03946 2026-05-27 cs.LG

CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

CleanSurvival:使用强化学习为时间事件模型自动数据预处理

Yousef Koka, David Selby, Gerrit Großmann, Kathan Pandya, Sebastian Vollmer

发表机构 * German University in Cairo(埃及德国亚历山大大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) University of Saarland(萨尔大学) University of Kaiserslautern–Landau (RPTU)(凯撒斯劳滕-兰道大学(RPTU))

AI总结 提出基于强化学习的CleanSurvival框架,自动优化生存分析的数据预处理流程,提升Cox、随机森林、神经网络等时间事件模型的预测性能。

Comments Resubmitted after Peer Review Feedback to BMC Medical Informatics and Decision Making

详情
AI中文摘要

在机器学习中,数据预处理往往被忽视,尽管它对模型性能有潜在的重大影响。虽然自动化机器学习管道开始认识到并将数据预处理集成到分类和回归任务的解决方案中,但对于更专业的任务(如针对删失数据的时间事件模型)却缺乏这种集成。因此,生存分析不仅面临数据预处理的一般挑战,还缺乏针对性的自动化解决方案。为填补这一空白,本文提出了CleanSurvival,一种基于强化学习的解决方案,用于优化预处理流程,并专门扩展到生存分析。该框架可处理连续和分类变量。它基于Learn2Clean的Q学习,选择数据插补、异常值检测和特征提取技术的组合,以针对Cox、随机森林、神经网络或用户提供的时间事件模型实现最佳性能。Python包可在GitHub上获取:https://github.com/datasciapps/CleanSurvival。在真实世界数据集上的实验基准表明,基于Q学习的数据预处理相对于简单基线可以提高预测性能,而运行时行为依赖于条件,在覆盖最好的基准单元中最清晰可解释。此外,模拟研究证明了在不同类型和水平的缺失和噪声下的有效性。随着机器学习的使用增加,将AutoML管道推广到包括生存分析在内的各种模型变得重要。像CleanSurvival这样集成生存分析预处理的工具,可以使生存研究更容易、更快速地进行,并使结果更稳健。

英文摘要

Data preprocessing is often paid little attention in machine learning, despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like time-to-event models for censored data. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents CleanSurvival, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables. It builds upon Learn2Clean's Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The Python package is available on GitHub: https://github.com/datasciapps/CleanSurvival. Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing can improve predictive performance relative to simple baselines, while runtime behavior is condition-dependent and most clearly interpretable in the best-covered benchmark cells. Furthermore, a simulation study demonstrates effectiveness across different types and levels of missingness and noise. With an increase in the use of machine learning, it becomes important to generalise AutoML pipelines to a variety of models now present, including survival analysis. Tools like CleanSurvival, which integrate preprocessing for survival analysis, can make survival studies easier and quicker to perform, as well as make the results more robust.

2602.00491 2026-05-27 cs.CL

From Knowledge to Inference: Formalizing Specialized Public Health Reasoning on GlobalHealthAtlas

从知识到推理:形式化GlobalHealthAtlas上的专业公共卫生推理

Zhaokun Yan, Shan Xu, Wuzheng Dong, Zhaohan Liu, Lijie Feng, Chengxiao Dai, Chen Tianqi, Binfan Liu, Yunpu Ma, Wenting Wei, Yingting Li, Yi Zhang, Tongning Wu

发表机构 * China Academy of Information and Communications Technology(中国信息通信技术研究院) CRRC Industrial Academy Co., Ltd.(CRRC工业学院有限公司) The University of Sydney(悉尼大学) Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Institute of Infectious Disease and Biosecurity(上海传染病与生物安全研究院) School of Public Health, Fudan University(复旦大学公共卫生学院)

AI总结 为解决公共卫生推理缺乏结构化监督信号和基准的问题,提出大规模多语言数据集GlobalHealthAtlas(280,210实例,15领域,17语言),并构建LLM辅助的构建与质量控制流水线及领域对齐评估器,支持安全关键型公共卫生推理的LLM训练与评估。

Journal ref ICML 2026 regular

详情
AI中文摘要

公共卫生推理需要基于科学证据、专家共识和安全约束的群体层面推理。然而,作为一个结构化的机器学习问题,它仍然未被充分探索,且缺乏监督信号和基准。我们引入了GlobalHealthAtlas,一个大规模多语言数据集,包含280,210个实例,涵盖15个公共卫生领域和17种语言。我们进一步提出了一种大语言模型(LLM)辅助的构建和质量控制流水线,包括检索、去重、证据基础检查和标签验证,以提高大规模数据的一致性。最后,我们提出了一个从不同LLM的高置信度判断中提炼的领域对齐评估器,用于评估输出在六个维度上的表现:准确性、推理、完整性、共识一致性、术语规范性和洞察力。这些贡献共同使得LLM在安全关键的公共卫生推理中的可重复训练和评估成为可能,超越了传统的问答基准。我们公开发布了项目代码库、评估器和模型,网址为:https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator 和 https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model。

英文摘要

Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce GlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages. We further propose a large language model (LLM) assisted construction and quality control pipeline with retrieval, deduplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks. We publicly release project codebase, evaluator, and model at:: https://github.com/Jan8217/GlobalHealthAtlas, https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Evaluator and https://huggingface.co/aerovane0/GlobalHealthAtlas_Public_Model

2601.22384 2026-05-27 cs.LG cs.AI

Graph is a Substrate Across Data Modalities

图是跨数据模态的基板

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学) University of Notre Dame(诺丁汉大学) National University of Singapore(新加坡国立大学)

AI总结 提出G-Substrate框架,通过统一结构模式和交错角色训练策略,使图结构作为共享基板跨模态和任务积累,优于孤立和朴素多任务方法。

Comments Graph structure across data modalities, accepted by ICML26

详情
AI中文摘要

图提供了跨不同领域出现的自然关系结构表示。尽管无处不在,图结构通常以模态和任务隔离的方式学习,即在单个任务上下文中构建图表示,然后丢弃。因此,跨模态和任务的结构规律被反复重建,而不是在中间图表示级别积累。这引发了一个表示学习问题:如何组织图结构,使其能够跨异构模态和任务持久存在并积累?我们采用以表示为中心的视角,将图结构视为跨学习上下文持久存在的结构基板。为了实例化这一视角,我们提出了G-Substrate,一个围绕共享图结构组织学习的图基板框架。G-Substrate包含两个互补机制:一个统一的结构模式,确保跨异构模态和任务的图表示兼容性;以及一个交错基于角色的训练策略,在学习过程中将同一图结构暴露给多个功能角色。跨多个领域、模态和任务的实验表明,G-Substrate优于任务隔离和朴素多任务学习方法。代码库、模型和数据集可在https://github.com/zmli6/G-Substrate获取。

英文摘要

Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modality- and task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question: how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks? We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we propose G-Substrate, a graph substrate framework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods. The codebase, model, and datasets are available at https://github.com/zmli6/G-Substrate.

2601.21845 2026-05-27 cs.LG

Constrained Meta Reinforcement Learning with Provable Test-Time Safety

具有可证明测试时安全性的约束元强化学习

Tingting Ni, Maryam Kamgarpour

发表机构 * Sycamore Lab, EPFL, Lausanne, Switzerland(苏黎世联邦理工学院萨克森实验室,瑞士拉瓦尔)

AI总结 提出一种约束元强化学习算法,在测试任务上以可证明的安全性和样本复杂度保证学习近似最优策略,并证明样本复杂度下界。

详情
AI中文摘要

元强化学习允许智能体利用在可随意训练的任务分布上的经验,从而在新测试任务上更快地学习最优策略。尽管在提高测试任务样本复杂度方面取得了成功,但许多实际应用(如机器人和医疗保健)在测试期间施加了安全约束。约束元强化学习为将安全性整合到元强化学习中提供了一个有前景的框架。约束元强化学习中的一个开放问题是如何确保策略在真实世界测试任务上的安全性,同时降低样本复杂度,从而更快地学习最优策略。为了解决这一差距,我们提出了一种算法,该算法精炼训练期间学到的策略,具有可证明的安全性和样本复杂度保证,用于在测试任务上学习近似最优策略。我们进一步推导了一个匹配的下界,表明该样本复杂度是紧的。

英文摘要

Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.