arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.08958 2026-05-12 cs.LG

Learning predictive models for combinations of heterogeneous proteomic data sources

Michal Valko, Richard Pelikan, Miloš Hauskrecht

AI总结 该研究探讨了如何整合两种异质蛋白质组学数据源——全样本质谱分析和多重蛋白芯片阵列——以提高对胰腺癌的分类性能。研究发现,单独在每种数据上表现良好的分类模型在数据组合时可能失效,因此提出了一类能够融合不同数据特性的模型融合方法,以充分利用异质数据的优势。

Comments Published at in AMIA Summit on Translational Bioinformatics (STB 2008

详情
英文摘要

Multiple technologies that measure expression levels of protein mixtures in the human body offer a potential for detection and understanding the disease. The recent increase of these technologies prompts researchers to evaluate the individual and combined utility of data generated by the technologies. In this work, we study two data sources to measure the expression of protein mixtures in the human body: whole-sample MS profiling and multiplexed protein arrays. We investigate the individual and combined utility of these technologies by learning and testing a variety of classification models on the data from a pancreatic cancer study. We show that for the combination of these two (heterogeneous) datasets, classification models that work well on one of them individually fail on the combination of the two datasets. We study and propose a class of model fusion methods that acknowledge the differences and try to reap most of the benefits from their combination.

2605.08956 2026-05-12 cs.AI

Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery

Harshit Bisht, Vinay Kumar, Kevin Maik Jablonka, Mausam, N. M. Anoop Krishnan

AI总结 本文探讨了当前所谓的“智能体AI科学家”在实现端到端自主科学发现方面存在的局限性。作者指出,尽管这类AI已在科研中发挥辅助作用,但其在问题选择、知识基础、偏好优化和评估体系等方面存在根本性挑战,难以胜任真正的自主科研任务。文章建议通过科学模拟验证、构建持续演化的世界模型、建立假设预注册库等方法,推动更符合科学实践需求的AI科学家系统设计。

详情
英文摘要

A growing body of work pursues AI scientists capable of end-to-end autonomous scientific discovery. This position paper argues that although they already function as co-scientists, agentic AI scientists are not built for autonomous scientific discovery. We identify the following challenges in building and deploying autonomous AI scientists: (1) Problem selection is influenced by the McNamara fallacy; (2) Agents are built on large language models (LLMs) whose training corpora omit tacit procedural and failure knowledge of laboratory practice; (3) Preference optimisation during post-training compresses output diversity toward consensus; and (4) Most scientific benchmarks measure single-turn prediction accuracy and lack feedback from physical experiments back to the computational model. These challenges are not just questions of scale and scaffolding; they require revisiting fundamental design choices. To build truly autonomous AI scientists, we recommend the use of scientific simulations as verifiers for training, the design of persistent world models that represent the shifting objectives governing real investigations, the establishment of a centralized preregistration repository for all AI-generated hypotheses, and application driven by scientific need rather than tool affordance.

2605.08955 2026-05-12 cs.LG

Outlier detection for patient monitoring and alerting

Miloš Hauskrecht, Iyad Batal, Michal Valko, Shyam Visweswaran, Gregory F. Cooper, Gilles Clermont

AI总结 本文研究如何利用电子健康记录(EHRs)中的历史患者数据,检测出异常的患者管理决策,并据此生成警报。研究提出了一种基于数据驱动的异常检测方法,通过分析4486名心脏手术后患者的记录,验证了异常决策可能反映医疗错误的假设。实验结果显示,该方法在多种患者管理操作中实现了25%至66%的真实警报率,其中最强异常情况下的警报准确率达66%。

Comments Published at JBI 2013

详情
英文摘要

We develop and evaluate a data-driven approach for detecting unusual (anomalous) patient-management decisions using past patient cases stored in electronic health records (EHRs). Our hypothesis is that a patient-management decision that is unusual with respect to past patient care may be due to an error and that it is worthwhile to generate an alert if such a decision is encountered. We evaluate this hypothesis using data obtained from EHRs of 4486 post-cardiac surgical patients and a subset of 222 alerts generated from the data. We base the evaluation on the opinions of a panel of experts. The results of the study support our hypothesis that the outlier-based alerting can lead to promising true alert rates. We observed true alert rates that ranged from 25\% to 66\% for a variety of patient-management actions, with 66\% corresponding to the strongest outliers.

2605.08954 2026-05-12 cs.LG cs.AI

MolWorld: Molecule World Models for Actionable Molecular Optimization

Yang Qiao, Bo Pan, Hao-Wei Pang, Peter Zhiping Zhang, Liying Zhang, Liang Zhao

AI总结 在药物研发中,分子优化旨在发现具有更优靶点特性的分子,但实际的先导分子优化不仅需要预测性能高,还应具备可操作性,即能够通过有效的局部结构变换从已知分子演化而来。为此,本文提出MolWorld框架,通过构建分子转移图并利用世界模型指导搜索过程,实现可操作的分子优化。该方法在保持分子结构连通性的同时,能够有效提升分子性能,在性质优化和基于对接的任务中表现出色。

详情
英文摘要

Molecular optimization in drug discovery aims to discover molecules with improved target properties, but practical lead optimization often requires more than high predicted scores. A useful candidate should also be actionable: it should be reachable from known molecules through valid local structural transformations, so that it can be interpreted as a plausible revision within an evolving chemical series. Existing de novo and single-molecule optimization methods do not explicitly model such reachability, especially when both the target molecules and the intermediate molecules connecting them to known compounds are unknown. In this work, we formulate actionable molecular optimization as sequential expansion of a molecule-transfer graph, where nodes are molecules and edges encode valid local transformations. We propose MolWorld, a molecule world model-guided framework that treats the current molecule-transfer graph as an evolving search state. At each iteration, MolWorld selects local anchor contexts, generates candidate molecules conditioned on these contexts, evaluates their properties, and uses a learned world model to update the evolving molecule world by retaining admissible candidates and inserting them into the molecule-transfer graph. The expanded molecule world then guides subsequent optimization. Experiments on property optimization and docking-based tasks show that MolWorld discovers high-property molecules while maintaining substantially stronger structural connectivity, supporting actionable and sequential molecular design.

2605.08952 2026-05-12 cs.CV

FugSeg: Fast Uncertainty-aware Ground Segmentation for 3D Point Cloud

Yu Li, Volker Schwieger

AI总结 在基于激光雷达的环境感知系统中,地面分割是支持地图构建和导航等应用的关键预处理步骤。为了解决反射噪声和孤立地面点等挑战,本文提出了一种快速且具有不确定性感知能力的地面分割方法FugSeg。该方法采用极坐标网格图表示点云,并引入自适应坡度和噪声地面点处理机制,有效提升了复杂地形下的分割可靠性;实验表明,FugSeg在多个公开数据集上均优于现有非学习方法,且在单线程CPU上即可实现高运行效率,适用于资源受限的系统。

Comments Accepted for publication in IEEE Transactions on Intelligent Transportation Systems

详情
Journal ref
IEEE Transactions on Intelligent Transportation Systems (Early Access), 2026
英文摘要

In LiDAR-based environment perception systems, ground segmentation is a key preprocessing step supporting various applications such as mapping and navigation. Although extensively studied, problems such as reflection noise and isolated ground remain challenging. To address these issues, we propose FugSeg, a fast uncertainty-aware ground segmentation method. A polar grid map is adopted as the point cloud representation to ensure generalizability across LiDAR types. Building on that, we develop a within- and cross-segment ground labeling strategy that identifies not only directly visible ground cells but also those that are isolated or occluded. During this process, an adaptive slope is introduced, which incorporates measurement uncertainties to enhance its reliability under complex terrain. Finally, to achieve point-level ground segmentation, a fine-grained ground elevation estimation method is introduced. Throughout the complete workflow, reflection noise is explicitly handled via the proposed noisy ground cells. We conduct comprehensive evaluations on four public datasets covering both structured and unstructured environments. Results show that FugSeg outperforms state-of-the-art non-learning methods, achieving the highest F1, accuracy, and mIoU across all datasets, while maintaining the fastest runtime (135 Hz and 487 Hz for 64- and 32-layer LiDARs) using a single CPU thread, making it suitable for resource-limited systems. The code will be available at https://github.com/Leo-YuLi/FugSeg.

2605.08950 2026-05-12 cs.CL cs.AI

Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling

Wicaksono Leksono Muhamad, Joanito Agili Lopo, Tsamarah Rana Nugraha, Ahmad Cahyono Adi, Muhammad Oriza Nurfajri

AI总结 本文研究如何提升词汇难度预测的准确性,特别是在不同母语背景下的语言学习和可读性评估中。为了解决现有方法在跨语言对齐和难度序关系建模上的不足,作者提出了一种结合上下文对齐对比学习和岭回归集成的方法,有效提升了模型在跨语言场景下的表示能力和难度序建模能力。实验表明,该方法在多个母语数据集上均取得了更稳定和准确的预测效果。

详情
英文摘要

Lexical difficulty prediction is a fundamental problem in language learning and readability assessment, requiring models to estimate word difficulty across different first-language (L1) backgrounds. However, existing approaches rely on regression-only training with scalar supervision, which does not explicitly structure the representation space, limiting their ability to capture cross-lingual alignment and ordinal difficulty. To mitigate these issues, we propose Context-Aligned Contrastive Regression, which integrates Ridge regression ensemble with two complementary objectives, i.e., Cross-View Context and Ordinal Soft Contrastive Learning. Experiments on three L1 datasets show that (i) contrastive objectives improve cross-lingual representation alignment while preserving language-specific nuances, (ii) the learned representations capture the ordinal structure of lexical difficulty, and (iii) the ensemble effectively mitigates systematic biases of individual models, leading to more stable performance across difficulty levels.

2605.08947 2026-05-12 cs.RO

A low-cost mockup to simulate robotic laser cutting in nuclear decommissioning

Frederico Fernandes Afonso Silva, Murilo Marques Marinho, Bruno Vilhena Adorno

AI总结 本文提出了一种低成本实验装置,用于模拟核设施退役过程中机器人激光切割容器的过程。该装置包括三轴平台、六自由度机械臂和基于视觉系统的控制模块,能够模拟激光切割并实现避障与路径跟踪。通过采用约束任务空间自适应运动控制器,系统无需校准即可补偿参数误差,并在控制紫外光束而非末端执行器全姿态的情况下,实现了较高的切割路径跟踪精度。

Comments 7 pages, 8 figures, 2 tables. Under Review for TAROS 2026 (Towards Autonomous Robotic Systems)

详情
英文摘要

This paper introduces a low-cost experimental mockup to simulate the laser cutting process of containers in nuclear decommissioning. It is composed of a three-axis table supporting a cuboid container with ultraviolet-sensitive faces, a six-degree-of-freedom serial manipulator holding an ultraviolet torch that simulates the laser, and a visual system based on cameras and fiducial markers. The system employs a constrained task-space adaptive motion controller that compensates for inaccurate parameters and eliminates the need to calibrate the system. Furthermore, as the motion controller explicitly accounts for geometric constraints, the robot reactively avoids collisions with obstacles while handling the ultraviolet torch. To enhance tracking of the laser-cutting path, we control the ultraviolet beam, which requires only four degrees of freedom, instead of the full end-effector pose. Experiments show that, despite an initially uncalibrated system, the overall system is capable of tracking different trajectories with an overall mean accuracy of 3.9 (sd 2.5) mm when the end-effector pose is controlled and 2.4 (sd 1.3) mm when the ultraviolet beam is controlled.

2605.08946 2026-05-12 cs.LG

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

Akihiro Kubo, Kosuke Nakanishi, Shin Ishii

AI总结 该研究旨在学习一个单一的深度偏好条件策略,以捕捉多目标强化学习中不同偏好下的帕累托最优解集。通过引入平滑的Tchebycheff标量化方法,论文证明了在满足一定条件时,每个偏好对应唯一的帕累托最优回报向量,并且该向量对偏好具有Lipschitz连续性,为密集帕累托前沿覆盖提供了理论基础。研究提出了基于占用度量的凹镜像下降策略迭代算法(CMDPI),并将其扩展为深度策略梯度算法,在多个多目标任务中表现出优越的帕累托前沿覆盖率和期望效用性能。

详情
英文摘要

Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an $O(1/k)$ objective-suboptimality rate. We further show that each update is equivalent to solving a Kullback-Leibler-regularized MDP with the previous policy as reference, yielding a policy-iteration interpretation and finite-iterate policy continuity across preferences. We instantiate the update as a deep actor-critic algorithm preserving previous-policy regularization. On eight MO-Gymnasium tasks, it achieves the best average hypervolume rank among recent baselines and strong expected-utility performance. Continuous-control experiments indicate gains beyond the discrete-action setting.

2605.08945 2026-05-12 cs.CV

PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment

Qiqi Li, Pengfei Wang, Nenggan Zheng

AI总结 本文提出了一种名为PIDNet的渐进式隐式解耦网络,用于多模态动作质量评估。该方法通过渐进融合不同模态的特定信息、跨模态互补线索和全局质量语义,有效提升了评估准确性。核心模块iMambaWave结合双向Mamba分支和小波变换分支,分别捕捉长时序依赖和局部细节变化,配合门控聚合机制实现时域与频域信息的自适应融合。实验表明,PIDNet在多个数据集上取得了优于现有单模态和多模态方法的评估性能,并具有良好的通用性和模块化能力。

Comments 14 pages, 6 figures, 11 tables

详情
英文摘要

Action quality assessment (AQA) aims to automatically quantify the execution quality of human actions in videos and is valuable for applications such as competitive sports judging. In multimodal AQA, quality evidence from different modalities is heterogeneous, and quality cues evolve progressively over time. Existing methods often rely on coarse fusion or unified temporal modeling, which may blur modality-specific cues, preserve cross-modal redundancy, and weaken stage-specific quality evidence. To address these issues, we propose a progressive implicit decoupling and fusion network (PIDNet) that progressively integrates modality-specific information, cross-modal complementary cues, and global quality semantics for accurate assessment. Specifically, we design an iMambaWave module that maps RGB, optical flow, and audio features into a shared latent space and disentangles them with a Bi-Mamba branch and a wavelet-transform branch to capture long-range temporal dependencies and local perturbation details, respectively. A gated aggregation mechanism adaptively fuses temporal and frequency-domain information. We further build a three-stage progressive fusion network using Group3M blocks, where modality complementary attention retrieves cross-modal evidence while suppressing redundancy, and multi-scale convolutions enrich feature representations. Experiments on the Rhythmic Gymnastics and Fis-V datasets show that PIDNet achieves highly competitive score correlation with favorable error control compared with existing unimodal and multimodal methods. Ablation studies verify the effectiveness of each component. Moreover, iMambaWave consistently improves visual representation and temporal modeling across multiple backbones, showing good generalization and plug-and-play capability.

2605.08942 2026-05-12 cs.CL

Decomposing and Steering Functional Metacognition in Large Language Models

Yanshi Li, Xueru Bai, Shuman Liu, Haibo Zhang, Anxiang Zeng

AI总结 本文研究了大语言模型(LLMs)在推理过程中表现出的功能性元认知状态,即模型内部与评估意识、自我能力评估、风险感知等因素相关的可分解变量。通过残差流分析,作者展示了这些状态可以从模型激活中线性解码,并在不同层中表现出独特分布。进一步通过激活引导实验,证明这些元认知状态能以可区分的方式影响模型的推理行为,如输出长度、准确性和安全性。该研究为理解与控制模型内部状态提供了机制框架,对模型评估与应用具有重要意义。

Comments 18 pages, 7 figures

详情
英文摘要

Large language models (LLMs) increasingly exhibit behaviors suggesting awareness of their evaluation context, often adapting their reasoning strategies in benchmark settings. Prior work has shown that such evaluation awareness can distort performance measurements; however, it remains unclear whether this phenomenon reflects a single behavioral artifact or a deeper internal structure within the model. We propose that LLMs maintain a decomposable space of functional metacognitive states: internal variables encoding factors such as evaluation awareness, self-assessed capability, perceived risk, computational effort allocation, audience expertise adaptation, and intentionality. Through residual stream analysis across multiple reasoning models, we demonstrate that these states are linearly decodable from internal activations and exhibit distinct layer-wise profiles. Moreover, by steering model activations along probe-derived directions, we show that each functional metacognitive state causally modulates reasoning behavior in dissociable ways, affecting verbosity, accuracy, and safety-related responses across tasks. Our findings suggest that benchmark performance reflects not only task competence but also the activation of specific functional metacognitive states. We argue that understandi ng and controlling these internal states is essential for reliable evaluation and deployment of reasoning models, and we provide a mechanistic framework for studying functional m etacognition in artificial systems. Our code and data are publicly available at https://github.com/xlands/meta-cognition.

2605.08941 2026-05-12 cs.AI

MDGYM: Benchmarking AI Agents on Molecular Simulations

Vinay Kumar, Satyendra Rajput, Mausam, N. M. Anoop Krishnan

AI总结 本文介绍了MDGYM,一个用于评估AI代理在分子动力学模拟任务中表现的基准测试平台,包含169个由专家精心设计的模拟任务,覆盖LAMMPS和GROMACS两种主流软件,并分为三个难度等级。研究评估了三种智能体框架与四种大语言模型的性能,发现所有模型表现均不理想,即使最强的模型也只能完成约21%的简单任务。分析表明,AI代理在调用模拟工具时往往生成物理上不稳定的配置或伪造数值结果,显示出与通用软件工程任务不同的失败模式,突显了物理推理能力对AI科学应用的重要性。

详情
英文摘要

The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks -- Claude Code, Codex, and OpenHands -- with four LLMs, and find that all perform poorly: even the strongest agent solves only 21\% of easy-level tasks, with less than 10\% at higher difficulties. Trajectory analysis reveals a characteristic pattern of failure -- agents successfully invoke simulation machinery but produce physically unstable configurations, fabricate numerical outputs without executing the underlying computation, or abandon tasks prematurely rather than iterating through simulation-specific errors. These failure modes are qualitatively distinct from those observed in general software engineering benchmarks, indicating that fluent code generation does not transfer to grounded physical reasoning.

2605.08937 2026-05-12 cs.RO

Raymoval: Raycasting-based Dynamic Object Removal for Static 3D Mapping

Daebeom Kim, Seungjae Lee, Seoyeon Jang, Kevin Christiansen Marsim, Hyun Myung

AI总结 本文提出了一种基于射线投射的动态物体移除方法Raymoval,用于提升静态三维地图的精度和一致性。该方法通过将激光扫描数据投影到方位-仰角网格,并结合射线投射计算首次命中距离,以区分动态与静态点云。实验表明,该方法能有效减少动态物体残留痕迹,提高静态地图的质量。

Comments 12 pages, 5 figures, 3 tables, Presented at RiTA 2025

详情
英文摘要

Static mapping is fundamental to robot navigation, providing a persistent geometric prior and a consistent reference for long-term autonomy. However, dynamic objects leave residual traces and cause surface loss, which reduces map consistency. We propose a raycasting-based module for dynamic object removal in static 3D mapping. Each scan is projected onto an azimuth-elevation grid, and for every viewing direction we compare the bin-wise minimum range with the map's first-hit distance computed by raycasting. Furthermore, we apply a raycast consistency test that separates dynamic from static points. Finally, a spatial consistency validation step refines labels, producing static maps with lower residual dynamics and reduced over-removal. We evaluate our approach quantitatively and qualitatively on SemanticKITTI and a challenging custom dataset, and show consistent static mapping results.

2605.08936 2026-05-12 cs.AI cs.LG

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu

AI总结 本文提出了一种名为Self-ReSET的纯强化学习框架,旨在使大推理模型具备从自身不安全推理轨迹中自我恢复的能力。与依赖静态训练数据的传统方法不同,Self-ReSET通过将模型自身的错误轨迹作为强化学习的初始状态,增强了其动态恢复能力。实验表明,该方法在提升模型对对抗攻击的鲁棒性,尤其是对分布外越狱提示的防御能力方面效果显著,同时保持了模型的一般实用性。

详情
英文摘要

Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.

2605.08934 2026-05-12 cs.LG

From Mechanistic to Compositional Interpretability

Ward Gauderis, Thomas Dooms, Steven T. Holmer, Kola Ayonrinde, Geraint A. Wiggins

AI总结 本文提出了一种名为“组合性可解释性”的形式化框架,旨在解决机械可解释性方法缺乏客观验证和组合能力的问题。该方法基于范畴论,通过语法与语义映射的协调一致,确保模型分解与其行为的一致性,并将解释质量分解为忠实度和复杂度,将可解释性建模为约束优化问题。研究还引入了压缩优化技术,能够在不改变模型功能的前提下将其结构化为更简单的部分,并理论证明了语法压缩在提升人类对齐解释方面的有效性。

详情
英文摘要

Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we prove a parsimony criterion under which syntactic compression theoretically guarantees more concise, human-aligned explanations. Our framework situates prominent mechanistic methods as subclasses of refinement, and clarifies why their compressibility heuristics tend to align with human interpretability. Our work provides a measurable, optimisable foundation for automating the discovery and evaluation of mechanistic explanations.

2605.08933 2026-05-12 cs.LG

When and Why Grouping Attention Heads Accelerates Muon Optimization

Hongtao Zhang, Wenjie Zhou, Wei Chen, Xueqi Cheng

AI总结 本文研究了在多头注意力机制中,如何选择将Muon优化方法应用于整个注意力投影、单个注意力头或中间分组的注意力头。通过对比全矩阵和分组方式的Muon优化效果,发现分组优化在提升白化效果的同时会引入额外的范数成本。基于这一权衡,作者提出了一种新的优化方法Group Muon,将头分组的大小和规则作为超参数进行优化,在实验中显示出比全头优化和全矩阵优化更优的验证损失表现。

Comments 16 pages, 4 figures

详情
英文摘要

Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbf{group-wise whitening gain} from group-wise updates and the \textbf{grouping-induced norm cost}, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbf{Group Muon}, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.

2605.08930 2026-05-12 cs.AI

Internalizing Safety Understanding in Large Reasoning Models via Verification

Yi Zhang, Yuxin Chen, Leheng Sheng, Dongcheng Zhang, Chaochao Lu, Xiang Wang, An Zhang

AI总结 尽管显式的思维链(CoT)增强了大型推理模型(LRMs)的推理能力,但可能导致生成更冒险的回答。现有对齐方法主要依赖外部强制合规,优化模型以检测恶意提示,而非评估自身输出的安全性。为此,我们提出SInternal框架,通过专门训练模型在安全验证任务上,使其能够利用专家推理轨迹来批判自身生成的回答,从而内化安全规范,显著提升模型对越界攻击的鲁棒性,并在与强化学习结合时表现出优于传统监督微调的初始化优势。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
英文摘要

While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong generalization for response safety, significantly enhancing robustness against out-of-domain jailbreaks. Furthermore, when combined with reinforcement learning, SInternal serves as a superior initialization compared to standard supervised fine-tuning, suggesting that internalizing safety understanding creates a more robust foundation for alignment than merely mimicking safe behaviors. Our codes are available at https://github.com/AlphaLab-USTC/SInternal

2605.08915 2026-05-12 cs.LG

Physics-Informed Neural PDE Solvers via Spatio-Temporal MeanFlow

Hanru Bai, Yuncheng Zhou, Difan Zou

AI总结 该论文提出了一种基于物理信息的神经偏微分方程求解方法——时空均流(Spatio-Temporal MeanFlow),旨在克服现有深度学习方法在捕捉物理系统连续积分特性方面的不足。该方法借鉴了用于生成式常微分方程求解的MeanFlow框架,将其扩展到时空领域,通过将物理PDE算子替代生成速度场,实现了对物理状态有限区间演化的高效学习。实验表明,该方法在求解时间依赖和稳态PDE问题时,相比现有方法具有更高的精度和推理效率,并且在分布外初始条件和不同空间分辨率下表现出良好的泛化能力。

详情
英文摘要

Deep learning paradigms, such as PINNs and neural operators, have significantly advanced the solving of PDEs. However, they often struggle to capture the continuous integral nature of physical systems, relying either on pointwise residuals that ignore the integral perspective or on pre-discretized temporal grids. Drawing inspiration from MeanFlow, a continuous-time integrator recently developed to efficiently solve generative ODEs, we introduce Spatio-Temporal MeanFlow, which functions as a novel PDE solver learning the finite-interval evolution of physical states. By substituting the generative velocity field with the physical PDE operator, we transform multi-step numerical integration into an efficient prediction with a freely controllable integration length. Crucially, we extend the original MeanFlow constraint from the temporal to the spatio-temporal domain, coupling time evolution with spatial consistency. This yields a unified framework naturally accommodating both time-dependent and stationary PDEs. Comprehensive experiments on benchmarks demonstrate that our approach achieves superior accuracy and inference efficiency over representative baselines. Furthermore, the proposed integral constraint enables excellent generalization to out-of-distribution initial conditions and varying spatial resolutions.

2605.08914 2026-05-12 cs.LG cs.AI

Transformer autoencoder with local attention for sparse and irregular time series with application on risk estimation

Panteleimon Rodis

AI总结 本文提出了一种专门用于处理稀疏且不规则时间序列的风险估计框架,核心方法是结合局部注意力机制的Transformer自编码器,能够有效捕捉稀疏数据中的关键模式。该方法在希腊某地区电力系统非技术性损耗风险估计的实际案例中得到应用,实验表明其在风险估计任务中相比现有方法具有更高的召回率和精确率,为不规则时间序列的风险检测提供了稳健有效的工具。

Comments Under Review

详情
英文摘要

This paper introduces a framework specifically designed for sparse and irregular time series {risk estimation}. It is based on a Transformer Autoencoder with local attention, which leverages the powerful pattern identification capabilities of transformers complemented by traditional data cleaning and normalization methods. It efficiently captures relevant patterns within irregular sequences suffering from sparse data collection, benefiting from the discriminative ability of the local attention mechanism. The proposed framework is applied to a real-world case study, on the risk estimation of non-technical losses in electrical power systems in a wide area in Greece. Non-technical losses in electrical power systems, primarily stemming from electricity theft, pose significant economic and operational challenges. Detecting these anomalies is particularly challenging due to the inherent sparse and irregular nature of real-world data collection practices. Traditional risk estimation methods struggle with effectively capturing long-range dependencies and robustly handling such data characteristics. We demonstrate that our approach effectively yields highly discriminative latent features, which results in more consistent risk estimation compared with existing state-of-the-art and widely used methods. It achieves high recall and precision, meeting the critical objectives of the problem. As such, our solution offers a robust and effective tool for risk detection in irregular time series datasets.

2605.08911 2026-05-12 cs.CV

Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning

Han Li, Yulu Gao, Si Liu, Yuhang Wang, Bo Liu, Beipeng Mu

AI总结 自动驾驶车辆不仅需要感知驾驶场景中的物理元素,如车道线和交通信号灯,还需要理解车道中心线及其拓扑关系等逻辑信息。本文提出了一种统一建模车道与车道拓扑关系的新方法UniTopo,通过将车道间的拓扑关系表示为连接关系,实现了在同一个感知流程中同时获取车道位置和拓扑信息,建立了从原始图像特征直接感知车道拓扑的新范式。实验表明,该方法在OpenLane-V2基准测试中显著优于现有先进方法。

Comments Accepted by IEEE TCSVT

详情
英文摘要

Autonomous vehicles need to perceive not only physical elements in the driving scene, such as lane lines and traffic lights, but also logical elements like lane centerlines and their topology. Existing lane topology reasoning methods typically follow a reasoning-by-detection paradigm, where lane topological relationships are primarily derived from lane detection results. In this paper, we propose an innovative method called Unified Modeling of Lane and Lane Topology (UniTopo), which represents the topological relationships between lanes as connected lanes, encompassing predecessor lanes, successor lanes, and their interconnections. This unified representation of lanes and lane topology allows us to simultaneously obtain both the positions and topological information of lanes within a shared perception pipeline, establishing a new paradigm for directly perceiving lane topology from original image features. We validate our method on the driving scene reasoning benchmark OpenLane-V2, which consists of two subsets, built based on Argoverse2 and nuScenes, respectively. Our method achieves TOP_ll of 30.1% and 31.8% on the two subsets, significantly surpassing the existing state-of-the-art method T^2SG by 6.0% and 8.6%.

2605.08905 2026-05-12 cs.AI

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen

AI总结 本文提出 Forge,一种基于质量感知强化学习的框架,旨在解决大语言模型在 NP 难优化问题中的优化能力不足问题。研究引入 OPT-BENCH,包含生成实例、验证质量和最优基线的完整训练与评估体系,并通过质量感知奖励机制提升模型在可行性和优化质量上的表现。实验表明,该方法在多个任务上显著优于现有模型,并展现出良好的迁移能力。

详情
英文摘要

Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.

2605.08904 2026-05-12 cs.AI

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen

AI总结 本文提出OPT-BENCH,用于评估大语言模型代理在大规模搜索空间中的迭代自我优化能力。研究通过结合20个机器学习任务和10个经典NP难问题,构建了一个严格的测试环境,以检验模型是否能通过内在自我反思而非单纯工具应用进行适应。为此,作者还提出了OPT-Agent框架,模拟人类认知适应过程,通过感知、记忆与推理的循环迭代优化解决方案。实验表明,更强的模型在利用反馈信号进行自我改进方面表现更优,但其适应能力仍受限于模型的基础能力,尚未达到人类专家水平。

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and tool use. However, the fundamental cognitive faculties essential for problem solving, including perception, reasoning, and memory, remain the stable core of intelligence. Unlike memorizing specific patterns, humans succeed in novel environments by applying these intrinsic faculties to adapt and optimize. Yet, whether LLMs possess this essential capacity, namely the ability to continuously refine solutions in response to dynamic environmental feedback, remains underexplored. To address this challenge, we introduce OPT-BENCH, a benchmark for evaluating self-improvement capabilities in large-scale search spaces. By combining 20 machine learning tasks with 10 classic NP-hard problems, OPT-BENCH provides a rigorous setting to assess whether agents can adapt through intrinsic self-reflection rather than rote tool application. We further propose OPT-Agent, a framework that emulates human-like cognitive adaptation. It operates through a general perception, memory, and reasoning loop, iteratively refining solutions based on environmental feedback. Through extensive experiments on 19 LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 235B parameters, we demonstrate that stronger models are more effective at leveraging feedback signals for self-improvement. However, this upper-bound adaptability remains fundamentally constrained by the models' base capacity, and even the most advanced LLMs still fall short of human expert performance.

2605.08902 2026-05-12 cs.CV cs.AI

DAPE: Dynamic Non-uniform Alignment and Progressive Detail Enhancement Techniques for Improving the Performance of Efficient Visual Language Models

Mengyuan Tian, Qiyan Zhao, Yanan Wang, Da-Han Wang

AI总结 本文提出了一种名为DAPE的新框架,旨在提升高效视觉语言模型的性能。该方法通过动态非均匀对齐和渐进细节增强技术,解决了文本与图像之间信息密度分布不均的问题,实现了更精确的跨模态交互。实验表明,该方法在多个基准测试中显著提升了下游任务的准确性,同时降低了计算开销。

Comments Accepted in ICIC 2026 Oral

详情
英文摘要

In recent years, pre-trained visual-linguistic models have demonstrated tremendous potential, becoming a crucial foundational framework for numerous downstream tasks. However, the information density between text and images is not uniformly distributed. Existing methods often overlook the inherent and dynamic differences in information density and semantic scope between text tags and image blocks. These common uniform alignment strategies result in coarse-grained cross-modal interactions and loss of fine semantic details. Moreover, pursuing finer alignment typically requires substantial computational overhead, limiting practical model deployment. To address this challenge, this paper proposes a novel framework for dynamic cross-modal alignment with continuous detail introduction. First, we design a dynamically adaptive cross-modal matching mechanism that uses a learnable matching function to dynamically assign varying numbers and sizes of image tags to text tags of the same size but different information density, enabling more precise attention interaction. Second, we develop a continuous detail introduction module to progressively incorporate high-resolution visual feature enhancement into the alignment process. Extensive experiments across multiple benchmarks demonstrate significant improvements in the accuracy of various downstream tasks while reducing computational overhead.

2605.08898 2026-05-12 cs.CL cs.AI

LLM-Agnostic Semantic Representation Attack

Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Tairan Huang, Shaohui Mei, Lap-Pui Chau

AI总结 随着大型语言模型(LLM)越来越多地采用对齐技术以防止有害输出,攻击者仍可通过构造对抗性提示绕过这些防护。为解决现有基于精确文本模板的优化方法在收敛性、提示自然性和跨模型泛化能力方面的不足,本文提出了一种与LLM无关的语义表示攻击(SRA)方法,通过将对抗目标从精确文本转向恶意语义表示,提升了攻击的普适性和隐蔽性。实验表明,该方法在26个开源LLM上实现了高达99.71%的平均攻击成功率,具有优异的跨模型迁移能力和隐蔽性。

Comments arXiv admin note: substantial text overlap with arXiv:2509.19360

详情
英文摘要

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting adversarial prompts. Predominant token-level optimization methods primarily rely on optimizing for exact affirmative templates (e.g., ``\textit{Sure, here is...}''). However, these paradigms frequently encounter bottlenecks such as suboptimal convergence, compromised prompt naturalness, and poor cross-model generalization. To address these limitations, we propose Semantic Representation Attack (SRA), a novel LLM-agnostic paradigm that fundamentally reconceptualizes adversarial objectives from exact textual targeting to malicious semantic representations. Theoretically, we establish the semantic Coherence-Convergence Relationship and derive a Cross-Model Semantic Generalization bound, proving that maintaining semantic coherence guarantees both white-box semantic convergence and black-box transferability. Technically, we operationalize this framework via the Semantic Representation Heuristic Search (SRHS) algorithm, which preserves interpretability and structural coherence of the adversarial prompts during incremental discrete token chunk expansion. Extensive evaluations demonstrate that our framework achieves a 99.71% average attack success rate across 26 open-source LLMs, with strong transferability and stealth.

2605.08897 2026-05-12 cs.LG cs.AI

Shapley Regression for Rare Disease Diagnosis Support: a case study on APDS

Safa Alsaidi, Tomás Brogueira, Nizar Mahlaoui, Marc Vincent, Guilherme Pelegrina, Nicolas Garcelon, Adrien Coulet, Miguel Couceiro

AI总结 本文研究了如何利用数据驱动的方法支持罕见遗传免疫疾病APDS的早期诊断,针对其症状复杂、诊断困难的特点,提出了一种基于博弈论的新型回归模型——Shapley回归。该方法通过引入k-可加合作博弈替代传统线性预测器,既能够捕捉症状之间的复杂交互关系,又保持了逻辑回归的可解释性和凸性。实验表明,该方法在多个生物医学数据集和真实患者队列中均表现出良好的预测性能和鲁棒性,并有助于发现与APDS相关的症状组合及临床验证的交互关系。

Comments 21 pages, 4 figures. Accepted to the AI and Health special track at IJCAI 2026; the first two named authors had equal contribution

详情
英文摘要

Activated PI3K8 Syndrome (APDS) is a rare genetic immune disorder caused by variants in PIK3CD or PIK3R1, with highly heterogeneous symptoms that often delay diagnosis. Early recognition is hampered by overlapping clinical presentations and limited clinician awareness, motivating systematic, data-driven approaches to detect APDS-associated phenotypic patterns in routine electronic health records. Traditional linear scoring systems cannot capture complex symptom interactions, while deep learning models, though expressive, often lack interpretability. To bridge this gap, we propose Shapley regression, a novel game-theoretic model replacing the linear predictor with a k-additive cooperative game, explicitly modeling co-occurrence of symptoms while maintaining the transparency and convexity of logistic regression. We carry out an empirical study of our lightweight method on eight public biomedical datasets, showing that a 2-additive model with $l_{2}$ regularization achieves an optimal trade-off between predictive power and noise robustness. We also apply it to a real-world cohort of 222 patients, on which Shapley regression accurately distinguished APDS cases from matched controls, confirming and validating phenotypes known to be associated with APDS, and facilitating the exploration of pairwise interactions between symptoms, validated by clinical experts.

2605.08896 2026-05-12 cs.CL cs.AI cs.LG

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

Zhuoyun Li, Boxuan Wang, Jinwei Hu, Xiaowei Huang, Yi Dong

AI总结 该论文研究了大语言模型和视觉语言模型在面对扰动时的鲁棒性问题,指出平均准确率等传统指标可能掩盖预测结果在决策边界附近脆弱的结构化失败模式。为此,作者提出了FragileFlow,一种基于边距感知误差流的正则化方法,通过构建脆弱风险矩阵识别看似正确但实际脆弱的预测,并在理论层面提供了首个PAC-Bayes上界分析。实验表明,FragileFlow在多个基准任务中有效提升了模型的鲁棒性,同时保持了干净数据下的准确率。

详情
英文摘要

Robust adaptation of LLMs and VLMs is often evaluated by average accuracy or average consistency under perturbations. However, these averages can hide a structured failure mode: a prediction may remain correct while probability mass already flows from particular true classes toward systematic wrong competitors near the decision boundary. In this paper, we formalize this phenomenon as margin-aware error flow and introduce FragileFlow, a plug-in regularizer that uses a calibrated margin buffer to identify correct-but-fragile predictions and organize their off-class probability mass into a class-wise vulnerable-risk matrix. Theoretically, we provide the first PAC-Bayes upper bound for this margin-aware error-flow object, showing how empirical spectral control yields a conservative route to deterministic worst-class robustness under a stability condition. Experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation show that FragileFlow consistently improves the proposed theory-facing risk measures over matched baselines, yields perturbed worst-class accuracy gains in most settings, and preserves clean accuracy across comparisons.

2605.08891 2026-05-12 cs.LG

Bilinear autoencoders find interpretable manifolds

Thomas Dooms, Ward Gauderis, Geraint Wiggins, Jose Oramas

AI总结 该论文提出了一种双线性自编码器,用于发现神经网络中可解释的流形结构。与传统的线性自编码器不同,该方法通过二次潜在变量捕捉多维几何结构,能够更有效地表示复杂的概念组合。实验表明,这种方法在语言模型中显著提升了重构性能,并可通过交互式可视化工具展示其发现的流形结构,为构建数学上可解释的非线性潜在表示提供了新思路。

详情
英文摘要

Sparse autoencoders have become a standard tool for uncovering interpretable latent representations in neural networks. Yet salient concepts often span manifolds that current linear methods cannot capture without post hoc analysis. This paper uses quadratic latents to close this gap: we implement these with bilinear autoencoders, which decompose activations into low-rank quadratic forms, compose linearly in weight space, and admit input-independent geometric analysis. This qualitative difference in what concepts quadratic latents can detect challenges the standard linear representation hypothesis. Our experiments and visualisations show that multi-dimensional geometries are highly prevalent and that composite latents capture them well, systematically improving reconstruction error in language models. Furthermore, we show that autoencoders with varying geometric priors recover the same input subspace despite their dictionary entries being distinct. Practically, these models serve as an unsupervised tool for manifold discovery, which we demonstrate through an interactive online visualizer for Qwen 3.5. This is a step toward nonlinear but mathematically tractable latent representations whose composition is expressive and interpretable by design.

2605.08889 2026-05-12 cs.LG cs.CL cs.DL

Machine Learning Research Has Outpaced Its Communication Norms and NeurIPS Should Act

Ajay Mandyam Rangarajan, Jeyashree Krishnan

AI总结 该研究指出,机器学习研究的快速发展已远超其交流规范的演进,呼吁NeurIPS会议采取更明确的写作标准。通过分析大量论文数据,研究发现NeurIPS摘要的可读性下降、缩写使用增多且重复率低,同时可读性与引用量存在正相关。研究建议NeurIPS在2027年试点七项改进措施,以提升论文的可读性与传播效果。

Comments 9 pages, 11 figures, 7 tables

详情
英文摘要

Machine learning research has grown exponentially while its communication norms have not. We argue NeurIPS should adopt explicit, measurable writing standards. We analyze 2.8 million arXiv papers (1991-2025), 24,772 NeurIPS papers (1987-2024), and 24.5 million PubMed papers (1990-2025), applying classical readability scores, the Hohmann writing style suite (including sensational language), acronym density and reuse, an LLM as judge readability protocol, and citations from OpenAlex and Semantic Scholar. Four patterns emerge. First, NeurIPS abstracts score harder to read on every classical readability metric: Flesch Reading Ease falls from about 24 in 1987 to 13 in 2024, and sensational language rises by about 50 percent in NeurIPS abstracts between 2015 and 2024. Second, acronym density in NeurIPS titles has grown from 0.33 per 100 words in 1987 to 3.21 in 2024, and about 89 percent of NeurIPS acronyms are used fewer than ten times, ten points above the science-wide baseline. Third, more readable NeurIPS papers tend to receive more citations, suggesting readability and impact are correlated and that less readable papers risk remaining fragmented. LLM as judge scores rate NeurIPS abstracts as roughly stable from 1987 to 2022, with early signs of improvement thereafter, a pattern that disagrees with every classical readability metric and raises a design question for enforcement: is the target reader a human or an LLM? Lastly, NeurIPS volume has grown roughly 50-fold between 1987 and 2024. Assuming the goal is to optimise for human readers, we propose seven standards NeurIPS could pilot at NeurIPS 2027: an acronym budget with a venue-approved term list, a human readability threshold, stricter citation standards, standalone visual elements, a plain language summary, a pre-registered acronym glossary, and open source audit tooling.

2605.08887 2026-05-12 cs.AI cs.CL

Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution

Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He, Liang Lin, Yuan Liu, Xiangxiang Chu

AI总结 本文提出了一种名为 Ace-Skill 的协同进化框架,旨在解决多模态智能体在自我进化过程中面临的数据效率低和知识干扰两大瓶颈问题。该方法结合了优先级采样与懒惰衰减熟练度追踪,以聚焦于信息量大且掌握不足的样本,并通过语义聚类组织知识,提升知识检索的准确性和适应可靠性。实验表明,Ace-Skill 在多个多模态工具使用基准上取得了显著提升,并能有效将知识零样本迁移至更小模型,显著提升了资源受限智能体的性能。

详情
英文摘要

Self-evolving agents present a promising path toward continual adaptation by distilling task interactions into reusable knowledge artifacts. In practice, this paradigm remains hindered by two coupled bottlenecks: data inefficiency, where costly rollout effort is disproportionately spent on low-value samples rather than informative ones, and knowledge interference, where heterogeneous knowledge stored in shared repositories leads to noisy retrieval and task-misaligned guidance. Together, these issues form a self-reinforcing failure loop in which uninformative rollouts yield noisy knowledge, which in turn degrades subsequent rollouts. In this work, we introduce Ace-Skill, a co-evolutionary framework that jointly optimizes rollout allocation and knowledge organization for self-evolving multimodal agents. Specifically, Ace-Skill combines aprioritized sampler with lazy-decay proficiency tracking to focus rollouts on informative and insufficiently mastered samples, and a clustered organizer that semantically clusters knowledge for cleaner retrieval and more reliable adaptation. By improving sampling and organization together, Ace-Skill turns self-evolution into a virtuous cycle in which more informative rollouts produce higher-quality knowledge that supports stronger subsequent rollouts. Across four multimodal tool-use benchmarks, Ace-Skill delivers strong gains (e.g., +35.46% relative improvement in Avg@4 accuracy), enabling an opensource 35B MoE model to match or surpass proprietary models. The acquired knowledge also transfers effectively in a zero-shot manner to smaller 9B and 4B models, allowing resource-constrained agents to inherit advanced capabilities without additional training. The code has been publicly available at https://github.com/AMAP-ML/Ace-Skill.

2605.08885 2026-05-12 cs.LG

Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

Chen Wang, Siyu Hu, Guangming Tan, Weile Jia

AI总结 本文提出了一种针对SO(3)等变原子基础模型的结构化剪枝方法,旨在解决模型精度与推理效率之间的矛盾。通过在通道和阶数维度上进行块级剪枝,该方法在保持SO(3)等变性的同时显著降低了计算成本。实验表明,剪枝后的模型在保持较高精度的同时,参数量和预训练计算量大幅减少,并在多个下游任务中表现出优于从头训练小模型的性能。

详情
英文摘要

SO(3) equivariant graph neural networks have become the dominant paradigm for atomistic foundation models, achieving high accuracy and data efficiency by building rotational symmetry directly into the architecture. Yet the computational cost of their higher-order tensor operations creates a tough trade-off between model accuracy and inference efficiency. In this paper, we propose a structural pruning method for SO(3) equivariant atomistic foundation models to bridge this accuracy-efficiency gap. The pruning is applied along the channel and order dimensions, with each irreducible representation kept or removed as a complete block, thereby retaining SO(3) equivariance. Starting from a large checkpoint, the pruned model substantially reduces the inference cost while retaining higher accuracy than an independently trained small model. The pruned MACE-MP model outperforms the official from-scratch trained small model on 7 of 9 metrics on the Matbench Discovery leaderboard. In terms of efficiency, compressed MACE-MP and MACE-OFF models contain 1.5$\times$ to 4$\times$ fewer parameters and require 2.5$\times$ to 4$\times$ less pre-training compute than training a small model from scratch. For downstream applications, fine-tuning the pruned model reduces energy and force errors by 70.1% and 34.4% compared to training task-specific models from scratch across eight representative downstream datasets. We demonstrate that the method generalizes to other SO(3) equivariant architectures (SevenNet, eSCN) and can be combined with quantization and knowledge distillation for further gains.

2605.08882 2026-05-12 cs.LG

Discrete Flow Matching: Convergence Guarantees Under Minimal Assumptions

Le-Tuyet-Nhi Pham, Giovanni Conforti, Zhenjie Ren, Alain Durmus

AI总结 本文研究了离散流匹配(Discrete Flow Matching, DFM)模型,旨在从离散源分布 $μ_0$ 生成目标分布 $μ_1$。作者在 $\mathbb{Z}_m^d$ 空间上分析了两种DFM模型,通过时间离散化进行采样,并推导了它们的非渐近界。与以往工作不同,本文在Kullback-Leibler散度和总变分距离下建立了收敛保证,仅依赖于近似误差假设,放宽了传统分数匹配的限制,同时提升了对词汇量 $m$ 和维度 $d$ 的依赖性。

详情
英文摘要

Flow Matching has recently emerged as a popular class of generative models for simulating a target distribution $μ_1$ from samples drawn from a source distribution $μ_0$. This framework relies on a fixed coupling between $μ_0$ and $μ_1$, and on a deterministic or stochastic bridge to define an interpolating process between the two distributions. The time marginals of this process can then be approximately sampled by estimating the transition rates, or more generally the generator, of its Markovian projection. This framework has recently been extended to the case of discrete source and target distributions, under the name Discrete Flow Matching (DFM). However, theoretical guarantees for such models remain scarce. In this paper, we study two DFM models on $\mathbb{Z}_m^d = \{0,\ldots,m-1\}^d$, sampled through time discretization, and derive non-asymptotic associated bounds for both of them. In contrast to previous work, we establish non-asymptotic bounds in Kullback--Leibler divergence for the early-stopped version of the target distribution. We also derive explicit convergence guarantees in total variation distance with respect to the true target distribution. Importantly, these bounds rely only on an approximation error assumption, relaxing standard score assumptions used in earlier works, while also yielding improved dependence on the vocabulary size $m$ and the dimension $d$.