arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2603.09581 2026-05-26 cs.LG

Towards Understanding Adam Convergence on Highly Degenerate Polynomials

理解Adam在高度退化多项式上的收敛性

Zhiwei Bai, Jiajie Zhao, Zhangchen Zhou, Zhi-Qin John Xu, Yaoyu Zhang

AI总结 本文研究Adam优化器在高度退化多项式上的自动收敛性质,推导局部渐近稳定性条件,证明其线性收敛速度优于梯度下降和动量法,并刻画超参数相图。

Comments Accepted to ICML 2026

详情
AI中文摘要

Adam是深度学习中广泛使用的优化算法,然而其具有固有优势的目标函数类别仍未被充分探索。与先前需要外部调度器和$\beta_2$接近1才能收敛的研究不同,本文研究了Adam的“自然”自动收敛性质。我们识别了一类高度退化多项式,Adam无需额外调度器即可自动收敛。具体地,我们推导了退化多项式上局部渐近稳定性的理论条件,并展示了理论界限与实验结果之间的强一致性。我们证明Adam在这些退化函数上实现局部线性收敛,显著优于梯度下降和动量法的次线性收敛。这种加速源于第二矩$v_t$与平方梯度$g_t^2$之间的解耦机制,该机制指数级放大有效学习率。最后,我们刻画了Adam的超参数相图,识别出三种不同的行为区域:稳定收敛、尖峰和类似SignGD的振荡。

英文摘要

Adam is a widely used optimization algorithm in deep learning, yet the specific class of objective functions where it exhibits inherent advantages remains underexplored. Unlike prior studies requiring external schedulers and $β_2$ near 1 for convergence, this work investigates the ``natural'' auto-convergence properties of Adam. We identify a class of highly degenerate polynomials where Adam converges automatically without additional schedulers. Specifically, we derive theoretical conditions for local asymptotic stability on degenerate polynomials and demonstrate strong alignment between theoretical bounds and experimental results. We prove that Adam achieves local linear convergence on these degenerate functions, significantly outperforming the sub-linear convergence of Gradient Descent and Momentum. This acceleration stems from a decoupling mechanism between the second moment $v_t$ and squared gradient $g_t^2$, which exponentially amplifies the effective learning rate. Finally, we characterize Adam's hyperparameter phase diagram, identifying three distinct behavioral regimes: stable convergence, spikes, and SignGD-like oscillation.

2603.08072 2026-05-26 cs.LG

Hybrid Quantum Neural Network for Multivariate Clinical Time Series Forecasting

混合量子神经网络用于多变量临床时间序列预测

Irene Iele, Floriano Caprio, Paolo Soda, Matteo Tortora

AI总结 提出一种混合量子-经典架构,将变分量子电路集成到循环神经网络中,用于多变量生理时间序列的多步预测,在BIDMC数据集上表现出与基线相当的精度和更强的鲁棒性。

详情
AI中文摘要

预测生理信号可以通过预期患者状态的临界变化来支持主动监测和及时的临床干预。在这项工作中,我们通过联合预测心率、血氧饱和度、脉搏率和呼吸率在15、30和60秒的预测时域上,解决了生理时间序列的多变量多步预测问题。我们提出了一种混合量子-经典架构,将变分量子电路(VQC)集成到循环神经骨干中。GRU编码器将历史观察窗口总结为潜在表示,然后将其投影到用于参数化VQC的量子角度上。量子层作为可学习的非线性特征混合器,在最终预测阶段之前建模跨变量交互。我们在BIDMC PPG和呼吸数据集上采用留一患者方案评估了所提出的方法。结果显示,与经典和深度学习基线相比,该方法具有竞争性的精度,同时对噪声和缺失输入具有更强的鲁棒性。这些发现表明,混合量子层可以为小队列临床环境中的生理时间序列预测提供有用的归纳偏置。代码可在https://github.com/arco-group/quantum-ml获取。

英文摘要

Forecasting physiological signals can support proactive monitoring and timely clinical intervention by anticipating critical changes in patient status. In this work, we address multivariate multi-horizon forecasting of physiological time series by jointly predicting heart rate, oxygen saturation, pulse rate, and respiratory rate at forecasting horizons of 15, 30, and 60 seconds. We propose a hybrid quantum-classical architecture that integrates a Variational Quantum Circuit (VQC) within a recurrent neural backbone. A GRU encoder summarizes the historical observation window into a latent representation, which is then projected into quantum angles used to parameterize the VQC. The quantum layer acts as a learnable non-linear feature mixer, modeling cross-variable interactions before the final prediction stage. We evaluate the proposed approach on the BIDMC PPG and Respiration dataset under a Leave-One-Patient-Out protocol. The results show competitive accuracy compared with classical and deep learning baselines, together with greater robustness to noise and missing inputs. These findings suggest that hybrid quantum layers can provide useful inductive biases for physiological time series forecasting in small-cohort clinical settings. The code is available at https://github.com/arco-group/quantum-ml.

2603.06798 2026-05-26 cs.LG cs.DC stat.ML

NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

NEST: 面向分布式深度学习的网络与内存感知设备放置

Irene Wang, Vishnu Varma Venkata, Arvind Krishnamurthy, Divya Mahajan

AI总结 提出NEST框架,通过结构化动态规划统一模型并行、拓扑建模和内存可行性,在多种硬件和网络上实现高达2.43倍的吞吐量提升。

Comments Accepted to MLSys 2026

详情
AI中文摘要

深度学习规模的不断增长要求分布式训练框架能够联合考虑并行性、内存和网络拓扑。先前的工作通常依赖启发式或拓扑无关的搜索,分别处理通信和内存。由于缺乏每设备内存感知,这些方法通常事后通过将参数和激活分片到多个设备上来确保可行性,从而增加同步、扩大通信、降低计算利用率,限制了实际数据中心网络上的可扩展性和效率。我们提出了NEST,一个网络、计算和内存感知的设备放置框架,通过结构化动态规划统一了模型并行、拓扑建模和内存可行性。NEST的动态规划在具有张量和专家并行配置、跨层次或任意网络的显式allreduce延迟以及内存/计算轮廓的算子图上运行。通过跨张量、流水线、数据和专家维度分解并行性,NEST为混合策略定义了一个原则性的搜索空间,同时联合优化共置、网络延迟和内存可行性。在多种硬件和网络上的评估表明,与最先进的基线相比,NEST实现了高达2.43倍的吞吐量提升、更好的内存效率和可扩展性,为下一代AI基础设施的并行化策略和数据中心互连协同设计提供了基础。NEST的源代码可在https://github.com/scai-tech/Nest获取。

英文摘要

The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest

2603.06687 2026-05-26 cs.CV cs.CL cs.ET cs.MM cs.RO

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot: 在真实世界场景中评估视觉语言模型的地理时间理解能力

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

AI总结 提出TimeSpot基准,通过1,455张全球图像评估视觉语言模型在时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)上的推理能力,发现现有模型性能低下,尤其时间推理不足。

Comments Accepted to ICML 2026

详情
AI中文摘要

地理时间理解,即仅从视觉输入推断位置、时间和上下文属性的能力,支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管最近的视觉语言模型(VLM)利用地标和路标等线索在图像地理定位方面取得了进展,但它们推理时间信号和物理基础空间线索的能力仍然有限。为弥补这一差距,我们引入了TimeSpot,一个用于评估VLM在真实世界中进行地理时间推理的基准。TimeSpot包含来自80个国家的1,455张地面图像,要求直接从视觉证据中结构化预测时间属性(季节、月份、时段、日光相位)和地理属性(大洲、国家、气候带、环境类型、经纬度)。它还包括时空推理任务,测试在真实世界不确定性下的物理合理性。对最先进的开源和闭源VLM的评估显示性能低下,尤其是时间推理。虽然监督微调带来了改进,但结果仍不充分,凸显了需要新方法来实现稳健的、基于物理的地理时间理解。TimeSpot可在 https://TimeSpot-GT.github.io 获取。

英文摘要

Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.

2603.06218 2026-05-26 cs.RO

Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling

少样本神经可微模拟器:真实到模拟的刚体接触建模

Zhenhao Huang, Siyuan Luo, Bingyang Zhou, Ziqiu Zeng, Jason Pho, Fan Shi

AI总结 提出一种结合解析公式物理一致性与图神经网络表示能力的少样本真实到模拟方法,通过少量真实数据校准解析模拟器生成大规模合成数据集,并引入基于网格的图神经网络隐式建模刚体前向动力学及碰撞检测的代理梯度,实现完全可微性,从而提升模拟保真度和策略学习效率。

Comments Accepted in ICRA 2026

详情
AI中文摘要

精确的物理模拟对于机器人学习和控制至关重要,然而解析模拟器通常难以捕捉复杂的接触动力学,而基于学习的模拟器通常需要大量昂贵的真实世界数据。为弥合这一差距,我们提出了一种少样本真实到模拟方法,该方法结合了解析公式的物理一致性与基于图神经网络(GNN)模型的表示能力。仅使用少量真实世界数据,我们的方法校准解析模拟器以生成大规模合成数据集,捕捉多样的接触交互。在此基础上,我们引入了一种基于网格的GNN,隐式建模刚体前向动力学,并推导出碰撞检测的代理梯度,实现完全可微性。实验结果表明,我们的方法使基于学习的模拟器在复现真实世界轨迹方面优于可微基线。此外,可微设计支持基于梯度的优化,我们通过多物体交互场景中的基于模拟的策略学习验证了这一点。大量实验表明,我们的框架不仅以最小监督提高了模拟保真度,还提高了策略学习的效率。综上所述,这些发现表明,具有少样本真实世界基础的可微模拟为推进未来机器人操作和控制提供了有力方向。

英文摘要

Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.

2603.05143 2026-05-26 cs.CL cs.LG

Feature Resemblance: Towards a Theoretical Understanding of Analogical Reasoning in Transformers

特征相似性:迈向对Transformer中类比推理的理论理解

Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang

AI总结 本文通过最小化Transformer抽象模型,从理论上证明联合训练和特定课程顺序能使实体在表示空间中对齐,从而通过特征相似性实现属性转移,即类比推理。

详情
AI中文摘要

理解大型语言模型中的推理因评估混淆多种推理类型而变得复杂。我们分离出类比推理,即模型在共享已知属性的实体之间转移属性,并研究这种转移何时能从训练中涌现。为了使问题在分析上易于处理,我们研究了一个最小化的Transformer风格抽象,该抽象隔离了学习到的表示如何支持类比推理。在此设置中,我们证明了三个关键结果。首先,对相似性和属性前提的联合训练通过对齐表示实现类比推理。其次,顺序训练仅在相似性结构先于特定属性学习时成功,揭示了课程不对称性。第三,在我们的风格化设置中,两跳推理$(a \to b, b \to c \Rightarrow a \to c)$可被视为具有身份桥$(b=b)$的类比推理,这些身份桥在训练数据中明确出现。这些结果共同揭示了一个统一机制:具有共享属性的实体在表示空间中对齐,从而通过特征相似性实现属性转移。使用高达8B参数的架构进行的实验与理论定性一致,并表明表示几何在风格化模型之外的类比推理中扮演重要角色。

英文摘要

Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning, where a model transfers an attribute between entities that share known properties, and study when such transfer can emerge from training. To make the problem analytically tractable, we study a minimal transformer-style abstraction that isolates how learned representations support analogical reasoning. Within this setting, we prove three key results. First, joint training on similarity and attribution premises enables analogical reasoning through aligned representations. Second, sequential training succeeds only when similarity structure is learned before specific attributes, revealing a curriculum asymmetry. Third, in our stylized setting, two-hop reasoning $(a \to b, b \to c \Rightarrow a \to c)$ can be viewed as analogical reasoning with identity bridges $(b=b)$, which appear explicitly in training data. Together, these results reveal a unified mechanism: entities with shared properties become aligned in representation space, enabling property transfer through feature resemblance. Experiments with architectures up to 8B parameters show qualitative agreement with the theory and suggest that representational geometry plays an important role in analogical reasoning beyond the stylized model.

2603.04114 2026-05-26 cs.CV

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

Any2Any: 统一任意模态遥感翻译

Haoyang Chen, Jing Zhang, Hebaixu Wang, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haonan Guo, Di Wang, Zheng Wang, Bo Du

AI总结 提出统一潜扩散框架Any2Any,通过共享潜空间和轻量残差适配器实现任意模态间的高效翻译,并在新数据集RST-1M上验证了其优于成对方法且具备零样本泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态遥感图像提供同一地理场景的互补观测,但在实际中这些观测往往不完整。现有的跨模态翻译方法将每个模态对视为独立任务,导致二次复杂度且对未见模态组合的泛化能力有限。我们将任意到任意翻译建模为场景共享潜表示上的推理,其中不同模态对应同一底层语义的部分观测。基于此公式,我们提出Any2Any,一个统一的潜扩散框架,将异构输入投影到几何对齐的潜空间。该结构通过共享骨干网络执行锚定潜回归,解耦模态特定表示学习与语义映射。此外,使用轻量级目标特定残差适配器来纠正系统性潜失配,而不增加推理复杂度。为了支持稀疏但连接监督下的学习,我们引入了RST-1M,首个百万级遥感数据集,包含五种感知模态的配对观测,为任意到任意翻译提供监督锚点。在14个翻译任务上的实验表明,Any2Any始终优于成对翻译方法,并对未见模态对展现出强大的零样本泛化能力。代码和模型可在https://github.com/MiliLab/Any2Any获取。

英文摘要

Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models are available at https://github.com/MiliLab/Any2Any.

2603.00857 2026-05-26 cs.LG cs.AI

MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules

MultiPUFFIN:用于小分子性质预测的多模态领域约束基础模型

Idelfonso B. R. Nogueira, Carine M. Rebello, Mumin Enis Leblebici, Erick Giovani Sperandio Nascimento

AI总结 提出多模态基础模型MultiPUFFIN,融合SMILES、2D图、3D构象及实验条件,通过条件感知精炼和热力学约束头,在小样本下优于ChemBERTa-2,预测小分子热物理性质。

详情
AI中文摘要

MultiPUFFIN是一个领域信息多模态基础模型,用于预测小分子的热物理性质,填补了化学工程、药物发现和材料科学中的关键空白。现有的分子基础模型在数百万分子上预训练以学习通用表示,但其标准MLP输出层不施加物理约束,蒸汽压预测可能违反温度单调依赖性,粘度曲线可能缺乏过程模拟器所需的功能形式。保证热力学一致性的领域信息方法仍局限于单一性质和少量数据集,而多模态基础模型则侧重于生物活性而非热物理性质。MultiPUFFIN通过双向跨模态注意力和门控融合融合SMILES序列、2D分子图和3D构象几何,并辅以实验条件和分子描述符的辅助编码器,填补了这一空白。骨干网络使用三种互补的自监督目标在500,000个未标记的PubChem分子上预训练。一个条件感知精炼堆栈包含五个条件器(温度、pH、压力、多晶型和测量方法),将每个性质路由到一个四头锦标赛,选择该性质性能最佳的热力学信息头。MultiPUFFIN的平均测试R²为0.784,在所有九个性质上优于微调的ChemBERTa-2,尽管训练使用的标记分子数量少了约2000倍。

英文摘要

MultiPUFFIN is a domain-informed multimodal foundation model for predicting thermophysical properties of small molecules, addressing a critical gap in chemical engineering, drug discovery, and materials science. Existing molecular foundation models pretrain on millions of molecules to learn general-purpose representations, but their standard MLP output layers impose no physical constraints, vapor pressure predictions may violate monotonic temperature dependence, and viscosity curves may lack the functional form required by process simulators. Domain-informed approaches that guarantee thermodynamic consistency have remained limited to single properties and small datasets, whereas multimodal foundation models have focused on biological activity rather than thermophysical properties. MultiPUFFIN fills this gap by fusing SMILES sequences, 2D molecular graphs, and 3D conformer geometries through bidirectional cross-modal attention and gated fusion, supplemented by auxiliary encoders for experimental conditions and molecular descriptors. The backbone is pretrained on 500,000 unlabelled PubChem molecules using three complementary self-supervised objectives. A condition-aware refinement stack of five conditioners (temperature, pH, pressure, polymorph, and measurement method) routes each property to a four-head tournament that selects the best-performing thermodynamically informed head for that property. MultiPUFFIN achieves a mean test R2 of 0.784 and outperforms fine-tuned ChemBERTa-2 on all nine properties despite training on roughly 2,000x fewer labeled molecules.

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习:具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

AI总结 提出反思式测试时规划方法,通过行动中反思和行动后反思两种模式,结合回溯性反思,使具身智能体在测试时进行自我纠正和经验积累,显著提升长程任务性能。

详情
AI中文摘要

具身大语言模型赋予机器人高级任务推理能力,但它们无法反思错误原因,导致部署成为一系列独立尝试,错误重复而非积累经验。借鉴人类反思实践,我们引入反思式测试时规划,整合两种反思模式: extit{行动中反思},代理在行动前利用测试时扩展生成并评分多个候选行动,基于内部反思;以及 extit{行动后反思},利用测试时训练,根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思,允许代理重新评估早期决策,并利用后见之明进行模型更新,实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明,与基线模型相比有显著提升,并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实,行动中反思和行动后反思相互依赖,且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.

2602.20210 2026-05-26 cs.LG cs.AI

Multimodal Crystal Flow: Any-to-Any Modality Generation for Unified Crystal Modeling

多模态晶体流:面向统一晶体建模的任意模态生成

Kiyoung Seong, Sungsoo Ahn, Sehui Han, Changyoung Park

AI总结 提出多模态晶体流(MCFlow),一种统一的多模态流模型,通过原子类型和晶体结构的独立时间变量实现多种晶体生成任务,并在MP-20和MPTS-52基准上达到与任务特定基线竞争的性能。

详情
AI中文摘要

晶体建模涵盖一系列条件和非条件生成任务,包括晶体结构预测(CSP)和从头生成(DNG)。尽管最近的深度生成模型表现出有前景的性能,但它们仍然主要是任务特定的,缺乏跨任务共享晶体表示的统一框架。为了解决这一限制,我们提出了多模态晶体流(MCFlow),一种统一的多模态流模型,通过原子类型和晶体结构的独立时间变量将多种晶体生成任务实现为不同的推理轨迹。为了在标准Transformer模型中实现多模态流,我们引入了一种具有层次排列增强的组合和对称感知原子排序,无需显式结构模板即可注入组合和晶体学先验。在MP-20和MPTS-52基准上的实验表明,单个MCFlow模型在CSP、DNG和结构条件原子类型生成方面与任务特定基线具有竞争力。

英文摘要

Crystal modeling spans a family of conditional and unconditional generation tasks, including crystal structure prediction (CSP) and de novo generation (DNG). While recent deep generative models have shown promising performance, they remain largely task-specific, lacking a unified framework that shares crystal representations across tasks. To address this limitation, we propose Multimodal Crystal Flow (MCFlow), a unified multimodal flow model that realizes multiple crystal generation tasks as distinct inference trajectories via independent time variables for atom types and crystal structures. To enable multimodal flow in a standard transformer model, we introduce a composition- and symmetry-aware atom ordering with hierarchical permutation augmentation, injecting compositional and crystallographic priors without explicit structural templates. Experiments on the MP-20 and MPTS-52 benchmarks show that a single MCFlow model is competitive with task-specific baselines across CSP, DNG, and structure-conditioned atom type generation.

2602.19333 2026-05-26 cs.CL cs.IR cs.SI

PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

PerSoMed:用于波斯社交媒体文本分类的大规模平衡数据集

Isun Chehreh, Ebrahim Ansari

AI总结 该研究构建了首个大规模平衡的波斯社交媒体文本分类数据集,包含9个类别共36,000条帖子,并基于BiLSTM、XLM-RoBERTa、TookaBERT等模型进行基准测试,其中TookaBERT-Large取得了最佳性能(F1分数0.9621)。

Comments 10 pages, including 1 figure

详情
AI中文摘要

本研究引入了首个大规模、良好平衡的波斯社交媒体文本分类数据集,专门用于解决该领域缺乏综合资源的问题。该数据集包含9个类别(经济、艺术、体育、政治、社会、健康、心理、历史、科技)的36,000条帖子,每个类别4,000个样本,以确保类别分布平衡。数据收集涉及来自多个波斯社交媒体平台的60,000条原始帖子,随后进行严格的预处理和混合标注,结合基于ChatGPT的少样本提示和人工验证。为了缓解类别不平衡,我们采用了带语义冗余移除的欠采样和结合词汇替换与生成提示的高级数据增强策略。我们对多个模型进行了基准测试,包括BiLSTM、XLM-RoBERTa(使用LoRA和AdaLoRA适配)、FaBERT、基于SBERT的架构以及波斯语专用TookaBERT(Base和Large)。实验结果表明,基于Transformer的模型始终优于传统神经网络,其中TookaBERT-Large取得了最佳性能(精确率:0.9622,召回率:0.9621,F1分数:0.9621)。按类别评估进一步证实了所有类别的稳健性能,尽管社会和政治文本由于固有歧义而得分略低。本研究提供了一个新的高质量数据集,并对前沿模型进行了全面评估,为波斯语自然语言处理的进一步发展奠定了坚实基础,包括趋势分析、社会行为建模和用户分类。该数据集公开可用,以支持未来的研究工作。

英文摘要

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

2602.18640 2026-05-26 cs.AI

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

解码机器学习决策:面向大规模排序系统的智能体推理框架

Longfei Yun, Yihan Wu, Haoran Liu, Xiaoxuan Liu, Ziyun Xu, Yi Wang, Yang Xia, Pengfei Wang, Mingze Gao, Yunxiang Wang, Changfan Chen, Wenjie Fu, Hong Yan, Junfeng Pan

AI总结 提出GEARS框架,通过智能体技能封装排序专家知识,将排序优化转化为自主发现过程,实现高层意图驱动的系统调控并保证生产可靠性。

Comments 12 pages, 5 figures

详情
AI中文摘要

现代大规模排序系统在竞争目标、操作约束和不断变化的产品需求的复杂环境中运行。该领域的进展越来越受到工程上下文约束的瓶颈:将模糊的产品意图转化为合理、可执行、可验证的假设的艰巨过程,而不仅仅是建模技术本身。我们提出了GEARS(生成式智能体排序系统引擎),这是一个将排序优化重新定义为可编程实验环境中的自主发现过程的框架。GEARS不是将优化视为静态模型选择,而是利用专门智能体技能将排序专家知识封装为可复用的推理能力,使操作者能够通过高层意图(如氛围个性化)来引导系统。此外,为确保生产可靠性,该框架集成了验证钩子以强制执行统计稳健性,并过滤掉过度拟合短期信号的脆弱策略。跨不同产品表面的实验验证表明,GEARS通过协同算法信号与深度排序上下文,同时保持严格的部署稳定性,能够持续识别出接近帕累托最优的优越策略。

英文摘要

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving product requirements. Progress in this domain is increasingly bottlenecked by the engineering context constraint: the arduous process of translating ambiguous product intent into reasonable, executable, verifiable hypotheses, rather than by modeling techniques alone. We present GEARS (Generative Engine for Agentic Ranking Systems), a framework that reframes ranking optimization as an autonomous discovery process within a programmable experimentation environment. Rather than treating optimization as static model selection, GEARS leverages Specialized Agent Skills to encapsulate ranking expert knowledge into reusable reasoning capabilities, enabling operators to steer systems via high-level intent vibe personalization. Furthermore, to ensure production reliability, the framework incorporates validation hooks to enforce statistical robustness and filter out brittle policies that overfit short-term signals. Experimental validation across diverse product surfaces demonstrates that GEARS consistently identifies superior, near-Pareto-efficient policies by synergizing algorithmic signals with deep ranking context while maintaining rigorous deployment stability.

2602.17658 2026-05-26 cs.LG cs.AI cs.IT math.IT

MARS: Margin and Semantic-Aware Data Augmentation for Reward Modeling

MARS:面向奖励建模的边界与语义感知数据增强

Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon

AI总结 提出MARS框架,通过优先增强低边界偏好对并利用语义距离细化,提升奖励模型质量和对齐性能。

详情
AI中文摘要

奖励建模是RLHF、RLAIF和基于PPO的策略优化等对齐流程的核心,但其可靠性受限于有限且异构的人类偏好数据,这些数据难以大规模收集。虽然合成增强可以扩展偏好监督,但现有方法通常均匀增强或在表示层面增强,而不针对奖励模型不确定或容易误排序的示例。在本文中,我们介绍了MARS(面向奖励建模的边界与语义感知数据增强),一种自适应增强框架,优先考虑低边界偏好对,并使用语义距离作为第二层细化,以增强选择响应和拒绝响应之间的对比。在多个偏好数据集、奖励模型骨干、下游对齐设置以及包括RewardBench和AlpacaEval在内的基准测试中,MARS在奖励模型质量和对齐性能上都优于现有基线。我们的结果表明,当同时由模型边界和语义结构引导时,奖励模型增强最为有效。

英文摘要

Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constrained by limited and heterogeneous human preference data that are expensive to collect at scale. While synthetic augmentation can expand preference supervision, existing methods often augment uniformly or at the representation level, without targeting examples where the reward model is uncertain or prone to mis-ranking. In this paper, we introduce MARS (Margin and Semantic-Aware Data Augmentation for Reward Modeling), an adaptive augmentation framework that prioritizes low-margin preference pairs and uses semantic distance as a second layer for refinement to enhance the contrast between the chosen and rejected responses. Across multiple preference datasets, reward-model backbones, downstream alignment settings, and benchmarks including RewardBench and AlpacaEval, MARS improves both reward-model quality and alignment performance over existing baselines. Our results show that reward-model augmentation is most effective when guided by both model margins and semantic structure.

2602.17234 2026-05-26 cs.AI cs.LG

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting

所有泄漏都重要,有些泄漏更重要:LLM回测中可解释的时间污染检测与缓解

Zeyu Zhang, Ryan Chen, Bradly C. Stadie

AI总结 提出基于Shapley值的声明级评估框架Shapley-DCLR和推理时架构TimeSPEC,用于检测和缓解LLM回测中的时间污染问题。

Comments 8 pages plus appendix

详情
AI中文摘要

对已解决事件进行回测的LLM假设模型仅基于截止前知识进行推理,然而预训练模型不可避免地泄漏截止后知识。我们引入了一个声明级评估框架,将预测理由分解为原子声明,并应用Shapley值量化每个声明的决策影响,从而得到 extbf{Shapley-DCLR}( extbf{Shapley}加权的 extbf{决策关键泄漏率})——一个可解释的度量,用于衡量决策驱动推理中被污染的比例。我们进一步提出 extbf{TimeSPEC}(基于提取声明的时间监督预测),一种推理时架构,它将时间过滤的检索与声明级监督交织在一起,生成完全基于截止前证据的预测。在三个LLM上的消融实验证实了检索和监督共同必要;三项任务探测进一步说明,时间强制的性能成本与每个任务对截止后信息的依赖程度成正比。

英文摘要

Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff knowledge. We introduce a claim-level evaluation framework that decomposes prediction rationales into atomic claims and applies Shapley values to quantify each claim's decision impact, yielding \textbf{Shapley-DCLR} (\textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate) -- an interpretable metric measuring what fraction of decision-driving reasoning is contaminated. We further propose \textbf{TimeSPEC} (\textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims), an inference-time architecture that interleaves temporally-filtered retrieval with claim-level supervision, producing predictions grounded entirely in pre-cutoff evidence. Across three LLMs, the ablation experiments confirm retrieval and supervision are jointly necessary; and a three-task probe further illstrates that the performance cost of temporal enforcement scales with each task's reliance on post-cutoff information.

2602.16229 2026-05-26 cs.LG

Factored Latent Action World Models

因子化潜在动作世界模型

Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone

AI总结 提出因子化潜在动作模型(FLAM),通过将场景分解为独立因子并学习各自的潜在动作,提升了无动作视频中多实体动态建模的准确性和视频生成质量。

详情
AI中文摘要

从无动作视频中学习潜在动作已成为扩展可控世界模型学习的强大范式。潜在动作为用户迭代生成和操作视频提供了自然接口。然而,大多数现有方法依赖整体逆动态和正动态模型,学习单一潜在动作来控制整个场景,因此在多个实体同时行动的复杂环境中表现不佳。本文引入因子化潜在动作模型(FLAM),一种因子化动态框架,将场景分解为独立因子,每个因子推断自己的潜在动作并预测自己的下一步因子值。与整体模型相比,这种因子化结构能够更准确地建模复杂多实体动态,并提高无动作视频设置中的视频生成质量。基于模拟和真实世界多实体数据集的实验,我们发现FLAM在预测准确性和表示质量方面优于先前工作,并促进了下游策略学习,展示了因子化潜在动作模型的优势。

英文摘要

Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

2602.15811 2026-05-26 cs.CV cs.AI

CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification

CARL-CXR:基于连续适配器路由的任务未知胸部X光片分类

Muthu Subash Kavitha, Anas Zafar, Amgad Muneer, Jia Wu

AI总结 提出CARL-CXR框架,通过固定高容量骨干网络、增量添加轻量级任务特定适配器和分类头,以及潜在任务选择器,解决任务未知推理下的胸部X光片增量分类问题,显著减少灾难性遗忘并提升路由准确性。

Comments 9 pages, 4 figures

详情
AI中文摘要

胸部X光片分类器的临床部署需要模型能够在新数据集可用时进行更新,而无需对先前观察到的数据进行重新训练或降低已验证的性能。我们研究了任务未知推理下的任务增量连续学习设置,其中异质的胸部X光数据集顺序到达,且在部署时任务身份不可用。我们提出了CARL-CXR,一个基于连续适配器的路由框架,该框架保持固定的高容量骨干网络,同时增量引入轻量级任务特定适配器和分类头。一个潜在任务选择器基于适配器条件特征进行操作,将每个输入动态路由到最相关的任务路径,利用紧凑的任务原型和特征级经验回放来在顺序更新中保留任务身份,而无需存储原始图像。在MIMIC-CXR和CheXpert两个具有不同患者群体、成像设备和注释流程的大规模数据集上的实验表明,CARL-CXR实现了最小的灾难性遗忘(AUROC下降0.012),比已建立的连续学习基线LwF和EWC分别减少了6倍和11倍,同时保持了具有竞争力的诊断性能(AUROC 0.74)。在任务未知部署下,CARL-CXR在路由准确性上比联合训练高出12.5个百分点(75.0% vs. 62.5%):与LwF和EWC不同,后者在推理时需要明确的任务标识符且不提供路由机制。

英文摘要

Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously observed data or degrading validated performance. We study a task-incremental continual learning setting for chest radiograph classification under task-unknown inference, where heterogeneous chest X-ray datasets arrive sequentially and task identity is unavailable at deployment time. We propose CARL-CXR, a continual adapter-based routing framework that maintains a fixed high-capacity backbone while incrementally introducing lightweight task-specific adapters and classifier heads. A latent task selector operates on adapter-conditioned features to dynamically route each input to the most relevant task pathway, leveraging compact task prototypes and feature-level experience replay to preserve task identity across sequential updates without storing raw images. Experiments on MIMIC-CXR and CheXpert two large-scale datasets with distinct patient populations, imaging devices, and annotation pipelines demonstrate that CARL-CXR achieves minimal catastrophic forgetting (0.012 AUROC drop), representing a 6X and 11X reduction over established continual learning baselines LwF and EWC respectively, while maintaining competitive diagnostic performance (AUROC 0.74). Under task unknown deployment, CARL-CXR outperforms joint training by 12.5 points in routing accuracy (75.0% vs. 62.5%): unlike LwF and EWC, which require explicit task identifiers at inference and provide no routing mechanism.

2602.15620 2026-05-26 cs.CL cs.AI

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

STAPO:通过抑制稀有虚假标记稳定大语言模型的强化学习

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

AI总结 针对强化学习微调大语言模型时因稀有虚假标记导致训练不稳定和性能崩溃的问题,提出STAPO方法,通过抑制这些标记的梯度扰动,在多个数学推理基准上实现稳定训练和性能提升。

详情
AI中文摘要

强化学习显著提升了大语言模型的推理能力,但现有的强化学习微调方法严重依赖熵正则化和重加权等启发式技术来维持稳定性。实践中,这些方法常遭遇后期性能崩溃,导致推理质量下降和训练不稳定。我们识别出这一不稳定的关键因素:一小部分标记(称为虚假标记,约占0.01%)对推理结果贡献甚微,但由于继承了完整的序列级奖励而获得不成比例放大的梯度更新。我们提出了一个统一框架,用于评估虚假风险、梯度范数和熵变化下标记级优化影响。基于对严重破坏优化的标记特征的分析,我们提出了抑制虚假标记(S2T)机制,以有效抑制其梯度扰动。将该机制融入基于组的目标中,我们提出了虚假标记感知策略优化(STAPO),促进了稳定有效的大规模模型优化。在使用Qwen 1.7B、8B和14B基础模型的六个数学推理基准上,STAPO一致展现出优越的熵稳定性,并在GRPO、20-Entropy和JustRL基础上平均性能提升11.49%($\rho_{\mathrm{T}}$=1.0, top-p=1.0)和3.73%($\rho_{\mathrm{T}}$=0.7, top-p=0.9)。

英文摘要

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

2602.11534 2026-05-26 cs.LG cs.AI

Krause Synchronization Transformers

Krause同步变换器

Jingkun Liu, Yisong Yue, Max Welling, Yue Song

AI总结 提出基于有界置信共识动力学的Krause注意力机制,通过局部化稀疏交互替代全局softmax归一化,缓解表示坍缩和注意力汇聚现象,实现线性复杂度并提升性能。

Comments ICML 2026, Project page: https://jingkun-liu.github.io/krause-sync-transformers/

详情
AI中文摘要

Transformer中的自注意力依赖于全局归一化的softmax权重,导致所有token在每一层竞争影响力。当跨深度组合时,这种交互模式会诱导强同步动力学,倾向于收敛到主导模式,这种行为与表示坍缩和注意力汇聚现象相关。我们引入了Krause注意力,一种受有界置信共识动力学启发的原则性注意力机制。Krause注意力将基于相似性的全局聚合替换为基于距离的、局部化的、选择性稀疏的交互,促进结构化的局部同步而非全局混合。我们将这种行为与最近将Transformer动力学建模为相互作用粒子系统的理论联系起来,并展示有界置信交互如何自然地调节注意力集中并缓解注意力汇聚。将交互限制在局部邻域还将运行时复杂度从序列长度的二次方降低到线性。实验上,我们在多种设置中验证了Krause注意力,包括视觉(CIFAR/ImageNet上的ViT)、自回归图像生成(MNIST/CIFAR-10)、大语言模型(Llama/Qwen)以及从零开始训练的多种规模(100M/200M)的语言模型。在这些领域中,Krause注意力在提高计算效率的同时实现了持续的性能提升,突显了有界置信动力学作为注意力的一种可扩展且有效的归纳偏置。

英文摘要

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Empirically, we validate Krause Attention across diverse settings, including vision (ViT on CIFAR/ImageNet), autoregressive image generation (MNIST/CIFAR-10), large language models (Llama/Qwen), and language models trained from scratch at multiple scales (100M/200M). Across these domains, Krause Attention achieves consistent performance gains while improving computational efficiency, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

2602.11439 2026-05-26 cs.LG

Multi-Level Strategic Classification: Incentivizing Improvement through Promotion and Relegation Dynamics

多层级策略分类:通过晋升与降级动态激励改进

Ziyuan Huang, Lina Alkarmi, Mingyan Liu

AI总结 本文提出一种多层级晋升-降级框架,通过设计分类器阈值和难度递进来激励代理人诚实努力,并证明在温和条件下代理人可通过真实改进达到任意高水平。

Comments 9 pages, 4 figures, Accepted at ICML 2026

详情
AI中文摘要

策略分类研究自私个体或代理人操纵其响应以获得分类器有利决策结果的问题,通常当虚假行为成本低于真实努力时,他们会采取不诚实行为。虽然现有关于序列策略分类的研究主要关注优化动态分类器权重,但我们偏离这些以权重为中心的方法,分析了多层级晋升-降级框架中分类器阈值和难度递进的设计。我们的模型捕捉了由代理人的远见、技能保留以及资格与成就可自我强化的“助力效应”驱动的关键跨期激励。我们刻画了代理人的最优长期策略,并证明委托人可以设计一系列阈值来有效激励诚实努力。关键地,我们证明在温和条件下,该机制使代理人能够仅通过真实改进努力达到任意高水平。

英文摘要

Strategic classification studies the problem where self-interested individuals or agents manipulate their response to obtain favorable decision outcomes made by classifiers, typically turning to dishonest actions when they are less costly than genuine efforts. While existing studies on sequential strategic classification primarily focus on optimizing dynamic classifier weights, we depart from these weight-centric approaches by analyzing the design of classifier thresholds and difficulty progression within a multi-level promotion-relegation framework. Our model captures the critical inter-temporal incentives driven by an agent's farsightedness, skill retention, and a leg-up effect where qualification and attainment can be self-reinforcing. We characterize the agent's optimal long-term strategy and demonstrate that a principal can design a sequence of thresholds to effectively incentivize honest effort. Crucially, we prove that under mild conditions, this mechanism enables agents to reach arbitrarily high levels solely through genuine improvement efforts.

2602.08499 2026-05-26 cs.LG cs.AI

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

上下文展开赌博机:面向可验证奖励的强化学习

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

AI总结 针对RLVR中展开使用无差别、短视导致的问题,提出上下文赌博机框架,自适应选择高价值展开,提升训练效率与性能。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)是提升大型语言模型推理能力的有效范式。然而,现有RLVR方法以无差别和短视的方式使用展开:每个提示内不同质量的响应被统一对待,且历史展开在单次使用后被丢弃。这导致监督噪声大、样本效率低以及策略更新次优。我们通过将RLVR中的展开调度形式化为上下文赌博机问题,并提出一个统一的神经调度框架来解决这些问题,该框架在整个训练过程中自适应地选择高价值展开。每个展开被视为一个臂,其奖励由连续优化步骤之间诱导的性能增益定义。由此产生的调度器支持噪声感知的组内选择和历史展开的自适应全局重用,所有这些都在一个统一的原则性框架内。我们通过推导次线性遗憾界并证明扩大展开缓冲区可改善可实现性能上限,提供了理论依据。在六个数学推理基准上的实验表明,在多种RLVR优化方法中,性能和训练效率均有一致的提升。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

2602.08426 2026-05-26 cs.CL cs.AI cs.CV

Prism: Spectral-Aware Block-Sparse Attention

Prism: 频谱感知的块稀疏注意力

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu

AI总结 针对长上下文LLM预填充中块稀疏注意力的块选择效率瓶颈,提出无训练频谱感知方法Prism,通过高低频分支分解和能量温度校准恢复位置信号,实现纯块级重要性估计,在保持精度同时实现高达5.1倍加速。

Comments ICML 2026

详情
AI中文摘要

块稀疏注意力有望加速长上下文LLM的预填充,但高效识别相关块仍是瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理,但往往诉诸昂贵的令牌级搜索或评分,导致显著的选择开销。在本工作中,我们将通过均值池化的标准粗粒度注意力的不准确性追溯到一个理论根源:均值池化与旋转位置嵌入(RoPE)之间的交互。我们证明均值池化充当低通滤波器,在高频维度上引起破坏性干扰,有效造成局部位置信息(如斜线模式)的“盲点”。为解决此问题,我们引入Prism,一种无训练的频谱感知方法,将块选择分解为高频和低频分支。通过应用基于能量的温度校准,Prism直接从池化表示中恢复衰减的位置信号,使得仅使用块级操作即可进行块重要性估计,从而提高效率。大量评估证实,Prism在保持与全注意力精度相当的同时,实现了高达$\mathbf{5.1 imes}$的加速。

英文摘要

Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

2602.08355 2026-05-26 cs.CV

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds:面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

AI总结 提出电商短视频理解基准E-VAds,通过多模态信息密度评估框架量化领域复杂性,并构建多智能体生成的问答数据集,最后开发基于强化学习的推理模型E-VAds-R1,在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情
AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域,其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频,因为现有基准主要关注通用任务,忽略了商业意图的推理。在这项工作中,我们首先提出了一个多模态信息密度评估框架,以量化该领域的复杂性。我们的评估显示,与主流数据集相比,电商内容在视觉、音频和文本模态上表现出显著更高的密度,为视频理解建立了更具挑战性的前沿。为了弥补这一差距,我们引入了电商视频广告基准(E-VAds),这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频,涵盖广泛的产品类别,并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度,即感知与认知和推理,包含五个不同的任务。最后,我们开发了E-VAds-R1,一个基于强化学习的推理模型,具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导,同时为专家级精度创造非线性激励。实验结果表明,E-VAds-R1在仅使用几百个训练样本的情况下,在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

2602.06717 2026-05-26 cs.LG cs.AI

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

F-GRPO: 别让你的策略学到显而易见的而忘记罕见的

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daria Korotyshova, Daniil Gavrilov

AI总结 针对强化学习中有限采样组导致罕见正确轨迹被忽略的问题,提出基于Focal loss的难度感知缩放系数F-GRPO,在不增加组大小和计算成本下提升数学推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习通常依赖组采样来估计优势并稳定策略更新。实践中,计算限制往往排除非常大的组,因此训练使用有限的rollout集合,这些集合只能强化它们暴露的正确行为。在实际组大小下,更新可能会遗漏罕见的正确轨迹,同时仍然包含混合奖励,将概率集中在更常见的采样解上。我们推导了这种提示局部尾部遗漏事件作为组大小函数的概率,展示了非单调行为,并在分类抽象中描述了未采样的正确质量如何在总正确质量增长时缩小。受此分析启发,我们提出了一种难度感知缩放系数,灵感来自Focal loss,它降低了高成功采样组的更新权重。经验上,分类模拟在分类设置中展示了相同效果,Maze提供了单解测试,LLM实验包括代表性的GRPO组大小扫描以及GRPO、DAPO和CISPO之间的固定N迁移。在Qwen2.5-7B上,N=8时,我们的方法将平均数学pass@256从64.1提高到70.3(GRPO),69.3提高到72.5(DAPO),73.2提高到76.8(CISPO);在所有三种情况下,OOD pass@256也得到改善,且不增加组大小或计算成本。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, computational limits often rule out very large groups, so training proceeds with finite rollout sets that can reinforce only the correct behavior they expose. At practical group sizes, updates can miss rare-correct trajectories while still containing mixed rewards, concentrating probability on more common sampled solutions. We derive the probability of such prompt-local tail-miss events as a function of group size, showing non-monotonic behavior, and in the categorical abstraction characterize how unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware scaling coefficient, inspired by Focal loss, that down-weights updates on high-success sampled groups. Empirically, categorical simulation illustrates the same effect in the categorical setting, Maze provides a single-solution test, and LLM experiments include a representative GRPO group-size sweep together with fixed-$N$ transfer across GRPO, DAPO, and CISPO. On Qwen2.5-7B at $N{=}8$, our method improves average math pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO); OOD pass@256 also improves in all three cases, without increasing group size or computational cost.

2602.06508 2026-05-26 cs.RO

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

World-VLA-Loop: 视频世界模型与VLA策略的闭环学习

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, Mike Zheng Shou

AI总结 提出World-VLA-Loop框架,通过状态感知视频世界模型联合预测未来帧和二元奖励,并采用协同进化范式迭代优化VLA策略,减少对真实环境交互的依赖。

Comments 16 pages, 9 figures

详情
AI中文摘要

强化学习(RL)可以超越行为克隆,优化视觉-语言-动作(VLA)策略,但由于需要大量 rollout、重置、监督和安全风险,真实世界的RL仍然昂贵。基于动作条件的视频世界模型提供了在虚拟环境中训练的选项,但它们在精确的动作跟随方面表现不佳,尤其是在细微的接近成功失败情况下。此外,它们缺乏用于RL的原生奖励信号。基于不准确的视觉预测计算奖励仍然不可靠。我们引入了World-VLA-Loop,它围绕两个基础设计和一个更高级别的协同进化范式构建。我们首先策划了SANS,专门混合成功和接近成功的轨迹,以改善动作-结果对齐。然后,我们训练了一个状态感知视频世界模型,该模型从扩散潜变量中联合预测未来帧和二元奖励。它将奖励估计与生成器耦合,而不是单独模块,从而反过来有利于视觉预测。由于RL过程中VLA行为会发生变化,固定的模拟器可能与更新后的策略不对齐,因此World-VLA-Loop通过使用精炼的世界模型进行迭代VLA后训练,同时将每个改进策略的rollout反馈回来增强和微调世界模型,从而形成闭环。在仿真和真实机器人实验中,World-VLA-Loop显著提高了VLA性能,同时减少了对昂贵的物理交互的依赖。

英文摘要

Reinforcement learning (RL) can refine Vision-Language-Action (VLA) policies beyond behavior cloning, but real-world RL remains expensive due to extensive rollouts, resets, supervision, and safety risks. Action-conditioned video world models offer an option to train in virtual environments, yet they exhibit imprecise action following, particularly on subtle near-success failures. Besides, they lack native reward signals for RL. Computing rewards based on inaccurate visual predictions remain unreliable. We introduce World-VLA-Loop, structured around two foundational designs and a higher-level co-evolving paradigm. We first curate SANS, dedicatedly mixing successful and near-success trajectories to improve action-outcome alignment. Then, we train a state-aware video world model that jointly predicts future frames and binary rewards from diffusion latents. It couples reward estimation to the generator rather than a separate module, and in turn, benefits visual prediction. Since VLA behavior shifts during RL, a fixed simulator can misalign with the updated policy, World-VLA-Loop therefore closes the loop by using the refined world model for iterative VLA post-training while feeding rollouts from each improved policy back to augment and fine-tune the world model. Across simulation and real-robot experiments, World-VLA-Loop substantially improves VLA performance while reducing reliance on costly physical interaction.

2602.05052 2026-05-26 cs.LG

Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

使用TensorGalerkin学习、求解和优化PDE:一种高效的高性能Galerkin组装算法

Shizheng Wen, Mingyuan Chi, Tianwei Yu, Ben Moseley, Mike Yan Michelis, Pu Ren, Hao Sun, Siddhartha Mishra

AI总结 提出基于Galerkin离散化的统一算法框架,通过张量化元素操作和稀疏矩阵乘法实现O(1)图规模的系统组装,高效求解、约束优化和物理信息学习变分PDE。

详情
AI中文摘要

我们提出了一个统一的算法框架,用于具有变分结构的PDE的数值求解、约束优化和物理信息学习。该框架基于底层变分形式的Galerkin离散化,其高效率源于一种新颖的高度优化且兼容GPU的TensorGalerkin框架,用于线性系统组装(刚度矩阵和载荷向量)。TensorGalerkin通过在Python级Map阶段张量化元素操作,然后使用稀疏矩阵乘法进行全局归约,该乘法在网格诱导的稀疏图上执行消息传递。Map和Reduce阶段在PyTorch的autograd内部协同设计,使得组装图包含O(1)个节点,无论元素数量和局部自由度如何缩放。我们通过将TensorGalerkin部署为i)高效的数值PDE求解器,ii)用于PDE约束优化的端到端可微框架,以及iii)用于PDE的物理信息算子学习算法,验证了这种O(1)图属性。通过多个基准测试,包括非结构化网格上的2D和3D椭圆、抛物线和双曲PDE,我们证明了所提出的框架在所有目标下游应用中相比各种基线提供了显著的计算效率和精度提升。

英文摘要

We present a unified algorithmic framework for the numerical solution, constrained optimization, and physics-informed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. The Map and Reduce stages are co-designed inside PyTorch's autograd so that the assembly graph contains $O(1)$ nodes regardless of how the number of elements and local DoFs scale. We validate this $O(1)$-graph property by deploying TensorGalerkin downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.

2602.04279 2026-05-26 cs.CL

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

ECG-R1: 协议引导且模态无关的可靠心电图解读多模态大语言模型

Jiarui Jin, Haoyu Wang, Xingliang Wu, Xiaocheng Fang, Xiang Lan, Zihan Wang, Deyun Zhang, Bo Liu, Yingying Zhang, Xian Wu, Hongyan Li, Shenda Hong

AI总结 提出ECG-R1,通过协议引导数据生成、模态解耦架构和强化学习,实现可靠的心电图解读。

Comments Accepted to ICML 2026

详情
AI中文摘要

心电图(ECG)在临床实践中是一种不可或缺的诊断工具,然而现有的多模态大语言模型(MLLMs)在心电图解读方面仍不可靠,常常产生看似合理但临床错误的解读。为了解决这一问题,我们提出了ECG-R1,这是首个通过三项创新设计用于可靠心电图解读的推理型ECG MLLM。首先,我们利用 extit{协议引导的指令数据生成}构建解读语料库,将解读基于可测量的ECG特征以及专著定义的定量阈值和诊断逻辑。其次,我们提出了一种模态解耦架构,采用 extit{交错模态丢弃},以提高当ECG信号或ECG图像缺失时的鲁棒性和跨模态一致性。第三,我们提出了 extit{带有ECG诊断证据奖励的强化学习},以加强基于证据的ECG解读。此外,我们系统评估了专有、开源和医疗MLLM的心电图解读能力,并首次提供了定量证据表明严重的幻觉普遍存在,这表明公众不应在没有独立验证的情况下直接信任这些输出。代码可在\href{https://github.com/PKUDigitalHealth/ECG-R1}{此处}获取。

英文摘要

Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning ECG MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code is available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}.

2602.04139 2026-05-26 cs.LG physics.comp-ph

Generative Neural Operators through Diffusion Last Layer

通过扩散最后一层的生成式神经算子

Sungwon Park, Anthony Zhou, Hongjoong Kim, Amir Barati Farimani

AI总结 提出扩散最后一层(DLL)作为神经算子的概率输出头,通过Karhunen-Loéve展开和系数空间的条件扩散模型实现高效分布建模,在随机PDE基准和确定性长时滚动任务中提升了分布保真度和不确定性估计。

Comments ICML 2026, code is available at https://github.com/sungwpark/dll-no

详情
AI中文摘要

神经算子为学习函数空间之间的离散化不变映射提供了强大框架,但标准确定性模型无法捕捉预测不确定性。我们引入了扩散最后一层(DLL),一种用于神经算子主干的模块化概率输出头。DLL通过受Karhunen-Loéve展开启发的输入依赖低秩展开表示目标场,并在相应系数空间上学习条件扩散模型。这种设计使得在保留算子学习结构优势的同时实现高效的分布建模。在具有随机强迫的随机PDE基准测试中,DLL实现了强分布保真度,并与像素空间和传统潜在扩散基线竞争。在确定性长时滚动任务中,DLL提高了底层主干的滚动稳定性,并在复合自回归误差下提供了有用的预测不确定性估计。这些结果表明,在学习到的系数空间中进行扩散建模为不确定性感知神经算子提供了一条实用途径。

英文摘要

Neural operators provide a powerful framework for learning discretization invariant mappings between function spaces, but standard deterministic models do not capture predictive uncertainty. We introduce diffusion last layer (DLL), a modular probabilistic output head for neural operator backbones. DLL represents target fields through an input dependent low rank expansion inspired by the Karhunen-Loéve expansion and learns a conditional diffusion model over the corresponding coefficient space. This design enables efficient distributional modeling while preserving the structural advantages of operator learning. On stochastic PDE benchmarks with random forcing, DLL achieves strong distributional fidelity and performs competitively with pixel space and conventional latent diffusion baselines. In deterministic long horizon rollout tasks, DLL improves rollout stability over the underlying backbone and provides useful estimates of predictive uncertainty under compounding autoregressive errors. These results suggest that diffusion modeling in learned coefficient spaces offers a practical route to uncertainty aware neural operators.

2602.04120 2026-05-26 cs.LG cs.AI cs.DC cs.SE

Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

面向边缘AI系统的可扩展可解释性即服务(XaaS)

Samaresh Kumar Singh, Joyjit Roy

AI总结 提出可解释性即服务(XaaS)分布式架构,通过解耦推理与解释生成、语义缓存、轻量验证和自适应引擎,在边缘设备上实现低延迟、高保真的可解释性,并在三个实际用例中降低38%延迟。

Comments 8 pages, 5 figures, 2 tables. This version updates metadata after publication in IEEE Xplore and publication by SoutheastCon 2026

详情
Journal ref
2026 IEEE SoutheastCon, Huntsville, AL, USA, 2026
AI中文摘要

尽管可解释人工智能(XAI)取得了显著进展,但其在边缘和物联网系统中的集成通常是临时且低效的。当前大多数方法以“耦合”方式运行,即解释生成与模型推理同时进行。因此,这些方法在异构边缘设备上部署时会产生冗余计算、高延迟和可扩展性差的问题。本文提出可解释性即服务(XaaS),一种将可解释性视为一等系统服务(而非模型特定功能)的分布式架构。我们提出的XaaS架构的关键创新在于解耦推理与解释生成,使边缘设备能够在资源和延迟约束下请求、缓存和验证解释。为此,我们引入三项主要创新:(1)基于语义相似性的分布式解释缓存检索方法,显著减少冗余计算;(2)轻量验证协议,确保缓存和新生成解释的保真度;(3)自适应解释引擎,根据设备能力和用户需求选择解释方法。我们在三个实际边缘AI用例上评估了XaaS的性能:(i)制造质量控制;(ii)自动驾驶车辆感知;(iii)医疗诊断。实验结果表明,XaaS在三个实际部署中延迟降低38%,同时保持高解释质量。总体而言,本工作使得在大规模异构物联网系统中部署透明和可问责的AI成为可能,并弥合了XAI研究与边缘实用性之间的差距。

英文摘要

Though Explainable AI (XAI) has made significant advancements, its inclusion in edge and IoT systems is typically ad-hoc and inefficient. Most current methods are "coupled" in such a way that they generate explanations simultaneously with model inferences. As a result, these approaches incur redundant computation, high latency and poor scalability when deployed across heterogeneous sets of edge devices. In this work we propose Explainability-as-a-Service (XaaS), a distributed architecture for treating explainability as a first-class system service (as opposed to a model-specific feature). The key innovation in our proposed XaaS architecture is that it decouples inference from explanation generation allowing edge devices to request, cache and verify explanations subject to resource and latency constraints. To achieve this, we introduce three main innovations: (1) A distributed explanation cache with a semantic similarity based explanation retrieval method which significantly reduces redundant computation; (2) A lightweight verification protocol that ensures the fidelity of both cached and newly generated explanations; and (3) An adaptive explanation engine that chooses explanation methods based upon device capability and user requirement. We evaluated the performance of XaaS on three real-world edgeAI use cases: (i) manufacturing quality control; (ii) autonomous vehicle perception; and (iii) healthcare diagnostics. Experimental results show that XaaS reduces latency by 38% while maintaining high explanation quality across three real-world deployments. Overall, this work enables the deployment of transparent and accountable AI across large scale, heterogeneous IoT systems, and bridges the gap between XAI research and edge-practicality.

2602.02979 2026-05-26 cs.CL cs.LG

CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

CPMobius: 无数据强化学习的迭代式教练-玩家推理

Ran Li, Zeyuan Liu, Yinghao Chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Chen Qian, Zhiyuan Liu, Maosong Sun

AI总结 提出CPMobius协作式教练-玩家范式,通过无外部数据的合作优化循环提升数学推理能力,在Qwen2.5-Math-7B-Instruct上总体准确率提升4.9%,OOD准确率提升5.4%。

Comments Accepted to the ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在复杂推理方面展现出强大潜力,但其进展仍从根本上受限于对大规模高质量人工策划任务和标签的依赖,无论是通过监督微调(SFT)还是基于推理特定数据的强化学习(RL)。这种依赖使得监督密集型训练范式日益不可持续,实践中已出现可扩展性减弱的迹象。为克服这一限制,我们引入了CPMöbius(CPMobius),一种用于推理模型无数据强化学习的协作式教练-玩家范式。与传统对抗性自博弈不同,CPMöbius受现实世界人类体育协作和多智能体协作启发,将教练和玩家视为独立但合作的角色。教练针对玩家的能力提出指令,并根据玩家表现的变化获得奖励,而玩家则因解决教练生成的越来越有指导性的任务而获得奖励。这种合作优化循环旨在直接提升玩家的数学推理能力。值得注意的是,CPMöbius在不依赖任何外部训练数据的情况下实现了显著改进,优于现有的无监督方法。例如,在Qwen2.5-Math-7B-Instruct上,我们的方法总体准确率平均提升4.9%,分布外(OOD)准确率平均提升5.4%,总体准确率超过RENT 1.5%,OOD准确率超过R-zero 4.2%。我们的代码库已在https://github.com/thunlp/CPMobius发布。

英文摘要

Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy. Our codebase has been released at https://github.com/thunlp/CPMobius.

2602.02495 2026-05-26 cs.CL cs.AI cs.LG

Reward-free Alignment for Conflicting Objectives

无奖励的冲突目标对齐

Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

AI总结 提出RACO框架,通过冲突规避梯度下降的裁剪变体直接利用成对偏好数据解决多目标冲突,实现帕累托最优对齐。

Comments Accepted to ICML 2026 (Oral)

详情
AI中文摘要

直接对齐方法越来越多地用于将大型语言模型(LLMs)与人类偏好对齐。然而,许多现实世界的对齐问题涉及多个相互冲突的目标,简单的偏好聚合可能导致训练不稳定和糟糕的权衡。特别是,加权损失方法可能无法识别同时改善所有目标的更新方向,而现有的多目标方法通常依赖显式奖励模型,增加了额外复杂性并扭曲了用户指定的偏好。本文的贡献有两方面。首先,我们提出了一种用于冲突目标的无奖励对齐框架(RACO),该框架直接利用成对偏好数据,并通过一种新颖的冲突规避梯度下降的裁剪变体解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托临界点的保证,并进一步证明在双目标设置中裁剪可以严格改善收敛速度。其次,我们使用一些启发式方法改进了我们的方法,并进行了实验,以证明所提框架在LLM对齐中的兼容性。在多个LLM家族(Qwen 3、Llama 3、Gemma 3)上的多目标摘要和安全对齐任务的定性和定量评估表明,与现有的多目标对齐基线相比,我们的方法始终能实现更好的帕累托权衡。

英文摘要

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.