arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4065
2512.23964 2026-05-12 cs.LG cs.AI

DUALFloodGNN: Physics-informed Graph Neural Network for Operational Flood Modeling

Carlo Malapad Acosta, Herath Mudiyanselage Viraj Vidura Herath, Jia Yu Lim, Abhishek Saha, Sanka Rasnayaka, Lucy Marshall

AI总结 该论文提出了一种名为 DUALFloodGNN 的物理信息图神经网络模型,用于操作性洪水模拟。该模型通过在全局和局部尺度上嵌入物理约束,结合显式损失函数,实现了对节点水体积和边流量的联合预测。相比传统图神经网络和现有洪水模型,DUALFloodGNN 在预测水文变量(如水体积、流量和水深)方面表现出更高的准确性和计算效率,并且支持快速预测,适用于实际灾害管理场景。

Comments Accepted for publication at the IJCAI-ECAI 2026 AI4Tech track

详情
英文摘要

Flood models inform strategic disaster management by simulating the spatiotemporal hydrodynamics of flooding. While physics-based numerical flood models are accurate, their substantial computational cost limits their use in operational settings where rapid predictions are essential. Models designed with graph neural networks (GNNs) provide both speed and accuracy while having the ability to process unstructured spatial domains. Given its flexible input and architecture, GNNs can be leveraged alongside physics-informed techniques with ease, significantly improving interpretability and generalizability. We introduce a novel flood GNN architecture, DUALFloodGNN, which embeds physical constraints at both global and local scales through explicit loss terms. The model jointly predicts water volume at nodes and flow along edges through a shared message-passing framework. To improve performance for autoregressive inference, model training is conducted with a multi-step loss enhanced with dynamic curriculum learning. Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables (e.g., water volume, flow, and depth) while maintaining high computational efficiency. The model is open sourced at https://github.com/acostacos/dual_flood_gnn. The dataset is open sourced at https://hdl.handle.net/2123/35293 with the DOI 10.25910/9xav-0s86.

2512.19995 2026-05-12 cs.CL cs.AI cs.LG

Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou

AI总结 该研究探讨了大型语言模型在数学推理过程中所展现的思维结构,采用Schoenfeld的“事件理论”作为分析框架,提出了一种名为ThinkARM的可扩展方法,将推理过程抽象为如分析、探索、验证等明确的推理步骤。通过该方法,研究揭示了不同模型在推理过程中的动态特征和结构差异,并通过案例分析表明,探索步骤对推理正确性具有关键影响,效率导向的方法可能抑制评估反馈步骤而非单纯缩短响应。这一工作为系统分析语言模型推理结构提供了新的视角。

Comments ACL2026, camera-ready

详情
英文摘要

Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

2512.17593 2026-05-12 cs.LG math.OC

A Unified Representation of Neural Networks Architectures

Christophe Prieur, Mircea Lazar, Bogdan Robu

AI总结 本文研究了神经网络架构在隐藏层神经元数量和隐藏层数目趋于无穷时的极限情况,将其形式化为连续体,并推导了相应的逼近误差。作者首先考虑单隐藏层神经网络,提出了一种广义的无限宽度积分神经网络表示,进而扩展到具有有限积分隐藏层和残差连接的深度残差CNN。通过结合神经ODE与深度残差网络的关系,作者提出了一个统一的分布参数神经网络(DiPaNet)表示,展示了大多数现有有限和无限维神经网络架构均可通过同质化或离散化方法与此表示相关联,为神经网络的理论分析提供了新的视角。

Comments Typographical corrections and additional clarifications, remarks; few new relevant references added and acknowledgements; main results unchanged

详情
英文摘要

In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogenization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Relations with neural fields and other neural integro-differential equations are discussed along with further possible generalizations and applications of the DiPaNet framework.

2512.15977 2026-05-12 cs.CV

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

AI总结 该研究评估了多种开源和闭源的视觉-语言模型(VLMs)在农业图像分类任务中的表现,涉及27个数据集、162个类别和248,000张图像。结果表明,零样本VLMs在多数任务中显著落后于监督学习的基准模型YOLO11,且在开放性提示下性能更低,需借助语义判断等方法提升效果。尽管部分开源模型如Qwen-VL-72B表现接近闭源模型,但整体来看,当前VLMs尚未具备作为独立农业诊断系统的能力,更适合在受限接口和领域知识支持下作为辅助工具使用。

详情
英文摘要

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

2512.13919 2026-05-12 cs.LG cs.NA math.NA

Adaptive digital twins for predictive decision-making: Online Bayesian learning of transition dynamics

Eugenio Varetti, Matteo Torzoni, Marco Tezzele, Andrea Manzoni

AI总结 本文研究了如何通过自适应机制提升数字孪生在土木工程中的价值实现,重点在于利用概率图模型对数字孪生中的状态转移模型进行自适应。通过动态贝叶斯网络建模物理与虚拟域之间的双向交互,并将状态转移概率作为具有共轭先验的随机变量,实现了基于贝叶斯更新的分层在线学习。该方法扩展了现有数字孪生框架中对分布类型的适用范围,并结合强化学习求解参数化马尔可夫决策过程,提升了系统的个性化、鲁棒性和成本效益,实验案例验证了其在铁路桥梁结构健康监测与维护规划中的有效性。

详情
英文摘要

This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature on digital twins. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.

2512.13618 2026-05-12 cs.CL cs.LG

Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Zefang Liu, Nam H. Nguyen, Yinzhu Quan, Shi-Xiong Zhang

AI总结 本文研究了在使用大语言模型(LLM)对事件序列进行建模时,如何有效表示连续时间这一关键但尚未充分探索的问题。通过系统比较多种时间编码策略,如数值字符串、高精度字节表示、日历语义标记、均匀分箱和自适应残差量化等,发现不同方法在不同统计分布的数据上表现各异。研究强调,时间标记策略应与数据的统计特性相匹配,揭示了时间标记设计在基于LLM的事件建模中是一个关键但常被忽视的维度。

详情
英文摘要

Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.

2512.06949 2026-05-12 cs.CV

Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology

Shravan Venkatraman, Muthu Subash Kavitha, Joe Dhanith P R, V Manikandarajan, Jia Wu

AI总结 在皮肤癌诊断中,组织病理学图像分割对于识别组织结构至关重要,但建模空间上下文和组织间关系仍是一个挑战,尤其是在组织重叠或形态相似的区域。为此,本文提出了一种新的分割框架——神经组织关系建模(NTRM),通过在卷积神经网络中引入图神经网络,建模不同组织类型之间的空间和功能关系,从而提升分割的结构一致性。实验表明,NTRM在非黑色素瘤皮肤癌分割数据集上显著优于现有方法,Dice相似性系数提升了4.9%至31.25%,展示了关系建模在提升分割准确性和可解释性方面的潜力。

Comments CVPR 2026 Workshops

详情
英文摘要

Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9\% to 31.25\% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at https://github.com/shravan-18/NTRM.

2512.06427 2026-05-12 cs.LG

A new initialisation to Control Gradients in Sinusoidal Neural network

Andrea Combette, Antoine Venaille, Nelly Pustelnik

AI总结 本文提出了一种针对正弦激活函数神经网络(如SIREN)的新初始化方法,旨在更好地控制梯度、缓解梯度消失或爆炸问题,并提升模型的训练与泛化能力。该方法通过分析前激活分布和雅可比矩阵方差的收敛性,推导出一种闭式初始化表达式,与原始SIREN方案不同。实验表明,该初始化方法在函数拟合和图像重建任务中显著优于现有方法,尤其在物理信息神经网络任务中表现突出。

详情
英文摘要

Proper initialisation strategy is of primary importance to mitigate gradient explosion or vanishing when training neural networks. Yet, the impact of initialisation parameters still lacks a precise theoretical understanding for several well-established architectures. Here, we propose a new initialisation for networks with sinusoidal activation functions such as \texttt{SIREN}, focusing on gradients control, their scaling with network depth, their impact on training and on generalization. To achieve this, we identify a closed-form expression for the initialisation of the parameters, differing from the original \texttt{SIREN} scheme. This expression is derived from fixed points obtained through the convergence of pre-activation distribution and the variance of Jacobian sequences. Controlling both gradients and targeting vanishing pre-activation helps preventing the emergence of inappropriate frequencies during estimation, thereby improving generalization. We further show that this initialisation strongly influences training dynamics through the Neural Tangent Kernel framework (NTK). Finally, we benchmark \texttt{SIREN} with the proposed initialisation against the original scheme and other baselines on function fitting and image reconstruction. The new initialisation consistently outperforms state-of-the-art methods across a wide range of reconstruction tasks, including those involving physics-informed neural networks.

2512.04949 2026-05-12 cs.LG cs.AI cs.CL

CARL: Criticality-Aware Agentic Reinforcement Learning

Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua

AI总结 本文提出了一种名为CARL的强化学习算法,旨在解决多步任务中传统策略优化方法因假设每一步贡献相同而导致的性能不足问题。CARL通过引入熵作为状态重要性的代理指标,专注于对关键状态的动作进行奖励分配,从而提升训练效率和效果。实验表明,CARL在多种评估场景中均表现出更强的性能和更高的效率。

Comments 18 pages, 6 figures

详情
英文摘要

Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each step holds equal contribution, which deviates significantly from reality. Our analysis reveals that only the action choices on a small fraction of states are critical in determining the final outcome. Building on this insight, we propose CARL, a criticality-aware reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for state criticality and achieves focused training by assigning rewards to actions taken from high-criticality states while excluding actions taken from low-criticality states from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.

2511.23332 2026-05-12 cs.CV

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

AI总结 本文提出 UniGeoSeg,一种面向遥感地景的统一开放世界分割框架,旨在解决现有方法在任务定义分散和指令数据有限方面的不足。研究构建了 GeoSeg-1M 数据集,包含大量图像-掩码-指令三元组,并设计了 GeoSeg-Bench 用于评估模型在复杂地景场景中的理解与推理能力。UniGeoSeg 通过任务感知的文本增强、潜在知识记忆和渐进式训练策略,实现了多任务学习,在多个基准测试中表现出色,具有强大的零样本泛化能力。

Comments Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg ; Accepted by CVPR 2026

详情
英文摘要

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

2511.22963 2026-05-12 cs.RO cs.AI

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu, Kaiyang Ji, Ke Yang, Yahao Fan, Jingyi Yu, Ye Shi, Jingya Wang

AI总结 本文研究了如何使人形机器人理解并执行自由形式的自然语言指令,提出了一个名为Humanoid-LLA的大语言动作模型,能够将自然语言直接转化为可执行的全身运动。该方法通过学习统一的人类-人形机器人运动词汇,解决了语言语义与物理控制之间的对齐问题,并采用两阶段微调框架,结合监督学习与强化学习,提升了运动的物理稳定性和鲁棒性。实验表明,该模型在模拟和真实环境中均能生成多样且物理合理的动作,具有良好的语言指令泛化能力。

Comments Project page: https://humanoidlla.github.io/

详情
英文摘要

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.

2511.22565 2026-05-12 cs.AI cs.DB cs.LG

Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation

Yannick Brunink, Daniel Daza, Yunjie He, Michael Cochez

AI总结 本文研究了神经网络在知识图谱上处理复杂查询(CQA)的能力,通过对比神经方法与一种无需训练的查询松弛策略,揭示了神经模型在推理模式上可能存在的局限性。研究发现,神经模型在多个数据集和查询结构上的表现并不一致优于查询松弛方法,且两者检索出的答案重叠较少,结合两者结果能提升性能。这一结果表明,当前神经CQA模型尚未完全涵盖查询松弛所捕捉的推理模式,强调了引入非神经基线和融合松弛原理对未来发展的重要性。

Comments Accepted in Transactions on Machine Learning Research (2026)

详情
英文摘要

Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing. In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance. These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.

2511.07756 2026-05-12 cs.CV

Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation

Song Yan, Wei Zhai, Chenfeng Wang, Xinliang Bi, Jian Yang, Yancheng Cai, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha

AI总结 扩散模型从各向同性高斯潜在空间开始生成,但仅改变随机种子会导致生成结果在语义忠实度、构图和视觉质量上出现显著差异。本文通过分析从初始噪声到生成内容的语义映射,揭示了种子敏感性的几何原因:潜在空间中大多数方向对语义变化不敏感,而语义敏感的变化集中在较小的子空间内。基于这一发现,作者提出了一种无需训练的提示残差种子塑造方法,通过注入与语义变化相关的切向分量,将种子拉回到原始高斯分布的壳层,从而在保持先验兼容性的同时提升生成结果的对齐度和质量。

详情
英文摘要

Diffusion models start generation from an isotropic Gaussian latent, yet changing only the random seed can lead to large differences in prompt faithfulness, composition, and visual quality. We study this seed sensitivity through the semantic map from initial noise to generated meaning. Although the sampling flow is locally invertible, the subsequent semantic projection is many-to-one, inducing a degenerate pullback semi-metric on the latent space: most local directions are nearly semantic-invariant, while semantic-sensitive variation is concentrated in a much smaller horizontal subspace. This provides an explanatory geometric view of the seed lottery. Motivated by this view, we introduce a training-free prompt-residual seed-shaping procedure. Rather than claiming to recover the exact horizontal space, the method uses a single high-noise cold-start prompt residual as a model-coupled proxy, injects only its tangential component, and retracts the seed to the original Gaussian radius shell. This keeps the initialization prior-compatible while adding only one conditional/unconditional probe before standard sampling. Across multiple generation benchmarks, the method improves alignment and quality metrics over standard sampling, supporting both the practical value of the proxy and the explanatory relevance of semantic anisotropy.

2511.02623 2026-05-12 cs.CL

The Realignment Problem: When Right becomes Wrong in LLMs

Aakash Sen Sharma, Debdeep Sanyal, Manodeep Ray, Vivek Srivastava, Shirish Karande, Murari Mandal

AI总结 随着政策和价值观的变化,大型语言模型(LLMs)的对齐目标可能逐渐偏离现实需求,形成对齐-现实鸿沟。本文提出TRACE框架,通过分析现有数据中的对齐冲突,无需重新标注即可实现模型的再对齐。该方法利用一个更强的模型作为判断者,通过三阶段流程优化模型对齐效果,并在多个主流模型上验证了其有效性与通用性。

Comments ICML 2026

详情
英文摘要

Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make static alignment increasingly brittle. As policies evolve, deployed models can diverge from current alignment objectives, creating an Alignment-Reality Gap that is difficult to audit or correct. Existing remediation typically requires re-annotation under revised guidelines, which introduces systematic challenges, including guideline ambiguity, annotator interpretation drift, and reduced consistency at scale. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms realignment into a structured optimization problem over existing data without requiring fresh human annotation. Leveraging a stronger model as a proxy judge, TRACE operates via a three-stage pipeline: (1) triaging preference pairs into inversion, suppression, or retention categories based on alignment conflicts; (2) computing an alignment impact score via bi-level optimization to prioritize high-leverage samples; and (3) executing updates using a hybrid objective that combines relational losses (e.g., IPO) for preference inversion and punitive losses (e.g., NPO) for response suppression. Experiments on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B demonstrate robust realignment on synthetic benchmarks and the PKU-SafeRLHF dataset without degrading general utility. This work provides a scalable approach for LLM realignment under evolving data annotation policies and alignment guidelines. We release our code: https://respailab.github.io/TRACE/

2511.01774 2026-05-12 cs.RO cs.SY eess.SY

MOBIUS: A Multi-Modal Bipedal Robot that can Walk, Crawl, Climb, and Roll

Alexander Schperberg, Yusuke Tanaka, Stefano Di Cairano, Dennis Hong

AI总结 本文介绍了MOBIUS平台,这是一种能够行走、爬行、攀爬和滚动的双足机器人。该机器人配备四条肢体,包括两只6自由度的机械臂和两只4自由度的腿,结合强化学习与力控制的混合架构,实现了多种运动模式的无缝切换和稳定操作。研究通过硬件实验验证了其在复杂地形中的适应性与操作能力,展示了形态设计、高层规划与控制紧密结合在移动操作与抓取任务中的重要性。

Comments Paper is accepted at the Robotics: Science and Systems conference, held in Sydney, Australia, July 13th-17th, 2026. Alexander Schperberg and Yusuke Tanaka are co-first authors. Both were at the Robotics and Mechanisms Laboratory (RoMeLa) at UCLA when the work started, and are now with Mitsubishi Electric Research Laboratories and ETH Zurich (RSL) respectively

详情
英文摘要

This paper presents the MOBIUS platform, a bipedal robot capable of walking, crawling, climbing, and rolling. MOBIUS features four limbs, two 6-DoF arms with two-finger grippers for manipulation and climbing, and two 4-DoF legs for locomotion--enabling smooth transitions across diverse terrains without reconfiguration. A hybrid control architecture combines reinforcement learning for locomotion and force control for compliant contact interactions during manipulation. A high-level MIQCP planner autonomously selects locomotion modes to balance stability and energy efficiency. Hardware experiments demonstrate robust gait transitions, dynamic climbing, and full-body load support via pinch grasp. Overall, MOBIUS demonstrates the importance of tight integration between morphology, high-level planning, and control to enable mobile loco-manipulation and grasping, substantially expanding its interaction capabilities, workspace, and traversability.

2510.27527 2026-05-12 cs.LG cs.AI

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Yuxiang Chen, Yifan Liu, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, Jianfei Chen

AI总结 大型语言模型(LLM)的训练成本极高,因此低精度全量化训练(FQT)受到广泛关注。本文提出 TetraJet-v2,一种基于 NVFP4 格式的端到端 4 位 FQT 方法,用于激活、权重和梯度的量化。针对低精度训练中的权重震荡和异常值问题,该方法引入了无偏双块量化、OsciReset 算法和 OutControl 算法,有效提升了训练稳定性和精度。实验表明,TetraJet-v2 在多个大规模模型上实现了接近 BF16 的性能,同时相比 FP8 方法提升了 1.67 倍的训练速度。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026 (ICML 2026)
英文摘要

Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.

2510.25372 2026-05-12 cs.CV cs.LG

Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty

AI总结 本文研究了如何在联邦学习环境下高效且通用地对视觉Transformer进行提示调优。为了解决全局提示调优泛化性差和个性化调优过拟合的问题,作者提出了PEP-FedPT框架,引入了一种基于类上下文混合提示(CCMP)的新方法,通过全局类原型和客户端类先验动态组合类特定提示,实现样本级提示个性化,而无需存储客户端参数。实验表明,该方法在多个数据集上优于现有方法,为联邦视觉Transformer调优提供了有效解决方案。

Comments Accepted to TMLR 2026

详情
英文摘要

Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

2510.18184 2026-05-12 cs.LG cs.AI

ActivationReasoning: Logical Reasoning in Latent Activation Spaces

Lukas Helff, Ruben Härle, Wolfgang Stammer, Felix Friedrich, Manuel Brack, Antonia Wüst, Hikaru Shindo, Patrick Schramowski, Kristian Kersting

AI总结 大型语言模型(LLMs)在生成流畅文本方面表现出色,但其内部推理过程仍不透明且难以控制。为此,研究提出了一种名为ActivationReasoning(AR)的框架,通过在LLMs的潜在激活空间中嵌入显式的逻辑推理,使模型具备系统推理和行为引导的能力。该方法分三个阶段:首先通过稀疏自编码器(SAEs)识别并组织潜在概念表示,其次在推理时将激活的概念映射为逻辑命题,最后通过逻辑规则对这些命题进行推理,生成更高层次的结构、新概念并引导模型行为。实验表明,AR在多项推理任务中表现出良好的鲁棒性和泛化能力,为实现更透明、可控和可审计的AI提供了新路径。

Comments Proceedings of the 14th International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.

2510.13397 2026-05-12 cs.LG stat.ML

Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring

Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder, Stefan Feuerriegel

AI总结 在临床研究中,由于患者提前退出(dropout)现象普遍,且退出可能与生存时间相关(即信息性删失),导致治疗效果估计存在偏差。本文提出了一种假设较少的框架,用于在信息性删失下评估条件平均处理效应(CATE)估计的稳健性,通过部分识别方法推导出CATE的置信区间,从而识别出在存在信息性删失情况下治疗仍有效的患者子群。此外,作者还提出了一种新型的模型无关元学习方法SurvB-learner,能够与任意机器学习模型结合使用,具有双重稳健性和近似最优效率等良好理论性质,并通过仿真和真实数据实验验证了其有效性。

详情
英文摘要

Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further propose a novel model-agnostic meta-learner, called SurvB-learner, to estimate the bounds that can be used in combination with arbitrary machine-learning models, and that has favorable theoretical properties such as double-robustness and quasi-oracle efficiency. We finally demonstrate the effectiveness of our meta-learner across various experiments using both simulated and real-world data.

2510.11233 2026-05-12 cs.CL

CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis

Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang, Ying Wang, Pierre Magistry, Mathieu Valette, Lei Li

AI总结 CNSocialDepress 是一个用于检测和结构化分析中文社交媒体中抑郁风险的基准数据集。该数据集包含233名用户的44,178条帖子,并由心理专家标注了10,306段与抑郁相关的内容,提供了二分类风险标签及多维心理属性信息,支持细粒度和可解释的抑郁信号分析。实验表明,该数据集在结构化心理画像和大语言模型微调等任务中具有良好的应用效果,为中文语境下的心理健康研究提供了重要资源。

详情
英文摘要

Depression is a pressing global public health issue, yet publicly available Chinese-language resources for depression risk detection remain scarce and largely focus on binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection on Chinese social media. The dataset contains 44,178 posts from 233 users; psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels along with structured, multidimensional psychological attributes, enabling interpretable and fine-grained analyses of depressive signals. Experimental results demonstrate the dataset's utility across a range of NLP tasks, including structured psychological profiling and fine-tuning large language models for depression detection. Comprehensive evaluations highlight the dataset's effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights for mental health applications tailored to Chinese-speaking populations.

2510.10730 2026-05-12 cs.LG cs.AI stat.ML

Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits

Jiazheng Sun, Weixin Wang, Pan Xu

AI总结 本文提出了一种统一的算法框架,用于非线性上下文老虎机中的集成采样,并针对广义线性老虎机和神经网络上下文老虎机两种常见场景,分别给出了广义线性集成采样(GLM-ES)和神经网络集成采样(Neural-ES)方法,并证明了它们的高概率频繁主义遗憾界。研究通过在随机扰动数据上使用最大似然估计维护多个奖励模型参数估计器,解决了非线性模型中的理论挑战,并提供了无需固定时间步长的任意时间版本算法,具有较强的实用性和理论保证。实验结果表明,所提方法在实际中表现优异。

Comments 58 pages, 5 figures, 1 table

详情
英文摘要

We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (GLM-ES) for generalized linear bandits and Neural Ensemble Sampling (Neural-ES) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\widetilde{O}(d^{3/2} \sqrt{T} + d^{4})$ for GLM-ES and $\widetilde{O}(\widetilde{d}^{3/2} \sqrt{T})$ for Neural-ES, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel (NTK) matrix and $T$ is the number of rounds. The regret bound of GLM-ES matches the state-of-the-art result of randomized exploration algorithms in generalized linear bandit setting. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumption by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate GLM-ES, Neural-ES and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.

2510.10606 2026-05-12 cs.CV

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

AI总结 ViSurf 是一种统一的单阶段微调方法,旨在解决大型视觉-语言模型在知识注入与性能提升之间的矛盾。该方法结合了监督微调(SFT)和基于可验证奖励的强化学习(RLVR)的优势,通过将真实标签直接注入RLVR过程,实现外部监督与内部强化的同步优化。ViSurf 还引入了三种新的奖励控制策略以保障训练稳定性,实验表明其在多个基准测试中均优于单独使用SFT、RLVR或传统两阶段方法。

详情
英文摘要

Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

2510.07500 2026-05-12 cs.LG cs.IT math.IT

Black-Box Detection of LLM-Generated Text Using Generalized Jensen-Shannon Divergence

Shuangyi Chen, Ashish Khisti

AI总结 本文研究在实际约束下的黑盒检测问题,即在未知源模型与评分模型不匹配、且生成对比样本成本较高的情况下,如何检测机器生成的文本。提出了一种基于参考的检测方法 SurpMark,通过总结文本中 token 惊奇值的动态变化,利用离散化后的状态转移矩阵,并结合广义杰森-香农散度(GJS)与预设的人类和机器参考模型进行对比评分。实验表明,SurpMark 在多个数据集和生成模型上表现优异,具有良好的跨领域和跨生成器鲁棒性。

Comments ICML 2026

详情
英文摘要

We study black-box detection of machine-generated text under practical constraints: the scoring model (proxy LM) may mismatch the unknown source model, and per-input contrastive generation is costly. We propose SurpMark, a reference-based detector that summarizes a passage by the dynamics of its token surprisals. SurpMark discretizes surprisals into interpretable states, estimates a state-transition matrix for the test text, and scores it via a generalized Jensen-Shannon (GJS) gap between the test transitions and two fixed references (human vs. machine) built once from existing corpora. Theoretically, we derive design guidance for how the discretization bins should scale with data and provide a principled justification for our test statistic. Empirically, across multiple datasets, source models, and scenarios, SurpMark consistently matches or surpasses baselines, demonstrating strong robustness across domains and generators; our experiments on hyperparameter sensitivity exhibit trends that our theoretical results help to explain.

2510.04142 2026-05-12 cs.CV cs.AI cs.LG

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

Xiaoyu Yang, En Yu, Wei Duan, Jie Lu

AI总结 本文研究了在非平稳多流环境中,如何从多个多模态大语言模型中实现鲁棒的推理对齐问题。针对源模型推理分布随时间演变带来的系统性偏差,作者提出了一种新的约束满足框架——自主偏好优化(APO),将模型间差异视为动态负约束,并通过两阶段策略实现对齐:先通过监督引导使目标模型具备源模型的联合能力,再通过约束感知优化生成一致的共识流形。实验表明,该方法在胸部X光解读任务中表现出优越的鲁棒性,并发布了包含七个多模态大模型推理轨迹的CXR-MAX基准数据集。

Comments ICML 2026

详情
英文摘要

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.

2510.03895 2026-05-12 cs.RO cs.CV

NoTVLA: Semantics-Preserving Robot Adaptation via Narrative Action Interfaces

Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Ye Lin, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen

AI总结 该研究提出了一种名为NoTVLA的语义保持型机器人自适应框架,旨在解决视觉-语言-动作(VLA)模型在实际部署中面临的灾难性遗忘问题。其核心方法是通过关注稀疏轨迹而非密集动作序列,结合时间压缩和空间推理剪枝策略,优化轨迹规划并降低计算需求。NoTVLA在多任务评估中表现出优于现有模型的性能,同时显著减少计算资源消耗,并无需依赖腕部摄像头,实现了跨平台部署与零样本泛化能力。

详情
英文摘要

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

2510.00883 2026-05-12 cs.LG cs.AI

GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling

Jose I. Mestre, Alberto Fernández-Hernández, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

AI总结 本文提出了一种名为GreenLightningAI(GLAI)的新架构模块,旨在替代传统多层感知机(MLP),通过解耦训练过程中通常纠缠的结构知识和量化知识,实现更高效的训练。GLAI在结构稳定后固定其激活路径,仅优化数值参数,从而在保持MLP通用逼近能力的同时,显著提升了训练效率,平均减少约40%的训练时间。该模块具有通用性,可广泛应用于各类神经网络结构中,并在多种实验设置下表现出与MLP相当或更优的性能。

Comments 20 pages, 2 figures

详情
英文摘要

In this work we introduce GreenLightningAI (GLAI), a new architectural block designed as an alternative to conventional MLPs. The central idea is to separate two types of knowledge that are usually entangled during training: (i) *structural knowledge*, encoded by the stable activation patterns induced by ReLU activations; and (ii) *quantitative knowledge*, carried by the numerical weights and biases. By fixing the structure once stabilized, GLAI reformulates the MLP as a combination of paths, where only the quantitative component is optimized. This reformulation retains the universal approximation capabilities of MLPs, yet achieves a more efficient training process, reducing training time by ~40% on average across the cases examined in this study. Crucially, GLAI is not just another classifier, but a generic block that can replace MLPs wherever they are used, from supervised heads with frozen backbones to projection layers in self-supervised learning or few-shot classifiers. Across diverse experimental setups, GLAI consistently matches or exceeds the accuracy of MLPs with an equivalent number of parameters, while converging faster. Overall, GLAI establishes a new design principle that opens a direction for future integration into large-scale architectures such as Transformers, where MLP blocks dominate the computational footprint.

2509.25080 2026-05-12 cs.LG

Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Bogdan Raonić, Siddhartha Mishra, Samuel Lanthaler

AI总结 在科学人工智能领域,数据驱动模型在天气预测和流体力学等关键任务中广泛应用,但其在面对分布外(OOD)数据时可能失效,如何检测此类失效仍是回归任务中的挑战。本文提出一种基于分数扩散模型的联合似然估计方法,结合输入数据与回归模型预测结果,生成任务感知的可靠性评分。实验表明,该方法在多个科学数据集上能有效反映预测误差,为构建可验证的“信任证书”提供了基础,有助于评估科学人工智能预测的可信度。

详情
英文摘要

Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at https://github.com/bogdanraonic3/OOD_Detection_ScientificML

2509.24244 2026-05-12 cs.AI

Model Merging Scaling Laws in Large Language Models

Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang

AI总结 本文研究了大语言模型中模型合并的规模定律,通过交叉熵进行衡量。作者发现了一个简洁的幂律关系,揭示了模型规模与专家数量之间的联系,并指出随着模型容量增大,合并效果的下限降低,而专家数量带来的收益则呈现边际递减趋势。该定律适用于不同领域和多种合并方法,能够解释合并过程中收益快速衰减和波动减小的现象,并为模型合并提供了预测性规划的理论依据,为分布式生成式AI系统的发展提供了可预测的扩展原则。

Comments ICML 2026

详情
英文摘要

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

2509.21892 2026-05-12 cs.CL cs.AI cs.LG

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

AI总结 本文研究了混合专家(MoE)模型在推理时动态调整激活专家数量以适应不同硬件和负载需求的问题。传统MoE模型在训练和推理时固定激活专家数,难以应对实际场景中的变化。作者提出了一种新的训练框架Elastic MoE(EMoE),通过同时训练专家在不同组合下的协作能力,并引导路由器做出高质量选择,从而在推理时弹性调整激活专家数量,显著提升了模型在不同预算下的性能表现。实验表明,EMoE在多个大规模MoE架构和基准测试中均取得了更广的扩展范围和更高的峰值性能。

详情
英文摘要

Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.

2509.21000 2026-05-12 cs.LG math.OC

Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices

Qingyu Han, Qian Li, Linxin Yang, Qian Chen, Qingjiang Shi, Ruoyu Sun

AI总结 本文研究了如何提升图神经网络(GNN)在求解整数线性规划(ILP)问题中的表现。传统GNN因缺乏节点唯一标识而表达能力受限,而引入全局唯一标识(UID)又会导致泛化性能下降。为此,作者提出了一种局部唯一标识(Local-UID)方案,仅在每个节点的d-hop邻域内保证唯一性,并基于此设计了ColorGNN和ColorUID模型。实验表明,该方法在保持表达能力的同时显著提升了模型在ILP任务上的泛化性能。

Comments 19 pages, 9 Tables

详情
英文摘要

Integer Linear Programs (ILPs) are central to real-world optimizations but notoriously difficult to solve. Learning to Optimize (L2O) has emerged as a promising paradigm, with Graph Neural Networks (GNNs) serving as the standard backbone. However, standard anonymous GNNs are limited in expressiveness for ILPs, and the common enhancement of augmenting nodes with globally unique identifiers (UIDs) typically introduces spurious correlations that severely harm generalization. To address this tradeoff, we propose a parsimonious Local-UID scheme based on d-hop uniqueness coloring, which ensures identifiers are unique only within each node's d-hop neighborhood. Building on this scheme, we introduce ColorGNN, which incorporates color information via color-conditioned embeddings, and ColorUID, a lightweight feature-level variant. We prove that for d-layer networks, Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization. Extensive experiments show that our approach yields substantial and robust gains across ILP benchmarks.