arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11609 2026-05-13 cs.LG cs.AI cs.CL

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Guobin Shen, Xiang Cheng, Chenxiao Zhao, Lei Huang, Jindong Li, Dongcheng Zhao, Xing Yu

AI总结 该研究针对基于策略的自蒸馏方法在数学推理任务中效果不佳的问题,提出了一种新的反向自蒸馏方法(AntiSD)。通过点互信息分析,发现特权上下文导致教师模型对已知结构部分过于自信,而忽视了推理过程中的关键思考步骤。AntiSD通过最大化学生与教师之间的分布差异,反转了传统自蒸馏的梯度方向,从而更有效地提升推理能力。实验表明,该方法在多个大规模语言模型上显著减少了训练步骤并提升了推理准确率。

详情
英文摘要

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

2605.11608 2026-05-13 cs.CL cs.AI cs.LG

PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

Chieh-Yen Lin, Shao-Hua Sun

AI总结 PRISM 是一种用于分析训练后大语言模型变体(如量化、LoRA适配和蒸馏模型)表示漂移的几何风险界方法,能够将漂移分解为尺度、形状和输出头三个独立可测的维度。该方法利用模型的线性输出头和近等距的主干结构,推导出目标模型与变体之间的交叉熵风险上界,从而不仅判断性能退化,还能识别退化的具体原因。实验表明,PRISM 在多个基准测试中表现出优异的变体排序能力,并且其形状正则化项在防止灾难性遗忘方面优于经验回放等传统方法。

详情
英文摘要

Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.

2605.11605 2026-05-13 cs.CV cs.AI

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

Chaeyoung Jung, Kyeongha Rho, Joon Son Chung

AI总结 多模态大语言模型(Omni-LLMs)在处理多模态输入时面临较高的计算开销,因此需要有效的token减少方法。本文提出了一种名为ContextGuard的推理时token剪枝框架,通过保留广泛的视听上下文并去除跨模态冗余,从而在保证性能的同时减少输入token数量。该方法基于音频预测粗粒度视觉语义,剪枝可由音频恢复的视频token,并保留能提供音频无法表达的局部视觉细节的token,同时合并时间上相似的视频token以进一步压缩。实验表明,ContextGuard在多个基准测试中优于现有方法,且在不需微调下游模型的情况下实现了较高的剪枝比例与性能。

详情
英文摘要

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.

2605.11603 2026-05-13 cs.AI

GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta

AI总结 随着大语言模型(LLM)部署规模的扩大,如何在异构模型池中平衡响应质量与计算成本成为关键问题。本文提出了一种基于约束优化的绿色感知路由(GAR)框架,旨在在保证准确率和延迟约束的前提下,最小化每请求的碳排放。GAR通过自适应约束优化和轻量级估计器实现实时路由决策,并结合在线算法与启发式变体,有效降低碳足迹同时保持模型性能,为可持续的大语言模型推理提供了理论支持与实践方案。

详情
英文摘要

The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.

2605.11601 2026-05-13 cs.CL cs.AI

DiffScore: Text Evaluation Beyond Autoregressive Likelihood

Wen Lai, Yingli Shen, Dingnan Jin, Qing Cui, Jun Zhou, Maosong Sun, Alexander Fraser

AI总结 本文提出了一种名为 DiffScore 的文本评估方法,旨在克服自回归语言模型在文本评价中因位置偏差导致的局限性。DiffScore 基于掩码大型扩散语言模型,通过全双向上下文对每个词进行评分,从而消除位置偏倚,并建立从局部流畅性到整体连贯性的评估层次。该方法还引入了多时间步质量分析和双向PMI分解等诊断工具,实验表明其在多个基准测试中优于传统自回归模型。

详情
英文摘要

Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.

2605.11598 2026-05-13 cs.LG cs.AI cs.DB q-bio.QM

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

Madhurima Panja, Danny D'Agostino, Huitao Li, Tanujit Chakraborty, Nan Liu

AI总结 随着数据驱动方法在公共卫生决策中的广泛应用,传染病预测已成为重要研究领域。为解决现有研究缺乏高质量多变量预测基准的问题,本文提出了EpiCastBench,一个包含40个精心挑选的多变量传染病数据集的大型基准框架,涵盖多种传染病和地理区域,具有不同的时间粒度、序列长度和稀疏性。研究通过统一的评估设置对15种多变量预测模型进行了系统比较,所有数据和代码均已公开,有助于推动传染病预测方法的发展与验证。

详情
英文摘要

The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

2605.11595 2026-05-13 cs.AI

Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI

Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis, Dimosthenis Kyriazis

AI总结 本文针对欧盟人工智能法案对高风险AI系统提出的透明性与可信性要求,提出了一种用于贝叶斯置信传播神经网络(BCPNN)的原生可解释性框架。该框架通过建立BCPNN特有的可解释性分类体系和十六个架构级解释原语,实现了对模型决策过程的系统性解释,并引入了五个配置级解释原语以支持预部署阶段的审计。研究为BCPNN在边缘设备上的可信部署提供了理论支持,推动了类脑AI在工业物联网中的应用。

Comments 8 pages

详情
英文摘要

The EU Artificial Intelligence Act (Regulation 2024/1689), fully applicable to high-risk systems from August 2026, creates urgent demand for AI architectures that are simultaneously trustworthy, transparent, and feasible to deploy on resource-constrained edge devices. Brain-like neural networks built on the Bayesian Confidence Propagation Neural Network (BCPNN) formalism have re-emerged as a credible alternative to backpropagation-driven deep learning. They deliver state-of-the-art unsupervised representation learning, neuromorphic-friendly sparsity, and existing FPGA implementations that target edge deployment. Despite this momentum, no systematic framework exists for explaining BCPNN decisions -- a gap the present paper fills. We argue that BCPNN is, in the sense of Rudin's interpretable-by-design agenda, an inherently transparent model whose architectural primitives map directly onto established explainable-AI (XAI) families. We make four contributions. First, we propose the first XAI taxonomy for BCPNN. It maps weights, biases, hypercolumn posteriors, structural-plasticity usage scores, attractor dynamics, and input-reconstruction populations onto attribution, prototype, concept, counterfactual, and mechanistic explanation modalities. Second, we introduce sixteen architecture-level explanation primitives (P1--P16), several without analogue in standard ANNs. We provide closed-form algorithms for computing each from quantities the model already maintains. Third, we introduce five design-time Configuration-as-Explanation primitives (Config-P1 to Config-P5) that treat BCPNN hyperparameter choices as an auditable pre-deployment explanation artifact. Fourth, we sketch a roadmap for integration into industrial IoT deployments and discuss EU AI Act alignment, edge feasibility, and Industry 5.0 implications.

2605.11594 2026-05-13 cs.CV

PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

Cheng Chi, Xianqi Wang, Hongcheng Luo, Mingfei Tu, Gangwei Xu, Zehan Zhang, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang, Haiyang Sun

AI总结 本文提出了一种名为PointForward的前馈驾驶场景重建框架,通过点对齐的表示方法解决现有方法在多视角一致性与动态实例建模方面的不足。该方法在世界坐标系中初始化稀疏的3D查询点,并通过时空融合多视角图像信息,实现单次前馈过程中的显式跨视角一致性。此外,通过引入场景图显式组织动态实例,结合3D边界框实现实例级运动传播,从而获得时序一致的动态重建结果。实验表明,PointForward在大规模驾驶数据集上达到了最先进的性能。

详情
英文摘要

High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.

2605.11592 2026-05-13 cs.LG cs.AI cs.CR

SoK: Unlearnability and Unlearning for Model Dememorization

Mengying Zhang, Derui Wang, Ruoxi Sun, Xiaoyu Xia, Shuang Hao, Minhui Xue

AI总结 本文系统研究了机器学习模型中数据遗忘相关的两种关键技术——不可学习性(unlearnability)和模型遗忘(unlearning),旨在防止敏感数据被滥用。研究揭示了这两种方法在浅层遗忘、相互影响及理论保障方面的共性与缺陷,并首次提出了统一的分类框架、实证分析以及理论保证,为实现更深层次的数据遗忘提供了理论基础和实践指导。

Comments The first two authors contributed equally

详情
英文摘要

Advanced model dememorization methods, including availability poisoning (unlearnability) and machine unlearning, are emerging as key safeguards against data misuse in machine learning (ML). At the training stage, unlearnability embeds imperceptible perturbations into data before release to reduce learnability. At the post-training stage, unlearning removes previously acquired information from models to prevent unauthorized disclosure or use. While both defenses aim to preserve the right to withhold knowledge, their vulnerabilities and shared foundations remain unclear. Specifically, both unlearnability and unlearning suffer from issues such as shallow dememorization, leading to falsely claimed data learnability reduction or forgetting in the presence of weight perturbations. Moreover, input perturbations may affect the effectiveness of downstream unlearning, while unlearning may inadvertently recover domain knowledge hidden by unlearnability. This interplay calls for deeper investigation. Finally, there is a lack of formal guarantees to provide theoretical insights into current defenses against shallow dememorization. In this Systematization of Knowledge, we present the first integrated analysis of model dememorization approaches leveraging unlearnability and unlearning. Our contributions are threefold: (i) a unified taxonomy of unlearnability and scalable unlearning methods; (ii) an empirical evaluation revealing the robustness, interplay, and shallow dememorization of leading methods; and (iii) the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These results lay the foundation for unifying dememorization mechanisms across the ML lifecycle to achieve a deeper immemor state for sensitive knowledge.

2605.11591 2026-05-13 cs.CV

Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

Mingtao Xian, Yifeng Yang, Qinying Gu, Xinbing Wang, Nanyang Ye

AI总结 多模态大语言模型在多图像跨模态检索任务中表现出色,但存在严重的顺序偏差问题,即预测结果受输入顺序影响而非语义相关性。本文提出了一种名为“Logit-Attention Divergence”的现象,指出输出logits存在偏差,而内部注意力图仍能准确对齐相关视觉信息,揭示了现有校准方法的局限性。基于此,作者提出了一种无需训练、基于注意力引导的去偏框架,利用模型内部的注意力信号在推理阶段进行实例级校正,仅需少量校准数据且计算开销极小。实验表明,该方法显著提升了模型对输入顺序的鲁棒性,在多个基准测试中取得了最先进的性能。

详情
英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at https://github.com/brightXian/LAD.

2605.11586 2026-05-13 cs.LG math.OC

Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

Kihyun Yu, Beomhan Baek, Dabeen Lee

AI总结 本文研究了在弱连通假设下无限时间平均奖励约束马尔可夫决策过程(CMDPs)的学习问题。作者首先建立了有限状态和动作空间下弱连通平均奖励CMDPs在平稳策略上的强对偶性,即使在缺乏线性规划形式且问题非凸的情况下,也通过分析状态占用测度的几何结构证明了强对偶性的成立。其次,基于该结果,提出了一种剪裁价值迭代的原始-对偶算法,用于学习弱连通平均奖励线性CMDPs,该算法在 regret 和约束违反方面达到了 $\widetilde{\mathcal{O}}(T^{2/3})$ 的上界,优于现有最佳结果,并通过强对偶性分析实现了对复合拉格朗日 regret 的分解。

详情
英文摘要

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

2605.11585 2026-05-13 cs.CV cs.LG

A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods

Shota Saito, Yuta Nakahara, Kohei Horinouchi, Naoki Ichijo, Manabu Kobayashi, Toshiyasu Matsushima

AI总结 本文研究了灰度图像的高斯噪声去除问题,提出了一种结合四叉树区域划分模型与混合自回归模型的概率图像生成方法,并将基于最大后验估计的去噪问题转化为变分下界最大化问题。通过交替应用变分贝叶斯方法和梯度方法,开发了一种新的优化算法,其中梯度更新规则可解析计算,无需数值近似。实验验证了该算法的有效性,并指出了进一步改进的方向。

详情
英文摘要

This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.

2605.11582 2026-05-13 cs.CL

Efficient LLM-based Advertising via Model Compression and Parallel Verification

Wenxin Dong, Chang Gao, Guanghui Yu, Xuewu Jiao, Mingqing Hu, Qiang Fu, Peng Xu, Penghui Wei, Hui Xu, Yue Xing, Shuanglong Li, Lin Liu

AI总结 本文研究了如何高效地在广告场景中部署大语言模型(LLM),以解决其推理延迟高和计算成本大的问题。提出了一种高效的生成式定向框架,结合自适应分组量化、层自适应分层稀疏化和前缀树并行验证等方法,在保证生成质量的同时显著加速LLM推理。实验表明,该框架在两个真实广告场景中实现了显著的加速效果,且质量下降可控,具备实际部署的可行性。

Comments 10 pages, 7 figures, industry paper

详情
英文摘要

Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.

2605.11581 2026-05-13 cs.CL

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Wenxin Dong, Mingqing Hu, Guanghui Yu, Qiang Fu, Peng Xu, Hui Xu, Yue Xing, Xuewu Jiao, Shuanglong Li, Lin Liu

AI总结 在商业在线广告系统中,大语言模型(LLM)的实时推理需要严格控制端到端延迟。为解决解码阶段内核启动开销大的问题,研究提出Ada-MK方法,通过基于DAG的自动搜索优化MegaKernel的执行路径,结合三维共享内存约束模型和异构混合推理引擎,有效降低了共享内存使用并消除了运行时分支开销,显著提升了推理吞吐量和延迟表现。

Comments 10 pages, 8 figures

详情
英文摘要

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.

2605.11578 2026-05-13 cs.CV

The Midas Touch for Metric Depth

Yu Ma, Zizhan Guo, Zuyi Xiong, Haoran Zhang, Yi Feng, Hongbo Zhao, Hanli Wang, Rui Fan

AI总结 本文提出了一种名为MTD的方法,旨在解决相对深度估计在实际应用中因缺乏度量尺度、局部不一致和计算效率低而受限的问题。该方法通过极稀疏的3D数据将相对深度转换为度量深度,采用分段恢复策略和基于不连续性感知的测地成本像素级优化,有效消除了局部尺度不一致。MTD具有良好的泛化能力,显著提升了深度补全和深度估计的精度,且其轻量化的模块化设计便于在多种下游3D任务中部署和集成。

详情
英文摘要

Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at https://mias.group/MTD.

2605.11577 2026-05-13 cs.CL

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Shaobin Zhuang, Yuang Ai, Jiaming Han, Xiaohui Li, Huaibo Huang, Xiangyu Yue, Xuefeng Hu, Kun Xu, Yali Wang, Hao Chen

AI总结 传统自回归语言模型逐个生成文本标记,难以有效捕捉自然语言中多标记单元的结构特性,限制了模型的表达能力和推理效率。本文提出 BitLM,通过将每个标记表示为固定长度的二进制码,并在每个块内并行去噪多个标记,从而在保持左到右因果注意力的同时实现块内联合词法决策。BitLM 用位级去噪替代传统大词汇表 softmax,将标记生成重构为紧凑二进制空间中的迭代承诺过程,显著提升了预训练效率和推理速度,展示了逐标记生成并非语言模型的必然要求,而是接口选择,为新一代语言模型架构提供了新方向。

Comments 12 pages, 4figures, 1 table

详情
英文摘要

Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.

2605.11574 2026-05-13 cs.CL cs.AI cs.LG

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

Pruthvinath Jeripity Venkata

AI总结 本文研究了大型语言模型在处理训练知识与矛盾文档之间冲突时的三种不同情境,并提出了一个三阶段的预测框架。核心方法区分了参数强度与参数唯一性这两个正交维度,并通过大量实验验证了模型在不同任务场景下的行为差异。研究发现,模型在任务相关性引导下对文档的依赖程度显著变化,揭示了参数确定性在事实性任务中的主导作用。

Comments 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models

详情
英文摘要

The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p <= .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p < .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.

2605.11571 2026-05-13 cs.LG

FedOUI: OUI-Guided Client Weighting for Federated Aggregation

Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

AI总结 本文提出FedOUI,一种基于过拟合-欠拟合指示器(OUI)的联邦学习聚合方法,通过客户端模型在固定探测数据集上的激活特征评估其训练过程中的结构特性,并据此动态调整客户端在聚合中的权重。该方法无需标签信息,能够在强非独立同分布和存在噪声客户端的场景下提升聚合质量,实验表明其在异构性较强时表现尤为突出,展示了模型内部激活结构在联邦学习中的潜在价值。

详情
英文摘要

Federated learning usually aggregates client updates using dataset size or gradient-level criteria, while overlooking internal signals about how each client model is organizing its input space during training. We introduce FedOUI, a simple aggregation rule based on the Overfitting-Underfitting Indicator (OUI), an activation-based and label-free metric. Each participating client sends its local update together with a OUI value computed on a fixed probe batch, and the server estimates the round-wise OUI distribution to assign lower weights to structurally atypical clients through a smooth reweighting rule. We evaluate FedOUI on CIFAR-10 under strong non-IID partitioning and noisy-client conditions, comparing it with FedAvg, FedProx, and a gradient-alignment baseline. The clearest gains appear under strong heterogeneity, where OUI-based weighting improves aggregation quality while remaining lightweight and interpretable. These results show that internal activation structure can provide useful information for federated aggregation beyond client size and gradient geometry.

2605.11570 2026-05-13 cs.LG

OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training

Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

AI总结 本文提出将过拟合-欠拟合指示器(OUI)视为神经网络训练过程中内部结构变化的一个可观测指标,强调应从激活函数的角度理解训练动态。研究发现,OUI作为一种早期、无需标签的基于激活的信号,能够提前揭示网络训练进入良好或不良状态的趋势,在监督学习、强化学习和在线控制等多种场景中表现出良好的预测能力。这一发现为构建以激活为中心的训练动态理论提供了实证基础。

详情
英文摘要

Activation functions are what make deep networks expressive: without them, the model collapses to a linear map. Yet we still evaluate training mostly from the outside, through loss, accuracy, return, or final calibration, while the internal structural evolution of the network remains largely unobserved. In this paper, we argue that the Overfitting--Underfitting Indicator (OUI) should be understood as a first practical observable of that internal structure. Across our recent results, OUI consistently appears as an early, label-free, activation-based signal that reveals whether a network is entering a poor or promising training regime before convergence. In supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes early in PPO actor--critic; and in online control, it can drive layer-wise weight decay adaptation. Read together with recent evidence that activation patterns tend to stabilize earlier than parameters, these results suggest a broader research direction: an activation-centric theory of training dynamics. OUI is becoming an empirical foothold toward this theory.

2605.11569 2026-05-13 cs.AI cs.LG

Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics

ASM Nazrul Islam, Md. Hasanul Kabir, Md. Liakot Ali, Joydeb Kumar Sana

AI总结 该研究针对航空业需求预测中的不足,提出了一种结合双时间流和混合注意力机制的LSTM模型,用于更准确地预测航班载客率。该模型同时处理航班内部的预订积累和航班之间的预订模式,克服了传统单时间维度建模的信息丢失问题。实验表明,该方法在孟加拉国航空公司实际数据上取得了较高的预测精度,并在多种航线类型中表现出良好的泛化能力,已被该航空公司正式应用于运营中。

详情
英文摘要

Accurate short-term demand forecasting is crucial to airline revenue management, yet most existing systems fail to meet this need because current models treat booking data as a single temporal dimension, either the accumulation of bookings for a specific flight or the historical booking profile of the same route. This unidimensional view discards information carried by the other temporal stream and forecasting absolute passenger counts introduces a further operational fragility when change in planned aircraft type alters total seat capacity. This study addresses both limitations. A dual-stream Long Short-Term Memory (LSTM) integrated with attention framework is proposed that simultaneously processes two complementary input sequences: a horizontal sequence capturing intra-flight booking accumulation over the days preceding departure, and a vertical sequence capturing inter-flight booking patterns at fixed days-before-departure offsets across historical flights. Multiple dual-stream architectural variants, combining self-attention, cross-attention, and hybrid attention with concatenation, residual, and gated fusion strategies, are developed and evaluated. Experiments on real-world reservation data from the national airline of Bangladesh, Biman Bangladesh Airlines (BBA), demonstrate that the proposed hybrid model achieves a Mean Absolute Error of 2.8167 and a coefficient of determination ($R^{2}$) of 0.9495, outperforming single-stream baselines, tree-based models, and three prior dual-LSTM architectures applied to the same data. Validation across four flight category pairs; domestic versus international, direct versus transit, high versus low frequency, and short versus mid versus long haul confirms that the model generalizes across operationally diverse route types. Biman Bangladesh Airlines (BBA) has officially integrated this methodology into its operations.

2605.11564 2026-05-13 cs.RO

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

Pablo Ortega-Kral, Eliot Xing, Arthur Bucker, Vernon Luk, Junseo Kim, Owen Kwon, Angchen Xie, Nikhil Sobanbabu, Yifu Yuan, Megan Lee, Deepam Ameria, Bhaswanth Ayapilla, Jaycie Bussell, Guanya Shi, Jonathan Francis, Jean Oh

AI总结 本文提出 RIO,一个开源的 Python 框架,旨在解决跨形态机器人学习中的基础设施碎片化问题。RIO 提供了灵活、轻量的组件,支持机器人控制、远程操作、数据格式化、传感器配置和策略部署,适用于多种硬件平台和形态。研究通过在三种机器人形态和四种硬件平台上验证 RIO,展示了其在通用视觉-语言-动作模型训练与部署中的有效性,为实际机器人硬件上的学习加速提供了基础支持。

Comments 14 pages, 12 figures, 5 tables. Accepted to Robotics: Science and Systems (RSS) 2026

详情
英文摘要

Despite recent efforts to collect multi-task, multi-embodiment datasets, to design recipes for training Vision-Language-Action models (VLAs), and to showcase these models on different robot platforms, generalist cross-embodiment robot capabilities remains a largely elusive ideal. Progress is limited by fragmented infrastructure: most robot code is highly specific to the exact setup the user decided on, which adds major overhead when attempting to reuse, recycle, or share artifacts between users. We present RIO (Robot I/O), an open source Python framework that provides flexible, lightweight components for robot control, teleoperation, data formatting, sensor configuration, and policy deployment across diverse hardware platforms and morphologies. RIO provides abstractions that enable users to make any choice and to switch between them, with minimal reconfiguration effort. We validate RIO on VLA deployment workflows across three morphologies (single-arm, bimanual, humanoid) and four hardware platforms with varying grippers and cameras. Using teleoperated data collected with RIO, we fine-tune state-of-the-art VLAs including $π_{0.5}$ and GR00T on household tasks such as pick-and-place, folding, and bowl scrubbing. By open sourcing all our efforts, we hope the community can accelerate their pace of robot learning on real-world robot hardware. Additional details at: https://robot-i-o.github.io

2605.11563 2026-05-13 cs.CV cs.AI

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim

AI总结 本文提出了一种名为TCP-SSM的高效视觉状态空间模型,旨在解决现有SSM在长程视觉任务中难以控制状态依赖记忆行为的问题。该方法通过引入基于令牌的稳定极点,显式建模递归动态,提升了模型的可解释性和可控性。TCP-SSM采用实极点和复共轭极点分别建模单调衰减和阻尼振荡响应,并通过分组极点共享和轻量输入路径设计,实现了计算效率的显著提升,在多个视觉任务中相比基线模型减少了高达44%的计算复杂度。

详情
英文摘要

State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.

2605.11559 2026-05-13 cs.CV cs.AI

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

Fanpu Cao, Xin Zou, Xuming Hu, Hui Xiong

AI总结 多模态大语言模型(MLLMs)在视觉推理和基于视觉的问题回答中发挥着重要作用,但其仍易产生视觉幻觉,即生成的回答与图像内容矛盾或提及不存在的物体。本文发现,通过分析视觉注意力的高频结构(即层间拉普拉斯能量),可以揭示模型在生成幻觉时的注意力变化特征,并据此提出一种无需训练的解码策略LaSCD,通过选择具有高拉普拉斯能量的层并重新映射下一个词的得分,有效减少幻觉现象,同时保持模型的一般能力。

详情
英文摘要

Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.

2605.11556 2026-05-13 cs.AI cs.LG

Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers

Shengjie Wang, Guanghe Li, Zonghan Yang, Yang Gao

AI总结 该研究提出了一种名为Hindsight Hint Distillation(HHD)的新方法,旨在从无思维链(CoT)注释的问题-答案对中学习推理能力,以解决复杂的长期任务。HHD通过模型自身失败的自我推演生成“事后提示”,用于指导成功的策略生成,并通过自我蒸馏提升模型的推理能力。实验表明,HHD在多个基准测试中显著优于现有方法,尤其在未见过的任务上表现出良好的泛化能力。

Comments 28 pages, 7 figures

详情
英文摘要

Solving complex long-horizon tasks requires strong planning and reasoning capabilities. Although datasets with explicit chain-of-thought (CoT) rationales can substantially benefit learning, they are costly to obtain. To address this challenge, we propose Hindsight Hint Distillation (HHD), which only requires easy-to-obtain question-answer pairs without CoT annotations. Inspired by how human teachers use student mistakes to provide targeted guidance, HHD synthesizes hindsight hints from the model's own failed self-rollouts and uses them to scaffold on-policy rollouts that successfully complete the tasks. The model then self-distills these scaffolded trajectories and generalizes to new problems without hint guidance. Experiments show that HHD significantly outperforms iterative RFT and trajectory-synthesis baselines, achieving an absolute improvement of 8\% on SWE-bench Verified, while all baselines improve by only around 2\%. Notably, the reasoning strategies induced by HHD generalize effectively to out-of-distribution tasks, yielding the largest gains on SWE-bench Multilingual despite no training on multilingual data. These results demonstrate that HHD can effectively synthesize expert-like reasoning from CoT-free data and substantially improve long-horizon performance.

2605.11554 2026-05-13 cs.LG

A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup

Hongmin Li

AI总结 该研究探讨了在固定预训练与探针任务设置下,基于结构代理的解释是否能够准确反映模型在分布外(OOD)任务上的性能差异。研究通过构造一个受控实验,展示了结构代理的排名与OOD探针准确率排名可能不一致,表明结构代理未必能追踪影响OOD性能的关键任务结构。这一反例揭示了强代理解释的局限性,指出在特定条件下,总学习结构的代理可能无法准确反映任务相关结构。

Comments 19 pages, 3 figures

详情
英文摘要

Task-agnostic structure proxies are often used to interpret why one pretraining corpus transfers better than another, but such explanations require the proxy to track the structure that matters for the downstream task. We test this requirement in a fixed pretraining-and-probing setup motivated by computationally bounded notions of learned structure, including epiplexity. The core question is whether a proxy ranking of two pretraining datasets must agree with their ranking by OOD probe accuracy. We show that it need not. First, we give a controlled construction in which a formal structure quantity, its operational proxy, and the task-relevant structure for a target family separate. We then instantiate the same mechanism in a synthetic sequence-model experiment: under the primary all-sample evaluation, the OOD accuracy ranking reverses the proxy ranking in two of three seeds, with auxiliary diagnostics and ablations supporting the same interpretation. The counterexample does not reject structure-based explanations in general; it identifies a boundary on strong proxy-based explanations. A proxy for total learned structure can fail to track the task-relevant structure that drives OOD performance, even in a controlled setting.

2605.11551 2026-05-13 cs.LG cs.CV cs.IT math.IT

VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck

Aryan Gondkar, Hayder Radha, Yiming Deng

AI总结 本文提出了一种基于深度变分信息瓶颈(VIB)的新型检测与不确定性量化方法VNDUQE,用于检测神经网络中的分布外(OOD)样本。该方法通过信息论指标如KL散度和预测熵来评估样本的异常程度,并在MNIST数据集上验证了其有效性。实验表明,结合KL散度和预测熵的并行检测策略在远分布外和近分布外样本检测上均优于传统基线方法,显著提升了检测性能和不确定性估计的可靠性。

Comments 6 pages, 3 figures, Fall 2025 version

详情
英文摘要

Detecting out-of-distribution (OOD) samples is critical for safe deployment of neural networks in safety-critical applications. While maximum softmax probability (MSP) provides a simple baseline, it lacks theoretical grounding and suffers from miscalibration. We propose VNDUQE (VIB-based Novelty Detection and Uncertainty Quantification for Nondestructive Evaluation), which investigates novelty detection through the Deep Variational Information Bottleneck (VIB), which explicitly constrains information flow through learned representations. We train VIB models on MNIST with held-out digit classes and evaluate OOD detection using information-theoretic metrics: KL divergence and prediction entropy. Our results reveal complementary detection signals: KL divergence achieves perfect detection (100\% AUROC on noise) on far-OOD samples (noise, domain shift), while prediction entropy excels at near-OOD detection (94.7\% AUROC on novel digit classes). A parallel detection strategy combining both metrics achieves 95.3\% average AUROC and 92\% true positive rate at 5\% false positive rate, which is a 32 percentage point improvement over baseline MSP (85.0\% AUROC, 60.1\% TPR). Compression via the information bottleneck principle ($β=10^{-3}$) reduces Expected Calibration Error by 38\%, demonstrating that information-theoretic constraints produce fundamentally more reliable uncertainty estimates. These findings directly support active learning with expensive computational oracles, where well-calibrated novelty detection enables principled threshold selection for oracle queries.

2605.11550 2026-05-13 cs.CV

The DAWN of World-Action Interactive Models

Hongbo Lu, Liang Yao, Chenghao He, Haoyu Wang, Xiang Gu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng

AI总结 该论文提出了一种名为DAWN的世界-动作交互模型,用于解决自动驾驶场景中世界演化与动作生成之间的相互依赖问题。DAWN通过在语义潜在空间中结合世界预测器和世界条件动作去噪器,实现了世界预测与动作生成的递归优化,从而在复杂交互场景中支持长期轨迹生成。实验表明,DAWN在多个自动驾驶基准测试中表现出优异的规划性能和安全性,展示了交互式世界-动作生成在构建真正可操作世界模型中的潜力。

详情
英文摘要

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.

2605.11547 2026-05-13 cs.LG cs.AI

Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching

Aditi Gupta, Soon Hoe Lim, Annan Yu, N. Benjamin Erichson

AI总结 本文提出了一种名为 SharpEuler 的训练无关采样方法,用于改进流匹配模型的生成效率与质量。该方法通过离线分析预训练模型,估计速度场变化最剧烈的区域,并据此生成适用于任意推理预算的时步网格,从而在保持相同模型评估次数的前提下提升采样效果。实验表明,SharpEuler 在固定计算预算下能有效减少模式泄露并提升模式覆盖度,为高效生成提供了新思路。

详情
英文摘要

Flow matching models generate samples by numerically integrating a learned velocity field, with each integration step requiring a neural network evaluation. Fast generation therefore requires using a small fixed evaluation budget effectively: the key question is not only how to integrate the flow, but where the sampler should spend its steps. We propose SharpEuler, a training-free sampler that profiles a pretrained model offline by estimating where the learned velocity field changes most rapidly along calibration trajectories. This finite-difference estimate defines a solver-aware sharpness profile, which is smoothed and converted by a quantile transform into a timestep grid for any desired inference budget. At test time, sampling remains ordinary Euler integration with the same number of model evaluations as a uniform schedule. We justify SharpEuler using three principles: a numerical principle identifying trajectory acceleration as the leading source of Euler discretization error, a variational principle deriving sharpness-based power-law timestep densities, and a statistical guarantee showing that the finite-sample calibrated sampler is stable at the terminal distribution level. Our experiments show that SharpEuler improves sample quality at fixed budgets, reducing inter-mode leakage and increasing mode coverage.

2605.11541 2026-05-13 cs.CV

GeoR-Bench: Evaluating Geoscience Visual Reasoning

Yushuo Zheng, Zicheng Zhang, Huiyu Duan, Chunyi Li, Zijian Chen, Ziheng Jia, Yue Shi, Ke Gu, Xiongkuo Min, Guangtao Zhai

AI总结 GeoR-Bench 是一个用于评估地球科学视觉推理能力的基准测试,旨在解决当前人工智能系统在理解和预测地球系统变化方面的能力不足问题。该基准包含440个经过精心挑选的样本,涵盖6类地球科学任务和24种任务类型,通过视觉编辑任务来评估模型的推理能力、一致性和输出质量。实验结果表明,现有模型在地球科学推理上仍存在显著瓶颈,最佳模型的总体准确率仅为42.7%,而开源模型表现更差,反映出当前模型在科学准确性上仍有较大提升空间。

详情
英文摘要

Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.

2605.11538 2026-05-13 cs.CL cs.AI cs.LG

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

Cheng Wang, Qin Liu, Wenxuan Zhou, Muhao Chen

AI总结 本文针对大型语言模型在训练过程中探索与利用之间的平衡问题,提出了一种基于协方差感知的改进型GRPO方法。该方法通过高斯核函数动态降低极端token更新的影响,从而在不损失有用学习信号的前提下减少训练不稳定。实验表明,该方法在多个推理基准上优于原始GRPO,有效提升了模型的下游性能并稳定了训练过程中的熵值。

Comments ACL 2026

详情
英文摘要

Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.