arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2602.09850 2026-05-11 cs.CV

Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning

Peng Chen, Chao Huang, Yunkang Cao, Chengliang Liu, Wei Wang, Wenqiang Wang, Mingbo Yang, Li Shen, Wenqi Ren, Xiaochun Cao

AI总结 工业缺陷检测需要对细粒度缺陷模式进行精确推理,但现有基于通用领域数据预训练的多模态大语言模型在捕捉特定类别异常方面存在不足,影响了检测精度和可解释性。为此,本文提出Reason-IAD,一种基于知识引导的动态潜在推理框架,用于可解释的工业异常检测。该方法结合了检索增强的知识模块和熵驱动的潜在推理机制,通过引入类别特定的文本描述和优化潜在推理过程,提升了检测性能与可解释性。实验结果表明,Reason-IAD在多个任务中均优于现有先进方法。

详情
英文摘要

Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods across multiple tasks. The code will be publicly available at https://github.com/chenpeng052/Reason-IAD.

2602.09782 2026-05-11 cs.LG cs.AI cs.CL

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao

AI总结 本文研究了可验证奖励强化学习(RLVR)中策略熵崩溃的问题,提出了一种基于梯度保持剪切的灵活熵控制方法。通过理论分析和实验验证,作者明确了重要采样比区域对熵变化的影响,并设计了动态剪切阈值机制以精确调控熵值。所提出的动态熵控制策略在多个基准测试中有效缓解了熵崩溃现象,显著提升了模型性能。

Comments https://github.com/Kwen-Chen/Flexible-Entropy-Control

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping thresholds to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse and achieve superior performance across multiple benchmarks.

2602.09229 2026-05-11 cs.LG cs.IR

When Does Embedding Magnitude Matter? A Cross-Task Functional-Symmetry Framework

Xincan Feng, Taro Watanabe

AI总结 本文研究了嵌入向量幅值在不同任务中的影响,提出了一种基于功能对称性的跨任务框架,通过独立控制查询和文档侧的归一化方式,揭示了两种此前未被研究的中间变体(QNorm 和 DNorm)。实验表明,这些单侧归一化方法在检索和多个下游任务中均优于传统的余弦相似度和点积方法。研究进一步发现,任务的功能对称性决定了归一化策略的选择,并在多个任务类别中验证了该机制的广泛适用性。

Comments Preliminary work. Under review

详情
英文摘要

Cosine similarity normalizes both sides; dot product normalizes neither. We propose a 2x2 framework that independently controls query-side and document-side normalization, exposing two intermediate variants (QNorm, DNorm) that have not been previously studied. On retrieval with four encoders, evaluated in-domain on MS MARCO and out-of-domain on BEIR, BRIGHT, and multi-hop QA, the unilateral variants outperform both cosine and dot product, with relative gains of up to +72% out-of-domain and +24% on downstream RAG. Cross-evaluation reveals the mechanism: document magnitude scales inference scores while query magnitude modulates training gradients, and the Fisher Information Matrix condition number predicts which side to normalize. We then classify tasks by functional symmetry, defined as whether the aggregate scoring procedure treats Q and C as interchangeable, and test whether the mechanism extends beyond retrieval. On five additional task families (semantic textual similarity, CLIP, knowledge graph completion, few-shot classification, recommender systems), the coarse prediction (cosine for symmetric, magnitude-preserving for asymmetric) holds in every case examined; the unilateral variants beat Cosine on recommendation, and on few-shot classification DNorm beats both Cosine and the standard Euclidean default of Prototypical Networks.

2602.06283 2026-05-11 cs.LG

SOCKET: SOft Collision Kernel EsTimator for Sparse Attention

Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava

AI总结 在长上下文推理中,利用稀疏性是扩展大语言模型的关键,而注意力机制是自回归解码的主要成本来源。本文提出SOCKET,一种基于软碰撞核估计的稀疏注意力方法,通过引入概率化的相似性感知聚合,替代传统LSH中的硬桶匹配,从而在保持top-k排序的同时显著减少内存消耗。SOCKET将LSH从候选生成器重新定义为一种原理化的评分核,实现了高效的token选择,并在多个长上下文基准测试中达到或超越现有稀疏注意力方法的性能。

Comments 7 figures, 17 tables

详情
英文摘要

Exploiting sparsity during long-context inference is key to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends on efficient scoring and selection at inference time. We revisit Locality-Sensitive Hashing (LSH) and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Traditional LSH yields binary collision signals that limit ranking quality and require substantial memory to perform well. In contrast, soft LSH accumulates graded collision evidence across hash tables, preserving top-k ordering with significantly less memory. This reframes LSH from a candidate generator into a principled scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad hoc voting and matches or surpasses prior sparse attention methods across multiple long-context benchmarks. With a custom CUDA scoring kernel and a Flash Decode Triton backend, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention.

2602.05359 2026-05-11 cs.CV

Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

Yiming Zhang, Qiangyu Yan, Borui Jiang, Kai Han

AI总结 本文研究了多模态大语言模型在推理过程中存在的效率低、冗余及幻觉问题,提出了一种基于层次化视觉线索注入的多模态潜空间推理框架HIVE。该方法通过递归扩展Transformer模块,构建内部推理循环,并将全局场景到细粒度区域的视觉线索注入潜空间表示,实现基于视觉信息的多步推理。实验表明,该方法有效提升了模型对复杂场景的理解能力,并在推理过程中更好地结合视觉知识。

详情
英文摘要

The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.

2602.04556 2026-05-11 cs.CL cs.LG

Rethinking Weight Tying: Pseudo-Inverse Tying for LM Stable Training and Updates

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

AI总结 本文重新审视了语言模型中广泛使用的权重绑定技术,提出了一种名为伪逆绑定(Pseudo-Inverse Tying, PIT)的新方法,用于提升模型训练的稳定性与更新效率。PIT通过将嵌入层和解嵌层视为共享潜在词元记忆的耦合投影,确保在整个训练过程中保持伪逆一致的接口。该方法引入正交共享记忆和对称正定变换,避免显式计算伪逆和引入额外参数,从而在提升训练稳定性的同时,为模型解释性分析提供了更清晰的结构基础。实验表明,PIT在不同规模的设备端模型上有效提升了持续预训练的稳定性,并在保持词元接口一致性方面表现出色。

Comments an early-stage version

详情
英文摘要

Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, parameter sharing alone does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and weakening explainability probes that rely on a meaningful vocabulary-space decoder. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by polar initialization from a source checkpoint for continued pretraining or by random orthonormal initialization for from-scratch pretraining, and introduces a learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and vocabulary-sized auxiliary parameters. Beyond improving training stability, PIT provides a cleaner substrate for logit-lens-style and vocabulary-space explainability probes by keeping the input and output token geometries synchronized. We evaluate PIT on on-device models spanning 256M-1.3B parameters. The results show that PIT improves continued-pretraining stability, enforces near-exact token-interface consistency across settings, and yields more predictable lightweight adaptation after continued pretraining, while from-scratch pretraining reveals a trade-off between strict interface consistency and unconstrained optimization.

2602.04447 2026-05-11 cs.LG cs.AI

Mixture of Masters: Sparse Chess Language Models with Player Routing

Giacomo Frisoni, Lorenzo Molfetta, Davide Freddi, Gianluca Moro

AI总结 本文提出了一种名为“Mixture of Masters”(MoM)的新型稀疏国际象棋语言模型,通过引入多个小型GPT专家网络,每个专家模仿顶尖棋手的风格,并利用一个可学习的门控网络根据当前棋局状态动态选择最合适的专家进行决策。该方法有效避免了传统密集型模型风格单一、策略多样性的丧失问题,在标准棋局测试中表现出优于现有密集模型和基于聚合数据训练的GPT基线模型的性能,同时保持了生成多样性、可控性和可解释性。

详情
英文摘要

Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically, e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.

2602.03331 2026-05-11 cs.LG

Bayesian Conformal Prediction as a Decision Risk Problem

Fanyi Wu, Veronika Lohmanova, Samuel Kaski, Michele Caprio

AI总结 本文提出了一种贝叶斯共形预测(BCP)框架,将贝叶斯后验预测分布与PAC风格的共形风险控制相结合,以保证有限样本下的预测集覆盖率。与传统基于固定分位数阈值的方法不同,BCP将共形预测建模为决策风险优化问题,生成优化后的最高后验密度(HPD)预测集,能够在多模态分布下更高效地集中概率质量。实验表明,BCP在保持覆盖率的同时显著减小了预测集规模,并在模型误设情况下仍能保持可靠的预测性能。

Comments 22 pages, 8 figures. A previous version was accepted at the EIML Workshop at NeurIPS 2025

详情
英文摘要

We propose Bayesian Conformal Prediction (BCP), a framework that combines Bayesian posterior predictive distributions with PAC-style conformal risk control to produce prediction sets with finite-sample coverage guarantees. Standard quantile-threshold conformal methods often construct prediction sets using a single fixed threshold, which typically yields connected prediction sets. While valid, such sets can be inefficient when the posterior predictive distribution is multimodal, since they may span low-density regions between separated modes. The main contribution of BCP is to formulate conformal prediction as a decision-risk optimisation problem, extending standard fixed quantile-threshold sets to optimised highest posterior density (HPD) prediction sets. These sets can be disjoint, concentrating probability mass on separated high-density regions. Validity is enforced using a PAC-style risk constraint, which provides coverage control even when the Bayesian model is misspecified. In standard nested-threshold settings, BCP recovers the smallest feasible threshold, aligning with existing PAC-based approaches. In the multimodal experiment, HPD geometry substantially improves efficiency, reducing mean prediction set size from $4.82$ to $2.07$ while satisfying the target PAC pass rate. Across regression, classification, and distribution-shift experiments, BCP maintains reliable coverage under model misspecification, whereas Bayesian credible intervals can fail to preserve nominal coverage.

2602.03201 2026-05-11 cs.LG

SLOPE: Optimistic Potential Landscape Shaping for Model-based Reinforcement Learning

Yao-Hui Li, Zeyu Wang, Xin Li, Wei Pang, Yingfang Yuan, Zhengkun Chen, Boya Zhang, Riashat Islam, Alex Lamb, Yonggang Zhang

AI总结 本文提出了一种名为SLOPE的模型基于强化学习框架,旨在解决稀疏奖励环境下梯度信息不足的问题。该方法通过乐观分布回归估计高置信度的奖励上界,增强稀有成功信号,从而生成更具信息量的潜在奖励景观,引导有效的探索与规划。实验表明,SLOPE在多个基准测试和实际机器人任务中均优于现有先进方法,适用于稀疏、半稀疏和密集奖励场景。

Comments Work in progress

详情
英文摘要

Model-based reinforcement learning (MBRL) is sample-efficient but struggles in sparse reward settings. A critical bottleneck arises from the lack of informative gradients in sparse settings, where standard reward models often yield flat landscapes that struggle to guide planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting sparse scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks and real-world robotic deployments, demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.

2602.02739 2026-05-11 cs.LG cs.AI

TopoPrune: Robust Data Pruning via Unified Latent Space Topology

Arjun Roy, Prajna G. Malettira, Manish Nagaraj, Kaushik Roy

AI总结 TopoPrune 是一种基于拓扑结构的鲁棒数据剪枝方法,旨在解决传统几何剪枝方法在面对潜在空间扰动时稳定性差的问题。该方法通过统一的潜在空间拓扑结构,从全局和局部两个尺度对数据进行剪枝,分别利用拓扑感知的流形近似和可微持续同调进行优化,从而提升剪枝精度和鲁棒性。实验表明,TopoPrune 在高剪枝比例下仍能保持优异性能,并在噪声干扰和跨架构迁移中表现出更强的稳定性。

Comments Preprint. Under Review

详情
英文摘要

Geometric data pruning methods, while practical for leveraging pretrained models, are fundamentally unstable. Their reliance on extrinsic geometry renders them highly sensitive to latent space perturbations, causing performance to degrade during cross-architecture transfer or in the presence of feature noise. We introduce TopoPrune, a framework which resolves this challenge by leveraging topology to capture the stable, intrinsic structure of data. TopoPrune operates at two scales, (1) utilizing a topology-aware manifold approximation to establish a global low-dimensional embedding of the dataset. Subsequently, (2) it employs differentiable persistent homology to perform a local topological optimization on the manifold embeddings, ranking samples by their structural complexity. We demonstrate that our unified dual-scale topological approach ensures high accuracy and precision, particularly at significant dataset pruning rates (e.g., 90%). Furthermore, through the inherent stability properties of topology, TopoPrune is (a) exceptionally robust to noise perturbations of latent feature embeddings and (b) demonstrates superior transferability across diverse network architectures. This study demonstrates a promising avenue towards stable and principled topology-based frameworks for robust data-efficient learning.

2602.02320 2026-05-11 cs.CL cs.AI q-bio.BM

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

AI总结 该研究提出了一种基于规则的自动化标注框架,用于生成包含完整分子结构信息的自然语言描述,解决了构建大规模高质量分子结构-语言对数据集的难题。通过扩展化学命名规则解析器,生成结构化的XML元数据,并引导大语言模型生成精确描述,最终构建了一个包含约16.3万个分子-描述对的数据集,经验证其描述精度高达98.6%。该数据集为分子与语言的对齐研究提供了可靠基础,适用于多种化学任务。

详情
英文摘要

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular descriptions that preserve complete structural details at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structural XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule--description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6$%. The proposed annotation framework is readily beneficial to broader chemical tasks that rely on structural descriptions, with the resulting dataset providing a reliable foundation for molecule--language alignment. The source code and dataset are hosted at https://github.com/TheLuoFengLab/MolLangData and https://huggingface.co/datasets/ChemFM/MolLangData, respectively.

2602.01752 2026-05-11 cs.CL cs.CR

WorldCup Sampling for Multi-bit LLM Watermarking

Yidan Wang, Yubing Ren, Yanan Cao, Li Guo

AI总结 随着大语言模型生成的文本越来越接近人类语言,水印技术成为实现可靠归属的重要手段。本文提出了一种名为WorldCup的多比特水印框架,通过将采样过程建模为结构化通信信道,并利用互补信号引导的分层竞争机制嵌入消息比特,从而在保证生成质量的同时实现鲁棒的消息恢复。实验表明,WorldCup在消息容量、可检测性、鲁棒性、文本质量和解码效率之间取得了良好的平衡,优于现有方法,为多比特水印研究提供了可扩展的理论基础。

详情
英文摘要

As large language models (LLMs) generate increasingly human-like text, watermarking has emerged as a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing approaches typically extend zero-bit watermarking schemes by introducing static logit perturbations and counting-based decoding strategies, which can degrade text quality and compromise decoding robustness as the payload increases. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that models the sampling process as a structured communication channel and embeds message bits through a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup incorporates entropy-aware modulation to preserve generation quality and enables robust message recovery via confidence-aware decoding that accounts for token-level reliability. Comprehensive experiments demonstrate that WorldCup achieves a strong balance across message capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines. We believe that this work establishes a scalable and principled foundation for future research on multi-bit watermarking in LLMs.

2602.01642 2026-05-11 cs.LG cs.AI math.OC stat.CO stat.ML

The Effect of Mini-Batch Noise on the Implicit Bias of Adam

Matias D. Cattaneo, Boris Shigida

AI总结 本文研究了在Adam优化器中,小批量噪声对隐式偏差的影响,特别是其如何影响模型在损失函数景观中趋向更尖锐或更平坦区域的倾向,进而影响泛化性能。研究发现,当批量较大时,增大β₂会加剧记忆项的反正则化效应,损害泛化;而当批量较小时,β₂对正则化的影响方向相反,β₁的单调性变化也呈现类似趋势。该理论分析还揭示了批量大小与临界批量规模之间的关系,并通过实验验证了这些结论。

详情
英文摘要

With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(β_1, β_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $β_1$, $β_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $β_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $β_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $β_1$. In particular, the commonly "default" pair $(β_1, β_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $β_1$ closer to $β_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

2602.01166 2026-05-11 cs.RO

Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Badong Chen, Shanghang Zhang

AI总结 本文提出了一种名为LaRA-VLA的统一视觉-语言-动作框架,通过将多模态的思维链(CoT)推理过程内化为连续的潜在表示,解决了现有方法在推理效率和感知控制匹配上的不足。该方法在潜在空间中统一进行推理与预测,避免了显式生成CoT的开销,实现了高效的动作控制。通过基于课程的训练策略和结构化的CoT数据集,LaRA-VLA在仿真和实际机器人操作任务中均表现出优越性能,推理延迟相比显式CoT方法降低了90%以上。

Comments Accepted by ICML 2026

详情
英文摘要

Vision-Language-Action (VLA) models benefit from chain-of-thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multi-modal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets and evaluate LaRA-VLA on both simulation benchmarks and long-horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA consistently outperforms state-of-the-art VLA methods while reducing inference latency by up to 90\% compared to explicit CoT-based approaches, demonstrating latent reasoning as an effective and efficient paradigm for real-time embodied control. Project Page: https://loveju1y.github.io/Latent-Reasoning-VLA/

2602.01003 2026-05-11 cs.LG cs.AI

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Zhishen Sun, Sizhe Dang, Guang Dai, Haishan Ye

AI总结 本文提出了一种名为ESSAM的新方法,用于在有限GPU资源下高效微调大语言模型,以提升其数学推理能力。ESSAM结合了进化策略中的零阶搜索与尖锐度感知最大化技术,实现了参数级微调,并在GSM8K等任务中表现出与强化学习方法相当甚至更优的性能。实验表明,ESSAM在保持高准确率的同时,显著降低了GPU内存消耗,并展现出更强的模型泛化能力。

详情
英文摘要

Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27\% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72\% and is comparable to GRPO with an accuracy of 78.34\%, and even surpassing them on some models. Further generalization experiments show that the models trained with ESSAM exhibit stronger generalization ability. Their average performance achieves the best results on 5 out of 6 datasets, indicating that ESSAM can effectively improve the generalization performance of fine-tuned models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage. In addition, we design an accelerated variant of ESSAM, which achieves nearly a twofold speedup while maintaining the same GPU memory usage as ESSAM, and attains an average accuracy of 78.02\% across all models, outperforming PPO. Code: https://github.com/szs777/ESSAM

2602.00513 2026-05-11 cs.LG

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Md Tanvirul Alam, Aritran Piplai, Ionut Cardei, Nidhi Rastogi, Peter J Worth

AI总结 本文提出 Minerva,一种基于可验证奖励的强化学习方法,用于提升网络安全威胁情报(CTI)大语言模型的结构化输出能力。研究利用 CTI 标准中的确定性验证机制,构建了包含多个子任务的统一数据集与训练流程,并设计了 MinervaRL 自训练机制以缓解奖励稀疏问题。实验表明,MinervaRL 在多个 CTI 基准测试中显著提升了模型性能。

详情
英文摘要

Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce Minerva, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose MinervaRL, a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Averaged across four backbones and 12 CTI benchmarks, MinervaRL improves the mean score by 15.8 percentage points over the corresponding base models and by 4.3 points over GRPO.

2601.22307 2026-05-11 cs.LG cs.NA math.NA

Exact Gaussian Moment Matching for Residual Networks: a Second-Order Method

Simon Kuang, Xinfan Lin

AI总结 本文研究了如何通过逐层矩匹配方法,将一般多元高斯分布的均值和协方差准确传播通过深度(残差)神经网络。作者针对包括probit、GeLU、ReLU(作为GeLU的极限)、Heaviside(作为probit的极限)和正弦激活函数在内的多种激活函数,推导出精确的矩匹配方法,适用于前馈网络和广义残差网络。实验表明,该方法在随机网络和变分贝叶斯神经网络中,相比现有方法在KL散度误差指标上分别实现了数量级甚至百万倍的提升,并给出了在正则条件下消除主要低方差误差的平滑距离误差界。

Comments new theoretical result on higher-order accuracy

详情
英文摘要

We study the problem of propagating the mean and covariance of a general multivariate Gaussian distribution through a deep (residual) neural network using layer-by-layer moment matching. We close a longstanding gap by deriving exact moment matching for the probit, GeLU, ReLU (as a limit of GeLU), Heaviside (as a limit of probit), and sine activation functions; for both feedforward and generalized residual layers. On random networks, we find orders-of-magnitude improvements in the KL divergence error metric, up to a millionfold, over popular alternatives. On a variational Bayes neural network, we show that our method attains hundredfold improvements in KL divergence from Monte Carlo ground truth over a state-of-the-art deterministic inference method. We also give a smooth-distance error bound showing that, under regularity assumptions, moment matching removes the leading low-variance errors and propagates higher-order local accuracy through the layers of a network.

2601.21424 2026-05-11 cs.LG cs.CV cs.IT math.IT

Lossy Common Information in a Learnable Gray-Wyner Network

Anderson de Andrade, Alon Harell, Ivan V. Bajić

AI总结 许多计算机视觉任务之间存在大量重叠信息,但传统编码方法往往忽视这一点,导致表示冗余且效率低下。本文受信息论中的Gray-Wyner网络启发,提出了一种可学习的三通道编码器,用于分离多任务中的共享信息与任务特有信息。通过引入“有损公共信息”的概念,研究界定了该方法的理论极限,并设计了相应的优化目标以平衡学习过程中的权衡。实验表明,该方法在多个视觉任务中显著减少了冗余,优于独立编码方式,展示了经典信息理论在现代机器学习中的实用价值。

详情
英文摘要

Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.

2601.20599 2026-05-11 cs.LG cs.AI

R-GTD: A Geometric Analysis of Gradient Temporal-Difference Learning in Singular Regimes

Hyunjun Na, Donghwan Lee

AI总结 本文研究了梯度时差(GTD)学习算法在特征交互矩阵(FIM)奇异情况下的收敛性问题。为了解决现有方法对FIM非奇异的限制性假设,作者提出了一种正则化的优化目标,通过重新表述最小化均方投影Bellman误差的问题,得到了一种新的正则化GTD算法(R-GTD)。该方法在FIM奇异时仍能保证收敛到唯一解,并通过几何分析建立了理论收敛保证和误差界,实验验证了其有效性。

Comments 32 pages, 8 figures

详情
英文摘要

Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. While some prior works have applied regularization to relax the nonsingularity assumption, their theoretical guarantees inevitably rely on other restrictive conditions. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We conduct a geometric analysis to establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.

2601.19831 2026-05-11 cs.LG cs.CL

Neural Neural Scaling Laws

Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho

AI总结 本文研究了语言模型性能随训练数据量增加的扩展规律,指出传统基于验证损失的参数化方法难以准确描述不同下游任务的多样化扩展行为。为此,作者提出了一种新的神经网络方法NeuNeu,将扩展规律预测建模为时间序列外推问题,结合准确率轨迹和词元级损失进行预测。该方法在66个下游任务上实现了1.99%的平均绝对误差,相比传统方法提升了44%,并且能够零样本泛化到未见过的模型和任务。

详情
英文摘要

Neural scaling laws predict how language model performance improves with increased training inputs. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation loss suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without the limitations inherent in assuming a specific functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 1.99% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 44% reduction compared to logistic scaling laws (3.56% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, architectures, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling directly from data outperforms parametric alternatives.

2601.18744 2026-05-11 cs.AI cs.LG

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou

AI总结 TSRBench 是一个面向通用模型的综合性多任务多模态时间序列推理基准,旨在全面评估模型在时间序列感知、推理、预测和决策等方面的能力。该基准包含来自14个领域的4125个问题,涵盖15项核心任务,通过大量实验评估了30多个主流大语言模型、视觉语言模型和时间序列模型的表现。研究发现,当前模型在多模态融合和预测任务上仍存在明显不足,揭示了语义理解与数值预测之间的解耦现象,为通用模型的发展提供了重要参考。

Comments Accepted to ICML 2026

详情
英文摘要

Time series are ubiquitous in real-world scenarios and crucial for applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluate over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual forms of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.

2601.18700 2026-05-11 cs.AI

TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, Bing Qin

AI总结 TEA-Bench 是一个用于评估工具增强型情感支持对话代理的系统性基准,旨在解决现有情感支持系统在多轮对话中缺乏外部工具支持、易产生幻觉的问题。该基准引入了真实情感场景和工具环境,通过过程级指标综合评估情感支持的质量与事实准确性。实验表明,工具增强能提升情感支持质量,但效果依赖于模型能力,且微调方法在泛化能力上存在局限。

详情
英文摘要

Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents. Our code and data can be found in https://github.com/XingYuSSS/TEA-Bench.

2601.18681 2026-05-11 cs.LG cs.AI cs.SY eess.SY math.OC

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

Yilie Huang, Wenpin Tang, Xunyu Zhou

AI总结 本文研究了基于分数的扩散模型中时间离散化问题,旨在在有限时间步数预算下生成高质量样本。提出了一种自适应重参数化时间(ART)方法,通过控制重参数化时间变量的时钟速度,在保持终端时间不变的前提下优化计算分配,以最小化欧拉离散化误差。进一步引入了基于强化学习的ART-RL框架,将ART问题转化为连续时间的强化学习问题,并建立了ART与高斯策略之间的双向桥梁,从而为确定性时间步最优解提供了理论依据和高效求解方法。实验表明,ART-RL在多个数据集上显著提升了生成图像的质量。

Comments 25 pages, 8 figures, 5 tables

详情
英文摘要

We consider time discretization for score-based diffusion models to generate samples from a learned reverse-time dynamic on a finite grid. Uniform and hand-crafted grids can be suboptimal given a budget on the number of time steps. We introduce Adaptive Reparameterized Time (ART), which controls the clock speed of a reparameterized time variable to redistribute computation along the sampling trajectory while preserving the terminal time, with the objective of minimizing the aggregate Euler discretization error. We derive a randomized companion ART-RL that recasts ART as a continuous-time reinforcement learning problem with Gaussian policies, and prove a two-directional bridge between the two: the deterministic ART optimum lifts to an optimal Gaussian policy, and conversely any optimal Gaussian policy must recover the ART control through its mean. This bridge turns continuous-time actor--critic learning into a principled, rather than heuristic, route to the deterministic timestep optimum. Within the official EDM pipeline, ART-RL improves FID on CIFAR--10 across a wide range of budgets; after one-time offline training, the distilled deterministic schedule transfers without retraining to AFHQv2, FFHQ, and ImageNet at no extra inference cost.

2601.17942 2026-05-11 cs.AI cs.DB

LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting

Yu-Jie Yang, Hung-Fu Chang, Po-An Chen

AI总结 本文研究了基于大语言模型(LLM)的自然语言到SQL生成问题,针对用户查询歧义、数据库模式关联复杂性以及SQL方言泛化能力不足等挑战,提出了一种无需真实标注数据的单智能体自优化与集成投票(SSEV)框架,结合自优化机制与加权多数投票策略,显著提升了生成SQL的准确性。进一步提出的ReCAPAgent-SQL框架通过多智能体协作,实现了对复杂企业数据库和实际场景下Text-to-SQL任务的高效处理,为构建可扩展的自然语言查询系统提供了有效方案。

Comments 29 pages, 22 figures

详情
Journal ref
2026 International Conference on Information Management
英文摘要

Text-to-SQL has emerged as a prominent research area, particularly with the rapid advancement of large language models (LLMs). By enabling users to query databases through natural language rather than SQL, this technology significantly lowers the barrier to data analysis. However, generating accurate SQL from natural language remains challenging due to ambiguity in user queries, the complexity of schema linking, limited generalization across SQL dialects, and the need for domain-specific understanding. In this study, we propose a Single-Agent Self-Refinement with Ensemble Voting (SSEV) pipeline built on PET-SQL that operates without ground-truth data, integrating self-refinement with Weighted Majority Voting (WMV) and its randomized variant (RWMA). Experimental results show that the SSEV achieves competitive performance across multiple benchmarks, attaining execution accuracies of 85.5% on Spider 1.0-Dev, 86.4% on Spider 1.0-Test, and 66.3% on BIRD-Dev. Building on insights from the SSEV pipeline, we further propose ReCAPAgent-SQL (Refinement-Critique-Act-Plan agent-based SQL framework) to address the growing complexity of enterprise databases and real-world Text-to-SQL tasks. The framework integrates multiple specialized agents for planning, external knowledge retrieval, critique, action generation, self-refinement, schema linking, and result validation, enabling iterative refinement of SQL predictions through agent collaboration. ReCAPAgent-SQL's WMA results achieve 31% execution accuracy on the first 100 queries of Spider 2.0-Lite, demonstrating significant improvements in handling real-world enterprise scenarios. Overall, our work facilitates the deployment of scalable Text-to-SQL systems in practical settings, supporting better data-driven decision-making at lower cost and with greater efficiency.

2601.16736 2026-05-11 cs.CV

A Step to Decouple Optimization in 3DGS

Renjie Ding, Yaonan Wang, Min Liu, Jialin Zhu, Jiazheng Wang, Jiahao Zhao, Wenting Shen, Feixiang He, Xiang Chen

AI总结 3D高斯泼溅(3DGS)是一种用于实时新视角合成的有力技术,但其优化过程中存在更新步耦合和梯度耦合等未被充分研究的问题。本文通过解耦优化过程,提出稀疏Adam、重状态正则化和解耦属性正则化等方法,并基于大量实验重新设计优化流程,最终提出AdamW-GS优化器,在提升优化效率的同时增强了表示效果。

Comments Accepted by ICLR 2026 (fixed typo)

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.

2601.15884 2026-05-11 cs.CV

Contrast-X: A Multi-Modal Contrast Image Synthesis Benchmark and Universal Modality Flow Matching

Yifan Chen, Fei Yin, Hao Chen, Jia Wu, Chao Li

AI总结 该研究提出Contrast-X,一个包含CT和乳腺DCE-MRI数据的多模态对比成像合成基准,涵盖10种器官和1500多名患者,每个病例均配有放射科医生验证的阶段标签和肿瘤掩码。为了解决任意模态缺失情况下的合成问题,研究引入FlowMI模型,通过统一的多模态潜在空间和流匹配实现对不同模态组合的处理。实验评估了多种模态缺失配置下的图像质量、放射科医生评估及病灶分析,并测试了模型在跨器官任务中的泛化能力。

详情
英文摘要

Contrast-enhanced imaging is central to oncologic diagnosis, but contrast agents can be contraindicated for many of the patients who need them most. Synthesizing contrast scans from non-contrast inputs is the natural response. Two obstacles stand in the way: no benchmark provides paired contrast data with lesion-level evaluation, and no single model handles the arbitrary missing patterns seen in practice. We introduce Contrast-X, a benchmark of paired contrast-enhanced and non-contrast imaging spanning 10 organs in CT (1{,}526 patients) and multi-phase breast DCE-MRI (1116 patients). Every case carries radiologist-verified phase labels and tumor masks. We further propose FlowMI, a single model that handles arbitrary subsets of available modalities through a unified multi-modal latent space and flow matching. We benchmark a range of missing-modality configurations, reporting standard image-quality metrics, radiologist reader studies, and downstream lesion analysis on the synthesized scans. We further evaluate cross-organ generalization to test whether the model has learned a transferable contrast-enhancement operation. Dataset, code, and leaderboard will be released. Our code are available at https://github.com/YifanChen02/Contrast-X.

2601.15507 2026-05-11 cs.CV

A Unified and Controllable Framework for Layered Image Generation with Visual Effects

Jinrui Yang, Qing Liu, Yijun Li, Mengwei Ren, Letian Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou

AI总结 该论文提出了一种统一且可控的分层图像生成框架LASAGNA,旨在解决现有图像生成模型在编辑特定元素时易导致内容身份漂移的问题。该方法通过单次前向传播生成具有真实视觉效果(如阴影和反射)的背景和前景图层,支持多种编辑操作而无需额外模型处理,从而避免了身份漂移。此外,研究还发布了首个包含48K分层图像的公开数据集和首个标准化的分层生成基准测试集,推动了该领域的发展。

详情
英文摘要

Recent image generation models produce impressive composites, but often fail to preserve the identity of user-provided content when editing specific elements: the surrounding scene may shift, and even the edited object's appearance can drift from the original. Layered representation offer a natural remedy--they allow users to independently manipulate individual elements--but existing layered methods typically produce transparent foregrounds without realistic visual effects such as shadows and reflections, forcing the use of a second harmonization model after every edit, which in turn introduces drift. To overcome these limitations, we present LASAGNA, which generates a photorealistic background (BG) and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground (FG) layer, LASAGNA supports the dominant class of consumer edits (e.g., translation, scaling, recoloring, duplication) via alpha compositing alone, without invoking any model post-edit, thereby eliminating identity drift introduced by cascade editing pipelines. This single-pass design contrasts with prior layered methods that rely on separate expert models for each task. LASAGNA handles diverse conditional inputs--text prompts, FG, BG, and location masks--within a unified architecture. We further release two community resources: LASAGNA-48K, the first public dataset of 48K layered image triplets with photorealistic visual effects, and LASAGNA-BENCH, the first standardized benchmark for layer-centric generation and editing, comprising 242 expert-annotated samples across six diverse sources. Experiments show that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes, and supports a wide range of post-edits without any model re-inference.

2601.15050 2026-05-11 cs.CL

Beyond Factual Accuracy: Evaluating Global Reasoning Integrity in RAG Systems with LogicScore

Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

AI总结 当前对检索增强生成(RAG)系统的评估方法过于关注事实准确性,忽视了长文本生成中的全局逻辑完整性,导致模型生成的回答虽事实正确,但逻辑上可能存在漏洞、冗余或不一致。为此,研究提出了基于霍恩规则的LogicScore评估方法,从完整性、必要性和确定性三个维度系统评估模型的全局推理能力。实验表明,尽管主流模型在事实精度上表现优异,但在逻辑推理方面仍存在明显不足,凸显了在大语言模型发展中同时重视逻辑连贯性的重要性。

详情
英文摘要

Current evaluation methods for Retrieval Augmented Generation (RAG) suffer from \textit{factual myopia}: they relentlessly emphasize factual accuracy yet neglect global logical integrity in long-form answer generation. This drives models to force unnatural connections, producing factually grounded yet logically incoherent responses with unaddressed gaps, ambiguous links, or redundant premises. To mitigate this, we present \textsc{LogicScore}, shifting from local, fact-by-fact assessment to rigorous global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Essentiality} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high factual accuracy (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Essentiality for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development.

2601.14958 2026-05-11 cs.CL cs.AI

Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala

Minuri Rajapakse, Ruvan Weerasinghe

AI总结 该研究探讨了语言模型在处理低资源、形态丰富的僧伽罗语时对不同书写形式的敏感性,特别是针对Unicode、罗马化和混合书写形式的文本。通过在多种文本来源上对24个开源语言模型进行困惑度评估,研究发现模型在不同书写形式下的性能差异显著,从Unicode到罗马化文本的性能下降超过300倍。研究还指出,模型规模与处理多书写形式的能力无明显相关性,并为多书写形式下的低资源语言模型选择提供了实际指导。

Comments Published at SCSE 2026 (9th IEEE International Research Conference on Smart Computing and Systems Engineering). Best Paper Award - Text Analytics Track

详情
Journal ref
2026 9th IEEE International Research Conference on Smart Computing and Systems Engineering (SCSE), vol. 9, pp. 1-6
英文摘要

The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.

2601.04731 2026-05-11 cs.AI cs.CL

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

Shuyang Jiang, Yuhao Wang, Ya Zhang, Yanfeng Wang, Yu Wang

AI总结 当前针对大推理模型的无评论强化学习方法在使用正同质提示进行训练时效率低下,因零优势估计导致大量样本浪费。本文提出 Miner 方法,通过利用策略的内在不确定性作为自监督奖励信号,无需外部监督或额外推理成本,显著提升了训练效率。Miner 引入了两个关键创新:基于标记的聚焦信用分配机制和自适应优势校准,有效提升了模型在推理任务中的表现。实验表明,Miner 在多个基准测试中优于现有方法,展示了其在大模型强化学习中的优越性。

Comments 24 pages

详情
英文摘要

Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models. Code is available at https://github.com/pixas/Miner.