arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2605.07735 2026-05-11 cs.SD

TARNet: A Temporal-Aware Multi-Scale Architecture for Closed-Set Speaker Identification

Yassin Terraf, Youssef Iraqi

AI总结 本文提出了一种名为TARNet的轻量级时序感知多尺度网络,用于闭集说话人识别任务。该方法通过多阶段时序编码器在不同时间尺度上显式建模时序信息,并结合注意力统计池化模块融合多尺度特征,生成具有判别力的说话人嵌入。实验表明,TARNet在VoxCeleb1和LibriSpeech数据集上优于现有先进方法,且计算复杂度较低,适合实际应用。

Comments Accepted at IEEE International Conference on Multimedia and Expo (ICME) 2026. Code available at: https://github.com/YassinTERRAF/TARNet

详情
英文摘要

Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong performance, many existing architectures provide limited mechanisms for modeling temporal dependencies across different time scales, which can restrict the effective use of complementary short-, mid-, and long-term speaker characteristics. In this paper, we propose TARNet, a lightweight Temporal-Aware Representation Network for closed-set speaker identification. TARNet explicitly models temporal information at multiple time scales using a multi-stage temporal encoder with stage-specific dilation configurations. The resulting multi-scale representations are fused and aggregated via an Attentive Statistics Pooling (ASP) module to produce a discriminative utterance-level speaker embedding. Experiments on the VoxCeleb1 and LibriSpeech datasets show that TARNet outperforms state-of-the-art methods while maintaining competitive computational complexity, making it suitable for practical speaker identification systems. The code is publicly available at https://github.com/YassinTERRAF/TARNet.

2605.07727 2026-05-11 cs.LG cs.AI cs.RO

Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

Juil Koo, Mingue Park, Jiwon Choi, Yunhong Min, Minhyuk Sung

AI总结 本文提出了一种基于漂移模型范式的非微分方程(non-ODE)单步生成策略——Drifting Field Policy(DFP)。该方法将策略更新建模为向软目标策略的反向KL散度Wasserstein-2梯度流,使得每次更新对应概率空间中的梯度步。通过该方法,策略更新被分解为向高动作价值区域的上升以及与锚定策略的评分匹配,从而保证了策略更新的稳定性与有效性。实验表明,DFP在多个操作任务中表现出色,优于基于微分方程的策略方法。

详情
英文摘要

We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probability space. By construction, this gradient is decomposed into an ascent toward higher action-value regions and a score matching with the anchor policy as a trust region. We further derive a simple, tractable surrogate of the otherwise intractable update loss, akin to behavior cloning on top-K critic-selected actions. We find empirically that this mechanism uniquely benefits the drifting backbone owing to its non-ODE parameterization. With one-step inference, DFP achieves state-of-the-art performance on several manipulation tasks across Robomimic and OGBench, outperforming ODE-based policies.

2605.07725 2026-05-11 cs.CL cs.AI

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang

AI总结 本文研究了如何将工具集成推理(TIR)有效扩展到小型语言模型中,提出了一种名为SOD的逐步策略蒸馏框架。该方法针对现有方法在长期工具交互中易出现错误累积的问题,通过在每一步动态调整蒸馏强度,有效缓解了教师模型指导信号的误导性,从而提升学生模型的推理能力。实验表明,SOD在多个数学、科学和编程基准测试中表现出色,显著优于现有方法,并展示了在轻量级模型上实现高效代理推理的潜力。

详情
英文摘要

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.

2605.07719 2026-05-11 cs.LG cs.AI cs.PF

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Feiyu Yao, Zhixiong Niu, Xiaqing Li, Yongqiang Xiong, Juan Fang, Qian Wang

AI总结 随着长上下文推理任务对CPU驻留的KV缓存需求增加,现有稀疏注意力方法在端到端效率上仍存在不足。本文提出Fluxion,通过输出感知的KV预算分配、头特异性与粒度感知的稀疏配置,以及跨设备协调执行机制,实现了CPU-GPU混合稀疏注意力的高效优化。实验表明,Fluxion在保持模型质量的同时,相比固定稀疏基线实现了1.5到3.7倍的加速。

详情
英文摘要

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.

2605.07706 2026-05-11 cs.LG

Bayesian Fine-tuning in Projected Subspaces

Viktar Dubovik, Patryk Marszałek, Jacek Tabor, Tomasz Kuśmierczyk

AI总结 本文提出了一种参数高效的贝叶斯微调框架,在低维子空间中实现有效的不确定性量化。该方法通过将权重空间投影到低维空间,能够在保持计算效率的同时提升模型的校准性和泛化能力。实验表明,在低维空间中可以有效建模权重不确定性,且权重协方差具有低秩特性。

详情
英文摘要

Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large models by decomposing weight updates into low-rank matrices, significantly reducing storage and computational overhead. While effective, standard LoRA lacks mechanisms for uncertainty quantification, leading to overconfident and poorly calibrated models. Bayesian variants of LoRA address this limitation, but at the cost of a significantly increased number of trainable parameters, partially offsetting the original efficiency gains. Additionally, these models are harder to train and may suffer from unstable convergence. In this work, we propose a novel framework for parameter-efficient Bayesian fine-tuning, demonstrating that effective uncertainty quantification can be achieved in very low-dimensional parameter spaces. The proposed method achieves strong performance with improved calibration and generalization while maintaining computational efficiency. Our empirical findings show that, with the appropriate projection of the weight space uncertainty can be effectively modeled in a low-dimensional space, and weight covariances exhibit low ranks.

2605.07703 2026-05-11 cs.AI cs.RO

Finite-Time Analysis of MCTS in Continuous POMDP Planning

Da Kong, Vadim Indelman

AI总结 本文对部分可观测马尔可夫决策过程(POMDP)中蒙特卡洛树搜索(MCTS)进行了有限时间分析,适用于离散和连续观测空间,并提供了概率集中界。针对MCTS在非平稳性和启发式动作选择带来的依赖性问题,研究提出了扩展的多项式探索奖励机制,并引入了一种抽象划分框架以处理连续观测空间。基于此,作者设计了Voro-POMCPOW算法,该算法利用Voronoi单元对连续观测空间进行自适应划分,在保持有限分支因子的同时提供理论保证,实验表明其性能具有竞争力。

Comments 9 pages, 1 figure

详情
英文摘要

This paper presents a finite-time analysis for Monte Carlo Tree Search (MCTS) in Partially Observable Markov Decision Processes (POMDPs), with probabilistic concentration bounds in both discrete and continuous observation spaces. While MCTS-style solvers such as POMCP achieve empirical success in many applications, rigorous finite-time guarantees remain an open problem due to the nonstationarity and the interdependencies induced by heuristic action selection (e.g., UCB). In the discrete setting, we address these challenges by extending the polynomial exploration bonus to UCB in POMDP setting, yielding polynomial concentration bounds for the empirical value estimation at the root node. For continuous observation spaces, we introduce an abstract partitioning framework and propose a finite-time bound on partitioning loss. Under mild conditions, we prove highprobability bound on value estimates in POMDPs with continuous observation space. Specifically, we propose Voro-POMCPOW, a variant of POMCPOW with f inite-time guarantees that adaptively partitions the continuous observation space using Voronoi cells. This approach maintains a finite branching factor while preserving the original observation generator. Empirical validation demonstrates that the proposed Voro-POMCPOW shows competitive performance while providing theoretical guarantees. Although our analysis focuses on continuous POMDPs, the techniques developed herein are also applicable to continuous MDPs, closing another gap on the MDP side.

2605.07701 2026-05-11 cs.CL

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

Fan Zhou, Tim Van de Cruys

AI总结 本文研究了在扩散语言模型中如何动态调整分类器无关引导(CFG)的引导尺度,以提升生成过程中的可控性与生成质量的平衡。作者将CFG尺度的选择建模为一个序列决策问题,并通过强化学习学习动态的引导轨迹。实验表明,与固定引导尺度的方法相比,该方法在多个受控自然语言生成任务中取得了更优的性能,并揭示了不同任务下具有可解释性的引导轨迹。

Comments ReALM-GEN@ICLR2026

详情
英文摘要

Classifier-Free Guidance (CFG) is a widely used mechanism for controlling diffusion-based generative models, yet its guidance scale is typically treated as a fixed hyperparameter throughout generation. This static design yields a suboptimal controllability and quality tradeoff, as the optimal degree of guidance varies across tasks and across different stages of the diffusion process, especially in NLP domain. We recast CFG scale selection as a sequential decision-making problem and propose to learn dynamic guidance trajectories via reinforcement learning. Specifically, we model the guidance scale as a discrete control action selected at each generation step based on the evolving diffusion state, and optimize a policy using Proximal Policy Optimization (PPO) under task-level rewards. Experiments on three controlled NLP generation tasks using discrete diffusion language models demonstrate that adaptive guidance consistently achieves a better balance between controllability and generation quality than fixed-scale strategies. Further analysis of the learned policies reveals distinct and interpretable guidance trajectories across tasks, underscoring the importance of treating guidance as a dynamic control process rather than a static design choice.

2605.07699 2026-05-11 cs.CL cs.AI

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

Hsuvas Borkakoty, Sebastian Pohl, Cheng Wang, Bei Chen, Yufang Hou

AI总结 DRIP-R 是一个面向零售领域的基准,旨在评估大型语言模型在现实政策模糊性下的决策与推理能力。该基准通过真实零售场景中的政策歧义构建任务,测试模型在缺乏唯一正确答案的情况下进行合理判断的能力。DRIP-R 包含多角色对话模拟、工具调用功能及多评委评估体系,实验表明当前前沿模型在处理相同模糊政策时存在显著分歧,突显了政策模糊性对模型决策的系统性挑战。

Comments 10 pages

详情
英文摘要

LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.

2605.07698 2026-05-11 cs.LG cs.IT math.IT

Future Validity is the Missing Statistic: From Impossibility to $Φ$-Estimation for Grammar-Faithful Speculative Decoding

Wenhua Nie, Zijie Meng, Kun Zou, Zheng Lin, Ziwei Li, Haoran Zheng, Jyh-Shing Roger Jang, Hao Zhang

AI总结 该论文研究了在语法约束生成中,如何使推测解码更符合用户期望的语法条件分布。作者指出,现有方法实际上采样的是局部投影分布,而非目标语法条件分布,并提出了未来有效性函数 $Φ$ 作为缺失的修正统计量。通过引入基于 $Φ$ 的估计方法,论文实现了对目标分布的更精确采样,并在多种语法结构上验证了其有效性,显著提升了生成质量。

详情
英文摘要

Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejection, and rollback soundness samples from the locally projected distribution $μ^{\mathrm{proj}}$ rather than the grammar-conditional distribution $μ^\star$. This extends the GAD impossibility result to speculative decoding; on Dyck grammars with Qwen3-8B, the total-variation gap can reach 0.996. We identify the future-validity function $Φ_t(y)=\Pr_p[\mathrm{valid\ completion}\mid y]$ as the missing correction statistic. The target distribution is a Doob transform of the base model with $h=Φ$, while local masking corresponds to setting $h$ to one. With exact $Φ$, our oracle decoder FVO-Spec samples exactly from $μ^\star$; with approximate $Φ$, we bound the resulting total-variation error. Because exact future validity is hard for general context-free grammars, we evaluate estimator hierarchies on tractable Dyck and finite JSON languages. OneStep reduces Dyck TV by 14% with under 1% throughput overhead, exact dynamic programming reduces it by 97%, and finite-language correction closes JSON gaps to numerical precision. All fidelity claims are scoped to enumerable grammars and token tries.

2605.07695 2026-05-11 cs.CV

OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos

Ritul Jangir, Arkya Jyoti Bagchi, Aiman Farooq, Mangalton Okram, Saurabh Seetaram Korgaonkar, Deepak Mishra

AI总结 OphEdit 是一种无需训练的文本引导眼科手术视频编辑框架,能够根据文本指令对手术视频进行精确修改,如更换手术器械或调整手术阶段。该方法通过确定性二阶ODE逆过程提取原始视频中的注意力值张量,并在去噪过程中将其注入条件分类器自由引导分支,从而在保持眼部解剖结构完整性的同时实现语义编辑。实验表明,OphEdit 在结构保真度和时间一致性方面优于现有视频编辑工具,为生成多样化标注医疗数据提供了高效且无需微调模型的解决方案。

详情
英文摘要

High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at https://github.com/ophedit/OphEdit

2605.07693 2026-05-11 cs.LG

Toward Better Geometric Representations for Molecule Generative Models

Shaoheng Yan, Zian Li, Cai Zhou, Qiaojing Huang, Kai Liu, Muhan Zhang

AI总结 本文研究了如何改进基于几何表示的分子生成模型,以提升生成效率和质量。作者提出了一种名为LENSes的框架,通过引入多级表示提取、语义感知损失和节点级表示对齐机制,有效提升了预训练分子编码器在生成过程中的表现。实验表明,该方法在GEOM-DRUG数据集上实现了更高的生成有效性和稳定性,并验证了其生成表示更加平滑且信息丰富。

详情
英文摘要

Geometric representation-conditioned molecule generation provides an effective paradigm that decouples molecule representation modeling from structure generation. By decoupling molecule generation into two stages-first generating a meaningful molecule representation, and then generating a 3D molecule conditioned on this representation-the efficiency and quality of the generation process can be significantly enhanced. However, its effectiveness is fundamentally limited by the quality of the representation space: pretrained molecular encoders, such as UniMol, produce representations that are non-smooth and not fully exploited during the generative training process. In this work, we propose LENSEs, a framework that better exploits the potential of molecule representations in representation-conditioned generation methods. In particular, LENSEs introduces three complementary mechanisms: (1) a representation head, simultaneously trained during generative tasks, that extracts multi-level representations from the pretrained encoder; (2) a molecule perceptual loss that optimizes the generator in a semantic-informative representation space; and (3) a node-level representation alignment (REPA) loss that explicitly aligns the generator's hidden states with encoder representations, reducing the semantic gap between pretraining and generation. We demonstrate the effectiveness of these improvements through extensive molecule generation tasks. Specifically, on the challenging molecule generation dataset GEOM-DRUG, LENSEs achieves 97.28% validity and 98.51% molecule stability, surpassing existing advanced methods. Further analyses through Lipschitz constant reduction (4.6x) and QM9 probing tasks also demonstrate the smoother, more informative refined representations, establishing generative training with alignment objectives as a potential pretraining paradigm for molecular encoders.

2605.07692 2026-05-11 cs.AI

GASim: A Graph-Accelerated Hybrid Framework for Social Simulation

Xuan Zhou, Yanhui Sun, Hantao Yao, Allen He, Yongdong Zhang, Wu Liu

AI总结 GASim 是一种用于大规模社会模拟的图加速混合框架,旨在解决传统混合方法中因大量记忆检索和顺序执行带来的高延迟问题。该框架通过引入图优化记忆(GOM)和图消息传递(GMP)机制,分别优化大型语言模型驱动的核心代理和普通代理的运行效率,并结合熵驱动分组(EDG)动态识别关键代理,从而实现高效并行计算。实验表明,GASim 在保持与现实舆论趋势高度一致的同时,相比传统方法实现了近10倍的加速,并显著降低了计算成本。

详情
英文摘要

Large-scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)-based agents with numerical agent-based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph-accelerated hybrid multi-agent framework for large-scale social simulations. For core agents driven by LLM, GASim introduces Graph-Optimized Memory (GOM) to replace intensive LLM-based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine-grained feature aggregation and Graph Attention Network. We further introduce Entropy-Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information-diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94-fold end-to-end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real-world public opinion trends. Our code is available at https://github.com/Jasmine0201/GASim.

2605.07690 2026-05-11 cs.LG

Fortifying Time Series: DTW-Certified Robust Anomaly Detection

Shijie Liu, Tansu Alpcan, Christopher Leckie, Sarah Erfani

AI总结 该论文研究了时间序列异常检测中的鲁棒性问题,针对现有方法在对抗攻击下易受干扰的缺陷,提出了一种基于动态时间规整(DTW)的可认证鲁棒防御方法。通过将$\ell_p$-范数与DTW距离进行转换,建立了首个在DTW度量下的鲁棒性保证框架。实验表明,该方法在多个数据集和模型上均表现出优异的防御性能,显著提升了对抗攻击下的检测准确率。

详情
Journal ref
39th Conference on Neural Information Processing Systems (NeurIPS 2025)
英文摘要

Time-series anomaly detection is critical for ensuring safety in high-stakes applications, where robustness is a fundamental requirement rather than a mere performance metric. Addressing the vulnerability of these systems to adversarial manipulation is therefore essential. Existing defenses are largely heuristic or provide certified robustness only under $\ell_p$-norm constraints, which are incompatible with time-series data. In particular, $\ell_p$-norm fails to capture the intrinsic temporal structure in time series, causing small temporal distortions to significantly alter the $\ell_p$-norm measures. Instead, the similarity metric \emph{Dynamic Time Warping} (DTW) is more suitable and widely adopted in the time-series domain, as DTW accounts for temporal alignment and remains robust to temporal variations. To date, however, there has been no certifiable robustness result in this metric that provides guarantees. In this work, we introduce the first \emph{DTW-certified robust defense} in time-series anomaly detection by adapting the randomized smoothing paradigm. We develop this certificate by bridging the $\ell_p$-norm to DTW distance through a lower-bound transformation. Extensive experiments across various datasets and models validate the effectiveness and practicality of our theoretical approach. Results demonstrate significantly improved performance, e.g., up to 18.7\% in F1-score under DTW-based adversarial attacks compared to traditional certified models.

2605.07689 2026-05-11 cs.LG

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

Wenhua Nie, Jianan Wu, Junlin Liu, Ziwei Li, Zheng Lin, Zhang Zijian, Yilong Fan, Haoran Zheng, Jyh-Shing Roger Jang

AI总结 本文研究了二元奖励环境下组相对策略优化(GRPO)算法中出现的梯度消失问题,即当组内所有响应都正确或都错误时,中心化优势值为零,导致策略无法学习。作者证明了真实退化率高于独立同分布伯努利预测,并在实际数据中观察到显著的退化现象。通过引入简单的固定参考信号优势函数 $A=2r-1$,有效提升了学习信号,实验表明该方法在GSM8K测试集上显著优于传统方法,主要收益来自于搜索压缩而非模型容量扩张。

详情
英文摘要

Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every response is wrong, the centered advantage is exactly zero and the policy receives no learning signal. We prove that the true degeneracy rate always exceeds the i.i.d. Bernoulli prediction by Jensen's inequality, and observe a 0.69 degeneracy rate at group size four in logged Qwen3.5-9B GSM8K training. We then show that the fixed-reference Sign advantage, $A=2r-1$, performs pass@$G$ failure descent by increasing the probability that at least one sample in the group succeeds. On the full GSM8K test set across seven seeds, Sign reaches 73.8% accuracy versus 28.4% for standard normalized group-mean DrGRPO at group size four, a 45.4 point gain with $p<0.0001$. The effect is directionally consistent on Llama-3.1-8B and positive but underpowered on a MATH-500 transfer check. Pass@$k$ analysis indicates that the main benefit is search compression rather than large capacity expansion, aligning the empirical gains with recent RLVR ceiling observations.

2605.07687 2026-05-11 cs.RO

PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN

Yixiong Jing, Xingyuan Chen, Guangming Wang, Olaf Wysocki, Haibing Wu, Brian Sheil

AI总结 PhySPRING 是一种基于图神经网络(GNN)的可微分方法,旨在对物理驱动的数字孪生系统(如弹簧-质量系统)进行结构保持的简化。该方法通过从观测数据中联合学习分层的简化图结构和对应的机械参数,有效减少了模型复杂度,同时保持物理和视觉保真度。实验表明,PhySPRING 在预测精度和计算效率方面优于现有方法,并在机器人策略评估任务中展现出良好的实用性和鲁棒性。

Comments 16 pages and 6 pages, conference paper

详情
英文摘要

Physics-based digital twins aim to predict the dynamics of real-world objects under interaction, enabling real-to-sim-to-real applications in robotics. Current approaches reconstruct such twins as explicit physical models (such as spring-mass systems) to predict the dynamics, but the resulting models often inherit the resolution of the visual reconstruction rather than being reduced to the physical complexity required to reproduce task-relevant dynamics. This mismatch introduces redundant topology, making repeated forward-dynamics rollouts unnecessarily expensive. To address this challenge, we present PhySPRING, an fully differentiable GNN-based method to reduce complexity in spring--mass digital twins. PhySPRING jointly learns a hierarchy of coarsened graph topologies and their mechanical parameters from observations. At each reduction level, PhySPRING merges nodes with similar learned dynamic responses to optimize the topology, while maintaining every reduced layer as an explicit spring--mass system. On the PhysTwin benchmark, PhySPRING improves dense reconstruction and prediction accuracy over PhysTwin, while reduced models retain stable physical and visual fidelity with up to a 2.30 times speed-up. We further demonstrate the effectiveness of PhySPRING in a Real2Sim robot policy-evaluation pipeline, where the reduced models are substituted zero-shot into ACT and $π_0$ evaluations, maintaining comparable manipulation success rates across downsampling levels while improving action-sampling effectiveness. Together, PhySPRING enables efficient and structure-preserving spring--mass reduction without sacrificing fidelity or robotic utility.

2605.07686 2026-05-11 cs.LG

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

Wenhua Nie, Junlin Liu, Jianan Wu, Zijie Meng, Yilong Fan, Zhang Zijian, Haoran Zheng, Jyh-Shing Roger Jang

AI总结 本文研究了在固定输出长度限制下,共享token预算对语言模型推理链(chain-of-thought)性能的影响,提出了一种“耦合税”现象:推理过程和最终答案共享预算时,过长的推理链可能挤占答案空间,从而降低整体表现。通过实验证明,在多个任务中,不使用推理模式的表现往往优于或等于使用推理模式,并提出了预算分配策略以解耦推理与答案生成,显著提升了模型在数学等复杂任务上的准确率。

Comments 40 pages, 6 figures

详情
英文摘要

Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.

2605.07676 2026-05-11 cs.LG

Structured Coupling for Flow Matching

Xavier Sumba, Carles Balsells-Rodas, Yingzhen Li

AI总结 本文提出了一种名为Structured Coupling for Flow Matching (SCFM) 的新框架,旨在解决流匹配模型在学习可解释潜在结构时的不足。该方法通过引入结构化潜在变量和外生噪声,将流匹配与潜在变量建模相结合,同时学习结构化的先验分布和连续的传输映射。实验表明,SCFM在保持生成质量的同时,能够有效学习有意义的潜在结构,并在聚类、解耦等任务中表现出优越的性能。

详情
英文摘要

Standard flow matching scales well but typically relies on an unstructured source distribution, limiting its ability to learn interpretable latent structure. Latent-variable models, by contrast, capture structure but often sacrifice generative quality. We bridge this gap by proposing Structured Coupling for Flow Matching (SCFM), a cooperative framework that augments flow matching with structured latent representation learning. By introducing structured latent variables and exogenous noise into the source, SCFM jointly learns a structured prior (via latent variable modeling) and a continuous transport map (via flow matching). It uses a shared time-dependent recognition network for both latent variable model variational inference and intermediate-time flow velocity estimation. This yields a structurally informed yet unconditional, simulation-free flow model, where the latent variable model can also assist flow sampling. Empirically, SCFM facilitates unsupervised latent representation learning for clustering, disentanglement and downstream tasks, while remaining competitive with flow matching in sample quality, showing that meaningful structure can be learned without sacrificing generative fidelity.

2605.07675 2026-05-11 cs.AI cs.LG

FactoryBench: Evaluating Industrial Machine Understanding

Yanis Merzouki, Coral Izquierdo, Matei Ignuta-Ciuncanu, Marcos Gomez-Bracamonte, Riccardo Maggioni, Alessandro Lombardi, Camilla Mazzoleni, Federico Martelli, Balazs Gunther, Jonas Petersen, Philipp Petersen

AI总结 本文介绍了 FactoryBench,一个用于评估时间序列模型和大语言模型在工业机器人遥测数据理解能力的基准。该基准围绕因果推理的四个层级构建问答对,并采用结构化评分与LLM作为评委的评分机制。研究提出了一个可扩展的问答生成框架,并基于多个工业数据集构建了包含7万余个问答对的大型基准,揭示了当前模型在工业场景下的理解能力仍存在较大提升空间。

Comments 9 pages, 4 figures, 14 tables; appendix with 24 pages

详情
英文摘要

We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.

2605.07662 2026-05-11 cs.LG cs.NA math.NA

Direction-Preserving Number Representations

Bardia Zadeh, George A. Constantinides

AI总结 本文研究了在低精度数值格式下,如何通过有限字典选择标量元素来精确表示向量方向的问题。作者提出了一种几何分析框架,量化了乘积结构编码与球面编码在方向覆盖上的差距,并证明了常用的二进制补码、定点和浮点格式在方向表示上存在优化空间。实验表明,NVIDIA 的 E2M1 格式在四比特情况下接近最优方向字典,为低精度机器学习中的高性能提供了几何解释。

Comments 9 pages excluding appendices and references, 18 in total. 5 figures

详情
英文摘要

Low-precision number formats are widely used in modern machine learning systems due to their efficiency. Accurate direction representation is key to the accuracy of vector operations. This work precisely explores the extent to which the direction of a vector can be represented by selecting its scalar elements from a common finite alphabet of a given size. This is standard practice in machine learning, where low-precision significands may be narrow-width floating-point or integer values. A geometric framework is introduced for analyzing the directional coverage of such product-structured codes. This work analytically quantifies the suboptimality gap between such product-structured codes and spherical codes for the vector as a whole, in both low and asymptotically high dimensions. Furthermore, within the product code class, it is proven that the standard formats of two's complement, fixed-point, and floating-point are suboptimal, again with quantified gap, pointing to the potential to develop new scalar number formats. Such scalar alphabets are numerically optimized across multiple block dimensions for directional coverage, including the dimension used in NVIDIA's NVFP4 format. Experimental results are presented comparing the performance of standard formats and the optimized alphabet. We find that for four bits, NVIDIA's choice of E2M1 closely approximates the optimized alphabet, providing a geometric explanation for its strong performance in low-precision machine learning workloads and an analytical understanding of the link between that superiority and block size. We provide open-source formal proofs in Lean for the theorems in this work, along with the experimental code and the optimized alphabets obtained.

2605.07661 2026-05-11 cs.LG cs.CV

Stochastic Transition-Map Distillation for Fast Probabilistic Inference

George Rapakoulias, Peter Garud, Lingjiong Zhu, Panagiotis Tsiotras

AI总结 本文提出了一种名为STMD的无教师框架,用于加速扩散模型的推理过程,同时保持概率样本生成的能力。不同于基于分数的扩散模型仅建模后验分布的均值,STMD通过条件均值流模型学习与采样随机微分方程相关的完整转移映射,从而实现一步或多步的随机采样器。该方法无需预训练教师模型或复杂的优化过程,具有高效且可扩展的训练优势,并在多个图像生成任务上验证了其有效性。

详情
英文摘要

Diffusion models achieve strong generation quality, diversity, and distribution coverage, but their performance often comes with expensive inference. In this work, we propose Stochastic Transition-Map Distillation (STMD), a teacher-free framework for accelerating diffusion model inference while preserving probabilistic sample generation. In contrast to score-based diffusion models, whose denoising parametrization models the mean of the posterior distribution, STMD distills the full transition map associated with the sampling stochastic differential equation (SDE). We parameterize these SDE transitions with a conditional Mean Flow model, yielding a one- or few-step stochastic sampler that retains the transition structure of the underlying diffusion process. This perspective is especially useful for downstream tasks that require stochastic inference, such as diffusion posterior sampling, inverse problems, and energy-based fine-tuning. Compared to recent distillation methods, STMD requires no pretrained teacher, bi-level optimization, or trajectory simulation and caching, enabling efficient and scalable training. We derive convergence bounds for our method in the Wasserstein distance, providing a strong theoretical foundation for our approach, and validate STMD on various image generation examples on the MNIST, CIFAR-10, and CelebA datasets.

2605.07660 2026-05-11 cs.CL

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

Gengyang Li, Zheng-Fan Wu, Siqi Bao, Yunfang Wu

AI总结 本研究探讨了基于强化学习的大型语言模型后训练中,不同token的学习信号异质性问题,提出通过注意力熵来衡量每个响应token的上下文支持集中程度。研究发现,低注意力熵的token(称为锚点)具有稳定梯度,适合作为优化的骨干,但难以应对复杂任务;而高注意力熵的token(称为探索者)则能捕捉更复杂的上下文信息,但梯度波动较大。研究进一步表明,基于注意力熵的动态重加权方法可有效提升模型推理性能,揭示了token级强化学习信号中隐藏的优化结构。

详情
英文摘要

Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training.

2605.07655 2026-05-11 cs.CV cs.AI

Towards Billion-scale Multi-modal Biometric Search

Arka Koner, Chetan S. Naik, Lokesh Kurre, Vivek Raghavan, Barada P. Sabut, Tanusree Deb Barma, Anoop M. Namboodiri, Anil K. Jain

AI总结 本文研究了面向国家级身份系统的百亿级多模态生物特征搜索系统,重点解决大规模数据下的高效处理、准确匹配及防伪检测等问题。提出了一种基于开源架构的多模态生物特征系统Bharat ABIS,涵盖指纹、人脸和虹膜等多种生物特征的预处理、质量评估、攻击检测与特征嵌入生成,并生成每人13.5KB的融合模板。实验表明,该系统在2.2亿身份样本上实现了0.3%的漏检率和0.5%的误报率,且在单服务器上可达到每秒100次搜索的高效性能。

详情
英文摘要

Searching a multi-biometric database of a billion records for a country-level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large-scale multimodal biometric search system, called Bharat ABIS, based on open-source architectures. The end-to-end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality-specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de-duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India's Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state-of-the-art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).

2605.07650 2026-05-11 cs.CV eess.IV

Breaking Spatial Uniformity: Prior-Guided Mamba with Radial Serialization for Lens Flare Removal

Zijia Fu, Yuanfei Huang, Lizhi Wang, Hua Huang

AI总结 该论文研究了如何去除图像中的镜头光晕问题,针对现有方法在空间均匀处理上的不足,提出了一种基于先验引导的Mamba框架DeflareMambav2。该方法引入了光晕先验网络估计光晕区域,并结合径向序列化策略实现非均匀处理,从而更有效地保留光源区域、去除光晕伪影并恢复背景细节。实验表明,该方法在保持图像质量的同时具有更少的参数量,取得了当前最优的性能。

详情
英文摘要

Lens flares, caused by complex optical aberrations, severely degrade image quality especially in nighttime photography. Although recent restoration methods have made remarkable progress, most still rely on spatially uniform processing. They are failing to handle the region-dependent restoration demands of flare scenes, where saturated light sources should be preserved, flare artifacts removed, and background details recovered. To address this challenge, we propose DeflareMambav2, a prior-guided Mamba framework for lens flare removal. Specifically, we introduce a Flare Prior Network (FPN) to estimate flare priors and guide adaptive restoration. Besides, a novel radial serialization strategy breaks spatially homogeneous processing by performing flare-aware targeted sampling, and better supports long-range modeling in State Space Models (SSMs). Based on these priors, the backbone adopts a dual-level adaptive scheme. It explicitly preserves light-source regions to avoid over-processing, and applies curriculum-based restoration to the remaining contaminated areas while calibrating restoration intensity at the pixel level. Extensive experiments demonstrate that DeflareMambav2 achieves state-of-the-art performance with reduced parameter burden. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMambav2.

2605.07648 2026-05-11 cs.LG

Learning Large-Scale Modular Addition with an Auxiliary Modulus

Hanato Kikuchi, Ryosuke Masuya, Kazuhiko Kawamoto, Hiroshi Kera

AI总结 本文研究了大规模模数加法的学习问题,该任务因对输入高度敏感而具有挑战性。为了解决训练与测试分布不一致导致的协变量偏移问题,作者引入了一个辅助模数 $Kq$,在保持输入分布一致的同时降低问题难度。实验表明,该方法在大规模输入长度、大模数和小样本情况下均表现出优越的可扩展性和样本效率,显著优于现有稀疏方法。

Comments 10+11 pages, 5 figures

详情
英文摘要

Learning parity functions, more general modular addition, is a challenging machine learning task due to its input sensitivity. A recent study substantially scaled modular addition learning in both the number of summands and the modulus. Its key idea is to increase zeros in training sequences, reducing the effective number of summands and thus controlling training difficulty; however, this induces covariate shift between training and test input distributions. This study theoretically and empirically analyzes this side effect and proposes a covariate-shift-free method for modular addition. Specifically, we introduce an auxiliary modulus $Kq$ during training, which reduces wrap-around frequency and problem difficulty while preserving the same input distribution across training and testing. Experiments show strong scalability and sample efficiency: even for large input length $N$, large modulus $q$, and small datasets -- where the sparse method fails to learn -- our method achieves equal or better match accuracy and relaxed $τ$-accuracy. For example, at $N=64$ and $q=974269$, our method trained on 100K samples achieves $97.0\%$ $τ$-accuracy at $τ=0.05$, while the sparse method achieves only $9.5\%$ with the same data size and $93.9\%$ even when extended to 1M samples.

2605.07646 2026-05-11 cs.CL cs.AI cs.LG

MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

Yinsheng Yao, Jiehao Tang, Zhaozhen Yang, Dawei Cheng

AI总结 MAVEN 是一种多智能体验证-阐述网络,通过引入“怀疑者-研究者-评判者”三方协作机制,实现对大语言模型推理过程的显式分解与验证。该框架通过分步的信念审计机制,提升了推理的透明度和可信度,特别适用于高风险场景。实验表明,MAVEN 在多个基准测试中表现出优越的推理能力,且适用于不同模型架构,具有良好的通用性和迁移性。

Comments 24 pages, 2 figures

详情
英文摘要

While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.

2605.07642 2026-05-11 cs.CV

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Jaeyoung Choi, Hyeondong Kim, Yujin Kim, Daehee Park

AI总结 本文提出EggHand,一种基于基础模型的框架,用于预测第一人称视角下的手部三维姿态序列。该方法结合了视觉-语言-动作模型中的动作解码器与第一人称视频-文本编码器,实现了对复杂手部运动和上下文信息的联合建模,无需依赖身体姿态或外部追踪。实验表明,EggHand在EgoExo4D数据集上取得了最先进的预测精度,并在剧烈视角变化下仍保持鲁棒性,同时支持通过语言指令进行可控预测。

Comments CVPR Findings 2026

详情
英文摘要

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand

2605.07640 2026-05-11 cs.CV cs.AI

LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation

Jun Wang, Fengpeng Li, Hang Dong, Tianjin Huang, Wei Han

AI总结 本文提出LithoBench,一个用于评估遥感岩石类型解释能力的多层级基准数据集,旨在推动大型多模态模型在地质学领域的应用。该基准包含12类岩石的10,000个专家标注样本,涵盖从识别描述到综合推理的五个认知层次,并采用专家参与的半自动化构建流程以提升地质合理性与评估可靠性。实验表明,现有大型视觉语言模型在高阶地质解释与推理任务中仍存在显著局限。

详情
英文摘要

Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.

2605.07639 2026-05-11 cs.AI

Tacit Knowledge Extraction via Logic Augmented Generation and Active Inference

Lorenzo Lamazzi, Aldo Gangemi, Alessio Giberti, Andrea Giovanni Nuzzolese, Vittorio Andrea Rocca, Mattia Torta, Francesco Poggi

AI总结 本文研究如何从隐性知识中提取可被机器理解和复用的显性知识,特别是在依赖过程和经验的领域。为此,作者提出了一种结合逻辑增强生成与主动推理的神经符号框架,用于构建基于本体的知识图谱。该方法在制造领域的知识迁移案例中得到验证,显著提升了知识表示的完整性和语义质量,为工业领域的神经符号知识工程提供了新的解决方案。

详情
英文摘要

Tacit knowledge plays a central role in human expertise, yet it remains difficult to capture, formalize, and reuse in machine-interpretable form. This challenge is especially relevant in procedural domains, where successful execution depends not only on explicit instructions, but also on implicit assumptions, contextual constraints, embodied skills, and experience-based judgments rarely documented. As a result, current knowledge engineering pipelines struggle to transform tacit and process-centric knowledge into formally specified, machine-interpretable representations that can be queried, validated, reasoned over, and reused. In this paper, we introduce a neuro-symbolic framework that combines Logic-Augmented Generation and an Active-Inference-inspired approach for ontology-grounded Knowledge Graph construction. We evaluate the approach in a knowledge transfer case study in manufacturing, using assembly-like repair procedures from instructional videos as a reproducible proxy domain. Results show that the proposed solution improves completeness and semantic quality, advancing neuro-symbolic knowledge engineering for industrial domains.

2605.07635 2026-05-11 cs.CL

Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction

Adnan Labib, Qiao Wang, Yixuan Huang, Zheng Yuan

AI总结 本研究针对语法错误纠正(GEC)领域中大型语言模型(LLMs)评估不足的问题,从编辑精度、流畅性保持和语义保留三个维度对最新LLMs进行了全面评估,发现微调后的GPT-4o在各项指标上均达到最先进水平。研究还发现,不同LLMs在错误纠正模式上高度相似,并揭示了基于参考的评估指标可能低估了GEC系统的真实性能,约73.76%的GPT-4o修正结果在有效性上与标准答案相当或更优。这些发现为教育者选择有助于学生语言发展的GEC工具提供了重要参考。

Comments 9 Pages

详情
英文摘要

Automated assistants for Grammatical Error Correction are now embedded in educational platforms serving millions of learners, yet three critical gaps remain in this domain: (1) latest-generation Large Language Models (LLMs) lack comprehensive evaluation on grammar correction tasks; (2) whether combining these LLMs improves correction quality is unexplored; and (3) the extent to which reference-based metrics underestimate GEC system performance has not been adequately quantified. In this study, first, we evaluate latest-generation LLMs on edit precision, fluency preservation, and meaning retention, showing fine-tuned GPT-4o achieves state-of-the-art performance across all three dimensions. Second, through grammatical error type analysis we demonstrate that individual LLMs exhibit highly similar error correction patterns ($ρ=0.947$). Third, we show that reference-based metrics underestimate GEC performance with 73.76% of GPT-4o corrections different from gold standards being equally valid or even superior. These GEC evaluation findings equip educators with guidance for selecting GEC assistants that enhance rather than constrain student linguistic development. We make our data, code, and models publicly available.

2605.07631 2026-05-11 cs.AI

Inference Time Causal Probing in LLMs

Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser

AI总结 本文研究了如何在大型语言模型中进行推理时的因果探针,以分析和控制模型内部表示对其行为的影响。作者提出了一种无需辅助分类器的梯度驱动方法——隐藏状态驱动边距干预(HDMI),通过直接利用模型的输出调整隐藏状态,从而改变特定属性的生成概率。实验表明,HDMI在多个基准数据集和模型上均表现出比现有方法更高的可靠性和效果。

Comments 16 pages, 4 tables, 3 figures

详情
英文摘要

Causal probing methods aim to test and control how internal representations influence the behavior of generative models. In causal probing, an intervention modifies hidden states so that a property takes on a different value. Most existing approaches define such interventions by training an auxiliary probe classifier, which ties the method to a specific task or model and risks misalignment with the model's predictive geometry. We propose Hidden-state Driven Margin Intervention (HDMI), a probe-free, gradient-based technique that directly steers hidden states using the model's native output. HDMI applies a margin objective that increases the probability of a target continuation while decreasing that of the source, without relying on probe classifiers. We further introduce a lookahead variant (LA-HDMI) for text editing that backpropagates through the softmax embeddings, modifying the current hidden state so that the likelihood of user-specified tokens increases in next token generations while preserving fluency. To evaluate interventions, we measure completeness (whether the targeted property changes as intended) and selectivity (whether unrelated properties are preserved), and report their harmonic mean as an overall measure of reliability. HDMI consistently achieves higher reliability than prior methods on the LGD agreement corpus and the CausalGym benchmark, across Meta-Llama-3-8B-Instruct, and Pythia-70M.