arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1706
2604.21889 2026-05-25 cs.CL cs.AI cs.LG

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS:企业级规模下从嘈杂客户事件中实时发现风险事件

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

AI总结 本文介绍了TingIS,一个用于大规模企业环境中实时发现风险事件的端到端系统。针对客户事件数据中存在噪声大、语义复杂、吞吐量高的挑战,TingIS结合多阶段事件链接引擎与大型语言模型,实现了从少量用户描述中稳定提取有效事件的能力,并通过级联路由机制和多维降噪流程提升业务归因精度和信号质量。实验表明,TingIS在高优先级事件发现率和系统响应延迟方面表现优异,显著优于现有方法。

Comments Accepted to ACL 2026 Industry Track (oral presentation)

详情
AI中文摘要

实时检测和缓解技术异常对于大规模云原生服务至关重要,即使几分钟的停机也可能导致巨大的财务损失和用户信任度下降。虽然客户事件是发现监控遗漏风险的重要信号,但由于极端噪声、高吞吐量和不同业务线的语义复杂性,从这些数据中提取可操作情报仍然具有挑战性。在本文中,我们提出了TingIS,一个为企业级事件发现设计的端到端系统。TingIS的核心是一个多阶段事件链接引擎,该引擎将高效索引技术与大型语言模型(LLM)协同起来,对事件合并做出明智决策,从而仅从少量多样的用户描述中稳定提取可操作事件。该引擎辅以级联路由机制以实现精确的业务归属,以及一个集成领域知识、统计模式和行为过滤的多维降噪管道。TingIS部署在生产环境中,处理峰值吞吐量超过每分钟2,000条消息和每天300,000条消息,实现了P90告警延迟3.5分钟和高优先级事件95%的发现率。基于真实数据构建的基准测试表明,TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。

英文摘要

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

2604.19000 2026-05-25 cs.LG cs.AI

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

分解、结构化与修复:基于操作树的神经符号自动形式化框架

Xiaoyang Liu, Zineng Dong, Yifan Bai, Yantao Li, Yuntian Liu, Tao Luo

AI总结 该论文提出了一种名为DSR的神经符号框架,用于将自然语言数学问题自动形式化为形式语言。DSR通过分解数学陈述为逻辑组件并映射为结构化的操作符树,利用这种拓扑结构实现对错误的精确定位与修复。研究还引入了PRIME基准数据集,并在实验中验证了DSR在计算资源相同的情况下优于现有方法,取得了新的最先进成果。

Comments Accepted to ICML 2026

详情
AI中文摘要

语句自动形式化通过将自然语言问题翻译成形式语言,成为人类数学与形式数学之间的关键桥梁。虽然先前的工作侧重于数据合成和多样化的训练范式来优化端到端的大语言模型(LLMs),但它们通常将形式代码视为平面序列,忽略了数学语句中固有的层次逻辑。在这项工作中,我们引入了分解、结构化与修复(DSR),一个神经符号框架,将自动形式化重构为模块化流水线。DSR将语句分解为逻辑组件,并将其映射到结构化的操作树,利用这一拓扑蓝图通过子树精炼精确定位和修复错误。此外,我们引入了PRIME,一个包含156个本科和研究生级别定理的基准,这些定理选自经典教科书并由专家在Lean 4中注释。实验结果表明,DSR建立了新的最先进水平,在同等计算预算下始终优于基线。数据集、模型和代码可在https://github.com/XiaoyangLiu-sjtu/DSR获取。

英文摘要

Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code are available at https://github.com/XiaoyangLiu-sjtu/DSR.

2604.13596 2026-05-25 cs.CV

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

VGGT-Segmentor: 几何增强的跨视角分割

Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu

AI总结 本文提出了一种名为VGGT-Segmentor(VGGT-S)的几何增强跨视角分割框架,旨在解决从第一人称视角到第三人称视角的实例级物体分割难题。该方法结合了VGGT模型强大的跨视角特征表示能力,并引入了一个新的联合分割头,通过多阶段处理实现高精度的像素级分割。此外,该方法采用单图像自监督训练策略,无需成对标注即可实现良好的泛化能力,在Ego-Exo4D基准测试中取得了优于现有方法的性能。

详情
AI中文摘要

跨不同自我中心和外部中心视图的实例级对象分割是视觉理解中的基本挑战,对于具身AI和远程协作应用至关重要。由于尺度、视角和遮挡的剧烈变化,直接像素级匹配变得不稳定,使得该任务异常困难。尽管像VGGT这样的最新几何感知模型为特征对齐提供了坚实基础,但我们发现,即使其内部对象级注意力保持一致,它们在密集预测任务中常常因显著的像素级投影漂移而失败。为弥合这一差距,我们引入了VGGT-Segmentor(VGGT-S),一个将鲁棒几何建模与像素精确语义分割统一的框架。VGGT-S利用VGGT强大的跨视图特征表示,并引入了一种新颖的Union分割头。该分割头分三个阶段运行:掩码提示融合、点引导预测和迭代掩码细化,有效地将高级特征对齐转化为精确的分割掩码。此外,我们提出了一种单图像自监督训练策略,消除了对配对标注的需求,并实现了强大的泛化能力。在Ego-Exo4D基准上,VGGT-S在Ego到Exo和Exo到Ego任务中分别实现了67.7%和68.0%的平均IoU,显著优于先前方法。值得注意的是,我们的无对应预训练模型超越了大多数全监督基线,证明了我们方法的有效性和可扩展性。代码公开于:https://github.com/buaa-colalab/VGGT-S。

英文摘要

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa-colalab/VGGT-S.

2604.11679 2026-05-25 cs.CV

Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge

面向临床的大脑MRI基础模型:来自FOMO25挑战赛的发现

Asbjørn Munk, Stefano Cerri, Vardan Nersesjan, Christian Hedeager Krag, Jakob Ambsdorf, Pablo Rocamora García, Julia Machnio, Peirong Liu, Suhyun Ahn, Nasrin Akbari, Yasmina Al Khalil, Kimberly Amador, Sina Amirrajab, Tal Arbel, Meritxell Bach Cuadra, Ujjwal Baid, Bhakti Baheti, Jaume Banus, Kamil Barbierik, Christoph Brune, Yansong Bu, Baptiste Callard, Yuhan Chen, Cornelius Crijnen, Corentin Dancette, Peter Drotar, Prasad Dutande, Nils D. Forkert, Saurabh Garg, Jakub Gazda, Matej Gazda, Benoît Gérin, Partha Ghosh, Weikang Gong, Pedro M. Gordaliza, Sam Hashemi, Tobias Heimann, Fucang Jia, Jiexin Jiang, Emily Kaczmarek, Chris Kang, Seung Kwan Kang, Mohammad Khazaei, Julien Khlaut, Petros Koutsouvelis, Jae Sung Lee, Yuchong Li, Mengye Lyu, Mingchen Ma, Anant Madabhushi, Klaus H. Maier-Hein, Pierre Manceron, Andrés Martínez Mora, Moona Mazher, Felix Meister, Nataliia Molchanova, Steven A. Niederer, Leonard Nürnberg, Jinah Park, Abdul Qayyum, Jonas Richiardi, Antoine Saporta, Branislav Setlak, Ning Shen, Justin Szeto, Constantin Ulrich, Puru Vaish, Vibujithan Vigneshwaran, Leroy Volmer, Zihao Wang, Siqi Wei, Anthony Winder, Jelmer M. Wolterink, Maxence Wynen, Chang Yang, Si Young Yie, Mostafa Mehdipour Ghazi, Akshay Pai, Espen Jimenez Solem, Sebastian Nørgaard Llambias, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

AI总结 临床部署自动化脑部MRI分析面临数据异质性强、标签获取成本高的挑战。本文通过组织FOMO25挑战赛,提供了大规模预训练数据集FOMO60K,并在临床真实数据上评估了模型在少样本和跨域场景下的表现。研究发现,无监督预训练能有效提升模型在跨域数据上的泛化能力,且不同预训练目标对不同任务效果各异,小规模预训练模型已能取得良好性能,进一步扩大模型规模和训练时间并未带来稳定提升。

详情
AI中文摘要

自动化脑MRI分析的临床部署面临一个基本挑战:临床数据异质且有噪声,高质量标签的获取成本高得令人望而却步。自监督学习(SSL)可以通过利用临床工作流程中产生的大量未标记数据来训练鲁棒的 extit{基础模型},这些模型在最小监督下适应域外场景。然而,脑MRI基础模型的发展一直受到小规模预训练数据集和专注于高质量研究级数据的域内基准测试的限制。为解决这一差距,我们组织了FOMO25挑战赛,作为MICCAI 2025的卫星活动。FOMO25为参与者提供了一个大型预训练数据集FOMO60K,并在少样本和域外设置下,直接使用来自临床工作流程的数据评估模型。任务涵盖梗死分类、脑膜瘤分割和脑年龄回归,并考虑了在FOMO60K上训练的模型(方法赛道)和任何数据上训练的模型(开放赛道)。来自16个团队的19个基础模型使用标准化容器化流程进行了评估。结果表明:(a) 自监督预训练提升了域迁移下临床数据的泛化能力,最强的 extit{域外}训练模型超越了 extit{域内}训练的有监督基线。(b) 没有单一的预训练目标对所有任务都有利:MAE有利于分割,混合重建-对比目标有利于分类,以及(c) 小型预训练模型取得了强劲性能,而扩大模型规模和训练时长并未带来可靠收益。

英文摘要

Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-quality labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.

2604.09349 2026-05-25 cs.CV cs.AI cs.CL

Visually-Guided Policy Optimization for Multimodal Reasoning

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

AI总结 该研究针对视觉语言模型在多模态推理中视觉关注不足的问题,提出了一种名为Visually-Guided Policy Optimization(VGPO)的新框架,通过引入视觉注意力补偿机制和双粒度优势重加权策略,增强模型在推理过程中的视觉聚焦能力。实验表明,VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现,显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著提升了视觉语言模型(VLM)的推理能力。然而,VLM固有的文本主导特性常导致视觉忠实度不足,表现为对视觉标记的注意力激活稀疏。更重要的是,我们的实证分析揭示,推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距,我们提出视觉引导策略优化(VGPO),一种在策略优化期间强化视觉聚焦的新框架。具体而言,VGPO首先引入视觉注意力补偿机制,利用视觉相似性定位并放大视觉线索,同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制,我们实施双粒度优势重加权策略:轨迹内层级突出显示具有相对较高视觉激活的标记,而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明,VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

2604.06885 2026-05-25 cs.CV

Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

基于FDG-PET/CT的非小细胞肺癌时间驱动生存分析

Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg

AI总结 该研究提出了一种基于FDG-PET/CT影像的深度回归框架,用于预测非小细胞肺癌患者的总生存期(OS),并引入时间变量作为输入以实现时间驱动的生存分析。方法结合ResNet-50提取影像特征,并与时间信息融合,生成随时间变化的生存概率预测。实验表明,该方法在AUC指标上优于基线模型,且结合临床与影像特征的集成模型取得了最佳性能,验证了多模态数据在生存预测中的互补价值。

Comments Under review

详情
Journal ref
Ann Biomed Eng (2026)
AI中文摘要

目的:基于医学图像的临床结果(如总生存期,OS)自动预测在改善患者预后和个性化治疗计划方面具有巨大潜力。我们开发了一个深度回归框架,使用组织FDG-PET/CT投影作为输入,以及一个表示标量时间范围(以天为单位)的时间输入,来预测非小细胞肺癌(NSCLC)患者的OS。方法:所提出的框架采用ResNet-50骨干网络处理输入图像并生成相应的图像嵌入。然后将嵌入与时间数据结合,生成作为时间函数的OS概率,从而有效地基于时间参数化预测。整体框架使用U-CAN队列(n=556)开发,并在测试集(n=292)上与基线方法进行比较评估。基线使用ResNet-50架构,仅处理图像作为输入,并在预定义的时间间隔(如2年或5年)提供OS预测。结果:将时间数据与图像嵌入相结合在预测OS方面显示出优势,优于基线方法,AUC提高了4.3%。使用临床+IDP特征的模型取得了强劲性能,而成像与临床+IDP模型的集成取得了最佳整体性能(0.788),突显了多模态输入的互补价值。所提出的方法还能够将患者风险分层为不同类别(高风险与低风险)。显著性分析的热图突出显示了肿瘤区域作为预测的关键结构。结论:我们的方法提供了一个自动化的框架来预测作为时间函数的OS,并展示了结合成像和表格数据以改善生存预测的潜力。

英文摘要

Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

2604.03244 2026-05-25 cs.AI cs.CY cs.DB

AI Evaluation Should Require Standardized Item-Level Data Releases

AI评估应要求标准化的项目级数据发布

Han Jiang, Susu Zhang, Dongyao Zhu, Yuzhuo Bai, Sang T. Truong, Xiaoyuan Yi, Sanmi Koyejo, Xing Xie, Ziang Xiao

AI总结 本文主张人工智能评估应采用标准化的项目级基准数据作为默认基础设施。当前评估方法存在项目选择不明确、构念不一致和泛化能力差等问题,其根本原因是对模型整体得分的过度关注。为构建有效的评估体系,作者提出应通过项目级模型响应的实证数据进行验证,并建立标准化数据发布机制,以提高评估的透明性、可复现性和可审计性。为此,研究构建了OpenEval数据集,展示了项目级数据在识别低质量项目、分析构念偏差和验证基准结构方面的作用。

详情
AI中文摘要

这篇立场论文认为,标准化的项目级基准数据应成为AI评估的默认基础设施。当前的评估存在项目选择不明确、构造错位和泛化能力差的问题。这些失败的根本原因在于对聚合模型分数的错误关注。没有项目级证据,有效性声明无法评估,导致能力声明夸大、研究方向错误以及对已部署系统的不当信任。我们的立场是,设计有效的评估需要来自项目级模型响应的实证证据,并且此类数据的标准化发布应被视为核心AI评估基础设施。此外,这种发布能够实现评估结果的透明度、可复制性和可审计性。为了展示这一规范既可行又重要,我们构建了OpenEval,这是一个包含来自广泛使用基准的15.5万个项目的1000万条响应的项目级档案,采用AI评估社区可以发展的统一模式。我们展示了项目级数据如何识别低质量项目、记录构造错位以及恢复关于基准内部结构的有效性证据。我们解决了关于污染和作者负担的反对意见,并表明每个问题相对于基于不可信声明做出的决策成本而言都是可处理的。

英文摘要

This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.

2604.00003 2026-05-25 cs.CL cs.AI cs.IR

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

使用本地大语言模型和布局感知解析的表格PDF信息提取:可靠性评估

Muhammad Anis Al Hilmi, Neelansh Khare, Noel Framil Iglesias, Kurnia Adi Cahyanto, Azhar Al Afghani, Musfi Yuliadi

AI总结 该研究评估了从学术PDF文档中提取结构化信息的可靠性,以印度尼西亚高等教育课程注册表(KRS)为案例,比较了三种方法:纯大语言模型(LLM)、混合确定性-LLM(正则表达式与LLM结合)以及基于Camelot的流程并结合LLM作为后备。实验表明,混合方法在处理确定性元数据时效率更高,而基于Camelot的流程结合LLM后备在准确率和计算效率上表现最佳,尤其适合计算资源受限的环境。

Comments 9 pages, 5 figures, 3 tables

详情
AI中文摘要

从学术PDF文档中提取结构化信息并非易事:单页通常结合自由文本元数据和表格区域,存在跨程序变化,并容易受到干扰下游解析的Unicode编码伪影的影响。本研究以印度尼西亚高等教育的学术课程注册文档(Kartu Rencana Studi或KRS)为案例,评估了表格PDF文档信息提取方法的可靠性。比较了三种策略:纯LLM、混合确定性-LLM(正则表达式和LLM)以及基于Camelot的管道(带LLM回退)。实验在140份文档(基于LLM的测试)和860份文档(基于Camelot的管道评估)上进行,涵盖四个学习项目,包含表格和元数据中的不同数据。使用Ollama和消费级CPU(无GPU)本地运行了三个12-14B的LLM模型(Gemma 3、Phi 4和Qwen 2.5)。评估使用了精确匹配(EM)和Levenshtein相似度(LS)指标,阈值为0.7。尽管并非适用于所有模型,但结果表明,与纯LLM相比,混合方法可以提高效率,尤其是对于确定性元数据。基于Camelot的管道(带LLM回退)在准确性(EM和LS高达0.99-1.00)和计算效率(大多数情况下每个PDF不到1秒)方面取得了最佳组合。Qwen 2.5:14b模型在所有场景中表现最一致。这些发现证实,在计算受限的环境中,将确定性和基于LLM的方法相结合是从基于文本的表格PDF文档中提取信息的可靠且高效的策略。

英文摘要

Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, Hybrid Deterministic - LLM (regex & LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM based methods is a reliable and efficient strategy for information extraction from tabular text based PDF documents in computationally constrained environments.

2603.24985 2026-05-25 cs.CV

Few-Shot Left Atrial Wall Segmentation in 3D LGE MRI via Meta-Learning

基于元学习的3D LGE MRI左心房壁少样本分割

Yusri Al-Sanaani, Rebecca Thornhill, Pablo Nery, Elena Pena, Robert deKemp, Calum Redpath, David Birnie, Sreeraman Rajan

AI总结 该研究针对3D晚期钆增强磁共振成像(LGE-MRI)中左心房壁分割的挑战,提出了一种基于元学习的模型无关框架,结合3D残差U-Net网络,实现少量样本(5、10、20个样本)下的分割任务。通过联合训练左心房壁及辅助左、右心房腔任务,并引入边界感知复合损失函数,提升了对薄结构的分割精度。实验表明,该方法在少样本条件下优于传统微调方法,并在不同数据域下表现出良好的鲁棒性,有助于减少心脏重构评估中的标注负担。

Comments Accepted to IEEE EMBC 2026

详情
AI中文摘要

从晚期钆增强磁共振成像(LGE-MRI)中分割左心房(LA)壁因其薄几何结构、低对比度和有限的专家标注而具有挑战性。我们提出了一种基于模型无关元学习(MAML)的框架,采用3D残差U-Net骨干网络,用于K-shot(K=5, 10, 20)左心房壁分割。该框架在左心房壁任务以及辅助的左心房和右心房(RA)腔任务上进行元训练,并使用边界感知复合损失来改善薄结构描绘。我们在一个保留的干净测试集上评估了MAML,并在未见过的合成域偏移和本地队列上评估了其鲁棒性。在保留的干净测试集上,MAML在5-shot下优于少样本微调基线,Dice系数(DSC)=0.54对比0.48,豪斯多夫距离(HD95)=4.60对比6.40毫米。在20-shot下,MAML接近从头训练的完全监督模型,DSC=0.59对比0.61。在未见过的偏移下,性能相对于干净测试有所下降,但随K增加而持续改善。在5-shot下,MAML在未见过的合成偏移下达到DSC=0.52和HD95=5.02毫米,在本地队列上达到DSC=0.50和HD95=5.43毫米。这些结果表明,元学习可以改善低样本适应中的薄壁描绘,并可能减少心房重构评估的标注负担。

英文摘要

Segmenting the left atrial (LA) wall from late gadolinium enhancement magnetic resonance imaging (LGE-MRI) is challenging because of its thin geometry, low contrast, and limited expert annotations. We propose a model-agnostic meta-learning (MAML) framework with a 3D residual U-Net backbone for K-shot (K = 5, 10, 20) LA wall segmentation. The framework is meta-trained on LA wall tasks together with auxiliary LA and right atrial (RA) cavity tasks and uses a boundary-aware composite loss to improve thin-structure delineation. We evaluated MAML on a held-out clean test set and assessed its robustness under an unseen synthetic domain shift and on a local cohort. On the held-out clean test set, MAML outperformed the K-shot fine-tuning baseline at 5-shot, achieving Dice coefficient (DSC) = 0.54 versus 0.48 and Hausdorff distance (HD95) = 4.60 versus 6.40 mm. At 20-shot, MAML approached the fully supervised model trained from scratch, with DSC = 0.59 versus 0.61. Under unseen shifts, performance decreased relative to clean testing but improved consistently as K increased. At 5-shot, MAML achieved DSC = 0.52 and HD95 = 5.02 mm under the unseen synthetic shift, and DSC = 0.50 and HD95 = 5.43 mm on the local cohort. These results suggest that meta-learning can improve thin-wall delineation in low-shot adaptation and may reduce the annotation burden for atrial remodeling assessment.

2603.21437 2026-05-25 cs.CL cs.IR

Pooling and Semantic Shift: The Fundamental Challenges in Long Text Embedding and Retrieval

池化与语义偏移:长文本嵌入与检索中的根本挑战

Hang Gao, Wujiang Xu, Kai Mei, Dimitris N. Metaxas

AI总结 本文研究了基于Transformer的嵌入模型在长文本表示与检索中面临的两个根本性挑战:池化操作导致的嵌入坍缩和语义漂移。作者提出,嵌入质量下降并非单纯由文本长度或注意力机制引起,而是源于池化操作与内部语义变化的共同作用,并建立了统一的理论框架加以证明。通过实验验证,语义漂移是导致嵌入高度集中化的主因,揭示了各向异性对检索性能的影响仅在强语义漂移情况下才显著,为理解长文本嵌入难题提供了理论依据。

详情
AI中文摘要

基于Transformer的嵌入模型经常表现出几何病态,例如各向异性和长度诱导的表示崩溃,这会降低下游检索性能。虽然先前的工作通常将这些归因于文本长度或注意力机制,但我们认为根本驱动因素反而是固有的池化操作与内部语义偏移。在本文中,我们建立了一个统一的理论框架,证明上下文池化本质上会导致嵌入崩溃。具体来说,我们从数学上证明,对语义多样的句子进行池化不可避免地会导致微观层面的语义稀释,并严格降低向量空间的平均成对距离,从而保证宏观层面的空间集中。基于这些几何洞察,我们正式定义了语义偏移,以捕捉文本内部的自然语义演变和分散。通过跨多种模型和语料库的精心控制实验,我们将文本长度与语义内容分离。我们证明语义偏移是严重嵌入集中的主要预测因子。关键的是,我们的检索评估揭示,各向异性仅在由强语义偏移诱导时才有害,从而调和了先前文献中的矛盾观察,并为现代嵌入模型面临的长上下文挑战提供了原则性解释。

英文摘要

Transformer-based embedding models frequently exhibit geometric pathologies, such as anisotropy and length-induced representation collapse, which can degrade downstream retrieval performance. While prior work often attributes these issues directly to text length or attention mechanisms, we argue that the fundamental drivers are instead the inherent pooling operations coupled with internal semantic shift. In this paper, we establish a unified theoretical framework proving that contextual pooling intrinsically causes embedding collapse. Specifically, we mathematically prove that pooling semantically diverse sentences inevitably leads to micro-level semantic dilution, and strictly reduces the Mean Pairwise Distance of the vector space, guaranteeing macro-level spatial concentration. Grounded in these geometric insights, we formally define semantic shift to capture the natural semantic evolution and dispersion within a text. Through carefully controlled experiments across diverse models and corpora, we disentangle text length from semantic content. We demonstrate that semantic shift is the primary predictor of severe embedding concentration. Crucially, our retrieval evaluations reveal that anisotropy is fundamentally harmful only when induced by strong semantic shifts, reconciling conflicting observations in prior literature and offering a principled explanation for the long-context challenges faced by modern embedding models.

2603.19812 2026-05-25 cs.LG

Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study

共享空间中自动穿梭车与行人的眼动知情与情境感知轨迹预测:一项虚拟现实研究

Danya Li, Yan Feng, Rico Krueger

AI总结 本研究通过虚拟现实实验,探讨行人眼动信息在共享空间中预测其轨迹的价值,研究了不同接近角度和交通条件下的行人与自动驾驶接驳车的交互行为。研究构建了一个融合眼动、头部方向和情境上下文的多模态预测模型,发现眼动信息对轨迹预测的贡献依赖于角度和身体协调,并与情境信息具有互补性。实验表明,结合眼动与情境信息可将最终位移误差降低8.47%,突显了将人类感知信号纳入行人行为预测的重要性。

详情
AI中文摘要

为填补这一空白,我们进行了一项虚拟现实实验,行人在不同接近角度(45°、90°、135°)和连续交通条件(单辆穿梭车、两辆穿梭车间隔3或5秒)下与自动穿梭车交互,收集了同步的运动、眼动和头部朝向数据。为了探究细粒度眼动在何种程度、何种条件下以及以何种形式对行人运动预测提供信息,我们开发了一个多模态预测模型,通过模态特定编码器融合这些信号,并系统地消融眼动表示与头部朝向和情境上下文。我们报告三个主要结果。首先,眼动的预测价值与角度相关,并与眼-头-身体协调紧密耦合:在锐角角度下,行人主动转移视线以获取穿梭车信息时,眼动携带了仅头部朝向无法捕捉的信息。其次,连续眼动朝向优于分类语义注视标签,最佳编码框架(全局或身体相对)取决于眼动是单独使用还是与上下文联合使用。第三,眼动和情境上下文提供互补的预测信息:它们的组合将最终位移误差(FDE)降低了8.47%,接近各自贡献之和。这些发现共同凸显了将人类感知信号纳入行人行为预测的价值,并激励了以人为中心的建模方法补充以车辆为中心的建模方法。我们的代码可在 https://github.com/danyayay/GazeX.git 获取。

英文摘要

To address this gap, we conduct a Virtual Reality experiment in which pedestrians interact with automated shuttles under varying approach angles (45°, 90°, 135°) and continuous-traffic conditions (single shuttle, two shuttles with 3 or 5-second gaps), collecting synchronized motion, eye gaze, and head orientation data. To investigate to what extent, under what conditions, and in what form fine-grained eye gaze is informative for pedestrian motion prediction, we develop a multi-modal prediction model that fuses these signals through modality-specific encoders, and systematically ablate gaze representations against head orientation and situational context. We report three main results. First, the predictive value of eye gaze is angle-dependent and tightly coupled with eye-head-body coordination: at acute angles where pedestrians actively redirect gaze to acquire the shuttle, eye gaze carries information that head orientation alone misses. Second, continuous gaze orientation outperforms categorical semantic fixation labels, with the optimal encoding frame (global or body-relative) depending on whether gaze is used alone or jointly with context. Third, eye gaze and situational context provide complementary predictive information: their combination reduces final displacement error (FDE) by 8.47%, close to the sum of their individual contributions. Together, these findings highlight the value of incorporating human perceptual signals into pedestrian behavior prediction and motivate a human-centered complement to vehicle-centric modeling approaches. Our code is available at https://github.com/danyayay/GazeX.git.

2603.19310 2026-05-25 cs.LG cs.AI

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward: 基于图的经验记忆用于有限标签下的LLM奖励预测

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

AI总结 本文提出了一种基于图结构的经验记忆框架 MemReward,用于在标注数据有限的情况下提升大语言模型(LLM)的奖励预测能力。该方法通过构建包含初始策略生成的推理过程和答案的异构图,并利用图神经网络(GNN)将有限的标注奖励传播到未标注的样本中,从而在在线策略优化过程中实现奖励的高效获取。实验表明,MemReward 在仅使用20%标注数据的情况下,能够在数学证明、问答和代码生成等任务中接近理想奖励模型的性能。

详情
AI中文摘要

强化学习已成为改进大型语言模型推理能力的强大范式,其中从策略中采样rollout,并利用在这些rollout上计算的奖励信号来更新策略。然而,在数据稀缺的场景中,大规模获取ground-truth标签以验证rollout通常需要昂贵的人工标注或劳动密集型的专家验证。例如,评估数学证明需要专家评审,而开放式问答缺乏确定的ground-truth。当ground-truth标签稀缺时,强化学习微调的有效性受到限制。受半监督学习在将标签从标注样本传播到未标注样本方面成功的启发,我们提出了MemReward,一种基于图的经验记忆框架,将奖励传播直接集成到在线策略优化中。MemReward将来自初始LLM策略的rollout(思考过程和最终答案)存储为异构图中的节点,这些节点通过相似性和结构边连接,图神经网络通过该图将奖励从标注rollout传播到未标注rollout。为了训练这样的框架,我们首先在标注rollout上预热GNN,通过查询、思考和答案节点的异质聚合来预测奖励。在在线RL微调期间,未标注rollout通过查询相似性附加到图中,GNN预测它们的奖励,从而产生一种结合ground-truth和GNN预测奖励的混合奖励获取策略。在Qwen2.5-1.5B和3B上的数学、问答和代码生成实验表明,MemReward仅使用20% rollout的ground-truth奖励,就在1.5B上达到Oracle性能的96.6%,在3B上达到97.3%,并在域外任务上接近Oracle。

英文摘要

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

2603.19167 2026-05-25 cs.CL

Evaluating Counterfactual Strategic Reasoning in Large Language Models

评估大型语言模型中的反事实策略推理

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou

AI总结 本研究评估了大型语言模型在重复博弈场景中的策略性能,以判断其表现是基于真正的推理能力还是对记忆模式的依赖。研究引入了对经典博弈(如囚徒困境和石头剪刀布)的反事实变体,改变收益结构和行动标签,从而打破原有的对称性和支配关系。通过多维度评估框架,研究揭示了大型语言模型在激励敏感性、结构泛化和反事实环境中的策略推理方面存在局限性。

Comments Accepted at GEM@ACL 2026

详情
AI中文摘要

我们在重复博弈论环境中评估大型语言模型(LLM),以判断策略表现是否反映了真正的推理还是依赖于记忆模式。我们考虑两个经典博弈:囚徒困境(PD)和石头剪刀布(RPS),并引入反事实变体,改变收益结构和行动标签,打破熟悉的对称性和支配关系。我们的多指标评估框架比较了默认和反事实实例,展示了LLM在反事实环境中的激励敏感性、结构泛化和策略推理方面的局限性。

英文摘要

We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

2603.17879 2026-05-25 cs.CV cs.AI

Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

Podakanti Satyajith Chary, Nagarajan Ganapathy

AI总结 本文提出了一种用于视频胶囊内镜(VCE)的多标签时间事件检测框架,针对Galar数据集中严重的类别不平衡问题,结合了角度分离损失和生物状态机解码器两个核心贡献。该框架基于BiomedCLIP模型,通过局部差分注意力模块融合连续帧以增强病理信号,并利用解剖上下文头结合软解剖激活进行病理预测。实验表明,该方法在RARE-VISION测试集上显著提升了检测性能,实现了更高的平均精度。

Comments 12 pages, 1 figure, ICPR 2026 RARE-VISION Competition

详情
AI中文摘要

本文提出一个多标签时间事件检测框架用于视频胶囊内镜(VCE),通过结合两个主要贡献来解决Galar数据集固有的极端类别不平衡问题:类原型上的角度分离损失和生物状态机时间解码器。主干网络保持为BiomedCLIP,一个生物医学视觉-语言基础模型。三个连续帧通过局部差分注意力模块融合,该模块通过抑制静态时间冗余来放大瞬态病理信号。然后,解剖上下文头将病理预测条件化于软解剖激活上,利用已知的胃肠道发现空间共现结构。可学习的文本特征提示和基于原型的logit增强与角度分离损失一起训练,该损失惩罚类原型之间的非对角线余弦相似度,防止在极端不平衡下影响罕见类的原型崩溃。为抵消倾斜的标签分布,训练方案结合了非对称焦点损失、逆频率加权采样、时间混合、指数移动平均和每类阈值校准。生物状态机解码器用基于解剖标签的生理学基础前向状态转换替代朴素间隙合并,消除了先前方法中每视频产生数百个虚假解剖事件的碎片化伪影,并将每视频解剖输出减少到2-3个临床现实事件。在包含三个NaviCam检查(161,025帧)的保留RARE-VISION测试集上,更新后的管道实现了整体时间mAP@0.5为0.3597,mAP@0.95为0.3399,相比先前提交分别相对提升46%和44%,总推理时间在单个GPU上约21分钟完成。

英文摘要

This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static temporal redundancy. An Anatomy Context Head then conditions pathological predictions on soft anatomical activations, exploiting the known spatial co-occurrence structure of GI findings. Learnable text-feature prompts and prototype-based logit augmentation are trained alongside an Angular Separation Loss that penalizes off-diagonal cosine similarity between class prototypes, preventing the prototype collapse that afflicts rare classes under extreme imbalance. To counteract the skewed label distribution, the training regime combines asymmetric focal loss, inverse-frequency weighted sampling, temporal Mixup, Exponential Moving Average, and per-class threshold calibration. The Biological State Machine decoder replaces naive gap merging with a physiologically grounded forward-only state transition over anatomy labels, eliminating the fragmentation artefact that produced hundreds of spurious anatomy events per video in the prior approach and reducing per-video anatomy output to 2--3 clinically realistic events. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the updated pipeline achieves an overall temporal mAP@0.5 of 0.3597 and mAP@0.95 of 0.3399, representing a relative improvement of 46% and 44% respectively over the prior submission, with total inference completed in approximately 21 minutes on a single GPU.

2603.16331 2026-05-25 cs.LG

Decoding the Critique Mechanism in Large Reasoning Models

解码大型推理模型中的批判机制

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan

AI总结 本文研究了大推理模型(LRMs)在推理过程中如何通过内部机制纠正错误,提出了“隐藏的批评能力”这一概念。研究发现,即使模型在中间推理步骤中出现错误且未进行明确纠正,仍能最终得出正确答案,表明其具备某种隐式的错误检测与自我修正机制。通过特征空间分析,作者识别出一个可解释的“批评向量”,用于引导模型增强错误检测能力,提升推理性能,且无需额外训练成本。这一发现为理解与改进大模型的自我验证机制提供了新思路。

详情
AI中文摘要

大型推理模型(LRMs)展现出回溯和自我验证机制,使其能够修正中间步骤并达到正确解,在复杂逻辑基准上表现强劲。我们假设这种行为仅在模型具有足够强的“批判”能力来检测自身错误时才有益。本工作通过在中间推理步骤中插入算术错误,系统研究了当前LRMs如何从错误中恢复。值得注意的是,我们发现一个奇特但重要的现象:尽管错误在整个思维链(CoT)中传播且没有任何言语修正,模型在思考过程结束后仍能得出正确的最终答案。这种恢复暗示存在一种内部机制帮助模型检测错误并触发自我修正,我们称之为隐藏的批判能力。基于特征空间分析,我们识别出一个高度可解释的批判向量,代表这种行为。跨多个模型规模和系列的广泛实验表明,用该向量引导潜在表示可提升模型的错误检测能力,并在无需额外训练成本的情况下增强测试时扩展性能。我们的发现为LRMs的批判行为提供了有价值的理解,提示了控制和改进其自我验证机制的有前景方向。我们的代码可在 https://github.com/mail-research/lrm-critique-vectors 获取。

英文摘要

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the \textit{hidden critique ability}. Building on feature space analysis, we identify a highly interpretable \textit{critique vector} representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.

2603.14027 2026-05-25 cs.CL

SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

SemEval-2026 任务 6:CLARITY——揭露政治问题回避

Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou

AI总结 SemEval-2026任务6(CLARITY)旨在识别政治发言中对问题的回避性回答,研究如何在保持表面回应性的同时避免直接回答。该任务包含两个子任务:一是对回答清晰度进行分类,二是对九种具体回避策略进行细粒度识别。该基准数据集基于美国总统采访构建,采用专家定义的分类体系,结果显示大语言模型提示和分层利用分类体系是有效方法,且清晰度分类任务比回避策略分类更具挑战性。

Comments SemEval 2026 (Task organizers)

详情
AI中文摘要

政治演讲者常常在保持回应表象的同时避免直接回答问题。尽管这对公共话语至关重要,但这种策略性回避在自然语言处理中仍未得到充分探索。我们介绍了 SemEval-2026 任务 6,CLARITY,一个关于政治问题回避的共享任务,包含两个子任务:(i) 清晰度级别分类,分为清晰回答、矛盾和不清晰回答;(ii) 回避级别分类,分为九种细粒度回避策略。该基准来自美国总统访谈,并遵循基于专家的回应清晰度和回避分类体系。该任务吸引了 124 个注册团队,他们提交了 946 个有效运行用于清晰度级别分类,539 个用于回避级别分类。结果显示两个子任务之间存在显著的难度差距:最佳系统在清晰度分类上达到了 0.89 的宏 F1,大幅超过最强基线,而顶级回避系统达到了 0.68 的宏 F1,与最佳基线持平。总体而言,大语言模型提示和分类体系的层级利用成为最有效的策略,顶级系统始终优于那些独立处理两个子任务的系统。CLARITY 将政治回应回避确立为计算话语分析的一个具有挑战性的基准,并突显了建模政治语言中策略性模糊的难度。

英文摘要

Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.

2603.10688 2026-05-25 cs.RO cs.CV

MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction

MapGCLR: 用于在线矢量化高清地图构建的地理空间对比学习表示

Jonas Merkert, Alexander Blumberg, Jan-Hendrik Pauls, Christoph Stiller

AI总结 本文提出了一种名为 MapGCLR 的方法,旨在提升在线矢量化高精地图构建中鸟瞰图(BEV)特征网格的表示能力。通过在对比损失函数中引入地理空间一致性约束,该方法增强了重叠区域特征的一致性,并结合多遍历数据集划分策略,实现了半监督学习框架。实验表明,该方法在矢量化地图感知任务和特征空间可视化方面均优于传统监督方法。

详情
AI中文摘要

自动驾驶汽车依赖地图信息来理解周围环境。然而,离线高清地图的创建和维护成本仍然很高。一种更具可扩展性的替代方案是在线高清地图构建,它仅在训练时需要地图标注。为了进一步减少标注大量训练标签的需求,自监督训练提供了一种替代方案。本文通过在地理空间上强制重叠的鸟瞰图特征网格之间的一致性作为对比损失函数的一部分,专注于改进矢量化在线高清地图构建模型中的潜在鸟瞰图特征网格表示。为了确保对比对的地理空间重叠,我们引入了一种方法来分析给定数据集中遍历之间的重叠,并根据可调整的多遍历要求生成子数据集划分。我们使用减少的单遍历标注数据对同一模型进行监督训练,并在更广泛的未标注数据集上根据我们的多遍历要求进行自监督训练,有效实现了半监督方法。我们的方法在各个方面都优于监督基线,无论是在下游任务矢量化地图感知性能的定量评估上,还是在鸟瞰图特征空间的主成分分析可视化的分割定性评估上。

英文摘要

Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.

2603.10067 2026-05-25 cs.LG cs.AI

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

HTMuon:通过重尾谱校正改进Muon

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

AI总结 本文提出 HTMuon,一种改进 Muon 优化算法的方法,旨在提升大语言模型的训练效果。研究指出,Muon 的正交更新规则抑制了权重谱的重尾特性,而 HTMuon 基于重尾自正则化理论,通过生成更重尾的更新步长,增强模型对参数依赖关系的捕捉能力。实验表明,HTMuon 在语言模型预训练和图像分类任务中均优于现有方法,且可作为现有 Muon 变体的插件使用。

详情
AI中文摘要

Muon最近在LLM训练中显示出有希望的结果。在这项工作中,我们研究如何进一步改进Muon。我们认为Muon的正交化更新规则抑制了重尾权重谱的出现,并过度强调了沿噪声主导方向的训练。受重尾自正则化(HT-SR)理论的启发,我们提出了HTMuon。HTMuon保留了Muon捕捉参数相互依赖性的能力,同时产生更重尾的更新并诱导更重尾的权重谱。在LLM预训练和图像分类上的实验表明,HTMuon持续优于最先进的基线,并且可以作为现有Muon变体的插件使用。例如,在C4数据集上的LLaMA预训练中,与Muon相比,HTMuon将困惑度降低了高达0.98。我们进一步从理论上证明,HTMuon对应于Schatten-$q$范数约束下的最速下降,并提供了在光滑非凸环境下的收敛性分析。HTMuon的实现可在https://github.com/TDCSZ327/HTmuon获取。

英文摘要

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

2603.06610 2026-05-25 cs.LG

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

CapTrack: 大语言模型后训练中遗忘的多方面评估

Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz

AI总结 本文提出CapTrack,一个以能力为中心的框架,用于评估大型语言模型在微调过程中产生的遗忘现象。不同于传统的参数或事实知识丢失视角,CapTrack从行为和能力退化角度定义遗忘,并结合行为分类和能力特异性指标构建评估体系。通过大规模实验分析多种微调方法、领域和模型家族,研究发现遗忘不仅影响参数知识,还显著影响模型的鲁棒性和默认行为,不同微调方法对能力退化的程度也存在差异。

详情
AI中文摘要

大语言模型(LLM)后训练增强了潜在技能,解锁了价值对齐,提升了性能,并实现了领域适应。不幸的是,后训练已知会引发遗忘,尤其是在利用第三方预训练模型的普遍用例中,这通常被理解为参数或事实知识的损失。我们认为这种以准确性为中心的观点对于现代基础模型是不够的,而是将遗忘定义为系统性的模型漂移,它会降低行为和用户体验。在此背景下,我们引入了CapTrack,一个以能力为中心的框架,用于分析LLM中的遗忘,该框架结合了行为分类法和以能力特定指标为中心的评估套件。利用CapTrack,我们跨后训练算法、领域和模型家族(包括高达80B参数的模型)进行了大规模实证研究。我们发现遗忘超出了参数知识,在鲁棒性和默认行为方面出现了显著的漂移。指令微调引发了最强的相对漂移,而偏好优化更为保守,并且可以部分恢复丢失的能力。不同模型家族之间的差异持续存在,没有出现通用的缓解方法。

英文摘要

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce CapTrack, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite centered on capability-specific metrics. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

2603.02897 2026-05-25 cs.CV

ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization

ProGIC:基于残差向量量化的渐进式轻量级生成图像压缩

Hao Cao, Chengbin Liang, Wenqi Guo, Zhijin Qin, Jungong Han

AI总结 本文提出了一种名为 ProGIC 的渐进式轻量级生成图像压缩方法,基于残差向量量化(RVQ)构建,能够在保证感知质量的同时实现更高效的压缩。该方法通过多阶段的残差编码生成渐进式比特流,支持部分数据预览,并结合轻量化的深度可分离卷积和小注意力模块,提升了在低算力设备上的部署能力。实验表明,ProGIC 在 Kodak 数据集上相比现有方法实现了显著的码率节省,并在编码解码速度上也有明显提升。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成图像压缩(GIC)的最新进展在感知质量上取得了显著提升。然而,许多GIC依赖于大规模且刚性的模型,严重限制了其在低比特率场景下灵活传输和实际部署的实用性。为解决这些问题,我们提出了渐进式生成图像压缩(ProGIC),一种基于残差向量量化(RVQ)的紧凑编解码器。在RVQ中,一系列向量量化器逐级编码残差,每个量化器拥有自己的码本。生成的码字累加实现从粗到细的重建和渐进比特流,从而支持从部分数据预览。我们将其与基于深度可分离卷积和小型注意力模块的轻量级骨干网络配对,使得在GPU和仅CPU设备上均可实际部署。实验结果表明,ProGIC在压缩性能上与先前方法相当。在Kodak数据集上,与MS-ILLM相比,它在DISTS上节省高达57.57%的比特率,在LPIPS上节省58.83%。除了感知质量,ProGIC还支持渐进传输以提高灵活性,并且在GPU上编码解码速度比MS-ILLM快10倍以上。

英文摘要

Recent advances in generative image compression (GIC) have delivered remarkable improvements in perceptual quality. However, many GICs rely on large-scale and rigid models, which severely constrain their utility for flexible transmission and practical deployment in low-bitrate scenarios. To address these issues, we propose Progressive Generative Image Compression (ProGIC), a compact codec built on residual vector quantization (RVQ). In RVQ, a sequence of vector quantizers encodes the residuals stage by stage, each with its own codebook. The resulting codewords sum to a coarse-to-fine reconstruction and a progressive bitstream, enabling previews from partial data. We pair this with a lightweight backbone based on depthwise-separable convolutions and small attention blocks, enabling practical deployment on both GPUs and CPU-only devices. Experimental results show that ProGIC attains comparable compression performance compared with previous methods. It achieves bitrate savings of up to 57.57% on DISTS and 58.83% on LPIPS compared to MS-ILLM on the Kodak dataset. Beyond perceptual quality, ProGIC enables progressive transmission for flexibility, and also delivers over 10 times faster encoding and decoding compared with MS-ILLM on GPUs for efficiency.

2603.02719 2026-05-25 cs.LG

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

多模态临床状况分类中的校准与选择性预测的实证分析

L. Julián Lechuga López, Farah E. Shamout, Tim G. J. Rudner

AI总结 本研究针对多模态临床条件分类任务,实证分析了基于不确定性的选择性预测在可靠性方面的表现。研究发现,尽管模型在标准评估指标上表现良好,但选择性预测可能导致性能显著下降,其根本原因在于模型对不同类别存在严重的校准偏差,尤其在罕见临床条件下更为明显。研究强调了当前聚合评估指标可能掩盖这些问题,并指出在临床AI系统中需要引入校准感知的评估方法,以确保预测的安全性和鲁棒性。

Comments 40 pages, 14 figures, 16 tables. Accepted as a conference paper at AHLI Conference on Health, Inference, and Learning (CHIL) 2026

详情
AI中文摘要

随着人工智能系统向临床部署迈进,确保可靠的预测行为对于安全关键的决策任务至关重要。一种提议的安全保障是选择性预测,即模型可以将不确定的预测交由人类专家审查。在这项工作中,我们使用多模态ICU数据,实证评估了基于不确定性的选择性预测在多标签临床状况分类中的可靠性。在一系列最先进的单模态和多模态模型中,我们发现尽管标准评估指标表现强劲,但选择性预测可能会大幅降低性能。这种失败是由严重的类别依赖的误校准驱动的,即模型对正确预测赋予高不确定性,对错误预测赋予低不确定性,尤其是对于代表性不足的临床状况。我们的结果表明,常用的聚合指标可能掩盖这些效应,限制了它们评估该设置下选择性预测行为的能力。综合来看,我们的发现描述了多模态临床状况分类中选择性预测的任务特定失败模式,并强调了需要校准感知评估来为临床AI提供强有力的安全性和鲁棒性保证。

英文摘要

As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

2603.01655 2026-05-25 cs.LG eess.SP

Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling

变换不变生成射线路径采样用于高效无线电传播建模

Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges

AI总结 本文提出了一种基于生成流网络的智能采样框架,用于高效建模无线电波传播路径,以解决传统射线追踪方法计算复杂度过高的问题。该方法通过引入经验回放缓冲区、统一探索策略和物理约束的动作掩码,提升了模型在复杂环境中的学习鲁棒性和路径探索效率。实验表明,该方法在保持高精度的同时,相比穷举搜索在GPU和CPU上分别实现了最高10倍和100倍的加速,但在实际城市环境中仍需进一步提升模型泛化能力。

Comments submitted to npj Wireless Technology, 30 pages, 16 figures

详情
AI中文摘要

射线追踪已成为精确无线电传播建模的标准方法,但其计算复杂度呈指数增长,因为候选路径数量随物体数量的交互阶数而增加。这一瓶颈限制了其在大型或实时应用中的使用,迫使传统工具依赖启发式方法减少路径候选,但可能牺牲精度。为克服这一限制,我们提出了一种机器学习辅助框架,通过生成流网络进行智能采样,取代穷举路径搜索。将这些生成模型应用于该领域面临挑战,特别是由于有效路径的稀缺性导致的稀疏奖励,这可能导致在复杂环境中评估高阶交互时收敛失败和琐碎解。为确保鲁棒学习和高效探索,我们的框架包含三个关键组件。首先,经验回放缓冲区捕获并保留稀有的有效路径。其次,统一探索策略提高了泛化能力,防止过拟合简单几何形状。第三,基于物理的动作掩蔽策略在模型考虑之前过滤掉物理上不可能的路径。在理想街道峡谷场景上的验证表明,我们的模型相比穷举搜索实现了显著加速——GPU上最高10倍,CPU上最高100倍——同时保持高覆盖精度并成功发现复杂传播路径。然而,在真实曼哈顿街道几何形状上的分布外评估显示,泛化到显著不同的城市形态需要模型容量或训练策略的进一步改进。源代码、测试和教程见https://github.com/jeertmans/sampling-paths。

英文摘要

Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics that reduce path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying these generative models to this domain presents challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key components. First, an \emph{experience replay buffer} captures and retains rare valid paths. Second, a uniform exploratory policy improves generalization and prevents overfitting to simple geometries. Third, a physics-based action masking strategy filters out physically impossible paths before the model considers them. Validated on idealized street-canyon scenarios, our model achieves substantial speedups over exhaustive search -- up to $10\times$ faster on GPU and $100\times$ faster on CPU -- while maintaining high coverage accuracy and successfully uncovering complex propagation paths. However, out-of-distribution evaluations on real-world Manhattan street geometries reveal that generalizing to substantially different urban morphologies requires further advancement in model capacity or alternative training strategies. Source code, tests, and a tutorial are available at https://github.com/jeertmans/sampling-paths.

2602.19174 2026-05-25 cs.CL

TurkicNLP: An NLP Toolkit for Turkic Languages

TurkicNLP:突厥语言的自然语言处理工具包

Sherzod Hakimov

AI总结 本文介绍了TurkicNLP,一个面向突厥语系的开源自然语言处理工具包,旨在解决该语系语言处理工具和资源分散的问题。该工具包支持四种书写系统,提供统一的NLP流程,包括分词、形态分析、词性标注、依存句法分析等功能,并采用模块化架构整合规则和神经模型,实现自动脚本检测与转换。其输出遵循CoNLL-U标准,便于与其他系统兼容与扩展。

Comments The toolkit is available here: https://github.com/turkic-nlp/turkicnlp

详情
AI中文摘要

突厥语族由欧亚大陆超过2亿人使用,其自然语言处理仍然碎片化,大多数语言缺乏统一的工具和资源。我们提出TurkicNLP,一个开源的Python库,为四种文字体系(拉丁、西里尔、波斯-阿拉伯和古突厥如尼文)的突厥语言提供单一、一致的NLP流水线。该库通过一个语言无关的API覆盖分词、形态分析、词性标注、依存句法分析、命名实体识别、双向文字转写、跨语言句子嵌入和机器翻译。模块化多后端架构透明地集成了基于规则的有限状态转换器和神经模型,并具备自动文字检测和文字变体路由功能。输出遵循CoNLL-U标准,以实现完全互操作性和扩展性。代码和文档托管于https://github.com/turkic-nlp/turkicnlp。

英文摘要

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .

2602.18788 2026-05-25 cs.CL

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

BURMESE-SAN: 评估大语言模型的缅甸语NLP基准

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

AI总结 本文介绍了BURMESE-SAN,这是首个系统评估大型语言模型在缅甸语自然语言理解、推理和生成能力的综合性基准。该基准包含七项子任务,涵盖问答、情感分析、因果推理等多个领域,并通过严格的母语者参与流程构建,确保语言自然性和文化真实性。研究发现,缅甸语模型性能更依赖于架构设计、语言表示和指令微调,而非模型规模,并指出区域微调和新一代模型能显著提升效果。

详情
AI中文摘要

我们引入了BURMESE-SAN,这是第一个系统性评估大语言模型(LLM)在缅甸语上三种核心NLP能力:理解(NLU)、推理(NLR)和生成(NLG)的全面基准。BURMESE-SAN整合了涵盖这些能力的七个子任务,包括问答、情感分析、毒性检测、因果推理、自然语言推理、抽象摘要和机器翻译,其中多个任务此前在缅甸语中不可用。该基准通过严格的母语者驱动流程构建,以确保语言自然性、流畅性和文化真实性,同时最小化翻译引起的伪影。我们对开源和商业LLM进行了大规模评估,以考察缅甸语建模中因预训练覆盖有限、丰富形态和句法变异带来的挑战。我们的结果表明,缅甸语性能更多地依赖于架构设计、语言表示和指令微调,而非仅模型规模。特别是,东南亚区域微调和更新的模型世代带来了显著提升。最后,我们发布BURMESE-SAN作为公共排行榜,以支持缅甸语及其他低资源语言的系统评估和持续进步。https://leaderboard.sea-lion.ai/detailed/MY

英文摘要

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

2602.18176 2026-05-25 cs.CL

Improving Sampling for Masked Diffusion Models via Information Gain

通过信息增益改进掩码扩散模型的采样

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb

AI总结 该论文研究了如何改进掩码扩散模型(MDMs)的采样过程,指出现有采样方法过于贪心,仅关注局部确定性而忽视了后续影响,导致生成结果不确定性增加。为此,作者提出了一种无需训练的解码方法——信息增益采样器(Info-Gain Sampler),通过利用MDMs的双向结构,在当前不确定性和剩余位置的信息增益之间取得平衡。实验表明,该方法在推理、编码、创意写作和图像生成等任务中均优于现有方法,显著提升了生成质量。

Comments https://github.com/yks23/Information-Gain-Sampler Accepted by ICML2026 Accepted by ICML2026

详情
AI中文摘要

掩码扩散模型(MDMs)支持灵活的解码顺序,但现有采样器大多是贪婪的,仅选择局部确定的token而不考虑其下游影响。我们表明这种短视行为会增加累积不确定性并导致次优生成。为解决此问题,我们提出**Info-Gain采样器**,一种无需训练的解码方法,利用MDMs的双向结构平衡即时不确定性与剩余掩码位置获得的信息增益。在推理、编码、创意写作和图像生成任务中,Info-Gain采样器持续优于现有MDM采样器,平均推理准确率提升2.9--11.6个百分点,创意写作平均胜率达到62.8%。代码可在https://github.com/yks23/Information-Gain-Sampler获取。

英文摘要

Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation. To address this, we propose the **Info-Gain Sampler**, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers, improving average reasoning accuracy by 2.9--11.6 percentage points and achieving a 62.8% average win rate in creative writing. The code is available at https://github.com/yks23/Information-Gain-Sampler.

2602.17653 2026-05-25 cs.CL

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

语言模型处理差异论元标记中的类型学对齐差异

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld

AI总结 本文研究语言模型在处理差分化论元标记(DAM)时表现出的类型学对齐差异。通过在18个实现不同DAM系统的合成语料库上训练GPT-2模型,并使用最小对进行评估,研究发现模型在DAM的自然标记方向上表现出与人类语言相似的偏好,即更倾向于对语义不典型的论元进行显性标记,但在对象优先这一人类语言常见现象上却未能复现。这一结果表明,不同类型学倾向可能源于不同的底层机制。

Comments 16 pages, 8 figures, 7 tables. To appear at CoNLL 2026

详情
AI中文摘要

近期研究表明,在合成语料上训练的语言模型可以展现出类似人类语言跨语言规律的类型学偏好,特别是对于语序等句法现象。本文将此范式扩展到差异论元标记(DAM),一种形态标记取决于语义显著性的语义许可系统。使用受控合成学习方法,我们在18个实现不同DAM系统的语料上训练GPT-2模型,并通过最小对评估其泛化能力。结果揭示了DAM的两个类型学维度之间的分离。模型可靠地展现出对自然标记方向的人类偏好,倾向于那些显性标记针对语义非典型论元的系统。相比之下,模型并未复现人类语言中强烈的宾语偏好,即在DAM中显性标记更常针对宾语而非主语。这些发现表明,不同的类型学倾向可能源于不同的潜在来源。

英文摘要

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

2602.15258 2026-05-25 cs.RO

SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks

SEG-JPEG: 用于在不可靠无线网络上远程操作自动驾驶车辆的简单视觉语义通信

Sebastian Donnelly, Ruth Anderson, George Economides, James Broughton, Peter Ball, Alexander Rast, Andrew Bradley

AI总结 本文研究了在不可靠无线网络环境下,如何通过视觉语义通信技术实现对自动驾驶车辆的远程操控。提出了一种名为SEG-JPEG的方法,通过在低分辨率灰度图像中用彩色高亮编码检测到的道路使用者分割信息,将所需数据率降低50%,同时保持视觉清晰度。实验表明,该方法能够在低带宽网络下实现低于200毫秒的端到端延迟,提升远程操作员的环境感知能力,为自动驾驶车辆的大规模远程部署提供了可行方案。

Comments 7 pages, 9 figures. Under minor revision for CSNDSP 2026

详情
AI中文摘要

远程操作被认为是快速部署自动驾驶车辆的关键。目前,将图像流传输到远程控制连接车辆需要可靠、高吞吐量的网络连接,而在依赖公共网络基础设施的实际远程操作部署中,这种连接可能受到限制。本文研究了如何应用计算机视觉辅助的语义通信来规避与传统图像压缩技术相关的数据丢失和损坏。通过将检测到的道路用户的分割编码为低分辨率灰度图像中的彩色高亮,与传统技术相比,所需数据速率可降低50%,同时保持视觉清晰度。这使得即使网络数据速率低于500 kbit/s,中位玻璃到玻璃延迟也能低于200 ms,同时清晰勾勒出显著的道路用户,以增强远程操作员的情境意识。该方法在4G移动连接变化的区域使用自动最后一英里配送车辆进行了演示。结果表明,即使在通常受限的公共4G/5G移动网络上,也有可能大规模部署远程操作的自动驾驶车辆,从而有可能加速自动驾驶车辆在全国范围内的推广。

英文摘要

Remote Operation is touted as being key to the rapid deployment of automated vehicles. Streaming imagery to control connected vehicles remotely currently requires a reliable, high throughput network connection, which can be limited in real-world remote operation deployments relying on public network infrastructure. This paper investigates how the application of computer vision assisted semantic communication can be used to circumvent data loss and corruption associated with traditional image compression techniques. By encoding the segmentations of detected road users into colour coded highlights within low resolution greyscale imagery, the required data rate can be reduced by 50% compared with conventional techniques, while maintaining visual clarity. This enables a median glass-to-glass latency of below 200 ms even when the network data rate is below 500 kbit/s, while clearly outlining salient road users to enhance situational awareness of the remote operator. The approach is demonstrated in an area of variable 4G mobile connectivity using an automated last-mile delivery vehicle. Results indicate that large-scale deployment of remotely operated automated vehicles could be possible even on the often constrained public 4G/5G mobile network, providing the potential to expedite the nationwide roll-out of automated vehicles.

2602.13985 2026-05-25 cs.AI

Bridging AI and Clinical Reasoning: Abductive Explanations for Alignment on Critical Symptoms

弥合AI与临床推理:针对关键症状对齐的溯因解释

Belona Sonna, Alban Grastien

AI总结 该研究旨在解决人工智能在临床诊断中与结构化临床推理不一致的问题,提出利用形式化归因解释方法,以确保AI决策基于关键症状进行合理推理。通过识别最小充分特征集,该方法不仅提升了AI解释的透明度和可信度,还实现了与临床思维的对齐,为构建可信赖的医疗诊断AI系统提供了有效框架。

Comments The Algorithm 1 is not entirely correct and they may affect the results as well. We are restarting the experimentations and will upload the new version as soon as possible

详情
AI中文摘要

人工智能在临床诊断中展现出强大潜力,其准确性常达到或超过人类专家水平。然而,一个关键挑战是AI推理常偏离结构化临床框架,限制了信任、可解释性和应用。即使预测正确,AI模型也可能忽略对快速准确决策至关重要的关键症状。现有的事后解释方法透明度有限且缺乏形式保证。为解决此问题,我们利用形式溯因解释,它在最小充分特征集上提供一致且保证的推理。这使我们能够清晰理解AI决策,并使其与临床推理对齐。我们的方法在保持预测准确性的同时提供临床可操作的见解,为医疗诊断中可信AI建立了稳健框架。

英文摘要

Artificial intelligence (AI) has demonstrated strong potential in clinical diagnostics, often achieving accuracy comparable to or exceeding that of human experts. A key challenge, however, is that AI reasoning frequently diverges from structured clinical frameworks, limiting trust, interpretability, and adoption. Critical symptoms, pivotal for rapid and accurate decision-making, may be overlooked by AI models even when predictions are correct. Existing post hoc explanation methods provide limited transparency and lack formal guarantees. To address this, we leverage formal abductive explanations, which offer consistent, guaranteed reasoning over minimal sufficient feature sets. This enables a clear understanding of AI decision-making and allows alignment with clinical reasoning. Our approach preserves predictive accuracy while providing clinically actionable insights, establishing a robust framework for trustworthy AI in medical diagnosis.

2602.13473 2026-05-25 cs.AI

NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines

NeuroWeaver:一种用于探索EEG分析流水线程序空间的自主进化智能体

Guoan Wang, Shihao Yang, Jun-En Ding, Feng Liu

AI总结 本文提出了一种名为NeuroWeaver的自主进化智能体,用于探索EEG分析流程的程序空间。该方法通过将流程设计转化为离散约束优化问题,并结合领域知识引导的初始化和多目标进化优化,有效平衡了性能、新颖性和效率。实验表明,NeuroWeaver能够在较少参数的情况下生成轻量高效的解决方案,其表现优于现有任务特定方法,并可与大规模基础模型相媲美。

详情
AI中文摘要

尽管基础模型在通用领域取得了显著成功,但这些模型在脑电图(EEG)分析中的应用受到大量数据需求和高参数化的限制。这些因素导致高昂的计算成本,从而阻碍了在资源受限的临床环境中的部署。相反,通用自动机器学习框架通常不适合该领域,因为在无界程序空间中的探索未能纳入必要的神经生理学先验,并且经常产生缺乏科学合理性的解决方案。为了解决这些限制,我们提出了NeuroWeaver,一个统一的自主进化智能体,通过将流水线工程重新表述为离散约束优化问题,旨在泛化到不同的EEG数据集和任务。具体来说,我们采用领域信息子空间初始化将搜索限制在神经科学合理的流形上,并结合多目标进化优化,通过自我反思性改进动态平衡性能、新颖性和效率。在五个异构基准上的实证评估表明,尽管使用的参数显著减少,NeuroWeaver合成的轻量级解决方案始终优于最先进的任务特定方法,并实现了与大规模基础模型相当的性能。

英文摘要

Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalography (EEG) analysis is constrained by substantial data requirements and high parameterization. These factors incur prohibitive computational costs, thereby impeding deployment in resource-constrained clinical environments. Conversely, general-purpose automated machine learning frameworks are often ill-suited for this domain, as exploration within an unbounded programmatic space fails to incorporate essential neurophysiological priors and frequently yields solutions that lack scientific plausibility. To address these limitations, we propose NeuroWeaver, a unified autonomous evolutionary agent designed to generalize across diverse EEG datasets and tasks by reformulating pipeline engineering as a discrete constrained optimization problem. Specifically, we employ a Domain-Informed Subspace Initialization to confine the search to neuroscientifically plausible manifolds, coupled with a Multi-Objective Evolutionary Optimization that dynamically balances performance, novelty, and efficiency via self-reflective refinement. Empirical evaluations across five heterogeneous benchmarks demonstrate that NeuroWeaver synthesizes lightweight solutions that consistently outperform state-of-the-art task-specific methods and achieve performance comparable to large-scale foundation models, despite utilizing significantly fewer parameters.

2602.12579 2026-05-25 cs.LG cs.AI

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

VI-CuRL: 通过置信度引导的方差缩减稳定与验证器无关的强化学习推理

Xin-Qiang Cai, Masashi Sugiyama

AI总结 本文提出了一种名为VI-CuRL的验证器无关强化学习框架,旨在解决现有可验证奖励强化学习(RLVR)依赖外部验证器导致的可扩展性问题。该方法通过利用模型自身的置信度构建独立于外部验证器的课程学习体系,有效控制梯度方差,提升训练稳定性。理论分析证明了该估计器的渐近无偏性,实验表明其在数学和通用推理任务中优于多种依赖或不依赖验证器的基线方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的主流范式,但其对外部验证器的依赖限制了可扩展性。最近的研究表明,RLVR主要通过激发潜在能力发挥作用,这推动了无验证器算法的发展。然而,在此类设置中,标准方法(如Group Relative Policy Optimization)面临一个关键挑战:破坏性的梯度方差常导致训练崩溃。为解决此问题,我们引入了与验证器无关的课程强化学习(VI-CuRL),该框架利用模型的内在置信度构建独立于外部验证器的课程。通过优先处理高置信度样本,VI-CuRL有效管理偏差-方差权衡,特别针对降低动作和问题方差。我们提供了严格的理论分析,证明我们的估计量保证渐近无偏性。实验上,VI-CuRL促进了稳定性,并在有/无验证器的数学和通用推理基准上持续优于依赖/不依赖验证器的基线。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.