arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology（首尔科学技术大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； LG CNS

AI总结提出MENTOR方法，通过灵活的教师优化奖励结构，平衡行为对齐与下游性能，提升小模型在工具使用任务中的域外泛化能力。

详情

AI中文摘要

将大型语言模型（LLMs）的工具使用能力蒸馏到小型语言模型（SLMs）中对其实际应用至关重要。主要方法监督微调（SFT）由于与静态教师轨迹的刚性对齐，导致域外（OOD）泛化性能较差。虽然强化学习（RL）提供了一种替代方案，但SLMs的能力限制带来了严峻的困境：稀疏的结果奖励提供的指导不足，而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距，我们提出了MENTOR，它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制，而是利用教师的参考来指导工具使用行为，平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明，与SFT和严格RL基线相比，MENTOR提高了OOD工具使用性能。我们的研究结果表明，在可验证的工具使用环境中，灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

URL PDF HTML ☆

赞 0 踩 0

2603.25702 2026-06-19 cs.CL 版本更新

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

S2D2：通过免训练自我推测实现扩散LLM的快速解码

Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

发表机构 * Red Hat AI Innovation（红帽AI创新）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Iowa State University（爱荷华州立大学）； Core AI, IBM（IBM核心AI）

AI总结提出S2D2，一种免训练的自我推测解码框架，通过将块扩散模型在块大小为1时变为自回归模型，实现草稿与验证角色复用，在不增加训练或测试计算下提升解码速度与准确性。

Comments Code is available at https://github.com/phymhan/S2D2

详情

AI中文摘要

块扩散语言模型通过结合块级自回归解码与块内并行去噪，为超越自回归生成提供了一条有前景的路径。然而，在实际加速所需的少步数场景中，标准的置信度阈值解码往往脆弱：激进的阈值损害质量，而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外训练，要么增加测试时计算。我们提出S2D2，一种用于块扩散语言模型的免训练自我推测解码框架。我们的关键观察是，当块大小减小到1时，块扩散模型变为自回归模型，从而允许相同的预训练模型同时充当草稿模型和验证模型。S2D2在标准块扩散解码中插入一个推测验证步骤，并使用轻量级路由策略来决定何时验证值得其成本。这产生了一种混合解码轨迹，其中扩散并行提出令牌，而自回归模式充当局部序列级评判器。在三个主流块扩散家族中，S2D2在准确性-速度权衡上持续优于强置信度阈值基线。在SDAR上，我们观察到相比自回归解码高达4.7倍加速，相比调优的动态解码基线高达1.57倍加速，同时准确性提升高达4.5个点。在LLaDA2.1-Mini上，S2D2与内置自校正保持互补，包括在保守设置下比静态基线快4.4倍且准确性略高。

英文摘要

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.16865 2026-06-19 cs.CL 版本更新

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Jinesis Lab, University of Toronto & Vector Institute（Jinesis实验室，多伦多大学及向量研究所）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Princeton University（普林斯顿大学）； Cornell University（康奈尔大学）； The University of Tokyo（东京大学）； RIKEN AIP（日本理化学研究所AIP）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（德国图宾根最大计划智能系统研究所）； EuroSafeAI

AI总结本文提出MixSD方法，通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入，从而在保持预训练能力的同时提升事实记忆和推理能力。

详情

AI中文摘要

监督微调（SFT）被广泛用于将新知识注入语言模型，但通常会损害预训练能力，如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致，迫使优化器模仿低概率token序列。为了解决这个问题，我们提出了MixSD，一种无需外部教师的简单方法，用于对齐分布的知识注入。与固定目标训练不同，MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号，同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD，研究事实回忆和算术功能学习，并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下，MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线，能够保留基础模型的100% held-out能力，同时保持接近完美的训练准确率，而标准SFT只能保留1%。我们进一步表明，MixSD在基础模型下生成的监督目标具有显著更低的NLL，并减少了有害的Fisher敏感参数方向运动。这些结果表明，将监督与模型的本征生成分布对齐是简单且有效的知识注入原则，可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2510.27568 2026-06-19 cs.AI cs.CL 版本更新

质量优于点击：面向早期电商查询建议的迭代强化学习

Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group（阿里巴巴国际数字商业集团）

AI总结针对早期部署场景点击反馈稀疏的问题，提出质量优先的迭代强化学习框架QualEQS，从可回答性、事实性和信息增益三个维度优化查询建议质量，通过候选建议的组级分歧识别模糊上下文并挖掘难例进行迭代改进，在真实电商系统中ChatPV提升6.81%。

详情

AI中文摘要

现有的对话系统依赖查询建议来增强用户参与度。最近的方法主要使用点击率（CTR）模型优化生成模型，以与用户偏好对齐。然而，这些方法在早期部署场景中效果较差，因为点击反馈稀疏且不足以训练可靠的CTR模型。为弥补这一差距，我们提出了QualEQS，一个面向电商查询建议的质量优先迭代强化学习框架。我们将可操作的建议质量形式化为三个直接影响下游可用性的维度：可回答性、事实性和信息增益。为了在没有点击监督的情况下从在线流量中持续改进，我们进一步提出候选建议之间的组级分歧，以识别模糊的查询上下文并挖掘难训练案例进行迭代优化。我们还引入了EQS-Benchmark，一个包含16,949个真实电商查询的数据集，用于离线训练和评估。实验表明，我们基于质量的离线指标与在线性能强相关，为稀疏反馈部署提供了一种实用的评估方法。在离线和在线设置中，QualEQS均持续优于强基线，在真实企业级对话购物助手系统中，在线ChatPV提升了6.81%。

英文摘要

Existing dialogue systems rely on query suggestion to enhance user engagement. Recent approaches mainly optimize generative models using click-through rate (CTR) models to align with user preferences. However, these methods are less effective in early-stage deployment scenarios, where click feedback is sparse and insufficient for training a reliable CTR model. To bridge this gap, we propose QualEQS, a quality-first iterative reinforcement learning framework for e-commerce query suggestion. We formalize actionable suggestion quality along three dimensions that directly affect downstream usability: answerability, factuality, and information gain. To continuously improve from online traffic without click supervision, we further propose group-level disagreement among candidate suggestions to identify ambiguous query contexts and mine hard training cases for iterative refinement. We also introduce EQS-Benchmark, a dataset of 16,949 real-world e-commerce queries for offline training and evaluation. Experiments show that our quality-based offline metrics correlate strongly with online performance, providing a practical evaluation recipe for sparse-feedback deployment. In both offline and online settings, QualEQS consistently outperforms strong baselines, yielding a 6.81% improvement in online ChatPV in a real-world enterprise-level conversational shopping assistant system.

URL PDF HTML ☆

赞 0 踩 0

2604.23938 2026-06-19 cs.CL 版本更新

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence（计算科学卓越中心）

AI总结提出TSAssistant多智能体框架，通过分层指令架构和交互式优化循环，将靶点安全性评估报告生成分解为专业子任务，实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情

AI中文摘要

靶点安全性评估（TSA）需要系统整合遗传、转录组、靶点同源性、药理学和临床数据，以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家，在可扩展性和可重复性方面面临挑战。我们提出TSAssistant，一种人在回路中的多智能体框架，将TSA报告生成分解为专门子智能体的工作流：研究子智能体各自基于并引用单个TSA领域，合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据，生成可单独引用、基于证据的章节，其行为由分层指令架构塑造，该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束，程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束，而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较，而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性，发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

URL PDF HTML ☆

赞 0 踩 0

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.（高通技术公司）

AI总结提出首个仅使用音频和IMU模态的实时对话助手，通过微调语言模型减少不必要对话并提升问答准确性，在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情

AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入，这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手，仅使用来自用户可穿戴设备的轻量级隐私保护模态（如音频和IMU输入）来理解上下文，为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务，我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令，并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题，我们展示了微调模型如何将其减少不必要对话的能力提升50%（精确度），同时将正确回答问题的能力提升150%（召回率）。我们进一步描述了如何在边缘设备上实现该助手，无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

URL PDF HTML ☆

赞 0 踩 0

2605.13438 2026-06-19 cs.AI cs.CL 版本更新

CogniFold: Always-On Proactive Memory via Cognitive Folding

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Minghua Deng, Chen Chen, Xinliang Zhou

AI总结提出CogniFold，一种受大脑启发的主动记忆系统，通过将互补学习系统扩展为三层（海马体、新皮层、前额叶意图层）并利用图拓扑自组织，实现事件流的持续认知结构涌现，在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情

AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的，缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体，我们引入了CogniFold，一种受大脑启发的“始终在线”智能体记忆，专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构，从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统（CLS）理论从两层（海马体、新皮层）扩展到三层，增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心，CogniFold通过图拓扑自组织实现这一点：认知结构在事件流下主动组装，语义相似时合并，过时时衰减，通过联想回忆重新链接，并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成，证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外，在跨越五个认知领域的7个广泛覆盖的基准测试中，我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across eight downstream benchmarks -- two probing long-term conversational memory (LoCoMo, LongMemEval) and six spanning other cognitive domains -- we validate that CogniFold simultaneously performs robustly on conventional memory tasks. Our code is available at https://github.com/OpenNorve/CogniFold.

URL PDF HTML ☆

赞 0 踩 0

2504.02885 2026-06-19 cs.CL 版本更新

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2：面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； The School of Computing, Macquarie University（麦考瑞大学计算机学院）； Doubao Medical Group, ByteDance（字节跳动 doubao 医疗集团）

AI总结提出Med-R2微调策略，通过引入感知驱动的长推理过程和放射学知识指导，并加入反思机制修正感知错误，提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情

AI中文摘要

自动化医学报告生成（MRG）越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型（LVLMs）因其细粒度的图像-文本对齐和先进的文本生成能力，在自动化MRG中展现出巨大潜力。目前，最先进的MRG主要专注于通过直接监督微调（SFT）来适应预训练的LVLMs，这是一种使用医学图像-报告对的微调策略。然而，有几个因素限制了这些LVLMs的性能。首先，直接SFT使LVLMs能够直接生成医学报告，而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征，从而引起误诊。其次，直接SFT缺乏放射学特定知识的指导，导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题，我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程，该过程在报告生成之前进行，并融入放射学特定知识作为指导。此外，为了减轻复杂推理中潜在的感知错误，引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明，Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2603.16606 2026-06-19 cs.CL 版本更新

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea（韩国文化科技研究所）； Maum AI Inc., Republic of Korea（马姆人工智能公司）

AI总结本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题，通过分析下游语义失败，揭示了传统ASR指标无法完全捕捉的误差影响，发现不同性能的LLM在级联降级上的一致性，识别出单字符ASR错误作为语义失败通道，并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

2606.05846 2026-06-19 cs.CL eess.AS 版本更新

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR：将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo（东京大学）

AI总结通过模型合并和领域泛化方法，研究从有限语言对中学到的代码切换能力能否泛化到未见语言对，实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情

AI中文摘要

自动语音识别（ASR）已成为人机交互的关键技术。然而，由于跨多种语言对的代码切换（CS）语音资源严重稀缺，代码切换ASR（CS-ASR）仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而，这些方法面临固有的可扩展性限制，因为对CS的支持必须针对语言对单独开发，而语言对的数量随支持的语言数量呈组合增长。在这项工作中，我们研究通过模型合并和领域泛化方法，从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明，合并的双语CS-ASR模型对未见语言对有一定程度的泛化，表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

URL PDF HTML ☆

赞 0 踩 0

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR：迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO（蔚来智能系统集团）

AI总结提出NIM4-ASR框架，通过重新设计多阶段训练范式（包括预训练架构优化、迭代异步SFT和ASR专用强化学习）以及生产优化（噪声鲁棒性、流式推理和RAG热词定制），在2.3B参数下实现SOTA性能。

详情

AI中文摘要

将大语言模型（LLM）集成到自动语音识别（ASR）中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色，但其训练仍然主要依赖数据驱动，未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题，我们提出了NIM4-ASR，一个面向生产的、基于LLM的ASR框架，针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分，我们重新设计了多阶段训练范式，使每个模块与其预期的能力边界对齐。具体来说，我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率；引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移；设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化，包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成（RAG）进行的热词定制。实验表明，NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能，同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制，检索延迟低于毫秒，从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

URL PDF HTML ☆

赞 0 踩 0

2508.04266 2026-06-19 cs.CL 版本更新

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

ShoppingBench：面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group（阿里巴巴国际数字商业集团）

AI总结提出ShoppingBench基准，包含多层级真实购物意图任务，通过模拟环境和250万商品评估LLM智能体，发现GPT-4.1成功率低于50%，并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情

AI中文摘要

现有的电子商务基准主要关注基本用户意图，例如查找或购买产品。然而，现实世界的用户通常追求更复杂的目标，例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距，我们提出了ShoppingBench，这是一个新颖的端到端购物基准，旨在涵盖日益具有挑战性的接地意图级别。具体来说，我们提出了一个可扩展的框架，基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估，我们提供了一个大规模购物沙箱作为交互式模拟环境，包含超过250万种真实产品。实验结果表明，即使是最先进的语言智能体（如GPT-4.1）在我们的基准任务上的绝对成功率也低于50%，这突显了我们的ShoppingBench带来的重大挑战。此外，我们提出了一种轨迹蒸馏策略，并利用监督微调以及基于合成轨迹的强化学习，将大型语言智能体的能力蒸馏到较小的智能体中。结果，我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

URL PDF HTML ☆

赞 0 踩 0

2602.13139 2026-06-19 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3：提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结针对现有语言识别工具对近亲语言和噪声区分困难的问题，通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器，提出OpenLID-v3，在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情

DOI: 10.18653/v1/2026.vardial-1.23

AI中文摘要

语言识别（LID）是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具（如OpenLID或GlotLID）通常难以识别近亲语言，也难以区分有效自然语言与噪声，这污染了特定语言子集，尤其是低资源语言。在本工作中，我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3，并在多个基准上将其与GlotLID进行评估。在开发过程中，我们重点关注三组近亲语言（波斯尼亚语、克罗地亚语和塞尔维亚语；意大利北部和法国南部的罗曼语变体；以及斯堪的纳维亚语言），并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度，但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

URL PDF HTML ☆

赞 0 踩 0

2605.26891 2026-06-19 cs.CL 版本更新

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

发表机构 * Research and Innovation, Telenor Group（Telenor集团研究与创新）

AI总结本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库，共1122篇文档，用于支持北欧NLP和信息检索研究。

Comments 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152

详情

AI中文摘要

本文介绍了一个多语言客户服务自助语料库，包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档，总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面，随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺，尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示，不同运营商的文档长度和结构存在显著差异，反映了不同的编辑策略，以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供，网址为https://zenodo.org/records/19493152，旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling 274,599 words and 1,884,833 characters. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/20732652, intended to support reproducible research in Nordic NLP and information retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.01338 2026-06-19 cs.CL 版本更新

面向词嵌入迁移学习的组稀疏矩阵分解

Kan Xu, Xuanyi Zhao, Hamsa Bastani, Osbert Bastani

发表机构 * W. P. Carey School of Business, Arizona State University（亚利桑那州立大学韦伯商学院）； University of Pennsylvania（宾夕法尼亚大学）； Wharton School, University of Pennsylvania（宾夕法尼亚大学沃顿商学院）

AI总结提出一种基于组稀疏惩罚的两阶段估计器，通过结合大规模语料和少量领域数据高效迁移学习领域特定的词嵌入，并证明了其泛化误差界和非凸目标函数的局部最优与全局最优统计等价。

详情

AI中文摘要

非结构化文本为许多领域的决策者提供了丰富的数据源，从零售中的产品评论到医疗保健中的护理记录。为了利用这些信息，单词通常通过无监督学习算法（如矩阵分解）转化为词嵌入——编码单词之间语义关系的向量。然而，从训练数据有限的新领域学习词嵌入可能具有挑战性，因为在新领域中含义/用法可能不同，例如，单词“positive”通常具有积极情感，但在医疗记录中通常具有消极情感，因为它可能意味着患者检测出疾病阳性。在实践中，我们预计只有少数领域特定的单词可能具有新含义。我们提出了一种直观的两阶段估计器，通过组稀疏惩罚利用这种结构，通过结合大规模文本语料库（如维基百科）和有限的领域特定文本数据，高效地迁移学习领域特定的词嵌入。我们限定了迁移学习估计器的泛化误差，证明当只有少量嵌入在领域间改变时，它可以用显著更少的领域特定数据实现高精度。此外，我们证明了在标准正则化条件下，由非凸目标函数识别的所有局部最小值与全局最小值在统计上不可区分，这意味着我们的估计器可以高效计算。我们的结果首次给出了组稀疏矩阵分解的界限，这可能具有独立意义。我们通过与自然语言处理中最先进的微调启发式方法进行实证比较来评估我们的方法。

英文摘要

Unstructured text provides decision-makers with a rich data source in many domains, ranging from product reviews in retail to nursing notes in healthcare. To leverage this information, words are typically translated into word embeddings -- vectors that encode the semantic relationships between words -- through unsupervised learning algorithms such as matrix factorization. However, learning word embeddings from new domains with limited training data can be challenging, because the meaning/usage may be different in the new domain, e.g., the word ``positive'' typically has positive sentiment, but often has negative sentiment in medical notes since it may imply that a patient tested positive for a disease. In practice, we expect that only a small number of domain-specific words may have new meanings. We propose an intuitive two-stage estimator that exploits this structure via a group-sparse penalty to efficiently transfer learn domain-specific word embeddings by combining large-scale text corpora (such as Wikipedia) with limited domain-specific text data. We bound the generalization error of our transfer learning estimator, proving that it can achieve high accuracy with substantially less domain-specific data when only a small number of embeddings are altered between domains. Furthermore, we prove that all local minima identified by our nonconvex objective function are statistically indistinguishable from the global minimum under standard regularization conditions, implying that our estimator can be computed efficiently. Our results provide the first bounds on group-sparse matrix factorization, which may be of independent interest. We empirically evaluate our approach compared to state-of-the-art fine-tuning heuristics from natural language processing.

URL PDF HTML ☆

赞 0 踩 0

2512.03818 2026-06-19 cs.CL 版本更新

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

改善人机编码对齐：心理学构念识别中提示工程的实证评估

Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura

发表机构 * Department of Educational Psychology, Neag School of Education, University of Connecticut（教育心理学系，教育学院，康涅狄格大学）； Department of Psychological Sciences, College of Liberal Arts and Sciences, University of Connecticut（心理学系，文理学院，康涅狄格大学）

AI总结本研究提出一个实证框架，通过提示工程优化大语言模型在心理学文本中识别构念的性能。实验评估五种提示策略，发现构念定义和任务框架最关键，结合代码簿引导和自动提示工程的少样本方法最接近专家判断。

Comments 22 pages, 2 figures

详情

AI中文摘要

由于其架构和庞大的预训练数据，大语言模型（LLMs）表现出强大的文本分类性能。然而，LLM的输出——这里指分配给文本的类别——在很大程度上取决于提示的措辞。尽管关于提示工程的文献正在扩展，但很少有研究关注分类任务，更少有研究涉及心理学等领域，在这些领域中，构念具有精确的、理论驱动的定义，而这些定义可能未在预训练数据中得到充分体现。我们提出了一个实证框架，通过提示工程优化LLM在文本中识别构念的性能。我们实验评估了五种提示策略——代码簿引导的实证提示选择、自动提示工程、角色提示、思维链推理和解释性提示——采用零样本和少样本分类。我们发现，角色、思维链和解释并不能完全解决因措辞不当的提示而导致的性能损失。相反，提示中最有影响力的特征是构念定义、任务框架，以及在较小程度上提供的示例。在三个构念和两个模型中，与专家判断最一致的分类来自结合代码簿引导的实证提示选择和自动提示工程的少样本提示。基于我们的发现，我们建议研究人员生成并评估尽可能多的提示变体，无论是人工编写的、自动生成的，或者理想情况下两者兼有，并根据训练数据集中的实证性能选择提示和示例，在保留集中验证最终方法。该程序提供了一种实用、系统且理论驱动的方法，用于在需要与专家判断对齐的环境中优化LLM提示。

英文摘要

Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies -- codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

URL PDF HTML ☆

赞 0 踩 0

2512.18859 2026-06-19 cs.CL 版本更新

Toward Human-Centered AI-Assisted Terminology Work

迈向以人为中心的AI辅助术语工作

Antonio San Martin

发表机构 * Universite du Quebec à Trois-Rivieres（魁北克大学三河分校）

AI总结本文提出以人为中心的人工智能框架，在利用生成式AI自动化术语工作的同时，通过增强术语学家能力、保持人类控制权来确保术语数据的准确性和可靠性。

Comments Accepted for publication in the journal Terminology

详情

AI中文摘要

生成式AI可能通过创造自动化新机会来改变术语工作。同时，它引发了对术语学家和术语资源未来的担忧，因为效率压力可能鼓励过度自动化，认为人类专业知识可被AI取代。然而，由于错误、幻觉和各种形式的偏见，大型语言模型在术语目的上仍然不可靠，使得术语学家在确保术语数据的准确性和可靠性方面不可或缺。本文认为，以人为中心的AI（强调AI的主要目标应是促进人类福祉的方法）提供了一个框架，可以在最大化生成式AI收益的同时减轻其风险。它主张高水平的自动化和有意义的人类控制是兼容且可取的，AI应增强术语学家的能力，同时保留他们的自主权和决策权。通过三个相互关联的维度——增强的术语学家、伦理AI和以人为中心的设计——审视了AI辅助术语工作的影响。特别是，本文探讨了AI整合如何重塑术语学家的角色，影响专业价值观和工作条件，要求管理AI产生的偏见，并呼吁围绕术语学家的需求设计AI工具。本文得出结论，以人为中心的方向是必要的，以确保AI加强而非削弱术语工作在支持专业交流以及跨语言和跨文化准确传播知识中的关键作用。

英文摘要

Generative AI is likely to transform terminology work by creating new opportunities for automation. At the same time, it raises concerns about the future of terminologists and terminological resources, as efficiency pressures may encourage excessive automation based on the perception that human expertise can be replaced by AI. However, large language models remain unreliable for terminological purposes due to errors, hallucinations, and various forms of bias, making terminologists indispensable for ensuring the accuracy and reliability of terminological data. This paper argues that human-centered AI, an approach that emphasizes that AI's primary goal should be to contribute to human well-being, provides a framework for maximizing the benefits of generative AI while mitigating its risks. It contends that high levels of automation and meaningful human control are compatible and desirable, and that AI should enhance terminologists' capabilities while preserving their agency and decision-making authority. The implications of AI-assisted terminology work are examined through three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. In particular, the paper examines how AI integration reshapes the role of the terminologist, affects professional values and working conditions, requires the management of AI-generated bias, and calls for the design of AI tools around the terminologist's needs. The paper concludes that a human-centered orientation is necessary to ensure that AI strengthens, rather than undermines, the essential role of terminology work in supporting specialized communication and the accurate transmission of knowledge across languages and cultures.

URL PDF HTML ☆

赞 0 踩 0

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2606.18941 2026-06-19 cs.PL cs.CL 版本更新

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC：使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester（计算机科学，曼彻斯特大学）； Electrical Engineering, Federal University of Amazonas (UFAM)（电气工程，亚马逊联邦大学（UFAM））

AI总结针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题，提出基于DFS的图形LD解析器，将连接图转换为布尔触点合取，并采用三级I/O推断方案，成功实现完整GOTO IR转换，验证了3个图形LD程序。

Comments 18 pages

详情

AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式：一种使用<rung>元素的文本编码，另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式，但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示，导致空洞的验证成功。本文提出Graph-ESBMC-PLC，通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈，将梯形路径提取为布尔触点合取，并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序，确保SET线圈在RESET线圈之前处理，匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明，所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR，而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留，无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo（DantasCordeiro2026graphical，doi: https://doi.org/10.5281/zenodo.20699856）。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

URL PDF HTML ☆

赞 0 踩 0

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

2406.15465 2026-06-19 cs.CL cs.AI 版本更新

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland（以患者为中心的数字健康研究所，伯恩应用科学大学，比尔，瑞士）； ID Suisse AG, St. Gallen, Switzerland（ID瑞士股份有限公司，圣加尔，瑞士）

2306.12679 2026-06-19 cs.CL 版本更新

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

Mojtaba Mazoochi, Leila Rabiei, Farzaneh Rahmani, Zeinab Rajabi

发表机构 * Faculty member in ICT Research Institute（ICT研究所教员）； Iran Telecommunication Research Center (ITRC)（伊朗电信研究中心）； Faculty member in Computer Department（计算机系教员）； Mehralborz University（梅赫拉布尔兹大学）； Hazrat-e Masoumeh University（玛苏姆大学）

Journal ref Multimedia Tools and Applications, 2025

1. 大语言模型与基础模型 5 篇

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

A Survey of On-Policy Distillation for Large Language Models

2. 机器翻译与跨语言处理 2 篇

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

3. 对话系统与智能体 4 篇

Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

CogniFold: Always-On Proactive Memory via Cognitive Folding

4. 文本生成、摘要与编辑 1 篇

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

5. 多模态语言处理 4 篇

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Vero: An Open RL Recipe for General Visual Reasoning

6. 语音语言联合与音频文本 3 篇

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

7. 评测、数据集与基准 6 篇

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

Telenor Nordics Customer Service self-help corpus

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

8. 安全、隐私、公平与可解释NLP 6 篇

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

DeFrame: Debiasing Large Language Models Against Framing Effects

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

Large Language Models Hack Rewards, and Society

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

9. 低资源、领域适配与高效训练 1 篇

Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings

10. 其他/综合NLP 7 篇

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

Toward Human-Centered AI-Assisted Terminology Work

Critique of World Model

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs