语言大模型 / LLM - arXivDaily 专题

2606.19266 2026-06-18 cs.CL cs.AI 新提交 90%

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

医学LLM适应中的权衡：法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020（艾克斯-马赛大学，法国国家科学研究中心，LIS UMR 7020）； Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004（南特大学，南特中央理工大学，法国国家科学研究中心，LS2N UMR 6004）； Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217（格勒诺布尔阿尔卑斯大学，法国国家科学研究中心，INRIA，格勒诺布尔INP，LIG UMR 5217）

专题命中领域大模型：法语医学LLM领域适应策略比较

AI总结通过法语医学问答任务，实证比较持续预训练（CPT）和监督微调（SFT）在多个模型家族和规模下的效果，发现CPT+SFT在多项选择问答上最优但增益小，SFT是强且经济的默认选择，而CPT在开放式问答中提升重叠指标。

详情

AI中文摘要

大型语言模型（LLMs）的发展导致了对它们适应专业领域和语言的关注增加，但领域适应策略的有效性仍不明确。我们以法语医学问答（QA）为案例，进行了医学领域适应的研究。我们比较了持续预训练（CPT）、监督微调（SFT）及其组合，跨越三个模型家族、多个规模和三种初始化类型，明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下，使用自动指标和LLM-as-a-Judge评估，评估了多项选择问答（MCQA）和开放式问答（OEQA）。对于MCQA，CPT+SFT通常取得最佳分数，但相比SFT的增益很小且通常不显著，使得SFT成为强大且成本效益高的默认选择。对于OEQA，CPT持续改善基于重叠的指标，而SFT常降低生成质量；指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示，法语适应能有效迁移到英语基准。总体而言，我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 新提交 90%

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester（罗切斯特大学）； National Taiwan University（国立台湾大学）； NVIDIA（英伟达）

专题命中领域大模型：台湾法律理解基准，评估LLM法律推理

AI总结提出TW-LegalBench基准，包含多项选择、开放式问答和法律判决预测任务，评估13个LLM在台湾法律上的表现，发现顶尖模型通过律师考试但未达到法官检察官标准，且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情

AI中文摘要

大型语言模型（LLM）在多种任务上展现出令人印象深刻的能力，但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench，利用台湾法律系统丰富的官方公开语料库，填补了在普通法基准（侧重英文来源）和大陆法基准（侧重简体中文来源）之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型：（1）涵盖18个专业领域五年官方考试的超过16,000道多项选择题（MCQ）；（2）来自法律专业人员考试的117道开放式问答题（OEQ），附有官方评分标准；（3）超过14,000个法律判决预测（LJP）实例，涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ，以及LJP的判决准确性和法条引用指标，评估了13个LLM。我们的结果显示，表现最佳的模型超过了合格律师的通过门槛（通过率：11%），但未达到法官和检察官的通过标准（通过率：1-2%）。对于LJP，虽然模型展示了合理的判决类型准确性和刑期预测能力，但它们难以准确引用具体法律条文。这些发现表明，即使LLM在资格考试上的表现接近人类水平，可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

URL PDF HTML ☆

赞 0 踩 0

2606.18600 2026-06-18 cs.DC 新提交 85%

ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters

ShuntServe: 异构竞价型GPU集群上的成本高效LLM服务

Seungwoo Jeong, Moohyun Song, Juhyun Park, Kyungyong Lee

专题命中领域大模型：提出ShuntServe系统优化LLM在异构GPU上服务

AI总结提出ShuntServe系统，通过屋顶线模型估计性能和动态规划优化模型放置，在异构竞价型GPU集群上最大化吞吐量，结合输出保留迁移与共享张量存储实现容错，相比基线吞吐量提升1.42倍，成本效率提升31.9%以上。

Comments 18 pages, 16 figures, 5 tables

详情

AI中文摘要

随着大语言模型（LLM）服务的广泛采用，在云环境中为这些模型提供服务的GPU资源成本已成为关键问题。竞价实例相比按需实例可节省高达90%的成本，但其频繁中断和有限可用性对连续LLM服务构成重大挑战。特别是GPU竞价实例的可用性比基于CPU的实例更低且更不稳定，使得依赖单一GPU类型的同构集群容易受到关联故障的影响。跨多种GPU类型的异构集群可以通过利用不同竞价池的互补可用性模式来解决这一问题，然而现有的LLM服务系统是为同构环境设计的，在异构GPU上部署时会遇到负载不均衡的问题。本文提出了ShuntServe，一个用于异构竞价型GPU集群的成本高效LLM服务系统。ShuntServe采用基于屋顶线模型的分析性服务性能估计器和基于动态规划的模型放置优化器，联合确定节点配置、并行化策略和层分配，以最大化跨异构GPU的吞吐量。为了增强使用竞价实例时的容错能力，ShuntServe将输出保留的请求迁移与通过共享张量存储的并发初始化相结合，通过重叠替换节点准备与持续服务来最小化迁移停机时间。在由L4、A10G和L40S GPU组成的异构AWS集群上对Llama-3.1-70B和Qwen3-32B的评估表明，ShuntServe的吞吐量比最先进的基线高出1.42倍和1.35倍，并且与按需实例相比，在离线服务和在线服务中分别实现了31.9%和31.2%的成本效率提升。

英文摘要

As large language model (LLM) services become widely adopted, the cost of GPU resources for serving these models in cloud environments has emerged as a critical concern. Spot instances offer up to 90% cost savings over on-demand instances, but their frequent interruptions and limited availability pose significant challenges for continuous LLM serving. GPU spot instances, in particular, exhibit lower and more volatile availability than CPU-based instances, making homogeneous clusters that depend on a single GPU type vulnerable to correlated failures. Heterogeneous clusters spanning multiple GPU types can address this by leveraging complementary availability patterns across diverse spot pools, yet existing LLM serving systems are designed for homogeneous environments and suffer from load imbalance when deployed on heterogeneous GPUs. This paper presents ShuntServe, a cost-efficient LLM serving system for heterogeneous spot GPU clusters. ShuntServe employs a roofline model-based analytical serving performance estimator and a dynamic programming-based model placement optimizer that jointly determines node configuration, parallelization strategy, and layer assignment to maximize throughput across heterogeneous GPUs. To enhance fault tolerance when using spot instances, ShuntServe combines output-preserving request migration with concurrent initialization via a shared tensor store, minimizing migration downtime by overlapping replacement node preparation with ongoing serving. Evaluation on Llama-3.1-70B and Qwen3-32B with a heterogeneous AWS cluster of L4, A10G, and L40S GPUs shows that ShuntServe achieves 1.42x and 1.35x higher throughput than state-of-the-art baselines and attains 31.9% and 31.2% cost efficiency improvements over on-demand instances for offline and online serving, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.18596 2026-06-18 cs.HC cs.AI 新提交 80%

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

更好的依从性，更丰富的上下文：基于LLM的对话式语音睡眠日记的现场评估

Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

发表机构 * The Johns Hopkins University（约翰霍普金斯大学）； Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine（精神病学与行为科学系，约翰霍普金斯大学医学院）

专题命中领域大模型：LLM驱动的对话式语音睡眠日记现场评估

AI总结通过现场实验评估基于LLM的对话式语音睡眠日记，发现相比文本日记，语音日记提高了依从性并收集了更详细的上下文信息，但结构化字段完整性较低。

详情

AI中文摘要

睡眠日记是行为睡眠医学和失眠认知行为疗法的核心，但每日完成难以维持，静态形式通常为解释夜间睡眠变化提供的上下文有限。我们设计了一个基于LLM的对话式语音日记，通过主动智能音箱提示、结构化对话输入和自适应后续对话，提供临床基础的早晚睡眠日记问题。我们在为期四周的受试者间现场研究中评估了该系统，涉及30名大学生，使用匹配的日记项目、报告窗口和提醒间隔，与基于文本的移动日记进行比较。与文本日记相比，对话式语音日记显示出更高的依从性，并引发了关于日常习惯、压力源、环境条件和其他睡眠相关因素的更详细上下文自我报告。参与者还描述语音日记更容易融入日常，尽管感知完成时间更长。然而，基于语音的对话输入导致某些结构化日记字段的完整性较低，揭示了表达丰富性与结构化精度之间的权衡。这些发现展示了使用基于LLM的对话式语音助手进行纵向健康自我报告的前景和挑战。

英文摘要

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

URL PDF HTML ☆

赞 0 踩 0

2606.18989 2026-06-18 cs.CL cs.AI 新提交 75%

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign：基于释义的跨语言习语对齐基准

Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（NLP 2 CT实验室，计算机与信息科学系，澳门大学）； Faculty of Arts and Humanities, University of Macau（人文学院，澳门大学）

专题命中领域大模型：构建跨语言习语对齐基准，评估LLM翻译能力。

AI总结提出G-IdiomAlign基准，通过维基词典释义锚定习语，构建高置信度对齐集，并设计多项选择等价测试和释义对比生成协议，揭示大语言模型在习语翻译中的字面翻译偏差。

Comments Accepted to ACL 2026

详情

AI中文摘要

习语由于其非组合性和弱表层形式基础，难以跨语言转换，使得字面映射不可靠。我们提出G-IdiomAlign，一个基于释义的基准，其中每个习语通过维基词典的英语释义进行锚定。我们进一步构建了一个高置信度的参考对齐集，用于可重复评估。G-IdiomAlign支持两种协议：（1）受控的多项选择习语等价测试，带有类型化干扰项用于错误归因；（2）释义对比生成，对比无释义和有释义输入，以隔离显式语义枢轴的影响。在不同的大语言模型中，字面翻译偏差是主要的失败模式，尤其是当目标语言是低资源语言时。在基于嵌入的语义代理下，释义一致地改善了释义对比生成，但性能仍然有限，表明在开放输出空间中存在显著提升空间。随后对Qwen3-8B的分析进一步表明，跨条件差异更多集中在注意力头而非层中，而有释义生成更好的情况与更强的释义锚定相关。

英文摘要

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18986 2026-06-18 cs.CL cs.AI 新提交 75%

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词：面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University（德肯大学）

专题命中领域大模型：提出时间序列问答框架，直接嵌入时间步避免分词瓶颈。

AI总结提出CADE框架，通过逐点线性编码器直接嵌入每个时间步，避免分词瓶颈，并利用单向监督对比损失对齐时间序列与文本锚点，在Time-MQA基准上提升六项TSQA任务性能。

详情

AI中文摘要

大型语言模型的最新进展催生了时间序列问答（TSQA），它将时间序列分析表述为自然语言问答。然而，直接将原始数值序列输入LLM会遇到分词瓶颈：字节对编码将连续值分割成不稳定的词元，其嵌入缺乏有意义的度量结构，导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口，锁定单一粒度，这会破坏模式并隐藏确切的时间步，且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战，我们提出了CADE（对比对齐与直接嵌入），一个基于两个关键组件构建的TSQA新框架：直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间，保留了精确的索引级访问，同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距，我们引入了一种新颖的单向监督对比损失，将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明，我们的框架在六项TSQA任务上持续提升了性能，优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18803 2026-06-18 cs.AI cs.CY 新提交 75%

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd（滴滴出行科技有限公司）

专题命中领域大模型：LLM应用于工业调度，属于领域大模型

AI总结提出ProfiLLM，一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道，解决工业网约车调度中大规模行为日志的用户画像问题，在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情

AI中文摘要

将大型语言模型（LLM）作为语义特征提取器引入工业网约车调度，处理平台规模的行为日志，是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主，但关键的行为信号（例如，驾驶员对某些区域的习惯性厌恶）本质上是上下文相关的，并且可以自然地表达为LLM生成的用户画像。然而，将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束，这些约束很少被一起解决：在一个拥有数百万日订单量的平台上，日志超出任何LLM的上下文窗口数个数量级；大多数用户是长尾用户，交互太少无法进行单个用户画像；表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM，一个智能LLM数据管道，通过两个模块实现面向生产匹配系统的效用对齐用户画像。（1）工具增强全局知识挖掘：为LLM智能体配备27个分析工具，用于挖掘平台规模的数据，生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。（2）效用对齐画像探索：为每个聚类生成多个候选画像，通过轻量级下游效用代理进行评估，迭代优化最佳候选，并为DPO微调构建偏好对。在滴滴生产调度器上部署后，ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进，在调度模拟中实现了高达+4.35%的GMV增长，并在14天在线A/B测试中持续改进，包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

URL PDF HTML ☆

赞 0 踩 0

2606.18597 2026-06-18 cs.CL 新提交 75%

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

低资源中文方言辨识：基于迁移学习与数据增强

Fan Xu, Yangjie Dan, Keyu Yan, Yong Ma, Mingwen Wang

发表机构 * Jiangxi Normal University（江西师范大学）

专题命中领域大模型：迁移学习与数据增强用于中文方言辨识

AI总结针对中文方言标注资源稀缺的问题，提出结合迁移学习与数据增强的CDDTLDA框架，利用源域ASR模型和目标域数据增强及微调，通过自注意力机制捕获共性语义特征，显著超越现有方法。

Comments Published in ACM TALLIP

详情

AI中文摘要

中文方言辨识是一项具有挑战性的自然语言处理任务，由于标注资源稀缺。本文中，我们开发了一种新颖的中文方言辨识框架，结合迁移学习与数据增强（CDDTLDA），以克服资源短缺问题。具体来说，我们首先使用一个较大的中文方言语料库训练一个源端自动语音识别（ASR）模型。然后，我们采用一种简单但有效的数据增强方法（即速度、音高和噪声干扰）来增强目标端低资源中文方言，并基于之前的源端ASR模型微调另一个目标ASR模型。同时，通过使用自注意力机制，可以捕获源端和目标端ASR模型之间的潜在共性语义特征。最后，我们提取目标ASR模型中的隐藏语义表示来进行中文方言辨识。我们广泛的实验结果表明，我们的模型在两个基准中文方言语料库上显著优于最先进的方法。

英文摘要

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.19167 2026-06-18 cs.SE 新提交 70%

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

用LLM和MCP集成教学软件工程：从课堂到工业实践

Kehui Chen, Jacky Keung, Weining Li, Xiangbing Shao, Yishu Li, Xiaoxue Ma

专题命中领域大模型：使用LLM辅助软件工程教学，但非核心模型创新

AI总结本研究将LLM和MCP集成到软件工程协作教学模式中，通过嵌入驱动工具到教学、代码辅助和工程模拟，弥合传统教学与工业流程的差距，提升学生编程、问题解决和智能工具使用能力。

Comments Aceept by International Symposium on Educational Technology (ISET) 2026

详情

AI中文摘要

大型语言模型（LLM）和模型上下文协议（MCP）在工业软件工程中的快速集成，迫切要求更新软件工程教育以跟上新兴技术和不断变化的行业需求。本研究探讨了一种创新方法，将LLM和MCP集成到软件工程教育的协作教学模式中，旨在构建一个与实际工程实践紧密相连的实用学习框架。通过将LLM和MCP驱动的工具嵌入日常教学、代码辅助和工程模拟中，该模型有效弥合了传统教学与工业工作流程之间的差距。这种集成增强了学生的编程能力、实际问题解决能力以及使用智能工程工具的熟练度。此外，通过与行业实习的合作，学生可以在真实环境中应用这些技术，进一步加强学术准备与专业实践之间的联系。总体而言，本研究为人工智能时代软件工程教育的改革与创新提供了一条实用路径。

英文摘要

The rapid integration of Large Language Models (LLMs) and the Model Context Protocol (MCP) into industrial software engineering has created a pressing need to update software engineering education to align with emerging technologies and evolving industry demands. This study investigates an innovative approach that integrates LLMs and MCP into a collaborative teaching model for software engineering education, aiming to build a practical learning framework closely connected to real-world engineering practices. By embedding LLM and MCP driven tools into daily teaching, code assistance, and engineering simulations, the model effectively bridges the gap between traditional instruction and industrial workflows. This integration enhances students' programming competence, practical problem-solving abilities, and proficiency in using intelligent engineering tools. Furthermore, through partnerships with industry internships, students can apply these technologies in real-world settings, further strengthening the connection between academic preparation and professional practice. Overall, this research offers a practical pathway for reforming and innovating software engineering education in the era of artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.18789 2026-06-18 eess.SY cs.SY 新提交 70%

PowerAgentBench-SS: A Benchmark for Agentic AI in Power System Steady-State Studies

PowerAgentBench-SS：电力系统稳态研究中智能体AI的基准测试

Costas Mylonas, Magda Foti, Andrea Pomarico, Matheus Duarte, Qian Zhang, Emmanouel Varvarigos

专题命中领域大模型：电力系统领域LLM智能体基准

AI总结提出PowerAgentBench-SS基准框架，用于评估LLM智能体在电力系统稳态研究中执行工程工作流的能力，通过工具API、验证预算和风险敏感指标区分智能体性能。

详情

AI中文摘要

电力系统基准测试通常评估数值求解器、预测模型或顺序控制器。这些基准是必要的，但它们不直接测试大型语言模型（LLM）智能体是否能执行工程工作流：检查电网案例、选择工具、调用模拟器、筛选 contingencies、提出可接受的缓解措施、验证结果并生成可审计的证据链。本文介绍了PowerAgentBench-SS，一个用于评估电力系统运行和规划研究中工具使用智能体的稳态基准框架。该基准向智能体公开案例数据、动作约束、工具API和验证预算，同时隐藏的评估器重新计算物理有效性并对提交的报告进行评分。我们定义了智能体接口、工具契约、证据日志和风险敏感指标，包括提交召回率、证据支持召回率、发现召回率、假安全惩罚、严重性遗憾、残余违规分数、动作成本、工具使用效率和工作流诊断。为了使框架具体化，我们在可复现的直流热N-2 contingency搜索试点中实例化该协议，使用确定性IEEE 39节点运行点变体，包括脚本基线、LLM JSON命令适配器、三个本地托管的Ollama LLM智能体和一个OpenAI API智能体。结果表明为什么仅求解器或仅答案评估是不够的：智能体不仅通过顶级contingency发现来区分，还通过验证预算使用、显式提交、类型强制、重复验证、证据支持报告和缓解行为来区分。

英文摘要

Power system benchmarks usually evaluate numerical solvers, prediction models, or sequential controllers. These benchmarks are necessary, but they do not directly test whether a Large Language Model (LLM) agent can execute an engineering workflow: inspect a grid case, select tools, call simulators, screen contingencies, propose admissible mitigations, validate results, and produce an auditable evidence trail. This paper introduces PowerAgentBench-SS, a steady-state benchmark framework for evaluating tool-using agents in power system operation and planning studies. The benchmark exposes public case data, action constraints, a tool API, and a validation budget to an agent, while a hidden evaluator recomputes physical validity and scores the submitted report. We define the agent interface, tool contract, evidence log, and risk-sensitive metrics, including submitted recall, evidence-backed recall, found recall, false-safe penalties, severity regret, residual violation score, action cost, tool-use efficiency, and workflow diagnostics. To make the framework concrete, we instantiate the protocol in a reproducible DC thermal N-2 contingency-search pilot on deterministic IEEE 39-bus operating-point variants, with scripted baselines, an LLM JSON-command adapter, three locally hosted Ollama LLM agents, and one OpenAI API agent. The results show why solver-only or answer-only evaluation is insufficient: agents are distinguished not only by top-contingency discovery, but also by validation-budget use, explicit submission, type coercions, duplicate validations, evidence-backed reporting, and mitigation behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.18636 2026-06-18 cs.CL cs.AI 新提交 70%

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home：智能家居中渐进式省略命令的解释

Yingyu Shan, Zeming Liu, Silin Li, Boao Qian, Jiashu Yao, Yuhang Guo, Haifeng Wang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Baidu Inc.（百度公司）

专题命中领域大模型：智能家居中渐进式省略命令的解释

AI总结针对智能家居中用户因共享上下文而使用渐进式省略命令导致的指代和意图歧义问题，提出首个模拟家庭数据集PEC-Home，实验表明现有LLM助手难以准确执行省略命令。

Comments Accepted by ACL 2026 Findings

详情

AI中文摘要

近年来，大型语言模型（LLM）的进步使家庭助手具备了自然语言交互能力。然而，当前的助手忽略了人类对话中随着共享上下文积累而发生的渐进式省略，即为了高效沟通而使用更简洁的表达。因此，当前助手仍难以准确解释此类省略表达，限制了其在现实应用中的有效性。在实际智能家居场景中，助手面临由省略命令引起的两大挑战：（1）多个用户对环境期望不同导致的指代歧义；（2）用户偏好随时间或环境变化导致的意图歧义。为应对这些挑战，我们引入了PEC-Home，这是首个专门为解释智能家居中渐进式省略命令而设计的模拟家庭数据集。在包括GPT-4o在内的多种LLM上的广泛实验表明，现有的家庭助手难以仅基于省略命令执行用户意图的操作。即使配备存储和检索用户对话历史的工具，其执行准确率仍低于使用完整命令时的水平。

英文摘要

Recent advancements in Large Language Models (LLMs) have empowered home assistants with natural language interaction capabilities. However, current assistants overlook the progressive omission that occurs in human dialogue as shared context accumulates, leading to more elliptical expressions for efficient communication. Thus, current assistants still struggle to interpret such elliptical expressions accurately, which limits their effectiveness in real-world applications. In practical smart home scenarios, assistants face two major challenges caused by elliptical commands: (1) referential ambiguity caused by different environmental expectations among multiple users; and (2) intention ambiguity resulting from user preferences that evolve over time or change with the environment. To address these challenges, we introduce PEC-Home, the first simulated home dataset specifically designed for interpreting progressively elliptical commands in smart homes. Extensive experiments on various LLMs, including GPT-4o, show that existing home assistants struggle to execute user-intended operations based solely on elliptical commands. Even when equipped with tools for storing and retrieving user dialogue history, execution accuracy remains below that achieved with complete commands.}.

URL PDF HTML ☆

赞 0 踩 0

2606.18584 2026-06-18 cs.CL 新提交 70%

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

语音驱动的端到端汉语方言语言鉴别

Fan Xu, Jian Luo, MingWen Wang, GuoDong Zhou

发表机构 * Jiangxi normal university（江西师范大学）； Soochow university（苏州大学）

专题命中领域大模型：语音驱动端到端汉语方言语言鉴别

AI总结针对相似语言和方言鉴别难题，提出基于MFCC特征和HMM-DNN端到端模型的语音驱动方法，结合注意力机制和CNN融合词嵌入与MFCC特征，在基准语料上优于现有方法。

Comments Published in ACM TALLIP

2606.18560 2026-06-18 cs.SD 新提交 70%

Constraining to Generalize: Subspace Tuning for Few-shot Generalization of Audio-Language Models

约束泛化：音频-语言模型少样本泛化的子空间微调

Jaehyuk Jang, Kangwook Ko, Wonjun Lee, Changick Kim

发表机构 * KAIST（韩国科学技术院）

专题命中领域大模型：子空间微调提升音频-语言模型少样本泛化

AI总结针对音频-语言模型少样本微调导致的基类-新类权衡问题，提出子空间微调（SubT），通过结构化子空间参数化和残差锚定约束文本嵌入漂移，并利用子空间感知门控抑制负迁移，在11个音频基准上实现高效强泛化。

2606.18372 2026-06-18 cs.CL cs.AI 新提交 60%

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

保留还是删除？用于教育对话去标识的完全本地AI级联框架

Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, René F. Kizilcec

发表机构 * Cornell University（康奈尔大学）

专题命中领域大模型：使用本地LLM级联进行教育对话去标识。

AI总结针对教育对话中课程术语与个人身份信息混淆的问题，提出一种完全本地的级联框架，通过召回优先的联合提议器和上下文感知审查器实现约束性隐私分类，在数学辅导对话上达到0.958的宏F1，优于商业API和纯LLM基线。

详情

AI中文摘要

教育对话是研究中有价值但敏感的资源：捕捉真实学习的同一份转录往往也包含与课程内容纠缠的个人身份信息（PII），其中“Riemann”可能指真实学生或数学概念。现有方法在治理和准确性之间强制权衡。商业大型语言模型（LLM）可以处理这种歧义，但需要将学生数据发送给第三方，而本地命名实体识别（NER）系统保留治理但过度删除课程术语。我们提出一个完全本地的级联框架，将去标识从开放式实体识别重新定义为约束性隐私分类。一个召回优先的联合提议器结合两个轻量级编码器和确定性规则，过度生成候选跨度；然后一个上下文感知审查器利用周围对话和说话者角色对每个候选做出二元的保留/删除决策。我们在两个大型平台的数学辅导转录上评估了三种审查器配置，与同系列纯LLM基线和商业API进行比较。最强的本地配置达到0.958宏F1，而同系列纯LLM基线为0.767，商业API为0.706，同时完全在单个笔记本电脑上运行。在针对课程-人名歧义的挑战集上，相同配置仅下降0.03 F1，而较小审查器下降0.19至0.25。这些结果表明，对于教育去标识，问题表述比模型规模更重要。

英文摘要

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

URL PDF HTML ☆

赞 0 踩 0

2606.18256 2026-06-18 cs.HC cs.AI 新提交 60%

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

动态内群体人格生成以增强人机融洽关系

Yoonseok Oh, Inseo Jung, Jinkyu Kim, Jungbeom Lee, Minwoo Kang, Suhong Moon

发表机构 * Korea University（韩国大学）； Kakao Mobility ； University of California, Berkeley（加州大学伯克利分校）

专题命中领域大模型：LLM聊天机器人通过内群体人格增强融洽关系

AI总结提出一种动态内群体人格生成方法，通过识别用户主要关切并生成共享相似关切的内群体人格，显著提升人机融洽关系，实验表明该方法优于无人格条件和最小自我表露基线。

详情

AI中文摘要

基于LLM的聊天机器人越来越多地应用于咨询和同伴支持等人际领域，在这些领域中建立人机融洽关系至关重要但仍具挑战性。在这项工作中，我们引入了一种新颖的方法来为LLM赋予内群体人格，该方法首先识别用户的主要关切和简要个人背景（例如，一位担心未来职业前景的计算机科学本科生），然后生成一个共享相似主要关切但在背景和叙述细节（如年龄或职业）上有所不同的合成内群体人格（例如，一家AI初创公司的初级研究员）。此外，我们进行了一项人类受试者研究，系统评估内群体人格代理在增强人机融洽关系方面的有效性。我们将我们的方法与两种基线条件进行比较：一种是不带人格条件的传统代理，另一种是表现出最小自我表露的代理（例如，“我也曾有过这种感觉”）。来自评估融洽关系和用户体验的任务后问卷的结果表明，与基线相比，内群体人格代理显著改善了感知融洽度和个人相关性，并产生了更积极的用户体验——最显著的是更高的参与度。

英文摘要

LLM-based chatbots are increasingly applied in interpersonal domains such as counseling and peer support, where establishing human-AI rapport is crucial yet remains challenging. In this work, we introduce a novel approach for conditioning LLMs with in-group personas, which (i) first identifies a user's primary concern and brief personal context (e.g., a computer science undergraduate worried about future career prospects), and (ii) generates a synthetic in-group persona that shares a similar primary concern while differing in background and narrative details, such as age or profession (e.g., a junior researcher at an AI startup). Furthermore, we conduct a human-subject study to systematically evaluate the effectiveness of in-group persona agents in enhancing human-AI rapport. We compare our approach against two baseline conditions: a conventional agent without persona conditioning and an agent exhibiting minimal self-disclosure (e.g., "I've felt that too"). Results from post-task questionnaires assessing rapport and user experience indicate that the in-group persona agent significantly improves perceived rapport and personal relevance compared to the baselines, and also yields more positive user experience-most notably higher engagement.

URL PDF HTML ☆

赞 0 踩 0