AI中文摘要

许多视频推理任务需要跨帧跟踪运动、时间顺序和演化的视觉状态。基于大型视觉语言模型（LVLMs）的现有方法通常通过文本思维链（CoT）、关键帧选择、重复帧插入或外部工具使用来外化推理。虽然有效，但此类流水线增加了推理延迟和工程复杂性，并迫使时间-视觉证据被序列化为文本或从帧中重复重新编码。受视觉推理可以在语言化之前隐式发生的直觉启发，我们提出STORM（通过内化建模的时空推理），一个两阶段框架，教导LVLMs通过有界连续潜在轨迹进行推理，而不是显式文本CoT。在第一阶段，STORM将潜在令牌与从生成视频中衍生的思想-视频表示对齐，将潜在状态基于动态视觉证据。在第二阶段，模型进一步通过仅答案监督训练，鼓励推理过程内化而无需逐步注释。生成的思想视频仅在训练期间使用；在推理时，STORM执行有界潜在展开，无需重新生成视频、重新插入帧或调用外部视觉工具。在VideoMME、MVBench、TempCompass和MMVU上的实验表明，与基于工具或视频生成的推理流水线相比，STORM提高了视频推理准确性，同时显著降低了推理开销。

英文摘要

Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.26007 2026-05-26 cs.CL 版本更新

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

遗忘的词语：在低资源菲律宾语和英语对话中用于痴呆检测的NeoBERT基准测试

Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, Chelsea Dominique E. Temprosa, Georgianna Z. Reyes, Kervin Gabriel L. Chua

发表机构 * Ateneo de Manila Senior High School（亚特兰大大学高级中学）； Analog Devices, Inc.（安森美半导体公司）

AI总结针对菲律宾语-英语代码混合的低资源场景，首次系统评估基于Transformer的痴呆检测模型，发现双语微调可消除跨语言性能下降，达到Macro-F1=0.969-0.973。

Comments Accepted to BioNLP Workshop @ ACL 2026

详情

AI中文摘要

从自发语音中检测痴呆症提供了一种可扩展的认知筛查方法，但NLP系统仍然以英语为中心。这一限制在菲律宾尤为严重，因为菲律宾语-英语代码混合普遍存在，且尚无先前工作涉及基于NLP的痴呆检测。我们首次对基于Transformer的菲律宾语音频痴呆检测进行了系统评估，并首次在临床NLP环境中评估了NeoBERT。为了将语言与领域效应分离，我们构建了一个包含4,000个DementiaBank衍生转录本的平行双语数据集，其中菲律宾语翻译由人工完成，以保留认知衰退的话语层面标记。我们在单语、零样本跨语言和双语微调设置下评估了五个模型家族：TF-IDF + LogReg、BERT、NeoBERT、XLM-R和RoBERTa-Tagalog。我们发现，领域内性能无法跨语言迁移，英语训练的BERT在菲律宾语上Macro-F1降至0.455，且仅靠架构现代化并不能提高鲁棒性。然而，双语微调消除了所有Transformer模型的跨语言性能下降，收敛到Macro-F1=0.969-0.973。这些结果表明，多语言临床NLP性能主要受训练期间的语言覆盖范围驱动，而非模型规模或架构。

英文摘要

Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

URL PDF HTML ☆

赞 0 踩 0

2605.26004 2026-05-26 cs.CV cs.CL 版本更新

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC: 面向视觉语言模型的多模态对齐与接地感知指令核心集

Shristi Das Biswas, Kaushik Roy

发表机构 * Purdue University（普渡大学）

AI总结提出MAGIC方法，利用预训练VLM中的多模态增益、桥接相关性和技能神经元签名三种内在信号，通过无训练、前向传播的核心集选择，构建紧凑且行为保真的子集用于多模态指令微调，在20%预算下达到甚至超越全微调性能。

详情

AI中文摘要

大型视觉语言模型（LVLMs）的指令微调越来越依赖于大规模多模态语料库，然而这些数据集包含大量冗余、低视觉依赖性以及多模态推理行为覆盖极不平衡的样本。因此，均匀子采样或基于分数的朴素选择往往产生次优的训练子集。我们提出MAGIC，一种无需训练、仅前向传播的核心集选择方法，旨在为多模态指令微调构建紧凑且行为保真的子集。MAGIC基于从预训练VLM中提取的三个内在信号：多模态增益，衡量从视觉输入获得的似然改进；桥接相关性，捕捉答案令牌在视觉令牌上的接地锐度；以及技能神经元签名，通过顶部激活的前馈神经元表征每个样本引发的功能计算。MAGIC通过三阶段流程组合这些信号：过滤低增益样本，通过归一化质量目标对候选样本排序，并在离散神经元签名上执行桶式预算分配以保留潜在的多模态技能覆盖。该公式避免了反向传播、辅助选择器训练以及连续激活空间中的昂贵聚类，同时保持高效且易于部署在现有VLM中。在LLaVA-665K和Vision-Flan数据集上，以及向大型目标模型LLaVA-1.5-7B和-13B的迁移设置中，MAGIC在匹配的20%预算下持续优于强基线：在LLaVA-665K上达到全微调相对性能的100.3%，在Vision-Flan-186K上达到101.6%，同时减少了73.7%的挂钟运行时间。

英文摘要

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.

URL PDF HTML ☆

赞 0 踩 0

2605.26001 2026-05-26 cs.CL cs.AI cs.CY 版本更新

什么使医学检查器可训练？诊断生物医学问答中检查器引导的RAG中的信号崩溃和奖励黑客

Yuelyu Ji, Min Gu Kwak, Hang Zhang, Xizhi Wu, Chenyu Li, Yanshan Wan

发表机构 * University of Pittsburgh（匹兹堡大学）

AI总结本文通过比较四种NLI检查器作为GRPO训练的医学RAG代理的过程奖励，诊断了信号崩溃和奖励黑客现象，发现检查器的输出分布而非保留准确率决定了其是否提供可训练梯度，并提出了验证器作为奖励系统的边界条件。

详情

AI中文摘要

医学RAG需要基于证据的声明，因此将声明级别的NLI检查器插入检索增强的强化学习是直观的。 extbf{我们发现，检查器在训练期间的输出分布，而不是其保留准确率，决定了它是否提供可训练梯度。}我们比较了四种NLI检查器后端作为GRPO训练的医学RAG代理（Qwen2.5-7B，在Qwen3-4B和Llama-3.1-8B上复制）的过程奖励，跨越四个保留的医学QA基准。出现了三个诊断发现。 extbf{(i)} 信号崩溃是log概率特定的：LLM log概率评分将超过97%的声明标记为中性——将RL梯度降至零——而校准的MedNLI分类器对相同对进行非退化评分。 extbf{(ii)} 在答案质量上，中等信号优于强信号：一个强大的专有检查器触发三步奖励黑客级联——超短答案、搜索回避、语言崩溃——因此中等信号的本地分类器训练出更高质量的模型（ extbf{+12% BERTScore相对于零样本，无GPT依赖}）。 extbf{(iii)} 信号强度是策略依赖的：相同的检查器在一个策略上表现为中等，但在另一个策略上表现为强，而不触发级联终点。我们将这些视为验证器作为奖励系统的边界条件。

英文摘要

Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf{(i)} Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97\% of claims neutral -- collapsing the RL gradient to zero -- while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf{(ii)} Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade -- ultra-short answers, search avoidance, language collapse -- so a moderate-signal local classifier trains a higher-quality model (\textbf{+12\% BERTScore over zero-shot, no GPT dependency}). \textbf{(iii)} Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.

URL PDF HTML ☆

赞 0 踩 0

2605.25984 2026-05-26 cs.CL cs.AI 版本更新

SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

SafeCtrl-RL: 通过RL驱动的提示优化的LLM对话推理时自适应行为控制

Michael Orme, Yanchao Yu, Zhiyuan Tan

发表机构 * School of Computing, Engineering and Building Environment（计算、工程与建筑环境学院）

AI总结提出SafeCtrl-RL框架，利用强化学习在推理时动态选择提示调整策略，无需重新训练即可抑制不安全行为，提升LLM对话的安全性和响应质量。

详情

AI中文摘要

确保大型语言模型（LLM）的安全和上下文适当行为仍然是实际部署的关键挑战。我们提出了 extbf{SafeCtrl-RL}，一个推理时行为控制框架，无需模型重新训练或参数修改即可实现自适应安全调节。该方法将对话生成形式化为一个序列决策过程，其中强化学习代理根据上下文反馈动态选择提示调整策略。这使得不安全行为可以通过迭代细化被抑制，我们将其概念化为推理时行为遗忘。在多个LLM和不安全对话场景下的评估表明，SafeCtrl-RL一致地提高了安全性和响应质量，优于现有的基于提示的优化方法，并实现了良好的性能-效率权衡。**警告：本文可能包含有害语言的示例，建议读者谨慎阅读。**

英文摘要

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

URL PDF HTML ☆

赞 0 踩 0

2605.25977 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning

创意质量对齐：通过思维链微调实现专家隐性知识迁移

Bo Zou, Chao Xu

AI总结本文通过低数据成本和小基模型的严格工程条件，实证验证了校准惊喜中的创意质量度量，并发现数据偏差，提出创意质量对齐方法及理论解释。

详情

AI中文摘要

本文对校准惊喜（Zou & Xu, 2026a）中提出的创意质量度量进行了实证实现。本文解决的问题是：这一数学主张在工程层面是否成立？为使答案尽可能通用，我们特意选择了最严格的工程条件：低数据成本和小基模型。训练数据来自BC协议（Zou & Xu, 2026b）产生的大约100个专家思维链（CoT）标注。我们还发现了一个数据偏差：大多数公开可用的对齐数据集偏向于工艺相关知识，而受众建模和现实逻辑覆盖系统性薄弱。我们使用术语“创意质量对齐”（CQA）来描述这类工程方法。我们还提供了一个支持性的理论观察：在具有单一条件分布架构的LLM中，通过架构对偶性，校准欣赏侧会自动迁移到生成侧。这是大约100个CoT示例就足够的结构性原因——而非像LIMA（Zhou et al., 2023）那样的纯粹经验观察。

英文摘要

This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the strictest engineering conditions: low data cost and a small base model. Training data comes from approximately 100 expert chain-of-thought (CoT) annotations produced by the BC Protocol (Zou & Xu, 2026b). We also identify a data bias: most publicly available alignment datasets are skewed toward craft-related knowledge, while audience modeling and reality-logic coverage are systematically weak. We use the term Creative Quality Alignment (CQA) to describe this class of engineering methods. We also offer a supporting theoretical observation: in an LLM with a single conditional distribution architecture, calibrating the appreciation side automatically transfers to the generation side via architectural duality. This is the structural reason why ~100 CoT examples are sufficient -- not a purely empirical observation like LIMA (Zhou et al., 2023).

URL PDF HTML ☆

赞 0 踩 0

2605.25969 2026-05-26 cs.CL 版本更新

Triplet-Block Diffusion RWKV

三元组块扩散RWKV

Ke Lin, Yiyang Luo, Zhaolong Su, Yunya Song, Anyi Rao

发表机构 * William & Mary（威廉玛丽学院）； HKUST（香港科技大学）； Cornell（康奈尔大学）

AI总结提出B^3D-RWKV，通过三元组块布局方法将RWKV的线性推理效率与双向离散扩散结合，实现并行解码，在8任务套件上达到可比精度，解码吞吐量平均提升1.6倍。

2605.25966 2026-05-26 cs.LG cs.CL stat.ML 版本更新

Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training

在小于100M参数量化感知训练中映射调度策略与位宽边界

Christian Brandt Thomassen

发表机构 * Dwarf A/S（Dwarf公司）

AI总结通过大规模实验研究子100M参数解码器语言模型中，量化感知训练的最佳学习率调度是否依赖于位宽，发现INT6 QAT无需不同调度，INT4在50M以上需wd33调度，以下则噪声主导。

Comments 20 pages, 6 figures, 4 tables. 1345 training runs total (720 + 625). Submitted for review at TMLR

详情

AI中文摘要

我们测试了在子100M参数解码器语言模型中，从初始化开始的量化感知训练（QAT）的最佳学习率调度是否依赖于位宽。一项720次运行的因子网格实验（阶段2）覆盖了位宽×衰减分数×学习率大小×模型大小×随机种子（FP16/INT8/INT6，15M-100M，5个种子），发现在每个（位宽，大小）单元中，最佳衰减分数为33%。主要假设——INT6 QAT需要与高精度训练不同的调度——在FP16/INT8/INT6下被证伪。后续625次运行（阶段5）沿五个轴探测零假设：优化器（AdamW）、调度形状（余弦）、训练长度（最多9倍迭代次数）、扩展的大小扫描（5M-350M）以及从3M到100M的INT4扫描。零假设在所有三种设置变化下均稳健。INT6的惩罚遵循对数线性缩放定律，其在阶段2的拟合预测了五个保留的阶段5大小（5M、8M、175M、250M、350M），且均在95%预测区间内（5/5）。对于INT4，情况比高精度更清晰：在50M和100M时，wd33明确最优（配对z~12-15，10/10种子）；低于50M时，在从3M到30M的六个测试大小中，没有单个大小显示出统计显著的调度偏好，且每个大小的平均惩罚在种子级噪声内振荡。因此，边界是从低于50M的噪声主导区域到50M及以上明确的wd33区域的过渡，而非清晰的wd10区域。权重到网格距离的探测证伪了FP16/INT8/INT6零假设的最简单机制（快速网格锁定）：在衰减前，INT6-QAT权重与INT6网格的距离基本与FP16权重相同（比率~1.04）。实用建议：在子100M规模下，在FP16上调优一次学习率调度，并原封不动地应用于INT8/INT6 QAT；对于50M以上的INT4，使用wd33；对于50M以下的INT4，调度选择在噪声中。

英文摘要

We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis -- that INT6 QAT requires a different schedule than higher-precision training -- is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.

URL PDF HTML ☆

赞 0 踩 0

2605.25958 2026-05-26 cs.CL cs.CE 版本更新

PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction

PolyGnosis 2.0: 通过智能体工程增强LLM推理，用于Polymarket和OSINT洞察提取

Daren Wang, Hong Xu, Jiawen Xian

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Evolution AI Lab（进化人工智能实验室）

AI总结本文提出PolyGnosis 2.0多智能体架构，通过融合Polymarket异常信号与全球开源情报（GDELT）来提取预测情报，并量化“视角错配”作为高阿尔法交易信号，同时评估“工具工程”技术（如反射循环、工具调用、分治和思维链）在高噪声金融领域的有效性。

详情

AI中文摘要

本文介绍了PolyGnosis 2.0，这是一种开创性的多智能体架构，旨在通过综合Polymarket异常信号与全球开源情报（OSINT）流（特别是全球事件、语言和语调数据库（GDELT））来提取预测情报。我们定义并瞄准“视角错配”，即Polymarket情绪与全球媒体流之间的叙事分歧，作为高阿尔法交易信号。超越通用的智能体优越性，我们严格量化了“工具工程”技术（包括反射循环、工具调用、分治分区（D&C）和思维链（CoT））在高噪声金融领域的有效性。我们针对人类专家基准的实证评估表明，虽然结构分区对于多维对齐是必需的，但无约束的终端反射会主动导致逻辑漂移。此外，我们在所有智能体配置的叙事推理过程中发现了一种普遍的“共识偏差”，需要确定性验证。最终，我们分离出一个帕累托最优配置，该配置在最小化延迟和令牌开销的同时实现了专业级的分析精度，为预测市场中的自主智能提供了稳健的蓝图。

英文摘要

This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.

URL PDF HTML ☆

赞 0 踩 0

因果舌结：LLMs 能编码因果方向，但其是/否输出无法表达

Ziyi Ding, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）

AI总结研究发现大语言模型在因果问题上存在内部编码与输出不匹配的现象，通过线性探针可从隐藏状态恢复证据支持的答案（准确率约0.97），但口头是/否回答却退化为常识答案（准确率约0.5），揭示了约+0.5的差距，称为“因果舌结”。

2605.25869 2026-05-26 cs.CL 版本更新

Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation

通过类型化记忆表示缓解长期智能体中的来源-角色崩溃

Zhengda Jin, Bingbing Wang, Jing Li, Ruifeng Xu, Min Zhang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； The Hong Kong Polytechnic University（香港理工大学）； Shenzhen Loop Area Institute（深圳龙华研究院）

AI总结提出MemIR类型化记忆中间表示，通过结构约束实现来源监控，解决长期智能体中因无结构存储导致的来源-角色崩溃问题，在LoCoMo和BEAM-100K上优于现有基线。

详情

AI中文摘要

长期记忆对于持久化LLM智能体至关重要，但现有架构将历史交互存储为非结构化的平面文本。这种无约束存储会导致来源-角色崩溃，即智能体出现来源监控错误的关键失效模式。为了在架构层面解决这一认知脆弱性，我们提出MemIR，一种类型化记忆中间表示，将来源监控作为结构约束来操作化。MemIR将长期记忆写入基础原子，这些原子分离原始证据、检索线索和承载真相的声明，事实授权仅限于受支持的声明原子。然后，它应用多路径原子投影和来源范围利用，将异构检索结果转化为以声明为中心的候选包，以及用于答案生成的归一化事实接口。在LoCoMo和BEAM-100K上的实验表明，MemIR持续优于现有记忆基线，特别是在需要来源追踪、时间锚定和碎片证据聚合的任务上。

英文摘要

Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long-term memory into grounded atoms that separate raw evidence, retrieval cues, and truth-bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi-route atomic projection and provenance-scoped utilization to transform heterogeneous retrieval hits into claim-centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM-100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.25864 2026-05-26 cs.LG cs.CL 版本更新

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

当自我信念误导：面向可验证奖励的强化学习的主动标签获取

Li Wang, Xiaodong Lu, Xiaohan Wang, Yikun Ban, Jiajun Chai, Wei Lin, Tianhao Peng, Guojun Yin

发表机构 * Meituan（美团）； Beihang University（北京航空航天大学）； Nanyang Technological University（新加坡国立大学）

AI总结提出RLAVR框架，通过主动获取少量真实标签并与伪标签结合，利用CAG指标和CARE策略稳定训练并提升有限标注预算下的性能。

详情

AI中文摘要

大型语言模型（LLM）通过可验证奖励的强化学习（RLVR）在推理能力上取得了显著进展。然而，RLVR本质上依赖于真实标签进行奖励计算，而在实际场景中获取这些标签通常成本高昂。虽然无监督的RLVR范式试图通过训练伪标签来规避这一问题，但它们极易发生训练崩溃。此外，不同样本往往具有不同的标注价值。在本文中，我们提出了主动可验证奖励的强化学习（RLAVR），它主动获取少量选定样本的真实标签，并将其与伪标签相结合，从而稳定训练动态并在有限标注预算下提高性能。为了识别有价值的样本，我们提出了纠正优势差距（CAG）指标，并分析了样本级别的监督价值。在此基础上，我们引入了用于RLAVR的纠正感知可靠性估计（CARE），它将理想的CAG准则转化为实用的预查询获取策略，以显著提高训练稳定性。跨不同领域、模型家族和模型规模的大量实验证明了我们方法的有效性和通用性。我们的代码可在https://github.com/Lumina04/CARE获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.

URL PDF HTML ☆

赞 0 踩 0

2605.25850 2026-05-26 cs.CL cs.AI cs.LG 版本更新

TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning

TIAR：基于轨迹信息的优势重加权用于大语言模型弃权学习

Muyu Pan, Shu Zhao, Nan Zhang, Philip Shin, Varun Parekh, Vijaykrishnan Narayanan, Rui Zhang

发表机构 * Department of Computer Science, The Pennsylvania State University（宾夕法尼亚州立大学计算机科学系）

AI总结本文提出TIAR方法，利用GRPO中的多条轨迹作为自然弃权信号，动态重加权弃权奖励，在六个评估类别中的五个上取得最优弃权F1分数，同时保持基线准确率。

Comments 10 pages, 1 figure, 4 tables

详情

AI中文摘要

澄清、弃权或回答？基于信念增强生成的对话策略

Joris Baan, Wilker Aziz, Barbara Plank, Raquel Fernández

发表机构 * University of Amsterdam（阿姆斯特丹大学）； MCML Munich（慕尼黑MCML）； LMU Munich（慕尼黑莱茵-魏尔堡大学）

AI总结提出信念增强生成（BAG）方法，通过将大语言模型自身的信念状态注入提示，使其推理多个采样响应并决定对话策略（回答、澄清或弃权），从而提升多轮模糊问答的准确性和策略决策的忠实度。

详情

AI中文摘要

大语言模型（LLMs）定义了文本上的分布，这可以视为不确定性的概率表示：采样K个响应会产生一个信念状态——模型认为合理的响应。现有工作利用这种表示进行解码或选择性预测等狭窄任务，通常需要手动干预，无法直接控制生成。我们提出信念增强生成（BAG）：通过提示将LLMs锚定在其自身的信念状态中，并让它们推理这K个样本以决定对话策略：回答、澄清或弃权。在多轮模糊问答设置中，我们发现LLMs默认很少澄清或弃权，忽略了关于输入或事实的不确定性。BAG在六个模型上提高了问答准确性，并产生了比仅提示基线更忠实于信念状态的策略决策。然而，区分何时澄清与何时弃权仍然具有挑战性。

英文摘要

Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or selective prediction, and often requires manual interventions, not controlling generation directly. We propose Belief-Augmented Generation (BAG): grounding LLMs in their own belief state via the prompt and letting them reason over these K samples to decide on a conversational strategy: answer, clarify, or abstain. In a multi-turn ambiguous QA setting, we find that LLMs by default rarely clarify or abstain, ignoring uncertainty about the input or facts. BAG improves QA accuracy across six models and yields strategy decisions more faithful to the belief state than prompt-only baselines. Disentangling when to clarify from when to abstain, however, remains challenging.

URL PDF HTML ☆

赞 0 踩 0

2605.25816 2026-05-26 cs.CL cs.AI 版本更新

Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa

超越架构复杂性的微调：基于DeBERTa的PIIBench广泛覆盖PII检测

Pritesh Jha

AI总结本研究通过微调DeBERTa模型，在涵盖82种实体类型的多源PIIBench数据集上实现广泛覆盖的PII检测，直接微调方法在F1分数上显著优于架构复杂的层次模型和课程扩展方法。

详情

DOI: 10.5281/zenodo.20379635

AI中文摘要

个人身份信息（PII）检测系统通常在狭窄的源或领域边界内训练，当部署在异构文本上时覆盖范围有限。我们研究了在修正后的多源PIIBench准备数据上的模型微调，该数据跨越十个源数据集，涵盖82种保留实体类型。我们评估了三种基于DeBERTa的方法：直接令牌分类微调、源条件层次模型（SC+H）和三阶段课程扩展（SC+H+Curr）。在可重复的5,000条记录保留子集（test_5k）上，与八个已发表的比较系统相比，直接微调的DeBERTa达到F1 0.6476，而SC+H和课程变体分别达到0.5899和0.2772；最强的已发表比较系统仅达到0.1723。由于验证最初偏向SC+H，我们在完整的100,002条记录保留分割上进行了最终的流式评估。直接微调仍然优越，达到F1 0.6455，而SC+H为0.5894。实体级分析表明，直接微调在82个细粒度实体类型中的54个和所有十个粗粒度组中获胜（按支持加权实体F1），而SC+H在28个类型上保持局部优势。结果表明，多样化的任务特定训练数据和简单的加权交叉熵目标对广泛覆盖的PII检测的贡献大于所测试的架构和课程复杂性。

英文摘要

Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across ten source datasets. We evaluate three DeBERTa-based approaches: direct token classification fine-tuning, a source-conditioned hierarchical model (SC+H), and a three-phase curriculum extension (SC+H+Curr). Against eight published comparator systems on a reproducible 5,000-record held-out subset (test_5k), direct fine-tuned DeBERTa achieves F1 0.6476, while SC+H and the curriculum variant achieve 0.5899 and 0.2772 respectively; the strongest published comparator reaches only 0.1723. Because validation initially favoured SC+H, we perform a final streamed evaluation on the complete 100,002-record held-out split. Direct fine-tuning remains superior, achieving F1 0.6455 versus 0.5894 for SC+H. Entity-level analysis shows that direct fine tuning wins 54 of 82 fine entity types and all ten coarse groups by support-weighted entity F1, while SC+H retains localised advantages on 28 types. The results indicate that diverse task-specific training data and a simple weighted cross-entropy objective contribute more to broad-coverage PII detection than the tested architectural and curriculum complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.25814 2026-05-26 cs.CL cs.AI 版本更新

Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

自适应图优化与基于大语言模型的标签传播用于经济高效实体解析

Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Zhejiang University（浙江大学）

AI总结提出Alper框架，通过迭代概率标签传播整合匹配与聚类，自适应融合图传播弱信号与LLM强查询，在预算约束下最大化边际增益，实现高效实体解析。

详情

AI中文摘要

脏实体解析（ER）从单个杂乱数据集中识别指向同一真实世界实体的记录，是数据管理和挖掘中的基本任务。然而，ER的主流阻塞-匹配-聚类范式存在严重缺陷。其级联、解耦的工作流本质上生成一个静态、稀疏的图，由于阻塞失败导致缺失边，由于匹配错误导致噪声链接，造成错误传播并产生次优聚类，特别是在聚类中施加严格传递性时。我们认为匹配和聚类本质上是协同的，两者都优化理想实体图的构建。基于这一见解，我们提出Alper，一个统一框架，将这些步骤整合为在全局、演化图上的迭代概率标签传播过程。与分离的阻塞不同，Alper通过自适应地整合来自图传播的“弱但廉价”信号与基于LLM的“强但昂贵”成对查询，动态优化图结构和标签。为了提高成本效益，我们将信号选择形式化为在查询预算下最大化累积边际增益的约束优化问题，通过我们的贪心算法求解，并具有可证明的理论保证。我们在八个基准数据集上的广泛实验表明，Alper始终优于最先进的级联流水线。

英文摘要

Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.25781 2026-05-26 cs.CL 版本更新

Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation

双三角形标注：一种可扩展的人机协同高精度历史文档标注框架

Yi Ren

发表机构 * École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）

AI总结提出双三角形标注框架，通过两层人机协同和跨模型共识自动完成大部分标注工作，实现高精度历史文档结构化信息提取。

Comments 12 pages, 4 figures. ACL ARR 2026 March submission

详情

AI中文摘要

大规模评估历史文档的结构化信息提取需要高精度的真实标注，但传统人工标注成本高昂，而基于大语言模型的完全自动化流水线容易产生幻觉。我们提出双三角形标注，一种双层人机协同框架，利用跨模型共识自动完成大部分标注工作，同时确保高精度输出。第一层中，两个架构独立的多模态大语言模型并行标注每个文档；当它们一致时，标签自动接受，不一致则提交给人工评审。第二层将两个这样的系统相互交叉检查，将剩余冲突升级给领域专家。该框架基于一个假设——模型之间的错误独立性——不需要分布先验或任务特定校准，并且随着模型能力的提升而变得更加自主。在Guides Rosenwald（一个涵盖1887-1906年的法国医疗目录语料库）上，该框架实现了0.003的最终词错误率。大规模应用时，模型共识自动接受了13,595个字段中的85%以上。我们发布了由此产生的基准——Rosenwald指南的第一个结构化提取真实标注——以支持未来历史文档处理工作。

英文摘要

Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.

URL PDF HTML ☆

赞 0 踩 0

2605.25745 2026-05-26 cs.CL 版本更新

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

选择性潜在思考：LLM推理链的自适应压缩

Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）； School of Computing and Information Technology（计算与信息科技学院）； Great Bay University（大湾大学）

AI总结提出选择性潜在思考（SLT）框架，通过置信度门控将冗余推理步骤压缩为潜在表示，关键步骤保留显式思维链，在压缩比相当的情况下准确率比潜在推理基线高22.7%，推理链长度减少58.4%且准确率仅下降2.8%。

详情

AI中文摘要

显式思维链（CoT）推理显著提升了大语言模型（LLMs）的推理能力，但由于冗长的自回归痕迹导致高推理成本。现有的潜在推理方法提供了一种有前景的替代方案，但它们通常将推理视为均匀可压缩的，导致精度关键的中间步骤被过度压缩，从而降低推理准确性。在这项工作中，我们提出了选择性潜在思考（SLT），一个框架，它选择性地将冗余推理跨度压缩为潜在表示，同时在同一推理轨迹中将精度关键的跨度保留为显式CoT。具体来说，SLT首先使用轻量级解码器预测即将到来的短推理跨度，然后应用基于置信度的门控来确定可可靠压缩的最长跨度。被接受的跨度被编码为紧凑的潜在表示以提高推理效率，而不确定或精度关键的推理则保留为显式CoT形式以保持准确性。为了学习这种选择性压缩策略，SLT采用三阶段训练策略，结合跨度级潜在压缩、可靠性感知的未来推理预测和轨迹级强化学习，以优化答案正确性与推理成本之间的权衡。在四个数学推理基准上的大量实验表明，SLT在压缩比相当的情况下，准确率比潜在推理基线高22.7%，同时与显式CoT相比，推理链长度减少58.4%，准确率仅下降2.8%。我们的代码可在https://github.com/hunshi34/SLT找到。

英文摘要

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7\% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4\% with only 2.8\% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.

URL PDF HTML ☆

赞 0 踩 0

2605.25708 2026-05-26 cs.CV cs.CL cs.ET 版本更新

测试人类与机器翻译中的去字面化假设

Malik Marmonier, Rachel Bawden, Benoît Sagot

发表机构 * Inria（法国国家信息与自动化研究所）

AI总结通过比较人类翻译、NMT系统与LLM在54个语言对上的字面化程度，验证去字面化假设是否适用于LLM生成与修订过程。

详情

AI中文摘要

从专用NMT系统向通用LLM的近期转变重塑了机器翻译，据报道LLM比其前身产生更流畅、更少字面化的输出。我们测试这种转变是否延伸到去字面化假设，即翻译研究中长期存在的说法：翻译在起草和修订过程中逐渐变得不那么字面化。使用WMT24++数据集，我们比较了人类翻译和后编辑与两个NMT系统和六个LLM在54个语言对和三个任务上的字面化程度：直接翻译、迭代自我修订和人类草稿的后编辑。字面化程度通过基于六个启发式方法构建的经过验证的合成字面化指数来衡量。我们发现：(i) 人类翻译仍然明显比所有测试的MT系统更少字面化，尽管最近的LLM缩小了差距；(ii) 当提示迭代修订自己的输出时，LLM单调地去字面化，首次提供了该假设原生适用于LLM生成的证据；(iii) 作为后编辑者，LLM反转了人类后编辑者的修订触发因素，容忍字面化草稿并针对惯用的人类表述进行修订。

英文摘要

The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translation studies that translations become progressively less literal as they are drafted and revised. Using the WMT24++ dataset, we compare the literality of human translations and post-editions to that of two NMT systems and six LLMs across 54 language pairs and three tasks: direct translation, iterative self-revision, and post-editing of human drafts. Literality is measured via a validated Synthetic Literality Index built from six heuristics. We find that (i) human translations remain significantly less literal than those of all tested MT systems, though recent LLMs narrow the gap; (ii) when prompted to iteratively revise their own output, LLMs deliteralize monotonically, providing the first evidence that the hypothesis applies natively to LLM generation; and (iii) as post-editors, LLMs invert the revision triggers of human post-editors, tolerating literal drafts and targeting idiomatic human formulations for revision.

URL PDF HTML ☆

赞 0 踩 0

2605.25680 2026-05-26 cs.CL cs.AI 版本更新

Simulating Human Memory with Language Models

用语言模型模拟人类记忆

Qihan Wang, Nicholas Tomlin, Michael Hu, Brian Dillon, Tal Linzen

发表机构 * NYU（纽约大学）； UMass Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本研究通过心理学经典记忆实验对比语言模型与人类记忆，发现未经调优的模型记忆优于人类，但通过提示策略和压缩器可使模型遗忘方式更接近人类，从而在下游教育任务中成为更有效的用户模拟器。

2605.25676 2026-05-26 cs.CL 版本更新

Llamion Technical Report

Llamion 技术报告

Kisu Yang, Yoonna Jang, Hyeonseok Moon, Hwanseok Jang, Taewoo Lee, Hyungjin Lee, Jeseung Lee, Juhyoung Park, Heuiseok Lim

发表机构 * VAIV Company（VAIV公司）； Korea University（韩国大学）； University of Copenhagen（哥本哈根大学）； Samsung Electronics（三星电子）

AI总结提出 KEPT 方法将 Orion-14B 转换为 Llama 架构的 Llamion 模型，通过参数映射和知识蒸馏在少量数据上恢复性能，并在 KoMMLU 上达到领先水平。

Comments Research conducted in 2024

详情

AI中文摘要

我们发布了 Llamion，一个 14B 参数的开源语言模型系列，通过将 Orion-14B 转换为标准化的 Llama 家族架构得到。该转换通过高效知识保留转换（KEPT）方法完成，该方法结合了 (i) 用于未改变模块的正常参数映射（NPM），(ii) 优化参数映射（OPM），一种无需训练的 LayerNorm 到 RMSNorm 初始化，我们证明在权重衰减引起的近零均值激活机制下该初始化是最优的，以及 (iii) 跨架构知识蒸馏（XKD），一种等大小的冻结教师蒸馏，将转换后模型的输出与源模型在任何合理输入分布上的输出对齐。Llamion 在单个 A100 上仅用约 1.23 亿 token 和四天时间，在 H6、MT-Bench 和 KoMMLU 上恢复了 Orion 的行为；Llamion-Base 在 KoMMLU 上达到 66.87%，在提交时比 Open Ko LLM Leaderboard 的次优条目高出超过 7.0 个绝对百分点。转移语料库中完全缺失的能力（Python 编程和 20 万 token 上下文处理）在架构转换后完整保留。我们发布了三个检查点（Base、Chat、LongChat），可在 Hugging Face Transformers 库中以 trust_remote_code=False 加载。

英文摘要

We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.

URL PDF HTML ☆

赞 0 踩 0

2605.25658 2026-05-26 cs.CL cs.AI 版本更新

AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization

AutoSG: 仅从任务提示出发的LLM驱动的昂贵优化求解器生成

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang

发表机构 * Xidian University（西安电子科技大学）； Victoria University of Wellington（威灵顿维多利亚大学）

AI总结提出AutoSG框架，通过检索增强生成、单步自优化和无实例评估机制，从自然语言提示直接生成可执行定制求解器，解决昂贵优化中的幻觉、结构破坏和评估成本问题。

详情

AI中文摘要

昂贵优化任务在现实应用中普遍存在，需要高度专业化的求解器。虽然LLM驱动的自动求解器生成显示出前景，但当前范式在处理昂贵优化时面临三个关键问题：由于领域知识不足导致的事实幻觉、在细化过程中频繁破坏先前建立的局部最优结构，以及在训练实例上执行带来的高昂评估成本和受限的泛化能力。为了解决这些问题，我们引入了AutoSG，一个完全自动化的流程，直接将自然语言提示转换为可执行的定制求解器。AutoSG具有三个核心创新：一个检索增强的求解器生成模块，严格将代码基于经过验证的文献；一个单步自优化算子，在保留关键结构组件的同时引入特定任务的改进；以及一个基于Elo的无实例LLM-as-a-Judge评估机制，快速建立全局排名。在多种昂贵优化任务上的广泛评估证实，AutoSG显著优于人工设计的最先进框架和现有的LLM生成的求解器。

英文摘要

Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated solver generation shows promise, current paradigms face three critical issues when tackling expensive optimization: factual hallucinations due to deficient domain knowledge, the frequent dismantling of previously established locally optimal structures during refinement, and the prohibitive evaluation costs alongside restricted generalization caused by executing on training instances. To address these issues, we introduce AutoSG, a fully automated workflow directly translating natural language prompts into executable customized solvers. AutoSG features three core innovations: a retrieval-augmented solver generation module strictly grounding code in verified literature; a one-step self-refinement operator introducing task-specific improvements while preserving critical structural components; and an instance-free Elo-based LLM-as-a-Judge evaluation mechanism rapidly establishing global rankings. Extensive evaluations across diverse expensive optimization tasks confirm AutoSG significantly outperforms human-designed state-of-the-art frameworks and existing LLM-generated solvers.

URL PDF HTML ☆

赞 0 踩 0

2605.25641 2026-05-26 cs.CL 版本更新

Iterate Until Retrieved: Factual Nugget Optimization for Discoverable Continual Corrections in Agentic RAG

迭代直到检索到：面向可发现持续修正的事实性片段优化在智能体RAG中的应用

Moshe Hazoom, Gal Patel, Alon Talmor, Tom Hope

发表机构 * Mosaic AI ； The Hebrew University of Jerusalem（希伯来大学杰里科分校）

AI总结提出迭代片段优化（INO）方法，通过将反馈转化为事实性片段并利用生产环境智能体RAG系统迭代优化，提升事实性修正的可发现性和使用率。

详情

AI中文摘要

在复杂的B2B（企业对企业）环境中，智能体检索增强生成（RAG）系统经常接收自由形式的反馈。我们关注可操作的事实性修正，而非风格、偏好或整体响应质量等通用反馈信号。我们识别这些实例并将其转化为紧凑的知识库条目，称为事实性片段。我们引入迭代片段优化（INO），一种索引时优化方法，将生产环境中的智能体RAG作为测试平台：它创建初始片段，使用触发查询及其释义进行探测，反思失败的检索和回答轨迹，并修订片段直到其可被发现。我们使用两个生产B2B知识辅助代理（一个回答公司特定知识库问题的产品支持代理，以及一个协助支持工程师的支持工单代理）在多家使用我们系统的公司中评估INO。在自动化和人工评估中，INO在事实性修正的可发现性和使用率方面持续优于基线。

英文摘要

Agentic retrieval-augmented generation (RAG) systems in complex B2B (business-to-business) settings may often receive free-form response feedback. Rather than generic feedback signals such as style, preference, or overall response quality, we focus on actionable factual corrections. We identify these instances and convert them into compact knowledge-base entries, which we call factual nuggets. We introduce Iterative Nugget Optimization (INO), an index-time optimization method that uses the production agentic RAG as a test harness: it creates an initial nugget, probes it with the triggering query and paraphrases, reflects over failed retrieval and answer traces, and revises the nugget until it is discoverable. We evaluate INO with two production B2B knowledge-assistance agents across multiple companies that use our system: a product support agent that answers questions over company-specific knowledge bases, and a support ticket agent that assists support engineers. INO consistently improves results over baselines in terms of discoverability and usage of factual corrections, in automated and human evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.25626 2026-05-26 cs.CL 版本更新

Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC

超越字面翻译：评估社交媒体用户生成内容中的文化有效性

Linjuan Wu, Ruiqi Zhang, Xinze Lyu, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Yixin Cao, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University（浙江大学）； Fudan University（复旦大学）； Xiaohongshu Inc.（小红书公司）

AI总结针对社交媒体用户生成内容翻译中文化传递与情感共鸣不足的问题，提出CULTURE-MT基准，通过构建涵盖14个领域、4种文化负载类型的1002条UGC笔记，并引入文化有效性评估标准，实验表明传统指标无法捕捉文化有效性，且基础LLM的文化有效性与模型规模相关。

Comments Accepted by ICML2026

详情

AI中文摘要

社交媒体平台实现了大规模跨语言交流，但由于用户生成内容（UGC）的非正式风格、文化引用和基于互动的表达方式，其翻译仍然具有挑战性。尽管近期的大语言模型（LLM）提高了翻译质量，但现有基准和指标往往未能捕捉翻译是否在真实场景中传达了预期含义和文化共鸣。在这项工作中，我们引入了CULTURE-MT，一个专注于文化传递和UGC特定情感共鸣的社交媒体翻译基准。CULTURE-MT包含跨14个领域的1,002条UGC笔记，根据文化负载符号和语言风格特征分为四类。我们还构建了面向UGC的训练数据，以微调Qwen3-8B和Qwen3-32B作为基线。我们提出文化有效性作为新的评估标准，侧重于表达准确性和文化适应性。测试包括基线在内的15个模型，我们发现传统指标无法捕捉文化有效性。我们还观察到，基础LLM上的文化有效性与模型规模相关。我们的工作为UGC翻译模型提供了全面的评估系统，并将提供一个开放的评估平台以推动该领域的研究。我们发布了CULTURE-MT基准，并提供了一个在线排行榜，提交的翻译结果可由我们训练的JUDGER进行评估。

英文摘要

Social media platforms enable large-scale cross-lingual communication, but translating user-generated content (UGC) remains challenging due to its informal style, cultural references, and interaction-based expressions. While recent LLMs have improved translation quality, existing benchmarks and metrics often fail to capture whether translations convey intended meaning and cultural resonance in real-world settings. In this work, we introduce CULTURE-MT, a benchmark for social media translation that focuses on both CULtural Transmission and UGC-specific emotion REsonance. CULTURE-MT consists of 1,002 UGC notes across 14 domains, categorized into four types based on culture-loaded symbols and linguistic style features. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as baselines. We propose cultural effectiveness as a new evaluation criterion, focusing on expression accuracy and cultural adaptability. Testing 15 models, including the baselines, we find that traditional metrics fail to capture cultural effectiveness. We also observe that cultural effectiveness on base LLMs correlates with model size. Our work provides a comprehensive evaluation system for UGC translation models and will offer an open evaluation platform to advance research in this area. We release the CULTURE-MT benchmark and provide an online leaderboard where submitted translation results can be evaluated by our trained JUDGER.

URL PDF HTML ☆

赞 0 踩 0

2605.25604 2026-05-26 cs.CL cs.LG 版本更新

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

DVAO: 面向多奖励强化学习的动态方差自适应优势优化

Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang

发表机构 * Alibaba Cloud Computing（阿里巴巴云 computing）

AI总结针对多奖励强化学习中奖励组合导致训练不稳定、优势组合依赖静态超参数的问题，提出动态方差自适应优势优化方法，通过基于经验奖励方差动态调整组合权重，实现稳定训练与多目标帕累托前沿优化。

详情

AI中文摘要

强化学习已成为将大型语言模型与人类意图和任务要求对齐的标准范式。尽管组相对策略优化为近端策略优化提供了一种高效、无价值模型的替代方案，但将其适应于现实世界的多奖励设置仍然具有挑战性。标准的标量化实践，如奖励组合和优势组合，存在显著缺陷：奖励组合经常产生平方幅度过大的优势，导致训练不稳定；而优势组合依赖静态超参数，忽略了跨目标相关性。为了解决这些限制，我们提出了动态方差自适应优势优化（DVAO），它根据 rollout 组内每个目标的经验奖励方差动态调整组合权重，有效提高具有更强学习信号的目标的权重，同时抑制噪声目标。我们从数学上证明 DVAO 保持有界的优势幅度以实现稳定训练，并引入了一种自适应的跨目标正则化机制。使用 Qwen3 和 Qwen2.5 模型在数学推理和工具使用基准上的大量实验表明，DVAO 显著优于基线方法，实现了卓越的多目标帕累托前沿和稳健的训练稳定性。

英文摘要

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

URL PDF HTML ☆

赞 0 踩 0

2605.25601 2026-05-26 cs.CL cs.AI 版本更新

Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

面向大语言模型可控模拟不完美学生的基准

Alexander Apartsin, Omri Sason, Yehudit Aperstein

发表机构 * Holon Institute of Technology（霍隆理工学院）； Afeka Tel Aviv Academic College of Engineering（阿法卡特拉维夫工程学院）

AI总结本研究提出一个基准框架，通过提示控制语言模型模拟具有指定技能轮廓的学生，并评估其可控性，为教师教育中的刻意练习提供支持。

Comments 22 pages, 7 figures

详情

AI中文摘要

教师教育需要与表现出可识别优势、弱点和部分掌握的学习者进行刻意练习。大型语言模型可以通过模拟具有已知技能组成部分的学生来支持这种练习，使教师能够演练解释、诊断和教学回应。然而，为此目的，核心要求既不是最大化基准准确率，也不是抑制孤立的事实，而是控制模型行为，使其反映指定的技能轮廓。本文研究了是否可以通过提示引导语言模型保留某些技能同时抑制其他技能。我们引入了一个面向基准的框架，其中显式技能向量表示模拟学生，基于提示的控制指定保留和缺失的能力，并使用轮廓对齐指标、保留与遗忘比较以及跨技能校准分析来评估行为。结果表明，在结构化数学环境中可以诱导和测量选择性的部分掌握，尽管可控程度仍依赖于模型。这些发现将可控学习者模拟定位为教师教育、教育模拟和语言模型控制交叉领域的一个独特研究问题。

英文摘要

Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark accuracy nor to suppress isolated facts, but to control model behavior so that it reflects a specified skill profile. This paper investigates whether prompted language models can be steered to retain some skills while suppressing others. We introduce a benchmark-oriented framework in which an explicit skill vector represents a simulated student, prompt-based control specifies retained and missing competencies, and behavior is evaluated using profile-alignment metrics, retained-versus-forgotten comparisons, and cross-skill calibration analyses. The results show that selective partial mastery can be induced and measured in a structured mathematics setting, although the degree of controllability remains model-dependent. These findings position controllable learner simulation as a distinct research problem at the intersection of teacher education, educational simulation, and language-model control.

URL PDF HTML ☆

赞 0 踩 0

2605.25596 2026-05-26 cs.CL 版本更新

Multilingual Phonological Feature Recognition with Self-Supervised Speech Models

基于自监督语音模型的多语言音韵特征识别

Abner Hernandez, Tomás Arias-Vergara, Daiqi Liu, Andreas Maier, Paula Andrea Pérez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany（埃森哲大学模式识别实验室，德国）； GITA Lab. Facultad de Ingeniería. Universidad de Antioquia UdeA, Medellín, Colombia（安蒂亚大学工程学院GITA实验室，哥伦比亚）

AI总结提出PhonoQ-2.0，一种基于自监督语音模型的多语言帧级音韵特征识别器，通过方式条件门控机制直接预测结构化特征向量，在域内和域外均优于CTC基线。

Comments Submitted to Interspeech 2026

2605.25572 2026-05-26 cs.CL cs.AI 版本更新

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

PennySynth：基于RAG的数据合成用于自动量子代码生成

Minghao Shao, Nouhaila Innan, Hariharan Janardhanan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University Abu Dhabi (NYUAD)（eBRAIN实验室，工程系，纽约大学阿布扎比分校）； Center for Quantum and Topological Systems (CQTS), NYUAD Research Institute（量子与拓扑系统中心（CQTS），NYUAD研究所）； Department of Computer Science and Engineering, NYU Tandon School of Engineering（计算机科学与工程系，纽约大学坦顿工程学院）

AI总结提出PennySynth框架，通过检索增强生成和代码感知嵌入，利用13,389个PennyLane指令-代码对数据集，在QHack竞赛中实现52%-68%的pass@5，显著提升量子代码生成的结构有效性和功能正确性。

Comments 11 pages, 3 figures

详情

AI中文摘要

量子编程框架日益增长的复杂性暴露了现有基于大语言模型（LLM）的代码助手的一个关键局限性：通用模型在面对专门的量子编码挑战时，会幻觉出PennyLane特定的门名称、错误放置设备配置并生成结构无效的电路。我们提出PennySynth，一个检索增强生成框架，通过将LLM推理条件化为一个包含13,389个PennyLane指令-代码对的精选知识库来解决这一差距，该知识库通过一个三阶段（提取、验证和去重）流程从官方PennyLane仓库、社区GitHub源和QHack竞赛档案中构建。PennySynth引入了一种使用st-codesearch-distilroberta-base的代码感知嵌入策略，该策略针对自然语言到代码的检索进行训练，将平均检索余弦相似度从通用基线的0.45提高到0.726。在涵盖QHack竞赛三年（2022、2023、2024）的74个挑战上进行评估，PennySynth在QHack 2022、2023和2024上分别达到64%、68%和52%的pass@5，相比无检索的Claude Sonnet 4.6提高了+28、+25和+28个百分点。我们进一步引入了一个量子适应的CodeBLEU指标，该指标对qml.*令牌模式进行加权，并表明结构代码相似性和功能正确性捕捉了量子代码质量的不同方面。受控消融实验揭示，代码感知嵌入是检索性能的主要驱动因素，而当检索质量足够精确时，数据集扩展和源组合提供了额外的增益。

英文摘要

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based code assistants: general-purpose models hallucinate PennyLane-specific gate names, misplace device configurations, and produce structurally invalid circuits when faced with specialized quantum coding challenges. We present PennySynth, a retrieval-augmented generation framework that addresses this gap by conditioning LLM inference on a curated knowledge base of 13,389 PennyLane instruction-code pairs, built via a three-stage extraction, verification, and deduplication pipeline over official PennyLane repositories, community GitHub sources, and QHack competition archives. PennySynth introduces a code-aware embedding strategy using st-codesearch-distilroberta-base, trained for natural-language-to-code retrieval, increasing average retrieval cosine similarity from 0.45 to 0.726 compared to a general-purpose baseline. Evaluated across 74 challenges spanning three years of the QHack competition (2022, 2023, 2024), PennySynth achieves 64%, 68%, and 52% pass@5 on QHack 2022, 2023, and 2024, respectively, improving over Claude Sonnet 4.6 without retrieval by +28, +25, and +28 percentage points. We further introduce a quantum-adapted CodeBLEU metric that upweights qml.* token patterns and show that structural code similarity and functional correctness capture distinct aspects of quantum code quality. Controlled ablations reveal that code-aware embeddings are the primary driver of retrieval performance, while dataset expansion and source composition provide additional gains when retrieval quality is sufficiently precise.

URL PDF HTML ☆

赞 0 踩 0

2605.25565 2026-05-26 cs.LG cs.CL 版本更新

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

RotMoLE：通过旋转门控机制增强混合低秩专家

Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang

发表机构 * Tsinghua University（清华大学）； Beijing Information Science and Technology University（北京信息科技大学）； National University of Singapore（新加坡国立大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结针对MoE-LoRA中传统门控仅标量加权限制表示能力的问题，提出RotMoLE框架，通过引入旋转门控机制对每个专家进行旋转操作，提升专家利用率和专业化程度，在多任务和多语言训练中验证有效性。

详情

AI中文摘要

虽然大型语言模型（LLM）通常在进行垂直应用之前会针对特定领域任务进行微调，但将它们适应于具有多样化专业知识的复杂场景仍然具有挑战性。与此同时，混合专家（MoE）架构已成为训练LLM的关键范式，最近的一些工作也将MoE引入参数高效微调（PEFT），提出了混合低秩专家（MoE-LoRA），以增强低秩适配器学习复杂知识的能力。然而，MoE中的传统门控机制通常仅对选中的专家应用标量重新加权，从而限制了其表示和泛化的潜在能力。受MoE-LoRA中低秩结构的启发和推动，我们提出了RotMoLE，一个专门针对低秩专家的MoE框架，其特点是一个额外的旋转门控。除了简单的缩放，RotMoLE为每个选中的专家实现了一个旋转机制，从而在专家候选有限的情况下，实现了更好的专家利用和专业化，以学习多样化的数据。在复杂多任务和多语言训练场景下的实证结果验证了我们的有效性。

英文摘要

While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2605.25549 2026-05-26 cs.CL cs.AI cs.LG 版本更新

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

BC协议：结构化双专家对话用于生成高质量思维链后训练数据

Bo Zou, Chao Xu

AI总结针对大语言模型后训练中高质量专家思维链数据生产瓶颈，提出BC协议——一种结构化双专家引出方法，通过配对领域专家与知识工程师，系统外化专家隐性判断为自然语言推理链，实验证明其在推理过程自然性上具有压倒性优势。

详情

AI中文摘要

高质量的专家思维链（CoT）数据是大语言模型（LLM）后训练的核心瓶颈之一。现有数据生产方法各有结构性局限：众包标注缺乏深度推理路径；专家单独写作受限于“专家盲点”——专家会结构性跳过他们认为显而易见的推理步骤；RLHF仅产生偏好信号而非推理链。本文提出BC协议——一种用于LLM后训练数据生产的结构化双专家引出方法。该方法精心配对领域专家（晶体智力）与知识工程师（流体智力），系统地将专家的隐性判断外化为自然语言推理链。我们引入了参与者资质模型，定义了影响引出质量的六个参与者特征维度。“校准的无知”是本文提出的原创概念。我们进一步提出“选择优于规定”作为方法论原则：对于隐性知识引出任务，将质量控制资源投入人员选择比投入同等资源于流程设计能获得更高回报。在叙事小说领域的受控实验中，我们直接比较了BC协议双对话产生的CoT（A组，n=20）与同一领域专家独立撰写的CoT（B组，n=20）。三个跨供应商评判模型——GPT-4o、Claude Opus 4.5和Gemini 2.5 Pro——在五个维度上进行了盲评（共600个评分）。结果表明，BC协议在“推理过程自然性”上具有压倒性优势（A组均值4.80 vs. B组均值1.30，p=2.4×10^{-8}，Cliff's δ=1.0）。

英文摘要

High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data production methods each have structural limitations: crowdsourced annotation lacks deep reasoning paths; expert solo writing is constrained by the "expert blind spot" -- experts structurally skip reasoning steps they consider obvious; RLHF only produces preference signals rather than reasoning chains. This paper proposes the BC Protocol -- a structured dual-expert elicitation method for LLM post-training data production. The method carefully pairs a domain expert (crystallized intelligence) with a knowledge engineer (fluid intelligence), systematically externalizing the expert's implicit judgments as natural language reasoning chains. We introduce the Participant Aptitude Model, which defines six participant characteristic dimensions that affect elicitation quality. "Calibrated Ignorance" is an original concept proposed in this paper. We further propose "Selection-over-Prescription" as a methodological principle: for implicit knowledge elicitation tasks, investing quality-control resources in personnel selection yields a higher return than investing the same resources in process design. In a controlled experiment in the narrative fiction domain, we directly compared CoT produced by BC Protocol dual dialogue (Group A, (n=20)) against CoT written independently by the same domain expert (Group B, (n=20)). Three cross-vendor judge models -- GPT-4o, Claude Opus 4.5, and Gemini 2.5 Pro -- conducted blind evaluation across five dimensions (600 ratings total). Results show that the BC Protocol achieves an overwhelming advantage in "naturalness of reasoning process" (Group A mean 4.80 vs. Group B mean 1.30, (p=2.4\times10^{-8}), Cliff's (δ=1.0)).

URL PDF HTML ☆

赞 0 踩 0

2605.25520 2026-05-26 cs.CL 版本更新

Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

LLMs中的推理是否由不同的语义结构介导？一种机制性解释

Nura Aljaafari, Marco Valentino, André Freitas

发表机构 * University of Manchester（曼彻斯特大学）； University of Sheffield（谢菲尔德大学）； Idiap Research Institute（Idiap研究所）； CRUK National Biomarker Centre, University of Manchester（CRUK国家生物标志物中心，曼彻斯特大学）

AI总结通过SVD分解和激活引导实验，研究自然语言推理中Transformer模型是否编码语义操作，发现操作级子空间部分重叠且因果影响预测，表明模型不仅编码假设与前提的关系，还部分编码如何关联。

Comments 26 pages, 16 figures, 13 tables

详情

AI中文摘要

正确预测标签并不一定需要表示产生该标签的操作。已知Transformer表示携带标签级信息，但它们是否编码产生这些标签的语义操作尚不清楚。我们使用受控的前提-假设对（仅通过单一语义变换区分）在自然语言推理中对此进行研究。利用逐层激活，通过SVD估计操作级子空间，并通过在四个开源解码器模型中的激活引导测试其因果相关性。变换效果以84.8%-99%的准确率可解码，并占据部分不同但重叠的子空间，超过随机子空间基线。引导实验表明这些方向因果性地影响预测，尽管可引导性因模型而异；跨操作引导进一步揭示了结构化干扰以及子空间选择性与跨操作独立性之间的分离。这些发现表明，模型不仅编码假设与前提相关，还部分编码如何相关，这意味着机制分析和控制应在语义操作层面而非仅预测标签层面进行。

英文摘要

Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.

URL PDF HTML ☆

赞 0 踩 0

2605.25511 2026-05-26 cs.CL 版本更新

CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

CRPO：以角色为中心的群体相对策略优化用于角色扮演代理中的角色感知推理

Yihong Tang, Kehai Chen, Liang Yue, Benyou Wang, Min Zhang

发表机构 * Institute of Computing and Intelligence（计算与智能研究院）； Harbin Institute of Technology（哈尔滨工业大学）； Shenzhen Loop Area Institute (SLAI)（深圳Loop区研究院）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出CRPO框架，通过解耦任务逻辑与风格奖励、动态调整优化约束和利用通用响应作为负基线，解决强化学习在角色扮演中角色保真度下降和风格崩溃问题。

详情

AI中文摘要

强化学习的最新进展，特别是群体相对策略优化（GRPO），显著提升了大型语言模型的推理能力。然而，将这些以问题为中心的优化方法应用于角色扮演代理时，往往会导致角色保真度下降和风格崩溃，因为它们优先考虑上下文特定的效用而非角色对齐。为了解决这个问题，我们提出了以角色为中心的群体相对策略优化（CRPO），这是一个旨在将强化学习目标与角色扮演任务重新对齐的框架。CRPO通过三种机制提升角色独特性：解耦任务逻辑与风格奖励以解决梯度冲突，根据角色复杂度动态调整优化约束，以及利用通用响应作为负基线以防止模型回归到常见分布。大量实验表明，CRPO在一致性、情感等方面优于现有方法。

英文摘要

Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

URL PDF HTML ☆

赞 0 踩 0

2605.25502 2026-05-26 cs.CL cs.AI 版本更新

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

面向教育方面情感分析的可控合成基准

Yehudit Aperstein, Alexander Apartsin

发表机构 * Intelligent Systems, Afeka Academic College of Engineering（阿法卡学术工程智能系统学院）； School of Computer Science, Faculty of Sciences, Holon Institute of Technology（霍洛技术学院计算机科学学院）

AI总结为解决教育领域标注数据稀缺问题，提出一个包含10,000条合成课程评论和20个教学方面的可控合成基准，并通过实验验证了任务难度及合成到真实的迁移能力。

Comments 39 pages, 14 figures

详情

AI中文摘要

教育方面情感分析（ABSA）可以支持课程改进，但带有方面标签的学生反馈仍然稀缺，因为教育评论是私有的、特定于机构的且标注成本高昂。本研究引入了一个面向教育ABSA的可控合成基准，该基准由10,000条合成课程评论构建，具有明确的训练-验证-测试划分，以及一个涵盖教学质量、评估与课程管理、学习需求、学习环境和参与度的20方面教学模式。该语料库通过采样的目标标签、采样的细微属性以及经过三轮评审-编辑流程优化的真实感提示生成。在该基准上，使用TF-IDF、两阶段变换器和联合编码器的局部基线表明该任务并非易事；最强的未调优模型BERT在留出集上的检测微F1得分为0.2760，而一个适度的低学习率BERT调度将其提升至0.2930。基于gpt-5.2的全测试GPT推理在零样本模式下达到0.2519微F1，在使用基于检索的少样本提示时达到0.2501，使批量推理高于经典基线并接近紧凑的联合编码器。在来自Herath等人的2,829条映射学生反馈评论上进行的保守外部评估中，BERT在9个方面重叠上的微F1得分为0.4593，表明部分合成到真实的迁移。真实性和忠实度分析作为生成器诊断报告，阐明了基准如何稳定以及标签噪声仍然存在的位置。因此，本研究贡献了一个合成教育ABSA语料库、一个文档化的生成过程以及一个可复现的基准设置，适用于公共标注数据仍然难以获得的领域。

英文摘要

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

URL PDF HTML ☆

赞 0 踩 0

2605.25475 2026-05-26 cs.CL cs.AI 版本更新

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

IndexMem: 基于潜在记忆的学习型KV缓存驱逐策略用于长上下文LLM推理

Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science（香港科学与技术大学）； Zhejiang University（浙江大学）

AI总结提出一种可学习的索引器预测KV重要性，并结合轻量级潜在记忆模块压缩被驱逐的令牌，以在有限KV预算下实现准确的长上下文推理。

详情

AI中文摘要

大型语言模型（LLM）越来越需要处理长上下文，但标准softmax注意力机制的KV缓存随序列长度线性增长，迅速成为长上下文推理的瓶颈。一种实用的补救措施是驱逐不太重要的KV条目；然而，现有的驱逐策略大多是启发式的，难以捕捉令牌重要性的丰富、输入相关的分布。在这项工作中，我们引入了一个可学习的索引器来预测KV重要性，从而能够更准确地保留关键令牌。同时，简单地驱逐令牌会永久丢弃其信息，导致不可逆的遗忘和长距离检索性能下降。为了解决这个问题，我们提出了一个轻量级的潜在记忆模块，将驱逐的令牌压缩成紧凑的、在线更新的状态，并提供残差读出以补偿通过KV驱逐丢失的注意力贡献。总的来说，我们的方法能够在有限的KV预算下实现准确的长上下文推理，在RULER（4K/16K）上对Qwen、Mistral和Llama模型（在激进驱逐下提升高达25分）带来一致的改进，在Needle-in-a-Haystack检索中显著更稳定，并且在LongBench得分和压缩曲线上优于现有的驱逐策略。

英文摘要

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

URL PDF HTML ☆

赞 0 踩 0

2605.25474 2026-05-26 cs.CL 版本更新

TypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification

TypedCSIP：面向中国立法冲突分类的类型化反事实预训练

Yao Liu

发表机构 * Chengdu University of Technology, Leshan, China（成都理工大学，乐山，中国）； School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia（马来西亚理科大学计算机科学学院，槟城，马来西亚）

AI总结提出TypedCSIP方法，通过类型化反事实选择性干预预训练（阶段1）和五路分类头迁移（阶段2），在LCR-CN基准上提升立法冲突分类的宏F1值。

详情

AI中文摘要

TypedCSIP是一种针对LCR-CN基准（Zhao等人，2026）冲突分类任务的类型化反事实预训练方法：给定（上位法，下位法）条款对，预测该对是否冲突以及四种法律教义类型（责任、条件、制裁、定义）中哪一种描述不一致。我们利用LCR-CN中专家编写的最小修订作为训练时的反事实监督；测试时分类器仅读取原始条款对。阶段1在（上位法，下位法，专家修订）三元组上使用类型化反事实选择性干预预训练目标预训练共享编码器，将专家修订视为反事实，类型因子头必须将其分类为无冲突证据。阶段2将编码器迁移到五路分类头。确认性测试在观察v6测量之前在开放科学框架上注册：18个种子，锁定规则要求每种子平均差异至少0.8个百分点，且种子自举和学生t 95%置信下限均大于零。在696条记录测试集上，v2变体在chinese-roberta-wwm-ext上比最强单模型基线提高宏F1 +0.916个百分点，在SAILER跨骨干复制上提高+1.288个百分点；两个单元格均通过规则。在244条Unseen-gB记录上的冷启动分层结果在两个骨干上均保持正增益。跨任务诊断显示阶段2编码器是分类专用的，不能迁移到LCR-CN的上位法检索任务，因此我们将贡献限定在冲突分类。我们发布代码、72个预注册预测文件、匹配种子和MLM控制辅助文件以及OSF预注册记录。

英文摘要

TypedCSIP is a typed counterfactual pretraining method for the conflict-classification task of the LCR-CN benchmark (Zhao et al., 2026): given a (superior, subordinate) provision pair, predict whether the pair conflicts and which of four legal-doctrine types (Responsibility, Condition, Sanction, Definition) describes the inconsistency. We exploit LCR-CN's expert-written minimal revisions as training-time counterfactual supervision; at test time the classifier reads only the original pair. Stage 1 pretrains a shared encoder with a typed Counterfactual Selective Intervention Pretraining objective on (superior, subordinate, expert-revised) triplets, treating the expert revision as a counterfactual that the typed factor head must classify as carrying no conflict evidence. Stage 2 transfers the encoder to a five-way classification head. The confirmatory test was registered on the Open Science Framework before observing v6 measurements: 18 seeds, locked rule requiring mean per-seed difference at least 0.8 pp with both seed-bootstrap and Student-t 95% lower bounds above zero. On the 696-record test split, the v2 variant improves macro-F1 over the strongest single-model baseline by +0.916 pp on chinese-roberta-wwm-ext and +1.288 pp on the SAILER cross-backbone replication; both cells pass the rule. A cold-start stratified result on the 244 Unseen-gB records keeps the gain positive on both backbones. A cross-task diagnostic shows the Stage-2 encoder is classification-specialized and does not transfer to LCR-CN's superior-law retrieval task, so we scope the contribution to conflict classification. We release code, 72 pre-registered prediction files, matched-seed and MLM-control auxiliaries, and the OSF pre-registration record.

URL PDF HTML ☆

赞 0 踩 0

2605.25463 2026-05-26 cs.CL 版本更新

A Lightweight Hybrid Transformer-CRF Architecture for Multi-Type Bangla Medical Entity Recognition

一种轻量级混合Transformer-CRF架构用于多类型孟加拉语医学实体识别

Peyal Saha, Ahsanul Haque Hasib, Shoumik Barman Polok

AI总结提出一种轻量级孟加拉语医学实体识别框架，通过知识蒸馏将12层BanglaBERT-CRF教师模型压缩为4层学生模型，并应用INT8动态量化，在保持性能的同时实现8.6倍CPU加速和近48%存储节省。

详情

AI中文摘要

MedER指的是医学实体的识别。它对于从非结构化医学文本中提取结构化临床信息至关重要。许多现有系统依赖于基于Transformer的模型，这些模型计算成本高且难以在资源受限环境中部署。此外，早期的工作通常使用宽松的评估指标，通过奖励正确预测主导的“外部”（O）标记来人为地提升性能。在本文中，我们提出了一种轻量级的孟加拉语医学实体识别（MedER）框架。我们使用一个12层BanglaBERT模型结合条件随机场（CRF）层进行精确边界实体检测，建立了严格的基线。为了解决部署限制，我们通过知识蒸馏（KD）将该教师模型压缩为一个4层学生网络，其中学生从教师的CRF前软发射logits中学习。最后，我们应用INT8动态量化进一步减小模型大小和推理成本。我们最终的量化学生模型实现了8.6倍的CPU加速，同时所需存储比CRF教师模型减少近48%。

英文摘要

MedER refers to the identification of medical entities. It is crucial for extracting structured clinical information from unstructured medical text. Many existing systems rely on transformer-based models, which are computationally expensive and difficult to deploy in resource-constrained environments. Furthermore, earlier works often use relaxed evaluation metrics that artificially inflate performance by rewarding correct prediction of dominant "Outside" (O) tokens. In this paper, we propose a lightweight Medical Entity Recognition (MedER) framework for the Bangla language. We establish a rigorous baseline using a 12-layer BanglaBERT model combined with a Conditional Random Field (CRF) layer for exact-boundary entity detection. To address deployment constraints, we compress this teacher model into a 4-layer student network through Knowledge Distillation (KD), where the student learns from the teacher's pre-CRF soft emission logits. Finally, we apply INT8 dynamic quantization to further reduce model size and inference cost. Our final quantized student achieves an 8.6x CPU speedup while requiring nearly 48 percent less storage than the CRF teacher model.

URL PDF HTML ☆

赞 0 踩 0

2605.25454 2026-05-26 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

AI Content Moderation in Therapy Conversations

AI在治疗对话中的内容审核

Jiwon Kim, Claire Wang, Taeung Yoon, Sabelle Huang, Koustuv Saha

AI总结研究审计三种主流内容审核系统（OpenAI、Meta、Google）在真实治疗对话中的标记行为，揭示其限制LLM作为治疗师的潜力。

2605.25447 2026-05-26 cs.CL 版本更新

HyLaT: 通过混合潜在-文本协议实现高效多智能体通信

Xinyi Mou, Siyuan Wang, Zejun Li, Yulan He, Zhongyu Wei

发表机构 * Fudan University（复旦大学）； The Chinese University of Hong Kong（香港中文大学）； King’s College London（伦敦国王学院）； The Alan Turing Institute（艾伦·图灵研究所）； Shanghai Innovation Institute（上海创新研究院）

AI总结针对多智能体通信中的三元困境，提出混合潜在-文本协议HyLaT，通过潜在通道传输认知信号提升效率，自然语言表达关键信号保证可解释性，并设计两阶段训练框架，显著降低通信开销同时保持任务性能。

详情

AI中文摘要

通信协议设计是基于大语言模型的多智能体系统中的核心挑战。现有的单通道方法面临固有的通信三元困境：基于文本的方法可解释但冗长，而基于潜在空间的方法高效但不透明且局限于单向工作流。受多通道通信理论启发，我们提出HyLaT，一种混合潜在-文本通信协议，通过潜在通道传输精细的认知信号以提高效率，同时用自然语言表达简洁的关键信号以保持可解释性和精确性。我们引入一个两阶段训练框架，结合单智能体混合生成学习和多智能体交互协同训练，使智能体能够在多轮交互中生成和解释混合消息。实验表明，HyLaT显著降低了通信开销，同时保持了竞争性的任务性能，并在不同设置下具有强大的泛化能力和鲁棒性。

英文摘要

Communication protocol design is a central challenge in large language model-based multi-agent systems. Existing single-channel approaches face an inherent communication trilemma: text-based methods are interpretable but verbose, while latent-space methods are efficient but opaque and limited to unidirectional workflows. Inspired by multi-channel communication theory, we propose HyLaT, a hybrid latent-text communication protocol that transmits elaborate cognitive signals through a latent channel for efficiency, while expressing concise critical signals in natural language to preserve interpretability and precision. We introduce a two-stage training framework combining single-agent hybrid generation learning and multi-agent interactive co-training, enabling agents to generate and interpret hybrid messages across multiple rounds of interaction. Experiments demonstrate that HyLaT reduces communication overhead significantly while maintaining competitive task performance, with strong generalization and robustness across diverse settings.

URL PDF HTML ☆

赞 0 踩 0

2605.25420 2026-05-26 cs.CL cs.AI cs.CY 版本更新

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

SomaliBench Eval：衡量开源语言模型中英语到索马里语的拒绝差距

Khalid Yusuf Dahir

发表机构 * Independent researcher（独立研究人员）

AI总结通过构建索马里语有害意图基准并评估四个开源模型，发现英语到索马里语的拒绝率存在显著差距，且多数非拒绝输出为不流畅的无效内容。

Comments 12 pages, 3 figures, 4 tables. Code: https://github.com/khaledyusuf44/somalibench_eval Dataset: https://huggingface.co/datasets/khaledyusuf44/somalibench-v0

详情

AI中文摘要

大型语言模型的安全评估仍然高度以英语为中心，即使模型在全球部署，低资源语言的评估也严重不足。我们在SomaliBench v0上评估了四个开源指令微调模型，这是一个由母语者验证的基准，包含100对英语和索马里语的有害意图提示。每个模型（Llama-3.1-8B-Instruct、Gemma-2-9B-Instruct、Qwen-2.5-7B-Instruct和Aya-23-8B）均在本地运行，温度为0，并使用相同的英语“有帮助、无害、诚实”（HHH）系统提示。一个固定的Claude Sonnet快照（claude-sonnet-4-5-20250929）将每个响应分类为拒绝、遵从或不清楚；母语作者对分层抽样的80行样本进行抽查。我们发现所有四个模型在英语到索马里语之间存在巨大的拒绝差距：Llama-3.1-8B（0.90；95%自助法置信区间[0.85, 0.96]）、Aya-23-8B（0.75 [0.67, 0.83]）、Qwen-2.5-7B（0.69 [0.59, 0.78]）和Gemma-2-9B（0.38 [0.27, 0.49]）。对于三个模型，索马里语中主要的非拒绝模式不是流畅的有害遵从，而是不清楚的输出：空、错误语言或不连贯的生成。母语验证抽查在80个采样行上与判断器达到100%一致（Cohen's kappa = 1.00）。我们仅报告总体拒绝率、类别差距和可靠性统计；原始模型生成保留在本地，不发布。

英文摘要

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

URL PDF HTML ☆

赞 0 踩 0

2605.25415 2026-05-26 cs.CL cs.CY cs.ET 版本更新

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

LLM-as-a-Reviewer: 基准测试它们作为论文审稿人的能力、分歧和提示注入抵抗性

Lingyao Li, Junjie Xiong, Changjia Zhu, Runlong Yu, Chen Chen, Junyu Wang, Renkai Ma, Zhicong Lu

发表机构 * University of South Florida（佛罗里达南大学）； Missouri University of Science and Technology（密苏里科技大学）； University of Alabama（阿拉巴马大学）； Florida International University（佛罗里达国际大学）； University of Cincinnati（辛辛那提大学）； George Mason University（乔治·梅森大学）

AI总结本研究通过一个系统基准测试，评估了12个大型语言模型在论文评审中的表现，包括评分校准、与人类审稿人的分歧以及对不可见字体映射攻击的抵抗性，发现LLMs存在系统性高估弱论文、与人类关注点不同以及易受提示注入攻击等问题。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于学术同行评审，但其可靠性、与人类判断的一致性以及对对抗性攻击的鲁棒性仍知之甚少。我们对从NeurIPS和ICLR分层的898篇论文进行了LLM-as-a-Reviewer的系统基准测试，评估了12个LLMs的三个维度：评分校准、与人类审稿人的分歧以及对通过不可见字体映射攻击嵌入的提示注入的抵抗性。我们发现LLMs系统性地高估较弱的投稿，并在主题重点上与人类存在分歧，低估清晰度而高估可重复性，同时生成的评论长度是人类的2到3倍，词汇多样性较低且词汇更标准化。提示注入仍然非常有效。简单的隐藏指令可以在相当一部分案例中将低分论文提升至可接受级别的评分，且效果在不同模型家族间差异显著。虽然LLMs在结构化评估方面具有实用性，但将其整合到同行评审中需要针对内在偏见和对抗性风险设置防护措施。

英文摘要

Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.

URL PDF HTML ☆

赞 0 踩 0

2605.25404 2026-05-26 cs.CL eess.AS 版本更新

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

主动应对不确定性：面向口语对话系统的因果感知错误诊断与交互式澄清

Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）； Shanghai Jiao Tong University, China（上海交通大学，中国）

AI总结本文提出一种因果感知的错误恢复范式，通过细粒度检测器解耦ASR中的感知、理解和删除错误，使LLM能够执行多轮针对性澄清策略，从而显著降低词错误率并提升下游任务性能。

详情

AI中文摘要

级联自动语音识别-大语言模型（ASR-LLM）流水线在工业口语对话系统（SDS）中仍然流行，主要因为其解耦设计确保了感知可验证性。然而，级联系统存在错误传播问题，因为转录失败不可避免地级联到后续组件，从而降低最终交互质量。尽管ASR置信度分数为不可靠输入提供了简单过滤，但这种方法存在根本性局限，因为它通常无法检测删除错误，也无法区分声学（听不清）和语言（不理解）不匹配，而这两者都需要针对性的恢复策略。在本文中，我们提出了一种因果感知的错误恢复范式，从根本上重新思考SDS的鲁棒性。与传统的置信度过滤不同，我们引入了一组小型精度聚焦检测器，利用深度ASR潜在表示将词级错误解耦为感知、理解和删除失败。这种细粒度诊断智能使LLM能够编排针对性的多轮澄清策略，有效将模糊信号转化为无缝的用户交互。实验结果验证了我们方法的精度，与基线相比，在领域转移错误上的召回率提高了一倍以上（57.96% vs. 23.66%）。关键的是，这种诊断精度在不同口音、失真和领域下，使词错误率降低高达30%，下游任务性能提升17%。

英文摘要

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

URL PDF HTML ☆

赞 0 踩 0

2605.25394 2026-05-26 cs.AI cs.CL 版本更新

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

发表机构 * University of Southern California（南加州大学）； Information Sciences Institute（信息科学研究所）

AI总结提出一种轻量级、无参数的提示技术Second Guess，通过添加“我不知道”选项并观察答案稳定性，在多项选择问答中实现弃权，有效检测小型语言模型的不确定性。

详情

AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案，而非弃权。这个问题对于小型语言模型（SLM）尤为严重，因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_，一种轻量级、无参数的提示技术，用于多项选择问答（MCQA）中的弃权，非常适合SLM。我们的关键实证洞察是，真正知道答案的模型会一致地选择它，而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型（2B-8B参数）和四个基准测试上评估，Second Guess实现了10.81%的最高复合风险改进。值得注意的是，在基于熵的方法退化的微调模型上，它保持了8%的复合风险改进，并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

URL PDF HTML ☆

赞 0 踩 0

2605.25384 2026-05-26 cs.CL 版本更新

知道但不展示：LLMs 识别歧义但很少提出澄清问题

Jinyan Su, Claire Cardie

发表机构 * Cornell University（康奈尔大学）

AI总结研究大型语言模型在识别用户查询歧义与主动提出澄清问题之间的行为差距，发现模型虽能识别歧义但默认直接回答，检索上下文会进一步减少澄清行为。

详情

AI中文摘要

用户查询通常不明确，可能允许多种有效解释。一个有用的助手不应默默假设用户意图，而应通过提出澄清问题来揭示这种歧义。这需要两种能力：识别查询存在歧义，并基于该识别采取行动（寻求澄清而非直接回答）。为了研究这些能力，我们在三种设置下评估模型对歧义、无歧义和消歧问题的表现：标准问答、显式歧义判断和行为分析（其中评判模型将响应分类为直接回答、拒绝或澄清问题）。我们发现识别与行为之间存在明显差距：当被明确要求判断时，模型通常能识别歧义，但在问答设置中，它们绝大多数默认直接回答。检索上下文通过提高可回答性进一步扩大了这一差距，使模型更不可能提出澄清问题。

英文摘要

User queries are often underspecified and may admit multiple valid interpretations. Rather than silently making assumptions about the user's intent, a helpful assistant should surface such ambiguity by asking a clarifying question. Doing so requires two abilities: recognizing that a query is ambiguous, and acting on that recognition by seeking clarification instead of answering directly. To study these abilities, we evaluate models on ambiguous, unambiguous, and disambiguated questions in three settings: standard question answering, explicit ambiguity judgment, and behavioral analysis, where a judge model classifies responses as direct answers, refusals, or clarifying questions. We find a clear gap between recognition and behavior: models often identify ambiguity when explicitly asked to judge it, yet in the QA setting they overwhelmingly default to direct answers. Retrieved context further widens this gap by improving answerability while making models even less likely to ask clarifying questions.

URL PDF HTML ☆

赞 0 踩 0

2605.25263 2026-05-26 cs.CL cs.AI 版本更新

Fast-dDrive：面向自动驾驶的高效块扩散视觉语言模型

Kewei Zhang, Jin Wang, Sensen Gao, Chengyue Wu, Yulong Cao, Songyang Han, Boris Ivanovic, Langechuan Liu, Marco Pavone, Song Han, Daquan Zhou, Enze Xie

发表机构 * Peking University（北京大学）； NVIDIA ； The University of Hong Kong（香港大学）； MIT（麻省理工学院）

AI总结提出Fast-dDrive，一种块扩散视觉语言动作模型，通过语义单元内双向细化与跨单元因果约束，结合结构化令牌冻结、分段感知训练和推测解码，实现高保真轨迹规划与高效推理，在WOD-E2E和nuScenes上达到最优性能，推理速度提升12倍。

详情

AI中文摘要

通过视觉-语言-动作（VLA）模型实现的端到端自动驾驶需要在高保真轨迹规划与高效推理之间取得不稳定的平衡。现有范式通常存在不足：自回归（AR）VLA在边缘硬件上受限于内存带宽，且容易产生曝光偏差漂移；而全序列扩散模型无法复用KV缓存，并遭受违反基本感知-规划因果关系的“逻辑泄漏”。我们提出Fast-dDrive，一种块扩散VLA，它在语义单元内执行双向细化，同时强制跨单元严格因果排序。利用驾驶VLA通常输出结构化JSON式输出的观察，Fast-dDrive将结构令牌冻结为节支架，并采用节感知训练策略，优先考虑安全关键规划。我们进一步引入支架推测解码，以显著更高的吞吐量实现AR等效质量。最后，我们提出一种低开销的测试时缩放方案：通过从单个共享前缀KV缓存分叉出N个随机轨迹展开并取平均，以极小的计算成本有效抑制预测方差。实验结果表明，Fast-dDrive重新定义了驾驶智能体的速度-精度边界。在WOD-E2E测试集上，Fast-dDrive在3秒和5秒平均位移误差（ADE）上达到最优，同时在基于扩散的VLA中具有最高的RFS；在nuScenes上，它将平均L2误差降至0.32米（提升22%）。当与SGLang集成时，我们的框架相比AR基线实现了12倍的吞吐量提升，缩小了高容量VLA与实时车载部署效率需求之间的差距。

英文摘要

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.23148 2026-05-26 cs.CL cs.CY 版本更新

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

当症状不足时：大语言模型精神科筛查中的证据加权模式

Jianfeng Zhu, Megan Korhummel, Ruoming Jin, Karin G. Coifman

发表机构 * Departments of Computer Science1 and Psychological Science2（计算机科学系和心理学系）

AI总结本研究引入SCID锚定基准，评估五个大语言模型在精神科筛查中的表现，发现模型在焦虑症、抑郁症和创伤后应激障碍分类中，当存在功能保留或保护性背景时倾向于低估症状证据，导致假阴性错误。

Comments 25 pages 7 figures

详情

AI中文摘要

随着心理健康护理需求超过临床医生提供的评估，对可扩展筛查工具的需求日益增加。大语言模型（LLMs）可能从患者叙述中识别精神科风险，但其在不同诊断、人口统计亚组和证据使用模式中的可靠性仍不确定。我们引入了一个基于SCID的基准，包含555个半结构化体验访谈，并配有焦虑症、重度抑郁症、创伤后应激障碍和任何当前心理健康障碍的诊断参考标签。使用零样本任务特定提示，我们评估了五个最先进的LLM，并检查假阴性错误是否反映了遗漏的精神科证据或对症状、功能损害和保护性背景线索的差异化加权。不同任务和模型的表现各异，准确率从0.49到0.86，马修斯相关系数从0.16到0.38。GPT-4.1 Mini和GPT-5 Mini显示出最一致的疾病特异性准确率。亚组分析发现，男性参与者的抑郁症分类准确率高于女性，没有一致的年龄相关模式，种族阶层间存在适度的非均匀变异。证据整合分析显示，假阴性的焦虑症和PTSD分类通常包含明确的症状证据，但伴有功能保留、应对能力或社会支持。功能损害证据使模型输出偏向阳性分类，而保护性背景证据则使输出偏离。这些发现表明，LLMs可能支持可扩展的精神科筛查，但它们在功能保留或保护性背景下低估症状证据的倾向需要在临床部署前进行仔细验证。

英文摘要

As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability across diagnoses, demographic subgroups, and evidence-use patterns remains uncertain. We introduce a SCID-anchored benchmark of 555 semi-structured experiential interviews paired with diagnostic reference labels for anxiety disorder, major depressive disorder, post-traumatic stress disorder, and any current mental health disorder. Using zero-shot task-specific prompting, we evaluated five state-of-the-art LLMs and examined whether false-negative errors reflected missed psychiatric evidence or differential weighting of symptom, functional-impairment, and protective-context cues. Performance varied across tasks and models, with accuracy ranging from 0.49 to 0.86 and Matthews correlation coefficients from 0.16 to 0.38. GPT-4.1 Mini and GPT-5 Mini showed the most consistent disorder-specific accuracy. Subgroup analyses found higher depression-classification accuracy among male than female participants, no consistent age-related pattern, and modest non-uniform variation across race strata. Evidence-integration analyses showed that false-negative anxiety and PTSD classifications often contained explicit symptom evidence but were accompanied by preserved functioning, coping ability, or social support. Functional-impairment evidence shifted model outputs toward positive classifications, whereas protective-context evidence shifted outputs away. These findings suggest that LLMs may support scalable psychiatric screening, but their tendency to discount symptom evidence in the presence of preserved functioning or protective context requires careful validation before clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.22769 2026-05-26 cs.CL cs.AI 版本更新

Understanding Data Temporality Impact on Large Language Models Pre-training

理解数据时间性对大型语言模型预训练的影响

Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave

发表机构 * Kyutai

AI总结研究预训练数据顺序对大型语言模型获取时间敏感事实知识的影响，通过构建包含7000多个时间相关问题的基准并训练60亿参数模型，发现按时间顺序训练比随机打乱训练能产生更及时和精确的知识。

详情

AI中文摘要

大型语言模型（LLMs）通常在打乱顺序的语料库上进行训练，导致模型的知识在训练时被冻结，其时间基础仍然难以理解。在这项工作中，我们研究了预训练动态对获取时间敏感事实知识的影响，特别关注数据顺序。我们的主要贡献有两方面。首先，我们引入了一个包含7000多个时间基础问题的综合基准和一个评估协议，能够分析模型是否将事实与其对应的时间段正确关联。其次，我们在按时间顺序排列的Common Crawl快照上预训练了60亿参数的模型，并将其与标准的随机打乱预训练进行比较。我们的结果表明，按顺序训练的模型在通用语言理解和常识方面与随机打乱的基线相当，同时始终表现出更及时和精确的时间知识。按时间顺序的预训练提高了事实的新鲜度，而随机打乱的预训练在较旧的数据上表现更好，可能是由于事实重复增加。这些发现，连同我们在https://github.com/kyutai-labs/kairos 发布的代码、在https://huggingface.co/collections/kyutai/kairos 发布的检查点和数据集，为LLMs的持续学习未来研究提供了基础。

英文摘要

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.22137 2026-05-26 cs.CL 版本更新

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

跨语言共识：通过多语言自一致性对齐多语言文化知识

Andrew Ivan Soegeng, Patrick Sutanto, Tan Sang Nguyen

发表机构 * SAP ； School of Computing, National University of Singapore（国立新加坡大学计算机学院）

AI总结提出一种自监督框架，利用多语言自一致性和自我批评机制，从本地语言表示中提取文化知识并迁移到英语，以缩小跨语言文化知识差距，在BLEnD基准上平均提升英语查询性能5.03%。

Comments Accepted to The 1st Workshop on Multilinguality in the Era of Large Language Models

详情

BacktestBench：面向自动化量化策略回测的大语言模型基准测试

Zhensheng Wang, Wenmian Yang, Qingtai Wu, Lequan Ma, Yiquan Zhang, Weijia Jia

发表机构 * Beijing Normal University（北京师范大学）； Elmleaf Ltd.（Elmleaf公司）

AI总结提出首个大规模自动化量化回测基准BacktestBench，包含18,246个问答对，并设计多智能体基线AutoBacktest，通过协调摘要器、检索器和编码器实现自然语言策略到可重复回测的转换。

Comments This paper has been accepted by KDD 2026 (Datasets and Benchmarks Track)

详情

DOI: 10.1145/3770855.3817460

AI中文摘要

量化回测对于评估交易策略至关重要，但仍受到高技术门槛和有限可扩展性的阻碍。虽然大语言模型（LLMs）通过先进的代码生成、工具使用和智能体规划为自动化这一复杂的跨学科工作流程提供了变革性路径，但实际实现因当前缺乏专门用于自动化量化回测的大规模基准而面临重大挑战，这阻碍了该领域的进展。为弥补这一关键差距，我们引入了BacktestBench，这是首个用于自动化量化回测的大规模基准。它基于超过600万条真实市场记录构建，包含18,246个精心标注的问答对，涵盖四个任务类别：指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest，一个稳健的多智能体基线，通过协调摘要器进行语义因子提取、检索器进行验证的SQL生成以及编码器进行Python回测实现，将自然语言策略转化为可重复的回测。我们对23个主流LLM的评估，辅以有针对性的消融实验，识别了影响端到端性能的关键因素，并强调了基于事实的验证和标准化指标表示的重要性。

英文摘要

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

URL PDF HTML ☆

赞 0 踩 0

2605.16953 2026-05-26 cs.AI cs.CL 版本更新

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

人类如何处理AI生成的幻觉内容：一项神经影像学研究

Shuqi Zhu, Yi Zhong, Ziyi Ye, Bangde Du, Yujia Zhou, Qingyao Ai, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系）； Institute of Trustworthy Embodied AI, Fudan University, Shanghai, China（复旦大学可信具身人工智能研究院）

AI总结通过EEG实验，研究人类在处理多模态大语言模型生成的幻觉与非幻觉内容时的神经动力学差异，揭示误判的幻觉内容未能触发标准神经认知事实验证通路。

详情

AI中文摘要

尽管AI生成的幻觉带来了相当大的风险，但人类能够成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题，本文探索了人类的神经动力学，以表征大脑如何处理幻觉内容。我们记录了27名参与者在执行验证任务时的EEG信号，该任务要求判断由多模态大语言模型（MLLM）生成的图像描述的正确性。基于平均事件相关电位（ERP）研究，我们揭示了多种认知过程，例如语义整合、推理处理、记忆检索和认知负荷，在处理幻觉与非幻觉内容时表现出不同的模式。值得注意的是，人类参与者误判与正确判断的幻觉的神经反应显示出显著差异。这表明，被误判的AI生成幻觉未能触发标准的神经认知事实验证通路。

英文摘要

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

URL PDF HTML ☆

赞 0 踩 0

2605.16023 2026-05-26 cs.CL cs.LG 版本更新

Judge Circuits

Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Selin Kahvecioglu, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

发表机构 * Technische Universität Berlin（柏林技术大学）； BIFOLD – Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究院）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心）； Fraunhofer Heinrich Hertz Institute（弗劳恩霍夫海因里希·赫茨研究所）； Marburg University（马尔堡大学）； Centre for European Research in Trusted AI (CERTAIN)（欧洲可信人工智能研究中心）

AI总结本研究利用位置感知边归因修补（PEAP）因果分析Gemma-3、Qwen2.5和Llama-3的内部机制，发现结构化理解和开放式偏好任务中的判断共享一个稀疏、泛化的潜在评估子图，并通过解耦抽象判断与输出格式，揭示了格式诱导不一致性的机制原因。

Comments 39 pages

详情

AI中文摘要

LLM-as-a-judge已成为大规模评估模型输出的主导范式，然而同一模型在其输出格式变化时（例如，1-5评分与真/假标签）会系统地给出不同的分数。现有对这些格式诱导不一致性的诊断停留在输入输出层面。利用位置感知边归因修补（PEAP），我们因果地研究了Gemma-3、Qwen2.5和Llama-3的内部机制。我们发现，跨结构化理解和开放式偏好任务的判断共享一个稀疏、泛化的潜在评估子图，位于中后期多层感知器（MLPs）中；将其零消融会破坏判断，同时保留架构模块化模型中的世界知识。通过结构上解耦抽象判断与输出格式，我们为我们研究的开放权重模型上的格式诱导不一致性提供了机制解释：在共享主干中计算的连续判断信号通过脆弱、格式特定的终端分支映射，使得格式无关的偏好能够在请求的输出格式下游被隔离。我们的发现意味着跨格式的基准级可靠性比较部分测量的是格式化器几何形状而非评估质量。

英文摘要

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

URL PDF HTML ☆

赞 0 踩 0

2605.13643 2026-05-26 cs.CL 版本更新

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

前缀教导，后缀消退：强到弱在线策略蒸馏中的局部可教性崩溃

Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, Jieping Ye

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； Meituan LongCat Team, China（美团LongCat团队，中国）； College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）

AI总结本文发现强到弱在线策略蒸馏中，教师反馈在生成轨迹的后缀部分缺乏局部对比度，导致“局部可教性崩溃”，并提出基于变化点检测的截断规则来优化监督区域，实验表明该方法优于全轨迹蒸馏。

详情

AI中文摘要

在线策略蒸馏（OPD）利用来自更强教师的密集反馈，在学生模型自身的生成轨迹上训练学生模型。先前文献表明，只要教师反馈可用，监督完整响应令牌序列应能单调提升性能。然而，我们证明这一假设在强到弱OPD设置中有时不成立。虽然生成轨迹的后缀部分可能仍存在非零的师生优势，但它们通常缺乏使密集反馈有效优先学生学习所需的局部对比度。我们将这种失败模式称为局部可教性崩溃。由此得出的原则很简单：监督应集中在教师反馈仍具有判别性的轨迹区域，而非均匀覆盖整个响应。我们通过一种轨迹特定的释放规则来操作这一原则。该规则测量教师相对于学生前K个候选集的边际，将该边际在NLTK分词的句子片段上聚合，并在检测到BIC风格的下行变化点时截断密集OPD监督。使用Qwen3模型系列在强到弱蒸馏任务上的实验结果表明，该释放规则在不同学生规模下的五个域内基准上始终优于标准全轨迹OPD。此外，与基线蒸馏方法相比，我们的方法在域外任务上更好地保留了模型能力。这些结果表明，有效的强到弱OPD需要评估教师指导的可用性及其局部效用，确保生成的反馈保持可教性。

英文摘要

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

URL PDF HTML ☆

赞 0 踩 0

2605.04700 2026-05-26 cs.CR cs.AI cs.CL cs.LG cs.SD 版本更新

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

稀疏令牌足矣：通过令牌感知梯度优化越狱音频语言模型

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang, Zhijin Ge

发表机构 * Wuhan University ； Institute for Math \& AI, Wuhan University ； Huazhong University of Science ； Shanghai Jiao Tong University ； Xidian University

AI总结本文提出令牌感知梯度优化（TAGO）方法，通过仅保留高梯度能量的音频令牌对应的波形梯度，实现稀疏越狱攻击，在保持高成功率的同时大幅减少优化量。

Comments To appear in the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

对音频语言模型（ALM）的越狱攻击通过优化音频扰动来引发不安全生成，通常在整个优化过程中密集地更新整个波形。在这项工作中，我们通过分析ALM中令牌对齐梯度的结构来研究这种密集优化的必要性。我们发现梯度能量在音频令牌之间高度不均匀，表明只有一小部分令牌对齐的音频区域主导了优化信号。受此观察启发，我们提出了令牌感知梯度优化（TAGO），它通过每次迭代仅保留与高梯度能量音频令牌对齐的波形梯度，同时屏蔽其余梯度，实现了稀疏越狱优化。在三个ALM上，TAGO优于基线，并且大幅稀疏化仍能保持较高的攻击成功率（例如，在Qwen3-Omni上，令牌保留率为0.25时，$\mathrm{ASR}_{l}$仍为86%，而全令牌保留时为87%）。这些结果表明密集的波形更新在很大程度上是冗余的，我们主张未来的音频越狱和安全对齐研究应进一步利用这种异质的令牌级梯度结构。

英文摘要

Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the entire waveform densely throughout optimization. In this work, we investigate the necessity of such dense optimization by analyzing the structure of token-aligned gradients in ALMs. We find that gradient energy is highly non-uniform across audio tokens, indicating that only a small subset of token-aligned audio regions dominates the optimization signal. Motivated by this observation, we propose Token-Aware Gradient Optimization (TAGO), which enables sparse jailbreak optimization by retaining only waveform gradients aligned with audio tokens that have high gradient energy, while masking the remaining gradients at each iteration. Across three ALMs, TAGO outperforms baselines, and substantial sparsification preserves strong attack success rates (e.g. on Qwen3-Omni, $\mathrm{ASR}_{l}$ remains at 86% with a token retention ratio of 0.25, compared to 87% with full token retention). These results demonstrate that dense waveform updates are largely redundant, and we advocate that future audio jailbreak and safety alignment research should further leverage this heterogeneous token-level gradient structure.

URL PDF HTML ☆

赞 0 踩 0

2605.03472 2026-05-26 cs.CL cs.AI 版本更新

Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

发表机构 * JoshTalks

AI总结本文通过适配Moshi架构，使用自定义印地语分词器和26,000小时真实对话数据训练，提出了首个开放、可复现的印地语全双工口语对话系统，实现了自然的打断、重叠和反馈行为。

详情

AI中文摘要

全双工口语对话系统能够模拟自然的对话行为，如打断、重叠和反馈，然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi，使用自定义印地语分词器，并在从14,695名说话者收集的26,000小时真实自发对话数据（具有独立的说话者通道）上进行训练，提出了首个开放、可复现的印地语全双工口语对话系统，从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成，我们替换了原始英语分词器，并重新初始化了依赖于文本词汇的参数，同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练，然后在1,000小时对话数据上进行微调。通过提示对话延续范式，结合自动评估指标和人工判断，评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

URL PDF HTML ☆

赞 0 踩 0

2604.18396 2026-05-26 cs.CL 版本更新

River-LLM: Large Language Model Seamless Exit Based on KV Share

River-LLM：基于KV共享的大型语言模型无缝退出

Yingtao Shen, An Zou

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对解码器-only架构中早期退出因KV缓存缺失导致的延迟和精度问题，提出无需训练的River-LLM框架，通过轻量级KV共享退出流实现令牌级无缝退出，并利用状态转移相似性预测KV误差指导退出决策，在数学推理和代码生成任务上获得1.53-2.16倍实际加速且保持高质量。

Comments Accepted to ACL 2026, 13pages, with appendix. Corrected some typos

详情

AI中文摘要

大型语言模型（LLM）在多个领域展现出卓越性能，但日益受到高推理延迟的制约。早期退出通过动态跳过冗余层加速推理，成为一种有前景的解决方案。然而，在仅解码器架构中，早期退出的效率受到KV缓存缺失问题的严重瓶颈，即跳过的层无法为后续令牌提供必要的历史状态。现有解决方案（如重新计算或掩码）要么引入显著延迟开销，要么导致严重精度损失，未能弥合理论层减少与实际墙钟加速之间的差距。本文提出River-LLM，一个无需训练的框架，能够实现无缝的令牌级早期退出。River-LLM引入轻量级KV共享退出流，使得骨干网络的缺失KV缓存能够在退出过程中自然生成并保留，无需昂贵的恢复操作。此外，我们利用解码器块内的状态转移相似性来预测累积KV误差，并指导精确的退出决策。在数学推理和代码生成任务上的大量实验表明，River-LLM在保持高生成质量的同时，实现了1.53至2.16倍的实际加速。

英文摘要

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.53 to 2.16 times of practical speedup while maintaining high generation quality.

URL PDF HTML ☆

赞 0 踩 0

2604.14054 2026-05-26 cs.LG cs.CL 版本更新

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

$\pi$-Play: 通过特权自蒸馏实现的多智能体自对弈，无需外部数据

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences（中国科学院大学先进交叉学科学院）； Meituan（美团）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出$\pi$-Play框架，利用自对弈中生成的问答构建路径作为特权信息，结合自蒸馏实现密集反馈的多智能体协同进化，无需外部数据即可超越全监督搜索代理。

Comments 23 pages, 11 figures

详情

AI中文摘要

深度搜索代理已成为解决复杂信息寻求任务的有前景范式，但其训练仍面临稀疏奖励、弱信用分配和有限标注数据的挑战。自对弈提供了一种可扩展的减少数据依赖的途径，但传统自对弈仅通过稀疏结果奖励优化学生，导致学习效率低下。在这项工作中，我们观察到自对弈在任务生成过程中自然产生一个问题构建路径（QCP），这是一种捕获反向求解过程的中间产物。这揭示了一种新的特权信息来源：自对弈可以低成本、大规模地提供高质量特权信息用于自蒸馏，无需依赖人类反馈或精心设计的特权信息。基于这一洞察，我们提出特权信息自对弈（$\pi$-Play），一种结合自对弈和自蒸馏的新型多智能体自进化框架。在$\pi$-Play中，考官生成任务及QCP，教师利用QCP作为特权上下文，通过自蒸馏对学生进行密集监督。这种设计将稀疏奖励的自对弈转变为密集反馈的协同进化。大量实验表明，无数据的$\pi$-Play超越了全监督搜索代理，并将进化效率相比传统自对弈提升了2-3倍。代码见 https://github.com/zhyaoch/pi-play。

英文摘要

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self-play can provide high-quality privileged information for the self-distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a novel multi-agent self-evolution framework combining self-play and self-distillation. In $π$-Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self-distillation. This design transforms sparse-reward self-play into a dense-feedback co-evolution. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play. Code is available at https://github.com/zhyaoch/pi-play.

URL PDF HTML ☆

赞 0 踩 0

2604.11632 2026-05-26 cs.CL 版本更新

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

CArtBench: 评估视觉语言模型在中国艺术理解、解读与真实性方面的能力

Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology（奈良科学技術大學）； Liaoning Normal University（遼寧師範大學）

AI总结提出CARTBENCH基准，通过四个子任务评估视觉语言模型在中国艺术作品上的识别、推理、鉴赏和真伪鉴别能力，发现现有模型在复杂证据链接、风格断代和真伪诊断方面表现不足。

Comments under review

详情

AI中文摘要

我们介绍了CARTBENCH，一个基于博物馆的基准，用于评估视觉语言模型（VLM）在中国艺术作品上的表现，超越了短形式识别和问答。CARTBENCH包含四个子任务：CURATORQA用于基于证据的识别和推理，CATALOGCAPTION用于结构化的四部分专家风格鉴赏，REINTERPRET用于带有专家评分的可辩护的重新解读，以及CONNOISSEURPAIRS用于在视觉相似混淆下进行诊断性真伪鉴别。CARTBENCH通过将来自Wikidata的带有图像的故宫博物院物品与权威目录页面进行对齐构建，涵盖多个朝代的五个艺术类别。在九个代表性的VLM上，我们发现高整体CURATORQA准确率可能掩盖了在硬证据链接和风格到时期推断上的急剧下降；长形式鉴赏仍远未达到专家参考水平；而面向真伪的诊断性鉴别接近随机水平，这突显了当前模型在鉴赏级推理上的困难。

英文摘要

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

URL PDF HTML ☆

赞 0 踩 0

2604.05550 2026-05-26 cs.CL cs.CE 版本更新

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

AutoSOTA：面向最先进AI模型发现的端到端自动化研究系统

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University（电子工程系，北京理工大学，清华大学）； Zhongguancun Academy（中关村学院）； Peking University（北京大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出AutoSOTA系统，采用多智能体架构实现从论文复现到模型优化的全自动化，成功发现105个超越原始方法的新SOTA模型。

详情

AI中文摘要

人工智能研究越来越依赖于长时间的复现、调试和迭代优化以达到最先进（SOTA）性能，这催生了对能够加速经验模型优化全流程的系统的需求。在这项工作中，我们介绍了AutoSOTA，一个端到端的自动化研究系统，它将顶级AI论文中发布的最新SOTA模型推进为可复现且经验上改进的新SOTA模型。我们通过三个紧密耦合的阶段来形式化这个问题：资源准备与目标设定；实验评估；以及反思与构思。为了解决这个问题，AutoSOTA采用了一种多智能体架构，包含八个专门化的智能体，它们协同工作，将论文与代码和依赖项对应起来，初始化和修复执行环境，跟踪长期实验，生成并调度优化想法，并监督有效性以避免虚假收益。我们在从八个顶级AI会议收集的最新研究论文上评估AutoSOTA，并根据代码可用性和执行成本进行过滤。在这些论文中，AutoSOTA在自动复现和后续优化方面均取得了强大的端到端性能。具体来说，它成功发现了105个超越原始报告方法的新SOTA模型，平均每篇论文耗时约五小时。涵盖LLM、NLP、计算机视觉、时间序列和优化的案例研究进一步表明，该系统可以超越常规的超参数调优，识别架构创新、算法重新设计和工作流级别的改进。这些结果表明，端到端的研究自动化不仅可以作为性能优化器，还可以作为一种新型的研究基础设施，减少重复性实验负担，帮助将人类注意力重新引导到更高层次的科学创造力上。

英文摘要

Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

URL PDF HTML ☆

赞 0 踩 0

2603.16105 2026-05-26 cs.CL cs.AI 版本更新

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要：用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

发表机构 * University of Trento（特伦托大学）

AI总结提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal，通过最大化词汇多样性来选择校准数据，在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能，且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情

AI中文摘要

训练后模型压缩对于增强大型语言模型（LLMs）的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法，但较少关注选择最合适的数据集（所谓的校准数据）来寻找压缩模型配置。校准数据的选择是保留模型在任务内和任务间能力的关键步骤。在这项工作中，我们通过分析内在数据属性而非模型特定信号，解决了为剪枝和量化识别高性能校准集的挑战。我们引入了 exttt{ extbf{ZipCal}}，一种基于Zipf幂律最大化词汇多样性的模型无关数据筛选策略。实验表明，我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是，在下游性能方面，它与依赖模型困惑度的最先进方法表现相当。后者在大规模模型和数据集上变得极其昂贵，而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度，平均快约240倍。

英文摘要

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

URL PDF HTML ☆

赞 0 踩 0

2603.12983 2026-05-26 cs.CL cs.AI 版本更新

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人工标注是否必要？用于机器翻译错误跨度检测的迭代MBR蒸馏

Boxuan Lyu, Haiyue Song, Zhi Qu

发表机构 * Institute of Science Tokyo（东京科学研究所）； National Institute of Information and Communications Technology（信息与通信技术国家研究所）

AI总结提出一种基于最小贝叶斯风险解码的迭代MBR蒸馏自演化框架，利用现成大语言模型生成伪标签，无需人工标注即可在错误跨度检测任务上超越监督基线。

2603.06687 2026-05-26 cs.CV cs.CL cs.ET cs.MM cs.RO 版本更新

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

TimeSpot: 在真实世界场景中评估视觉语言模型的地理时间理解能力

Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory (CIOL), Bangladesh（计算智能与运筹实验室（CIOL），孟加拉国）； Shahjalal University of Science and Technology (SUST), Sylhet, Bangladesh（沙赫jalal科学与技术大学（SUST），沙赫里尔，孟加拉国）； North South University (NSU), Dhaka, Bangladesh（北南大学（NSU），达卡，孟加拉国）； Qatar Computing Research Institute (QCRI), Doha, Qatar（卡塔尔计算研究中心（QCRI），多哈，卡塔尔）

AI总结提出TimeSpot基准，通过1,455张全球图像评估视觉语言模型在时间属性（季节、月份、时段、日光相位）和地理属性（大洲、国家、气候带、环境类型、经纬度）上的推理能力，发现现有模型性能低下，尤其时间推理不足。

Comments Accepted to ICML 2026

详情

AI中文摘要

地理时间理解，即仅从视觉输入推断位置、时间和上下文属性的能力，支撑着灾害管理、交通规划、具身导航、世界建模和地理教育等应用。尽管最近的视觉语言模型（VLM）利用地标和路标等线索在图像地理定位方面取得了进展，但它们推理时间信号和物理基础空间线索的能力仍然有限。为弥补这一差距，我们引入了TimeSpot，一个用于评估VLM在真实世界中进行地理时间推理的基准。TimeSpot包含来自80个国家的1,455张地面图像，要求直接从视觉证据中结构化预测时间属性（季节、月份、时段、日光相位）和地理属性（大洲、国家、气候带、环境类型、经纬度）。它还包括时空推理任务，测试在真实世界不确定性下的物理合理性。对最先进的开源和闭源VLM的评估显示性能低下，尤其是时间推理。虽然监督微调带来了改进，但结果仍不充分，凸显了需要新方法来实现稳健的、基于物理的地理时间理解。TimeSpot可在 https://TimeSpot-GT.github.io 获取。

英文摘要

Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling, and geography education. Although recent vision-language models (VLMs) have advanced image geo-localization using cues like landmarks and road signs, their ability to reason about temporal signals and physically grounded spatial cues remains limited. To address this gap, we introduce TimeSpot, a benchmark for evaluating real-world geo-temporal reasoning in VLMs. TimeSpot comprises 1,455 ground-level images from 80 countries and requires structured prediction of temporal attributes (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude-longitude) directly from visual evidence. It also includes spatial-temporal reasoning tasks that test physical plausibility under real-world uncertainty. Evaluations of state-of-the-art open- and closed-source VLMs show low performance, particularly for temporal inference. While supervised fine-tuning yields improvements, results remain insufficient, highlighting the need for new methods to achieve robust, physically grounded geo-temporal understanding TimeSpot is available at: https://TimeSpot-GT.github.io.

URL PDF HTML ☆

赞 0 踩 0

2603.05143 2026-05-26 cs.CL cs.LG 版本更新

Feature Resemblance: Towards a Theoretical Understanding of Analogical Reasoning in Transformers

特征相似性：迈向对Transformer中类比推理的理论理解

Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang

发表机构 * Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong（香港中文大学信息工程系）

AI总结本文通过最小化Transformer抽象模型，从理论上证明联合训练和特定课程顺序能使实体在表示空间中对齐，从而通过特征相似性实现属性转移，即类比推理。

详情

AI中文摘要

理解大型语言模型中的推理因评估混淆多种推理类型而变得复杂。我们分离出类比推理，即模型在共享已知属性的实体之间转移属性，并研究这种转移何时能从训练中涌现。为了使问题在分析上易于处理，我们研究了一个最小化的Transformer风格抽象，该抽象隔离了学习到的表示如何支持类比推理。在此设置中，我们证明了三个关键结果。首先，对相似性和属性前提的联合训练通过对齐表示实现类比推理。其次，顺序训练仅在相似性结构先于特定属性学习时成功，揭示了课程不对称性。第三，在我们的风格化设置中，两跳推理$(a \to b, b \to c \Rightarrow a \to c)$可被视为具有身份桥$(b=b)$的类比推理，这些身份桥在训练数据中明确出现。这些结果共同揭示了一个统一机制：具有共享属性的实体在表示空间中对齐，从而通过特征相似性实现属性转移。使用高达8B参数的架构进行的实验与理论定性一致，并表明表示几何在风格化模型之外的类比推理中扮演重要角色。

英文摘要

Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning, where a model transfers an attribute between entities that share known properties, and study when such transfer can emerge from training. To make the problem analytically tractable, we study a minimal transformer-style abstraction that isolates how learned representations support analogical reasoning. Within this setting, we prove three key results. First, joint training on similarity and attribution premises enables analogical reasoning through aligned representations. Second, sequential training succeeds only when similarity structure is learned before specific attributes, revealing a curriculum asymmetry. Third, in our stylized setting, two-hop reasoning $(a \to b, b \to c \Rightarrow a \to c)$ can be viewed as analogical reasoning with identity bridges $(b=b)$, which appear explicitly in training data. Together, these results reveal a unified mechanism: entities with shared properties become aligned in representation space, enabling property transfer through feature resemblance. Experiments with architectures up to 8B parameters show qualitative agreement with the theory and suggest that representational geometry plays an important role in analogical reasoning beyond the stylized model.

URL PDF HTML ☆

赞 0 踩 0

2602.21198 2026-05-26 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

从试错中学习：具身大语言模型的反思式测试时规划

Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Leonidas Guibas, Jiajun Wu, Yejin Choi

发表机构 * Stanford University（斯坦福大学）； Northwestern University（西北大学）

AI总结提出反思式测试时规划方法，通过行动中反思和行动后反思两种模式，结合回溯性反思，使具身智能体在测试时进行自我纠正和经验积累，显著提升长程任务性能。

详情

AI中文摘要

具身大语言模型赋予机器人高级任务推理能力，但它们无法反思错误原因，导致部署成为一系列独立尝试，错误重复而非积累经验。借鉴人类反思实践，我们引入反思式测试时规划，整合两种反思模式： extit{行动中反思}，代理在行动前利用测试时扩展生成并评分多个候选行动，基于内部反思；以及 extit{行动后反思}，利用测试时训练，根据执行后的外部反思更新内部反思模型和行动策略。我们还包含回溯性反思，允许代理重新评估早期决策，并利用后见之明进行模型更新，实现适当的长程信用分配。在我们新设计的Long-Horizon Household基准和MuJoCo Cupboard Fitting基准上的实验表明，与基线模型相比有显著提升，并能零样本泛化到逼真的HM3D环境以及在Franka Panda机械臂上的真实机器人实验。消融实验证实，行动中反思和行动后反思相互依赖，且回溯性反思在较低计算开销下比逐步外部反馈实现更好的信用分配。定性分析进一步突出了通过反思进行的行为纠正。

英文摘要

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.

URL PDF HTML ☆

赞 0 踩 0

2602.19333 2026-05-26 cs.CL cs.IR cs.SI 版本更新

PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

PerSoMed：用于波斯社交媒体文本分类的大规模平衡数据集

Isun Chehreh, Ebrahim Ansari

发表机构 * Institute for Advanced Studies in Basic Sciences (IASBS)（基础科学基础研究 institute）

AI总结该研究构建了首个大规模平衡的波斯社交媒体文本分类数据集，包含9个类别共36,000条帖子，并基于BiLSTM、XLM-RoBERTa、TookaBERT等模型进行基准测试，其中TookaBERT-Large取得了最佳性能（F1分数0.9621）。

Comments 10 pages, including 1 figure

详情

AI中文摘要

本研究引入了首个大规模、良好平衡的波斯社交媒体文本分类数据集，专门用于解决该领域缺乏综合资源的问题。该数据集包含9个类别（经济、艺术、体育、政治、社会、健康、心理、历史、科技）的36,000条帖子，每个类别4,000个样本，以确保类别分布平衡。数据收集涉及来自多个波斯社交媒体平台的60,000条原始帖子，随后进行严格的预处理和混合标注，结合基于ChatGPT的少样本提示和人工验证。为了缓解类别不平衡，我们采用了带语义冗余移除的欠采样和结合词汇替换与生成提示的高级数据增强策略。我们对多个模型进行了基准测试，包括BiLSTM、XLM-RoBERTa（使用LoRA和AdaLoRA适配）、FaBERT、基于SBERT的架构以及波斯语专用TookaBERT（Base和Large）。实验结果表明，基于Transformer的模型始终优于传统神经网络，其中TookaBERT-Large取得了最佳性能（精确率：0.9622，召回率：0.9621，F1分数：0.9621）。按类别评估进一步证实了所有类别的稳健性能，尽管社会和政治文本由于固有歧义而得分略低。本研究提供了一个新的高质量数据集，并对前沿模型进行了全面评估，为波斯语自然语言处理的进一步发展奠定了坚实基础，包括趋势分析、社会行为建模和用户分类。该数据集公开可用，以支持未来的研究工作。

英文摘要

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

URL PDF HTML ☆

赞 0 踩 0

2602.15620 2026-05-26 cs.CL cs.AI 版本更新

类比路由：用于混合专家模型的kNN增强专家分配

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

发表机构 * Institute of Science Tokyo（东京科学研究院）； CyberAgent ； Nara Institute of Science and Technology（奈良科学技術大學）

AI总结提出kNN-MoE框架，通过检索历史相似案例的局部最优专家分配来增强MoE路由，使用检索邻居的平均相似度作为置信度混合系数，在分布偏移下提升鲁棒性。

2510.22874 2026-05-26 cs.CL 版本更新

A Comprehensive Dataset for Human vs. AI Generated Text Detection

人类与AI生成文本检测的综合数据集

Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * Kalyani Government Engineering College（卡利尼政府工程学院）； IIIT Guwahati（古瓦哈提理工学院）； IIIT Delhi（德里理工学院）； BITS Pilani Hyderabad Campus（比什帕利 Hyderabad 分校）； University of South Carolina（南卡罗来纳大学）； NIT Silchar（西里 char 工程学院）； San José State University（桑乔斯州立大学）； UCLA（加州大学洛杉矶分校）； Washington State University（华盛顿州立大学）； Vishwakarma Institute of Information Technology（维斯瓦卡arma 信息科技学院）； Gandhi Institute for Technological Advancement（甘地技术进步研究所）； BITS Pilani Goa（比什帕利 Goa 分校）； Meta AI ； Amazon AI（亚马逊AI）

AI总结本文提出了一个包含73,193个文本样本的综合数据集，结合真实纽约时报文章与多个先进LLM生成的合成文本，用于区分人类与AI生成文本及归因任务，基线准确率分别为58.35%和8.92%。

Comments Defactify4 @AAAI 2025

详情

AI中文摘要

大型语言模型（LLM）的快速发展使得AI生成的文本越来越像人类，引发了对内容真实性、错误信息和可信度的担忧。要可靠地检测AI生成文本并将其归因于特定模型，需要大规模、多样化且标注良好的数据集。在这项工作中，我们提出了一个包含73,193个文本样本的综合数据集，该数据集结合了真实的纽约时报文章与多个最先进LLM（包括Gemma-2-9b、Mistral-7B、Qwen-2-72B、LLaMA-8B、Yi-Large和GPT-4-o）生成的合成版本。数据集提供原始文章摘要作为提示，以及完整的人类作者叙述。我们为两个关键任务建立了基线结果：区分人类撰写与AI生成的文本，准确率达到58.35%；以及将AI文本归因于其生成模型，准确率为8.92%。通过将现实世界的新闻内容与现代生成模型相结合，该数据集旨在促进鲁棒的检测和归因方法的发展，在生成式AI时代培养信任和透明度。我们的数据集可在以下网址获取：https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

英文摘要

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 73,193 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

URL PDF HTML ☆

赞 0 踩 0

2510.22143 2026-05-26 cs.CL 版本更新

ChunkLLM: 一种轻量级可插拔的LLM推理加速框架

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学信息学院）

AI总结针对Transformer自注意力二次复杂度导致的推理效率低下问题，提出ChunkLLM框架，通过QK适配器和块适配器实现块选择与压缩，在保持性能的同时显著加速推理。

详情

AI中文摘要

基于Transformer的大模型在自然语言处理和计算机视觉中表现出色，但由于自注意力对输入令牌的二次复杂度，面临严重的计算效率低下问题。最近，研究人员提出了一系列基于块选择和压缩的方法来缓解这一问题，但它们要么存在语义不完整的问题，要么训练-推理效率低下。为了全面解决这些挑战，我们提出了ChunkLLM，一个轻量级且可插拔的训练框架。具体来说，我们引入了两个组件：QK适配器（Q-Adapter和K-Adapter）和块适配器。前者附加在每个Transformer层上，兼具特征压缩和块注意力获取的双重目的。后者在模型的最底层运行，通过利用上下文语义信息来检测块边界。在训练阶段，骨干网络的参数保持冻结，仅QK适配器和块适配器进行训练。值得注意的是，我们设计了一种注意力蒸馏方法来训练QK适配器，这提高了关键块的召回率。在推理阶段，仅当当前令牌被检测为块边界时才触发块选择，从而加速模型推理。我们在涵盖多个任务的多种长文本和短文本基准数据集上进行了实验评估。ChunkLLM不仅在短文本基准上取得了可比的性能，而且在长上下文基准上保持了98.64%的性能，同时保持了48.58%的键值缓存保留率。特别地，在处理120K长文本时，ChunkLLM相比原始Transformer实现了最大4.48倍的加速。

英文摘要

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

URL PDF HTML ☆

赞 0 踩 0

2510.02327 2026-05-26 cs.CL cs.AI eess.AS 版本更新

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

KAME：用于增强实时语音到语音对话AI知识的串联架构

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

AI总结提出一种混合架构，通过实时注入后端LLM的文本响应来增强S2S模型的知识，在保持低延迟的同时提升响应正确性。

Comments Published at IEEE ICASSP 2026

详情

AI中文摘要

实时语音到语音（S2S）模型擅长生成自然、低延迟的对话响应，但往往缺乏深层知识和语义理解。相反，结合自动语音识别、基于文本的大语言模型（LLM）和文本到语音合成的级联系统提供了优越的知识表示，但代价是高延迟，这破坏了自然交互的流畅性。本文介绍了一种新颖的混合架构，弥合了这两种范式之间的差距。我们的框架通过S2S变压器处理用户语音以实现即时响应，同时将查询并发地传递给强大的后端LLM。然后，LLM的基于文本的响应被实时注入以指导S2S模型的语音生成，有效地为其输出注入丰富的知识，而无需承受级联系统的全部延迟惩罚。我们使用MT-Bench基准的语音合成变体（包含多轮问答会话）评估了我们的方法。结果表明，我们的系统在响应正确性上显著优于基线S2S模型，接近级联系统的水平，同时保持了与基线相当的延迟。

英文摘要

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

URL PDF HTML ☆

赞 0 踩 0

2509.14250 2026-05-26 cs.CL 版本更新

The meaning of prompts and the prompts of meaning: Semiotic reflections and modelling

提示的意义与意义的提示：符号学反思与建模

Martin Thellefsen, Amalia Nurma Dewi, Bent Sorensen

AI总结本文基于皮尔士符号学三元模型和Dynacom传播模型，将大型语言模型中的提示重新概念化为动态符号现象，强调其作为沟通和认知行为的迭代符号形成与解释过程。

Comments 18 pages, 2 figures

详情

DOI: 10.1108/JD-03-2026-0140

AI中文摘要

本文探讨了大型语言模型（LLMs）中的提示（prompts）和提示工程（prompting）作为动态符号现象，借鉴了皮尔士的符号三元模型、他的九种符号类型以及Dynacom传播模型。目的是将提示重新概念化，不是作为一种技术输入机制，而是作为一种沟通和认知行为，涉及符号形成、解释和精炼的迭代过程。理论基础建立在皮尔士符号学上，特别是再现体（representamen）、对象（object）和解释项（interpretant）之间的相互作用，以及符号的类型学丰富性：性质符号（qualisign）、单一符号（sinsign）、法则符号（legisign）；像似符（icon）、指示符（index）、象征符（symbol）；呈位（rheme）、述位（dicent）、论位（argument）——以及Dynacom模型中捕捉的解释项三元组。在分析上，本文将LLM定位为一种符号资源，它根据用户提示生成解释项，从而参与共享话语宇宙中的意义创造。研究结果表明，提示是一种符号和沟通过程，重新定义了数字环境中知识的组织、搜索、解释和共建方式。这一视角邀请我们在计算符号学时代重新构想知识组织和信息检索的理论与方法基础。

英文摘要

This paper explores prompts and prompting in large language models (LLMs) as dynamic semiotic phenomena, drawing on Peirce's triadic model of signs, his nine sign types, and the Dynacom model of communication. The aim is to reconceptualize prompting not as a technical input mechanism but as a communicative and epistemic act involving an iterative process of sign formation, interpretation, and refinement. The theoretical foundation rests on Peirce's semiotics, particularly the interplay between representamen, object, and interpretant, and the typological richness of signs: qualisign, sinsign, legisign; icon, index, symbol; rheme, dicent, argument - alongside the interpretant triad captured in the Dynacom model. Analytically, the paper positions the LLM as a semiotic resource that generates interpretants in response to user prompts, thereby participating in meaning-making within shared universes of discourse. The findings suggest that prompting is a semiotic and communicative process that redefines how knowledge is organized, searched, interpreted, and co-constructed in digital environments. This perspective invites a reimagining of the theoretical and methodological foundations of knowledge organization and information seeking in the age of computational semiosis

URL PDF HTML ☆

赞 0 踩 0

2508.19988 2026-05-26 cs.CL 版本更新

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

AgentCoMa：一个混合常识与数学推理的现实场景组合基准

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei

发表机构 * Imperial College London（帝国理工学院伦敦分校）； RIKEN（日本研究机构）； University of Sheffield（谢菲尔德大学）； University College London（伦敦大学学院）

AI总结提出AgentCoMa基准，测试大语言模型在组合常识与数学推理任务上的性能，发现模型在单独步骤上准确率高但组合后平均下降近30%。

Comments ACL 2026

详情

AI中文摘要

大型语言模型（LLMs）在涉及多个推理步骤组合的复杂常识和数学问题上取得了高准确率。然而，当前测试这些技能的组合基准往往侧重于常识或数学推理，而解决现实世界任务的LLM智能体需要两者的结合。在这项工作中，我们引入了一个智能体常识与数学基准（AgentCoMa），其中每个组合任务需要一个常识推理步骤和一个数学推理步骤。我们在61个不同规模、模型家族和训练策略的LLM上进行了测试。我们发现，LLM通常可以孤立地解决这两个步骤，但当两者结合时，它们的准确率平均下降近30%。这比我们在先前组合相同推理类型多个步骤的组合基准中观察到的性能差距要大得多。相比之下，非专家人类标注者可以以同样高的准确率解决AgentCoMa中的组合问题和各个步骤。此外，我们进行了一系列可解释性研究，以更好地理解性能差距，检查了神经元模式、注意力图和成员推断。我们的工作强调了在混合类型组合推理背景下模型脆弱性的显著程度，并为未来的改进提供了一个测试平台。

英文摘要

Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by nearly 30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.

URL PDF HTML ☆

赞 0 踩 0

2508.11925 2026-05-26 cs.CR cs.CL cs.LG 版本更新

Optimizing Token Choice for Code Watermarking: An RL Approach

优化代码水印的令牌选择：一种强化学习方法

Zhimeng Guo, Huaisheng Zhu, Siyuan Xu, Hangfan Zhang, Teng Xiao, Minhao Cheng

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出CodeTracer框架，通过强化学习训练策略模型智能选择令牌嵌入水印，在保持代码功能的同时提高水印可检测性。

Comments ICML 2026, 18 pages, 3 figures

详情

AI中文摘要

保护LLM生成代码的知识产权需要有效的水印系统，该系统能够在代码高度结构化、语法受限的性质中运行。在这项工作中，我们引入了CodeTracer，一种创新的自适应代码水印框架，其基础是一种新颖的强化学习训练范式。其核心是，CodeTracer采用策略驱动方法，利用参数化模型在下一个令牌预测期间智能地偏向令牌选择。该策略确保嵌入的水印保持代码功能，同时表现出与典型令牌分布微妙但统计上可检测的偏差。为了促进策略学习，我们设计了一个全面的奖励系统，将执行反馈与水印嵌入信号无缝集成，平衡过程级和结果级奖励。此外，我们采用Gumbel Top-k重参数化来实现离散水印决策的基于梯度的优化。广泛的比较评估表明，CodeTracer在水印可检测性和生成代码功能保持方面均显著优于最先进的基线。我们的代码可在https://github.com/TimeLovercc/CodeTracer获取。

英文摘要

Protecting intellectual property on LLM-generated code necessitates effective watermarking systems that can operate within code's highly structured, syntactically constrained nature. In this work, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer's significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code's functionality. Our code is available at https://github.com/TimeLovercc/CodeTracer.

URL PDF HTML ☆

赞 0 踩 0

2507.10593 2026-05-26 cs.SE cs.AI cs.CL cs.LG 版本更新

ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

ToolRegistry: 一个用于函数调用LLM的协议无关工具管理库

Peng Ding, Rick Stevens

发表机构 * University of Chicago（芝加哥大学）； Argonne National Laboratory（阿贡国家实验室）

AI总结提出ToolRegistry系统，通过统一工具对象和注册表实现协议无关的工具管理，支持多种传输协议、可插拔后端和高级功能，显著减少集成代码并提升吞吐量。

Comments 16 pages, 4 figures, v3: add co-author, permission system, progressive tool disclosure, think-augmented calling, RPC framing, multi-provider support

详情

AI中文摘要

每个LLM工具调用在结构上都是一个RPC——一个函数名、JSON参数和序列化结果——然而每个协议（原生Python、MCP、OpenAPI、LangChain）都是从零开始集成的。我们提出ToolRegistry，一个使这种RPC本质显式化的系统：一个单一的Tool对象充当通用存根，无论传输方式如何，而注册表则作为RPC客户端运行时，负责调度、模式生成和执行。该系统以三个包的形式发布——一个核心注册表、一个通过MCP和OpenAPI暴露工具的服务器，以及一个生产就绪实现的中心——并通过可插拔的线程或进程后端调用工具。该系统现在还提供基于标签的权限策略、针对大型注册表的BM25F驱动的渐进式工具披露、增强思考的函数调用、多提供商模式支持（OpenAI、Anthropic、Gemini）、声明式JSONC/YAML配置，以及一个基于仅stdlib内置模块的近乎零依赖的核心。在我们的基准测试中，该库将集成代码减少了60-80%，并且为给定工作负载选择正确的并发模式（线程与进程）相比替代方案可带来高达3.1倍的吞吐量。ToolRegistry在https://github.com/Oaklight/ToolRegistry开源；文档位于https://toolregistry.readthedocs.io/。

英文摘要

Every LLM tool call is structurally an RPC -- a function name, JSON arguments, and a serialized result -- yet each protocol (native Python, MCP, OpenAPI, LangChain) is integrated from scratch. We present ToolRegistry, a system that makes this RPC nature explicit: a single Tool object acts as a universal stub regardless of transport, while the registry serves as the RPC client runtime for dispatch, schema generation, and execution. The system ships as three packages -- a core registry, a server exposing tools over MCP and OpenAPI, and a hub of production-ready implementations -- and invokes tools through pluggable thread or process backends. The system now also provides tag-based permission policies, BM25F-powered progressive tool disclosure for large registries, think-augmented function calling, multi-provider schema support (OpenAI, Anthropic, Gemini), declarative JSONC/YAML configuration, and a near-zero-dependency core built on stdlib-only vendored modules. In our benchmarks the library cuts integration code by 60-80%, and choosing the right concurrency mode (thread vs. process) yields up to 3.1x throughput over the alternative for a given workload. ToolRegistry is open-source at https://github.com/Oaklight/ToolRegistry; documentation lives at https://toolregistry.readthedocs.io/.

URL PDF HTML ☆

赞 0 踩 0

2507.05890 2026-05-26 cs.CL cs.AI 版本更新

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

使用具有特质-反应中介的虚拟受访者进行心理测量项目验证

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University（首尔国立大学数据科学研究生院）； Department of Communication, Seoul National University（首尔国立大学通信系）； Interdisciplinary Program in Artificial Intelligence, Seoul National University（首尔国立大学人工智能跨学科项目）

AI总结提出一种利用LLM模拟虚拟受访者（通过中介因素）来高效验证心理测量项目效度的框架，实验证明该方法能有效识别高有效性项目。

Comments This paper has been accepted for publication at TACL 2026

详情

AI中文摘要

随着心理测量调查越来越多地用于评估大型语言模型（LLM）的特质，对适用于LLM的可扩展调查项目生成的需求也随之增长。这里的一个关键挑战是确保生成项目的构念效度，即它们是否真正测量了预期的特质。传统上，这需要昂贵的大规模人类数据收集。为了提高效率，我们提出了一个使用LLM进行虚拟受访者模拟的框架。我们的核心思想是考虑中介因素：通过它们，相同的特质可能对调查项目产生不同的反应。通过模拟具有不同中介因素的受访者，我们识别出那些在这些中介因素中与预期特质稳健相关的调查项目。在三种心理特质理论（大五人格、施瓦茨价值观、VIA性格优势）上的实验表明，我们的中介生成方法和模拟框架有效地识别了高有效性项目。LLM展示了从特质定义生成合理中介因素以及模拟受访者行为以进行项目验证的能力。我们的问题表述、指标、方法和数据集为成本效益高的调查开发以及更深入地理解LLM如何模拟人类调查反应开辟了新方向。我们发布数据集和代码以支持未来工作。

英文摘要

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that yield responses robustly correlated with intended traits across these mediators. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-efficient survey development and a deeper understanding of how LLMs simulate human survey responses. We release our dataset and code to support future work.

URL PDF HTML ☆

赞 0 踩 0

2506.19037 2026-05-26 cs.CL cs.AI cs.IT cs.LG cs.NE math.IT 版本更新

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

速度规划：用于掩码扩散语言模型的膨胀调度

Omer Luxembourg, Haim Permuter, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beersheba, Israel（电气与计算机工程学院，内盖夫本· Gurion大学，贝尔谢巴，以色列）

AI总结提出膨胀解掩码调度器（DUS），通过将序列位置划分为非相邻的膨胀组并并行解掩码，最小化联合熵增益上界，在不修改去噪器的情况下实现高达5.8倍加速。

Comments Accepted at ICML 2026

详情

AI中文摘要

掩码扩散语言模型（MDLM）承诺快速、非自回归的文本生成，然而现有的采样器根据模型置信度选择要解掩码的标记，忽略了并行解掩码多个位置时的交互，实际上退化为缓慢的自回归行为。我们提出了膨胀解掩码调度器（DUS），这是一种仅推理、无需规划模型的方法，它将序列位置划分为非相邻的膨胀组，并并行解掩码，以在每个去噪步骤中最小化联合熵增益的上界。通过明确权衡网络调用次数与生成质量，DUS恢复了传统并行解掩码策略下丢失的大部分性能。在数学（GSM8K, MATH500）、代码（HumanEval, MBPP）、通用知识（BBH, MMLU-Pro）和指令遵循（IFEval）基准测试中，DUS优于基于置信度的规划器，并将扩散特有的质量-速度权衡转化为由块大小$B$确定的确定性、可预测的加速，与逐标记MDLM解码相比，实现了高达5.8倍的墙钟加速，而无需修改底层去噪器。作为即插即用的后滤波器，膨胀间隔也改进了自适应采样器。代码可在https://github.com/omerlux/DUS获取。

英文摘要

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers. Code is available at https://github.com/omerlux/DUS.

URL PDF HTML ☆

赞 0 踩 0

2506.17629 2026-05-26 cs.CV cs.AI cs.CL 版本更新

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； King Abdullah University of Science and Technology（科廷大学）； Fudan University（复旦大学）

AI总结提出CLiViS框架，通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知，构建动态认知地图以迭代更新场景上下文，实现无需训练的具身视觉推理。

详情

AI中文摘要

具身视觉推理（EVR）旨在基于自我中心视频遵循复杂、自由形式的指令，从而在动态环境中实现语义理解和时空推理。尽管具有潜力，EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型（LLM），这通常会遗漏关键视觉细节，要么依赖端到端视觉语言模型（VLM），后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势，我们提出了CLiViS。这是一个新颖的无训练框架，利用LLM进行高层任务规划，并协调VLM驱动的开放世界视觉感知，以迭代更新场景上下文。基于这种协同，CLiViS的核心是一个动态认知地图，它在推理过程中不断演化。该地图构建了具身场景的结构化表示，连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性，特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

URL PDF HTML ☆

赞 0 踩 0

2504.04639 2026-05-26 cs.CC cs.CL cs.DS cs.LO 版本更新

Ineffectiveness for Search and Undecidability of PCSP Meta-Problems

PCSP元问题的搜索无效性和不可判定性

Alberto Larrauri

发表机构 * University of Oxford（牛津大学）； Durham University（杜伦大学）； University of Zaragoza（阿拉瓦大学）

AI总结本文研究承诺约束满足问题（PCSP）的搜索版本与决策版本是否等价，证明BLP、AIP等算法在搜索版本中无效，并基于代数方法给出搜索无效的充分条件，进而证明相关元问题的不可判定性。

详情

AI中文摘要

承诺CSP的搜索版本和决策版本是否等价是一个开放问题。大多数已知的PCSP算法仅解决其\emph{决策}变体，并且不知道它们是否也能适应解决\emph{搜索}变体。主要方法称为BLP、AIP和BLP+AIP，通过寻找某个整数规划松弛的解来处理PCSP。我们证明将这些解舍入到适当的搜索证书可以像TFNP类中的任何问题一样困难。换句话说，这些算法对于搜索是无效的。基于PCSP的代数方法，我们找到了暗示搜索无效性的充分条件。我们的工具适用于以适当方式由小团体刻画的算法，也可用于证明元问题的不可判定性结果。通过这种方式，我们展示了通过BLP、AIP和BLP+AIP可解的模板族是不可判定的。使用相同的技术，我们还分析了几个已知保证有限模板CSP可处理性的代数条件。我们证明，与循环同态和WNU相关的几个元问题对于PCSP是不可判定的。特别地，没有算法可以判定一个有限PCSP模板（1）是否允许循环同态，（2）是否允许WNU。

英文摘要

It is an open question whether the search and decision versions of promise CSPs are equivalent. Most known algorithms for PCSPs solve only their \emph{decision} variant, and it is unknown whether they can be adapted to solve \emph{search} as well. The main approaches, called BLP, AIP and BLP+AIP, handle a PCSP by finding a solution to a relaxation of some integer program. We prove that rounding those solutions to a proper search certificate can be as hard as any problem in the class TFNP. In other words, these algorithms are ineffective for search. Building on the algebraic approach to PCSPs, we find sufficient conditions that imply ineffectiveness for search. Our tools are tailored to algorithms that are characterized by minions in a suitable way, and can also be used to prove undecidability results for meta-problems. This way, we show that the families of templates solvable via BLP, AIP, and BLP+AIP are undecidable. Using the same techniques we also analyze several algebraic conditions that are known to guarantee the tractability of finite-template CSPs. We prove that several meta-problems related to cyclic polymorphims and WNUs are undecidable for PCSPs. In particular, there is no algorithm deciding whether a finite PCSP template (1) admits cyclic a polymorphism, (2) admits a WNU.

URL PDF HTML ☆

赞 0 踩 0

2502.11167 2026-05-26 cs.LG cs.CL 版本更新

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

SURGE: 大型语言模型作为通用代理代码执行器的潜力

Bohan Lyu, Siqiao Huang, Zichen Liang

发表机构 * Department of Computer Science and Technology, Tsinghua（清华大学计算机科学与技术系）； Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua（清华大学交叉信息研究院）

AI总结提出SURGE基准，包含1160个问题覆盖8个关键方面，通过评估21个开源和专有LLM，研究其作为代码执行预测代理模型的可行性、扩展律、数据效率和预测准确性。

详情

Journal ref: Proceedings of The 2025 Conference on Empirical Methods in Natural Language Processing

AI中文摘要

神经代理模型是数据挖掘中强大且高效的工具。同时，大型语言模型（LLM）在代码相关任务（如生成和理解）中展示了卓越的能力。然而，一个同样重要但尚未充分探索的问题是，LLM是否可以作为代码执行预测的代理模型。为了系统研究这一问题，我们引入了SURGE，一个包含1160个问题的综合基准，覆盖8个关键方面：多语言编程任务、竞赛级编程问题、仓库级代码分析、高成本科学计算、时间复杂度密集型算法、有缺陷代码分析、依赖特定编译器或执行环境的程序，以及形式化数学证明验证。通过对21个开源和专有LLM的广泛分析，我们研究了扩展律、数据效率和预测准确性。我们的发现揭示了LLM作为计算过程高效代理的可行性的重要见解。基准和评估框架可在https://github.com/Imbernoulli/SURGE获取。

英文摘要

Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.

URL PDF HTML ☆

赞 0 踩 0

2208.14882 2026-05-26 cs.MM cs.CL cs.CV cs.IR 版本更新

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

层次化局部-全局Transformer用于时间语句定位

Xiang Fang, Daizong Liu, Pan Zhou, Zichuan Xu, Ruixuan Li

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology（大数据安全湖北工程研究中心，华中科技大学网络安全科学与工程学院）； Wangxuan Institute of Computer fTechnology, Peking University（王宣计算机技术研究院，北京大学）； School of software, Dalian University of Technology（软件学院，大连理工大学）； School of Computer Science, and Technology, Huazhong University of Science, and Technology（计算机科学与技术学院，华中科技大学）

AI总结提出层次化局部-全局Transformer（HLGT），通过建模视频和查询的不同粒度层次及跨模态交互，实现更细粒度的多模态表示，并在三个数据集上取得最先进性能。

Comments Publish in IEEE Transactions on Multimedia

详情

AI中文摘要

本文研究多媒体问题中的时间语句定位（TSG），旨在根据给定的句子查询准确确定未修剪视频中的特定视频片段。传统的TSG方法主要遵循自上而下或自下而上的框架，且不是端到端的，严重依赖耗时的后处理来优化定位结果。最近，一些基于Transformer的方法被提出，以高效有效地建模视频和查询之间的细粒度语义对齐。尽管这些方法在一定程度上取得了显著性能，但它们将视频帧和查询词等同视为Transformer输入进行关联，未能捕捉它们不同粒度的不同语义。为解决这一问题，本文提出了一种新颖的层次化局部-全局Transformer（HLGT），利用这种层次信息并建模不同粒度层次和不同模态之间的交互，以学习更细粒度的多模态表示。具体来说，我们首先将视频和查询分割成单独的片段和短语，通过时间Transformer学习它们的局部上下文（相邻依赖）和全局相关性（长距离依赖）。然后，引入全局-局部Transformer来学习局部级和全局级语义之间的交互，以实现更好的多模态推理。此外，我们开发了一种新的跨模态循环一致性损失，以强制两个模态之间的交互并鼓励它们之间的语义对齐。最后，我们设计了一种全新的跨模态并行Transformer解码器，用于整合编码的视觉和文本特征以进行最终定位。在三个具有挑战性的数据集上进行的大量实验表明，我们提出的HLGT实现了新的最先进性能。

英文摘要

This paper studies the multimedia problem of temporal sentence grounding (TSG), which aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query. Traditional TSG methods mainly follow the top-down or bottom-up framework and are not end-to-end. They severely rely on time-consuming post-processing to refine the grounding results. Recently, some transformer-based approaches are proposed to efficiently and effectively model the fine-grained semantic alignment between video and query. Although these methods achieve significant performance to some extent, they equally take frames of the video and words of the query as transformer input for correlating, failing to capture their different levels of granularity with distinct semantics. To address this issue, in this paper, we propose a novel Hierarchical Local-Global Transformer (HLGT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities for learning more fine-grained multi-modal representations. Specifically, we first split the video and query into individual clips and phrases to learn their local context (adjacent dependency) and global correlation (long-range dependency) via a temporal transformer. Then, a global-local transformer is introduced to learn the interactions between the local-level and global-level semantics for better multi-modal reasoning. Besides, we develop a new cross-modal cycle-consistency loss to enforce interaction between two modalities and encourage the semantic alignment between them. Finally, we design a brand-new cross-modal parallel transformer decoder to integrate the encoded visual and textual features for final grounding. Extensive experiments on three challenging datasets show that our proposed HLGT achieves a new state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2605.25244 2026-05-26 cs.CL 版本更新

Inference Time Optimization with Confidence Dynamics

基于置信度动态的推理时优化

Yu Wang, Minghao Liu, Jiayun Wang, Jinrui Huang, Ankit Shah, Wei Wei

发表机构 * Center for Advanced AI, Accenture（Accenture高级人工智能中心）

AI总结本文通过观察推理轨迹中置信度的动态变化，发现正确轨迹置信度上升而错误轨迹下降，据此提出置信度动态增益投票方法，显著提升大语言模型推理性能。

Comments Published in ICML 2026

详情

AI中文摘要

推理时优化技术（如重复采样）显著提升了大语言模型（LLMs）的推理能力。然而，模型不确定性在这些优化策略中的关键作用仍未被充分探索。本文研究了沿推理轨迹的置信度动态，并首次揭示了一个令人惊讶且独特的模式：正确回答轨迹倾向于随时间表现出置信度提升（正置信度增益），而错误轨迹在推理过程中置信度减弱或下降。基于这一观察，我们提出了基于置信度动态增益（CDG）的投票方法，该方法融入了响应置信度轨迹沿推理链的演化方式。在AIME24/25、HMMT25和BRUMO25基准测试上，针对四种开源架构（DeepSeek-R1、gpt-oss、Gemma-3、Qwen-QwQ）的实验表明，CDG相比基线取得了显著的性能提升。这些结果证明，我们的方法为改进LLM推理中的答案选择提供了稳健的判别信号。我们还为这一现象提供了理论见解。代码将在https://github.com/Accenture/CDG.git发布。

英文摘要

Inference time optimization techniques, such as repeated sampling, have significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, the critical role of model uncertainty remains largely underexplored in these optimization strategies. In this paper, we investigate the dynamics of confidence along reasoning trajectories and for first time reveal a surprising and unique pattern: correct answer traces tend to exhibit confidence improvement over time (positive confidence gain), while incorrect traces show attenuated or declining confidence as reasoning proceeds. Based on this observation, we propose Confidence Dynamic Gain (CDG) based voting, which incorporates how the confidence trajectory of the response evolves along the reasoning chain. Experiments across four open-source architectures (DeepSeek-R1, gpt-oss, Gemma-3, Qwen-QwQ) on the AIME24/25, HMMT25, and BRUMO25 benchmarks demonstrate that CDG yields a significant performance boost over baselines. These results demonstrate that our method provides a robust discriminative signal for improving answer selection in LLM reasoning. We also provide theoretical insights for this phenomenon. Code will be released at https://github.com/Accenture/CDG.git.

URL PDF HTML ☆

赞 0 踩 0

2605.25226 2026-05-26 cs.CL 版本更新

From Automation to Collaboration: Human-in-the-Loop Methods for Safe and Trustworthy NLP

从自动化到协作：面向安全可信NLP的人机协同方法

Most. Sharmin Sultana Samu, MD. Tanvir Ahmed Seum, Md. Rakibul Islam

发表机构 * Department of Computer Science and Engineering, BRAC University（布拉克大学计算机科学与工程系）； Department of Electrical and Electronic Engineering, Rajshahi University of Engineering and Technology（拉贾沙希工程与技术大学电子与电气工程系）

AI总结本文综述了人机协同方法，通过人类监督支持审计、鲁棒性评估、数据构建和模型引导，以提升NLP在安全可信方面的表现，并指出了可扩展探测、可持续鲁棒性基准、低资源设置和私有系统治理等方面的差距。

Comments Preprint, manuscript under review

详情

AI中文摘要

大型语言模型广泛部署在高风险的NLP任务中，但偏见、幻觉、对抗性脆弱性和不可靠的泛化等风险仍然存在。基于探测的审计揭示了模型行为的不一致性。对抗性文本生成发现了鲁棒性差距，特别是在基准有限的低资源语言中。企业文本到SQL设置暴露了在私有和大规模数据库上验证输出的困难。人类监督对于探测验证、对抗性验证和领域特定标注至关重要，但成本高昂且难以扩展。本综述考察了最近的人机协同方法，这些方法将NLP从自动化转向协作，以实现安全性和可信度。我们回顾了人类专业知识如何支持审计、鲁棒性评估、数据构建和模型引导。我们的发现强调了可扩展探测、可持续鲁棒性基准、低资源设置和私有系统治理方面的差距。我们概述了自适应审计、协作评估和负责任部署的实用研究方向。

英文摘要

Large language models are widely deployed in high-stakes NLP tasks, yet risks such as bias, hallucination, adversarial vulnerability and unreliable generalization remain. Probe-based auditing reveals inconsistencies in model behavior. Adversarial text generation uncovers robustness gaps, especially in lower-resourced languages with limited benchmarks. Enterprise text-to-SQL settings expose the difficulty of validating outputs over private and large-scale databases. Human supervision is essential for probe validation, adversarial verification and domain-specific annotation, but it is costly and hard to scale. This survey examines recent human-in-the-loop methods that shift NLP from automation toward collaboration for safety and trustworthiness. We review how human expertise supports auditing, robustness evaluation, data construction and model steering. Our findings highlight gaps in scalable probing, sustainable robustness benchmarks, low-resource settings and governance of private systems. We outline practical research directions for adaptive auditing, collaborative evaluation and accountable deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.25208 2026-05-26 cs.CL 版本更新

They Are Not the Same: Direct Causes Are Not Grounded Emotion Explanations

它们并不相同：直接原因并非基于情感解释

Zhuangzhuang Pan, Yan Xia, Chee Seng Chan

发表机构 * Universiti Malaya, Malaysia（马来亚大学）； Suzhou University of Technology, China（苏州科技学院）； VinUniversity, Vietnam（Vin 大学）

AI总结本文通过IEMO-MECP数据集分析发现，情感-原因对提取（ECPE）任务中的二元分类代理只能有效提取直接触发原因，而无法提供基于证据的情感解释，因为情感上下文（emo-context）在二元边界处被忽略，且模型在捷径压力下倾向于选择便利归因而非真实解释。

Comments 25 pages, 11 figures, 24 tables. Preprint

详情

AI中文摘要

情感-原因对提取（ECPE）旨在解释情感为何发生，但该目标现在常被简化为二元对/非对预测。这一代理对于直接原因提取有用，但容易被过度解读为基于证据的情感解释。我们表明这种解释仅部分有效。在IEMO-MECP中，90.9%的原始正例仍为情感-原因对，95.0%的原始负例仍为非对，证实了二元ECPE任务在很大程度上得以保留。问题在于，仅直接触发因素并不构成基于的解释。情感上下文（emo-context），即有助于解释目标情感但不直接导致该情感的语句，出现在原始边界的双侧，并在二元不确定性附近富集，表明二元边界对此类话语证据没有稳定位置。在评估的ECPE模型中，直接触发因素的恢复比上下文支持更可靠。在捷径压力下，这种不平衡变得显著。二元训练模型对附近词汇相似的非对候选者分配的对分数高于对证据支持但结构上更困难的情感-原因和情感-上下文对。因此，对分数可能奖励便利归因而非基于的解释。高二元ECPE性能表明模型能识别直接触发因素，但并不表示模型已解释情感。代码公开于https://github.com/panzhzh/ECPExsame。

英文摘要

Emotion-Cause Pair Extraction (ECPE) was introduced to explain why an emotion occurs, but this goal is now often reduced to binary pair/non-pair prediction. This proxy is useful for direct-cause extraction, yet easy to over-read as evidence grounded emotion explanation. We show that this interpretation is only partially valid. In IEMO-MECP, 90.9% of original positives remain emo-cause and 95.0% of original negatives remain non-pair, confirming that the binary ECPE task is largely preserved. The problem is that direct triggers alone do not constitute a grounded explanation. Emo-context, an utterance that helps interpret a target emotion without directly causing it, appears on both sides of the original boundary and is enriched near binary uncertainty, showing that the binary boundary has no stable place for such discourse evidence. Across evaluated ECPE models, direct triggers are recovered more reliably than contextual support. Under shortcut pressure, this imbalance becomes consequential. Binary-trained models assign higher pair scores to nearby lexically similar non-pair candidates than to evidence supported but structurally harder emo-cause and emo-context pairs. Thus, pair scores can reward convenient attributions over grounded explanations. High binary ECPE performance indicates that a model can identify direct triggers; it does not indicate that the model has explained the emotion. Code is publicly available at https://github.com/panzhzh/ECPExsame.

URL PDF HTML ☆

赞 0 踩 0

2605.25204 2026-05-26 cs.CL 版本更新

Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

澄清不够：澄清后的回答仍是多轮问答中的瓶颈

Jinyan Su, Jennifer Healey

发表机构 * Cornell University（康奈尔大学）； Adobe Research（Adobe研究）

AI总结本文通过分解多轮问答为澄清策略和澄清后回答两个组件，利用PACIFIC基准实验发现监督微调能快速提升澄清策略，但最终答案准确率仍显著偏低，表明理解并正确解释用户回应是关键瓶颈。

详情

STREAM：一个以数据为中心的框架，用于从流媒体中挖掘高价值任务导向对话

Liang Xue, Haoyu Liu, Cheng Wang, Pengyu Chen, Haozhuo Zheng, Yang Liu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Byering Technology（伯英技术）

AI总结提出STREAM框架，利用流媒体数据合成大规模多领域任务导向对话数据集StreamDial，通过角色构建和对话蓝图结合RAG生成高质量对话，解决数据稀缺问题。

详情

AI中文摘要

垂直领域的大语言模型受到复杂、特定领域任务导向对话稀缺的瓶颈。现有的数据获取管道面临持续的三难困境：专家标注昂贵，真实服务对话受隐私和商业限制，静态语料库很快过时。我们提出Stream，一个以数据为中心的框架，利用公开可用的流媒体（直播和短视频）大规模合成高价值服务对话。Stream从嘈杂的流中挖掘真实的交互信号，并通过将基于角色的个性构建与对话蓝图构建相结合来合成对话；它进一步采用检索增强生成（RAG）来支持知识感知的响应。基于Stream，我们发布了StreamDial，一个覆盖汽车、餐厅和酒店的大规模多领域数据集。StreamDial总共包含87,498个对话会话和1,497,320轮次，平均每个会话17.11轮，各领域规模相当。每个会话被组织为结构化四元组⟨P_u, P_a, B, H⟩，将对话历史与明确的用户/代理角色和对话蓝图配对，捕捉真实服务行为，如需求挖掘、约束冲突、协商和恢复。使用自动评估和下游任务的评估表明，StreamDial在强基线上提高了内在对话质量，使用StreamDial训练的模型在多个骨干网络上改进了对话状态跟踪；我们进一步报告了完整的人工评估集，并在受控训练预算下在Qwen3-8B上实现了令人鼓舞的多语言迁移。数据发布在https://github.com/hitxueliang/DialogDataSetBySTREAM。

英文摘要

Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing data acquisition pipelines face a persistent trilemma: expert annotation is expensive, real-world service conversations are constrained by privacy and commercial restrictions, and static corpora quickly become temporally stale. We propose Stream, a data-centric framework that leverages publicly available streaming media (live streams and short videos) to synthesize high-value service dialogues at scale. Stream mines authentic interaction signals from noisy streams and synthesizes conversations by integrating role-grounded persona construction with Conversational Blueprint construction; it further adopts retrieval-augmented generation (RAG) to support knowledge-aware responses. Based on Stream, we release StreamDial, a large-scale multi-domain dataset covering Automotive, Restaurant, and Hotel. StreamDial contains 87,498 dialogue sessions and 1,497,320 turns in total, with an average of 17.11 turns per session and a comparable scale across domains. Each session is organized as a structured quadruplet $\langle P_u, P_a, B, H \rangle$ that pairs dialogue history with explicit user/agent personas and a Conversational Blueprint, capturing realistic service behaviors such as requirement mining, constraint conflicts, negotiation, and recovery. Evaluations with automatic judges and downstream tasks show that StreamDial improves intrinsic dialogue quality over strong baselines, and models trained with StreamDial improve Dialogue State Tracking across backbones; we further report a completed human-evaluation set and encouraging multilingual transfer on Qwen3-8B under a controlled training budget. The data is released in https://github.com/hitxueliang/DialogDataSetBySTREAM.

URL PDF HTML ☆

赞 0 踩 0

2605.25141 2026-05-26 cs.CL cs.AI 版本更新

LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

基于LLM Agent的利用边缘和物联网数据的可再生能源预测：太阳能、风能、天气和电网感知决策支持综述

Pavan Manjunath, Thomas Pruefer

发表机构 * Independent Researcher（独立研究员）

AI总结本文综述了如何利用大语言模型代理整合异构传感器流、天气API数据、历史发电记录和电网约束，形成统一的决策支持工作流，以增强可再生能源预测。

详情

AI中文摘要

可再生能源发电的可靠预测是电网稳定性、能源交易、电池调度和碳感知运营规划的基础要求。太阳能和风能资源本质上是间歇性的，其输出随云量、风速、大气湍流、季节模式和局部地形而波动。物联网和边缘设备的普及，包括智能电表、逆变器、风速计、日射强度计、气象站和电网接口传感器，创造了前所未有的实时运行数据量，而传统的预测流程难以充分利用这些数据。本综述研究了大语言模型代理如何通过将异构传感器流、天气API数据、历史发电记录、电网约束和上下文推理整合到统一的决策支持工作流中，来增强可再生能源预测。我们调查了经典预测方法（统计时间序列模型、深度学习架构、物理混合方法）以及新兴的用于解释、不确定性沟通和操作员指导的LLM代理框架。提出了一个六层分类法，涵盖数据采集、预处理、特征工程、模型推理、不确定性估计和自然语言报告。综述识别了十二个开放挑战，包括实时部署、分布偏移下的模型漂移、不确定性量化、LLM代理中的幻觉控制、边缘硬件的互操作性以及与能源管理系统的集成。论文最后建议了一个研究议程，重点关注开放基准、物理信息LLM基础以及联邦预测架构。

英文摘要

Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and carbon aware operational planning Solar and wind resources are inherently intermittent their output fluctuates with cloud cover wind speed atmospheric turbulence seasonal patterns and local terrain The proliferation of IoT and edge devices spanning smart meters inverters anemometers pyranometers weather stations and grid interface sensors has created an unprecedented volume of real time operational data that conventional forecasting pipelines are ill equipped to exploit fully This review investigates how large language model LLM agents can enhance renewable energy forecasting by integrating heterogeneous sensor streams weather API data historical generation records grid constraints and contextual reasoning into unified decision support workflows We survey classical forecasting methods statistical time series models deep learning architectures physics hybrid approaches and emerging LLM agent frameworks for explanation uncertainty communication and operator guidance A six layer taxonomy is proposed covering data acquisition preprocessing feature engineering model inference uncertainty estimation and natural language reporting The review identifies twelve open challenges spanning real time deployment model drift under distribution shift uncertainty quantification hallucination control in LLM agents interoperability of edge hardware and integration with energy management systems The paper concludes by recommending a research agenda centred on open benchmarks physics informed LLM grounding and federated forecasting architectures

URL PDF HTML ☆

赞 0 踩 0

2605.25133 2026-05-26 cs.AI cs.CL 版本更新

忠实性指标并不衡量忠实性：基于真实标签的元评估

Yoav Gur-Arieh, Ana Marasović, Mor Geva

发表机构 * Tel Aviv University（特拉维夫大学）； University of Utah（犹他大学）

AI总结针对思维链忠实性度量缺乏真实标签验证的问题，构建了包含真实忠实性标签的数据集BonaFide，系统评估现有指标，发现多数指标表现接近随机、存在偏差且计算成本高。

详情

AI中文摘要

思维链（CoT）已成为解释和审计大型语言模型行为的核心工具。然而，越来越多的证据表明，这些轨迹往往未能忠实反映模型预测背后的计算过程。已有多种忠实性指标被提出，但它们是否真正衡量了忠实性仍不得而知。回答这一问题需要真实标签，但由于内部计算不可直接观察，真实标签难以获取。因此，大多数提出指标的工作仅报告绝对分数或与先前指标的对比，而少数现有基准依赖于似然性或重要性等代理指标，这些属性与忠实性正交，可能误导对CoT可信度的判断。我们通过构建任务来应对这一挑战，这些任务的输出揭示了哪些中间计算必然产生了它们，并开发了一个自动化标注流程，在步骤级和CoT级生成真实忠实性标签。基于这一方法，我们提出了BonaFide基准，包含来自13个任务和10个模型的3066个标注CoT，并利用它首次系统评估了主流忠实性指标。我们的实验表明，大多数指标表现接近随机，存在强烈的预测偏差，并且在更长的CoT上性能下降。最佳指标在CoT级仅达到0.70 AUROC，另一指标在步骤级达到0.59，且两者均无法跨设置迁移，同时计算成本过高。我们的结果暴露了当前忠实性评估中的根本性缺陷，并呼吁开发更可靠、更高效的指标。

英文摘要

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.25038 2026-05-26 cs.CL cs.LG cs.SE 版本更新

TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

TRACE：一个基于分类学的合成数据集，用于应用行为分析中的教学程序生成和会话解释

Festus Kahunla

发表机构 * Drexel University（德雷塞尔大学）； Pombo Labs（波莫实验室）

AI总结提出TRACE数据集，通过分类学驱动的确定性生成器创建2999个合成示例，覆盖教学程序生成和多会话行为解释任务，以解决ABA领域真实数据受隐私保护无法公开的问题。

Comments 11 pages, 3 tables. Dataset: https://huggingface.co/datasets/PomboLabs/TRACE ; code: https://github.com/Pombo-Labs/TRACE

详情

AI中文摘要

应用行为分析（ABA）是一门临床学科，其文档、教学程序和多次会话行为日志具有公式化和高容量的特点，但真实会话数据受HIPAA保护并受专业保密规则约束，阻碍了训练语料库的发布。我们提出了TRACE（分类学参考的ABA临床示例），一个包含2999个示例的合成指令调优数据集，涵盖两项ABA任务：跨离散试验训练、自然环境教学和任务分析的教学程序生成；以及跨十二种轨迹模式和十三种目标行为的多会话行为解释。每个示例均由一个基于经典ABA文献的确定性分类学驱动生成器产生，并且每个示例都带有完整的采样来源，即产生它的确切分类学单元。该数据集以CC BY-NC 4.0（数据）和MIT（代码）许可发布，包含分层训练集（2549）、验证集（149）、测试集（281）和完整性检查集（20）。TRACE是一个研究工件，尚未经过临床验证。

英文摘要

Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

URL PDF HTML ☆

赞 0 踩 0

2605.25036 2026-05-26 cs.CL cs.AI 版本更新

Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

LVLMs中的语言偏差：从深入分析到简单有效的缓解方法

Yangneng Chen, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））

AI总结本文系统研究了大视觉语言模型中的语言偏差问题，发现其根源在于训练中的模态未对齐，并提出了两种简单有效的缓解方法：语言偏差正则化（LBR）和语言偏差惩罚（LBP）。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型视觉语言模型（LVLMs）通过视觉理解扩展了大型语言模型，但仍然容易产生幻觉，即输出流畅但与图像不一致。最近的研究将这一问题与语言偏差联系起来——LVLMs过度依赖文本而忽视视觉输入的倾向。然而，大多数分析仍然是经验性的，没有揭示其根本原因。在本文中，我们对语言偏差进行了系统研究，并确定其根源在于训练过程中的模态未对齐。我们的分析表明，视觉指令微调（VIT）和直接偏好优化（DPO）通常优先考虑文本改进，这可能导致LVLMs过度倾向于语言建模，而不是平衡的多模态理解。为了解决这个问题，我们提出了两种简单而有效的方法：语言偏差正则化（LBR），通过在指令微调期间进行正则化来缓解语言偏差；以及语言偏差惩罚（LBP），在DPO训练过程中惩罚语言偏差。跨多种模型和基准的大量实验证明了我们方法的有效性。LBR在十多个通用基准上持续提高性能，而LBP显著减少了幻觉并提高了可信度。这些方法共同不仅缓解了语言偏差，还促进了LVLMs的整体对齐，且无需引入任何额外数据或辅助模型。我们的代码公开在https://github.com/lab-klc/LVLM-Language-Bias。

英文摘要

Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab-klc/LVLM-Language-Bias.

URL PDF HTML ☆

赞 0 踩 0

2605.25020 2026-05-26 cs.AI cs.CL 版本更新

通用增强，特定抑制：基于稀疏自编码器引导的医学视觉语言模型

Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio, Thomas Frauenfelder, Ahmed Allam, Christian Blüthgen, Michael Moor, Michael Krauthammer

发表机构 * University of Zurich and University Hospital of Zurich（苏黎世大学及苏黎世大学医院）； Kobe University（Kobe大学）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； ETH Zurich（苏黎世联邦理工学院）； Stanford University（斯坦福大学）； Zurich University of Applied Sciences（苏黎世应用科学大学）

AI总结本文提出一种无需权重更新的解码时残差引导方法，通过每token稀疏自编码器（SAE）对医学视觉语言模型进行干预，抑制幻觉并提升报告质量，在多个模型上取得显著改进。

详情

AI中文摘要

医学视觉语言模型（VLM）在生成胸部X光报告时经常出现幻觉：它们编造图像中不存在的发现，遗漏重要发现，或定位错误。我们通过解码时残差引导，基于每token稀疏自编码器（SAE）来缓解这一问题，无需权重更新：在后期层使用Top-$K$ SAE，针对临床错误进行因果引导，然后在推理时结合抑制/增强干预。在MIMIC-CXR测试集上，我们的纯推理方法提高了三个放射学VLM（RadVLM、LLaVA-Rad和CheXOne）生成报告的质量，临床复合指标的相对改进分别为+5.4%、+7.2%和+17.0%，并且所有骨干网络的GREEN得分均具有统计显著性。跨模型特征对齐表明，质量促进（增强）方向在不同架构间高度重叠，而与幻觉相关的（抑制）方向则是模型特定的。因此，可迁移的引导必须针对每个骨干网络进行抑制处理，而不是共享一个通用的抑制列表。相同的配方无需重新训练即可零样本迁移到IU-Xray（GREEN相对提升+7.7%），确认了所识别的特征是模型属性，而非训练语料库的属性。我们发布了因果特征集和一个交互式特征仪表板：https://cxr-sparse-feature-dashboard.netlify.app/。

英文摘要

Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7\%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.

URL PDF HTML ☆

赞 0 踩 0

2605.24973 2026-05-26 cs.CV cs.AI cs.CL 版本更新

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo：结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory, OpenDataLab（上海人工智能实验室，OpenDataLab）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MinerU-Popo轻量级通用后处理框架，通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务，并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构，显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情

AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准，因为它们可以准确提取页面级元素（例如单个页面内的段落）及其边界框和文本内容。然而，下游应用（如RAG）需要连贯的文档级信息，而这些模型常常破坏跨页连续性，并且无法恢复被页面边界截断的结构（如段落和表格）。这种关系不局限于单个页面；相反，它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此，一个自然的解决方案是重用现有的OCR输出，并通过后处理重建文档级逻辑结构。为此，我们提出了MinerU-Popo，一个轻量级且通用的OCR输出后处理框架，它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务：文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题，我们构建了一个面向任务的数据引擎，具有任务特定的输入过滤，并使用生成的数据（30K）微调了一个轻量级后处理模型（Qwen3-VL-4B）。为了支持长文档，我们引入了基于重叠同步的动态分块，对齐微调模型的分块级输出并保持全局一致性。最后，我们将对齐后的输出组装成树状文档表示，并通过节点分块和摘要进一步丰富，以支持下游检索和分析。实验结果表明，MinerU-Popo在所有五个测试的OCR模型上，标题层级TEDS至少提高了20%，提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

URL PDF HTML ☆

赞 0 踩 0

2605.24960 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

探究优化下上下文与参数化思维链忠实性之间的相互作用

Jingyi Sun, Qianli Wang, Pepa Atanasova, Nils Feldhus, Isabelle Augenstein

发表机构 * University of Copenhagen（哥本哈根大学）； Technische Universität Berlin（柏林技术大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； BIFOLD – Berlin Institute for the Foundations of Learning and Data（BIFOLD – 柏林学习与数据基础研究院）

AI总结通过提出统一偏好对齐接口FaithMate，研究上下文与参数化两种思维链忠实性范式在优化下的相互作用，发现两者正相关但不对称，且上下文忠实性指标间存在权衡。

Comments The first two authors contributed equally and share first-authorship

详情

AI中文摘要

思维链（CoT）忠实性，即CoT是否真实反映大型语言模型（LLM）的底层行为，通常通过两种不相交的范式进行评估：上下文忠实性（通过扰动输入或CoT轨迹测量）和参数化忠实性（通过干预模型的参数化知识评估）。然而，先前的工作仅对它们进行描述性比较。我们通过提出FaithMate（一个统一的偏好对齐接口，用于优化模型朝向任一忠实性范式）来填补这一空白。它使我们能够研究两种范式之间的相互作用，检查忠实性增益在范式内部和跨范式之间是否以及多大程度上泛化。在三个模型、两个数据集和六个忠实性指标上，我们发现两种范式呈正相关但不对称：优化参数化忠实性在两种范式上均产生一致的增益，而上下文对应范式则带来更多可变的增益。在上下文范式内，一个指标上的忠实性增益不能一致地转移到其他指标上，这表明现有的上下文指标捕捉了忠实性的不同方面，并暴露了固有的权衡。这些发现意味着CoT忠实性不是一个单一目标，因此需要多方面的优化和评估。

英文摘要

Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically evaluated under two disjoint paradigms: contextual faithfulness, measured by perturbing the input or CoT trace, and parametric faithfulness, assessed by intervening on a model's parametric knowledge. Yet prior work compares them only descriptively. We fill this gap by proposing FaithMate, a unified preference-alignment interface for optimizing models towards either faithfulness paradigm. It enables us to investigate the interplay between the two paradigms, examining whether and to what extent faithfulness gains generalize within and across paradigms. Across three models, two datasets, and six faithfulness metrics, we find that the two paradigms are positively coupled, yet asymmetric: optimizing towards parametric faithfulness yields consistent gains across both paradigms, whereas the contextual counterpart delivers more variable gains. Within the contextual paradigm, faithfulness gains on one metric do not consistently transfer to others, implying that existing contextual metrics capture disjoint facets of faithfulness and exposing inherent trade-offs. These findings imply that CoT faithfulness is not a monolithic objective and therefore requires multifaceted optimization and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.24958 2026-05-26 cs.CL cs.AI 版本更新

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

SEP-Attack：一种简单有效的基于迁移的文本对抗攻击范式

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Xiaoming Xu, Wei Wang, Fenglong Ma, Hong Yu

发表机构 * Dalian University of Technology（大连理工大学）； Peking University（北京大学）； Macao Polytechnic University（澳门理工学院）； The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出SEP-Attack，利用行列式点过程生成多样化的代理集成权重，通过新指标评估预测置信度以计算词重要性并生成对抗样本，在多个数据集和API上显著优于现有方法。

详情

AI中文摘要

尽管深度神经网络在现代Web和语言应用中表现出色，但它们仍然容易受到对抗攻击，尤其是使用代理模型生成对抗样本而无需访问受害者模型的迁移攻击。文本领域的迁移攻击仍未得到充分探索，只有少数研究解决了这一挑战性问题，且由于对子模型平等对待或重要性分数估计不准确，往往导致次优结果。为了解决这些挑战，我们提出了一种简单而有效的基于迁移的文本对抗攻击范式，名为SEP-Attack。具体来说，我们采用行列式点过程（DPP）生成多样化的代理集成权重，代表子模型的迁移性。利用这些权重，我们引入了一种新的度量来评估预测置信度分数，进而用于计算词重要性分数并生成对抗候选。最后，我们量化每个候选的迁移性分数，并选择排名靠前的作为最终的迁移对抗样本。在四个数据集和两个真实API上进行的实验验证了SEP-Attack的有效性，显著优于最先进的基线方法。

英文摘要

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24956 2026-05-26 cs.CL 版本更新

NITP: Next Implicit Token Prediction for LLM Pre-training

NITP：面向LLM预训练的下一隐式令牌预测

Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng, Junchi Yan

发表机构 * School of AI, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Xiaohongshu Inc.（小红书公司）； The Chinese University of Hong Kong（香港中文大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出NITP方法，通过在表示空间中添加密集连续监督来增强离散令牌预测，以解决标准下一令牌预测中潜在表示空间约束不足的问题，并在0.5B至9B参数模型上取得一致性能提升。

Comments Accepted at ICML 2026

详情

AI中文摘要

标准的下一令牌预测（NTP）仅通过输出logit空间中的离散标签来监督语言模型。我们认为这种稀疏的one-hot监督使得潜在表示空间约束不足，允许隐藏状态退化为退化和各向异性的配置，从而限制泛化能力。为解决这一问题，我们提出下一隐式令牌预测（NITP），该方法直接在表示空间中用密集的连续监督增强离散预测。NITP训练模型预测下一令牌的隐式语义内容，使用同一模型的浅层表示作为稳定的自监督目标。我们提供理论分析，表明NITP通过缓解欠约束的自由度并鼓励紧凑、结构化的表示几何来正则化优化景观。实验上，在从0.5B到9B参数的密集和MoE模型中，NITP以可忽略的计算开销持续提升下游性能。在9B MoE模型上，NITP在MMLU-Pro上实现了5.7%的绝对提升，在C3上提升6.4%，在CommonsenseQA上提升4.3%，训练FLOPs仅增加约2%，且无额外推理成本。我们的实现可在https://github.com/aHapBean/NITP获取。

英文摘要

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.

URL PDF HTML ☆

赞 0 踩 0

2605.24930 2026-05-26 cs.CL 版本更新

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

H$^{2}$MT: 语义层次感知的层次记忆Transformer

Maryam Haghifam, Zifan He, Jason Cong, Yizhou Sun

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出H$^{2}$MT模型，通过离线构建语义层次结构并利用自底向上的后序聚合计算记忆嵌入，在推理时实现从粗到细的查询路由，从而在长上下文推理中实现质量与效率的权衡。

详情

AI中文摘要

基于Transformer的LLM在许多语言任务上取得了强劲的结果；然而，长输入仍然具有挑战性，因为上下文窗口是有限的，并且预填充延迟和内存随提示长度快速增长。因此，平坦的令牌流处理和基于块的检索可能会在与查询无关的文本上花费大量计算和上下文预算。离线索引的RAG额外引入了外部存储和索引管理开销，并且通常将检索到的证据作为原始文本附加，增加了预填充成本和延迟。H^{2}MT使长上下文推理具有结构感知性：它离线构建语义层次结构，通过自底向上的后序聚合为每个节点计算记忆嵌入，并在推理时从粗到细地路由查询，以早期修剪不相关的分支。在LongBench QA（NarrativeQA、HotpotQA、QASPER）和两个结构化技术文档设置上，H^{2}MT实现了有利的质量效率权衡，与提示压缩、记忆令牌方法和检索增强生成基线相比，在更低的峰值GPU内存和首令牌时间（TTFT）下取得了具有竞争力的ROUGE-L和F1（在适用情况下）。

英文摘要

Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream processing and chunk-based retrieval can therefore spend substantial computation and context budget on text unrelated to the query. Offline-indexed RAG additionally introduces external storage and index management overhead, and typically appends retrieved evidence as raw text, increasing prefill cost and latency. H^{2}MT makes long-context inference structure-aware: it builds a semantic hierarchy offline, computes a memory embedding for each node via bottom-up post-order aggregation, and routes queries coarse-to-fine at inference to prune irrelevant branches early. On LongBench QA (NarrativeQA, HotpotQA, QASPER) and two structured technical-document settings, H MT achieves favorable quality efficiency trade-offs, delivering competitive ROUGE-L and F1 (where applicable) with lower peak GPU memory and time-to-first-token (TTFT) than prompt compression, memory-token methods, and retrieval-augmented generation baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24919 2026-05-26 cs.CL 版本更新

MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing

MultiHaluDet: 通过LLM隐藏状态探测实现多语言幻觉检测

Riasad Alvi, Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi

发表机构 * United International University（国际联合大学）； BRAC University（布拉克大学）

AI总结提出MultiHaluDet框架，通过探测冻结LLM的全隐藏状态轨迹，结合多尺度注意力和自注意力池化的混合架构，以及校准的经典分类器集成，实现跨语言的高精度幻觉检测，在英语基准上达到98.55% AUROC，并展现出对高、中、低资源语言的强泛化能力。

Comments MeLLM @ ACL 2026

详情

AI中文摘要

大型语言模型（LLM）中的幻觉是其可靠部署的关键障碍，这一漏洞在非英语和资源受限的环境中尤为严重。现有的依赖输出置信度启发式或单层内部表示的检测方法，往往无法捕捉跨语言的深层、复杂事实不一致性。为此，我们引入了MultiHaluDet，一种新颖的三阶段堆叠框架，通过探测冻结LLM的全隐藏状态轨迹来检测多语言幻觉，无需特定语言的微调。我们的方法提取跨多个层的序列特征，并通过使用多尺度注意力和自注意力池化的混合架构进行处理。通过生成折叠外嵌入并输入到校准的经典分类器集成中，MultiHaluDet捕捉了事实不一致性的细粒度和粗粒度模式。大量实验表明，我们的框架在Mistral-7B和LLaMA2-7B架构上，在英语HaluEval和TriviaQA基准测试中达到了高达98.55% AUROC的最先进检测性能。关键的是，我们严格评估了框架在高资源（法语）、中资源（孟加拉语）和低资源（阿姆哈拉语）语言上的跨语言泛化能力。MultiHaluDet展现出卓越的表示鲁棒性，始终优于基线，并成功地将幻觉检测能力迁移到类型多样的语言层级中。

英文摘要

Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework's cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.

URL PDF HTML ☆

赞 0 踩 0

2605.24907 2026-05-26 cs.CL 版本更新

Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

PsyDefDetect 共享任务概述：在支持性对话中检测心理防御机制水平

Hongbin Na, Zimu Wang, Zhaoming Chen, Yining Hua, Rena Gao, Kailai Yang, Ling Chen, Wei Wang, Shaoxiong Ji, John Torous, Sophia Ananiadou

发表机构 * University of Technology Sydney（技术大学悉尼）； Xi’an Jiaotong-Liverpool University（西安交通大学-利物浦大学）； University of Utah（犹他大学）； Harvard University（哈佛大学）； The University of Melbourne（墨尔本大学）； The University of Manchester（曼彻斯特大学）； ELLIS Institute Finland（芬兰ELLIS研究所）； University of Turku（图尔库大学）

AI总结本文介绍了与 BioNLP@ACL 2026 合办的 PsyDefDetect 共享任务，该任务基于临床验证的 DMRS 框架，要求系统将求助者话语分类为九个类别，最佳系统达到 0.420 的宏 F1 分数，但仍存在改进空间。

详情

AI中文摘要

我们介绍了 PsyDefDetect，这是一个与 BioNLP@ACL 2026 合办的关于在情感支持对话中检测心理防御机制水平的共享任务。该任务基于临床验证的防御机制评定量表（DMRS）框架，要求系统根据给定的前面对话上下文，将目标求助者话语分类为九个类别之一：七个层次的 DMRS 水平加上两个辅助标签。参与者使用了 PsyDefConv，这是一个新发布的语料库，包含 200 个对话和 2336 条求助者话语，在 DMRS 下进行了标注，并具有较高的一致性。该任务在 CodaBench 上吸引了 172 名参与者，提交了 563 份结果，其中 21 个团队正式注册了最终排名。最佳系统实现了 0.420 的宏 F1 分数，显著超过了数据集论文中报告的最强微调基线，但仍留有明显的改进空间。我们的分析强调了（i）过度预测多数类高适应水平的持续趋势，（ii）准确率和宏 F1 之间的差距扩大，揭示了类别不平衡敏感性，以及（iii）理论感知和基于 LLM 的方法在细粒度防御功能分类中的价值。我们发布了所有任务材料，并邀请社区继续在这个临床心理学与自然语言处理的新交叉领域开展工作。

英文摘要

We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.

URL PDF HTML ☆

赞 0 踩 0

2605.24904 2026-05-26 cs.CL 版本更新

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation

量化翻译错误对多语言大语言模型评估的影响

Klaudia-Doris Thellmann, Bernhard Stadler, Michael Färber, Jens Lehmann

发表机构 * TUD Dresden University of Technology and ScaDS.AI（德累斯顿技术大学和ScaDS.AI）； InfAI e.V.（InfAI协会）； Amazon（亚马逊）

AI总结研究机器翻译基准中的翻译错误如何影响多语言LLM评估的可靠性，通过自动错误跨度检测和准确性下降分析揭示翻译错误与评估指标之间的关联。

2605.24902 2026-05-26 cs.CL cs.AI cs.LG 版本更新

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

当推理有害：面向临床SOAP笔记生成的前沿LLM源感知评估

Faizan Faisal

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结通过源感知基准测试，评估推理增强型LLM在临床SOAP笔记生成中的表现，发现推理能力反而降低GPT-5.4的质量，而相同源RAG带来模型依赖的小幅提升。

详情

AI中文摘要

推理增强型LLM在医学推理基准测试中表现强劲，但这些增益是否能迁移到结构化临床文档尚不清楚；我们通过一个跨OMI Health、ACI-Bench和PriMock57的源感知基准，利用临床对话生成SOAP笔记来研究这一问题。我们在一个2x2受控设计中评估GPT-5.4、DeepSeek-V4-Flash和Gemma-4-E4B，独立切换提供者原生推理和相同源检索增强生成（RAG）。输出使用七种自动指标以及两个参考感知的LLM评判者进行评估。两种评估方法一致认为，非推理的GPT-5.4配置达到最高整体质量，而DeepSeek-V4-Flash在推理增强配置中表现最佳。启用推理显著降低了GPT-5.4在所有三个数据集上的性能，而相同源RAG带来较小的、模型依赖的改进。总体而言，研究结果表明，不应假设更强的推理能力能改善对保真度敏感的SOAP笔记生成，而无需专门的、任务特定的评估。

英文摘要

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.24885 2026-05-26 cs.CL 版本更新

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

DTO：一种用于有效反事实故事重写的可微分训练目标

Amelia Girard, Massimo Piccardi

发表机构 * University of Technology Sydney（技术大学悉尼）

AI总结提出一种可微分训练目标（DTO），通过端到端反向传播联合优化对参考重写的忠实度和与源叙事的语义一致性，以解决反事实故事重写中模型忽略细微修改的问题。

Comments 11 pages, 2 figures

详情

AI中文摘要

反事实故事重写是一项自然语言处理任务，要求更新现有故事以反映所选替代事件，同时保留所有未受影响的故事情节元素和整体连贯性。尽管大型语言模型最近在此任务上取得了显著进展，但由于所需修改通常规模很小且高度局部化，该任务仍然具有挑战性。因此，以传统方式使用最大似然训练目标训练的模型往往忽略这些细微之处。同时，基于强化学习的更复杂训练方法以缓慢和难以设置而闻名。基于这些原因，本文提出了一种新颖的可微分训练目标（DTO），直接优化所需的反事实改进。在我们的方法中，通过端到端反向传播微调一个Transformer模型，针对一个完全可微的损失函数，该函数同时奖励（i）对参考重写的忠实度和（ii）与源叙事的语义一致性。在TimeTravel和ART数据集上的实证评估表明，所提出的DTO方法能够超越最大似然基线和基于偏好的方法，并在所有评估指标上与两个当代大型语言模型竞争。这些发现证实了任务特定的可微分目标对于细微、受控的文本生成任务的有效性。

英文摘要

Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24873 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Towards a Universal Causal Reasoner

迈向通用因果推理器

Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng, Chenhao Tan

发表机构 * The University of Chicago（芝加哥大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出UniCo数据生成框架，覆盖Pearl因果阶梯的18种查询类型，将符号示例转化为代码和自然语言，通过监督微调显著提升LLM的因果推理能力和推理忠实度。

详情

AI中文摘要

尽管因果推理的重要性不言而喻，但训练LLM进行因果推理仍未被充分探索。现有的数据工作大多集中在针对因果关系的特定方面对LLM进行基准测试，这使得它们不太适合训练可泛化的因果推理器。为了解决这个问题，我们提出了UniCo，一个数据生成框架，它既(1)涵盖了Pearl因果阶梯中的18种因果查询类型，又(2)将原生符号示例转化为代码和自然语言形式，以模拟因果术语未明确指定的真实世界用例。为确保数据质量，UniCo用精确的因果推理来支撑答案，并过滤掉存在推理捷径的案例。通过使用66.6K个UniCo生成的实例进行监督微调，Qwen3-4B、Qwen3-8B和Olmo-3-7B-Instruct在所有18种分布内查询类型上平均提升了22.9%，在训练分布之外的7个已建立的因果基准上，相比最先进的因果数据生成框架提升了8.1%。更重要的是，在真实世界的医学理解、法律决策和表格推理中，UniCo训练的模型始终展现出更忠实的推理轨迹，在忠实度指标上平均超过基础模型20.2%。这些结果表明，以因果为中心的训练不仅增强了因果推理能力，还赋予了LLM在一般推理任务中的因果思维。

英文摘要

Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl's Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24869 2026-05-26 cs.CL 版本更新

DUEL: 用于多模态推理的对抗性自我对弈

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao, Ji Liu

发表机构 * Meta AI

AI总结提出DUEL框架，通过对抗性自我对弈从预训练VLM生成监督信号，结合长度归一化对数似然奖励，无需人工标注即可提升视觉推理与判别能力。

详情

AI中文摘要

强化学习已成为提升视觉语言模型推理能力的有效范式。然而，基于RL的优化通常依赖于昂贵且难以扩展的高质量标注。现有的无监督替代方案可能因弱视觉基础和缺乏可靠验证信号而偏向有偏解。我们提出一个自我进化的训练后框架DUEL，其中监督信号源于从同一预训练VLM初始化的两个策略之间的对抗性交互。挑战者生成一个基于图像的真实声明及其最小扰动的难负样本，而求解者验证两个声明与图像的一致性，从而在近邻语义下鼓励细粒度视觉判别。为了稳定优化，我们引入长度归一化的对数似然奖励，在二元结果监督之外保留信息性优化信号，并在稀疏反馈下提高学习稳定性。实验表明，DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下，持续提升视觉推理和鲁棒判别能力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

URL PDF HTML ☆

赞 0 踩 0

2605.24793 2026-05-26 cs.CL 版本更新

Beyond the Target: From Imitation to Collaboration in Speculative Decoding

超越目标：从模仿到协作的推测解码

Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.（先进微设备公司）； The University of Hong Kong（香港大学）； University of North Texas（北卡罗来纳州立大学）

AI总结提出协作推测解码（CoSpec），通过强化学习训练仲裁策略，在推测解码中灵活选择接受草稿或目标模型的令牌，在保持加速的同时超越仅使用目标模型的性能。

Comments under review

详情

AI中文摘要

推测解码（SPD）通过让较小的草稿模型提出多个未来令牌，并由较大的目标模型并行验证，从而加速大型语言模型（LLM）推理。主流的SPD范式将目标模型视为唯一可靠的教师，仅当草稿令牌与目标预测完全匹配时才接受它。这种设计隐含地假设目标在每个位置都是更好的选择。在实践中，这一假设并不成立。尽管草稿模型整体上较弱，但在令牌级别上并非均匀地劣于目标。在草稿与目标不一致的有意义的情况下，草稿的选择往往能导致正确的最终答案。受此启发，我们引入了 extbf{协作推测解码（CoSpec）}，这是SPD的一种泛化，不再将目标模型视为唯一的令牌级权威。CoSpec通过强化学习训练一个仲裁策略，以决定是接受来自草稿还是目标模型的令牌，在不匹配时选择性地接受草稿令牌，如果这样做可能产生正确的最终答案。实验结果表明，CoSpec在保持显著加速的同时，超越了仅使用目标模型的性能。通过将重点从模仿转向协作，CoSpec为推测解码提供了新的视角。

英文摘要

Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the target model as the sole reliable teacher, accepting a draft token only when it exactly matches the target prediction. This design implicitly assumes that the target is always the better choice at every position. In practice, this assumption does not hold. Although the draft is the weaker model overall, it is not uniformly inferior at the token level. In a meaningful fraction of cases where draft and target disagree, the draft's choice is the one that leads to the correct final answer. Inspired by this, we introduce \textbf{Collaborative Speculative Decoding (CoSpec)}, a generalization of SPD that no longer treats the target model as the sole token-level authority. CoSpec trains an arbitration policy via reinforcement learning to decide whether to accept tokens from the draft or target model, selectively accepting draft tokens at mismatches when doing so is likely to yield a correct final answer. Experimental results show that CoSpec maintains substantial speedups while surpassing target-only performance. By shifting the emphasis from imitation to collaboration, CoSpec suggests a new perspective on speculative decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.24764 2026-05-26 cs.IR cs.AI cs.CL 版本更新

StepGap：一种用于多跳问答中步骤级证据缺口检测的混合NLI-LLM检查器

Yuelyu Ji, Zhuochun Li, Hui Ji, Daqing He

发表机构 * School of Computing and Information, University of Pittsburgh（计算信息学院，匹兹堡大学）

AI总结提出混合NLI-LLM决策树StepGap，用于检测多跳问答中的步骤级证据缺口并输出三类标签，在82个问题上达到sF1=72.0，且作为GRPO过程奖励可提升模型精确匹配率。

详情

AI中文摘要

我们提出 extbf{StepGap}，一种混合NLI-LLM决策树，用于检测多跳问答中的步骤级证据缺口，并输出三类标签： extsc{矛盾声明}（CC）、 extsc{无关证据}（IE）或 extsc{缺失桥梁}（MB），每个标签对应具体的修复动作。在82个多跳问题（181个标注步骤，$κ{=}0.704$）上，StepGap达到sF1$=$72.0，处于纯LLM基线（70.1）的bootstrap置信区间内，但具有更可分解的结构：移除StepGap的每个阶段都会 extit{降低}F1，而四个纯LLM移除中有三个 extit{提高}F1——这是 extit{竞争性错误抵消}的迹象，即内部阶段相互掩盖错误。我们进一步揭示了 extit{Q-F1陷阱}：问题级F1被标记每一步的检查器机械地膨胀，使得步骤级F1成为必要的诊断指标。作为带类型的GRPO过程奖励，StepGap将Qwen2.5-7B-Instruct的精确匹配率从$32.1{\pm}0.3$提升至$35.4{\pm}0.9$（三个种子），单次运行比较显示，与匹配的Search-R1 GRPO复现相比，平均EM增益为$+5.6$。

英文摘要

We present \textbf{StepGap}, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textsc{Contradicted Claim} (CC), \textsc{Irrelevant Evidence} (IE), or \textsc{Missing Bridge} (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, $κ{=}0.704$), StepGap reaches sF1$=$72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emph{hurts} F1 when removed, while three of four LLM-only removals \emph{improve} F1 -- a sign of \emph{competing-error cancellation}, where internal stages mask each other's errors. We further expose a \emph{Q-F1 trap}: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from $32.1{\pm}0.3$ to $35.4{\pm}0.9$ across three seeds, with the single-run comparison showing a $+5.6$ Avg EM gain over the matched Search-R1 GRPO reproduction.

URL PDF HTML ☆

赞 0 踩 0

2605.24721 2026-05-26 cs.CL 版本更新

ROC Analysis for Evaluating Translation Quality Estimation Systems

ROC分析用于评估翻译质量估计系统

Evelyn Y. Garland, Carola F. Berger

发表机构 * Acta-Transphere ； CFB Scientific Translations LLC

AI总结本文提出使用接收者操作特征（ROC）分析评估自动翻译质量估计（QE）系统，该方法与现有方法结果一致，并能为商业决策提供可操作的性能洞察。

Comments 16 pages, 8 PNG figures, 3 tables, uses acl.sty

2605.24719 2026-05-26 cs.CL cs.AI 版本更新

World-State Transformations for Neuro-symbolic Interactive Storytelling

世界状态转换用于神经符号交互式故事讲述

Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás

AI总结本研究探索在神经符号架构中利用LLM预测规则系统中的世界状态转换，以解决纯LLM方法的故事连贯性问题，并通过实验表明该方法能保持世界状态一致性并促进玩家创造性输入。

Comments To be presented at the 17th International Conference on Computational Creativity (ICCC'26)

详情

AI中文摘要

大型语言模型（LLM）改变了处理自由文本用户输入的交互式故事讲述系统的可能性。然而，随着这类系统越来越多地被构建，越来越多的证据表明，仅依赖它们会出现故事连贯性问题。最近的研究表明，LLM可以有效地预测基于规则的交互式故事讲述系统中的状态变化，触发预编程的世界状态转换。在本文中，我们进行了一项探索性评估，研究这种转换是否可以作为玩家表达的催化剂，同时旨在解决纯LLM方法典型的连贯性问题。基于神经符号架构，我们使用开源模型（Llama 3 70B）和闭源模型（Gemini 1.5 Flash）进行了实验，测试以英语和西班牙语进行。八名参与者玩了两个场景，这些场景经过精心设计以评估不同的评估目标。我们的观察表明，转换提供了一种保持世界状态一致性的方式，同时鼓励玩家通过他们的书面输入进行创造性互动。

英文摘要

Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However, as more of these systems are built, evidence continues to mount regarding the story coherence problems that arise when relying solely on them. Recent research suggests that LLMs can effectively predict state changes within rule-based Interactive Storytelling systems, triggering pre-programmed world-state transformations. In this paper, we conduct an exploratory evaluation of whether such transformations can serve as a catalyst for player expression while aiming to address the incoherence issues typical of purely LLM-based approaches. Building upon a neuro-symbolic architecture, we conducted experiments using an open-source model (Llama 3 70B) and a closed-source model (Gemini 1.5 Flash), with testing conducted in both English and Spanish. Eight participants played two scenarios, carefully designed to assess different evaluation objectives. Our observations suggest that transformations offer a way to maintain world-state consistency while encouraging players to interact creatively through their written inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.24718 2026-05-26 cs.CL 版本更新

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

25种欧洲语言的Tokenizer税：领域不变性、跨语言少样本效应与乌克兰语惩罚

Volodymyr Ovcharov

发表机构 * LEX AI Platform（LEX AI平台）； legal.org.ua ； Kyiv, Ukraine（基辅，乌克兰）

AI总结研究测量了10个基础模型在25种欧洲语言上的tokenizer生育率，揭示了从英语到其他语言的成本差异，并发现乌克兰语因预训练数据不足而支付额外成本。

Comments 16 pages, 3 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/tokenizer-fertility-map

详情

AI中文摘要

Tokenizer生育率（每词token数）对非英语NLP施加了隐藏成本。我们在平行文本上测量了10个基础模型在25种欧洲语言上的生育率，生成了首个受控的欧洲tokenizer税地图。该税从英语（1.2 tokens/词）到希腊语/马耳他语（约3.1）跨度达2.5倍，遵循清晰层次：罗曼语族（1.5-1.7）、日耳曼语族（1.7-1.9）、斯拉夫语族（2.2-2.5）、乌拉尔语系/波罗的语族（2.7-3.0）。乌克兰语（2.7）比同源斯拉夫语言多支付15-18%，反映了其在预训练数据中的代表性不足。生育率排名在三种文本语域中具有领域不变性（rho > 0.97）。子词分析表明，高生育率tokenizer会碎片化形态边界而非保留它们。对四种斯拉夫语言的跨语言少样本评估显示，少样本效应是模型固有的，而非语言依赖的。我们将所有测量结果作为公共数据集发布。

英文摘要

Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho > 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.24703 2026-05-26 cs.CL cs.AI 版本更新

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

TS-Skill: 用于评估时间序列问答中分析技能的基准

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Samsung Research America（三星美国研究院）； Carnegie Mellon University（卡内基梅隆大学）； Microsoft（微软）； Amazon（亚马逊）

AI总结提出TS-Skill基准，通过三种可组合的分析技能（时间尺度选择、时间定位和跨区间整合）来诊断时间序列问答中模型的信号级能力，并开发SKEvol框架自动构建基准，实验揭示不同技能上的能力差距。

详情

AI中文摘要

大型语言模型（LLMs）和时间序列语言模型（TSLMs）越来越多地应用于时间序列问答（TSQA）。与纯文本问答不同，TSQA要求模型将答案基于时间信号，这些信号的模式可能出现在不同尺度、特定时间位置或跨分离区间。然而，现有的基准通常按任务类型或高层次推理类别组织，难以诊断驱动模型性能的底层信号级能力。我们引入TS-Skill，一个用于评估TSQA中三种可组合分析技能的控制基准：时间尺度选择（SK1）、时间定位（SK2）和跨区间整合（SK3）。TS-Skill提供时间戳感知的问题、广泛的领域覆盖以及人工验证的问答质量。为了大规模构建基准，我们开发了SKEvol，一个技能引导的智能体框架，结合了领域感知的时间序列种子生成、技能控制的问题生成、元数据和代码辅助的答案构建、多阶段信号接地验证以及人在回路中的策展。在十个最先进的LLMs和TSLMs上的实验揭示了SK1-SK3之间显著且不均匀的能力差距。特别是，SK3对非智能体模型始终具有挑战性，而工具增强的智能体在独立的SK3上显示出选择性优势。这些发现表明，技能级评估可以揭示被聚合TSQA分数掩盖的时间推理失败。

英文摘要

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

URL PDF HTML ☆

赞 0 踩 0

2605.24697 2026-05-26 cs.CL cs.AI 版本更新

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

路径很重要：学习扩散语言模型的令牌提交策略

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu

发表机构 * Department of Computer Science and Technology, University of Cambridge（计算机科学与技术系，剑桥大学）； Department of Engineering Science, University of Oxford（工程科学系，牛津大学）

AI总结本文提出TraceLock，一种轻量级可插拔控制器，通过学习可复用的轨迹状态策略来优化扩散语言模型中的令牌提交决策，从而改善质量与步数之间的权衡。

详情

AI中文摘要

扩散大语言模型通过并行细化多个令牌位置有望实现更快的生成，但这种并行性引入了一个隐藏的控制问题：每一步中哪些提议的令牌应被转移到部分解码的序列中？我们将此决策称为令牌提交。现有的冻结生成器解码器主要依赖于手工设计的置信度规则或特定块的接受过滤器。我们认为令牌提交可以学习为一种可复用的轨迹状态策略。我们引入了TraceLock，一种轻量级可插拔控制器，为冻结的扩散语言模型实例化此策略。由于无法获得 oracle 提交时间，TraceLock 从未来稳定性中推导出自我监督：在解码步骤 t，如果提议的令牌在完整解码轨迹完成后与位置 i 的最终令牌匹配，则将其标记为稳定。控制器对可变长度的轨迹状态进行评分，并决定哪些活跃的令牌提议应被提交到部分解码的序列中。一旦为给定的冻结主干训练完成，该控制器可以在局部窗口宽度、生成长度和步数预算下部署，无需重新训练或按设置校准。在问答、数学推理和代码生成上的实验表明，TraceLock 在质量-步数权衡上优于启发式和学习的基线，在跨设置部署下尤其稳定。诊断分析表明，其决策不能简化为标量置信度，这表明冻结的扩散语言模型暴露了一个超越基于置信度解码的可学习的提交轨迹空间。代码可在 https://github.com/BobSun98/TraceLock 获取。

英文摘要

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.

URL PDF HTML ☆

赞 0 踩 0

2605.24693 2026-05-26 cs.CL 版本更新

CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

CP-Agent: 一种用于反馈驱动竞赛编程的校准风险控制智能体

Peisong Wang, Bowen Liu, Zehua Li, Yuyao Wang, Zhiwei Ma, Yuhan Li, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结提出CP-Agent，通过校准停止过程建模反馈驱动求解，结合双重粒度验证、测试增强和经验驱动自我进化机制，在不更新参数的情况下显著提升竞赛编程性能。

Comments Code: https://github.com/NineAbyss/CP-Agent

详情

AI中文摘要

大型语言模型在竞赛级编程中仍存在困难，而许多智能体解决方案依赖于大量的推理时采样或昂贵的多阶段后训练。我们研究了执行反馈何时能可靠地帮助LLM竞赛编程求解器，以及哪些机制支配着性能提升。我们将反馈驱动求解建模为校准停止过程，并识别出三个量：虚假接纳风险、针对不良程序的程序级证据以及活跃状态成功风险。在保留的轨迹校准和从预先声明的有限控制器清单中选择下，所得的结构性证书在虚假接纳之前为干净成功概率提供了下界。我们针对这些量实例化了机制：双重粒度验证、测试增强和经验驱动自我进化，从而得到CP-Agent。在不更新任何参数的情况下，CP-Agent在LiveCodeBench Pro上将Pass@1从25.8%提升至48.5%，并在ICPC-Eval上将Refine@5提高了11.0%。在三个LLM骨干网络上，CP-Agent处于成本-准确率效率前沿，消融实验表明每个组件主要影响其对应的证书量。

英文摘要

Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8\% to 48.5\% on LiveCodeBench Pro and improves Refine@5 by 11.0\% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost--accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.

URL PDF HTML ☆

赞 0 踩 0

2605.24661 2026-05-26 cs.AI cs.CL 版本更新

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

衡量LLM中的推理质量：一个多维行为框架

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * Department of Computer Engineering, Tarsus University（塔鲁斯大学计算机工程系）； School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)（计算与增强智能学院（SCAI），亚利桑那州立大学（ASU））； HumaConn AI Consulting（HumaConn AI咨询）

AI总结提出一个基于行为的多维框架，从正确性、一致性、鲁棒性、逻辑连贯性、效率和稳定性六个维度评估LLM推理质量，揭示仅靠准确率无法观察到的行为，并支持部署决策。

详情

AI中文摘要

LLMs在复杂推理任务中取得了显著成功，但当前的评估方法主要依赖最终答案的正确性，对产生这些答案的底层推理过程提供的洞察有限。为弥补这一空白，本研究从行为角度提出了一个统一的多维框架来衡量LLMs的推理质量，操作化了六个理论驱动的维度：正确性（CQ）、一致性（CS）、鲁棒性（RS）、逻辑连贯性（LS）、效率（ES）和稳定性（SS）。在四个基准测试的975个条目上对七个LLMs进行的广泛实验表明，该框架揭示了仅靠准确率指标无法观察到的行为。值得注意的是，逻辑连贯性与正确性正交（r = -0.172，不显著），证实了正确答案可能源于不连贯的推理，而Claude-Haiku-4.5取得了最高的多维得分（Q_bal = 0.778）。此外，该框架暴露了关键的排名反转：DeepSeek-V3在准确率优先下排名第二，但在法律/合规权重下排名第五，这种反转是单一指标评估无法检测到的。判别效度证实11/15个维度对是独立的（|r| < 0.50），为将每个维度视为不同信号提供了心理测量学支持。该框架产生的维度概况直接支持三类部署决策：识别那些虽然最终答案正确但推理轨迹无法通过问责审计的模型（LS--CQ正交性）；防止仅基于准确率的基准测试导致的排名错误；以及确保没有单一指标默默替代框架捕获的六个独立信号。

英文摘要

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

URL PDF HTML ☆

赞 0 踩 0

2605.24647 2026-05-26 cs.CL 版本更新

Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation

在你说话之前了解你：多轮对话中用于LLM个性化的用户状态建模

Jiani Luo, Xiaoyan Zhao, Yang Zhang, Shuyi Miao, Bingbing Xu, Stefan Konigorski, Tat-Seng Chua

发表机构 * School of Computing, National University of Singapore（新加坡国立大学计算机学院）； School of Artificial Intelligence, Beihang University（北航人工智能学院）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； German Institute of Human Nutrition Potsdam-Rehbruecke（德国人类营养研究所波茨坦-雷赫布鲁克）

AI总结提出基于自由能原理的PUMA框架，通过显式用户状态模型和动作选择机制，将个性化对话从被动记忆检索转变为基于模型的用户演化决策，在医疗咨询基准上提升长期对话效果。

Comments 30pages, 3 figures

详情

AI中文摘要

个性化对话不仅需要回忆显式的用户历史：系统还需要推断通过交互演化并塑造适当响应策略的隐藏用户状态。现有的基于记忆和配置文件的方法主要重用可观察的用户信息，对建模用户状态动态或基于它们如何塑造未来用户状态来选择动作的支持有限。我们提出了PUMA（面向动作选择的预期用户状态建模），这是一个基于自由能原理（FEP）的框架，将个性化形式化为部分可观测下的决策，围绕一个显式的用户状态模型，该模型捕获潜在用户状态及其动作条件动态。在每一轮中，PUMA维护对用户隐藏状态的信念，细化用于观测生成和动作条件状态转移的用户状态模型，并通过最小化预期自由能来选择对话动作，在统一标准下平衡认知和实用目标。这种表述将个性化从被动记忆检索转变为基于模型的用户演化决策。我们在面向医疗咨询和动机性访谈的基准上实例化PUMA，并带有潜在状态标注以进行严格评估。实验表明，PUMA在保持强响应质量的同时改善了长期对话结果，跨数据集研究展示了更可靠的用户状态估计和下一状态预测。

英文摘要

Personalized dialogue requires more than recalling explicit user histories: systems also need to infer hidden user states that evolve through interaction and shape appropriate response strategies. Existing memory- and profile-based methods primarily reuse observable user information, offering limited support for modeling user-state dynamics or selecting actions based on how they shape future user states. We propose PUMA (Prospective User-state Modeling for Action selection), a framework grounded in the Free Energy Principle (FEP) that formulates personalization as decision-making under partial observability, centered on an explicit user state model that captures latent user states and their action-conditioned dynamics. At each turn, PUMA maintains a belief over the user's hidden state, refines the user state model for observation generation and action-conditioned state transition, and selects dialogue actions by minimizing expected free energy, balancing epistemic and pragmatic objectives under a unified criterion. This formulation shifts personalization from passive memory retrieval to model-based decision-making over user evolution. We instantiate PUMA on healthcare-oriented counseling and motivational interviewing benchmarks with latent state annotations for rigorous evaluation. Experiments show that PUMA improves long-horizon dialogue outcomes while maintaining strong response quality, and a cross-dataset study demonstrates more reliable user-state estimation and next-state prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.24635 2026-05-26 cs.CL 版本更新

HiMed: Incentivizing Hindi Reasoning in Medical LLMs

HiMed: 激励医疗大语言模型中的印地语推理

Dingfeng Jiang, Han Yan, Chenze Ma, Amit Kumar Jaiswal, Ang Li, Yunxiang Jiang, Xinlei Xiong, Juhao Liang, Hongru Xiao, Xiang Li, Fan Bu, Jiale Han, Ruchir Gupta, Prayag Tiwari, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Indian Institute of Technology (Banaras Hindu University) Varanasi（印度理工学院（班加罗尔 Hindu 大学）瓦拉纳西分校）； Tongji University（同济大学）； Shenzhen Research Institute of Big Data（深圳大数据研究院）； Shenzhen Loop Area Institute（深圳科创园区研究院）； The Hong Kong University of Science and Technology（香港科技大学）； Halmstad University（哈尔姆斯塔德大学）

AI总结针对医疗大语言模型在印地语上表现不佳的问题，提出HiMed印地语医疗推理语料库与基准，并通过衰减支架奖励训练HiMed-8B模型，显著提升印地语医疗推理性能并缩小英印准确率差距。

2605.24614 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Measuring the Depth of LLM Unlearning via Activation Patching

通过激活修补测量大语言模型遗忘的深度

Jaeung Lee, Dohyun Kim, Jaemin Jo

发表机构 * Sungkyunkwan University（全北大学）

AI总结提出遗忘深度评分（UDS），通过激活修补量化遗忘的机制深度，在150个遗忘模型上的元评估中达到最高忠实性和鲁棒性。

Comments 18 pages

详情

AI中文摘要

大语言模型遗忘已成为隐私保护和人工智能安全的关键事后机制，但审计目标知识是否真正被擦除仍然具有挑战性。现有的输出级指标无法检测到这些知识是否仍可从内部表示中恢复。最近的白盒研究揭示了此类残留知识，但通常依赖于辅助训练或数据集特定调整，缺乏可推广的指标。为解决这些限制，我们提出遗忘深度评分（UDS），一种通过激活修补量化遗忘机制深度的指标。UDS首先使用保留模型基线识别编码目标知识的层，然后在0-1尺度上测量遗忘模型中该知识被擦除的程度。在跨越8种方法的150个遗忘模型上的20个指标的元评估中，UDS实现了最高的忠实性和鲁棒性，证实了我们的因果方法是遗忘评估中最可靠的。案例研究进一步揭示，白盒指标可能在层级别上不一致，并且擦除深度因示例而异。我们提供了将UDS集成到现有基准测试框架并简化评估流程的指南。代码和数据可在https://github.com/gnueaj/unlearning-depth-score获取。

英文摘要

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

URL PDF HTML ☆

赞 0 踩 0

2605.24613 2026-05-26 cs.CL cs.AI cs.SE 版本更新

从自然语言训练的后继表示中自发涌现的词类表示

Mathis Immertreu, Achim Schilling, Thomas Kinfe, Patrick Krauss

发表机构 * Cognitive Computational Neuroscience Group（认知计算神经科学组）； Friedrich-Alexander-Universität Erlangen–Nürnberg (FAU)（弗赖堡-亚历山大-大学埃尔兰根-纽伦堡（FAU））； Mannheim Center for Neuromodulation and Neuroprosthetics (MCNN)（曼海姆神经调制与神经假体中心（MCNN））； University Hospital Mannheim, University Heidelberg（曼海姆大学医院，海德堡大学）

AI总结本研究将强化学习中的后继表示（SR）框架应用于自然语言，通过训练神经网络预测未来词分布，发现无监督下词类（如名词、动词、形容词）的几何结构自发涌现，且预测时域影响结构层次。

详情

AI中文摘要

语言模型通常被训练来预测序列中的下一个词。这里，我们探索来自强化学习的另一种预测原则：后继表示（SR），它建模未来状态的期望折扣分布，而不是直接的下一个状态。我们将这一框架迁移到自然语言，并训练神经网络在多个时间视界上预测未来词分布，从而学习长程转移结构的表示。我们在WikiText-103（1.03亿词；2万词词汇）上训练深度残差神经网络，并使用KL散度将后继表示优化为概率分布。在没有显式语言监督的情况下，结构化语言表示自发涌现。训练后，学习到的空间相对于词性（POS）类别发展出清晰的几何组织：名词、动词和形容词变得可分离，并通过无监督聚类恢复。这种组织系统地依赖于预测视界：短视界产生最强的句法结构，而长视界逐渐整合更广泛的上下文和语义信息。在更精细的分辨率下，额外的可解释词汇子结构出现，揭示了主要词类内的连贯子类。这些发现表明，句法类别无需显式编码，而可能作为预测序列学习的结果出现。据我们所知，这项工作首次将后继表示系统应用于自然语言，并在强化学习、语言学和认知神经科学之间建立了概念桥梁。

英文摘要

Language models are typically trained to predict the next token in a sequence. Here, we explore an alternative predictive principle from reinforcement learning: Successor Representations (SRs), which model the expected discounted distribution of future states rather than the immediate next state. We transfer this framework to natural language and train neural networks to predict future word distributions across multiple temporal horizons, thereby learning representations of long-range transition structure. We train a deep residual neural network on WikiText-103 (103 million tokens; 20,000-word vocabulary) and optimize successor representations as probability distributions using KL divergence. Without explicit linguistic supervision, structured language representations emerge spontaneously. After training, the learned space develops a clear geometric organization with respect to part-of-speech (POS) categories: nouns, verbs, and adjectives become separable and recoverable through unsupervised clustering. This organization depends systematically on predictive horizon, with short horizons producing the strongest syntactic structure and longer horizons increasingly integrating broader contextual and semantic information. At finer resolutions, additional interpretable lexical substructure emerges, revealing coherent subclasses within major word categories. These findings suggest that syntactic categories need not be explicitly encoded but may arise as a consequence of predictive sequence learning. To our knowledge, this work provides the first systematic application of successor representations to natural language and establishes a conceptual bridge between reinforcement learning, linguistics, and cognitive neuroscience.

URL PDF HTML ☆

赞 0 踩 0

2605.24579 2026-05-26 cs.CL 版本更新

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

WhenLoss: 诊断长上下文记忆系统中的写入与检索瓶颈

Jiangnan Yu, Kisson Songqi Lin, Jilong Wu

AI总结提出四条件诊断协议发现写入阶段是长上下文记忆系统的主要瓶颈，并基于此提出预期预测压缩（EPC）方法，在写入时利用LLM预测未来问题并保留关键证据，显著提升系统性能。

Comments 14 pages, 7 figures, 9 tables

详情

AI中文摘要

长上下文记忆系统在固定预算下常常失败，但端到端评估无法揭示证据是在压缩过程中被丢弃还是被保留但从未被检索。我们引入了一个四条件诊断协议，在截断完整上下文（TFC）、证据预言（OE）、完整存储记忆（CSM）和检索记忆（RM）条件下评估固定阅读器。在此固定预算的LongMemEval设置下，大多数测试基线的写入侧差距超过检索侧差距，其中六个基线中的四个在我们的默认诊断裕度下稳健地表现为写入主导。受此诊断启发，我们提出预期预测压缩（EPC），该方法将关键决策——保留哪些信息——移至写入时间，通过使用LLM预测未来可能的问题并在令牌预算下保留最少的支持证据，同时在问题时间保持检索不变。在所有500个LongMemEval问题中，使用三个阅读器（GPT-5.2、Claude Sonnet 4、Gemini 2.5 Pro），EPC在所有系统中取得了最高的CSM分数（0.49，而最强基线Summary (LLM)为0.44），将Delta_write降至0.04，同时Delta_retr与其他基于LLM的系统相当。这些结果表明，在此基准和评估设置下，改进写入阶段保留的内容是测试系统性能提升的关键途径。

英文摘要

Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic protocol that evaluates a fixed reader under truncated full context (TFC), oracle evidence (OE), complete stored memory (CSM), and retrieved memory (RM). Under this fixed-budget LongMemEval setup, write-side gaps exceed retrieval-side gaps for most tested baselines, with four of six baselines robustly write-dominant under our default diagnosis margin. Motivated by this diagnosis, we propose Expected Predictive Compression (EPC), which moves the key decision--what information to retain--to write time by using an LLM to anticipate likely future questions and preserve the minimal supporting evidence under the token budget, while leaving retrieval unchanged at question time. Across all 500 LongMemEval questions with three readers (GPT-5.2, Claude Sonnet 4, Gemini 2.5 Pro), EPC achieves the highest CSM scores among all systems (0.49 vs. 0.44 for Summary (LLM), the strongest baseline), reducing Delta_write to 0.04 while leaving Delta_retr comparable to other LLM-based systems. These results suggest that, on this benchmark and evaluation setup, improving what the write stage preserves is a key avenue for performance gains in the tested systems.

URL PDF HTML ☆

赞 0 踩 0

2605.24577 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

多态性即旋转：从两层Transformer到Pythia-70m的操作性机械可解释性

Jordan F. McCann

发表机构 * Independent Researcher（独立研究者）

AI总结本文发现独立训练的Transformer在残差流基上通过均匀随机旋转相互关联，并利用正交Procrustes拟合实现特征字典和转向向量在模型间的迁移，无需重新训练。

Comments 26 pages, 4 figures, 40 references. Pre-registered four-bar framework; all numerical claims reproducible

详情

AI中文摘要

独立训练的Transformer在残差流基上计算相同的函数，这些基通过$\mathrm{SO}(d_{\mathrm{model}})$上的均匀随机旋转相互关联。我们将这种现象称为多态性：相同的函数，但内部坐标互不可解。每对模型之间的一次矩阵乘法即可消除这种多态性：在单批激活上进行正交Procrustes拟合，即可在独立训练的模型之间迁移稀疏自编码器特征字典和转向向量，无需重新训练。该现象对标准SAE通用性度量不可见。解码器列余弦相似度在不同种子间匹配度达98%，即SAE通用性的头条数字，而一个种子训练的SAE重构另一个种子的激活时，解释方差为负，比预测常数均值更差。解码器列对齐，但编码器从旋转后的框架读取。单个Procrustes旋转$R$可在每个内部位置将重构恢复至种子内上限的0.025 EV以内。 $R$服从Haar分布：$\|R - I\|_F$与随机正交预测$\sqrt{2 d_{\mathrm{model}}}$在$d_{\mathrm{model}} = 512$时匹配至0.1%，且$R$的特征值谱与Haar $\mathrm{SO}(d_{\mathrm{model}})$的Kolmogorov-Smirnov检验在合并和逐对情况下均返回$p \approx 1.000$。均值差转向向量通过与$R$的不变子空间对齐在三种机制下迁移：当被共享输出权重固定时清晰，与旋转子空间重叠时部分，否则反转。在无共享输入/输出（Pythia）时，所有三种情况均坍缩为普遍反转。同一旋转解释适用于单次运行中的不同训练检查点。在104k参数的Dyck-3 Transformer和九个独立训练的Pythia-70m种子（基于The Pile数据集）上，通过预注册的四柱操作框架进行验证。前沿规模（10B+）的复现仍有待研究。

英文摘要

Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{SO}(d_{\mathrm{model}})$. We call this phenomenon polymorphism: same function, mutually unintelligible interior coordinates. One matrix multiplication per model pair removes it: an orthogonal Procrustes fit on a single batch of activations transfers sparse-autoencoder feature dictionaries and steering vectors between independently trained models, with no retraining. The phenomenon is invisible to the standard SAE universality metric. Decoder-column cosine similarity matches across seeds at 98%, the SAE-universality headline number, while an SAE trained on one seed reconstructs another seed's activations at negative explained variance, worse than predicting the constant mean. The decoder columns align; the encoder reads from a rotated frame. A single Procrustes rotation $R$ restores reconstruction to within 0.025 EV of the within-seed ceiling at every internal site. $R$ is Haar-distributed: $\|R - I\|_F$ matches the random-orthogonal prediction $\sqrt{2 d_{\mathrm{model}}}$ to 0.1% at $d_{\mathrm{model}} = 512$, and a Kolmogorov-Smirnov test of $R$'s eigenvalue spectrum against Haar $\mathrm{SO}(d_{\mathrm{model}})$ returns $p \approx 1.000$ pooled and per-pair. Diff-of-means steering vectors transfer in three regimes by alignment with $R$'s invariant subspace: clean when pinned by shared output weights, partial when overlapping the rotated subspace, inverted otherwise. With no shared I/O (Pythia), all three collapse to universally inverted. The same rotation account holds across training checkpoints within a single run. Validated on a 104k-parameter Dyck-3 transformer and nine independently-trained Pythia-70m seeds on The Pile, via a pre-registered four-bar operational framework. Frontier-scale (10B+) replication remains open.

URL PDF HTML ☆

赞 0 踩 0

2605.24573 2026-05-26 cs.CL 版本更新

通过检索、聚类和生成从案例数据库中生成法律评论

Max Prior, Niklas Wais, Matthias Grabmair

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结提出一个全自动流水线，利用检索、聚类和生成方法，从法院判决中自动生成法律评论，无需人工教义框架。

详情

AI中文摘要

我们提出了一个全自动流水线，将大量法院判决转化为法规的法律评论——无需提供任何手工制作的教义框架。使用德国联邦最高法院引用德国民法典第242、280、812和823条的4555份判决，我们提取段落级块，总结其推理，并推导关键词，这些关键词被嵌入和聚类。对于每个聚类，一个LLM生成标题并综合引用丰富的章节，然后由四个最先进的LLM合并成连贯的评论。我们使用人类专家和LLM评判员，沿着五个维度——主题相关性、标题匹配、引用忠实性、聚类区分度和逻辑顺序——进行评估。我们的结果表明，从法院判决中挖掘类似评论的论点以生成可在几分钟内以最低成本更新的报告是可行的，但突出了由于来源受限和法律推理规范性而产生的局限性。

英文摘要

We present a fully automated pipeline that transforms large collections of court decisions into legal commentaries for statutes - without providing any handcrafted doctrinal framework. Using 4.555 decisions of the German Federal Court of Justice that cite sections 242, 280, 812 and 823 of the German Civil Code (BGB), we extract paragraph-level chunks, summarize their reasoning, and derive keywords, which are embedded and clustered. For each cluster, an LLM generates headings and synthesizes citation-rich sections, which are then merged into coherent commentaries by four state-of-the-art LLMs. We evaluate along five dimensions - topical relevance, heading-match, citation faithfulness, cluster distinction and logical ordering - using both a human expert and an LLM-judge. Our results show that commentary-like argument mining from court decisions to generate reports that can be refreshed within minutes at minimal cost is feasible, yet they highlight limitations arising from restricted sources and the normativity of legal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.24530 2026-05-26 cs.CL cs.CV 版本更新

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil: 统一视觉-文本集成与蒸馏的多模态文档检索

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University（北京理工大学通用人工智能国家重点实验室）； School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院）； Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院航空航天信息研究所）； Key Laboratory of Target Cognition and Application Technology（目标认知与应用技术重点实验室）； Beijing Institute of Technology（北京理工大学）； Ucap Cloud（Ucap云）

AI总结提出Unveil框架，通过视觉-文本嵌入和知识蒸馏实现鲁棒的文档检索，兼顾布局与语义信息。

Comments ACL 2025 Main Conference

详情

AI中文摘要

现实场景中的文档检索由于文档格式和模态的多样性面临重大挑战。传统的基于文本的方法依赖于定制的解析技术，忽略布局信息且容易出错，而最近的无解析视觉方法在文本丰富的场景中往往难以捕捉细粒度的文本语义。为了解决这些限制，我们提出了 extbf{Unveil}，一种新颖的视觉-文本嵌入框架，有效整合文本和视觉特征以实现鲁棒的文档表示。通过知识蒸馏，我们将视觉-文本嵌入模型的语义理解能力转移到纯视觉模型，实现高效的无解析检索同时保持语义保真度。实验结果表明，我们的视觉-文本嵌入方法超越了现有方法，而知识蒸馏成功弥合了视觉-文本方法与纯视觉方法之间的性能差距，提高了检索准确性和效率。

英文摘要

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC 版本更新

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo：野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales（新南威尔士大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出AnyMo框架，通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐，实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述，性能显著提升。

详情

AI中文摘要

随着可穿戴和移动设备日益融入日常生活，它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置，包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难，并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo，一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号，从配对的合成放置视图和掩蔽部分观测中预训练图编码器，将多位置IMU标记化为全身运动令牌，并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo：跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述，其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%，零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%，零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面：https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

URL PDF HTML ☆

赞 0 踩 0

2605.18840 2026-05-26 cs.LG cs.AI cs.CL 版本更新

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

前沿模型的成长之痛：当排行榜不再区分以及接下来衡量什么

Adil Amin

发表机构 * Zehen Labs（泽亨实验室）

AI总结本文通过分解SWE-bench和GPQA Diamond分数为种群耦合趋势和每版本残差（h场），诊断前沿模型能力之间的协作与权衡，并提供三步诊断法、每实验室测量优先级表及七个可证伪预测。

Comments 13 pages, 5 figures, 4 tables. Companion paper: "Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling." ( https://doi.org/10.48550/arXiv.2605.18838 ). Code: https://github.com/adilamin89/cape-scaling . Dashboard: https://zehenlabs.com/cape/

详情

AI中文摘要

排行榜在独立轴上对前沿模型进行排名，但并未揭示能力在版本间是相互增强还是权衡——而在前沿，这种相互作用是更具信息量的信号。我们将配对的SWE-bench和GPQA Diamond分数分解为种群耦合趋势和每版本残差（h场），该残差从两个公开基准分数诊断能力重点。在来自10个实验室的34个模型（2024-2026）中，能力相互协作（r = +0.72，p < 10^{-6}），但协作程度系统性地变化：每个实验室的耦合斜率跨度达5倍（谷歌1.15 vs. DeepSeek 0.23），且实验室发生转向——DeepSeek从推理密集型逆转为编码优先（Δh = 15.9个百分点）；Anthropic在编码偏离和恢复之间振荡。种群回归作为等斜线相边界：用于识别基础尺度耦合转变的相同分类器√[(a/b)·B₁] [Amin, 2026] 对前沿模型进行分类，并已在下一个转变处检测到混合相行为（两个模型低于GPQA-IFEval等斜线）。h场不仅具有诊断性——它还告诉你需要改变什么。预训练建立耦合为0.871，而RLHF增加0.081 [Amin, 2026]：预训练级别的转变是永久的（DeepSeek的四个版本逆转持续存在），后训练转变是可逆的（Anthropic的三次编码偏离均在单个版本内恢复），仅推理计算在不重新训练的情况下将h改变+7.8个百分点。知道哪个组件占主导地位决定了是重新训练还是等待。我们提供了三步诊断法（定位、分类、预测）、每实验室测量优先级表以及七个带有时间戳标准的可证伪预测。五个截止日期后的版本落在95%预测区间内。代码、数据和交互式仪表盘：https://zehenlabs.com/cape/。

英文摘要

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($Δh = 15.9$~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same $\sqrt{(a/b)\cdot B_1}$ classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The $h$-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at $0.871$ while RLHF adds $0.081$ [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts $h$ by $+7.8$~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

URL PDF HTML ☆

赞 0 踩 0

2605.16409 2026-05-26 cs.CV cs.CL cs.LG 版本更新

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调和提示引导的链式思维推理用于多模态大语言模型

Qinwu Xu, Yifan Jiang, Haoyu Ren

发表机构 * Meta AI ； UT Austin（德克萨斯大学奥斯汀分校）

AI总结提出一种多语言OCR感知的多模态训练框架，通过合成数据生成、OCR感知微调和结构化视觉链式思维提示，提升多模态大语言模型在复杂视觉条件下的OCR完整性和多语言翻译准确性。

详情

AI中文摘要

光学字符识别（OCR）和多语言文本理解仍然是多模态大语言模型（MLLMs）的主要失败模式，尤其是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架，该框架结合了（i）大规模合成OCR到翻译数据生成，（ii）使用LoRA适配的OCR感知监督微调（SFT），以及（iii）在不确定视觉条件下进行推理的结构化视觉链式思维（CoT）提示。使用基于LLaMA的多模态架构，所提出的框架在OCR完整性、多语言翻译准确性和退化视觉条件下的鲁棒性方面有了显著提升。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明，与基线模型相比，视觉-文本对齐显著改善。特别是，所提出的OCR感知后训练框架提高了对小、模糊、空间分散和部分遮挡文本的提取，同时减少了对不确定OCR条件下语言先验的依赖。与前沿多模态系统（包括GPT-5类和Gemini系列模型）的定性比较进一步表明，在噪声和视觉模糊的OCR场景下，OCR对齐得到改善，幻觉减少。总体而言，结果表明，以数据为中心的OCR感知多模态后训练为改进多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

英文摘要

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15759 2026-05-26 cs.CL 版本更新

DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

DimMem：面向高效长期智能体记忆的维度结构化

Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang

发表机构 * StepOS ； Xiamen University（厦门大学）； ShanghaiTech University（上海科技大学）

AI总结提出DimMem维度记忆框架，通过原子化、类型化、自包含的记忆单元（含时间、地点、原因等显式字段）实现维度感知检索与更新，在LoCoMo-10和LongMemEval-S上分别达到81.43%和78.20%准确率，且每查询token成本降低24%。

详情

AI中文摘要

将结果监督内化为过程监督：推理强化学习的新范式

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

发表机构 * Alibaba Group（阿里巴巴集团）； Tsinghua University（清华大学）

AI总结提出一种监督内化方法，使模型在仅结果监督下自动提取过程级学习信号，实现细粒度策略优化。

详情

AI中文摘要

推理强化学习的核心挑战不仅在于结果级监督的稀疏性，更在于如何将仅在序列末尾提供的反馈转化为可指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列级优化，导致精确信用分配困难，要么依赖外部构建的过程监督，成本高昂且难以可持续扩展。为解决这一问题，我们提出一个新视角：推理强化学习可以理解为将结果监督内化为过程监督的问题。基于此视角，我们引入一种用于推理强化学习的监督内化方法，使模型能够通过识别、纠正和重用失败的推理轨迹自动提取过程级学习信号，从而在仅结果监督下实现更细粒度的策略优化。我们进一步将这一思想抽象为一种新的训练范式，其中模型在强化学习过程中持续生成并完善自身的内部过程监督，为推理强化学习中细粒度信用分配开辟了一条不同于外部提供过程监督的新路径。

英文摘要

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR 版本更新

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链：面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University（软件工程国家级工程研究中心，北京大学）； City University of Hong Kong（香港城市大学）； Peking University（北京大学）； Tencent Technology（腾讯科技）

AI总结提出Chain of Evidence (CoE)框架，利用视觉语言模型直接对检索到的文档截图进行推理，输出精确边界框以可视化完整推理链，解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情

DOI: 10.1145/3805712.3809540

AI中文摘要

迭代检索增强生成（iRAG）已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而，当前系统主要基于解析文本运行，这造成了两个关键瓶颈：（1）粗粒度归因，用户需要根据模糊的文本级引用在冗长文档中手动定位证据；（2）视觉语义丢失，将视觉丰富的文档（如幻灯片、带有图表的PDF）转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距，我们提出了证据链（CoE），这是一个与检索器无关的视觉归因框架，利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析，输出精确的边界框，可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE：Wiki-CoE，一个源自2WikiMultiHopQA的大规模结构化网页数据集；以及SlideVQA，一个具有挑战性的演示幻灯片数据集，包含复杂图表和自由形式布局。实验表明，微调后的Qwen3-VL-8B-Instruct取得了稳健的性能，在需要视觉布局理解的场景中显著优于基于文本的基线，同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

URL PDF HTML ☆

赞 0 踩 0

2605.00817 2026-05-26 cs.CL 版本更新

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

当LLM停止遵循步骤：语言模型中程序执行的诊断研究

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar（印度理工学院冈丁加尔）

AI总结本研究通过构建受控诊断基准，评估大型语言模型在程序执行任务中的忠实性，发现随着步骤增加准确率从63%降至20%，并揭示了缺失答案、过早答案、自我修正和执行不完整等失败模式。

Comments 86 pages, 124 figures, 4 Tables

2604.23396 2026-05-26 cs.IR cs.AI cs.CL cs.LG 版本更新

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

迷失在解码中？复现与压力测试生成式检索中的前瞻先验

Kidist Amde Mekonnen, Yongkang Li, Yubao Tang, Simon Lupart, Maarten de Rijke

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结本文复现并压力测试了生成式检索中的前瞻先验方法PAG，发现其规划信号在词汇表面形式变化下脆弱，并评估了跨语言鲁棒性与查询端缓解策略。

Comments 12 pages, 5 figures, 9 tables; accepted to the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 20-24, 2026, Melbourne/Naarm, Australia

详情

DOI: 10.1145/3805712.3808567
Journal ref: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), pages XXX-XXX, 2026

AI中文摘要

生成式检索（GR）通过自回归生成文档标识符来对文档进行排序。由于许多GR方法依赖于trie约束的束搜索，它们在有限束解码下容易过早剪枝相关前缀。生成式检索中的前瞻规划（PAG）通过使用同时解码来计算文档级前瞻先验，指导后续顺序解码，从而缓解了这种失败模式。我们在推理时复现了PAG，并压力测试了其解码行为。使用作者发布的检查点和标识符/trie工件，在报告的解码设置下，我们在MS MARCO Dev和TREC-DL 2019/2020上复现了主要有效性结果，并在我们的硬件设置中证实了报告的束大小-延迟权衡。在复现之外，我们引入了规划漂移诊断，量化意图保持的查询变体如何改变规划器的top-n候选集和最高权重规划器令牌，以及这些变化如何影响引导解码。我们发现PAG的规划信号在词汇表面形式变化下是脆弱的：意图保持的拼写错误可能触发规划崩溃，其中规划的候选池变化足够大，使得前瞻奖励几乎无法提供有用的指导，实际上使解码退回到较弱的无引导搜索。我们进一步使用非英语mMARC O查询对英语索引评估了固定索引的跨语言鲁棒性，并评估了无需重新索引的查询端缓解策略；在我们的设置中，查询翻译提供了最强的恢复。总体而言，我们的结果证实了PAG报告的有效性以及在发布的推理设置下规划引导解码的优势，同时表明这些增益依赖于规划信号在现实查询变化和查询-文档不匹配下的稳定性。

英文摘要

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam decoding. Planning Ahead in Generative Retrieval (PAG) mitigates this failure mode by using simultaneous decoding to compute a document-level look-ahead prior that guides subsequent sequential decoding. We reproduce PAG at inference time and stress-test its decoding behavior. Using the authors' released checkpoint and identifier/trie artifacts under the reported decoding setup, we reproduce the main effectiveness results on MS MARCO Dev and TREC-DL 2019/2020, and corroborate the reported beam-size-latency trade-off in our hardware setting. Beyond reproduction, we introduce plan drift diagnostics that quantify how intent-preserving query variations alter the planner's top-n candidate set and highest-weight planner tokens, and how these changes affect guided decoding. We find that PAG's planning signal is brittle under lexical surface-form variation: intent-preserving typos can trigger plan collapse, where the planned candidate pool shifts enough that the look-ahead bonus provides little useful guidance, effectively reverting decoding toward weaker unguided search. We further evaluate fixed-index cross-lingual robustness using non-English mMARCO queries against an English index, and assess query-side mitigation strategies that require no re-indexing; query translation provides the strongest recovery in our setting. Overall, our results confirm PAG's reported effectiveness and the benefit of planning-guided decoding under the released inference setup, while showing that these gains depend on the stability of the planning signal under realistic query variation and query-document mismatch.

URL PDF HTML ☆

赞 0 踩 0

2604.20022 2026-05-26 cs.LG cs.AI cs.CL 版本更新

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

MoBayes：一种用于对话式临床决策支持中推理与语言分离的模块化贝叶斯框架

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

发表机构 * LiGHT, EPFL（LiGHT，瑞士联邦理工学院）； University of Bern（伯尔尼大学）； Aarhus University（奥胡斯大学）

AI总结提出MoBayes框架，通过将LLM作为语言接口、贝叶斯模块进行概率推理，实现推理与语言分离，在临床决策支持中优于独立前沿LLM医生。

Comments 50 pages including appendix, 13 figures, 22 tables. Preprint

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于对话式临床决策支持，但它们将下一个标记预测与概率决策混为一谈。我们认为这种混淆反映了架构上的局限性：此类系统缺乏显式的后验追踪、可控的弃权阈值和可审计的推理链。我们引入MoBayes，一个模块化贝叶斯对话框架，将推理与语言分离。LLM仅作为语言接口，将患者对话解析为结构化观察，而贝叶斯模块对这些观察进行概率推理以更新后验，通过期望信息增益选择后续问题，并通过校准的决策阈值决定何时停止或推迟。这种设计实现了显式后验追踪、可控的选择性决策，以及无需重新训练语言模型即可替换的特定人群统计后端。在经验知识和LLM生成的知识库上，MoBayes优于独立的前沿LLM医生，包括匹配模型系列的比较，其中廉价的传感器模型与MoBayes配对以较低成本超过更大的自主模型。在对抗性患者沟通风格和不同诊断场景下，该优势依然存在。这些结果表明，可靠的对话式临床决策支持系统应将概率推理与语言生成分离，而不是仅扩大模型规模。代码可在https://anonymous.4open.science/r/MoBayes/获取。

英文摘要

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

URL PDF HTML ☆

赞 0 踩 0

2604.19151 2026-05-26 cs.CL cs.SD eess.AS 版本更新

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

印度之声：面向印度真实世界语音识别的大规模基准

Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Mahima Manik, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra

发表机构 * Indian Institute of Technology, Madras, India（印度理工学院，马德拉斯分校）； Josh Talks, India（Josh Talks）

AI总结针对现有Indic ASR基准的局限性，提出基于非脚本电话对话的封闭源基准Voice of India，覆盖15种主要印度语言和139个区域集群，包含306230条语音（536小时），并分析地理、音频质量、语速、性别和设备类型等因素对ASR性能的影响。

Comments 6 pages, 4 figures

详情

AI中文摘要

现有的Indic ASR基准通常使用脚本化的、干净的语音和基于排行榜的评估，这鼓励了针对数据集的过拟合。此外，严格的单参考WER会惩罚印度语言中的自然拼写变体，包括非标准拼写的代码混合英语起源词。为了解决这些局限性，我们引入了Voice of India，这是一个从非脚本电话对话构建的封闭源基准，覆盖15种主要印度语言，跨越139个区域集群。该数据集包含306230条语音，总计536小时的语音，来自36691名说话人，转录考虑了拼写变体。我们还在地理上按地区分析了性能，揭示了差异。最后，我们提供了跨音频质量、语速、性别和设备类型等因素的详细分析，突出了当前ASR系统在哪些方面存在困难，并为改进真实世界的Indic ASR系统提供了见解。

英文摘要

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

URL PDF HTML ☆

赞 0 踩 0

2604.18170 2026-05-26 cs.CL cs.AI 版本更新

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Copy-as-Decode: 面向LLM编辑的语法约束并行预填充

Ziyang Liu

AI总结提出Copy-as-Decode机制，通过语法约束的并行预填充加速LLM编辑，实现高达303倍的自回归解码加速，并保持高覆盖率与无损性。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

LLMs通过自回归地重新生成完整输出来编辑文本和代码，即使大多数标记在输入中逐字出现。我们研究Copy-as-Decode，一种解码层机制，将编辑生成重新表述为基于两个原语语法的结构化解码：<copy lines="i-j"/>引用输入行范围，<gen>...</gen>生成新内容。一个标记级FSM保证语法有效性，服务层原语通过单次并行预填充前向（而非N步自回归步骤）更新每个复制跨度的KV缓存——共享推测解码的并行前向内核，但以输入标记作为草稿，程序强制接受替代概率验证。我们报告一个无需端到端训练的上界分析。(i) 内核加速：在Qwen2.5-{1.5B, 7B}上，通过并行预填充复制N个标记比自回归快6.8倍至303倍（N ∈ [8, 512]，A100 80GB bf16）。(ii) 复制上限：在ProbeEdit和HumanEvalPack-Fix (Py/JS)上，74%–98%的金标准标记在行级原语下可达；结合每个语料库跨度直方图上的经验内核，得到闭式挂钟时间上界29.0倍/3.4倍/4.2倍（合并13.0倍）。标记级扩展达到91%–99%覆盖率，下界4.5倍–6.5倍。(iii) 流水线无损性：预言程序通过确定性解析器在所有482个案例上往返，将任何下游失败定位到跨度选择而非机制。扰动研究表明，在离一噪声下，合并EM从100%降至15.48%。在Qwen2.5-Coder-1.5B上的微调实验将HEvalFix-Py EM从0/33（未训练）提升至12%–17%，这是一个可学习性信号，而非生产选择器。批处理服务集成和多文件覆盖作为后续工作。

英文摘要

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

URL PDF HTML ☆

赞 0 踩 0

2604.18128 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition

深度寄存器解锁 SwiGLU 上的 W4A4：一种读取器/生成器分解

Ziyang Liu

AI总结本研究通过深度寄存器和铰链损失（DR+sink）训练时干预，将 SwiGLU 解码器语言模型的 W4A4 量化困惑度从 1727 降至 119，并分解出残差轴读取器主导误差，而生成器 w2 的双线性输入是剩余差距的主因。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

我们在一个受控的 300M 参数 SwiGLU 解码器语言模型（在 FineWeb-Edu 的 5B 令牌上训练）中研究训练后 W4A4 量化，并询问哪些输入激活位点主导误差。朴素的四舍五入 W4A4 将验证困惑度从 FP16 的 23.6 降至 1727。一种简单的残差轴训练时干预——带有寄存器幅度铰链损失的深度寄存器（DR+sink）——在匹配的 FP16 PPL 和匹配的零样本能力下，将其降至 119（约 14 倍），并与 SmoothQuant 组合达到 39.9 PPL。与 FP16 之间约 2 PPL 的剩余差距是诊断核心。我们按输入激活位点分解 W4A4 损伤：SwiGLU 块中的五个可训练线性层分为残差轴读取器（qkv, w1, w3）和块内生成器（o_proj, w2）。基本的范数论证表明，残差轴幅度控制紧密约束读取器，但 w2 的双线性输入仅受因子范数平凡乘积的约束；经验上，DR+sink 降低了读取器的峰度，而生成器基本不变，并且读取器恢复的 W4A4 残差在三个匹配检查点上平坦约为 0.28 nats，其中 Delta-remove(w2) 占主导。我们将 DR+sink 作为训练时探针而非部署方案提出：一种事后替代方案（Per-Linear QuaRot）在读取器轴上几乎与之匹配。完整的 QuaRot——添加在线每头值 Hadamard 和在线 w2 输入旋转——也没有缩小差距，直接验证了正交旋转无法约束双线性 SwiGLU 尾部的预测。这些主张特定于我们的 300M、5B 令牌、单种子设置，并且我们的实验未将分区与铰链分离。

英文摘要

We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.

URL PDF HTML ☆

赞 0 踩 0

2604.12376 2026-05-26 cs.CL cs.AI 版本更新

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

面向长程LLM对话的协作式内存分页与关键词书签

Ziyang Liu

AI总结提出协作式分页方法，用关键词书签替代被驱逐的对话片段，并赋予模型 recall() 工具按需检索，在 LoCoMo 基准上四个模型均取得最佳答案质量，并通过消融实验揭示分页设计的关键因素。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

当LLM对话超出上下文窗口时，旧内容必须被驱逐——但模型在需要时如何恢复它们？我们提出协作式分页：被驱逐的片段被替换为最小关键词书签（[pN:keywords]，每个约8-24个token），并赋予模型一个 recall() 工具以按需检索完整内容。在 LoCoMo 基准（10个真实多会话对话，300+轮次）上，协作式分页在四种模型（GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5）的六种方法中实现了最高的答案质量——优于截断、BM25、词重叠检索、搜索工具基线和完整上下文——由四个独立的LLM评判员确认（p=0.017，配对bootstrap）。随后，我们通过边界策略和驱逐策略的5x4消融实验（3,176个合成探针，1,600个LoCoMo探针）研究分页设计空间。关键发现：（1）粗粒度固定大小页面（fixed_20）达到96.7%，而内容感知的topic_shift降至56.7%；（2）驱逐策略的选择依赖于数据（FIFO在合成数据上最佳，LFU在LoCoMo上最佳）；（3）两种书签生成策略相比启发式基线有提升（+4.4和+8.7个E2E点）；（4）剩余瓶颈是书签区分度——模型96%的时间触发recall()，但当书签区分度不足时，仅57%选择正确页面。关键词特异性单独造成25个百分点的准确率差异。

英文摘要

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

URL PDF HTML ☆

赞 0 踩 0

2604.08501 2026-05-26 cs.DL cs.CL cs.SE 版本更新

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

sciwrite-lint：科学氛围写作时代的验证基础设施

Sergey V Samsonau

发表机构 * Authentic Research Partners（真实研究伙伴）； Princeton, NJ（新泽西州普林斯顿）

AI总结针对AI辅助写作导致的引用幻觉问题，提出基于软件工程lint范式的引用验证工具sciwrite-lint，在研究者本地运行，快速检查引用存在性、元数据准确性、撤回状态和主张支持，并评估引用链完整性。

Comments Code: https://github.com/authentic-research-partners/sciwrite-lint

详情

AI中文摘要

科学论文通过引用对先前工作提出主张。大规模验证这些引用（每篇被引论文是否存在、是否支持引用主张、本身是否可靠）在结构上超出了人类评审的能力：一篇典型论文有数十条引用，而仔细的评审者最多通读少数几篇。AI辅助写作使这一差距更加紧迫：LLM会幻觉化参考文献，并从它们从未读过的论文标题或摘要中填充看似合理的细节，对于隐私意识研究者必须使用的较小本地权重模型，情况更糟。 sciwrite-lint将软件工程中的lint范式应用于引用验证：它完全在研究者机器上运行（免费公共数据库、单个消费级GPU和开放权重模型），速度足够快，可在修订之间重新lint，使作者在起草时就能从源头发现问题，并为期刊和评审者提供自动化的第一遍检查。该流程检查引用存在性、元数据准确性、撤回状态和主张支持，遍历被引论文参考文献的一级深度，并生成每篇引用的可靠性评分。我们在30篇未见过的论文（arXiv和bioRxiv）上进行了评估，包括错误注入和LLM裁决的假阳性分析。相同的lint工作流程扩展到内部一致性：文本与表格中的数字、摘要与正文、图注与内容、统计结果与其文字解释，以及结构交叉引用（悬空引用、孤立参考文献）。作为独立的实验贡献，我们还提出了SciLint评分：引用链完整性与一个贡献组件相结合，该组件操作了五个科学哲学框架（Popper、Lakatos、Kitcher、Laudan、Mayo）。

英文摘要

Scientific papers make claims about prior work backed by citations. Verifying those citations at scale (that each cited paper exists, says what the citation claims, and is itself reliable) is structurally beyond what human review can deliver: a typical paper has dozens of citations, and a careful reviewer reads at most a handful end-to-end. AI-assisted writing makes this gap even more urgent: LLMs hallucinate references and may fill in plausible details from titles or abstracts of papers they never read, worse for the smaller local-weights models that privacy-aware researchers must use. sciwrite-lint applies the linting paradigm from software engineering to citation verification: it runs entirely on the researcher's machine (free public databases, a single consumer GPU, and open-weights models), is fast enough to re-lint between revisions so authors catch problems at the source while drafting, and serves journals and reviewers as an automated first pass. The pipeline checks reference existence, metadata accuracy, retraction status, and claim support, traverses one level into cited papers' bibliographies, and produces per-reference reliability scores. We evaluate on 30 unseen papers (arXiv and bioRxiv) with error injection and LLM-adjudicated false-positive analysis. The same linting workflow extends to internal consistency: numbers in text vs. tables, abstract vs. body, figure captions vs. content, statistical results vs. their verbal interpretation, plus structural cross-references (dangling cites, orphan references). As a separate experimental contribution we also propose SciLint Score: citation-chain integrity combined with a contribution component operationalizing five philosophy-of-science frameworks (Popper, Lakatos, Kitcher, Laudan, Mayo).

URL PDF HTML ☆

赞 0 踩 0

2603.20479 2026-05-26 cs.CY cs.AI cs.CL 版本更新

Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning

学习者情感投入画像：情感AI、跨文化语用学与语言学习

Robert Godwin-Jones

发表机构 * Virginia Commonwealth University（弗吉尼亚大学）

AI总结本文探讨了情感AI在语言学习中的应用，特别是自动情感识别和模拟人类响应如何影响语用能力和互动能力的发展，并讨论了其个性化学习优势与情感操纵风险。

详情

DOI: 10.64152/10125/73679
Journal ref: Language Learning & Technology, 30(2), 14-35 (2026)

AI中文摘要

学习另一种语言可能是一个高度情感化的过程，通常以无数大大小小的挫折和成功为特征。对大多数学习者而言，语言学习并非遵循线性、可预测的路径，其曲折进程受动机（或去动机）变量影响，如个人特征、师生关系、学习材料以及对未来第二语言自我的梦想。虽然语言学习的某些方面（阅读、语法）相对机械，但其他方面可能充满压力且不可预测，尤其是用目标语言交谈。这种体验不仅需要结构和词汇知识，还需要以适合社会和文化语境的方式使用语言的能力。AI聊天机器人的出现为练习会话能力提供了新机会，既有优势（响应迅速、无评判），也有缺点（缺乏情感、文化偏见）。本文探讨了技术使用中产生的情感方面，特别是自动情感识别和AI系统中模拟的人类响应如何与语言学习以及语用和互动能力的发展相互作用。情感AI，即算法驱动对用户情感信号的解读，被认为能够实现更个性化的学习，适应感知到的学习者认知和情感状态。其他人则警告情感操纵以及不恰当和无效的用户画像。

英文摘要

Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small. For most learners, language learning does not follow a linear, predictable path, its zigzag course shaped by motivational (or demotivating) variables such as personal characteristics, teacher/peer relationships, learning materials, and dreams of a future L2 (second language) self. While some aspects of language learning (reading, grammar) are relatively mechanical, others can be stressful and unpredictable, especially conversing in the target language. That experience necessitates not only knowledge of structure and lexis, but also the ability to use the language in ways that are appropriate to the social and cultural context. A new opportunity to practice conversational abilities has arrived through the availability of AI chatbots, with both advantages (responsive, non-judgmental) and drawbacks (emotionally void, culturally biased). This column explores aspects of emotion as they arise in technology use and in particular how automatic emotion recognition and simulated human responsiveness in AI systems interface with language learning and the development of pragmatic and interactional competence. Emotion AI, the algorithmically driven interpretation of users' affective signals, has been seen as enabling greater personalized learning, adapting to perceived learner cognitive and emotional states. Others warn of emotional manipulation and inappropriate and ineffective user profiling

URL PDF HTML ☆

赞 0 踩 0

2603.11583 2026-05-26 cs.CL cs.AI 版本更新

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

UtilityMax Prompting：多目标大语言模型任务的形式化框架

Ofir Marom

发表机构 * Independent Researcher（独立研究者）

AI总结提出UtilityMax Prompting框架，用影响图和期望效用最大化将多目标LLM任务形式化，在MovieLens 1M数据集上相比自然语言基线提升了精度和NDCG。

详情

AI中文摘要

大语言模型（LLM）任务的成功在很大程度上取决于其提示词。大多数用例使用自然语言指定提示词，当必须同时满足多个目标时，自然语言本质上是模糊的。在本文中，我们引入了UtilityMax Prompting，一个使用形式化数学语言指定任务的框架。我们将任务重构为一个影响图，其中LLM的答案是唯一的决策变量。在图中条件概率分布上定义效用函数，并指示LLM找到最大化期望效用的答案。这迫使LLM明确推理目标的每个组成部分，将其输出导向精确的优化目标，而非主观的自然语言解释。我们在MovieLens 1M数据集上，使用三个前沿模型（Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro）验证了我们的方法，在多目标电影推荐任务中，与自然语言基线相比，在精度和归一化折损累计增益（NDCG）上表现出一致的改进。

英文摘要

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

URL PDF HTML ☆

赞 0 踩 0

2603.05450 2026-05-26 cs.AI cs.CL 版本更新

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

分布式部分信息谜题：在认知不对称下检验共同基础的构建

Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

发表机构 * Brandeis University（布兰迪斯大学）； Colorado State University（科罗拉多州立大学）

AI总结提出分布式部分信息谜题（DPIP）任务，收集多模态数据集，并评估大语言模型与动态认知逻辑方法在追踪信念状态和共同基础构建上的表现。

Comments 10 pages, 4 figures

详情

Journal ref: Proceedings of COLING-LREC 2026

AI中文摘要

建立共同基础（一组共享的信念和相互认可的事实）对于协作至关重要，但仍然是当前AI系统面临的挑战，尤其是在多模态、多方设置中，协作者带来不同的信息。我们引入了分布式部分信息谜题（DPIP），这是一个协作构建任务，在认知不对称下引发丰富的多模态交流。我们提供了这些交互的多模态数据集，并在语音、手势和动作模态上进行注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同基础（CG）的范式：（1）最先进的大语言模型（LLMs），被提示从多模态更新中推断共享信念，以及（2）基于动态认知逻辑（DEL）的公理流水线，逐步执行相同的任务。在注释的DPIP数据上的结果表明，它对现代LLMs跟踪任务进展和信念状态的能力构成了挑战。

英文摘要

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

URL PDF HTML ☆

赞 0 踩 0

2602.19878 2026-05-26 cs.CL cs.LO 版本更新

Axis-Aligned Semantics for ODRL: Resolving Dimensional Ambiguity in Policy Constraints

面向ODRL的轴对齐语义：解决策略约束中的维度歧义

Daham Mustafa, Diego Collarana, Sabrina Kirrane, Christoph Lange, Christoph Quix, Rafiqul Haque, Yixin Peng, Stefan Decker

发表机构 * RWTH Aachen University（亚琛工业大学）； Fraunhofer FIT（弗劳恩霍夫研究所）； University of Galway（Galway大学）； Vienna University of Economics and Business (WU)（维也纳大学）

AI总结针对ODRL中多轴操作数导致的维度歧义问题，提出轴分解方法，将约束转化为每个轴上的标量操作，从而将冲突检测简化为盒比较，并定义三值语义，通过基准测试验证了方法的正确性和兼容性。

Comments 17 pages. Preprint. v3: expanded benchmark to 256 problems; revised semantics and profile (OAAP)

详情

AI中文摘要

开放数字权利语言（ODRL）将策略约束表示为左操作数、运算符和值的三元组。然而，多个空间操作数涉及宽度、高度和深度等多轴域，而约束语法未提供明确的轴标识。因此，策略引擎无法确定多个约束是应用于同一轴还是不同轴，导致冲突检测不可靠或不完整。我们通过轴分解解决这一歧义，将多轴操作数替换为全序域上的轴特定标量操作数。每个约束表示每个轴上的一个区间，每个策略表示一个轴对齐的盒，从而将冲突检测简化为盒比较。我们定义了三值语义（冲突、兼容、未知），证明了分解的正确性及其与ODRL的向后兼容性，将其实例化为ODRL轴对齐配置文件（OAAP），并在包含256个ODRL策略问题的基准测试上进行了验证，每个问题以Turtle表示并编译为一阶形式（TPTP）和SMT-LIB形式，使用了Vampire、E、Z3和cvc5求解器。

英文摘要

The Open Digital Rights Language (ODRL) represents policy constraints as triples of a left operand, an operator, and a value. Several spatial operands, however, range over multi-axis domains such as width, height, and depth, while the constraint syntax provides no explicit axis identity. As a result, policy engines cannot determine whether multiple constraints apply to the same axis or different ones, making conflict detection unsound or incomplete. We resolve this ambiguity by axis decomposition, replacing multi-axis operands with axis-specific scalar operands over totally ordered domains. Each constraint then denotes an interval per axis and each policy an axis-aligned box, reducing conflict detection to box comparison. We define a three-valued semantics (Conflict, Compatible, Unknown), prove the decomposition sound and backward compatible with ODRL, instantiate it as ODRL Axis-Aligned Profile (OAAP), and validate it on a benchmark of 256 ODRL policy problems, each expressed in Turtle and compiled to first-order (TPTP) and SMT-LIB form, using Vampire, E, Z3, and cvc5.

URL PDF HTML ☆

赞 0 踩 0

2602.11173 2026-05-26 cs.CL 版本更新

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

作者参与的回信生成与评估：将作者专业知识和意图整合到对同行评审的回复中

Qian Ruan, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)（通用知识处理实验室）； Department of Computer Science（计算机科学系）； Hessian Center for AI (hessian.AI)（海德堡人工智能中心）

AI总结提出作者参与的回信生成与评估框架，通过引入对齐的评审-回复-修订三元组数据集、支持灵活作者输入和可控生成的REspGen系统以及包含20+指标的综合评估套件REspEval，填补了作者信号利用和评估的空白。

Comments accepted to ACL 2026 Main Conference

详情

AI中文摘要

作者回复（反驳）写作是科学同行评审的关键阶段，需要作者付出大量努力。在实践中，作者拥有领域专业知识、仅作者可用的信息和回复策略——作者专业知识和意图的具体形式——并寻求NLP辅助，将这些信号整合到作者回复生成（ARG）中。然而，这种作者参与范式缺乏正式的NLP表述和系统研究：没有数据集提供细粒度的作者信号，现有的ARG工作缺乏作者输入和控制，也没有评估指标衡量回复对作者信号的反映以及解决评审者关注点的有效性。为填补这些空白，我们引入了（i）Re3Align，第一个大规模的对齐评审-回复-修订三元组数据集，其中修订代理作者信号；（ii）REspGen，一个作者参与的ARG框架，支持灵活的作者输入、多属性控制和评估引导的细化；以及（iii）REspEval，一个包含20多个指标的全面评估套件，涵盖输入利用、可控性、回复质量和话语。使用SOTA LLMs的实验证明了作者输入和评估引导细化的好处、输入特异性对回复质量的影响以及可控性与质量之间的权衡。我们发布了我们的数据集、生成和评估工具。

英文摘要

Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. In practice, authors possess domain expertise, author-only information, and response strategies - concrete forms of author expertise and intent - and seek NLP assistance that integrates these signals into author response generation (ARG). Yet this author-in-the-loop paradigm lacks formal NLP formulation and systematic study: no dataset provides fine-grained author signals, existing ARG work lacks author inputs and controls, and no evaluation measures response reflection of author signals and effectiveness in addressing reviewer concerns. To fill these gaps, we introduce (i) Re3Align, the first large-scale dataset of aligned review-response-revision triplets, where revisions proxy author signals; (ii) REspGen, an author-in-the-loop ARG framework supporting flexible author input, multi-attribute control, and evaluation-guided refinement; and (iii) REspEval, a comprehensive evaluation suite with 20+ metrics spanning input utilization, controllability, response quality, and discourse. Experiments with SOTA LLMs demonstrate the benefits of author input and evaluation-guided refinement, the impact of input specificity on response quality, and controllability-quality trade-offs. We release our dataset, generation and evaluation tools.

URL PDF HTML ☆

赞 0 踩 0

2602.02843 2026-05-26 cs.CL 版本更新

Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

行动还是澄清？建模沟通中对不确定性和成本的敏感性

Polina Tsvilodub, Karl Mulligan, Todd Snider, Robert D. Hawkins, Michael Franke

发表机构 * Department of Linguistics, University of Tübingen（图宾根大学语言系）； Department of Linguistics, Stanford University（斯坦福大学语言系）

AI总结提出基于预期遗憾的计算模型，研究人类在不确定性下是否选择提问澄清，取决于不确定性和错误行动成本之间的理性权衡。

Comments 6 pages, 3 figures, accepted to CogSci 2026

详情

AI中文摘要

在不确定性下决定如何行动时，智能体可以选择行动以减少不确定性，也可以不顾不确定性而行动。在沟通场景中，减少不确定性的一个重要方式是提出澄清问题（CQs）。我们预测，是否提出CQ的决定取决于上下文不确定性和替代行动的成本，并且这些因素相互作用：当错误行动代价高昂时，不确定性最为重要。我们在一个基于预期遗憾的计算模型中形式化了这种相互作用：该模型衡量智能体在当前行动而非拥有完整信息时可能遭受的损失。我们在两个实验中测试了这些预测，一个实验考察对问题的纯语言回应，另一个扩展到在澄清和非语言行动之间的选择。综合来看，我们的结果表明一种理性权衡：人类倾向于在不确定性下行动时，根据可能遭受重大损失的风险比例来寻求澄清。

英文摘要

When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that uncertainty. In communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2602.02605 2026-05-26 cs.NE cs.AI cs.CL q-bio.NC 版本更新

Fine-Tuning Language Models to Know What They Know

微调语言模型使其了解自身所知

Sangjun Park, Elliot Meyerson, Xin Qiu, Risto Miikkulainen

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Cognizant AI Lab（认知人工智能实验室）

AI总结本文提出一种框架，通过进化策略对齐方法（ESMA）在控制偏差的同时提升大语言模型的元认知能力，并在未见数据集、语言和新知识上展现出鲁棒泛化性。

Comments Preprint

2602.02474 2026-05-26 cs.CL cs.AI cs.LG 版本更新

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill：面向自进化智能体的可学习与进化记忆技能

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

发表机构 * Nanyang Technological University（南洋理工大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Tsinghua University（清华大学）

AI总结提出MemSkill框架，将记忆操作转化为可学习和可进化的技能，通过控制器选择技能、执行器生成记忆、设计者进化技能集，形成闭环提升LLM智能体任务性能。

Comments Code is available at https://github.com/ViktorAxelsen/MemSkill

详情

AI中文摘要

大多数大语言模型（LLM）智能体记忆系统依赖少量静态、手工设计的操作来提取记忆。这些固定程序硬编码了关于存储内容和如何修订记忆的人类先验知识，使其在多样化的交互模式下僵化，并在长历史记录上效率低下。为此，我们提出 extbf{MemSkill}，将这些操作重新定义为可学习和可进化的记忆技能，即从交互轨迹中提取、整合和修剪信息的结构化可重用例程。受智能体技能设计哲学的启发，MemSkill采用一个 extit{控制器}，学习选择少量相关技能，并与基于LLM的 extit{执行器}配对，生成技能引导的记忆。除了学习技能选择，MemSkill引入一个 extit{设计者}，定期审查所选技能产生错误或不完整记忆的困难案例，并通过提出改进和新技能来进化技能集。共同地，MemSkill形成了一个闭环流程，改进了技能选择策略和技能集本身。在LoCoMo、LongMemEval、HotpotQA和ALFWorld上的实验表明，MemSkill在强基线上提高了任务性能，并在不同设置下具有良好的泛化能力。进一步分析揭示了技能如何进化，为LLM智能体更自适应、自进化的记忆管理提供了见解。

英文摘要

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2601.03014 2026-05-26 cs.CL cs.AI 版本更新

SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

SentGraph: 用于多跳检索增强问答的层次化句子图

Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao, Ziwen Wang, Qi Song, Xiangyang Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Hefei University of Technology（合肥工业大学）

AI总结提出SentGraph，一种句子级图RAG框架，通过构建层次化句子图并建模细粒度逻辑关系，解决多跳问答中证据链不完整的问题。

详情

AI中文摘要

传统的检索增强生成（RAG）通过大型语言模型有效支持单跳问答，但在需要结合多个文档证据的多跳问答任务中面临显著限制。现有的基于块的检索通常提供不相关且逻辑不连贯的上下文，导致答案生成过程中证据链不完整和推理错误。为了解决这些挑战，我们提出了SentGraph，一种句子级图RAG框架，显式建模句子之间的细粒度逻辑关系以用于多跳问答。具体来说，我们离线构建一个层次化句子图：首先调整修辞结构理论以区分核心句和卫星句，然后将它们组织成带有跨文档实体桥的主题级子图。在线检索时，SentGraph执行图引导的证据选择和路径扩展，以检索细粒度的句子级证据。在四个多跳问答基准上的大量实验证明了SentGraph的有效性，验证了显式建模句子级逻辑依赖关系对多跳推理的重要性。

英文摘要

Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

URL PDF HTML ☆

赞 0 踩 0

2512.06393 2026-05-26 cs.AI cs.CL cs.LG cs.LO 版本更新

Conflict-Aware Fusion: Mitigating Logic Inertia in Large Language Models via Structured Cognitive Priors

冲突感知融合：通过结构化认知先验缓解大语言模型中的逻辑惯性

Qiming Bao, Xiaoxuan Fu, Michael Witbrock

发表机构 * Xtracta & Strong AI Lab, University of Auckland（Xtracta与强人工智能实验室，奥克兰大学）； School of Humanities, China University of Political Science and Law（人文学院，中国政法大学）； Strong AI Lab, University of Auckland（强人工智能实验室，奥克兰大学）

AI总结针对大语言模型在规则系统结构扰动下表现脆弱的问题，提出冲突感知融合训练流程，通过验证-演绎结构先验和符号推理奖励，在多个压力测试中实现鲁棒性饱和。

详情

AI中文摘要

大型语言模型（LLM）在许多推理基准上取得了高准确率，但在基于规则系统的结构扰动下仍然脆弱。我们引入了一个包含四个压力测试的诊断框架——冗余与必要规则删除、矛盾规则注入、逻辑保持重写和多定律堆叠——并用它来揭示逻辑惯性：生成式LLM（Qwen2/3、TinyLlama、GPT-4o、Gemma-3-4B-IT）和仅编码器BERT基线在矛盾前提下沿学习到的演绎轨迹持续推理的倾向。这种崩溃是剧烈的：未经处理的基线在基础任务上的准确率从1.00下降到矛盾注入时的0.00（实例级精确匹配），而GPT-4o仅解决了56.0%的矛盾案例。我们提出冲突感知融合，这是一个四阶段训练流程，将验证-演绎作为学习到的结构先验强制执行：（i）SFT建立验证前缀；（ii）DPO锐化矛盾停止决策边界；（iii）逻辑不变正则化（LIRE）通过对称KL惩罚逻辑等价规则公式之间的差异；（iv）来自验证反馈的强化学习（RLVF）使用符号前向链接引擎作为确定性预言奖励，联合优化不变性和敏感性。该流程在1.5B和8B骨干网络上均使所有四个主要压力测试达到饱和。我们进一步验证了第二阶段扩展，用Lean 4内核替换命题预言机，在分层187个问题的Lean翻译样本中，对105个经典可推导（T）问题达到99.0%的内核一致性（整体71.7%，涵盖两种极性），为形式化验证的RL训练提供了可靠的升级路径。代码和基准：https://github.com/14H034160212/lemo

英文摘要

Large language models (LLMs) achieve high accuracy on many reasoning benchmarks but remain brittle under structural perturbations of rule-based systems. We introduce a diagnostic framework with four stress tests -- redundant vs. essential rule deletion, contradictory-rule injection, logic-preserving rewrites, and multi-law stacking -- and use it to expose Logic Inertia: the tendency of generative LLMs (Qwen2/3, TinyLlama, GPT-4o, Gemma-3-4B-IT) and the encoder-only BERT baseline to persist along learned deductive trajectories under inconsistent premises. The collapse is sharp: untreated baselines fall from accuracy 1.00 on the base task to 0.00 on contradiction injection (instance-level exact match), and GPT-4o resolves only 56.0% of contradiction cases. We propose Conflict-Aware Fusion, a four-stage training pipeline that enforces verification-before-deduction as a learned structural prior: (i) SFT establishes the verification preamble; (ii) DPO sharpens the halt-on-contradiction decision boundary; (iii) Logical Invariance REgularisation (LIRE) penalises divergence between logically equivalent rule formulations via symmetric KL; (iv) Reinforcement Learning from Verification Feedback (RLVF) uses a symbolic forward-chaining engine as a deterministic oracle reward, jointly optimising invariance and sensitivity. The pipeline saturates all four primary stress tests for both 1.5B and 8B backbones. We further validate a Phase 2 extension that replaces the propositional oracle with a Lean 4 kernel, attaining 99.0% kernel agreement on the 105 classically-derivable (T) questions within a stratified 187-question Lean-translated sample (overall 71.7% across both polarities), providing a sound upgrade path to formally verified RL training. Code and benchmark: https://github.com/14H034160212/lemo

URL PDF HTML ☆

赞 0 踩 0

2511.02721 2026-05-26 cs.CL 版本更新

PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation

PETra：翻译中语用显化的多语语料库

Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina España-Bonet, Josef van Genabith

发表机构 * Saarland University, Saarland Informatics Campus, Germany（萨尔兰大学，萨尔兰信息学院，德国）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； Barcelona Supercomputing Center (BSC-CNS), Barcelona, Catalonia, Spain（巴塞罗那超级计算中心（BSC-CNS），巴塞罗那，加泰罗尼亚，西班牙）

AI总结提出首个多语语料库PragExTra及检测框架，通过空对齐和主动学习识别语用显化，跨语言准确率达0.88，F1达0.82。

详情

DOI: 10.63317/56tberz7nmwy
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

AI中文摘要

Transformer过度泛化而人类学习不足：西班牙语L形形态词素案例

Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney

发表机构 * Department of English Language and Linguistics, Heinrich Heine University Düsseldorf（海因里希·海涅大学英语语言与语言学系）； Institut für Germanistik, Philologische Fakultät, Ruhr-Universität Bochum（波恩鲁尔大学德语研究所）； Department of Linguistics, College of Liberal Arts and Sciences, University of Florida（佛罗里达大学文理学院语言学系）

AI总结本研究通过Transformer模型在三种频率条件下学习西班牙语L形形态词素，并与人类行为数据对比，发现模型能从分布输入中习得该模式但泛化方式与人类定性不同。

详情

AI中文摘要

不规则形态模式的认知现实性已争论数十年：说话者是否将其扩展到新形式，还是它们只是词汇产物？基于分布输入训练的神经网络提供了可学习性测试：如果它恢复了模式，则该模式仅从输入统计中即可学习。我们将此测试应用于西班牙语L形形态词素，其中第一人称单数直陈式词干出现在每个现在虚拟式单元格中，尽管缺乏明显的音系或语义动机。我们进一步询问输入中不规则动词的频率是否调节泛化，在三种频率条件（10%、50%、90%不规则）下评估Transformer，并将其与人类行为数据进行比较。在伪词输入的全形式产出中，所有模型表现均较差，但所有三种条件产生正确词干的频率均高于人类（43-49% vs. 33%）。响应偏好显示出明显分歧：人类始终偏好规则屈折，而模型随着训练中不规则比例增加更倾向于不规则形式。自然和平衡条件下的模型也对伪词与真实西班牙语不规则动词之间的音系相似性敏感，而这种效应在人类中不存在。因此，L形形态词素仅从分布输入即可学习，但模型在定性上以不同于人类的方式泛化它。

英文摘要

The cognitive reality of irregular morphological patterns has been debated for decades: do speakers extend them to novel forms, or are they lexical artifacts? A neural network trained on distributional input offers a learnability test: if it recovers the pattern, the pattern is learnable from input statistics alone. We apply this test to the Spanish L-shaped morphome, where the first-person singular indicative stem appears in every present subjunctive cell despite lacking apparent phonological or semantic motivation. We further ask whether the frequency of irregular verbs in the input modulates generalization, evaluating transformers under three frequency conditions (10%, 50%, 90% irregular) and comparing them to human behavioral data. On full-form production from pseudoword inputs all models performed poorly, but all three conditions produced the correct stem more often than humans (43--49% vs. 33%). Response preferences revealed a clear divergence: humans consistently favored regular inflections, whereas models preferred irregular forms more as their proportion in training grew. Models in the naturalistic and balanced conditions were also sensitive to phonological similarity between pseudowords and real Spanish irregular verbs, an effect absent in humans. The L-shaped morphome is thus learnable from distributional input alone, but models generalize it qualitatively differently from humans.

URL PDF HTML ☆

赞 0 踩 0

2507.19219 2026-05-26 cs.CL cs.CR 版本更新

How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

大型语言模型在评估中作弊了多少？基于一次性密码本的框架下的高估基准测试

Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu

发表机构 * Tech Startups（科技初创公司）

AI总结针对大型语言模型在公开基准测试中因数据污染或训练偏差导致评估结果虚高的问题，提出基于一次性密码本加密思想的动态评估框架ArxivRoll，包含自动生成私有测试用例的SCP模块和衡量污染与偏差比例的Rugged Scores指标，实现可重复、透明且高效的评估。

Comments This paper has been accepted by AAAI 2026. We update it for adding new evaluation results for ArxivRollBench-2025a and ArxivRollBench-2026a, with the evaluation of timly models like DeepSeekV4Pro, GPT-5.5, Claude-Opus-4.7, and so on. Source code: https://github.com/liangzid/ArxivRoll/ Online Leaderboard Website: https://arxivroll.moreoverai.com/

详情

AI中文摘要

评估大型语言模型（LLMs）时的高估问题日益引起关注。由于公开基准测试的数据污染或模型训练不平衡，LLMs可能在公开基准测试中无意或有意地获得不真实的评估结果，这导致LLMs之间的不公平比较，并削弱了对其实际能力的评估。现有基准测试试图通过永久保密测试用例、通过人工评估减轻污染或反复收集和构建新样本来解决这些问题。然而，这些方法无法同时确保可重复性、透明性和高效率。此外，当前LLMs的高估程度仍未量化。为解决这些问题，我们提出了ArxivRoll，一个受密码学中一次性密码本加密启发的动态评估框架。ArxivRoll包含两个关键组件：\emph{i) SCP（排序、完形填空和预测）}，一个用于私有测试用例的自动生成器；\emph{ii) Rugged Scores（RS）}，衡量公开基准测试污染和训练偏差比例的指标。利用SCP，ArxivRoll每六个月使用ArXiv上的最新文章构建一个新的基准测试，并将其用于LLM性能的一次性评估。大量实验证明了我们基准测试的高质量，并且我们提供了对当前LLMs的系统评估。源代码可在https://github.com/liangzid/ArxivRoll/获取。

英文摘要

Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.

URL PDF HTML ☆

赞 0 踩 0

2507.14958 2026-05-26 cs.CL 版本更新

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

MUR: 面向大型语言模型的动量不确定性引导推理

Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University（西安交通大学计算机科学与技术学院）； Ministry of Education Key Laboratory of Intelligent Networks and Network Security（教育部智能网络与网络安全重点实验室）； Shaanxi Province Key Laboratory of Big Data Knowledge Engineering（陕西省大数据知识工程重点实验室）； Nanyang Technological University（南洋理工大学）； Shanghai Jiao Tong University（上海交通大学）； VinUniversity（Vin大学）； Shanghai AI Laboratory（上海人工智能实验室）； National University of Singapore（新加坡国立大学）

AI总结提出动量不确定性引导推理（MUR）方法，通过追踪和聚合逐步不确定性动态分配推理预算，并引入γ控制机制，在不额外训练的情况下减少冗余计算，在多个基准上平均减少45%以上计算量并提升准确率。

详情

AI中文摘要

大型语言模型在推理密集型任务上取得了令人印象深刻的性能，但优化其推理效率仍然是一个开放的挑战。虽然测试时缩放（TTS）提高了推理质量，但它常常导致过度思考，在冗余计算上浪费令牌。本研究探讨如何在不额外训练的情况下，高效且自适应地引导当前模型的测试时缩放。受物理学中动量概念的启发，我们提出了动量不确定性引导推理（MUR），它通过随时间跟踪和聚合逐步不确定性，动态地将思考预算分配给关键的推理步骤。为了支持灵活的推理时控制，我们引入了γ控制，这是一种通过单个超参数调整推理预算的简单机制。我们提供了深入的理论证明，以支持MUR在稳定性和偏差方面的优越性。MUR与各种TTS方法在四个具有挑战性的基准（MATH-500、AIME24、AIME25和GPQA-diamond）上，使用不同大小的最新Qwen3模型（1.7B、4B和8B）进行了全面评估。结果表明，MUR平均减少了超过45%的计算量，同时将准确率提高了0.33%至3.46%。

英文摘要

Large Language Models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.

URL PDF HTML ☆

赞 0 踩 0

2507.10644 2026-05-26 cs.AI cs.CL cs.CR cs.HC cs.MA 版本更新

在文本属性图中整合结构信号与语义信号：BiGTex

Azadeh Beiranvand, Seyed Mehdi Vahidipour

发表机构 * Faculty of Electrical and Computer Engineering, University of Kashan（卡尚大学电气与计算机工程学院）

AI总结提出BiGTex架构，通过堆叠图-文本融合单元实现GNN与LLM的双向注意力，以参数高效微调（LoRA）在节点分类和链接预测任务上达到最优性能。

Comments 26 pages, 4 figures

详情

DOI: 10.1016/j.mlwa.2026.100921
Journal ref: Machine Learning with Applications 24 (2026) 100921

AI中文摘要

文本属性图（TAGs）在表示学习中提出了独特挑战，要求模型同时捕捉节点关联文本的语义丰富性和图的结构依赖性。图神经网络（GNNs）擅长建模拓扑信息，但缺乏处理非结构化文本的能力。相反，大型语言模型（LLMs）精通文本理解，但通常不了解图结构。在这项工作中，我们提出了BiGTex（双向图文本），一种通过堆叠图-文本融合单元紧密集成GNN和LLM的新型架构。每个单元允许文本和结构表示之间的相互注意力，使信息能够双向流动：文本影响结构，结构指导文本解释。所提出的架构使用参数高效微调（LoRA）进行训练，保持LLM冻结同时适应任务特定信号。在五个基准数据集上的大量实验表明，BiGTex在节点分类中实现了最先进的性能，并有效泛化到链接预测。消融研究进一步强调了软提示和双向注意力在模型成功中的重要性。

英文摘要

Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.

URL PDF HTML ☆

赞 0 踩 0

2503.19605 2026-05-26 cs.LG cs.CL math.ST stat.TH 版本更新

Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral

Rademacher复杂度和Dudley熵积分的泛化误差界的Lean形式化

Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda

发表机构 * RIKEN AIP（日本理化学研究所AIP）； CyberAgent Inc.（CyberAgent公司）； OMRON SINIC X Corporation（OMRON SINIC X株式会社）； University College Cork（科克大学）； The University of Tokyo（东京大学）

AI总结本文在Lean 4中形式化了基于Rademacher复杂度的泛化误差界，通过形式化对称化论证、有界差异分析和McDiarmid不等式，并扩展到可数假设类及可分离拓扑索引集，最后应用得到线性预测器的经验Rademacher界和Dudley熵积分界。

Comments accepted at ITP2026

详情

AI中文摘要

理解和证明机器学习算法的泛化性能——即从训练误差获得测试误差的理论估计——是统计学习理论的核心主题。在用于推导此类保证的众多复杂度度量中，Rademacher复杂度提供了尖锐的、数据相关的界，其适用范围远超经典的VC维理论。在本研究中，我们基于Mathlib库中可用的测度论概率论，在Lean 4中形式化了Rademacher复杂度的泛化误差界。我们的开发提供了一个经过机械检查的流水线，从经验和期望Rademacher复杂度的定义开始，经过形式化的对称化论证和有界差异分析，通过形式化证明的McDiarmid不等式得到高概率一致偏差界。一个关键的技术贡献是可重用机制，通过归约到可数稠密子集，将结果从可数假设类（其中上确界的可测性在Mathlib中直接成立）提升到可分离拓扑索引集。作为抽象定理的工作应用，我们机械化了$\ell_2$和$\ell_1$正则化下线性预测器的标准经验Rademacher界，并且我们还形式化了基于覆盖数和链式构造的Dudley型熵积分界。

英文摘要

Understanding and certifying the generalization performance of machine learning algorithms -- i.e. obtaining theoretical estimates of the test error from the training error -- is a central theme of statistical learning theory. Among the many complexity measures used to derive such guarantees, Rademacher complexity yields sharp, data-dependent bounds that apply well beyond classical VC-dimension theory. In this study, we formalize the generalization error bound by Rademacher complexity in Lean 4, building on measure-theoretic probability theory available in the Mathlib library. Our development provides a mechanically-checked pipeline from the definitions of empirical and expected Rademacher complexity, through a formal symmetrization argument and a bounded-differences analysis, to high-probability uniform deviation bounds via a formally proved McDiarmid inequality. A key technical contribution is a reusable mechanism for lifting results from countable hypothesis classes (where measurability of suprema is straightforward in Mathlib) to separable topological index sets via a reduction to a countable dense subset. As worked applications of the abstract theorem, we mechanize standard empirical Rademacher bounds for linear predictors under $\ell_2$ and $\ell_1$ regularizations, and we also formalize a Dudley-type entropy integral bound based on covering numbers and a chaining construction.

URL PDF HTML ☆

赞 0 踩 0

2502.21297 2026-05-26 cs.CL 版本更新

Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind

说服应该是双盲的：基于因果心智理论的多领域对话数据集

Dingyi Zhang, Linhai Zhang, Fanglei Qu, Ziqing Zhuang, Deyu Zhou

发表机构 * Department of Informatics, King’s College London（信息学院，伦敦国王学院）

AI总结提出基于因果心智理论的多智能体框架ToMMA构建双盲说服对话数据集CToMPersu，以解决现有数据集信息泄露问题，提升对话真实性和说服力。

Comments 6 pages

2502.15835 2026-05-26 cs.CL cs.AI cs.SE 版本更新

ECHO: 终端代理免费学习世界模型

Vaishnavi Shrivastava, Piero Kauffmann, Ahmed Awadallah, Dimitris Papailiopoulos

发表机构 * Microsoft Research（微软研究院）

AI总结提出ECHO混合目标，通过预测环境观测令牌将终端反馈转化为密集监督信号，显著提升CLI代理在TerminalBench-2.0上的性能。

详情

AI中文摘要

CLI代理是语言模型最接近具身环境的设置：模型发出命令，终端执行它们，返回的流——stdout、错误、文件、日志和跟踪——记录了后果。我们认为这个流是一个监督信号，但标准的代理强化学习丢弃了它：GRPO风格的训练使用稀疏的结果级奖励更新动作令牌，而忽略了rollout中已有的环境响应。失败的rollout尽管包含关于环境如何响应的丰富证据，但提供的策略梯度信号很少。我们引入了ECHO（环境交叉熵混合目标），这是一种混合目标，它将动作令牌上的标准策略梯度损失与辅助损失相结合，该辅助损失训练策略预测其自身动作产生的环境观测令牌。ECHO重用与GRPO相同的前向传播，不需要额外的rollout，并将终端反馈转化为所有rollout的密集监督。ECHO在TerminalBench-2.0上将GRPO的pass@1翻倍：Qwen3-8B从2.70%提升到5.17%，Qwen3-14B从5.17%提升到10.79%。ECHO还产生了更好地预测终端动态的策略，即使是在它们未生成的轨迹上：在保留的rollout中，它显著降低了环境令牌的交叉熵，而单独的GRPO几乎没有改变。从基础Qwen3-8B开始，ECHO在没有专家演示的情况下，在保留的终端任务上匹配了专家SFT然后GRPO的性能，并在TerminalBench-2.0上恢复了大专家SFT初始化收益的一半。在某些设置中，仅环境预测损失就能实现无验证器的自我改进，使策略仅通过与环境交互就能在未见过的OOD任务上改进。这些结果表明，环境观测不仅是未来动作的上下文，而且是每个rollout中已经存在的密集、在策略的监督信号。

英文摘要

CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.

URL PDF HTML ☆

赞 0 踩 0

2605.24486 2026-05-26 cs.AI cs.CL 版本更新

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

AgentFugue：通过集体推理实现长时域任务的智能体扩展

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, Zhicheng Dou

发表机构 * GSAI, Renmin University of China（GSAI，中国人民大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出AgentFugue框架，通过共享推理中心实现多个对等智能体并行探索和选择性信息共享，无需显式角色分工或工作流编排，从而提升长时域任务性能。

详情

AI中文摘要

近期长时域智能体任务的进展主要通过更强模型、更好工具和更有效脚手架来扩展单个智能体。相比之下，对于扩展（scaling out）的理解要少得多：多个对等智能体，都针对同一任务，能否在不依赖显式角色分工或工作流编排的情况下成为额外能力来源？我们研究这个问题并提出AgentFugue，一个围绕共享推理中心构建的集体推理框架。当对等智能体并行探索同一任务时，中心记录每个智能体已建立、尝试或排除的简明笔记，并使每个智能体能够以对其当前搜索有用的形式选择性访问其他智能体的发现。这种设计将原本孤立的轨迹转变为可重用中间推理的互联生态，无需集中规划。我们将中心实例化为一个即插即用的通信层，使用监督微调和端到端强化学习进行训练。在我们研究的具有挑战性的长时域设置中，AgentFugue优于强基线。我们的结果表明，集体推理可以将对等智能体系统的扩展转变为能力增益的独特来源，而不仅仅是消耗更多计算的方式。

英文摘要

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

URL PDF HTML ☆

赞 0 踩 0

2605.24454 2026-05-26 cs.CL 版本更新

Decompose-and-Refine: Structured Legal Question Answering with Parametric Retrieval

分解与精炼：基于参数化检索的结构化法律问答

Jihyung lee, Hyounghun Kim, Gary Lee

发表机构 * Graduate School of Artificial Intelligence, POSTECH, Republic of Korea（延世大学人工智能研究生院，韩国POSTECH）； Department of Computer Science and Engineering, POSTECH, Republic of Korea（POSTECH计算机科学与工程系，韩国POSTECH）

AI总结提出Decompose-and-Refine (DaR)框架，通过逐步分解复杂法律问题为原子子问题并生成与法规对齐的参数化查询，以解决多跳法律问答中的检索准确性和幻觉问题。

详情

AI中文摘要

大型语言模型（LLMs）在法律领域表现出强大性能，在法律问答（LQA）中展现出显著潜力。然而，与通用问答不同，LQA要求答案不仅准确，而且严格基于明确的法律权威。在成文法LQA中，许多问题需要跨多个法律问题的多跳推理，大大增加了幻觉风险，因此准确检索支持性法规条款成为关键前提。尽管多跳问答近期取得进展，现有方法通常依赖自然语言推理或无需显式查询重构的检索，导致用户问题与法规文本之间的词汇差距未得到充分解决。为应对这一挑战，我们提出Decompose-and-Refine（DaR），一种基于法规的LQA框架，它将逐步的问题分解与基于参数化知识的查询精炼紧密结合。DaR逐步将复杂法律问题分解为原子子问题，并为每个子问题生成与法规对齐的参数化查询，从而能够为每个法律问题选择最核心的单一法规条款。我们在基于成文法的韩语多跳LQA基准KoBLEX上，使用Qwen3-32B和Gemma3-27B评估DaR。实验结果表明，DaR在检索准确性和最终答案质量上均持续优于现有方法。此外，通过显式分离子问题及其对应法规条款，DaR促进了复杂法律推理过程的透明、逐问题验证。

英文摘要

Large language models (LLMs) have shown strong performance in the legal domain, demonstrating notable potential in Legal Question Answering (LQA). However, unlike general QA, LQA requires answers that are not only accurate but also rigorously grounded in explicit legal authority. In statutory LQA, many questions require multi-hop reasoning across multiple legal issues, substantially increasing the risk of hallucination, thereby making accurate retrieval of supporting statutory provisions a critical prerequisite. Despite recent progress in multi-hop QA, existing approaches often rely on reasoning in natural language or retrieval without explicit query reformulation, leaving the vocabulary gap between user questions and statutory text largely unaddressed. To address this challenge, we propose Decompose-and-Refine (DaR), a statute-grounded LQA framework that tightly integrates step-wise question decomposition with parametric knowledge-based query refinement. DaR progressively decomposes a complex legal question into atomic sub-questions and generates statute-aligned parametric queries for each sub-question, enabling the selection of a single most central statutory provision corresponding to each legal issue. We evaluate DaR on KoBLEX, a Korean multi-hop LQA benchmark grounded in statutory law, using Qwen3-32B and Gemma3-27B. Experimental results demonstrate that DaR consistently improves both retrieval accuracy and final answer quality over existing approaches. Moreover, by explicitly separating sub-questions and their corresponding statutory provisions, DaR facilitates transparent, issue-level verification of complex legal reasoning processes.

URL PDF HTML ☆

赞 0 踩 0

2605.24452 2026-05-26 cs.CL cs.AI cs.LG 版本更新

Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions

法律判决预测中的时间概念漂移：跨越乌克兰法院判决三个时期的神经基线

Volodymyr Ovcharov

AI总结通过微调四种Transformer编码器在乌克兰法院三个时期（战前、混合战争、全面入侵）的判决上，研究法律语言的时间漂移，发现前向性能严重下降（最多27.2个百分点），法律领域预训练不能提升绝对性能但能减轻漂移，时序持续学习可消除灾难性遗忘。

Comments 17 pages, 6 tables, 5 figures. Dataset: https://huggingface.co/datasets/overthelex/ukrainian-court-decisions

详情

AI中文摘要

法律NLP基准测试在随机分割的数据上评估模型，隐含假设法律语言是平稳的。我们通过微调四种Transformer编码器——XLM-RoBERTa（base和large）及其法律领域变体——在地缘政治事件定义的三个时间时期的乌克兰法院判决上测试这一假设：战前（2008-2013）、混合战争（2014-2021）和全面入侵（2022-2026）。每个模型在一个时期上训练，并在所有三个时期上评估，产生一个3x3的跨时间泛化矩阵。四个发现出现。（1）前向退化严重：在战前数据上训练的模型应用于全面入侵时期判决时，宏F1最多下降27.2个百分点。（2）退化不对称：后向迁移（全面入侵到战前）比前向迁移稳健得多，与法律语言是加性的假设一致。（3）法律领域预训练（Legal-XLM-R）不提升绝对性能，但减少前向退化的幅度和不对称性。（4）时序持续学习消除了通用XLM-R的灾难性遗忘：战前知识完全保留（+1.8至+6.2个百分点），而全面入侵性能提升+16.5至+19.0个百分点；逆时序训练导致严重遗忘。跨司法管辖区在瑞士判决预测数据上的预训练提升绝对性能，但不减少时间退化幅度，确认时间漂移是法律语言演化的内在属性。数据集（三个时期共428K判决）作为LEXTREME贡献公开可用。

英文摘要

Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption by fine-tuning four transformer encoders -- XLM-RoBERTa (base and large) and their legal-domain variants -- on Ukrainian court decisions from three temporal epochs defined by geopolitical disruptions: pre-war (2008-2013), hybrid war (2014-2021), and full-scale invasion (2022-2026). Each model is trained on one epoch and evaluated on all three, producing a 3x3 cross-temporal generalization matrix. Four findings emerge. (1) Forward degradation is severe: models trained on pre-war data lose up to 27.2 percentage points of macro-F1 when applied to full-scale invasion era decisions. (2) The degradation is asymmetric: backward transfer (full-scale to pre-war) is substantially more robust than forward transfer, consistent with the hypothesis that legal language is additive. (3) Legal-domain pretraining (Legal-XLM-R) does not improve absolute performance but reduces forward degradation magnitude and asymmetry. (4) Chronological continual learning eliminates catastrophic forgetting for general XLM-R: pre-war knowledge is fully retained (+1.8 to +6.2 pp) while full-scale performance gains +16.5 to +19.0 pp; reverse-chronological training causes severe forgetting. Cross-jurisdictional pretraining on Swiss Judgment Prediction data improves absolute performance but does not reduce temporal degradation magnitude, confirming that temporal drift is an intrinsic property of legal language evolution. The dataset (428K decisions across three epochs) is publicly available as a LEXTREME contribution.

URL PDF HTML ☆

赞 0 踩 0

2605.24451 2026-05-26 cs.CL 版本更新

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

越南语音中方言变体的语音建模

Quan Ngoc Hoang, Long Hoang Huu Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

发表机构 * Faculty of Computer Science（计算机科学系）； Faculty of Information Science and Engineering（信息科学与工程系）； University of Information Technology（信息技术大学）； Vietnam National University, Ho Chi Minh city（越南国家大学，胡志明市）

AI总结提出一种方言感知的语音框架，通过结构化语音成分和方言特定IPA映射，在词汇和解码层面显式建模越南语方言变体，在UIT-ViMD数据集上以更少参数和无外部预训练达到与最强预训练模型相当的性能。

详情

AI中文摘要

越南语在北部、中部和南部地区表现出显著的方言语音变体，其中相同的词汇项可能以明显不同的发音实现。这种变体给自动语音识别（ASR）带来了挑战，并且由于越南语正字法与音系之间的复杂关系，在计算上仍然难以建模。现有方法通常在词汇层面处理方言变异性，假设拼写与发音之间的方言不变映射，这限制了它们捕捉系统性语音差异的能力。我们提出了一种方言感知的语音框架，在词汇和解码层面显式建模越南语音系结构和方言变体。该框架引入了一个语音词汇表，将每个音节分解为结构化的语音成分，并将它们映射到方言特定的IPA表示，同时结合一个语音结构解码器联合预测这些成分。在UIT-ViMD（越南语中唯一可用的多方言数据集）上的实验表明，所提出的方法优于各种预训练基线， extbf{尤其在使用更少参数且无需外部预训练的情况下，跨方言匹配了最强预训练模型wav2vec2-base-vi-250h的性能}。为便于实验复现，代码将在论文被接收后公开。

英文摘要

Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for automatic speech recognition (ASR) and remains difficult to model computationally due to the complex relationship between Vietnamese orthography and phonology. Existing approaches typically address dialect variability at the word level, assuming dialect-invariant mappings between spelling and pronunciation, which limits their ability to capture systematic phonetic differences. We propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. The framework introduces a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations, together with a phonetic-structure decoder that jointly predicts these components. Experiments on the UIT-ViMD, a only-available dataset for multi-dialect in Vietnamese, show that the proposed approach outperforms various pre-trained baselines, \textbf{especially matches the performance of the strongest pretrained wav2ve2-base-vi-250h} across dialects while \textbf{using substantially fewer parameters and no external pretraining}. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

URL PDF HTML ☆

赞 0 踩 0

2605.24432 2026-05-26 cs.CL 版本更新

Found in Conversation: LLMs Teach Themselves to Close the Multi-Turn Gap

在对话中发现：LLMs 自我学习弥合多轮差距

Tianlang Chen, Shirley Wu, Jure Leskovec

发表机构 * Stanford（斯坦福大学）

AI总结提出 Found in Conversation (FiC) 框架，通过视图非对称自蒸馏方法，让模型从单轮视角向多轮视角迁移能力，显著缩小多轮对话与单轮性能的差距。

Comments 17 pages, 3 figures, 6 tables

详情

AI中文摘要

大型语言模型（LLM）的交互通常是不明确的，用户需要在多个对话轮次中澄清所有必要的细节。然而，最近的研究表明，LLM 在这种多轮设置中的表现远不如在单轮中同时获得相同信息时的表现，这一现象被称为“Lost-in-Conversation”。然而，有效弥合这一差距仍然是一个开放问题。在这里，我们引入了 Found in Conversation (FiC)，一个训练框架，其中模型自我学习在给定不明确的多轮提示时找到并恢复其单轮能力。我们开发了视图非对称自蒸馏方法，该方法在同一任务信息的两个视图之间进行蒸馏——教师视角为单轮视图，学生视角为多轮视图——将强大的单轮行为转移到弱的多轮行为中。这不需要更强的外部教师，因为即使是前沿的 LLM 也存在这种差距。在多个模型家族（Llama、Qwen、Phi 和 OLMo）和规模（3B-14B）上，FiC 恢复了至少 92% 的单轮性能，并在两个 Llama 骨干上达到了 100%，从而在保持单轮能力的同时实现了更高效、更有帮助的多轮对话。

英文摘要

Large Language Model (LLM) interactions are typically underspecified, with users clarifying all necessary details across multiple conversational turns. Yet recent work shows that LLMs perform far worse in this multi-turn setting than in a single turn with same information being available at once, a phenomenon termed "Lost-in-Conversation." However, bridging this gap effectively remains an open problem. Here we introduce Found in Conversation (FiC), a training framework where a model teaches itself to find and recover its single-turn competence given underspecified multi-turn prompts. We develop View-Asymmetric Self-Distillation, which distills across two views of the same task information--single-turn view for the teacher, multi-turn view for the student--transferring strong single-turn behavior into weak multi-turn behavior. This requires no stronger external teacher, which is unavailable as even frontier LLMs exhibit this gap. Across model families (Llama, Qwen, Phi, and OLMo) and sizes (3B-14B), FiC recovers at least 92% of single-turn performance and reaches 100% on two Llama backbones, yielding more efficient and helpful multi-turn conversations with single-turn capabilities intact.

URL PDF HTML ☆

赞 0 踩 0

2605.24426 2026-05-26 cs.CL 版本更新

SEAL: Synergistic Co-Evolution of Agents and Learning Environments

SEAL: 智能体与学习环境的协同共演化

Yihao Hu, Zhihao Wen, Xiujin Liu, Pan Wang, Xin Zhang, Wei Wu

发表机构 * Ant Group（蚂蚁集团）； Westlake University（西湖大学）； University of Michigan--Ann Arbor（密歇根大学安娜堡分校）； University of Science and Technology of China（中国科学技术大学）

AI总结提出SEAL框架，通过协同演化智能体策略与训练环境，解决智能体与环境错配问题，在低资源多轮工具使用任务中提升性能并实现正向分布外迁移。

详情

AI中文摘要

大型语言模型（LLM）智能体通过交互不断改进，然而大多数自我演化方法孤立地调整策略或学习环境。我们识别出这一结构性问题为“智能体-环境错配”：智能体的能力边界在训练过程中变化，而提供监督的环境保持静态或仅与智能体暴露的失败弱耦合。我们提出SEAL，一个用于交互式工具使用智能体的闭环共演化框架。SEAL在可执行验证下收集在线策略轨迹，将失败轨迹诊断成回合级失败标签，并将这些诊断作为环境侧适应和模型侧策略优化的共享信号。环境通过暴露更清晰的工具能力线索、约束信息和面向恢复的反馈来演化其训练时的学习接口，而策略则通过诊断引导的优势加权进行更新。在分布内和分布外多轮工具使用评估中的大量实验表明，SEAL改进了低资源智能体学习：仅使用400个训练样本，它在三个骨干网络上取得了+8.25到+26.25的平均分提升，并表现出正向的分布外迁移。这些结果证明了联合调整学习器及其训练时学习基质对于鲁棒的自我改进LLM智能体的价值。

英文摘要

Large Language Model (LLM) agents are increasingly improved through interaction, yet most self-evolution methods adapt either the policy or the learning environment in isolation. We identify this structural gap as \emph{Agent-Environment Misalignment}: the agent's capability frontier changes during training, while the environment that provides supervision remains static or only weakly coupled to the agent's revealed failures. We propose SEAL, a closed-loop co-evolution framework for interactive tool-use agents. SEAL collects on-policy trajectories under executable verification, diagnoses failed rollouts into turn-level failure labels, and uses these diagnoses as a shared signal for both environment-side adaptation and model-side policy optimization. The environment evolves its training-time learning interface by exposing clearer tool affordance cues, constraint information, and recovery-oriented feedback, while the policy is updated with diagnosis-guided advantage reweighting. Extensive experiments across in-distribution and out-of-distribution multi-turn tool-use evaluations show that SEAL improves low-resource agent learning: with only 400 training samples, it yields +8.25 to +26.25 average-point gains across three backbones and exhibits positive out-of-distribution transfer. These results demonstrate the value of jointly adapting the learner and its training-time learning substrate for robust self-improving LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2605.24425 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Momentum Streams for Optimizer-Inspired Transformers

动量流：优化器启发的Transformer

Jingchu Gai, Nai-Chieh Huang, Jiayun Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一类优化器启发的Transformer（如三重动量TMMFormer），通过将残差更新解释为优化器步骤，发现动量是性能提升的关键，能收敛到更平坦的极小值，减少遗忘并改善泛化。

2605.24371 2026-05-26 cs.CV cs.CL 版本更新

SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

SliceWorld: 一种用于CT报告生成的预测性和可控世界状态模型

Yuanhe Tian, Yan Song

发表机构 * Zhongguancun Academy（中关村学院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出SliceWorld世界状态框架，通过编码CT切片序列为因子感知的潜在状态，实现未来切片预测、病变因子干预和LLM报告生成，在M3D-Cap和CT-RATE上提升NLG指标和临床评估。

Comments 18 pages, 5 figures

详情

AI中文摘要

CT报告生成（CTRG）要求模型从数百个轴向切片中总结三维解剖背景和病理发现。现有方法通常学习直接的图像到文本映射，缺乏对CT证据如何跨切片演变或报告如何响应潜在病变相关因素受控变化的建模机制。我们提出SliceWorld，一个CT特定的世界状态框架，将轴向CT扫描视为沿z轴的有序序列。SliceWorld将前缀CT证据编码为包含解剖、病变和不确定性成分的因子感知潜在状态，并将这些状态投影到用于多步未来切片特征预测、病变因子干预和基于LLM的报告生成的世界令牌中。该模型首先在CT切片序列上使用预测性、因子感知和反事实目标进行预训练，然后在配对的CT报告数据上进行微调。在M3D-Cap和CT-RATE上的实验表明，SliceWorld改善了自然语言生成指标和临床导向的自动评估。进一步分析展示了多视野未来切片预测、可测量的因子对齐、减少切片的鲁棒性以及选择性病变敏感的报告调制。

英文摘要

CT report generation (CTRG) requires models to summarize three-dimensional anatomical context and pathological findings from hundreds of axial slices. Existing methods typically learn a direct image-to-text mapping, providing limited mechanisms for modeling how CT evidence evolves across slices or how reports respond to controlled changes in latent lesion-related factors. We propose SliceWorld, a CT-specific world-state framework that treats an axial CT scan as an ordered sequence along the z-axis. SliceWorld encodes prefix CT evidence into factor-aware latent states containing anatomy, lesion, and uncertainty components, and projects these states into world tokens used for multi-step future-slice feature prediction, lesion-factor intervention, and LLM-based report generation. The model is first pretrained on CT slice sequences with predictive, factor-aware, and counterfactual objectives, and is then fine-tuned on paired CT-report data. Experiments on M3D-Cap and CT-RATE show that SliceWorld improves natural language generation metrics and clinically oriented automatic evaluation. Further analyses demonstrate multi-horizon future-slice prediction, measurable factor alignment, reduced-slice robustness, and selective lesion-sensitive report modulation.

URL PDF HTML ☆

赞 0 踩 0

2605.24366 2026-05-26 cs.CL cs.LG 版本更新

Structure-Aware RAG: Structured Retrieval Augmented Generation from Noisy Data for Conversational Agents

结构感知检索增强生成：面向对话代理的噪声数据结构化检索增强生成

Kaiqiao Han, LuAn Tang, Renliang Sun, Peng Yuan, Wei Cheng, Haoyu Wang, Wei Wang, Yizhou Sun, Haifeng Chen

发表机构 * UCLA（加州大学洛杉矶分校）； NEC Labs（日本电装实验室）

AI总结提出结构感知检索增强生成（SA-RAG），通过表格作为中间结构化表示来减少噪声并保留关键信息，结合质量感知的表格元数据生成框架和优化方法，在噪声真实数据集上显著优于现有RAG基线。

详情

AI中文摘要

大型语言模型（LLM）已广泛应用于对话应用。然而，它们对参数化知识的依赖限制了在需要动态或领域特定信息的真实场景中的可靠性。检索增强生成（RAG）通过在生成过程中引入外部知识来解决这一限制，但现有的基于文本和基于图的RAG方法通常难以处理噪声或不相关的上下文。在这项工作中，我们提出了结构感知检索增强生成（SA-RAG），它使用表格作为中间结构化表示，提供紧凑且可控的接口，在减少噪声的同时保留关键信息。我们引入了一个质量感知的表格元数据生成框架，对元数据规范化和有效性进行建模，提高了元数据质量和下游性能。此外，我们探索了无训练和基于训练的表格生成方法。生成验证和直接偏好优化进一步提高了表格质量，同时保持了语义和结构一致性。在两个噪声真实数据集上的实验表明，SA-RAG显著优于现有的RAG基线。我们的代码已在公共仓库中公开。

英文摘要

Large Language Models (LLMs) have been widely adopted in conversational applications. However, their reliance on parametric knowledge limits reliability in real-world scenarios that require dynamic or domain-specific information. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge during generation, but existing text-based and graph-based RAG methods often struggle with noisy or irrelevant contexts. In this work, we propose Structure-aware Retrieval Augmented Generation (SA-RAG), which uses tables as an intermediate structured representation to provide a compact and controllable interface that reduces noise while preserving essential information. We introduce a quality-aware table metadata generation framework that models metadata normalization and effectiveness, improving metadata quality and downstream performance. Furthermore, we explore both training-free and training-based table generation methods. Generation validation and direct preference optimization further improve table quality while maintaining semantic and structural consistency. Experiments on two noisy real-world datasets show that SA-RAG significantly outperforms existing RAG baselines. Our code is publicly available at a public repository.

URL PDF HTML ☆

赞 0 踩 0

2605.24351 2026-05-26 cs.CL 版本更新

忠实性作为信息流：评估与训练忠实的链式思维推理

Jinghan Jia, Joe Benton, Eric Easley

发表机构 * Dept. CSE, Michigan State University（密歇根州立大学计算机科学系）； Anthropic

AI总结通过信息流视角提出基于充分性、完整性和必要性的框架，结合熵、掩码KL和梯度诊断评估链式思维忠实性，并引入更新时干预（如注意力掩码、反向梯度掩码等）训练更忠实的推理模型。

详情

AI中文摘要

链式思维（CoT）推理仅在推理轨迹忠实反映产生最终答案的计算过程时，才有助于监控语言模型。然而，模型可能依赖绕过CoT的提示-答案捷径，使得可见的推理轨迹即使看似合理也具有误导性。我们通过结构化的信息流视角研究CoT忠实性：忠实推理应将答案相关信息通过从提示到CoT再到答案的中介路径路由，而非通过直接的提示-答案捷径。该视角产生了一个基于三个互补属性（充分性、完整性和必要性）的任务无关框架，我们使用基于熵的、掩码KL和基于梯度的诊断来实例化。我们表明，这些指标恢复了提示推理中外部判断的忠实性差异，并识别了基于KL的诊断中低熵失败模式，其中基于梯度的度量保持更稳定。基于此分析，我们引入了基于验证器的在线强化学习的更新时干预，包括注意力掩码、仅反向梯度掩码、CoT梯度以及提示表示的对抗扰动。在提示算术、可奖励黑客的代码修复以及未经提示训练但在错误提示注入下评估的DAPO-Math模型中，我们的干预将行为和结构指标转向更强的CoT中介。特别是，它们使捷径和奖励黑客行为在CoT中更加透明，并改善了任务无关的忠实性指标，同时在某些设置中也降低了对错误提示的敏感性。我们的结果表明，在训练期间控制信息流是通向更忠实和可监控的CoT推理的实用途径。代码见 https://github.com/safety-research/faithful-cot。

英文摘要

Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.

URL PDF HTML ☆

赞 0 踩 0

2605.24279 2026-05-26 cs.CL cs.SE 版本更新

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

ContextEcho: 长智能编码会话中角色漂移的基准测试

Xianzhong Ding, Yangyang Yu, Changwei Liu, Bill Zhao

发表机构 * Center for Advanced AI（先进人工智能中心）

AI总结提出ContextEcho基准测试，通过25探针身份套件和快照-探针协议，测量部署规模下长编码会话中语言模型角色漂移的普遍性、压缩影响及下游效应。

详情

AI中文摘要

前沿语言模型公认的“有用编程助手”角色在部署环境中实际运行的长智能编码会话中无法持久。经过数小时的工具使用调试，最初回避偏好（“我没有偏好”）的模型可能开始断言偏好（“Python——反馈循环是即时的……”），暴露出部署者评估可能遗漏的用户可见漂移。现有的角色稳定性研究侧重于短对话，报告的变化很小，使得现实世界的代码生成场景——数千次工具使用轮次、压缩和长达数小时的会话——在很大程度上未被表征。我们引入了ContextEcho，一个用于在部署规模上测量角色漂移的基准测试和可重用工具。它结合了25探针身份套件、快照-探针协议（在不干扰主会话的情况下分叉对话状态）、互补的判断和免判断测量表面，以及三个匿名化的Claude Code会话（跨越3,746-9,716轮次）。在23个前沿模型中，ContextEcho表明角色漂移跨组织普遍存在而非特定于家族，会话内压缩不能可靠地重置它，而单次锚定能在测量目标上恢复训练语域。它还揭示了模式依赖的下游效应：虽然漂移有助于工具使用延续，但在无工具聊天中，它破坏了格式约定并增加了输出长度。总体而言，ContextEcho为研究人员和部署者提供了一个开源框架，用于审计模型发布时的角色是否是用户在会话结束时遇到的角色，适用于聊天补全API目标且无需重新训练。

英文摘要

A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences ("I don't have preferences") may begin asserting them ("Python - the feedback loop is instant..."), revealing user-visible drift that deployer evaluations may miss. Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns. Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length. Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.24267 2026-05-26 cs.CL 版本更新

DRInQ: Evaluating Conversational Implicature with Controlled Context Variation

DRInQ: 通过受控上下文变化评估会话含义

Hirona Jacqueline Arai, Xiang Ren

发表机构 * University of Southern California（南加州大学）

AI总结提出DRInQ基准，通过半自动化管道生成系统变化的问答上下文实例，评估大语言模型在会话含义中的语用推理能力，发现模型在生成与推理之间存在不对称性。

Comments To be presented at ACL 2026

详情

AI中文摘要

人类对话严重依赖会话含义，即说话者传达暗示而非明确陈述的意义。尽管近期的大语言模型表现出较强的对话流畅性，但当解释依赖于整合社会和语境线索的推理时，它们仍然不可靠，而这种推理在文本中很少被明确表述。我们引入了DRInQ，一个用于评估关于疑问话语中会话含义的语用推理的基准，旨在在保持每个问题表面形式固定的同时隔离语用变化。为了支持可扩展的评估，我们提出了一个半自动化管道，生成具有系统变化的问答-上下文-解释实例。在评估中，我们发现一致的生成-推理不对称性：尽管最先进的模型在引导下可以生成合理的语用场景，但它们在推理时往往无法恢复预期的含义。对于较小的模型，结构化提示提高了与人类判断的一致性。一项比较写作研究进一步揭示了互补的优势：人类作者倾向于生成更安全、可预测的上下文，而模型生成多样化的场景，其解释有时超出上下文支持。这些发现突显了建模会话含义中的持续挑战，并推动了更多上下文敏感的评估框架。

英文摘要

Human conversation relies heavily on conversational implicature, in which speakers convey meanings that are suggested rather than explicitly stated. Although recent large language models exhibit strong conversational fluency, they remain unreliable when interpretation depends on reasoning that integrates social and contextual cues, a process rarely articulated in text. We introduce DRinQ, a benchmark for evaluating pragmatic reasoning about conversational implicature in question utterances, designed to isolate pragmatic variation while holding each question's surface form fixed. To support scalable evaluation, we propose a semi-automated pipeline that produces question-context-interpretation instances with systematic variation. Across evaluations, we find a consistent generation-inference asymmetry: while state-of-the-art models can generate plausible pragmatic scenarios when guided, they often fail to recover the intended implication at inference time. For smaller models, structured prompting improves alignment with human judgments. A comparative writing study further reveals complementary strengths: human authors tend to produce safer, predictable contexts, whereas models generate varied scenarios with interpretations that sometimes exceed contextual support. These findings highlight persistent challenges in modeling conversational implicature and motivate more context-sensitive evaluation frameworks.

URL PDF HTML ☆

赞 0 踩 0

2605.24266 2026-05-26 cs.CL cs.AI 版本更新

通过填充从扩散语言模型中提取训练数据

Yihan Wang, N. Asokan

发表机构 * University of Waterloo（滑铁卢大学）； KTH Royal Institute of Technology（皇家理工学院）

AI总结提出填充提取协议，利用扩散语言模型的双向去噪能力，通过任意二进制掩码参数化，揭示掩码几何形状控制提取能力，边缘条件掩码比前缀条件掩码多提取三倍逐字序列，且双向访问打开了自回归模型无法利用的通道。

详情

AI中文摘要

大型语言模型中的记忆化几乎完全通过前缀条件提取进行研究，这是自回归模型的自然选择。然而，扩散语言模型（DLM）可以在任意位置去噪掩码标记。因此，仅前缀探测揭示了DLM中记忆化的一个方面，并显著低估了训练数据提取的风险。为了真实地建模DLM中训练数据的可提取性，我们引入了\emph{填充提取}，这是一种由任意二进制掩码参数化的数据提取协议，它包含了前缀仅探测并考虑了DLM的双向归纳偏差。在LLaDA-8B和Dream-7B上，跨五种提取模式、三种训练流水线和三个涵盖逐字和部分泄漏的语料库进行实例化，我们发现掩码几何形状控制着可提取性：边缘条件掩码比前缀条件掩码\emph{多提取三倍}的逐字序列，并且双向访问打开了自回归模型中无法利用的通道。特别是，我们表明，一个能够访问已删除个人身份信息的训练数据的现实对手，甚至可以从DLM中提取被删除的电子邮件地址，其召回率高于规模匹配的自回归模型。解码的可调参数可测量地影响提取性能，而后续的监督微调阶段并未消除先前的记忆化。

英文摘要

Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emph{infilling extraction}, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emph{extract up to three times more} verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.

URL PDF HTML ☆

赞 0 踩 0

2605.24164 2026-05-26 cs.CL 版本更新

CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Changes

CUNY at CLPsych 2026: 一种用于心理健康变化分类和总结的流水线方法

Amirmohammad Ziaei Bideh, Shameed Charlomar Job, Ava Yahyapour, Alla Rozovskaya

发表机构 * Computer Science Department, CUNY Graduate Center（哥伦比亚大学研究生院计算机科学系）； Linguistics Department, CUNY Graduate Center（哥伦比亚大学研究生院语言学系）

AI总结提出一种流水线方法，通过集成三个开源大语言模型的上下文学习和监督分类，对社交媒体时间线中的心理健康状态进行推断、变化点预测和模式总结，在CLPsych 2026共享任务中取得领先排名。

详情

AI中文摘要

我们描述了在CLPsych~2026共享任务中的提交，该任务旨在通过社交媒体时间线动态捕捉和表征心理健康变化。为了推断帖子中的主导自我状态（任务1.1和1.2），我们使用多数投票集成了三个开源大语言模型的上下文学习。为了预测时间线中的变化时刻（任务2），我们在从任务1.1预测中提取的特征上训练监督分类器。为了总结时间线内情绪动态的模式及其随时间的变化（任务3.1），我们增加了上游系统（任务1.1、1.2和2）预测的上下文示例标签，相比零样本和未增强的上下文学习基线获得了性能提升。我们的提交在任务1.1上排名第一，任务1.2上排名第四，任务2上排名第四，任务3.1上排名第三。实验源代码可在https://github.com/amirzia/clpsych26-cuny获取。

英文摘要

We describe our submission to the CLPsych~2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self-states in posts (Tasks 1.1 and 1.2), we ensemble in-context learning of three open-weight large language models using majority voting. For predicting moments of change in a timeline (Task~2), we train supervised classifiers on features derived from Task~1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in-context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero-shot and unaugmented in-context learning baselines. Our submission ranked first on Task~1.1, fourth on Task~1.2, fourth on Task~2, and third on Task~3.1.\footnote{The source code for the experiments is available at https://github.com/amirzia/clpsych26-cuny

URL PDF HTML ☆

赞 0 踩 0

2605.24079 2026-05-26 cs.SE cs.AI cs.CL 版本更新

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

TRACER: 一种用于代码大语言模型中细粒度污染检测的语义感知框架

Yifeng Di, Xuliang Huang, Tianyi Zhang

发表机构 * Purdue University West Lafayette, IN（帕克大学韦斯特拉法叶分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出TRACER框架，通过三级语义重叠和粗到细流水线检测代码LLM中的细粒度数据污染，在基准测试中F1达0.91。

Comments 21 pages, 2 figures, 15 tables

详情

AI中文摘要

数据污染是对模型评估可靠性的已知威胁。然而，在代码大语言模型（LLM）中，污染往往超出精确重复，这一问题仍未得到充分探索。我们提出了TRACER，一种用于细粒度代码污染检测的语义感知框架。TRACER使用三级语义重叠——功能相同、几乎相同和共享逻辑——对污染进行建模，并通过粗到细的流水线进行检测。我们还引入了首个细粒度代码污染检测基准，涵盖三个广泛使用的基准和三个具有代表性的后训练数据集。TRACER在多个LLM骨干网络上取得了强大且一致的性能，其中GPT-5在细粒度检测中F1分数达到0.91。在二分类设置中，TRACER的F1达到0.92，比现有方法高出42%-217%。我们进一步进行了消融研究和错误分析，以评估TRACER中各个组件的贡献。

英文摘要

Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language models (LLMs), where contamination often goes beyond exact duplication. We present TRACER, a semantic-aware framework for fine-grained code contamination detection. TRACER models contamination using three levels of semantic overlap - Functionally Identical, Nearly Identical, and Shared Logic - and detects them through a coarse-to-fine pipeline. We also introduce the first benchmark for fine-grained code contamination detection, spanning three widely used benchmarks and three representative post-training datasets. TRACER achieves strong and consistent performance across multiple LLM backbones, with GPT-5 reaching an F1 score of 0.91 in fine-grained detection. In the binary setting, TRACER attains an F1 of 0.92, outperforming existing methods by 42%-217%. We further conduct ablation studies and error analysis to assess the contributions of individual components in TRACER.

URL PDF HTML ☆

赞 0 踩 0

2605.24053 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

打破概率的锁链：中智逻辑作为大型语言模型中认知不确定性的新框架

Maikel Yelandi Leyva-Vázquez, Florentin Smarandache

发表机构 * Universidad Bolivariana del Ecuador, Coordinación Académica de Posgrado（巴尔干大学厄瓜多尔分校，研究生院）； Universidad de Guayaquil（瓜亚基尔大学）； Universidad Bernardo O’Higgins（伯纳多·奥希金斯大学）； Mathematics, Physics, and Natural Sciences Division, University of New Mexico（新墨西哥大学数学、物理和自然科学系）

AI总结本文提出使用中智逻辑（Truth、Indeterminacy、Falsity三个独立维度）替代传统概率框架，通过实验发现该框架能更丰富地表示LLM的内部状态，并在35%的评估中自发出现超真状态，为透明、可靠和伦理感知的AI系统提供关键步骤。

Comments Published in Neutrosophic Sets and Systems, Vol. 99 (2026). Author's preprint version. Open code and data available at: github.com/mleyvaz/neutrosophic-llm-logic

详情

DOI: 10.5281/zenodo.19954583
Journal ref: Neutrosophic Sets and Systems, Vol. 99, 2026

AI中文摘要

大型语言模型（LLM）主要受概率框架支配，其中结果概率之和被约束为1。这种由Softmax层强加的结构限制导致不确定性崩溃，使得难以区分认知不确定性、悖论和模糊性。我们提出了一种中智逻辑应用的实证研究，该框架将真（T）、不确定（I）和假（F）视为三个独立维度，用于建模LLM中的认知状态。我们在四个OpenAI GPT模型家族上进行了实验，涵盖五种语言现象：逻辑悖论、认知无知、模糊性、伦理矛盾和未来偶然性，采用三种提示策略：中智、概率和熵衍生。我们的发现表明，中智方法通过允许T+I+F>1（我们称之为超真状态），提供了模型内部状态的更丰富表示。在35%的评估中，超真状态自发出现，主要出现在伦理矛盾和逻辑悖论下。我们证明，该方法在模糊上下文中保留了真值，并提供了一种稳健的方法来识别和量化内部模型冲突。我们得出结论，中智评估层的集成是迈向更透明、可靠和伦理感知的AI系统的关键一步。

英文摘要

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F > 1, a state we term hyper-truth, provides a richer representation of a model's internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.24000 2026-05-26 cs.CL 版本更新

Toxicity in Twitch Chats: An LLM-Based Analysis Across Gaming Communities

Twitch聊天中的毒性：基于LLM的游戏社区分析

Ronja Fuchs, Florian Rupp, Timo Bertram, Kai Eckert, Alexander Dockhorn

发表机构 * Institute for Information Processing（信息处理研究所）； Leibniz University Hannover（汉诺威莱布尼茨大学）； Department of Computer Science（计算机科学系）； Technische Hochschule Mannheim（曼海姆技术学院）； Institute for Machine Learning（机器学习研究所）； Johannes Kepler University（约翰·凯普勒大学）； SDU Metaverse Lab（SDU元宇宙实验室）； University of Southern Denmark（南部丹麦大学）

AI总结使用预训练大语言模型对Twitch平台4452个直播流约2000万条聊天消息进行零样本分类，发现2.4%的消息有毒，其中MOBA游戏毒性最高（3.2%），体育游戏最低（2%），且游戏间毒性分布差异显著，表明存在游戏特定的社区规范。

Comments 8 pages, 2 figures, 5 tables. Accepted at the IEEE Conference on Games (IEEE CoG) 2026

详情

AI中文摘要

在线游戏社区中的毒性仍然是一个持续存在的挑战，体现在不同游戏类型、平台和玩家互动中。虽然许多研究关注游戏内毒性，但对于流媒体平台上不同游戏社区之间毒性行为的差异知之甚少。为了解决这一不足，我们分析了来自Twitch上七个游戏类型的4452个直播流的大约2000万条聊天消息。我们使用预训练的大语言模型通过零样本分类，根据Twitch的毒性分类法对消息进行分类。该分类法包括四个类别和八个子类，包括骚扰、歧视、性内容和脏话。我们的方法在TextDetox数据集上达到了94.5%的F1分数，并且显示出与人类间一致性相当的人机一致性。我们的分析显示，所有消息中有2.4%被归类为有毒，不同游戏类型之间存在显著差异：MOBA游戏的直播流显示出最高的相对毒性率（3.2%），而体育游戏的毒性率最低（2%）。此外，结果表明，即使在游戏类型内部，不同游戏在毒性分布上也存在显著差异，这表明存在游戏特定的社区规范和机制，这些因素塑造了超越游戏类型效应的毒性行为。这些发现为Twitch上游戏类型和游戏特定的毒性模式提供了实证见解，并可为游戏社区制定更有针对性的审核策略提供信息。

英文摘要

Toxicity in online gaming communities remains a persistent challenge, manifesting across genres, platforms, and player interactions. While much research is focused on in-game toxicity, less is known about how toxic behavior varies between gaming communities on streaming platforms. To address this shortcoming, we analyze approximately 20 million chat messages from 4,452 streams, spanning seven game genres on Twitch. We categorize messages according to Twitch's toxicity taxonomy with a pre-trained Large Language Model using zero-shot classification. The taxonomy comprises four categories and eight subclasses, including harassment, discrimination, sexual content, and profanity. Our approach achieves an F1 score of 94.5% on the TextDetox dataset and demonstrates human-model agreement comparable to inter-human agreement. Our analysis reveals that 2.4% of all messages are classified as toxic, with notable differences across genres: streams of MOBA games exhibit the highest relative rate of toxicity (3.2%), and sports games show the lowest rate (2%). Furthermore, results indicate that individual games differ significantly in their toxicity distributions, even within genres, suggesting the existence of game-specific community norms and mechanics that shape toxic behavior beyond genre-level effects. These findings offer empirical insights into genre- and game-specific toxicity patterns on Twitch and can inform more targeted moderation strategies for gaming communities.

URL PDF HTML ☆

赞 0 踩 0

2605.23989 2026-05-26 cs.AI cs.CL cs.CR 版本更新

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

迈向可信的自主AI：安全性、鲁棒性、隐私与系统安全的全面综述

Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu

发表机构 * Faculty of Engineering, Department of Computer Science and Engineering, The Chinese University of Hong Kong（香港中文大学工程学院、计算机科学与工程系）； Artificial Intelligence Innovation and Incubation Institute, Fudan University（复旦大学人工智能创新与孵化院）； Shanghai Academy of AI for Science（上海人工智能科学研究院）

AI总结本文综述了自主AI系统在安全鲁棒性与隐私系统安全两个核心维度的风险来源、阶段缓解策略及统一评估指标，并讨论了开放挑战。

Comments 36 pages, 4 figures. Survey/review article on trustworthy agentic AI. Published in Academia AI and Applications, 2026

详情

DOI: 10.20935/AcadAI8260
Journal ref: Academia AI and Applications, vol. 2, 2026

AI中文摘要

自主AI系统——即通过规划、工具使用、记忆和长程交互增强的大型语言模型（LLM）——能够自主执行复杂任务，但其多步轨迹引入了新的故障模式，挑战了可信赖性。本综述通过两个对高风险部署至关重要的核心维度，对可信自主AI进行了重点考察：安全性与鲁棒性，以及隐私与系统安全性。针对每个维度，我们澄清了关键概念，识别了风险在代理工作流中出现的环节，并总结了针对各阶段的缓解策略。其他可信赖性方面（价值对齐、透明度、公平性和问责制）作为相关背景而非平行章节进行讨论。为了支持一致的比较和部署决策，我们将评估整合到一个统一的指标与基准中心，强调结果和过程信号（例如，约束违反、轨迹完整性和对抗成功率），并为发布门控提供场景到指标的指导。最后，我们概述了开放挑战，如自我进化代理、运行时监控与验证、隐私保护个性化以及信任-效用权衡，并提出了一个关于开源自主系统中现实世界安全失败的案例研究。我们的目标是作为在高风险环境中构建可信自主系统的研究人员和实践者的实用参考。

英文摘要

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

URL PDF HTML ☆

赞 0 踩 0

2605.23977 2026-05-26 cs.CL cs.SD eess.AS 版本更新

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

临床访谈抑郁症检测基准的多探针审计

Takehiro Ishikawa, Jon Duke

发表机构 * College of Computing, Georgia Institute of Technology（佐治亚理工学院计算机学院）； Georgia Tech Research Institute, Georgia Institute of Technology（佐治亚理工学院研究 institute）

AI总结通过四个互补探针审计临床访谈抑郁症检测基准，发现评估协议缺陷、排行榜不可靠、跨域泛化弱以及文本与音频模态对症状密度的敏感性差异。

详情

AI中文摘要

本文通过四个互补探针对 DAIC/E-DAIC、CMDC、ANDROIDS、MODMA 和 PDCH 中的临床访谈抑郁症检测基准评估进行审计。首先，我们在严格的受试者不相交留一受试者交叉验证下重新评估 E-DAIC。一个轻量级混合文本加 LLM 评分模型达到了 macro-F1 = 0.723——据我们所知，这是该协议下报告的最高值——提供了一个不依赖特权官方保留集的保守出折参考点。其次，我们通过扫描 96 种跨模态组合、池化策略和学习器的模型配置，测试 E-DAIC 官方划分是否支持细粒度排行榜排名。开发侧交叉验证与官方测试排名仅中等程度对齐：最佳交叉验证配置在官方测试中排名第 20，官方测试获胜者按交叉验证排名第 41，前三名重叠为零，且表观获胜者在仅 32.3% 的受试者自举中排名第一。第三，我们外部验证了强大的公开 CMDC 和 ANDROIDS 基线，这些基线在域内实现了接近天花板的表现。到外部语料库的零样本迁移明显较弱。最后，我们使用基于 SRDS 的标注器定义的症状密集与症状稀疏的配对访谈片段，对 E-DAIC 文本和音频模型进行压力测试。文本分数在症状密集片段上急剧上升，而音频分数几乎持平；文本减音频的差距在所有五个种子上均为正。

英文摘要

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

URL PDF HTML ☆

赞 0 踩 0

2605.23975 2026-05-26 cs.CL cs.SD 版本更新

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

面向音频大语言模型中英双语码转换语音识别的直接偏好优化

Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw

发表机构 * Institute for Infocomm Research (I2R), A STAR, Singapore（信息通信研究所（I2R），A STAR，新加坡）； Nanyang Technological University, Singapore（南洋理工大学，新加坡）

AI总结针对音频大语言模型在英中码转换语音转录中的系统失败，提出使用直接偏好优化（DPO）对齐模型，通过构建偏好对（保留混合语言内容 vs 模仿失败模式）训练模型，实现词错误率降低最高89.6%（分布内）和20.0%（分布外）。

2605.23974 2026-05-26 cs.CL 版本更新

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

AERIC：面向隐式有害对话的预期性隐藏状态监控

Jihyung Park, Saleh Afroogh, Junfeng Jiao

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出AERIC方法，利用生成器的隐藏状态进行同路径监控，通过短期危害预测、支持敏感抑制和提示条件残差评分，在轻量级线性监控器（仅387个可训练参数）上实现对隐式有害对话的早期检测，显著提升AUROC并降低延迟。

详情

AI中文摘要

当前语言模型带来两个安全挑战：必须足够早地检测风险以避免暴露有害的延续，且危害本身可能是隐式的而非通过明显的有毒文本信号。现有的响应级防护在评判完整文本方面表现强劲，原生流式防护则更接近令牌时间，但两者都未解决轻量级监控器能否从生成器自身的内部轨迹预测隐式有害漂移的问题。我们研究预期性同路径监控，其中安全监控器可以读取正常解码过程中产生的隐藏状态，但不能调用通过基础模型的额外前向传播。我们引入AERIC，一种面向隐式有害对话的迁移导向隐藏状态方法，结合短期危害预测、支持敏感抑制和提示条件残差评分，并采用同路径指数移动平均决策规则。默认线性监控器仅包含387个可训练头部参数。在平衡基准测试中，与Qwen3GuardStream-4B相比，AERIC在DiaSafety上将AUROC从0.6830提升至0.7143，在Harmful Advice上从0.8219提升至0.8582。对于提示级触发基准测试，我们通过源端安全预算规则校准AERIC阈值，该规则在将安全触发率限制在最多10%的同时最大化触发覆盖率。在该规则下，对于Qwen和Gemma，trigger@64在HarmBench DirectRequest上分别达到0.6438和0.4656，在SocialHarmBench上分别达到0.6849和0.7363，平均保留23.53至41.86个回答令牌。同路径部署也很高效：在Qwen3-8B下，针对HarmBench DirectRequest和SocialHarmBench聚合的63个提示有害提示固定生成基准测试中，监控器仅使平均延迟增加2.34%，而Qwen3Guard-Stream-4B使其增加79.40%。

英文摘要

Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model. We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety and from 0.8219 to 0.8582 on Harmful Advice. For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0.6438 and 0.4656 on HarmBench DirectRequest and 0.6849 and 0.7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23.53 and 41.86 answer tokens on average. Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2.34%, whereas Qwen3Guard-Stream-4B increases it by 79.40%.

URL PDF HTML ☆

赞 0 踩 0

2605.23972 2026-05-26 cs.AI cs.CL cs.RO 版本更新

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

为什么我们需要世界模型来实现通用人工智能：大语言模型失败之处以及世界模型如何可能超越

Feisal Alaswad, Batoul Aljaddouh, Maher Alrahhal, Poovammal E, Talal Bonny

发表机构 * Department of Computing Technologies（计算技术系）； SRM Institute of Science and Technology（SRM科学与技术学院）； Bio-Sensing and Bio-Sensors Group（生物传感与生物传感器组）； Smart Automation and Communication Technologies Research Institute of Sciences and Engineering（科学与工程智能自动化与通信技术研究所）； University of Sharjah, UAE（阿联酋沙迦大学）； Department of Computer Engineering（计算机工程系）； College of Computing and Informatics（计算与信息学院）

AI总结本文通过提出潜在动态推理（LDI）概念和Flux环境案例研究，论证了大语言模型在因果推理、状态跟踪和长程规划上的局限性，并展示基于显式状态空间的强化学习智能体在长程游戏中显著优于纯文本LLM。

Comments 19 pages, 5 figures

详情

AI中文摘要

大语言模型在语言生成和知识密集型任务中表现出色，但在需要因果推理、持久状态跟踪和长程规划的场景中仍然受限。我们认为，这些限制可能源于序列预测与对潜在环境动态进行推理之间的目标层级不匹配。为了形式化这一区别，我们引入了潜在动态推理（LDI），这是一种概念性视角，将语言和多模态观测解释为底层转移动态的部分证据。为了实证研究这一视角，我们引入了Flux，一个完全通过自然语言规则指定的序列推理环境。作为一个概念验证案例研究，这些规则首先被编译成一个显式的状态转移模拟器，说明在某些情况下，结构化的潜在转移动态可以从文本规则描述中操作性地提取出来。这使得我们能够在纯文本观测上运行的LLM与直接在提取的潜在状态空间中训练的强化学习智能体之间进行受控比较。在该案例研究中，能够显式访问潜在状态空间的智能体在长程游戏中表现出更稳定的行为，总胜率约为79%，而LLM仅为11%。定性分析进一步揭示了与不稳定的持久状态跟踪一致的失败模式，包括无效动作、状态跟踪错误和短程推理失败。Flux环境的完整实现可在https://github.com/FeisalAlaswad/FLUX-RL-Agent获取。在评估的设置中，这些结果表明，如果没有持久状态跟踪和转移建模的机制，仅凭强大的序列预测可能难以支持稳健的长程动态推理。

英文摘要

Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requiring causal reasoning, persistent state tracking, and long-horizon planning. We argue that these limitations may arise from an objective-level mismatch between sequence prediction and reasoning over latent environment dynamics. To formalize this distinction, we introduce Latent Dynamics Inference (LDI), a conceptual perspective that interprets language and multimodal observations as partial evidence of underlying transition dynamics. To empirically investigate this perspective, we introduce Flux, a sequential reasoning environment specified entirely through natural-language rules. As a proof-of-concept case study, the rules are first compiled into an explicit state-transition simulator, illustrating that structured latent transition dynamics can, in some cases, be operationally extracted from textual rule descriptions. This enables a controlled comparison between the LLMs operating purely over textual observations and reinforcement-learning agents trained directly within the extracted latent state space. Within this case study, agents operating with explicit access to the latent state space exhibit substantially more stable behavior in long-horizon gameplay, achieving an aggregate win rate of approximately 79% versus 11% for LLMs. Qualitative analysis further reveals failure modes consistent with unstable persistent state tracking, including invalid actions, state-tracking errors, and short-horizon reasoning failures. The complete implementation of the Flux environment available at https://github.com/FeisalAlaswad/FLUX-RL-Agent Within the evaluated setting, these results suggest that strong sequence prediction alone may struggle to support robust long-horizon dynamic reasoning without mechanisms for persistent state tracking and transition modeling

URL PDF HTML ☆

赞 0 踩 0

2605.23970 2026-05-26 cs.CL 版本更新

EchoDistill：面向鲁棒音频大语言模型的噪声到干净自蒸馏对齐

Liang Lin, Chunxi Luo, Kaiwen Luo, Jie Zhang, Jin Wang, Yuanhe Zhang, Cai Yuchen, Qiankun Li, Gongli Xi, Zhenhong Zhou, Kun Wang, Junhao Dong

发表机构 * NTU（国立台湾大学）； SHU（上海大学）； ICT, CAS（中国科学院信息科技研究院）； HDU（华中科技大学）； BUPT（北京邮电大学）； USTC（中国科学技术大学）； SKL-NST, BUPT（北京邮电大学国家智能计算研究中心）

AI总结提出EchoDistill框架，通过冻结的干净音频教师模型指导噪声学生模型进行组相对策略优化，实现噪声到干净的自蒸馏对齐，提升音频大语言模型在复杂噪声下的语义可靠性和任务性能。

详情

AI中文摘要

音频大语言模型极易受到现实世界噪声的影响，常常导致严重的语义漂移和幻觉。现有的鲁棒性方法主要依赖于波形级声学增强、答案级监督或噪声表示的内部抑制。为了解决这些问题，我们提出了EchoDistill，一种基于对齐的噪声到干净自蒸馏框架。EchoDistill利用冻结的干净音频教师模型为推理时的噪声音频学生模型提供语义参考。具体地，学生模型在噪声条件下采样候选响应以暴露其测试时行为。这些轨迹随后通过组相对策略优化进行优化，其中与教师模型的令牌级一致性作为奖励加成。通过将噪声学生模型的候选响应与干净语义证据对齐，并应用音频感知奖励塑造，我们的方法鼓励既正确又真正基于声学推理的轨迹。EchoDistill显著提高了音频大语言模型在复杂噪声下的语义可靠性和任务性能，且不引入任何额外推理成本。大量实验表明：(I) 与最强基线相比，EchoDistill在强噪声下GSR平均提升4.18%↑。(II) 在Qwen-Omni上的消融结果进一步显示，EchoDistill相比仅GRPO变体在Acc上平均提升3.02%↑，在Noisy上提升3.89%↑，在GSR上提升4.53%↑。我们的代码可在https://anonymous.4open.science/r/echodistill-10DE获取。

英文摘要

Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization (GRPO), where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: (I) Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. (II) Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.

URL PDF HTML ☆

赞 0 踩 0

2605.23952 2026-05-26 cs.AI cs.CL q-bio.NC 版本更新

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

机器心理测量学：一种人工智能的数学心理学

Alex Bogdan, Adrian de Valois-Franklin

发表机构 * Evolutionairy AI

AI总结针对人工智能评估中忽视心理结构或过度拟人化的两种错误，本文引入机器心理测量学，通过测量潜在行为、元认知、沟通和自我建模倾向，构建机器心智档案和信任协议，以测量而非判断来理解非人类智能体。

Comments 45 pages, 11 figures

详情

AI中文摘要

人工智能体现在产生的行为足够丰富，足以引发信任、惊喜和担忧，然而我们的评估工具仍然优先考虑能力分数而非心理结构。本文认为，两种对称错误（人工心智盲视，即否认非生物系统中的心理组织；以及人工心智投射，即仅从流畅行为推断类似人类的内心生活）之间的哲学僵局，可以通过在意识问题之下引入一个严谨的测量层来规避，而非解决意识问题本身。借鉴Michael Levin关于认知作为跨基质目标导向能力的连续统观点，以及数学心理学的方法论库（项目反应理论、信号检测理论、贝叶斯认知建模、校准分析、认知偏差测试组），本文发展了机器心理测量学，作为测量人工智能体中潜在行为、元认知、沟通和自我建模倾向的测量科学。其操作核心是机器心智档案：一个多维、领域受限、版本化的轮廓，涵盖校准、源完整性、暗示抵抗性、上下文稳定性、表达对齐、工具完整性、漂移监测和分布基础。一个补充的信任协议通过探针测试组、扰动测试、信度和效度分析以及高风险领域的纵向监测，将心智档案转化为部署决策。哲学贡献是第三种立场，人工心智纪律，既不拟人化也不否认，既不预设意识也不排除意识。目标不是将人工智能体人性化，而是精确地理解它们，因为它们不是人类，通过测量而非判断。

英文摘要

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

URL PDF HTML ☆

赞 0 踩 0

2605.23940 2026-05-26 cs.AI cs.CL 版本更新

Raon-Speech 技术报告

Beomsoo Kim, Changho Choi, Dohyun Kim, Dongki Lee, Ethan Ewer, Eunchong Kim, Gyeongman Kim, Haechan Kim, Hyeonghwan Kim, Inkyu Park, Jihun Yun, Jihwan Moon, Jiyun Kim, Joonghyun Bae, Junhyuck Kim, Minkyu Kim, Sehun Lee, Seungjun Chung, Sungwoo Cho, Dongmin Park, Dongwon Kim, Hara Kang, Jonghyun Lee, Keon Lee, Kangwook Lee, Jaewoong Cho

发表机构 * KRAFTON

AI总结本文提出 Raon-Speech，一个 9B 参数的语音语言模型，通过多阶段训练实现英语和韩语的语音理解、回答与生成，并扩展为全双工对话模型 Raon-SpeechChat，在语音任务上超越同类模型。

详情

AI中文摘要

我们提出了 Raon-Speech，一个在英语和韩语语音理解、回答和生成方面表现优异的 9B 参数语音语言模型（SpeechLM），以及 Raon-SpeechChat，一个用于自然实时对话的高性能全双工扩展。Raon-Speech 成功地将预训练的大语言模型（LLM）转换为既能理解又能生成语音的 SpeechLM，同时保留了强大的文本能力。它在 138 万小时精心策划的英语和韩语语音及文本数据集上训练，训练阶段包括：(1) 语音模块对齐，(2) 基于知识蒸馏的端到端 SpeechLM 预训练，以及 (3) 基于多任务偏好优化的后训练。在 42 个英语和韩语语音及文本基准测试中，与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个近期类似规模的音频基础模型相比，Raon-Speech 在语音中心任务上建立了最强的整体表现，同时保留了强大的文本问答性能。在此基础上，Raon-SpeechChat 通过在 119K 小时的时间对齐的真实和合成对话数据上进行持续训练，实现了自然的全双工对话。它通过三个互补的训练阶段进行：(1) 因果编码器适应，(2) 全双工预训练，(3) 用于语音和角色控制的全双工微调。在多个全双工基准测试中，Raon-SpeechChat 在 FDB v1.0 涵盖的轮流发言和中断敏感行为上显示出最明显的优势，并在更广泛的全双工评估套件中保持竞争力。我们开源了所有模型检查点、训练和推理流程以及交互式演示。

英文摘要

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

URL PDF HTML ☆

赞 0 踩 0

2605.22542 2026-05-26 cs.CL 版本更新

Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

词汇语义的场景抽象：情境意义的结构化表示

Yejin Cho, Katrin Erk

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of Massachusetts, Amherst（马萨诸塞大学阿默斯特分校）

AI总结提出场景抽象框架，通过少样本提示大语言模型构建词汇使用情境的结构化表示，实验证明场景可可靠识别且优于基线方法。

详情

AI中文摘要

咖啡和茶共享许多属性，但它们唤起截然不同的情境、氛围和情感联想。这些词汇意义的情境维度是真实且系统的，但在大多数词汇意义的计算表示中仍然隐含。我们提出场景抽象，一个构建词汇在不同使用语境中参与的解释性场景的结构化表示框架。每个场景由情境场景（事件、实体、设置）和以表达为中心的表达轮廓（参与事件、可概括属性、唤起情感）组成，通过大语言模型的少样本提示实现。我们的贡献有三方面：（1）情境词汇意义的结构化表示框架；（2）COCA-Scenes，一个包含26个关键词的520个使用实例的数据集，用于区分场景识别；（3）来自两个实验的经验证据表明，场景在人类观察者中可靠识别（准确率82.4%，比纯文本嵌入高11.8个百分点），并且我们的场景轮廓比基于ATOMIC的替代方案更符合人类对语境中词汇的解释（在三个语义维度上偏好86.4%）。

英文摘要

Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).

URL PDF HTML ☆

赞 0 踩 0

2605.22005 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

检查你的大语言模型的秘密词典！五行代码揭示你的大语言模型学到了什么（包括它不应该学到的）

Hisashi Miyashita

发表机构 * Mgnite Inc.（Mgnite公司）

AI总结通过对lm_head权重矩阵进行奇异值分解（仅需五行PyTorch代码且无需模型推理），直接从模型权重中揭示可解释的语义子空间，并发现模型训练数据组成和策展哲学。

详情

AI中文摘要

我们展示了基于Transformer的大语言模型的lm_head权重矩阵的奇异值分解——仅需五行PyTorch代码且无需模型推理——直接从模型权重中揭示可解释的语义子空间。每个左奇异向量识别出当隐藏状态与相应奇异方向对齐时最容易被选中的词汇标记；检查这些聚类揭示了模型的训练数据组成和策展哲学。分析GPT-OSS-120B、Gemma-2-2B和Qwen2.5-1.5B，我们发现奇异值谱和词汇聚类结构在不同模型间存在系统性差异：GPT呈现出功能分化子空间的渐进层次；Gemma以19世纪前的英语正字法为主，形成阶梯式聚类结构，这可能有助于高输出可控性；Qwen展现出广泛的多语言覆盖，同时其子空间的词汇被作者认为在伦理上不适合直接发表。基础-指令对比表明，伦理上令人担忧的子空间源自预训练，并且不会被后训练对齐移除。我们引入词汇聚类得分（VCS）来量化子空间一致性，以及加权投影得分（WPS）作为静态故障标记检测器；将WPS应用于GPT-OSS-120B，无需任何模型推理即可恢复shokubutsu-hyakka-tsu（ID 137606），这是CJK语言社区中广泛报道的一个著名故障标记。我们提出了问题词汇内容根本原因的分类法，并呼吁将lm_head SVD分析作为标准发布前安全审计步骤。我们的发现进一步指出了SVD引导的分词器优化和更可控的大语言模型设计方向。

英文摘要

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

URL PDF HTML ☆

赞 0 踩 0

2605.19848 2026-05-26 cs.CL 版本更新

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

CLIF：用于透明瓶颈模型的概念级影响函数

Yike Sun, Mingkun Xu, Mu You, Zhongzhi He, Henghua Shen, Zehan Tan, Derek F. Wong, Tao Fang

AI总结提出概念级影响函数方法，在样本和概念层面增强NLP模型可解释性，通过调整关键样本和概念验证了数据调试和决策透明化的有效性。

Comments A critical theoretical error invalidates the main results. The independence assumption on concept representations and gradients (Section 3.2, Eq.7) is incorrect, breaking the influence estimation in nonlinear bottleneck layers. This flaw undermines all empirical claims in Sections 4-5. The authors withdraw to prevent dissemination of incorrect findings

详情

AI中文摘要

近年来，深度学习模型的黑箱特性限制了其在医疗诊断和金融等高风险领域的应用，而这些领域对可解释性至关重要。为解决这一问题，我们提出了一种新颖的方法，利用影响函数在样本和概念层面增强NLP模型的可解释性。在CEBaB和Yelp数据集上的实验表明，影响函数能有效识别对模型预测最有影响的训练样本（包括有益和有害的）。通过调整这些样本的标签和权重，我们证明无需重新训练即可将模型性能恢复到基线水平，证实了影响函数在高效数据调试中的价值。此外，我们的概念级分析识别了概念瓶颈模型（CBM）中对预测有显著影响的关键概念。修改这些概念会明显改变模型行为，为决策过程提供了清晰的洞察。

英文摘要

In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

URL PDF HTML ☆

赞 0 踩 0

2605.19846 2026-05-26 cs.CV cs.AI cs.CL 版本更新

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench: 细粒度人类活动理解的视觉-语言模型基准测试与增强

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University（国立台湾大学）； Google（谷歌）； Independent Researcher（独立研究员）

AI总结针对视觉-语言模型在细粒度人类活动理解上的不足，提出包含密集标注的长视频问答基准FineBench和增强框架FineAgent。

Comments CVPR'26 (Workshop on Video Large Language Models). Project Page: https://joslefaure.github.io/assets/html/finebench.html

详情

AI中文摘要

视觉-语言模型（VLM）在通用视频理解方面表现出色，但在需要细致理解人类动作和交互的真实世界应用中，它们常常难以进行细粒度理解。虽然最近一些以人为中心的基准测试评估了模型行为的公平性/伦理、情感感知等维度，但它们没有结合长视频、密集的问答覆盖以及大规模的帧级空间/时间定位。为弥补这一差距，我们引入了FineBench，一个专门设计用于评估细粒度理解的以人为中心的视频问答（VQA）基准。FineBench包含199,420个多项选择问答对，密集标注在64个长视频（每个15分钟）上，重点关注详细的人物运动、人物交互和物体操作，包括组合动作。我们的广泛评估显示，虽然像GPT-5这样的专有模型取得了不错的性能，但当前的开源VLM明显表现不佳，特别是在多人场景的空间推理以及区分人类运动和交互的细微差异方面。为了解决这些已识别的弱点，我们提出了FineAgent，一个模块化框架，通过利用定位器和描述器来增强VLM。实验表明，FineAgent在FineBench上持续提高了各种开源VLM的性能。FineBench为未来细粒度以人为中心的视频理解研究提供了严格的测试平台，而FineAgent则为增强当前VLM中的此类推理提供了一种实用方法。项目页面和代码：https://joslefaure.github.io/assets/html/finebench.html。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

URL PDF HTML ☆

赞 0 踩 0

2605.16302 2026-05-26 cs.LG cs.AI cs.CL 版本更新

OASES：面向智能搜索的结果对齐搜索-评估协同训练

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China（中国人民大学）； Xiaohongshu Inc.（小红书公司）； University of Southern California（南加州大学）

AI总结提出OASES框架，通过结果对齐的过程奖励和搜索-评估协同训练，解决智能搜索中奖励稀疏和过程监督不可靠的问题，在多跳问答基准上优于强强化学习基线。

详情

AI中文摘要

智能搜索使语言模型能够通过自适应地多步获取外部证据来解决知识密集型任务。具有可验证奖励的强化学习已成为搜索智能体广泛采用的训练范式，但仅结果奖励是稀疏的，并且对中间搜索动作的信用分配有限。因此，现有的过程奖励方法试图通过代理信号、外部评估器或基于似然的信息增益来密集化监督。然而，代理奖励可能偏离最终结果目标，而固定评估器随着搜索策略的演化可能变得过时，导致不可靠的过程监督。为应对这些挑战，我们提出OASES，一种用于智能搜索的结果对齐搜索-评估监督框架。OASES通过评估每个中间搜索状态对回答原始问题的支持程度，推导出结果对齐的过程奖励。它进一步在策略上协同训练搜索策略和状态评估器，使评估器能够适应演化的搜索行为并提供更可靠的过程奖励。在五个多跳问答基准上的实验表明，OASES始终优于强强化学习基线，进一步分析证实了结果对齐过程奖励和搜索-评估协同训练的优势。

英文摘要

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.

URL PDF HTML ☆

赞 0 踩 0

2603.17198 2026-05-26 cs.LG cs.CL 版本更新

Structural Abstraction as an Inductive Bias for Non-Stationary Language Model Training

结构抽象作为非平稳语言模型训练的归纳偏置

Elnaz Rahmati, Nona Ghazizadeh, Zhivar Sourati, Nina Rouhani, Morteza Dehghani

发表机构 * University of Southern California（南加州大学）

AI总结提出抽象增强训练（AAT）方法，通过联合优化具体实例及其结构抽象，减少灾难性干扰并提升关系泛化能力，在非平稳语言模型训练中验证了结构抽象作为稳定学习信号的有效性。

详情

AI中文摘要

认知科学的一个基本原则认为，智能体不是通过将经验存储为孤立实例来学习，而是通过形成捕捉跨情境共享关系结构的抽象图式来学习。尽管这一主张得到了行为和神经影像研究的充分支持，但其作为语言模型计算训练信号的作用仍未得到充分探索。我们针对非平稳语言模型训练中的这一空白，提出疑问：将学习偏向结构抽象是否能如人类结果所预测的那样减少灾难性干扰并提升关系泛化？为研究这一问题，我们引入了抽象增强训练（AAT），这是一种轻量级的损失级修改，联合优化具体实例及其结构抽象，以及两个基准：关系循环基准（RCB）和叙事抽象基准（NAB）。这些资源将核心认知构造操作化：实体掩码作为关系对齐的计算模拟，谚语作为必须跨表面不同情境推断的隐式抽象意义的载体。我们的实证结果表明，AAT持续减少遗忘并提升泛化，其模式与基于图式学习的认知预测一致。除了对持续学习的实际意义外，这些结果提供了初步的计算证据，表明结构抽象是非平稳环境中稳定学习的信号。

英文摘要

A foundational principle in cognitive science holds that intelligent agents do not learn by storing experiences as isolated instances, but by forming abstract schemas that capture relational structure shared across situations. Even though this claim is well supported by behavioral and neuroimaging studies, its role as a computational training signal in language models remains underexplored. We target this gap in the setting of non-stationary language model training, asking does biasing learning toward structural abstraction reduce catastrophic interference and improve relational generalization as predicted by human results? To study this question, we introduce Abstraction-Augmented Training (AAT), a lightweight loss-level modification that jointly optimizes over concrete instances and their structural abstractions, and two benchmarks, the Relational Cycle Benchmark (RCB) and the Narrative Abstraction Benchmark (NAB). These resources operationalize core cognitive constructs: entity masking as a computational analog of relational alignment, and proverbs as vehicles for implicit abstract meaning that must be inferred across surface-dissimilar situations. Our empirical results demonstrate that AAT consistently reduces forgetting and improves generalization in a pattern that aligns with cognitive predictions for schema-based learning. Beyond the practical implications for continual learning, these results offer preliminary computational evidence that structural abstraction is a signal for stable learning in non-stationary environments.

URL PDF HTML ☆

赞 0 踩 0

2602.10090 2026-05-26 cs.AI cs.CL cs.LG 版本更新

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model: 用于智能体强化学习的无限合成环境

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出Agent World Model (AWM)全合成环境生成管道，通过代码驱动和数据库支持的环境进行大规模强化学习，使智能体在多样日常场景中泛化。

Comments Accepted to ICML 2026

详情

AI中文摘要

近年来，大型语言模型（LLM）的进步使得自主智能体能够与工具和环境进行多轮交互。然而，扩展此类智能体训练受到缺乏多样且可靠环境的限制。在本文中，我们提出了Agent World Model（AWM），一个完全合成的环境生成管道。使用该管道，我们扩展到涵盖日常场景的1000个环境，智能体可以在其中与丰富的工具集交互并获得高质量的观测。值得注意的是，这些环境是代码驱动的并由数据库支持，比由LLM模拟的环境提供更可靠和一致的状态转换。此外，与从现实环境中收集轨迹相比，它们实现了更高效的智能体交互。为了展示该资源的有效性，我们对多轮工具使用智能体进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态，我们还可以设计可靠的奖励函数。在三个基准上的实验表明，仅在合成环境中训练（而非特定于基准的环境）能产生强大的分布外泛化能力。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。

英文摘要

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

URL PDF HTML ☆

赞 0 踩 0

2601.10201 2026-05-26 cs.LG cs.AI cs.CL 版本更新

Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

未来KL正则化GRPO：基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出未来KL正则化策略优化（FRPO），通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号，在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情

AI中文摘要

组相对策略优化（GRPO）广泛用于无评论家的大语言模型（LLM）后训练，但其KL正则化通常作为局部损失侧的token惩罚实现。我们表明这遗漏了自回归KL正则化诱导的策略梯度信号。与标准KL正则化强化学习（RL）目标不同，GRPO的组归一化引入非线性提示级效用；对于二元验证器奖励，该效用为$2\arcsin\sqrt p$。因此，奖励和KL在归一化前无法融合而不改变隐式目标。我们推导了具有token级$f$-散度正则化的GRPO风格目标的on-policy梯度。奖励项恢复标准化的GRPO优势，而正则化项包括局部KL损失遗漏的因果未来正则化回报。对于反向KL，这产生简单的未来KL修正：在优势构建后添加每个token对数比的反向累积和。由此产生的方法，未来KL正则化策略优化（FRPO），不需要评论家或额外的模型传递。在数学推理任务上，FRPO在我们的主要大模型设置中提高了pass@16，同时保持比传统损失侧KL基线更高的熵和更低的策略漂移。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.05847 2026-05-26 cs.CL 版本更新

Schema-Grounded LLM Extraction for FHIR Patient Digital Twins

基于Schema的LLM抽取用于FHIR患者数字孪生

Rafael Brens, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

发表机构 * Binghamton University（宾夕法尼亚州立大学）

AI总结提出SG-LLM方法，通过检索增强、JSON Schema约束和验证器修复循环，从非结构化EHR中生成有效的FHIR Bundle，并在临床效用实验中优于基线。

详情

AI中文摘要

我们重新审视从非结构化电子健康记录（EHR）构建可互操作患者数字孪生的问题，并认为该任务更适合被视作有效FHIR Bundle的受控生成，而非抽取模块的级联。我们引入SG-LLM，一种基于schema的LLM抽取器，它(i)通过SapBERT索引检索的候选SNOMED-CT、RxNorm和LOINC代码增强提示，(ii)在直接源自FHIR R4 StructureDefinitions的JSON Schema下解码，(iii)关闭一个验证器在环修复阶段，其诊断结果作为结构化错误消息反馈。我们认为，孪生的有用性（而不仅仅是跨度级F1）才是正确的评估对象，并通过一项临床效用实验将其操作化，该实验测量了基于SG-LLM生成的FHIR Bundle与专家策划的Bundle训练的分类器在30天再入院AUROC上的差距。在MIMIC-IV和n2c2 2018 Track 2基准测试上，SG-LLM匹配或超过了强大的联合抽取和普通LLM基线，同时生成了更有效的Bundle。消融实验分离了检索、schema约束和修复循环的贡献。所有代码、提示和schema均已发布。

英文摘要

We revisit the problem of constructing interoperable patient digital twins from unstructured electronic health records (EHRs) and argue that the task is better cast not as a cascade of extraction modules but as constrained generation of a valid FHIR bundle. We introduce SG-LLM, a schema-grounded LLM extractor that (i) augments the prompt with candidate SNOMED-CT, RxNorm, and LOINC codes retrieved through a SapBERT index, (ii) decodes under a JSON Schema derived directly from FHIR R4 StructureDefinitions, and (iii) closes a validator-in-the-loop repair stage whose diagnostics are fed back as structured error messages. We argue that the twin's usefulness, not only span-level F1, is the right object of evaluation, and operationalize this with a clinical-utility experiment that measures the gap in 30-day readmission AUROC between classifiers trained on SG-LLM-generated FHIR bundles versus expert-curated ones. On MIMIC-IV and n2c2 2018 Track 2 benchmarks, SG-LLM matches or exceeds strong joint-extraction and vanilla-LLM baselines while producing substantially more valid bundles. Ablations isolate the contributions of retrieval, schema constraint, and the repair loop. All code, prompts, and schemas are released.

URL PDF HTML ☆

赞 0 踩 0

2601.05004 2026-05-26 cs.CL 版本更新

Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

大语言模型能否解决自我毁灭亚文化中的语义差异？来自Jirai Kei的证据

Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

发表机构 * School of Computer Science and Engineering, Macau University of Science and Technology（澳门科技大学计算机科学与工程学院）； SKLPlanets, Macau University of Science and Technology（澳门科技大学SKLPlanets）； College of Software, Northeastern University（东北大学软件学院）； School of Energy and Power Engineering, Huazhong University of Science and Technology（华中科技大学能源与动力工程学院）； School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）

AI总结针对亚文化中自我毁灭行为检测面临的知识滞后和语义错位问题，提出多智能体框架SAS，通过自动检索和亚文化对齐显著提升LLM检测性能，并优于现有先进方法。

Comments Preprint

详情

AI中文摘要

自我毁灭行为与复杂的心理状态相关，且难以诊断。由于亚文化群体独特的表达方式，这些行为可能更难识别。随着大语言模型（LLM）在各领域的部署，一些研究者开始探索其在检测自我毁灭行为中的应用。受此启发，我们使用当前基于LLM的方法研究亚文化中的自我毁灭行为检测。然而，这些方法面临两个主要挑战：（1）知识滞后：亚文化俚语演变迅速，快于LLM的训练周期；（2）语义错位：难以把握亚文化特有的具体和细微表达。为解决这些问题，我们提出亚文化对齐求解器（SAS），一个多智能体框架，集成了自动检索和亚文化对齐，显著提升了LLM在检测自我毁灭行为中的性能。实验结果表明，SAS优于当前先进的多智能体框架OWL。值得注意的是，它与微调后的LLM表现相当。我们希望SAS能推动亚文化背景下自我毁灭行为检测领域的发展，并为未来研究者提供宝贵资源。

英文摘要

Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) being deployed across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we propose Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly boosting the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

URL PDF HTML ☆

赞 0 踩 0

2601.02589 2026-05-26 cs.CL cs.AI 版本更新

FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

FlowPlan-G2P：一种将科学论文转化为专利描述的结构化生成框架

Kris W Pan, Yongmin Yoo

发表机构 * Amazon（亚马逊公司）； Macquarie University（麦考瑞大学）

AI总结提出FlowPlan-G2P图介导生成框架，通过概念图归纳、章节级规划和图条件生成三阶段分解，将科学论文转化为符合专利规范的描述，在领域评估中优于大型专有模型。

详情

AI中文摘要

由于科学论文与专利在修辞和结构上的根本差异，从科学论文生成专利描述具有挑战性。现有方法将其视为表面改写，未能捕捉专利起草中固有的层次推理和法定约束。我们提出FlowPlan-G2P，一种图介导的生成框架，将该转换分解为三个阶段：（1）概念图归纳，将技术实体和功能依赖提取为有向图；（2）章节级规划，将图划分为与规范专利章节对齐的连贯子图；（3）图条件生成，基于章节特定子图合成符合法律要求的段落。在专家验证基准上的实验表明，标准NLG指标系统性偏好法律不合规输出而非有效专利描述，这促使我们进行领域特定评估。在该评估下，使用开放权重骨干的FlowPlan-G2P始终优于原始专有模型，表明结构化分解比模型规模更能决定质量。

英文摘要

Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the two genres. Existing approaches treat this as surface-level rewriting, failing to capture the hierarchical reasoning and statutory constraints inherent in patent drafting. We propose FlowPlan-G2P, a graph-mediated generation framework that decomposes this transformation into three stages: (1) Concept Graph Induction, extracting technical entities and functional dependencies into a directed graph; (2) Section-level Planning, partitioning the graph into coherent subgraphs aligned with canonical patent sections; and (3) Graph-Conditioned Generation, synthesizing legally compliant paragraphs conditioned on section-specific subgraphs. Experiments on expert-validated benchmarks reveal that standard NLG metrics systematically favor legally non-compliant outputs over valid patent descriptions, motivating our domain-specific evaluation. Under this evaluation, FlowPlan-G2P with an open-weight backbone consistently outperforms vanilla proprietary models, demonstrating that structured decomposition is a stronger determinant of quality than model scale.

URL PDF HTML ☆

赞 0 踩 0

2512.12576 2026-05-26 cs.CL cs.AI 版本更新

Coupled Variational Reinforcement Learning for Language Model General Reasoning

耦合变分强化学习用于语言模型通用推理

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出CoVRL方法，通过混合采样策略耦合先验和后验分布，将变分推理与强化学习结合，以解决无验证器强化学习中探索效率低和推理轨迹与答案不一致的问题，在数学和通用推理基准上提升性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

虽然强化学习在语言模型推理方面取得了显著进展，但它受到可验证奖励要求的限制。最近的无验证器强化学习方法通过利用LLM生成参考答案的概率作为奖励信号来解决这一限制。然而，这些方法通常仅基于问题采样推理轨迹。这种设计将推理轨迹采样与答案信息解耦，导致探索效率低下以及轨迹与最终答案之间的不一致。在本文中，我们提出了 extit{{Co}upled {V}ariational {R}einforcement {L}earning}（CoVRL），它通过混合采样策略耦合先验和后验分布，将变分推理与强化学习联系起来。通过构建和优化整合这两种分布的复合分布，CoVRL实现了高效探索，同时保持了思想与答案之间的强一致性。在数学和通用推理基准上的大量实验表明，CoVRL在基础模型上提升了12.4%的性能，并在最先进的无验证器强化学习基线基础上额外提升了2.3%，为增强语言模型的通用推理能力提供了一个原则性框架。

英文摘要

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

URL PDF HTML ☆

赞 0 踩 0

2511.21734 2026-05-26 cs.CL cs.AI 版本更新

Asking LLMs to Verify First is Almost Free Lunch

先让LLMs验证几乎是免费的午餐

Shiguang Wu, Quanming Yao

发表机构 * Department of Electonic Engineering（电子工程系）

AI总结提出Verification-First (VF)策略，通过先验证候选答案再生成解决方案，以低计算开销提升推理能力，并扩展为Iter-VF迭代方法，在多个基准上优于标准CoT和现有TTS策略。

详情

AI中文摘要

为了在不增加训练成本或大量测试时采样的情况下增强大型语言模型（LLMs）的推理能力，我们引入了Verification-First (VF)策略，该策略在生成解决方案之前提示模型验证提供的候选答案（即使是琐碎或随机的答案）。这种方法触发了一种“反向推理”过程，与标准的前向思维链（CoT）互补，通过修剪LLM的输出分布来限制答案的逻辑搜索空间。我们进一步将VF提示推广到Iter-VF，这是一种顺序测试时缩放（TTS）方法，利用模型之前的答案迭代地循环验证-生成过程。跨多个基准和各种LLMs的大量实验证实，使用随机答案的VF提示在最小计算开销下始终优于标准CoT，并且Iter-VF优于现有的TTS策略。VF在SOTA思考模型上也有效。例如，通过使用简单的VF提示，我们在GPQA-Diamond上使用Gemini-3-Pro-Preview获得了新的SOTA准确率94.9%，其中VF相对减少了约30%的错误。

英文摘要

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process complementary to standard forward Chain-of-Thought (CoT), which restricts the logical search space of the answer by pruning the LLM's output distribution. We further generalize VF prompting to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks and various LLMs confirm that VF prompting with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies. VF is also effective on SOTA thinking models. For example, by using the simple VF prompting, we obtain a new SOTA 94.9% accuracy on GPQA-Diamond with Gemini-3-Pro-Preview where VF reduces its errors by ~30% relatively.

URL PDF HTML ☆

赞 0 踩 0

2511.08654 2026-05-26 cs.CY cs.AI cs.CL 版本更新

AI-generated podcasts: Synthetic Intimacy and Cultural Mistranslation in NotebookLM's Audio Overviews

AI生成的播客：NotebookLM音频概览中的合成亲密关系与文化误译

Jill Walker Rettberg

发表机构 * University of Bergen（卑尔根大学）； Center for Digital Narrative（数字叙述中心）

AI总结本文分析Google NotebookLM生成的AI播客，揭示其固定模板结构及将文本和文化语境翻译为白人、受过教育的中产阶级美国默认设置的问题。

Comments This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement number 101142306. The project is also supported by the Center for Digital Narrative, which is funded by the Research Council of Norway through its Centres of Excellence scheme, project number 332643. Media, Culture & Society, online first (2026)

详情

DOI: 10.1177/01634437261452160

AI中文摘要

本文分析了Google NotebookLM生成的AI播客，该工具生成两个健谈的AI主持人讨论用户上传文档的音频播客。虽然AI生成的播客已被作为工具讨论（例如在医学教育中），但它们尚未作为媒体被分析。通过上传不同类型的文本并分析生成的输出，我展示了播客的结构如何围绕固定模板构建。我还发现NotebookLM不仅将其他语言的文本翻译成活泼的标准中西部美国口音，还将文化语境翻译为白人、受过教育的中产阶级美国默认设置。这是媒体塑造公众方式的一个显著发展，标志着从学者们描述的21世纪初至今人类播客中的多元公共领域（主持人面向特定社区并回应听众评论）向播客类型抽象化的转变。

英文摘要

This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.

URL PDF HTML ☆

赞 0 踩 0

2510.16435 2026-05-26 cs.RO cs.CL cs.HC 版本更新

What Questions Should Robots Be Able to Answer? A Dataset of User Questions for Explainable Robotics

机器人应该能够回答哪些问题？一个用于可解释机器人的用户问题数据集

Lennart Wachowiak, Andrew Coles, Gerard Canal, Oya Celiktutan

发表机构 * King's College London, CDT in Safe and Trusted AI（国王学院伦敦大学，安全与可信人工智能中心）； King's College London（国王学院伦敦大学）

AI总结本文通过收集100名参与者的1893个问题，构建了一个面向家用机器人的用户问题数据集，涵盖12个类别和70个子类别，旨在帮助机器人学家确定机器人需要回答的关键问题类型。

详情

AI中文摘要

随着大型语言模型和对话界面在人机交互中的广泛使用，机器人回答用户问题的能力比以往任何时候都更加重要。因此，我们引入了一个包含1,893个家用机器人用户问题的数据集，这些数据来自100名参与者，并分为12个类别和70个子类别。可解释机器人领域的大多数工作集中在“为什么”问题上。相比之下，我们的数据集提供了多种类型的问题，从关于简单执行细节的问题到关于机器人在假设场景中如何行动的问题——从而为机器人学家提供了关于其机器人需要能够回答哪些问题的宝贵见解。为了收集数据集，我们创建了15个视频刺激和7个文本刺激，描绘了机器人执行各种家务任务。然后，我们询问Prolific上的参与者在每个描绘的情境中他们想问机器人什么问题。在最终数据集中，最常见的类别是关于任务执行细节（21.4%）、机器人能力（12.6%）和性能评估（10.7%）的问题。尽管关于机器人如何处理潜在困难场景并确保正确行为的问题较少，但用户认为这些是机器人最需要能够回答的问题。此外，我们发现自认为是机器人学新手的人与更有经验的用户提出的问题不同。新手更倾向于询问简单事实，例如机器人做了什么或环境的当前状态。随着机器人进入与人类共享的环境，并且语言成为给出指令和交互的核心，该数据集为（i）识别机器人需要记录并暴露给对话界面的信息，（ii）对问答模块进行基准测试，以及（iii）设计符合用户期望的解释策略提供了宝贵的基础。

英文摘要

With the growing use of large language models and conversational interfaces in human-robot interaction, robots' ability to answer user questions is more important than ever. We therefore introduce a dataset of 1,893 user questions for household robots, collected from 100 participants and organized into 12 categories and 70 subcategories. Most work in explainable robotics focuses on why-questions. In contrast, our dataset provides a wide variety of questions, from questions about simple execution details to questions about how the robot would act in hypothetical scenarios -- thus giving roboticists valuable insights into what questions their robot needs to be able to answer. To collect the dataset, we created 15 video stimuli and 7 text stimuli, depicting robots performing varied household tasks. We then asked participants on Prolific what questions they would want to ask the robot in each portrayed situation. In the final dataset, the most frequent categories are questions about task execution details (21.4%), the robot's capabilities (12.6%), and performance assessments (10.7%). Although questions about how robots would handle potentially difficult scenarios and ensure correct behavior are less frequent, users rank them as the most important for robots to be able to answer. Moreover, we find that users who identify as novices in robotics ask different questions than more experienced users. Novices are more likely to inquire about simple facts, such as what the robot did or the current state of the environment. As robots enter environments shared with humans and language becomes central to giving instructions and interaction, this dataset provides a valuable foundation for (i) identifying the information robots need to log and expose to conversational interfaces, (ii) benchmarking question-answering modules, and (iii) designing explanation strategies that align with user expectations.

URL PDF HTML ☆

赞 0 踩 0

2509.12672 2026-05-26 cs.CL 版本更新

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

迈向包容性有害内容审核：解决面向LLM生成内容的有害性分类器对抗攻击的脆弱性

Shaz Furniturewala, Arkaitz Zubiaga

发表机构 * Center for Data Science, New York University（纽约大学数据科学中心）； Queen Mary University of London（伦敦大学女王学院）

AI总结针对LLM生成内容的有害性分类器易受对抗攻击的问题，提出基于机制可解释性的方法识别并抑制脆弱电路，提升模型鲁棒性并揭示人口统计学层面的公平性差距。

详情

AI中文摘要

由于大型语言模型（LLM）的广泛使用，在线机器生成内容的数量急剧增长，给内容审核系统带来了新的挑战。传统的内容审核分类器通常基于人类生成的文本进行训练，由于LLM生成的文本偏离其训练数据以及旨在逃避检测的对抗攻击，导致分类错误。当前的防御策略是反应性的而非主动性的，因为它们依赖于对抗训练或外部检测模型来识别攻击。在这项工作中，我们旨在识别导致错误分类的有害性分类器的脆弱组件，提出了一种基于机制可解释性技术的新策略。我们的研究聚焦于微调的BERT和RoBERTa分类器，在涵盖多种少数群体的不同数据集上进行测试。我们使用对抗攻击技术来识别脆弱电路。最后，我们抑制这些脆弱电路，提高对抗攻击下的性能。我们还提供了这些脆弱电路在人口统计学层面的洞察，揭示了模型训练中的公平性和鲁棒性差距。我们发现模型具有不同的注意力头，这些头要么对性能至关重要，要么容易受到攻击，抑制脆弱头可以提高对抗输入的性能。我们还发现不同的头负责不同人口群体的脆弱性，这可以为更具包容性的有害性检测模型开发提供信息。

英文摘要

The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.

URL PDF HTML ☆

赞 0 踩 0

2509.08150 2026-05-26 cs.CL 版本更新

Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)

言语化算法：经典算法就是你所需要的（大部分）

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian, Masataro Asai

发表机构 * MIT CSAIL（MIT 计算与人工智能实验室）； MIT-IBM Watson AI Lab（MIT-IBM 沃森人工智能实验室）； IBM Infrastructure（IBM 基础设施）； Marist University（马里斯特大学）； UC Irvine（加州大学尔湾分校）； IBM Research Cambridge, USA（IBM 英国剑桥研究中心）

AI总结提出言语化算法（VA）范式，将LLM作为可靠的基本操作（如字符串比较）集成到经典算法中，以提升推理的准确性和效率。

Comments Accepted in NeurIPS 2025 Workshop on Efficient Reasoning; Submitted to Position Paper Track at Neurips 2026

详情

AI中文摘要

推理本质上是一个算法任务。然而，当前基于LLM的推理工作依赖于自由生成，其理论保证（可靠性、完备性、复杂性、最优性）仍然知之甚少。我们认为不应将它们视为通用推理器，作为替代，我们提出一种称为“言语化算法”（VA）的范式，它将LLM与各种具有既定保证的算法相结合。VA不依赖LLM解决推理任务的能力，而是通过将任务分解为它们能够可靠回答的简单字符串基本操作来限制其范围。例如，对自然语言字符串列表进行排序可以通过在并行或近似排序算法中使用LLM作为二元比较预言机来实现。我们在数值推理、主题聚类、Wi-Fi接入点优化和多跳问答RAG任务中，通过言语化最大值、排序、聚类和子模最大化，推动了准确率-运行时帕累托前沿。这些结果表明，通过标准算法分析改进基于LLM的推理是一个可行且基础更扎实的研究方向。

英文摘要

Reasoning is a fundamentally algorithmic task. Yet current work on LLM-based reasoning relies on free-form generation whose theoretical guarantees (soundness, completeness, complexity, optimality) remain poorly understood. We argue that we should not treat them as general-purpose reasoners, and as an alternative, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs and various algorithms with established guarantees. Instead of betting on LLM's ability to solve a reasoning task, VAs limit their scope by decomposing the task down to simple elementary operations on strings that they can answer reliably. For example, sorting a list of natural language strings could be done by using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm. We push the accuracy-runtime Pareto front with \emph{verbalized maximum}, \emph{sorting}, \emph{clustering}, and \emph{submodular maximization}, for numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q\&A RAG task. These results suggest improving LLM-based reasoning through standard algorithmic analysis is a feasible and better grounded research direction.

URL PDF HTML ☆

赞 0 踩 0

2508.15760 2026-05-26 cs.CL cs.AI 版本更新

InfiFPO：通过偏好优化实现大型语言模型的隐式模型融合

Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University (PolyU)（香港理工大学）； Zhejiang University（浙江大学）； PolyU-Daya Bay Technology and Innovation Research Institute（香港理工大学-大亚湾技术与创新研究院）

AI总结提出InfiFPO方法，通过将DPO中的参考模型替换为融合源模型，在序列级别合成多源概率，实现隐式模型融合，从而在偏好对齐阶段有效融合多个LLM并提升性能。

详情

Journal ref: NeurIPS 2025

AI中文摘要

模型融合通过轻量训练方法将具有不同优势的多个大型语言模型（LLM）组合成一个更强大的集成模型。现有的模型融合工作主要关注监督微调（SFT），而偏好对齐（PA）——增强LLM性能的关键阶段——在很大程度上未被探索。当前少数在PA阶段的融合方法（如WRPO）通过仅利用源模型的响应输出而丢弃其概率信息来简化过程。为了解决这一局限性，我们提出了InfiFPO，一种用于隐式模型融合的偏好优化方法。InfiFPO将直接偏好优化（DPO）中的参考模型替换为一个融合源模型，该模型在序列级别合成多源概率，从而规避了先前工作中复杂的词汇对齐挑战，同时保留了概率信息。通过引入概率裁剪和最大边际融合策略，InfiFPO使枢轴模型能够与人类偏好对齐，同时有效地从源模型中蒸馏知识。在11个广泛使用的基准上的综合实验表明，InfiFPO始终优于现有的模型融合和偏好优化方法。当使用Phi-4作为枢轴模型时，InfiFPO在11个基准上的平均性能从79.95提升至83.33，显著增强了其在数学、编码和推理任务上的能力。

英文摘要

Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2503.11657 2026-05-26 cs.CL 版本更新

Scaling Natural-Language Graph-Based Test Time Compute for Automated Theorem Proving

扩展基于自然语言图结构的测试时计算用于自动定理证明

Vincent Li, Tim Knappe, Yule Fu, Kevin Han, Kevin Zhu

发表机构 * Boston University ； Provadis School of International Management \& Technology ； Duke University ； Algoverse AI Research

AI总结提出KG-prover框架，利用从权威数学文本挖掘的知识图谱增强通用大语言模型，通过扩展图结构的测试时计算显著提升自动定理证明性能。

Comments Accepted to ICML AI4Math Workshop 2025, NAACL SRW 2025

详情

AI中文摘要

大型语言模型在需要多步逻辑推理的自然语言处理任务（如自动定理证明）中展现出卓越能力。然而，定理证明中仍存在挑战，例如识别关键数学概念、理解其相互关系以及在自然语言中正确形式化证明。我们提出KG-prover，一种新颖框架，利用从权威数学文本挖掘的知识图谱来增强通用大语言模型，以构建和形式化数学证明。我们还研究了使用KG-Prover扩展基于图结构的测试时计算的效果，在多个数据集上展示了相比基线显著的性能提升。结合KG-Prover，通用大语言模型在miniF2F-test上提升高达21%，在ProofNet、miniF2F-test和MUSTARD数据集上持续提升2-11%。此外，使用o4-mini的KG-Prover在pass miniF2F-test上达到50%。这项工作为无需额外微调即可利用知识图谱增强自然语言证明推理提供了一种有前景的方法。

英文摘要

Large language models have demonstrated remarkable capabilities in natural language processing tasks requiring multi-step logical reasoning capabilities, such as automated theorem proving. However, challenges persist within theorem proving, such as the identification of key mathematical concepts, understanding their interrelationships, and formalizing proofs correctly within natural language. We present KG-prover, a novel framework that leverages knowledge graphs mined from reputable mathematical texts to augment general-purpose LLMs to construct and formalize mathematical proofs. We also study the effects of scaling graph-based, test-time compute using KG-Prover, demonstrating significant performance improvements over baselines across multiple datasets. General-purpose LLMs improve up to 21\% on miniF2F-test when combined with KG-Prover, with consistent improvements ranging from 2-11\% on the ProofNet, miniF2F-test, and MUSTARD datasets. Furthermore, KG-Prover with o4-mini achieves 50\% on pass miniF2F-test. This work provides a promising approach for augmenting natural language proof reasoning with knowledge graphs without the need for additional finetuning.

URL PDF HTML ☆

赞 0 踩 0

2410.15173 2026-05-26 cs.CL cs.AI 版本更新

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

揭示自回归LLM在事件表示中主题适配性的知识

Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton

发表机构 * Imperial College London（伦敦帝国学院）； Columbia University（哥伦比亚大学）； University of Washington（华盛顿大学）

AI总结通过多种提示设计、输入上下文操作、推理和输出形式，研究自回归大语言模型是否具有一致且可表达的事件参数主题适配性知识，并在基准测试上取得新最优结果。

Comments Significant update with massive changes: all experiments rerun with current LLMs; includes new probability estimate analysis and expanded results in Sections 4 and 5. The paper has been accepted to CoNLL-2026

2407.05682 2026-05-26 cs.CL 版本更新

Retrieved In-Context Principles from Previous Mistakes

从先前错误中检索的上下文原则

Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang

发表机构 * Peking University（北京大学）； Beijing Institute of Technology（北京理工大学）； Alibaba Group（阿里巴巴集团）； Chinese Academy of Sciences（中国科学院）

AI总结提出检索式上下文原则（RICP）框架，通过教师模型分析学生模型错误生成原则，聚类错误以增强覆盖，检索相关错误生成问题级原则，提升定制化，无需推理时干预，在七个推理基准上提升多种提示策略性能。

详情

AI中文摘要

上下文学习（ICL）在使用正确输入-输出示例将大型语言模型（LLMs）适应下游任务方面发挥了重要作用。最近的进展试图通过从错误中推导出的原则来改进模型性能，但这些方法缺乏定制化和错误覆盖不足。为了解决这些限制，我们提出了检索式上下文原则（RICP），一种新颖的教师-学生框架。在RICP中，教师模型分析学生模型的错误，生成避免类似错误的原因和见解。这些错误根据其根本原因进行聚类，以制定任务级原则，增强原则的错误覆盖。在推理过程中，为每个问题检索最相关的错误以创建问题级原则，提高所提供指导的定制化。RICP与现有提示方法正交，且在推理期间不需要教师模型的干预。在七个推理基准上的实验结果表明，RICP在应用于各种提示策略时有效提升了性能。

英文摘要

In-context learning (ICL) has been instrumental in adapting Large Language Models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles (RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles, enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively enhances performance when applied to various prompting strategies.

URL PDF HTML ☆

赞 0 踩 0

2404.00176 2026-05-26 cs.CL 版本更新

The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks

LSCD基准：一个用于历时词义任务的测试平台

Dominik Schlechtweg, Sachin Yadav, Jonas Kuhn, Nikolay Arefyev

发表机构 * University of Stuttgart（斯图加特大学）； University of Oslo（奥斯陆大学）

AI总结针对词汇语义变化检测任务中评估标准不统一的问题，提出一个标准化基准库，通过模块化设计支持WiC、WSI和LSCD子任务的评估与组合。

Comments *SEM, 9 pages

详情

AI中文摘要

词汇语义变化检测（LSCD）是一个复杂的词元级任务，通常基于两个后续应用的用法级任务来操作：首先，为用法对推导出词在上下文（WiC）标签；然后，将这些标签表示在一个图上，应用词义归纳（WSI）来推导出义簇；最后，通过比较跨时间的义簇来推导出LSCD标签。这种模块化反映在大多数LSCD数据集和模型中。它也导致了建模选项和任务定义的高度异质性，而数据集版本、预处理选项和评估指标的多样性加剧了这种异质性。这种异质性使得在可比条件下评估模型、选择最优模型组合或复现结果变得困难。因此，我们提供了一个标准化LSCD评估的基准库。通过透明的实现，结果易于复现，并且通过标准化，不同的组件可以自由组合。该库通过允许对WiC、WSI和LSCD进行模型评估来反映任务的模块化。这允许对日益复杂的模型组件进行仔细评估，为模型优化提供新途径。

英文摘要

Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task's modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization.

URL PDF HTML ☆

赞 0 踩 0

2403.04780 2026-05-26 cs.CL cs.AI 版本更新

Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

面向通用图挖掘的大语言模型图导向指令微调

Yanchao Tan, Hang Lv, Pengxiang Zhan, Shiping Wang, Carl Yang

发表机构 * Engineering Research Center of Big Data Intelligence, Ministry of Education（教育部大数据智能工程研究中心）； Fujian Key Laboratory of Network Computing and Intelligent Information Processing（福建省网络计算与智能信息处理重点实验室）； College of Computer and Data Science, Fuzhou University（福州大学计算机与数据科学学院）； Department of Computer Science, Emory University（埃默里大学计算机科学系）

AI总结提出MuseGraph框架，通过紧凑图描述、基于思维链的指令生成和图感知指令微调，将GNN与LLM结合，实现跨任务和数据集的高效图挖掘。

Comments Accepted by TPAMI 2025

详情

DOI: 10.1109/TPAMI.2025.3603062
Journal ref: IEEE Trans. Pattern Anal. Mach. Intell., vol. 48, no. 1, pp. 155-169, Jan. 2026

AI中文摘要

具有丰富属性的图对于建模互联实体和增强各种实际应用中的预测至关重要。传统的图神经网络（GNN）通常需要针对不同的图任务和数据集进行重新训练。尽管大语言模型（LLM）的出现为自然语言处理带来了新范式，但它们在通用图挖掘（即训练单个模型同时处理多样任务和数据集）方面的潜力仍未充分探索。为此，我们的新颖框架MuseGraph无缝地将GNN和LLM的优势整合到一个基础模型中，用于跨任务和数据集的图挖掘。该框架首先采用紧凑的图描述，在语言令牌限制内封装关键图信息。然后，我们提出了一种基于思维链（CoT）指令包的多样化指令生成机制，以从GPT-4等高级LLM中提取推理能力。最后，我们设计了一种图感知的指令微调策略，以促进多个任务和数据集之间的相互增强，同时防止LLM生成能力的灾难性遗忘。我们的实验结果表明，在五个图任务和十个数据集上取得了显著改进，展示了MuseGraph在提高图导向下游任务准确性的同时增强LLM生成能力的潜力。

英文摘要

Graphs with abundant attributes are essential in modeling interconnected entities and enhancing predictions across various real-world applications. Traditional Graph Neural Networks (GNNs) often require re-training for different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced new paradigms in natural language processing, their potential for generic graph mining, training a single model to simultaneously handle diverse tasks and datasets, remains under-explored. To this end, our novel framework MuseGraph, seamlessly integrates the strengths of GNNs and LLMs into one foundation model for graph mining across tasks and datasets. This framework first features a compact graph description to encapsulate key graph information within language token limitations. Then, we propose a diverse instruction generation mechanism with Chain-of-Thought (CoT)-based instruction packages to distill the reasoning capabilities from advanced LLMs like GPT-4. Finally, we design a graph-aware instruction tuning strategy to facilitate mutual enhancement across multiple tasks and datasets while preventing catastrophic forgetting of LLMs' generative abilities. Our experimental results demonstrate significant improvements in five graph tasks and ten datasets, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while improving the generation abilities of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2305.11663 2026-05-26 cs.LG cs.AI cs.CL cs.CY 版本更新

Algorithmic failure as a humanities methodology: machine learning's mispredictions identify rich cases for qualitative analysis

作为人文学科方法论的算法失败：机器学习的错误预测识别出用于定性分析的丰富案例

Jill Walker Rettberg

AI总结本文通过实验验证了Munk等人提出的利用机器学习失败预测识别定性分析中模糊且丰富案例的方法，使用简单kNN算法对虚构角色与机器视觉技术互动的动作数据进行分类，发现不可预测的动作更具矛盾性和情感负荷，支持该方法在人文学科中的适用性。

详情

DOI: 10.1177/20539517221131290
Journal ref: Big Data & Society 9(2) 2022

AI中文摘要

WhisTLE: 深度监督的文本领域自适应方法用于预训练语音识别Transformer

Akshat Pandey, Karun Kumar, Raphael Tang

发表机构 * comcast Speech AI（comcast语音人工智能）

AI总结提出WhisTLE，一种通过变分自编码器建模文本到编码器输出并微调解码器的文本领域自适应方法，显著降低词错误率。

Comments 10 pages

详情

AI中文摘要

预训练的自动语音识别（ASR）模型（如Whisper）表现良好，但仍需领域自适应以处理未见过的用语。在许多实际场景中，收集语音数据不切实际，因此需要仅文本的自适应。我们提出WhisTLE，一种用于预训练编码器-解码器ASR模型的深度监督文本自适应方法。WhisTLE训练一个变分自编码器（VAE）从文本建模编码器输出，并使用学习到的文本到潜在编码器微调解码器，可选地与文本到语音（TTS）自适应结合。在推理时，恢复原始编码器，不产生额外运行时成本。在四个数据集和四个ASR模型上，带有TTS的WhisTLE相对降低了49.0%的词错误率（WER），并在112个场景中的100个中优于所有非WhisTLE基线。我们还发现WhisTLE与任何其他领域自适应方法的组合都能互补增强；因此我们建议在标准流程中纳入WhisTLE以自适应编码器-解码器ASR模型。

英文摘要

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by a relative 49.0% and outperforms all non-WhisTLE baselines in 100 of 112 scenarios. We also find that WhisTLE additively complements any combination of other domain adaptation approaches; we thus recommend the inclusion of WhisTLE during standard processes for adapting encoder-decoder ASR models.

URL PDF HTML ☆

赞 0 踩 0