arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28067 2026-05-28 cs.AI

BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

BlazeEdit: 基于图像到图像扩散模型的移动设备通用图像编辑

Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei

AI总结提出BlazeEdit，一个仅195M参数的轻量级通用图像编辑扩散模型，通过消除文本条件组件和多任务架构，在移动设备上实现快速、隐私保护的图像编辑。

详情

Comments: Accepted to CVPR 2026 EDGE Workshop

AI中文摘要

现代扩散模型卓越的生成质量往往以巨大的参数量为代价，这需要服务器端推理，带来显著的计算成本和潜在的隐私风险。因此，开发高效的设备端替代方案日益受到关注。尽管最近的努力优化了移动硬件上的文本到图像模型，但它们仍然相对庞大，通常有0.5B到1B参数。我们提出了BlazeEdit，一个专为设备端部署设计的高效通用图像到图像扩散模型。通过识别许多实际图像编辑任务不需要基于文本的指导，我们消除了文本条件组件，并开发了一个多任务架构，将对象移除、外扩、色调校正、重新照明和贴纸生成整合到一个仅195M参数的紧凑模型中。BlazeEdit大幅减少了下载大小和内存开销，同时保持了具有竞争力的生成质量。它在Pixel 10上仅需290ms即可完成一次完整推理，为边缘设备上的通用图像编辑提供了无缝、隐私保护和闪电般的体验。

英文摘要

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

URL PDF HTML ☆

赞 0 踩 0

2605.28065 2026-05-28 cs.AI

Verifiable Benchmarking of Long-Horizon Spatial Biology

长程空间生物学的可验证基准测试

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

AI总结提出 SpatialBench-Long 基准，通过 24 个评估任务测试 AI 代理从原始空间数据中推导科学结论的能力，发现当前最佳模型仅达到 11.1% 的成功率。

详情

AI中文摘要

AI 代理在生物数据分析中越来越有用，但现有基准大多测试广泛的生物学知识、可执行的工作流程或局部分析步骤，而不是对空间测量进行端到端的科学推理。我们引入了 SpatialBench-Long，一个用于长程空间生物学的基准，其中代理必须从原始或接近原始的数据以及校准的实验背景中恢复生物学声明，而不使用规定的方法。SpatialBench-Long 包含 24 个评估，涵盖原发性胰腺导管腺癌（PDAC）、工程化胶质母细胞瘤类器官和体内肿瘤、Cas9 谱系追踪的肺腺癌、以及小鼠视神经衰老/干预系统，涉及 CosMx、Visium、Xenium、多重纠错荧光原位杂交（MERFISH）、单细胞 RNA 测序（scRNA-seq）、Slide-seq、Slide-tags、组织学和谱系记录数据。候选声明通过再现、独立科学家审查和轨迹检查进行强化。最终答案通过受控词汇和符号进行确定性评分，并附有配套评分标准，捕捉通过关键分析瓶颈的进展。在 SpatialBench-Long 基准测试中，三个模型-工具对在 8/72 次运行（11.1%）中并列：Gemini 3.5 Flash / Pi 终端编码工具、GPT-5.5 / Pi 和 GPT-5.5 / OpenAI Codex。SpatialBench-Long 测试代理是否能够超越执行程序性分析，从复杂的空间测量中推导出准确的科学结论。

英文摘要

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

URL PDF HTML ☆

赞 0 踩 0

2605.28063 2026-05-28 cs.SD cs.AI cs.MM

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

自由文本提示的统一语音与声音合成

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song

AI总结提出PlanAudio框架，利用大语言模型推理能力和语义潜在思维链机制，直接从自由文本生成包含语音和声音的统一音频。

详情

AI中文摘要

音频生成已取得显著进展，但合成语音与声音自然组合的统一音频仍具挑战。当前方法要么依赖分离的流水线，无法捕捉细粒度交互，要么需要结构化输入和外部文本重写，限制了自由文本提示的灵活性。本文提出新任务：自由文本提示到统一音频生成，旨在直接从无约束自然语言合成包含语音、声音及其复合的统一音频。为此，我们提出PlanAudio，一个统一的、基于自回归LLM的框架。首先，它利用LLM内在推理能力简化模型架构，而非传统文本编码器。其次，引入语义潜在思维链机制，一种隐式规划机制，连接高层语义理解与低层声学合成。此外，我们创建PlanAudio-Bench，一个专门评估复合音频场景的基准。我们在语音、声音及其复合场景下进行评估。结果表明，PlanAudio普遍优于现有流水线和统一基线，同时与专为单一场景设计的模型保持竞争力。进一步分析揭示了语义潜在CoT相对于其他CoT机制的优越性，并强调了连续多场景训练课程的重要性。

英文摘要

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

URL PDF HTML ☆

赞 0 踩 0

2605.28062 2026-05-28 cs.CL cs.IR

ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor

ConvMemory: 一种轻量级学习型记忆重排序器、一个负归因结果以及一个研究预览冲突编辑器

Taiheng Pan

AI总结本文提出ConvMemory，一种3.6M参数的学习型重排序器，通过交叉编码器教师监督在融合密集和词汇特征上训练，用于对话长期记忆检索，并报告了负归因结果及研究预览冲突编辑器CCGE-LA。

详情

Comments: 15 pages. Technical report

AI中文摘要

我们描述了ConvMemory，一种用于对话长期记忆检索的小型3.6M参数学习型重排序器，通过交叉编码器教师监督在融合密集和词汇特征上训练。在LongMemEval记忆族上，ConvMemory在Recall@10上优于BGE-large交叉编码器，延迟降低12-47倍；在Clean500上，与mxbai-rerank-large-v1相比，Recall@10差距在0.025以内，但运行成本低28倍；在Stress1000干扰项下，Recall@10差距扩大到0.081，但ConvMemory的延迟仍低117倍；这些LongMemEval数字是单次运行或单种子结果，作为指示性成本前沿证据报告，而非基准级。然后，我们发布了一个关于先前声称机制的严格负归因结果：一个五种子重训练消融实验结合配对自助法表明，ConvMemory的学习时间窗口在总体上统计显著，但并非时间特定，对硬非时间控制的影响最大，而对多跳时间查询无显著影响。该机制的诚实描述是在融合密集+词汇特征空间中的廉价交叉编码器蒸馏，而非时间结构利用。此外，我们发布了CCGE-LA，一种低幅度的冲突感知候选集编辑器，基于ConvMemory，作为研究预览，在LoCoMo的替换和过时/恢复切片上取得了适度但一致的改进。所有结果均为检索阶段；ConvMemory在绝对LoCoMo MRR上未匹配mxbai-rerank-large-v1，且该报告为单作者，尚未独立审计。

英文摘要

We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory's learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

URL PDF HTML ☆

赞 0 踩 0

2605.28060 2026-05-28 cs.CL

Challenges in Explaining Pretrained Clinical Text Classifiers

解释预训练临床文本分类器的挑战

Kristian Miok, Matej Klemen, Blaz Škrlj, Marko Robnik Šikonja

AI总结本文通过医院住院时长预测任务，揭示了LIME和SHAP等事后解释方法在临床叙事中的局限性，包括过度关注非信息性标记、归因不稳定以及对不连贯输入的高置信度预测，强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

详情

DOI: 10.1007/978-3-032-19105-2_22
Journal ref: Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2025. Communications in Computer and Information Science, vol 2842, pp. 314-322. Springer, Cham (2026)
Comments: 9 pages, 7 figures. Accepted at the First Workshop on Responsible Healthcare using Machine Learning (RHCML 2025), co-located with ECML PKDD 2025

AI中文摘要

在临床自然语言处理中解释神经模型的预测仍然是一个重大挑战，尤其是对于涉及长篇幅、非结构化医疗文本的复杂任务。尽管LIME和SHAP等事后方法被广泛使用，但它们在应用于临床叙事时常常表现不足。在本文中，我们通过针对医院住院时长预测任务的定向演示，识别了基于标记和基于扰动的解释技术的核心局限性。我们的发现揭示了诸如过度强调非信息性标记、归因不稳定以及对不连贯输入变体的高置信度预测等问题。这些结果强调了需要临床有意义、语义基础且对语言噪声鲁棒的解释策略。

英文摘要

Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.

URL PDF HTML ☆

赞 0 踩 0

2605.28058 2026-05-28 cs.CL

Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis

提示即一切：基于多视角提示的大语言模型在方面级情感分析中的应用

Nils Constantin Hellwig, Niklas Donhauser, Jakob Fehle, Udo Kruschwitz, Christian Wolff

AI总结提出LLM-MvP方法，通过多视角提示、模式约束解码和前缀批处理，使大语言模型在少量样本下达到与微调模型竞争甚至更优的性能，同时降低计算开销。

详情

AI中文摘要

近期工作探索了大语言模型（LLMs）在方面级情感分析（ABSA）中通过少样本提示的能力，相比零样本基线显著改进，且所需标注示例大幅减少。然而，与在数百个示例上微调的模型相比仍存在性能差距，且LLM推理的计算成本对部署构成实际障碍。我们提出了基于LLM的多视角提示（LLM-MvP），将考虑多种元素排序的多视角原理适配到LLM提示中。通过将模式约束解码与上下文无关语法及前缀批处理相结合，LLM-MvP实现了与微调方法竞争甚至更优的性能，同时大幅降低计算开销。在五个基准数据集上的广泛实验表明，LLM-MvP缩小了少样本提示与微调模型之间的差距，为ABSA提供了实用且高效的解决方案。

英文摘要

Recent work explored the capabilities of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA) through few-shot prompting, requiring substantially fewer annotated examples while achieving notable improvements over zero-shot baselines. However, a performance gap remained compared to models fine-tuned on hundreds of examples, and the computational costs of LLM inference present practical barriers to deployment. We introduce LLM-based Multi-View Prompting (LLM-MvP), which adapts the multi-view principle of considering multiple element orderings to LLM prompting. By combining schema-constrained decoding with a context-free grammar and prefix batching, LLM-MvP achieves performance competitive or superior to fine-tuned approaches while substantially reducing computational overhead. Extensive experiments across five benchmark datasets demonstrate that LLM-MvP closes the gap between few-shot prompting and fine-tuned models, offering a practical and efficient solution for ABSA.

URL PDF HTML ☆

赞 0 踩 0

2605.28056 2026-05-28 cs.CV

CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

CogPortrait: 通过分层智能体规划实现肖像动画中的细粒度眼部区域控制

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

AI总结提出CogPortrait两阶段框架，利用多模态大语言模型智能体从高层标签生成关键点，再通过DiT视频生成骨干合成动画，实现细粒度眼部控制，并引入EMH基准评估。

详情

AI中文摘要

肖像动画方法已实现显著的视觉质量和唇形同步，但眼部区域的细粒度操控仍面临输入粒度与运动精度之间的权衡。现有方法使用情感标签或粗略文本提示不足以描述细微的眼部动态，而基于动作单元或驱动视频的方法以更高的输入负担为代价提供更高的保真度。这些限制对于超越情感状态（例如思考）和困倦状态仍然具有局限性。鉴于此，我们提出CogPortrait，一个从高层标签生成肖像动画的两阶段框架。在第一阶段，三个思维链多模态大语言模型（MLLMs）智能体通过时间事件规划、原型检索和从真实行为库中组合以及语义-生理约束执行，将高层标签编译为面部关键点。在第二阶段，基于DiT的视频生成骨干以关键点、参考肖像、音频和文本提示为条件合成最终动画，并通过动态无分类器引导策略（具有眼部区域感知重新加权和基于KTO的边界情况细化）增强。我们进一步引入了EMH基准，涵盖多样化的情感和超越情感类别，并带有两个AU级指标用于评估细粒度眼部区域和头部运动控制。在HDTF和EMH基准上的大量实验表明，CogPortrait在保持优越视觉质量和身份一致性的同时，实现了比现有方法更精确的眼部区域控制。

英文摘要

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

URL PDF HTML ☆

赞 0 踩 0

2605.28053 2026-05-28 cs.LG

RW-TTT: Batched Serving for Request-Owned Test-Time Training State

RW-TTT：面向请求自有测试时训练状态的批量服务

Jian Yang, Zhizhuo Kou, Yao Tian, Hao Zhang, Han Chen, Sirui Han, Yike Guo

AI总结提出RW-TTT框架，通过标记解码步骤的所有者、版本和读写效果，实现请求自有测试时训练状态下的高效批量LLM服务，在单GPU上达到274.61 tok/s聚合吞吐，较顺序服务提升9.31倍。

2605.28051 2026-05-28 cs.CV

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

超越代理梯度：面向视觉-语言模型的完全可微分令牌剪枝

Landi He, Mingde Yao, Shawn Young, Lijian Xu

AI总结提出DiffPrune方法，通过将剪枝重新表述为令牌信息的连续控制而非离散选择学习，利用信息节流阀调节令牌，实现完全可微分的令牌重要性学习，在保持96.5%全模型精度的同时将LLM预填充加速2.85倍。

详情

AI中文摘要

视觉令牌剪枝通过移除冗余视觉令牌来降低视觉-语言模型（VLM）的计算成本。现有方法通常依赖Gumbel-Softmax在训练期间近似离散选择。然而，优化由代理梯度驱动而非真实选择过程，导致令牌重要性的学习不可靠。本文提出DiffPrune，将剪枝重新表述为令牌信息的连续控制而非离散选择学习。具体而言，我们引入一个信息节流阀，利用基于重要性分数的方差保持噪声调节每个令牌，其中较高的分数在训练期间导致较少的信息抑制。该设计直接操作于令牌表示，自然地为学习令牌重要性提供了完全可微分的优化路径。在推理时，通过对学习到的分数进行硬阈值来移除令牌。在十个VLM基准测试中，DiffPrune保留了全模型精度的96.5%，同时将LLM预填充加速2.85倍，推理开销仅为0.69毫秒。

英文摘要

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.28048 2026-05-28 cs.RO

SAFEVPR: Patch-Based Conformal Verification for Safe Cross-Condition Sequence Visual Place Recognition

SAFEVPR: 基于补丁的共形验证用于安全跨条件序列视觉地点识别

Ha Sier, Jiaqiang Zhang, Zhuo Zou, Xianjia Yu, Tomi Westerlund

AI总结提出SAFEVPR，一种无需训练的验证与校准流程，通过互近邻补丁匹配评分和Mondrian共形LTT校准，在跨条件部署下实现序列VPR的有限样本FDR控制，实验证明在23个跨条件设置中均有效。

详情

AI中文摘要

基于序列的视觉地点识别（VPR）用于SLAM和机器人重定位必须决定检索到的top-1候选是否安全可接受。共形预测是这种接受/拒绝决策的自然框架，但其有限样本保证依赖于校准数据和部署（测试）数据之间的可交换性，这在跨条件部署下被违反。我们引入了SAFEVPR，一种无需训练的验证与校准流程，用于安全的跨条件序列VPR。SAFEVPR将标准的骨干余弦相似度替换为从冻结的DINOv2 ViT特征计算出的互近邻（MNN）补丁匹配分数，并将平坦的Learn-Then-Test校准替换为Mondrian共形LTT，为不同分数区间拟合独立的Bonferroni校正阈值。在可交换性下，这些阈值将提供有限样本的假发现率（FDR）控制；在条件偏移下，我们评估每个部署的经验有效性。在来自Oxford RobotCar、NCLT和St Lucia数据集的23个跨条件设置中，使用三个冻结的VPR骨干，SAFEVPR在目标FDR alpha=0.10下，在23/23的设置中经验有效，平均接受FDR为0.014，平均真阳性率（TPR）为0.75。结果表明，仅凭原始区分度不足以实现共形有效性：AnyLoc-VLAD和Super-Point+LightGlue达到了可比的ROC曲线下面积（AUROC），但在相同校准下失败的设置更多。在无纹理重复场景中，SAFEVPR安全地弃权，而不是接受不可靠的匹配。代码可在https://github.com/Hasar12139/SafeVPR获取。

英文摘要

Sequence-based visual place recognition (VPR) for SLAM and robot relocalization must decide whether the retrieved top-1 candidate is safe to accept. Conformal prediction is a natural framework for this accept/reject decision, but its finite-sample guarantees rely on exchangeability between calibration and deployment (test) data, which is violated under cross-condition deployment. We introduce SAFEVPR, a non-trainable verification-and-calibration pipeline for safe cross-condition sequence VPR. SAFEVPR replaces the standard backbone cosine similarity with a mutual-nearest-neighbour (MNN) patch-matching score computed from frozen DINOv2 ViT features, and replaces flat Learn-Then-Test calibration with Mondrian conformal LTT, fitting separate Bonferroni-corrected thresholds across score bins. Under exchangeability, these thresholds would provide finite-sample false-discovery-rate (FDR) control; under condition shift, we evaluate empirical validity per deployment. Across 23 cross-condition setups from Oxford RobotCar, NCLT, and St Lucia datasets, using three frozen VPR backbones, SAFEVPR is empirically valid on 23/23 setups at target FDR alpha = 0.10, achieving mean accepted FDR 0.014 and mean true-positive rate (TPR) 0.75. The results show that raw discrimination alone is not sufficient for conformal validity: AnyLoc-VLAD and Super-Point+LightGlue reach comparable area under the receiver operating characteristic curve (AUROC) but fail more setups under the same calibration. On textureless repetitive scenery, SAFEVPR safely abstains rather than accepting unreliable matches. Code is available at https://github.com/Hasar12139/SafeVPR.

URL PDF HTML ☆

赞 0 踩 0

2605.28047 2026-05-28 cs.CL

Knowledge Dependency Estimation for Reliable Question Answering

面向可靠问答的知识依赖估计

Chaodong Tong, Qi Zhang, Nannan Sun, Lei Jiang, Yanbing Liu

AI总结提出Knot方法，通过子集级反事实监督和潜在依赖因子覆盖建模，估计黑盒问答模型对不同知识单元的敏感性，以识别关键知识依赖。

详情

Comments: 12 tables, 9 figures

AI中文摘要

可靠的问答不仅需要判断答案是否正确，还需要识别预测所依赖的可用知识。在实际的基于LLM的问答中，这些知识可能来自上下文、检索、分解或中间推理，形成一个嘈杂且冗余的候选空间，而非干净的金标准证据集。我们研究\emph{知识依赖估计}：估计固定黑盒问答模型对不同候选知识单元的敏感性。挑战在于无需穷举测试时扰动即可获得细粒度的依赖分数，同时建模冗余性、可替代性和互补性。我们提出 extbf{Knot}，一种结构化的排序感知知识依赖估计器。Knot从子集级反事实监督中学习，通过覆盖潜在依赖因子来建模子集敏感性，并推导出排序感知的单元分数以识别有影响力的候选。在多项选择和生成式问答基准上，Knot在子集敏感性预测方面优于所有对比基线，并在无需额外问答模型调用的情况下产生比可部署基线更忠实的单元排序；当用于实际风险筛查时，其依赖分数有助于及早标记易出错的问答预测。

英文摘要

Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.

URL PDF HTML ☆

赞 0 踩 0

2605.28046 2026-05-28 cs.AI cs.CL

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

MemCog: 从记忆即工具到记忆即认知的对话代理

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

AI总结提出MemCog系统，通过可导航记忆存储、跨维度导航接口和主动推理协议，将记忆访问融入推理过程，在被动问答和主动记忆触发基准上达到最优性能。

详情

AI中文摘要

现有的代理记忆系统普遍遵循我们称之为“记忆即工具”的范式，其中单个查询触发对扁平段落列表的一次性检索，存在被动调用、推理-检索解耦以及检索片段与代理导航需求之间的结构不匹配等问题。我们提出MemCog，一个“记忆即认知”系统，使记忆访问成为推理过程的一个组成部分。MemCog将用户知识组织为具有关联链接图的可导航记忆存储，暴露跨维度导航接口以进行多步推理驱动的遍历，并采用主动推理协议，驱动代理从对话上下文中自发启动记忆探索。我们还构建了ProactiveMemBench，这是第一个用于评估主动记忆触发的基准。实验表明，MemCog在被动问答基准上达到了最先进水平（LoCoMo上92.98，LongMemEval上95.8），同时在ProactiveMemBench上大幅超越基线，展示了记忆即认知的优势。

英文摘要

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

URL PDF HTML ☆

赞 0 踩 0

2605.28044 2026-05-28 cs.AI

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

相关并不保证：引用RAG的证据力度校准

Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, Xinpeng Wei

AI总结针对引用RAG中证据力度不足的问题，提出FORCEBENCH基准测试，通过对比证据校准声明与力度增强变体，评估模型在五个操作轴上的单调性，发现标准支持提示不足以校准证据力度。

详情

AI中文摘要

引用RAG评估通常将可见来源视为接地信号，但一个真实的、主题相关的引用仍可能对附带的措辞支持不足。我们将这种诊断失败称为引用洗白：一个相关的来源被呈现为对过度强声称的保证。我们引入了FORCEBENCH，一个用于证据力度校准的对比压力测试。每个项目固定一个引用的段落，并将一个证据校准的声明与一个局部力度增强的变体配对，涵盖五个操作轴：关系、模态、范围、时间有效性和数值特异性。一个校准的评估器应该给证据校准的声明更高的分数。主要实验使用一个固定的、经过局部过滤的198对评估集。引用存在的合理性检查设计上无信息；标记和实体重叠在32.8--36.4%的对上仍然违反单调性。在四个报告的模型评判中，标准的通用支持提示不足以应对这个力度校准压力测试（总体MVR 47.2%），而显式的保证力度提示将MVR降低到24.5%，但仍不完美。我们发布了基准、提示、输出和即插即用管道，以便引用评估器可以报告单调性违反率和力度敏感性，以及传统的支持指标。

英文摘要

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.28042 2026-05-28 cs.CL cs.AI cs.LG

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

通过激进剪枝专家从LLM中提取小型翻译专家

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

AI总结提出一种从混合专家LLM中激进剪枝与翻译无关的专家，实现大幅压缩MoE块而不显著降低翻译质量的方法。

详情

AI中文摘要

现代大型语言模型（LLM）实现了最先进的机器翻译性能，但它们是作为广泛通才训练的，主要针对许多与翻译无关的任务和能力。因此，它们对于此任务严重过参数化，导致过多的内存和计算需求。在本文中，我们提出了一种从现代混合专家LLM中激进剪枝专家的方法，同时翻译质量下降可忽略不计。我们的方法利用专家专业化和LLM中多语言能力的可分离性来识别与翻译无关的专家。并且由于MoE的模块化特性，这些专家可以在无需任何训练的情况下轻松剪枝。无需重新训练，我们能够剪枝一半的专家而质量下降可忽略，剪枝70%仅造成轻微损失。通过非常短的SFT，我们剪枝75%的专家并恢复基线性能，在某些设置下移除近90%的专家同时保持合理的翻译质量。总体而言，我们的结果表明翻译仅需要LLM的一小部分，从而实现了对包含超过90%参数的MoE块的大幅压缩。

英文摘要

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.28037 2026-05-28 cs.CL

Personality, Role, and Expressive Style in Large Language Models: An Interactionist Analysis

大型语言模型中的个性、角色与表达风格：一种互动主义分析

Moe Nagao, Koichiro Terao, Mikio Nakano, Naoto Iwahashi

AI总结本研究从互动主义视角，通过因子设计实验分析个性特质、对话角色和表达风格如何共同影响大型语言模型生成对话中感知的大五人格特质表达。

详情

Comments: 26 pages

AI中文摘要

基于提示的个性控制是设计在社交情境中行为一致的大型语言模型（LLM）对话智能体的关键技术。然而，在提示中指定大五人格特质（BFTs）并不能确保这些特质在生成的语句中得到表达。本文从互动主义视角研究这种不匹配，将人格表达视为由特质指定与情境因素相互作用塑造的依赖于上下文的结果。我们分析了感知到的LLM生成对话中的BFT表达如何受三个提示因素影响：人格特质、对话角色和表达风格。采用结合六种人格条件、三种角色和三种表达风格条件的因子设计，我们在英语和日语中各生成了1,080个LLM智能体对话。然后，我们使用LLM-as-a-judge框架评估目标智能体的语句，以估计表达的大五人格特质。结果表明，表达的人格不仅受显式特质指定影响，还受对话角色和表达风格影响。这些效应是特质特定的：对话角色强烈影响开放性，表达风格显著塑造尽责性和宜人性，而显式特质指定主导神经质。即使没有显式的人格特质指定，社会和表达条件也会诱发独特的人格印象。跨语言比较显示英语和日语对话之间的模式大致相似，仅在特定的人格、角色和表达风格组合下存在显著差异。这些发现表明，LLM智能体中的个性控制不应被理解为特质提示的直接结果，而是一个涉及人格指定、社会角色和表达风格的依赖于上下文的过程。

英文摘要

Prompt-based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context-dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM-generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive-style conditions, we generate 1,080 LLM-agent dialogues in each of English and Japanese. We then evaluate the target agent's utterances using an LLM-as-a-judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait-specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality-trait specification, social and expressive conditions induce distinct personality-like impressions. Cross-linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context-dependent process involving personality specification, social role, and expressive style.

URL PDF HTML ☆

赞 0 踩 0

2605.28036 2026-05-28 cs.CV cs.LG

Stay Fair! Ensuring Group Fairness in Diffusion Models Across Guidance Scales

保持公平！确保扩散模型在不同引导尺度下的群体公平性

Myeongsoo Kim, Eunji Kim, Minwoo Chae, Sangwoo Mo

AI总结提出StayFair方法，通过分解总偏差为模型偏差和引导偏差，并扩展强人口平价到引导过程，设计公平引导算法，使扩散模型在不同引导尺度下保持群体公平性。

详情

Comments: 28 pages, 18 figures

AI中文摘要

扩散模型使用可调引导尺度来权衡提示对齐和多样性，从而引导条件生成。然而，现有的去偏技术针对单一尺度进行优化，当用户调整此参数时会降低公平性。我们通过将总偏差分解为两个组成部分：模型偏差和引导偏差，追溯了这种行为的先前被忽视的根源。虽然先前的工作主要针对前者，但我们表明引导偏差随引导尺度单调增长，最终在用户偏好的高引导区域占主导地位。为了解决这个问题，我们将强人口平价扩展到引导，并推导出一个条件，在该条件下目标分布在不同引导尺度下保持其群体比例。我们提出了StayFair，利用该条件在两种引导模式下设计公平引导算法。对于分类器引导，它均衡了分类器在不同群体间的输出分布；对于无分类器引导，它通过依赖于提示的偏移来移动空嵌入。由于StayFair仅修改引导步骤，它与模型去偏正交，可以叠加到现有的公平扩散模型上，以将其公平性扩展到不同引导尺度。在类条件和文本到图像生成中，StayFair在不牺牲图像质量的情况下将公平性与引导尺度解耦。

英文摘要

Diffusion models steer conditional generation with a tunable guidance scale to trade off prompt alignment and diversity. However, existing debiasing techniques are optimized for a single scale, degrading fairness when users adjust this parameter. We trace this behavior to a previously overlooked source by decomposing total bias into two components: a model bias and a guidance bias. While prior work primarily targets the former, we show that the guidance bias grows monotonically with the guidance scale, eventually dominating the high-guidance regimes users prefer. To address this, we extend Strong Demographic Parity to guidance and derive a condition under which the target distribution retains its group ratio across guidance scales. We propose StayFair, which leverages this condition to design fair guidance algorithms in both regimes. For classifier guidance, it equalizes the classifier's output distributions across groups; for classifier-free guidance, it shifts the null embedding by a prompt-dependent offset. Because StayFair modifies only the guidance step, it is orthogonal to model debiasing and can be layered onto existing fair diffusion models to extend their fairness across guidance scales. Across class-conditional and text-to-image generation, StayFair decouples fairness from the guidance scale without sacrificing image quality.

URL PDF HTML ☆

赞 0 踩 0

2605.28035 2026-05-28 cs.AI cs.MM cs.SD

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

MTAVG-Bench 2.0：诊断多说话人音视频生成中电影表现力的失败模式

Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen, YiMing Cheng, Xu Liu, Dian Jin, Jiajun Xu, Jingyun Liao, Tian Lan, Ziqin Zhou, Yueying Liu, Yu Bai, Changsen Yuan, Jinxing Zhou, Xian-Ling Mao, Xuefeng Chen, Yousheng Feng

AI总结针对多说话人音视频生成中电影表现力评估不足的问题，提出MTAVG-Bench 2.0基准，通过构建涵盖表演、叙事、氛围和视听语言的高层次失败分类体系及超过1万个问答实例，系统评估全模态大语言模型诊断复杂视听失败的能力。

详情

AI中文摘要

近年来，多说话人音视频生成（MTAVG）模型在唇形同步和视听对齐等基本指标上表现出了有前景的性能。然而，这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中，生成模型必须超越视听真实感，传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白，我们引入了MTAVG-Bench 2.0，这是一个用于诊断多说话人音视频生成中电影表现力失败模式的基准。与先前主要关注基本多轮对话质量的设置不同，MTAVG-Bench 2.0针对短剧和场景级生成，并建立了一个涵盖表演、叙事、氛围和视听语言的高层次失败分类体系。基于该分类体系，我们构建了超过1万个问答评估实例，以及用于短剧级评估和失败模式时间定位的子集，以系统评估全模态大语言模型诊断高层次视听失败的能力。实验结果表明，Gemini等商业全模态模型显著优于其他评估器，但即使是最强的模型在我们的基准中仍难以应对复杂失败。这些结果证明，MTAVG-Bench 2.0为电影级多说话人音视频生成中的失败诊断提供了一个系统化的基准。

英文摘要

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.28034 2026-05-28 cs.AI

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash: 无状态稀疏Johnson-Lindenstrauss量化用于神经嵌入

Stanislav Kirdey, Clark Labs Inc

AI总结提出Clark Hash方法，通过归一化、稀疏符号投影和固定宽度标量量化，将384维句子嵌入压缩至48字节，无需训练，在保持高余弦相似度相关性的同时实现32倍存储压缩。

详情

Comments: First Autoresearch publication. Code available at https://github.com/clark-labs-inc/clark-hash. GPT-5.5 Pro was used for drafting and editing assistance

AI中文摘要

Clark Hash是一种用于以更少空间存储神经嵌入的小型方法。它对每个数据库向量进行归一化，应用确定性稀疏有符号Johnson-Lindenstrauss投影，裁剪结果，并存储固定宽度的标量量化码。查询保持浮点格式，并根据存储的草图进行评分。在默认的384维句子嵌入设置中，Clark Hash将余弦搜索向量存储在48字节中，而密集f32存储需要1536字节。这小了32倍。该方法在存储新向量之前不需要训练过程、学习码本、旋转或语料库统计。我们描述了编解码器、Rust实现，以及对来自29个子集的9,304个标记对进行的多语言句子相似性评估。使用多语言MiniLM编码器，48字节草图在STS17和STS22上与密集余弦评分的宏Pearson相关性分别达到0.910和0.946。Clark Hash不是一个新的Johnson-Lindenstrauss定理，也不是近似最近邻索引的替代品。它是一种用于紧凑嵌入存储的简单无状态编解码器。

英文摘要

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

URL PDF HTML ☆

赞 0 踩 0

2605.28033 2026-05-28 cs.RO

How Should We Teach Robots? A Comparison of Kinesthetic, Joystick, and Gesture-Based Teaching

我们应如何教机器人？动觉、摇杆和手势教学的比较

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova

AI总结通过用户研究比较动觉引导、摇杆遥操作和手势教学三种示范方式，评估其在操作任务中的成功率、工作负载和常见错误。

2605.28032 2026-05-28 cs.AI

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

PetroBench：石油工程大语言模型基准测试

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu, Heng Meng, Peng Zhou, Peng Li

AI总结针对石油工程领域，构建包含1200道题目的标准化题库，评估8种主流大语言模型，发现模型在主观题上表现优于客观题，中国模型在选择题上有优势，国际模型在简答题上略优。

详情

AI中文摘要

大语言模型在石油工业中的应用日益广泛，凸显了领域特定评估框架的必要性。本研究开发了一个面向石油工程的大语言模型基准测试，包括数据预处理、质量过滤和多模型验证三个阶段。通过专家评审，构建了具有强领域相关性和区分能力的标准化题库。该基准测试涵盖采油工程、油藏工程和钻井工程，包含1200道题目，涉及选择题、判断题、术语定义和简答题四种格式。在统一API环境下评估了八种主流大语言模型。结果表明，模型在主观题上的表现优于客观题，表明其在事实知识辨别方面存在弱点。选择题和判断题的最高准确率分别为65.3%和74.3%。Gemini-3-Pro、Kimi-K2.5和Claude-Opus-4.6-Thinking取得了72%-74%的最佳总分。模型在采油工程中表现最佳，在油藏工程中最弱。中国模型在选择题上具有优势，而国际模型在简答题上略优。该基准测试为石油工程中大语言模型的评估和部署提供了可重复且实用的参考。

英文摘要

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

URL PDF HTML ☆

赞 0 踩 0

2605.28030 2026-05-28 cs.LG cs.AI cs.CR

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

SPARD: 通过安全投影与相关性-多样性数据选择防御有害微调攻击

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang

AI总结提出SPARD框架，结合安全投影交替优化和相关性-多样性数据选择，防御有害微调攻击，在保持任务精度的同时显著降低攻击成功率。

详情

Comments: Accepted by ICML 2026

AI中文摘要

微调大型语言模型往往会破坏其安全对齐，有害微调攻击进一步加剧了这一问题，其中对抗性数据移除安全防护并诱导不安全行为。我们提出SPARD，一种集成安全投影交替优化与相关性-多样性感知数据选择的防御框架。SPARD采用SPAG，在效用更新和显式安全投影之间交替优化，使用一组安全数据强制执行安全约束。为策划安全数据，我们引入相关性-多样性行列式点过程来选择紧凑的安全数据，平衡任务相关性和安全覆盖。在GSM8K和OpenBookQA上针对四种有害微调攻击的实验表明，SPARD始终实现最低的平均攻击成功率，显著优于最先进的防御方法，同时保持高任务精度。代码可在https://github.com/shuhao02/SPARD获取。

英文摘要

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

URL PDF HTML ☆

赞 0 踩 0

2605.28028 2026-05-28 cs.LG

BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

BPPO: 二元前缀策略优化用于高效GRPO式推理强化学习与简洁响应

Qingfei Zhao, Huan Song, Shuyu Tian, Jiawei Shao, Xuelong Li

AI总结针对GRPO更新成本高且易产生冗长推理的问题，提出BPPO方法，通过仅使用最短正确和错误完成作为更新单元并聚焦前缀优化，实现6倍加速并缩短30-50%响应长度。

详情

AI中文摘要

组相对策略优化（GRPO）广泛用于训练推理模型，但更新每组中的所有采样完成会带来巨大成本，并可能强化冗长的推理轨迹。本文研究在GRPO式推理强化学习中，是否所有完成都提供同样有用的更新信号。我们的梯度相似性分析表明，在同一提示组内，同类完成通常产生高度相似的更新方向，而正确-错误对则提供更明显的对比信号。受此观察启发，我们提出二元前缀策略优化（BPPO），该方法使用最短正确完成和最短错误完成作为紧凑更新单元，同时保留全组优势归一化。BPPO通过自适应完成调度和前缀聚焦优化进一步提高效率；通过仅更新响应前缀，它避免强化冗余后缀并鼓励更简洁的响应。在GSM8K、MATH和Geo3K上的实验表明，BPPO在保持竞争性准确率的同时，相比GRPO实现了高达6.08倍的加速，并将平均响应长度减少约30-50%，而无需在奖励中显式添加长度惩罚。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for training reasoning models, but updating all sampled completions in each group incurs substantial cost and can reinforce verbose reasoning trajectories. In this paper, we study whether all completions provide equally useful update signals in GRPO-style reasoning RL. Our gradient-similarity analysis shows that, within the same prompt group, same-class completions often induce highly similar update directions, whereas correct-incorrect pairs provide more distinct contrastive signals. Motivated by this observation, we propose Binary Prefix Policy Optimization (BPPO), which uses the shortest correct completion and the shortest incorrect completion as a compact update unit while preserving full-group advantage normalization. BPPO further improves efficiency with adaptive completion scheduling and prefix-focused optimization; by updating only response prefixes, it avoids reinforcing redundant suffixes and encourages more concise responses. Experiments on GSM8K, MATH, and Geo3K show that BPPO achieves up to 6.08x speedup over GRPO while maintaining competitive accuracy, and reduces mean response length by approximately 30-50% without modifying the reward with an explicit length penalty.

URL PDF HTML ☆

赞 0 踩 0

2605.28025 2026-05-28 cs.AI cs.CL cs.CY

MIRA: A Bilingual Benchmark for Medical Information Response Audit

MIRA: 医学信息响应审计的双语基准

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai, Weiyi Wu, Chongyang Gao

AI总结提出MIRA双语基准，通过4,320个提示评估大语言模型在不同用户表达下提供医学信息的一致性，发现低健康素养提示导致信息稀释（DID），并提出知识引导缓解方法。

详情

AI中文摘要

大语言模型（LLM）越来越多地被用于提供面向公众的健康信息，然而现有的安全评估忽略了在相同问题的不同用户表述下，响应是否保留了可比较的医学信息。为了解决这个问题，我们引入了医学信息响应审计（MIRA），这是一个受控的双语基准，评估LLM在用户侧语言、语域和健康素养信号下是否提供可比较的医学信息。MIRA包含从60个经过医学审查的低风险健康问题构建的4,320个提示。在五个主流LLM中，模型回答了所有医学问题，但对低健康素养信号的响应始终省略了更多关键信息，提供的具体后续步骤更少，并为独立判断提供的支持更少。我们将这种模式称为差异信息稀释（DID）。语言效应是模型特定的，而非对非英语提示普遍更差。与300个真实世界健康查询的比较提供了初步的秩次有效性证据。一种知识引导的缓解提示减少了大多数模型的信息稀释，其中Claude（约8%）和Qwen（约6%）在信息不足的简化方面减少最大。

英文摘要

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

URL PDF HTML ☆

赞 0 踩 0

2605.28023 2026-05-28 cs.CV cs.AI cs.CL cs.MM

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

VCap: 用于弱到强视觉字幕的超几何奖励

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

AI总结提出VCap，一种证人-裁判奖励机制，通过超几何分布级别的精度验证视觉信号中参考字幕与策略生成字幕之间的事实一致性，实现弱到强泛化，在多个图像和视频字幕基准上超越SOTA模型。

详情

Comments: 28 pages, 8 figures

AI中文摘要

视觉字幕要求模型忠实捕捉视觉内容，同时最小化遗漏和幻觉。作为字幕的主导范式，多模态大语言模型通过扩展和高质量数据取得了强大性能。最近，强化学习成为推动多模态大语言模型向更高精度和更广覆盖的关键途径，然而，现有字幕奖励设计未能提供细粒度且可靠的事实验证信号，限制了其有效性。为解决这一问题，我们提出VCap，一种证人-裁判奖励，将参考字幕（证人）与视觉信号（裁判）配对。通过明确验证基于视觉信号的参考字幕与策略生成字幕之间的事实一致性，VCap提供了具有超几何分布级别精度的奖励信号用于字幕质量验证。该设计即使在不完美的参考下也能实现有效学习，促进强化学习训练中的弱到强泛化。在我们的实验中，使用VCap训练的8B模型在多个图像和视频字幕基准上优于开源和闭源的最先进模型。人工评估进一步证实了其与事实正确性的强对齐。此外，VCap提升了多模态大语言模型的感知能力，跨任务泛化，并超越了最佳N蒸馏，挑战了先前关于强化学习与视觉推理的假设。

英文摘要

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

URL PDF HTML ☆

赞 0 踩 0

2605.28022 2026-05-28 cs.CL cs.SE

Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

超越 pass@k：面向多样本代码生成的冗余感知 RLVR

Le Bronnec Florian, Alexandre Verine, Rio Yokota, Benjamin Negrevergne

AI总结针对代码生成中重复采样评估的冗余问题，提出基于 JPlag 相似度的反冗余奖励增强 RLVR，在有限预算下提升可执行正确性。

详情

Comments: Preprint under review

AI中文摘要

用于代码生成的 LLM 通常使用 Pass@k 在重复采样设置中进行评估，其中多个候选程序在有限采样预算下针对单元测试执行。虽然最近基于验证器的强化学习（RLVR）方法提高了可执行正确性，但这些目标如何影响采样程序之间的冗余仍不清楚。在这项工作中，我们使用代码抄袭检测系统 JPlag 研究代码生成中的实现级冗余。跨模型和基准测试，我们表明仅正确性的 RLVR 通常使生成集中在重复实现上，而 Pass@k 感知目标保持较低冗余并提高更大预算下的性能。受这些观察的启发，我们基于 JPlag 相似度用直接反冗余奖励增强 RLVR。在 3 个模型和 3 个基准测试中，阻止近重复生成可靠地提高了有限预算下的可执行性能，通常匹配或超越专门的 Pass@k 感知目标。

英文摘要

LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.

URL PDF HTML ☆

赞 0 踩 0

2605.28021 2026-05-28 cs.LG

AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

AOE：通过重新校准异常标签实现穷尽式分布外检测

Fengqiang Wan, Qing-Yuan Jiang, Yang Yang

AI总结提出自适应置信度异常暴露（AOE）方法，通过温度缩放重新校准异常标签，利用自适应软目标保留分布外样本与分布内类别的语义关系，从而扩大分离边界并提升分布外检测性能。

详情

AI中文摘要

分布外（OOD）检测对于在开放世界和安全关键场景中部署机器学习模型至关重要，在这些场景中，测试输入可能偏离训练分布，对未知样本的过度自信预测可能导致不可靠的决策。异常暴露（OE）通过训练期间引入辅助异常样本来扩大分布内（ID）和OOD样本之间的间隔，已成为一种有前景的OOD检测范式。现有的基于OE的方法通常通过使用统一标签来最大化OOD样本在ID类别上的熵，从而扩大这一间隔。然而，我们从理论上证明，统一标签不可避免地忽略了OOD样本与ID类别之间的关系，称为过度软化效应，导致次优的间隔边界。我们的理论分析进一步揭示，显式利用这种关系反而可以提高OOD检测性能。受此启发，我们提出了自适应置信度异常暴露（AOE），一种简单而有效的方法，利用温度缩放重新校准异常标签。具体来说，AOE从温度缩放的模型预测中为OOD样本生成自适应软目标，其中可学习的温度平滑预测分布，而不会完全消除类别关系信息。通过使用这些自适应软目标监督OOD样本，AOE保留了OOD样本与ID类别之间的语义接近性，同时鼓励软目标接近高熵分布，从而抑制过度自信的OOD预测并扩大分离边界。在多种基准上的大量实验证明了AOE的有效性。

英文摘要

Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on unknown samples can lead to unreliable decisions. Outlier Exposure (OE) has emerged as a promising OOD detection paradigm by introducing auxiliary outliers during training to enlarge the margin between in-distribution (ID) and OOD samples. Existing OE-based methods typically enlarge this margin by employing uniform labels to maximize the entropy of OOD samples over ID categories. However, we theoretically show that uniform labels inevitably disregard the relations between OOD samples and ID categories, termed the over-softening effect, leading to a suboptimal margin bound. Our theoretical analysis further reveals that explicitly exploiting such relations can instead yield improved OOD detection performance. Motivated by this insight, we propose \underline{A}daptive Confidence \underline{OE} (AOE), a simple yet effective method that leverages temperature scaling to recalibrate outlier labels. Specifically, AOE generates adaptive soft targets from temperature-scaled model predictions for OOD samples, where the learnable temperature smooths the prediction distribution without fully erasing class-wise relational information. By supervising OOD samples with these adaptive soft targets, AOE preserves the semantic proximity between OOD samples and ID categories while encouraging the softened targets to approach a high-entropy distribution, thereby suppressing overconfident OOD predictions and enlarging the separation margin. Extensive experiments across diverse benchmarks demonstrate the effectiveness of AOE.

URL PDF HTML ☆

赞 0 踩 0

2605.28020 2026-05-28 cs.CL

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

预训练模型评估中缺失的一环：奖励引导解码无需参数更新即可解锁任务导向行为

Shaobo Wang, Guo Chen, Ziyue Wang, Zhengyang Tang, Qingyang Liu, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

AI总结提出一种无需训练的奖励引导解码框架EBD，通过外部轻量奖励模型调整输出分布，激活冻结预训练模型的任务导向行为，实现更公平的推理时评估。

详情

Comments: 26 pages, 5 figures, 8 tables

AI中文摘要

随着大型语言模型（LLMs）的快速发展，可靠地评估预训练LLMs的能力变得越来越重要。挑战在于，基础预训练模型针对下一个词预测进行优化，在标准提示和直接解码下往往无法遵循指令或生成格式良好的答案。因此，基准性能可能混淆模型能力与解码导致的无法产生任务导向输出的问题，而暴露这种行为通常依赖于昂贵的后训练。最近的仅解码方法试图重塑输出分布，但这类方法在开放式任务中可能效率低下且脆弱。为解决这些限制，我们提出基于能量的解码（EBD），一种无需训练、奖励引导的框架，用于从冻结的预训练LLMs中激活任务导向行为，涵盖开放式和客观任务。EBD通过外部轻量奖励模型增强解码，将生成导向高效用响应，同时通过奖励倾斜的目标分布将其锚定到预训练模型先验。我们证明EBD将基础模型输出转向更符合指令的行为，增加了与后训练对应物的行为相似性，并实现了对可访问预训练模型行为的更公平推理时评估。实验上，EBD在五个模型和六个基准上优于基线，将Qwen3-8B-Base在AlpacaEval2.0上的性能从8.8提升到44.5，将Mistral-7B在Math500上的延迟相对于先前的解码工作降低18.9倍，并且对奖励模型大小保持鲁棒。

英文摘要

With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token prediction and often fail to follow instructions or produce well-formed answers under standard prompting and direct decoding. As a result, benchmark performance can conflate model capability with decoding-induced failures to produce task-oriented outputs, while exposing such behavior often relies on costly post-training. Recent decodingonly approaches attempt to reshape output distributions, but such methods can be inefficient and brittle across open-ended tasks. To address these limitations, we propose Energy-Based Decoding (EBD), a training-free, reward-guided framework for activating task-oriented behaviors from frozen pre-trained LLMs across both open-ended and objective tasks. EBD augments decoding with an external lightweight reward model, steering generations toward high-utility responses while anchoring them to the pre-trained model prior through a reward-tilted target distribution. We show that EBD shifts base-model outputs toward more instructionfollowing behavior, increasing behavioral similarity to post-trained counterparts and enabling a fairer inference-time evaluation of accessible pre-trained-model behavior. Empirically, EBD outperforms baselines across five models and six benchmarks, improving Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5, reducing Mistral-7B Math500 latency by 18.9x relative to prior decoding work, and remaining robust to reward-model size.

URL PDF HTML ☆

赞 0 踩 0

2605.28018 2026-05-28 cs.CV

Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

双分支蒸馏Transformer用于高效非对称无人机跟踪

Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song

AI总结提出EATrack框架，通过教师引导的双分支蒸馏策略，在轻量学生模型中增强特征表达，实现无人机跟踪的精度与速度平衡。

详情

Comments: CVPR2026 Highlight

AI中文摘要

鉴于无人机跟踪的实时性需求，许多方法简化骨干网络以减少计算量，但这往往削弱特征表示，导致复杂场景下性能下降。为解决此问题，我们提出EATrack，一种高效的非对称无人机跟踪框架，其核心是教师引导的双分支蒸馏策略，增强轻量学生模型的特征表达能力。具体而言，EATrack探索了知识迁移的两个互补视角：空间聚焦的特征级蒸馏，通过引导学生学习强目标表示来补偿弱化的表示；以及预测级蒸馏，通过学习教师精确目标定位的能力来增强空间定位。此外，为增强对外观变化的鲁棒性，我们引入细粒度目标感知蒸馏策略，选择性地将教师的目标建模能力迁移给学生。推理时集成时间适应模块以增强时间上的鲁棒性。在五个无人机基准上的实验表明，EATrack在精度和速度之间取得了良好的平衡。代码：https://github.com/GXNU-ZhongLab/EATrack

英文摘要

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: https://github.com/GXNU-ZhongLab/EATrack

URL PDF HTML ☆

赞 0 踩 0

2605.28016 2026-05-28 cs.CV physics.med-ph

Enhancing Ultra-low-field MRI with Segmentation-guided Adversarial Learning

利用分割引导的对抗学习增强超低场MRI

James Grover, Andrew Phair, Michael Ferraro, David E. J. Waddington

AI总结提出结合解剖条件分割先验和模型集成的方法，通过Swin UNETR生成组织分割先验，并利用CycleGAN和T-REX两个增强网络合成3T级MRI，有效提升64 mT超低场MRI的图像质量。

2605.28014 2026-05-28 cs.CL cs.LG

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

ROSD: 面向跨领域语言模型推理的反思式同策略自蒸馏

Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu

AI总结提出反思式同策略自蒸馏（ROSD）框架，通过反思引导的错误定位蒸馏将参考解模仿转为针对性推理修正，提升领域内推理和跨领域泛化能力。

详情

Comments: Preprint

AI中文摘要

同策略自蒸馏（OPSD）通过为同策略 rollout 提供密集的 token 级监督，提升了大语言模型（LLM）的推理性能。然而，现有的 OPSD 方法在领域内推理上增益有限，且对领域外问题的泛化能力较差。我们识别出两个关键原因：将自教师模型条件化为已验证的解决方案会鼓励模仿训练领域的参考轨迹而非特定错误的修正；将蒸馏应用于完整响应可能会覆盖有效的推理前缀并强化过拟合。我们提出反思式同策略自蒸馏（ROSD），一个通过反思引导的、错误定位的蒸馏将参考解模仿转化为针对性推理修正的框架。对于每个 rollout，ROSD 使用自反思器提取修正思路并定位第一个错误片段。修正思路引导自教师模型进行针对性监督，而定位的错误片段将蒸馏限制在需要修正的区域。这种设计在保留有效前缀的同时修正了有缺陷的推理。在多个领域内和领域外推理基准上的实验表明，ROSD 在整体上产生了更强的领域内推理性能，并且相比标准 OPSD 具有显著更好的领域外泛化能力。代码可在 https://github.com/ZiqiZhao1/ROSD 获取。

英文摘要

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

URL PDF HTML ☆

赞 0 踩 0