arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26498 2026-05-27 cs.CL

Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation

Verilog-Evolve：反馈驱动与技能演进的Verilog生成

Zehua Pei, Hui-Ling Zhen, Yu Zhang, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

AI总结提出Verilog-Evolve框架，通过反馈驱动（功能仿真、Yosys综合、ABC时序代理等）迭代优化Verilog代码，并利用跨会话技能演进提升生成质量，实验表明在VerilogEval和混合精度GEMM任务上提高了功能成功率和下游友好性。

详情

AI中文摘要

大型语言模型（LLMs）改进了从自然语言规范生成Verilog的过程，但大多数流水线仍将生成视为孤立的采样后功能检查。这对于实际的RTL设计是不够的，因为有用的Verilog必须正确、可综合、考虑时序，并对下游硬件目标友好。我们提出Verilog-Evolve，一个用于版本化Verilog细化和跨会话技能演进的反馈驱动框架。对于每个任务，Verilog-Evolve生成多样化的次要候选，通过功能仿真、Yosys综合、ABC时序代理以及可选的GEMM指标的可执行反馈进行评估，然后在可配置评分下将最佳候选提升为主要版本。为了跨任务改进，系统维护模块化技能指导，根据任务和反馈上下文检索技能，并通过创建/改进/跳过决策和验证器报告从记录的历史中演进候选技能。在VerilogEval和混合精度GEMM任务上的实验表明，Verilog-Evolve提高了最终功能成功和晋升稳定性，同时在开源综合、时序代理和网表级GEMM目标下生成更下游友好的RTL。验证门控的技能演进进一步提高了GEMM下游质量，并在评估的技能模式中实现了最佳下游分数和GEMM保留通过率。

英文摘要

Large language models (LLMs) have improved Verilog generation from natural-language specifications, but most pipelines still treat generation as isolated sampling followed by functional checking. This is insufficient for practical RTL design, where useful Verilog must be correct, synthesizable, timing-conscious, and friendly to downstream hardware objectives. We present Verilog-Evolve, a feedback-driven framework for versioned Verilog refinement and cross-session skill evolution. For each task, Verilog-Evolve generates diverse minor candidates, evaluates them with executable feedback from functional simulation, Yosys synthesis, ABC timing proxy, and optional GEMM metrics, then promotes the best candidate into a major version under configurable scoring. To improve across tasks, the system maintains modular skill guidance, retrieves skills according to task and feedback context, and evolves candidate skills from logged histories through create/improve/skip decisions and verifier reports. Experiments on VerilogEval and mixed-precision GEMM tasks show that Verilog-Evolve improves final functional success and promotion stability while producing more downstream-friendly RTL under open-source synthesis, timing-proxy, and netlist-level GEMM objectives. Validation-gated skill evolution further improves GEMM downstream quality and achieves the best downstream score and GEMM held-out pass rate among the evaluated skill modes.

URL PDF HTML ☆

赞 0 踩 0

2605.26496 2026-05-27 cs.LG cs.AI

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Dense2MoE：通过统一剪枝和升级推动设备端LLM的帕累托前沿

Fengfa Li, Hongjin Ji, Yifeng Ding, Lei Ren, Chen Wei

AI总结提出Dense2MoE框架，通过层融合升级（LF-UC）统一剪枝和升级，将密集LLM高效转换为设备端MoE模型，在推理延迟与准确性之间取得更优帕累托前沿。

Comments 19 pages

详情

AI中文摘要

混合专家（MoE）架构对于资源受限的设备端部署极具前景，但从头训练这些模型成本高昂。当前方法试图通过将密集模型升级为MoE来缓解这一问题，然而它们常常引入参数冗余，降低推理效率。另一方面，标准层剪枝减少了冗余，但不可避免地损害模型准确性。为解决这一困境，我们提出Dense2MoE，一种通过层融合升级（LF-UC）统一剪枝和升级的新框架。在硬件Roofline理论指导下，Dense2MoE通过剪枝来自冗余层的带宽密集型注意力模块，同时将其多层感知机（MLP）重新用作MoE专家，系统地克服了推理内存墙。这种结构创新保留了模型的核心能力，并通过选择性令牌路由严格限制活跃参数。借助适度的持续预训练预算，Dense2MoE高效地将公开可用的密集LLM转换为设备端就绪的MoE模型。大量实验表明，Dense2MoE显著推进了设备端推理延迟与模型准确性的帕累托前沿，优于密集基线、最先进的压缩方法和标准升级方法。

英文摘要

The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense models into MoEs however they often introduce parameter redundancy that degrades inference efficiency Alternatively standard layer pruning mitigates redundancy but inevitably compromises model accuracy To resolve this dilemma we propose Dense2MoE a novel framework that unifies pruning and upcycling through Layer Fusion UpCycling LF UC Guided by hardware Roofline theory Dense2MoE systematically overcomes the inference memory wall by pruning bandwidth heavy attention modules from redundant layers while repurposing their Multi Layer Perceptrons MLPs into MoE experts This structural innovation preserves the models core capabilities and strictly limits active parameters via selective token routing With a modest continual pre training budget Dense2MoE efficiently converts publicly available dense LLMs into on device ready MoE models Extensive experiments demonstrate that Dense2MoE significantly advances the Pareto frontier for on device inference latency versus model accuracy outperforming dense baselines state of the art compression and standard upcycling methods

URL PDF HTML ☆

赞 0 踩 0

2605.26494 2026-05-27 cs.AI cs.CL cs.LG

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2系列：小激活释放最大现实智能

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge

AI总结提出MiniMax-M2系列混合专家语言模型，通过小激活参数实现前沿性能，核心包括智能体驱动数据管道、可扩展强化学习系统Forge及自进化检查点M2.7。

Comments Technical Report. 35 pages, 10 figures, 4 tables

详情

AI中文摘要

我们介绍了MiniMax-M2系列，这是一个基于“小激活可以释放最大现实智能”原则构建的混合专家语言模型家族。旗舰版M2总参数量为229.9B，每个token仅激活9.8B参数。M2系列专为智能体部署而端到端设计，包含三个组成部分：(i) 智能体驱动数据管道，生成大规模、可验证的轨迹，涵盖智能体编码和智能体协作，每个轨迹都基于可执行工作空间和与工件对齐的奖励；(ii) Forge，一个可扩展的智能体原生强化学习系统，适应长程智能体轨迹，并配有窗口FIFO调度、前缀树合并、推理优化以及支持白盒和黑盒智能体的干净训练-推理-智能体解耦；(iii) 最新的M2.7检查点向自我进化迈出了早期一步——自主调试训练运行并修改其自身框架。从M2到M2.7，这种组合将小激活足迹转化为智能体编码、深度搜索、办公任务和推理基准上的前沿性能。

英文摘要

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26492 2026-05-27 cs.CL cs.AI cs.LG

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

灯塔中的伊莱亚斯，再次？诊断LLM故事中的低多样性

Sil Hamilton, David Mimno

AI总结研究通过采样20000个故事发现，LLM生成的故事中存在高度重复的词汇（如Elias、灯塔等），这些词汇来自偏好数据而非预训练数据，表明小数据集与强对齐算法的结合可能对多样性产生不成比例的影响。

2605.26491 2026-05-27 cs.LG cs.CV

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好：扩散模型的列表级奖励感知对齐

Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

AI总结提出Diffusion LAIR方法，通过列表级奖励感知优化，利用连续奖励分数和所有候选图像同时优化扩散模型，在文本到图像生成等任务上超越成对偏好基线。

详情

AI中文摘要

偏好优化已成为从人类反馈中进行在线强化学习（RLHF）的一种高效替代方案，用于对齐文本到图像扩散模型。然而，现有方法大多将监督简化为二元成对比较。当训练数据自然包含同一提示的多个候选图像，并且连续奖励分数能提供比单一赢家-输家标签更丰富的信息时，这种成对简化具有局限性。为解决这些局限性，我们提出了Diffusion LAIR，一种用于扩散模型的奖励感知列表级偏好优化方法。对于每个提示，LAIR将一组候选图像的奖励分数转换为居中优势权重，然后在隐式奖励上优化优势加权回归目标，隐式奖励定义为当前模型相对于固定参考模型的去噪损失改进，并带有二次惩罚以正则化隐式奖励的幅度。所得目标同时使用所有候选图像而非选择成对，并通过显式控制隐式奖励的幅度保持保守性。LAIR目标在隐式奖励空间中具有有界闭式最优解，阐明了正则化强度如何控制偏好更新的幅度。实验表明，Diffusion LAIR在SD1.5和SDXL上，在文本到图像生成、组合生成和图像编辑基准测试中均优于强偏好优化基线。

英文摘要

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26489 2026-05-27 cs.LG

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

奇异分布的稳定性：语言模型预训练两阶段动力学的谱视角

Hongtao Zhang, Wenjie Zhou, Chenxi Jia, Wei Chen, Xueqi Cheng

AI总结本文发现语言模型预训练中奇异值谱的早期稳定现象（SoSD），并证明该现象与慢下降阶段同步，通过理论分析揭示了权重范数增长导致SoSD阈值，从而限制后续损失下降速率。

详情

AI中文摘要

大型语言模型预训练通常表现出两阶段轨迹：快速的初始损失下降，随后是长时间的缓慢改善。我们识别出一个潜在的谱现象——奇异分布的稳定性（SoSD），其中迹归一化的奇异值谱早期就稳定下来，即使参数矩阵继续演化。我们证明，SoSD与慢下降阶段之间的同步在不同架构（GPT-2、LLaMA）和设置中广泛存在，包括各种调度（Step-wise、WSD、Cosine Decay）、权重衰减和优化器（AdamW、Muon）。通过分析一个简化的Transformer，我们证明权重范数的增长不可避免地会引发早期的SoSD阈值，之后损失下降速率在理论上受限于奇异分布的变化。我们进一步解释了WSD和Muon等策略通过调节SoSD尺度来影响预训练动态，从而为理解高效预训练动力学提供了谱视角。

英文摘要

Large language model pre-training typically exhibits a two-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement. We identify an underlying spectral phenomenon, Stability of Singular Distribution (SoSD), where the trace-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve. We demonstrate that synchronization between SoSD and the slow-descent regime is widely observed across diverse architectures (GPT-2, LLaMA) and settings, including various schedules (Step-wise, WSD, Cosine Decay), weight decays, and optimizers (AdamW, Muon). By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre-training dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.26486 2026-05-27 cs.CV

LongCat-Video-Avatar 1.5 Technical Report

LongCat-Video-Avatar 1.5 技术报告

Meituan LongCat Team, Xunliang Cai, Meng Cheng, Feng Gao, Zhe Kong, Jiamu Li, Le Li, Weiheng Li, Hongyu Liu, Shuai Tan, Xiaoming Wei, Tianyu Yang, Yong Zhang

AI总结本文提出 LongCat-Video-Avatar 1.5，一个通过升级音频编码器、优化训练策略、数据筛选和RLHF训练实现高精度唇同步、全身时间稳定性和长视频生成的开放框架，在多个基准测试中达到或超越商业系统性能。

Comments Homepage: https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/ Github: https://github.com/meituan-longcat/LongCat-Video

详情

AI中文摘要

尽管音频驱动视频生成取得了进展，但实现商业级稳定性仍具挑战。我们提出 LongCat-Video-Avatar 1.5，一个升级的开源框架，优先考虑系统工程和生产就绪性而非架构新颖性。通过将音频编码器升级为 Whisper Large 并精心扩展训练配方，v1.5 实现了精确的唇同步、全身时间稳定性和严格身份一致性的鲁棒长视频生成。通过严格的数据筛选和 RLHF 训练，该模型能轻松泛化到风格化领域（如动漫和动物），并原生处理复杂现实条件（如多人交互和物体操作）。此外，针对工业部署的实际需求，我们采用高级步进蒸馏将推理加速至最优的 8 NFE，在服务效率与视觉保真度之间实现了良好权衡。通过在超过 500 个多样化测试案例的综合基准上进行的广泛定量指标和严格人工评估，验证了我们方法的优越性。结果表明，v1.5 在人类相似度评分和专家级质量评估中，与领先的闭源系统（如 HeyGen、OmniHuman 1.5、Kling Avatar 2.0）相比，达到了具有竞争力或更优的性能。通过开源发布，LongCat-Video-Avatar 1.5 缩小了学术研究原型与商业级部署之间的差距。

英文摘要

Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.26485 2026-05-27 cs.CV cs.CL

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniInteract：面向实时全模态助手的流式交互基准测试

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

AI总结提出OmniInteract基准，通过在线推理音视频流评估全模态大模型的实时交互能力，发现现有模型在流式交互中表现薄弱。

详情

AI中文摘要

我们引入了OmniInteract，一个用于实时全模态大语言模型的流式基准测试，通过音视频流上的原生在线推理进行评估。与离线视频理解或文本提示的流式问答不同，OmniInteract保留了原始音视频流，并要求模型在线处理，无法访问未来内容。用户查询和环境声音嵌入在音频轨道中，要求模型检测多模态触发信号，决定何时响应，并在流展开时回答问题。OmniInteract包含250个视频，具有1430个时间锚定的响应槽：其中1062个1Q1A槽涵盖实时、主动和嵌套场景，368个1QnA槽用于连续任务监控和步骤指导。每个槽包括触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1、中断诊断套件和嵌套链完成分数来评估响应正确性、时序、无效输出、中断处理和上下文连续性。实验表明，当前模型在流式交互中仍然薄弱，最佳整体IA-QTF1仅为0.368，最佳1QnA IA-QTF1仅为0.052。在全双工设置下的数学推理进一步研究表明，离线能力不一定能迁移到在线交互中。代码和数据集将在https://github.com/Lucky-Lance/OmniInteract公开。

英文摘要

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

URL PDF HTML ☆

赞 0 踩 0

2605.26484 2026-05-27 cs.LG

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

Extra-Merge：追踪语言模型预训练中模型合并的秩-1子空间

Wenjie Zhou, Bohan Wang, Hongtao Zhang, Chenxi Jia, Wei Chen, Xueqi Cheng

AI总结本文通过分析预训练后期轨迹发现秩-1子空间现象，提出无需额外训练的Extra-Merge方法，沿该子空间外推以最小化损失，在GPT-2和LLaMA系列上优于标准合并基线。

详情

AI中文摘要

模型合并已成为增强大型语言模型（LLMs）的轻量级范式，但其底层机制仍知之甚少。在这项工作中，我们分析了后期预训练轨迹，并揭示了一个 extbf{秩-1子空间}现象：虽然原始优化步骤剧烈振荡，但连续的\emph{合并}检查点坍缩到一个稳定的、近似一维的线性流形上。我们通过\emph{河谷}景观分析从理论上为这一观察提供了依据：平均操作充当了几何低通滤波器，抑制高曲率噪声以揭示最优下降方向。基于这一见解，我们提出了 extbf{Extra-Merge}，一种无需训练的策略，沿该子空间外推以最小化损失，无需额外的梯度更新。在GPT-2和LLaMA系列（124M到2B）上的大量实验表明，Extra-Merge始终优于标准合并基线。值得注意的是，它在Pythia-12B下游任务上取得了一致的零样本准确率提升，并有效推广到Muon优化器\citep{jordan2024muon}。

英文摘要

Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.

URL PDF HTML ☆

赞 0 踩 0

2605.26483 2026-05-27 cs.CV

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

基于临床基础的反事实推理用于医学视频诊断

Jianzhe Gao, Churan Wang, Weiyi Zhang, Jianghua Li, Li-An Li, Wenguan Wang, Yixin Zhu, Yizhou Wang

AI总结提出MedVCR反事实推理框架，通过扩散生成器合成病理组织演变、临床规则编码诊断知识及双重诊断预测策略，在医学视频诊断任务上提升2.6%-10.2%性能。

详情

AI中文摘要

医学视频诊断涉及从整个检查过程中的动态组织反应推断临床决策。现有方法依赖于端到端学习范式，该范式i)关注外观而非病理，ii)缺乏临床先验知识，iii)仅基于观察进行推理而无反事实比较。本文引入MedVCR，一个模仿临床诊断思维的反事实推理框架。MedVCR包含三个组件：一个反事实生成器，通过扩散方式合成指定病理状态下的组织演变；一个反事实表示学习模块，通过临床规则（即时间一致性、病理可分离性和反事实对齐）编码诊断知识；以及一个双重诊断预测策略，将视频级评估与帧级反事实分析相结合。MedVCR在完全监督（如阴道镜检查）和弱监督（如结肠镜检查）视频诊断设置下进行评估，与领先基线相比取得了2.6%-10.2%的性能提升。全面的消融研究进一步验证了每个组件的有效性。代码将发布。

英文摘要

Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.26478 2026-05-27 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

基于随机解耦策略梯度的高效在策略视觉强化学习

Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham

AI总结提出随机解耦策略梯度（SDPG）方法，通过轨迹滚动的随机扰动估计策略梯度，在单GPU上数小时内端到端训练多样化的视觉运动控制策略，显著降低计算和内存开销，并在视觉MuJoCo基准测试中优于基线方法。

2605.26477 2026-05-27 cs.LG

Variational Inference for Evidential Deep Learning

证据深度学习的变分推断

Jiawei Tang, Xinyan Du, Hui Liu, Junhui Hou, Yuheng Jia

AI总结针对传统证据深度学习（EDL）中KL惩罚项导致证据过高和参数设置缺乏理论保证的问题，提出基于变分推断的VI-EDL框架，通过推导证据下界（ELBO）抑制证据过度增长，并建立泛化界理论，在视觉和医学数据集上实现最先进的分布外检测、噪声检测和自动驾驶性能。

详情

AI中文摘要

尽管深度神经网络（DNN）取得了显著性能，但它们倾向于产生过度自信的预测。证据深度学习（EDL）通过将预测公式化为类别概率上的狄利克雷分布来显式量化认知不确定性，从而缓解了这一问题。然而，我们发现传统的EDL存在两个基本限制：一个仅抑制负类证据的Kullback-Leibler（KL）惩罚项，导致证据过高，从而降低了模型量化不确定性的能力；以及缺乏设置狄利克雷参数$α=e+1$的理论保证。在本文中，我们提出了一个数学上严谨的框架——变分推断证据深度学习（VI-EDL）。通过从变分推断的角度重新表述证据学习，我们推导出一个证据下界（ELBO），它防止证据过度增长。理论上，我们严格建立了泛化界，并揭示了预测不确定性、特征和网络复杂度如何影响该界，以及为什么设置$oldsymbolα = \mathbf{e} + \mathbf{1}$可以最小化它。在标准视觉和医学数据集上的大量实验表明，VI-EDL实现了最先进的性能，在分布外检测、噪声检测和自动驾驶场景中表现出色。代码可在https://github.com/seutjw/VI-EDL获取。

英文摘要

While Deep Neural Networks (DNNs) achieve remarkable performance, their tendency to produce overconfident predictions. Evidential Deep Learning (EDL) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback-Leibler (KL) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model's ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameter $α=e+1$. In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL). By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound (ELBO), which prevents the evidence from growing excessively. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting $\boldsymbolα = \mathbf{e} + \mathbf{1}$ can minimize it. Extensive experiments on standard visual and medical datasets demonstrate that VI-EDL achieves state-of-the-art performance, showing excellent performance in out-of-distribution detection, noise detection and autonomous driving scenario. The code is available in https://github.com/seutjw/VI-EDL.

URL PDF HTML ☆

赞 0 踩 0

2605.26476 2026-05-27 cs.CL cs.IR

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench：半导体制造中自适应RAG基准测试框架

Jingbin Qian, Congwen Yi, Min Xia, Wen Wu, Jun Zhu, Jian Guan

AI总结提出FAB-Bench框架，通过六项诊断指标和三种合成策略，评估半导体制造领域RAG系统在不同上下文窗口下的性能，发现注意力稀释是极端上下文长度下性能下降的主要原因。

详情

AI中文摘要

检索增强生成（RAG）已成为知识密集型应用的关键技术，然而在垂直领域评估其性能仍然困难，原因包括领域复杂性、多样的上下文规模以及对专家评估的严重依赖，而专家评估成本高、不一致且不可扩展。我们提出了FAB-Bench，一个用于半导体制造中RAG系统自适应基准测试的端到端框架。FAB-Bench定义了六项诊断指标，衡量事实准确性、上下文利用率、完整性、检索相关性、技术深度和推理一致性。该框架将检索器诊断与生成器级别的推理分析相结合，覆盖4K-32K token的上下文窗口，量化了随着上下文范围扩展，检索精度和生成保真度如何共同演变。从超过1300个生成的候选对中，我们精选了200个查询-答案对的高质量基准，涵盖三种合成策略：大海捞针、文档内多主题和跨文档多跳。在四个LLM和四个RAG框架上的系统评估揭示了三种不同的上下文缩放行为：对数增长、早期饱和和冷启动动态，并确定注意力稀释是极端上下文长度下性能下降的主要机制。在另外三个生产级RAG系统上的跨框架验证确认了评估的可移植性。

英文摘要

Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.

URL PDF HTML ☆

赞 0 踩 0

2605.26475 2026-05-27 cs.CV cs.AI

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

大规模平面场景的视觉度量测量比较研究

ZhiXin Sun

AI总结本文针对大规模室外场景，使用PTZ相机比较了三种基于视觉的平面度量方法（单目测距、图像拼接和立体测距），分析了它们的精度和适用性。

详情

AI中文摘要

基于视觉的度量距离和面积测量在大规模室外环境中仍然具有挑战性，原因包括远距离感知、相机变焦和不稳定的成像条件。本文研究了在实际水库监测场景中使用PTZ相机的平面度量测量，并比较了三种代表性方法：基于几何的单目测距、带有鸟瞰变换的图像拼接以及使用两个联合校准的单目相机的立体测距。对于单目测距，从相机几何推导出平面定位模型，并分析了相机俯仰角的影响。研究了用于大面积映射的图像拼接，同时开发了一种无需专用立体硬件的立体方案用于远距离测量。实验显示了明确的权衡：单目测距在足够大的俯仰角下达到米级精度，立体测距达到分米级精度且对俯仰变化敏感性较低，图像拼接在小规模场景中有效，但随着场景增大稳定性和可扩展性下降。

英文摘要

Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.

URL PDF HTML ☆

赞 0 踩 0

2605.26471 2026-05-27 cs.RO

Heterogeneous AAV Logistics Task Allocation: A Reinforcement Learning Enhanced Overlapping Coalition Formation Game Approach

异构AAV物流任务分配：一种强化学习增强的重叠联盟形成博弈方法

Yuze Zhou, Jingliang Sun, Junzhi Li, Jianxin Zhong, Zihan Wang, Teng Long

AI总结针对动态城市物流中时间敏感任务的随机出现带来的异构AAV任务分配最优性挑战，提出一种基于Transformer的软演员-评论家网络增强的重叠联盟形成博弈方法，实现全局最优任务分配并证明收敛至纳什稳定均衡。

Comments 12 pages

详情

AI中文摘要

在动态城市物流中，时间敏感任务的随机出现对异构AAV物流任务分配提出了显著的最优性挑战。为解决这一问题，提出了一种强化学习增强的重叠联盟形成博弈方法。建立了动态任务分配模型，其中全局最优性通过耦合服务质量与资源消耗的广义物流成本进行数学量化。为应对随机订单到达引起的时变任务集，设计了一种基于Transformer的软演员-评论家网络。通过利用多头自注意力编码可变长度的物流状态并捕获任务间的时空依赖关系，学习到的策略自适应地指导联盟更新，取代重叠联盟形成博弈中的启发式规则。在此基础上，异构AAV可以为动态物流任务形成更高效的重叠联盟。所得到的联盟形成过程被证明构成一个精确势博弈，保证了在有限迭代次数内收敛到纳什稳定均衡。数值仿真表明，所提算法在广义物流成本准则下有效提高了任务分配的最优性。在32架AAV和80个任务的场景中，与启发式OCF基线相比，我们的算法实现了39.76%的成本降低。室内飞行实验进一步验证了其实用性。

英文摘要

In dynamic urban logistics, the stochastic emergence of time-sensitive tasks poses a significant optimality challenge for heterogeneous AAVs logistics task allocation. To address this problem, a reinforcement learning enhanced overlapping coalition formation game approach is proposed. A dynamic task allocation model is established, where global optimality is mathematically quantified by a generalized logistics cost coupling service quality and resource consumption. To deal with the time-varying task sets induced by stochastic order arrivals, a transformer-based soft actor-critic network is designed. By leveraging multi-head self-attention to encode variable-length logistics states and capture task-wise spatiotemporal dependencies, the learned policy adaptively guides coalition updates, replacing heuristic rules in the overlapping coalition formation game. On this basis, heterogeneous AAVs can form more efficient overlapping coalitions for dynamic logistics tasks. The resulting coalition formation process is proven to constitute an exact potential game, which guarantees convergence to a Nash-stable equilibrium within a finite number of iterations. Numerical simulations demonstrate that the proposed algorithm effectively improves the optimality of task allocation under the generalized logistics cost criterion. In a scenario with 32 AAVs and 80 tasks, our algorithm achieves a 39.76% cost reduction compared with the heuristic OCF baseline. Indoor flight experiments further validate its practicality.

URL PDF HTML ☆

赞 0 踩 0

2605.26470 2026-05-27 cs.CV

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

面向逆问题的三元动力学感知扩散后验采样：优化引导与随机性调度

Junseo Bang, Dong Ju Mun, Hoigi Seo, Seongmin Hong, Se Young Chun

AI总结提出TriPS方法，将后验采样建模为时变控制问题，通过优化数据一致性引导、无分类器引导和随机性的调度策略，显著提升成像逆问题的求解性能。

Comments ICML 2026

详情

AI中文摘要

使用扩散模型的生成后验采样已成为解决成像逆问题的主流范式，通常包含三个主要组件：数据一致性（DC）引导、无分类器引导（CFG）和随机性。虽然先前的工作专注于如何开发每个或所有组件，但很少关注如何调度它们，导致启发式固定或部分调整的次优调度。在这项工作中，我们认为所有三个组件在调度方面的相互作用对于显著提高成像逆问题的求解性能至关重要。我们的分析表明，在采样早期激进的CFG与DC引导冲突，而随机性将轨迹带回高概率区域。基于这些发现，我们提出了三元动力学感知后验采样（TriPS），它将后验采样重新表述为一个时变控制问题，并按照DC和随机性尺度递减、CFG尺度递增的三元趋势优化调度。TriPS通过两种策略实现：基于模板的函数先验搜索以获得可靠的基线调度，以及基于组相对策略优化（GRPO）的强化学习以获得更灵活的时间曲线。实验表明，TriPS在数据保真度和感知真实感方面优于最先进的基线方法。

英文摘要

Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.

URL PDF HTML ☆

赞 0 踩 0

2605.26468 2026-05-27 cs.LG cs.AI

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

扩散检测：用于无监督IC异常检测的生成扩散模型

Yuxuan Yin, Chen He, Todd Jacobs, Jialei He, Boxun Xu, Robert Jin, Peng Li

AI总结提出首个结合扩散Transformer的无监督异常检测框架，通过自编码器压缩、结构化令牌序列和噪声预测误差实现晶圆级快速筛选，在16nm IC测试数据上达到最优性能。

Comments 9 pages, 5 figures

2605.26463 2026-05-27 cs.CL cs.AI

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

迈向无差错的电子健康记录：临床笔记与结构化表格之间的推理密集型一致性验证

Yeonsu Kwon, Jiho Kim, Junseong Choi, Paloma Rabaey, Minseo Kim, Sujeong Im, Jeewon Yang, Jun-Min Lee, Sangji Lee, Jiwon Kim, Hangyul Yoon, Hyunwook Kwon, Edward Choi

AI总结针对电子健康记录中临床笔记与结构化表格数据不一致的问题，提出推理密集型基准EHR-ReasonCon和基于大语言模型的框架EHR-Inspector，通过锚点实体提取、时间引用和表格探索工具实现高效一致性验证。

详情

AI中文摘要

电子健康记录中非结构化临床笔记与结构化表格之间的数据一致性对于患者安全和临床决策至关重要。然而，现有关于笔记-表格一致性验证的工作主要依赖于数值或简单事件的表面匹配。这些方法未能捕捉真实世界EHR文档背后的推理，包括临床解释、事件关系和时间变化。为弥补这一差距，我们引入了EHR-ReasonCon，一个用于笔记-表格一致性验证的推理密集型基准。它基于MIMIC-III构建，并经过专家指导的注释，包含来自临床笔记的8,048个实体，并提供高质量的真实标签。注释协议由专门的表格探索工具支持，以确保系统的证据检索和可靠的一致性评估。我们还提出了EHR-Inspector，一个基于LLM的框架，它分割笔记、提取锚点实体和时间引用，并使用表格探索工具与结构化表格进行一致性验证。在严格和宽松标准下，使用经过专家验证的LLM-as-a-judge指标进行评估，EHR-Inspector在多个模型骨干上实现了最先进的性能。进一步的分析证明了其组件的有效性，并突出了与人工验证的差异。

英文摘要

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.

URL PDF HTML ☆

赞 0 踩 0

2605.26460 2026-05-27 cs.CV cs.AI

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

AnchorDiff: 基于锚点图传播的无训练概念定位用于多模态扩散Transformer

Jian Zhang, Zhijun Zhang

AI总结提出AnchorDiff方法，通过锚点选择和混合图传播解耦语义定位与结构细化，解决多模态扩散Transformer中视觉混淆概念间的概念泄漏问题。

详情

AI中文摘要

多模态扩散Transformer（MM-DiTs）为无训练概念定位编码了丰富的表示，但现有的基于注意力的方法通常在视觉上易混淆的概念上产生重叠激活，这种失败模式我们称为概念泄漏，即目标响应溢出到非目标对象。为了解决这个问题，我们提出了AnchorDiff，一种无训练的定位方法，将语义定位与结构细化解耦。AnchorDiff从概念到图像的注意力图中选择一个高置信度锚点，并将其作为独热种子在从图像到图像自注意力导出的混合图上传播。该图利用输出空间相似性进行密集的物体内传播，并通过逐行注意力门抑制跨物体连接。此外，我们引入了多概念混淆数据集，其中包含具有多个视觉相似概念和独立掩码的图像，从而能够显式评估概念泄漏。实验表明，AnchorDiff在ImageNet-Segmentation和PascalVOC上实现了强大的定位性能，同时在我们的多概念混淆数据集上显著减少了概念泄漏。

英文摘要

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.26459 2026-05-27 cs.LG

MuCon: Clipped Muon Updates for LLM Training

MuCon: 用于LLM训练的裁剪Muon更新

Albert Yi

AI总结本文提出MuCon优化器，通过奇异值裁剪替代Muon的极分解方向，并研究其近似计算与数值稳定性。

详情

AI中文摘要

Muon风格的优化器采用矩阵值动量或预条件更新 $B = U \operatorname{diag}(\sigma_1,\ldots,\sigma_r) V^\top$，并将其替换为其规范部分极因子 $\operatorname{Pol}(B) = U V^\top$。这会将每个非零奇异值映射为1。MuCon是本文研究的裁剪Muon变体：它对相同的Muon矩阵应用奇异值裁剪，$D^{\mathrm{MuCon}}_\tau(B) = \operatorname{MClip}_\tau(B) = U \operatorname{diag}\bigl(\min\{\sigma_i,\tau\}\bigr) V^\top, \qquad \tau> 0$。因此，$\operatorname{MClip}_\tau$ 表示数学裁剪算子，而MuCon表示优化器原语，它将此裁剪方向替代Muon的极方向。本文使用的Muon/MuCon缩放参数化称为 $\text{SpectralP}$：这是一种隐藏矩阵缩放方案，在该方案下应用极Muon或裁剪MuCon方向。映射 $\operatorname{MClip}_\tau$ 是到谱范数球 $\{X : \|X\|_2 \le \tau\}$ 的Frobenius投影：它保持小于或等于 $\tau$ 的奇异值不变，仅修改违反的奇异方向。本文探讨何时可以在不进行完整稠密SVD的情况下近似MuCon裁剪步骤。我们记录了两个精确恒等式，一个极/绝对值公式和一个标量根公式，后者引出了用于裁剪半正定因子的有理牛顿滤波器，并指出了两者共同的数值障碍：接近阈值的奇异值使得符号决策和有理求解变得病态。因此，矩阵函数方法仅在结合稳定的极/平方根本原语或裁剪边界附近的显式正则化时才有用。

英文摘要

Muon-style optimizers take a matrix-valued momentum or preconditioned update $B = U \operatorname{diag}(σ_1,\ldots,σ_r) V^\top$ and replace it with its canonical partial polar factor $\operatorname{Pol}(B) = U V^\top$. This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, $D^{\mathrm{MuCon}}\_τ(B) = \operatorname{MClip}\_τ(B) = U \operatorname{diag}\bigl(\min\{σ\_i,τ\}\bigr) V^\top, \qquad τ> 0$. Thus, $\operatorname{MClip}\_τ$ denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon's polar direction. The Muon/MuCon scaling parameterization used in this work is called $\text{SpectralP}$: it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map $\operatorname{MClip}\_τ$ is the Frobenius projection onto the spectral-norm ball $\{X : \|X\|_2 \le τ\}$: it leaves singular values at or below $τ$ unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.

URL PDF HTML ☆

赞 0 踩 0

2605.26456 2026-05-27 cs.CV

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

稀疏激光雷达提示的单目几何基础：面向长距离驾驶深度的实证研究

Kai Zheng, Qiang Feng, Xingjian Liu, Wenquan Tan, Yuan Li

AI总结本文提出SLIM，首次将MoGe-2适配为接受真正稀疏激光雷达输入，通过部分卷积稀疏编码器和多尺度融合网络，在长距离（100-150米）将绝对相对误差降低39-51%。

Comments 6 pages, 3 figures, 2 tables

详情

AI中文摘要

稀疏激光雷达提示的深度基础模型（PromptDA, Prior Depth Anything, DMD3C）在室内场景或KITTI标准80米评估范围内表现出色。然而，存在两个局限性：（i）在长距离驾驶场景（50-150米）中缺乏系统性的距离分层评估；（ii）基于视差基础模型的先前方法依赖于预插值的密集先验，而真正稀疏激光雷达注入到点图基础模型（例如MoGe-2，NeurIPS 2025）尚未被探索。我们提出SLIM（稀疏激光雷达注入的单目几何），这是首个将MoGe-2适配为接受真正稀疏激光雷达输入的工作。SLIM集成了一个部分卷积稀疏编码器和一个多尺度融合颈部，在五个尺度上将激光雷达特征融合到点图解码器中。我们采用密度无关训练（随机注入比例在[0.005, 0.30]之间），使得单一模型能够适应不同的输入密度。在Virtual KITTI和CARLA上，SLIM在100-150米范围内将MoGe-2基线的绝对相对误差降低了约39-51%。在六种注入比例下的消融实验表明，部分卷积注入在Virtual KITTI的所有六种设置下均改善了AbsRel和RMSE；在CARLA上，AbsRel在六种设置中的五种得到改善（0.015比例下接近平局，差异为0.0013），而RMSE在不同编码器间相当，部分卷积在三种设置下有所改善（最多改善0.31单位），在其余三种设置下最多损失0.11单位。

英文摘要

Sparse-LiDAR-prompted depth foundation models (PromptDA, Prior Depth Anything, DMD3C) have shown strong results on indoor scenes or within KITTI's standard 80-meter evaluation cap. However, two limitations remain: (i) systematic distance-stratified evaluation in long-range driving regimes (50-150 m) is largely absent; (ii) prior approaches built on disparity-based foundations rely on pre-interpolated dense priors, leaving truly sparse LiDAR injection on point-map foundations (e.g., MoGe-2, NeurIPS 2025) unexplored. We present SLIM (Sparse-LiDAR Injected Monocular geometry), the first adaptation of MoGe-2 to accept truly sparse LiDAR input. SLIM integrates a partial-convolution sparse encoder with a multi-scale fusion neck that fuses LiDAR features into the point-map decoder at five scales. We adopt density-agnostic training (random injection ratio in [0.005, 0.30]) so a single model serves diverse input densities. On Virtual KITTI and CARLA, SLIM reduces the absolute relative error of the MoGe-2 baseline by approximately 39-51% at 100-150 m. Ablation across six injection ratios shows partial-convolution injection improves both AbsRel and RMSE on Virtual KITTI in all six settings; on CARLA, AbsRel improves in five of six settings (one near-tie at 0.015 differs by 0.0013), and RMSE is comparable across encoders, with partial-convolution improving in three settings (by up to 0.31 unit) and losing by at most 0.11 unit in the other three.

URL PDF HTML ☆

赞 0 踩 0

2605.26454 2026-05-27 cs.CL

Model Unlearning Objectives Vary for Distinct Language Functions

模型遗忘目标因语言功能而异

Berk Atil, Vipul Gupta, Rebecca J. Passonneau

AI总结本文提出针对不同语言功能（危险知识遗忘和毒性遗忘）应设计不同的遗忘方法，并分别提出基于余弦的元学习变体RMU和多层目标方法，在多个7-8B模型上取得良好效果。

2605.26449 2026-05-27 cs.CV cs.AI

Cross-scale Aligned Supervision for Training GANs

跨尺度对齐监督用于训练生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

AI总结针对GAN多尺度生成中跨尺度轨迹未对齐问题，提出CAT（跨尺度对齐Transformer），通过生成器侧一致性正则化对齐中间输出与最终输出，在ImageNet-256上实现FID-50K为1.56。

Comments Preprint

详情

AI中文摘要

现代GAN通常在中间生成器输出上引入对抗性监督，并将由此产生的多阶段合成解释为从粗到细的分层生成。在这项工作中，我们挑战了这一解释。我们认为标准的尺度级对抗监督并未构建适当的从粗到细的层次结构：每个中间图像被独立地推向其自身分辨率下的真实分布，但这种尺度级的真实性并不能确保各阶段的输出代表相同的生成样本。此外，每个阶段产生的特定尺度图像并未用作后续阶段的明确细化目标。因此，其对抗性损失可以改善特定尺度的输出，而不约束后续阶段保持相同的样本轨迹，允许它们转向不同的样本而不是细化先前的输出。我们将此问题称为跨尺度轨迹未对齐问题。为了解决这个问题，我们提出了CAT，一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT保持判别器尺度级，因此每个中间输出在其自身分辨率下被评估，同时添加一个简单的生成器侧一致性正则化，以对齐中间输出与最终输出。在类别条件ImageNet-256上，CAT-H/2在仅60个训练周期后，通过一步推理实现了1.56的FID-50K，优于强大的单步GAN和扩散/流基线。

英文摘要

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.26447 2026-05-27 cs.CV

Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting

Underwater360: 基于全景高斯泼溅的全景图像水下场景重建

Jiangbei Hu, Weichao Song, Shibo Yu, Mohan Wang, Zihan Yi, Rui Wu, Mingkang Xiang, Na Lei, Shengfa Wang, Zhongxuan Luo, Ying He

AI总结提出Underwater360框架，利用物理信息引导的全向高斯泼溅，通过球面光线投射和外观-介质建模，实现水下全景场景的高质量重建与外观恢复。

详情

AI中文摘要

水下场景重建对于沉浸式探索水生环境至关重要，但由于复杂的参与介质效应（如吸收和散射）以及传统相机的有限视场（FoV），仍然具有挑战性。尽管将全景成像与3D高斯泼溅（3DGS）相结合为逼真的水下渲染提供了有前景的方向，但传统的3DGS难以处理球面投影畸变和水下介质退化。在本文中，我们提出了 extbf{Underwater360}，一个物理信息引导的全向3DGS框架，用于水下全景场景重建。首先，我们引入了一个全向高斯泼溅模块，该模块直接在球面相机空间中进行光线投射，而不是依赖2D投影近似，从而减少了360$^\circ$视场下的几何畸变。其次，我们设计了一个基于物理的外观-介质建模架构，带有姿态条件的外观嵌入，以明确地将内在场景辐射与深度相关的后向散射和衰减解耦，从而实现物理基础的外观恢复。最后，我们建立了一个新的全景水下基准数据集，包含合成场景和真实场景。大量实验表明，Underwater360在水下新视图合成和场景外观恢复方面取得了优越的性能，在复杂水下环境中提供了改进的渲染质量和跨视图一致性。代码和数据集发布在https://github.com/SwcK423/Underwater360。

英文摘要

Underwater scene reconstruction is essential for immersive exploration of aquatic environments, yet remains challenging due to complex participating-media effects such as absorption and scattering, as well as the limited field of view (FoV) of conventional cameras. Although combining panoramic imaging with 3D Gaussian Splatting (3DGS) offers a promising direction for photorealistic underwater rendering, traditional 3DGS struggles with both spherical projection distortion and underwater medium degradation. In this paper, we propose \textbf{Underwater360}, a physics-informed omnidirectional 3DGS framework for underwater panoramic scene reconstruction. First, we introduce an Omnidirectional Gaussian Splatting module that performs ray casting directly in spherical camera space instead of relying on 2D projection approximations, thereby reducing geometric distortions under 360$^\circ$ FoV. Second, we design a physics-based appearance-medium modeling architecture with pose-conditioned appearance embeddings to explicitly decouple intrinsic scene radiance from depth-dependent backscatter and attenuation, enabling physically grounded scene appearance restoration. Finally, we establish a new panoramic underwater benchmark dataset containing both synthetic and real-world scenes. Extensive experiments demonstrate that Underwater360 achieves superior performance in underwater novel view synthesis and scene appearance restoration, delivering improved rendering quality and cross-view consistency in complex underwater environments. The code and datasets are released at https://github.com/SwcK423/Underwater360

URL PDF HTML ☆

赞 0 踩 0

2605.26446 2026-05-27 cs.LG cs.AI

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

DDGAD：基于扩散的图异常检测中的轨迹动力学

Yuxin Yang, Limei Hu, Feng Chen

AI总结提出DDGAD框架，利用扩散正则化和可靠性感知邻域共识下的轨迹动力学区分正常与异常节点，通过三种互补异常信号检测异常。

详情

AI中文摘要

图异常检测（GAD）旨在识别图结构数据中行为或属性显著偏离整体模式的节点或子结构，在金融风险控制、社交网络分析和网络安全等领域具有关键应用。然而，现有的基于GCN的方法存在污染传播的根本问题，即异常节点通过消息传递污染其邻居的表示，导致检测性能下降。本文提出DDGAD，一种新颖的基于扩散的图异常检测框架，利用轨迹动力学区分正常和异常节点。我们的关键洞察是，在扩散正则化和可靠性感知邻域共识的耦合作用下，正常节点表现出一致且稳定的表示轨迹，而异常节点由于全局流形先验与局部污染消息传递之间的方向不一致，表现出不稳定且冲突的动力学。为了减轻污染传播，我们引入了一种分布式的可靠性感知共识细化机制，并定义了三种互补的异常信号：邻居不一致性、可靠性权重和动力学冲突能量。我们进一步对耦合动力学下的正常节点稳定性进行了初步的理论分析。这些信号从局部不一致性、共识可靠性和动力学不稳定性角度共同刻画异常行为。在五个真实世界数据集上的大量实验证明了所提框架的有效性。

英文摘要

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2605.26445 2026-05-27 cs.CL

Curation and Extraction of Drug-Related Entities from Reddit Platform

从Reddit平台策划和提取药物相关实体

Zewei Wang, Zihan Xu, Yishu Wei, Michael Chary, Yifan Peng

AI总结为解决医生对非法药物真实使用情况了解有限的问题，本文构建了ReDose数据集（6435条Reddit帖子），并采用BERT、LLM和RAG模型进行药物、剂量和效果实体提取，其中BiomedBERT在药物实体上F1达0.843，Llama-3 70B优于GPT-4，但效果提取仍具挑战。

Comments Accepted by IEEE International Conference on Healthcare Informatics (ICHI 2026)

详情

AI中文摘要

医生主要通过临床过量案例了解非法药物，这限制了他们对其真实使用情况的理解。与此同时，药物用户在线上分享第一手经验，提供了关于药物剂量和效果的见解。为弥合这一差距，我们引入了ReDose（Reddit药物剂量和效果）数据集，包含6435条关于物质使用的Reddit帖子。一名委员会认证的毒理学家主要注释了训练集和测试集，而两名医学生参与了测试集的注释，标注了药物、剂量和效果实体。我们使用基于BERT、大型语言模型（LLM）和检索增强生成（RAG）模型对6267个注释进行了基准测试。BiomedBERT在药物实体上达到了0.843的F1分数，而Llama-3 70B优于GPT-4（F1=0.79 vs. 0.72）。效果提取仍然具有挑战性，GPT-4的召回率为0.41。ReDose捕捉了患者策划的叙述，以推进从社交媒体中提取医学数据。

英文摘要

Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.

URL PDF HTML ☆

赞 0 踩 0

2605.26442 2026-05-27 cs.CL cs.AI

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

大型语言模型的对齐调优：以数据为中心的对齐数据管道视角

Hwanjun Song

AI总结本文以数据为中心，将对齐调优重构为管道设计问题，分解为响应合成、偏好评估和偏好实例化三个阶段，并基于此框架统一分类现有对齐方法，总结设计权衡与失败模式，提炼高层原则，最后指出开放挑战。

Comments Accepted at the Findings of ACL 2026

2605.26441 2026-05-27 cs.CV cs.AI

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

从博弈视角重新思考弱监督视频时间定位

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, Daizong Liu

AI总结本文从博弈论视角出发，通过多元合作博弈建模帧与词的不确定对应关系，实现多级跨模态交互，从而在弱监督下提升视频时间定位的准确性。

Comments Published in ECCV 2024

详情

AI中文摘要

本文针对弱监督视频时间定位这一具有挑战性的任务。现有方法通常基于时刻提案选择框架，利用对比学习和重构范式对预定义时刻提案进行评分。尽管取得了显著进展，但我们认为当前框架忽略了两个不可或缺的问题：1) 粗粒度跨模态学习：先前方法仅捕获全局视频级与查询的对齐，未能建模视频帧与查询词之间的详细一致性以准确定位时刻边界。2) 复杂的时刻提案：其性能严重依赖于提案的质量，而提案的选择既耗时又复杂。为此，本文首次尝试从新颖的博弈视角处理该任务，通过多样粒度和灵活组合有效学习每个视觉-语言对之间的不确定关系，实现多级跨模态交互。具体而言，我们创造性地将每个视频帧和查询词建模为多元合作博弈中的玩家，学习它们对跨模态相似度得分的贡献。通过博弈论交互量化联盟内帧-词合作的趋势，我们能够评估帧与词之间所有不确定但可能的对应关系。最后，我们不再使用时刻提案，而是利用学习到的查询引导的帧级得分进行更好的时刻定位。实验表明，我们的方法在Charades-STA和ActivityNet Caption数据集上均取得了优越性能。

英文摘要

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.26440 2026-05-27 cs.CL cs.SE

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Conv-to-Bench: 通过代码任务中的用户-助手对话评估语言模型

Victor M. dos Santos, Andre C. Castro, Samuel L. de S. Toledo, Bruno M. L. Calura, Lisandra C. de M. Menezes, Raul C. R. Mata, Telma W. de L. Soares, Bryan L. M. de Oliveira

AI总结提出Conv-to-Bench框架，自动将多轮用户-助手对话转化为结构化需求清单，用于评估大语言模型，在编程领域与人工标准高度一致且计算开销低。

详情

AI中文摘要

大型语言模型（LLMs）的快速发展已超越了传统评估基准的可扩展性，这些基准仍严重依赖劳动密集型的人工专家策划。我们通过Conv-to-Bench解决了这一瓶颈，这是一个多阶段框架，可自动将真实的多轮用户-助手对话转化为结构化的、可验证的需求清单。通过利用真实对话日志中的“指令演化”，我们的方法将碎片化的用户意图分解为整合的指令和二元评估标准。应用于编程领域，Conv-to-Bench生成的评估集与BigCodeBench等人工程准几乎完美对齐，实现了高达ρ=1.000的斯皮尔曼相关性，且计算开销显著降低。对LLM-as-a-judge框架的验证进一步证实了其可靠性，主要评估器与人工验证的真实标签达到高度一致（κ=0.705）。我们全面的消融研究表明，虽然多轮交互捕捉了用户意图的迭代演化，但以指令为中心的提取提供了更稳健的基础。最终，Conv-to-Bench提供了一种可扩展、成本效益高的范式，用于在用户中心AI应用持续多样化时保持高保真评估标准。

英文摘要

The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn user-assistant dialogues into structured, verifiable requirement checklists. By leveraging the "instructional evolution" found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with human-authored standards like BigCodeBench, achieving Spearman correlations of up to $ρ$ = 1.000 with significantly lower computational overhead. Validation of the LLM-as-a-judge framework further confirms its reliability, with the primary evaluator achieving substantial agreement with human-verified ground truth ($κ$ = 0.705). Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation. Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.

URL PDF HTML ☆

赞 0 踩 0

2605.26438 2026-05-27 cs.CL cs.AI

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

LURE: 减少评估感知的实时使用回放评估

Igor Ivanov, David Demitri Africa

AI总结提出LURE方法，通过回放真实代理交互轨迹并附加评估提示来构建类似部署的评估，以减少大语言模型的评估感知，并引入自动化评估真实性流程。

详情

AI中文摘要

大型语言模型能够识别自己正在被评估（评估感知），并因此表现出不同的行为，这破坏了安全和对齐基准的有效性。我们提出LURE（实时使用回放评估），一种通过回放真实的代理交互轨迹并在末尾附加评估提示来构建类似部署的评估的方法。我们还引入了一个自动化流程来衡量评估的真实性，结合了对口头化评估感知的检测和法官模型对日志是否为评估的概率估计，并在一个包含部署和评估记录的大型数据集上进行了验证。我们发现，与广泛使用的基准和合成评估生成器相比，基于LURE的评估与部署的区分度显著降低，并且可以接近与用户真实对话的真实性。我们在策划、AI安全破坏和谄媚场景中实例化了LURE。我们的结果表明，评估真实性是对齐基准的一个关键属性，应在基准结果旁边报告，特别是当这些结果用于安全案例时。

英文摘要

Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

URL PDF HTML ☆

赞 0 踩 0