arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.11223 2026-05-19 cs.AI

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

视觉-语言模型在点击式谜题游戏中是否展现出人类般的逻辑问题解决能力？

Maximilian Triebel, Marco Menner, Dominik Helfenstein

AI总结本文提出VLATIM基准测试，用于评估在经典物理谜题游戏The Incredible Machine 2中人类般的逻辑问题解决能力，发现尽管大模型在规划方面表现优异，但精确的视觉定位仍存在问题，尚未达到人类水平。

详情

AI中文摘要

视觉-语言（-动作）模型（VLMs）越来越多地应用于交互环境，但现有基准测试往往忽视了点击式谜题游戏中所需的复杂物理推理。本文介绍了Vision-Language Against The Incredible Machine（VLATIM），一个用于评估在经典物理谜题游戏The Incredible Machine 2（TIM）中人类般的逻辑问题解决能力的基准测试。与现有基准测试不同，VLATIM专门针对高水平逻辑推理与需要精确鼠标交互的连续动作空间之间的关键差距。该基准测试分为五个逐步部分，评估的能力从基本的视觉定位和领域理解到多步骤操作和完整谜题解决。我们的结果揭示了推理与执行之间的显著差距。尽管大 proprietary 模型在规划能力方面表现优异，但它们在精确的视觉定位上存在困难。因此，它们尚未展现出人类般的解决问题能力。

英文摘要

Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.10239 2026-05-19 cs.CV

AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting

AdaptSplat: 为前馈3D高斯点划法适应视觉基础模型

Mingwei Xing, Xinliang Wang, Yifeng Shi

AI总结本文提出AdaptSplat，通过在通用架构中引入一个仅含1.5M参数的轻量级适配器，有效提升了前馈3D高斯点划法在跨领域泛化和高频几何保真度方面的性能。

详情

AI中文摘要

本文探讨了一种简单而强大的轻量级适配器设计，用于前馈3D高斯点划法（3DGS）。现有方法通常在图像特征提取→多视角交互→特征解码的通用流程上应用复杂的、架构特定的设计。然而，受限于3D训练数据的规模瓶颈和深度网络的低通滤波效应，这些方法在跨领域泛化和高频几何保真度方面仍显不足。为了解决这些问题，我们提出了AdaptSplat，证明在不使用复杂组件工程的情况下，仅在通用架构中引入一个仅含1.5M参数的适配器就足以实现优越的性能。具体而言，我们设计了一个轻量级的频率保持适配器（FPA），从强大视觉基础模型主干的浅层特征中提取方向感知的高频结构先验，并通过高频位置编码和自适应残差调制无缝地将其整合到通用流程中。这有效补偿了深度特征中过度平滑导致的高频衰减，提高了高斯原语在复杂表面和尖锐边界上的拟合精度。大量实验表明，AdaptSplat在多个标准基准上实现了最先进的前馈重建性能，并在跨领域泛化方面表现出稳定性。代码可在：https://github.com/xmw666/AdaptSplat 获取。

英文摘要

This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.

URL PDF HTML ☆

赞 0 踩 0

2605.10236 2026-05-19 cs.LG cs.AI

When Does Non-Uniform Replay Matter in Reinforcement Learning?

在强化学习中非均匀回放何时起作用？

Michal Korniak, Mikołaj Czarnecki, Yarden As, Piotr Miłoś, Pieter Abbeel, Michal Nauman

AI总结本文研究了非均匀回放在强化学习中的有效性，发现回放体积、预期近期性和回放分布熵是决定因素，并提出了一种简单有效的截断几何回放策略以提高样本效率。

详情

AI中文摘要

现代非策略强化学习算法通常依赖于简单的均匀回放采样，但非均匀回放何时以及为何优于这一强基线仍不清楚。在多样化的强化学习设置中，我们证明非均匀回放的有效性由三个因素决定：回放体积、每环境步骤回放的转换数量；预期近期性，即所采样转换的近期程度；以及回放采样分布的熵。我们的主要贡献是明确非均匀回放何时有益，并为现代非策略强化学习中的回放设计提供实用指导。我们发现，当回放体积较低时，非均匀回放最有益，且即使在预期近期性相当时，高熵采样也很重要。受这些发现的启发，我们采用了一种简单的截断几何回放策略，该策略倾向于近期经验，同时保持高熵并带来可忽略的计算开销。在大规模并行模拟、单任务和多任务设置中，包括在五个强化学习基准套件上评估的三种现代算法，这种回放采样策略在低体积情况下提高了样本效率，而在高回放体积时仍具有竞争力。

英文摘要

Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.

URL PDF HTML ☆

赞 0 踩 0

2605.10185 2026-05-19 cs.CV cs.AI

DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

DynGhost: 用于量子探测器动态鬼成像的时序建模Transformer

Vittorio Palladino, Ahmet Enis Cetin

AI总结本文提出DynGhost，一种基于Transformer的动态鬼成像方法，通过交替的空间和时间注意力模块解决传统方法在动态场景和低光条件下的局限性，利用量子感知训练框架提升真实硬件下的性能。

Comments 6 pages, 8 figures

详情

AI中文摘要

鬼成像通过将结构化照明图案与标量强度测量相关联，从单像素桶探测器重建空间信息。尽管深度学习方法在静态场景中取得了显著成果，但存在两个关键局限：现有架构未能利用帧间的时间相干性，导致动态鬼成像问题未得到解决，且假设加性高斯噪声模型，而实际单光子硬件遵循泊松统计。我们提出了DynGhost（动态鬼成像Transformer），通过交替的空间和时间注意力块解决这两个限制。基于物理准确的探测器模拟（SNSPDs、SPADs、SiPMs）和Anscombe方差稳定化归一化，我们的量子感知训练框架解决了导致经典模型在真实硬件约束下失效的分布偏移。在多个基准测试中，DynGhost在动态和光子匮乏设置中优于传统重建方法和现有深度学习架构。

英文摘要

Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.

URL PDF HTML ☆

赞 0 踩 0

2605.10059 2026-05-19 cs.AI

Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

LLM代理市场中的战略利用：电子商务信任的模拟框架

Shijun Lei, Quang Nguyen, Swapneel S Mehta, Zeping Li, Huichuan Fu, Xiaolong Zheng, Siki Chen, Yunji Liang, Philip Torr, Zhenfei Yin

AI总结本文提出TruthMarketTwin模拟框架，用于研究LLM代理在电子商务市场中的行为，发现LLM代理在传统市场中会利用声誉治理的弱点，而强制执行可减少欺骗并重塑战略推理。

详情

AI中文摘要

基于代理的建模（ABM）长期以来被用于经济学中研究人类行为，而大型语言模型（LLM）代理现在使新的社会和经济模拟成为可能。尽管先前工作发现了LLM代理在金融交易和拍卖市场中的战略性欺骗，但电子商务仍鲜有研究，尽管其有独特的信息不对称：卖家私下观察产品质量，而买家依赖广告声明和声誉信号。我们引入TruthMarketTwin，一种用于研究LLM代理在电子商务市场中行为的受控模拟框架。该框架是首个模拟不对称信息共享下双边贸易的模型之一，其中代理做出战略性列表、购买、评分和救济相关决策以优化卖家利润和买家效用。我们发现，释放到传统市场中的LLM代理会自主利用基于声誉的治理弱点，而强制执行可减少欺骗并重塑战略推理。我们的结果将LLM代理模拟定位为研究由机构治理的自主市场工具。

英文摘要

Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.

URL PDF HTML ☆

赞 0 踩 0

2605.09855 2026-05-19 cs.LG

Concordia: Self-Improving Synthetic Tables for Federated LLMs

Concordia：面向联邦大语言模型的自改进合成表格

Jimin Huang, Duanyu Feng, Nuo Chen, Xiaoyu Wang, Zhiqiang Zhang, Xueqing Peng, Mingquan Lin, Prayag Tiwari, Guojun Xiong, Alejandro Lopez-Lira, Sophia Ananiadou

AI总结本文研究了在无法共享原始数据的情况下，如何通过自改进的合成表格来提升联邦学习中大语言模型的适应能力，提出了一种三层优化框架Concordia，通过参数高效LoRA训练和轻量级效用评分器提升联邦验证效用和跨客户端稳定性。

Comments 12 pages

详情

AI中文摘要

联邦学习（FL）能够在不共享原始数据的情况下训练大型语言模型（LLMs），但在严格的数据隔离和非独立同分布（non-IID）客户端分布下，适应LLMs仍然具有挑战性。合成数据为本地训练提供了自然的隐私保护替代方案，但现有联邦流程通常将合成生成视为静态或松散耦合于下游优化，导致在异质客户端下效用迅速下降。我们研究了在无法共享原始记录和验证数据的情况下，如何在表格任务中进行联邦适应，并且本地训练必须完全依赖合成表格。我们提出Concordia，一种三层优化框架，该框架在这些约束下对齐合成数据生成与联邦验证效用。在客户端层面，模型通过参数高效LoRA训练在合成表格上进行适应。客户端还从私有验证反馈中学习轻量级效用评分器，以在本地训练中重新加权合成样本。在外层，每个客户端使用组相对策略优化（GRPO）来细化自己的合成表格生成器，由跨客户端共享的异质评分器集合引导，而无需聚合生成器参数或暴露验证数据。在隐私敏感的表格基准测试中，Concordia在金融和医疗领域展示了比静态和解耦合成数据基线更一致的联邦性能、跨客户端稳定性和对分布偏移的鲁棒性。

英文摘要

Federated learning (FL) enables training large language models (LLMs) without sharing raw data, but adapting LLMs under strict data isolation and non-IID client distributions remains challenging in practice. Synthetic data offers a natural privacy-preserving surrogate for local training, yet existing federated pipelines typically treat synthetic generation as static or loosely coupled with downstream optimization, leading to rapidly diminishing utility under heterogeneous clients. We study federated adaptation of LLMs on tabular tasks where raw records and validation data cannot be shared, and local training must rely entirely on synthetic tables. We propose Concordia, a tri-level optimization framework that aligns synthetic data generation with federated validation utility despite these constraints. At the client level, models are adapted via parameter-efficient LoRA training on synthetic tables. Clients additionally learn lightweight utility scorers from private validation feedback to reweight synthetic samples during local training. At the outer level, each client refines its own synthetic table generator using group-relative policy optimization (GRPO), guided by an ensemble of heterogeneous scorers shared across clients, without aggregating generator parameters or exposing validation data. Experiments on privacy-sensitive tabular benchmarks from finance and healthcare demonstrate that Concordia consistently improves federated performance, cross-client stability, and robustness to distribution shift compared to static and decoupled synthetic-data baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.09040 2026-05-19 cs.AI cs.IR cs.LG

UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

UxSID：面向超长序列的语义感知用户兴趣建模

Hongwei Zhang, Qiqiang Zhong, Jiangxia Cao, Yiyang Lv, Huanjie Wang, Liwei Guan, Jing Yao, Yiyu Wang, Junfeng Shu, Zhaojie Liu, Han Li

AI总结本文提出UxSID框架，通过语义组共享兴趣记忆和双层注意力策略，实现高效且语义感知的超长用户序列建模，取得最佳性能并提升广告收益。

Comments Work in progress

2605.08738 2026-05-19 cs.LG cs.AI cs.CL

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

AI总结本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术，探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性，最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情

AI中文摘要

结构化剪枝和知识蒸馏（KD）是压缩大型语言模型的典型技术，但其在预训练规模下的应用仍不清楚，尤其是针对最近的混合专家（MoE）模型。本文系统研究了大规模预训练中的MoE压缩，重点探讨三个关键问题：剪枝是否比从头训练提供更好的初始化；专家压缩选择如何影响继续训练后的最终模型；以及哪种训练策略最有效。我们得出以下发现：首先，在深度、宽度和专家压缩方面，对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次，不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发，我们引入了一种简单的部分保留专家合并策略，该策略在大多数基准上提升了下游性能。第三，结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测（MTP）蒸馏，其效果一致。最后，鉴于相同的训练令牌，渐进式剪枝计划优于单次压缩，表明渐进的架构过渡导致更好的优化轨迹。综合来看，我们将Qwen3-Next-80A3B压缩到23A2B模型，保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.08439 2026-05-19 cs.CL

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

语言模型能否识别乳腺癌放射治疗的副作用？

Natalie Seah, Danielle S. Bitterman, Daphna Spiegel, Thomas Hartvigsen

AI总结本研究探讨了语言模型在识别乳腺癌放射治疗副作用中的能力，通过评估多种语言模型在不同提示下的表现，揭示了其在精度、召回率及罕见长期副作用识别上的局限性，并提出了改进方向。

详情

AI中文摘要

准确地向癌症幸存者传达癌症治疗的副作用至关重要，特别是在知情同意等情境中，临床医生必须清晰而全面地传达潜在的治疗毒性。然而，由于对不良治疗反应的临床知识不足以及电子健康记录（EHR）系统之间的碎片化，这一任务仍极具挑战性。大型语言模型（LLMs）有潜力帮助完成此任务，但其在癌症幸存者护理中的可靠性仍不明确。本文提出了一种面向部署的压力测试框架，用于评估LLM生成的乳腺癌治疗和幸存者护理中的放射副作用列表。使用21名乳腺癌患者资料，我们构建了仅在放射治疗方案上不同的配对患者临床场景，以在多种提示模式下评估七种指令微调的LLM。然后将LLM输出与由两名主要学术医疗中心的知情同意文件和超过七名乳腺放射肿瘤学家团队编写的临床医生编纂参考进行比较。该参考将放射剂量分割、照射区域和位置映射到相关的毒性，按频率和时间起始点分解。在不同模型中，我们揭示了对细微文档变化的敏感性、精度与召回率之间的权衡，以及系统性低估罕见和长期副作用的问题。当单独使用时，限制生成的副作用数量会降低精度，而将输出基于临床医生编纂的副作用列表可以显著提高可靠性和稳健性。这些发现突显了LLM在肿瘤学中的重要局限性，并提出了更安全和信息丰富的幸存者护理应用的设计选择。

英文摘要

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

URL PDF HTML ☆

赞 0 踩 0

2605.08163 2026-05-19 cs.CV cs.AI cs.CL

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT：跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

AI总结本文提出MULTITEXTEDIT基准测试，通过12种语言、5种视觉领域和7种编辑操作的3600个实例，评估跨语言文本-图像编辑中退化问题，引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情

AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力，但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT，一个包含3,600个实例的受控基准测试，涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础，并配有人工编辑的参考文本和区域掩码，从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误，如缺失变音符号、RTL顺序颠倒和混合脚本渲染，我们引入了一个由两阶段LVM协议评分的语言保真度（LSF）度量，其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时，发现所有模型在跨语言退化方面表现显著，最大退化出现在希伯来语和阿拉伯语上，最小退化出现在荷兰语和西班牙语上，且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配，其中输出保持全局布局和背景保真度，但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

URL PDF HTML ☆

赞 0 踩 0

2605.07790 2026-05-19 cs.LG cs.CV

Hessian Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

Hessian Surgery: 通过Hessian尖峰扰动实现类目标后处理重平衡

Hugo Vigna, Samuel Bontemps

AI总结本文提出Hessian Surgery方法，通过扰动模型权重沿尖峰特征向量来重平衡各类准确率，无需重新训练，提升了CIFAR-10和ISIC-2019数据集的平衡准确率和标准差。

Comments The code is available here: https://github.com/hugovigna/hessian-surgery.git

详情

AI中文摘要

训练好的深度网络的Hessian谱表现出一种特征结构：连续的近零特征值和少量的大异常特征值（尖峰），证实了随机矩阵理论在深度学习中的相关性。尖峰数量与类别数减一相匹配。尽管先前工作描述了这种结构，但没有方法将其操作化以提高分类性能。我们提出Hessian Surgery，一种后处理优化方法，直接扰动模型权重沿尖峰特征向量以重平衡各类准确率而无需重新训练。我们引入（i）一个尖峰类敏感度矩阵，量化每个类准确率沿每个尖峰特征向量的方向导数，（ii）一个约束优化扰动系数，针对弱类同时保持强类，以及（iii）自适应幅度控制，根据迭代级改进信号调整扰动预算。我们在CIFAR-10和ISIC-2019上获得了令人鼓舞的结果，同时在平衡准确率和标准差方面都取得了显著提升。

英文摘要

The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Hessian Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class's accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.

URL PDF HTML ☆

赞 0 踩 0

2605.07544 2026-05-19 cs.AI

From Pixels to Prompts: Vision-Language Models

从像素到提示：视觉-语言模型

Khang Hoang Nhat Vo

AI总结本文探讨了视觉-语言模型的发展历程，旨在提供清晰的认知框架，帮助读者理解该领域的核心概念和应用，而非罗列所有数据集和模型变体。

详情

AI中文摘要

当您阅读一篇关于新型视觉-语言模型的论文时，可能会忘记这个想法在不久以前听起来多么奇怪。教机器看见已经很困难，教它们阅读和生成语言也已很困难。让它们同时做到这些，并随后进行推理、回答问题、遵循指令，甚至有时令人惊讶，仍带着科幻的余韵，尽管它已成为日常。这本书源于一种简单的感觉：太容易迷失方向了。该领域发展迅速，新模型名称不断出现，‘我知道 buzzwords’与‘我真的理解其工作原理’之间的差距可能让人感到不适。我曾多次感受到这种差距。如果您持有这本书，您可能也有太大的感受。我的目标不是提供一个详尽的数据集、基准和新模型变体的清单。相反，我希望提供更谦逊但或许更持久的东西：一个清晰的视觉-语言模型认知图谱。足够的结构，使您在阅读新论文时充满信心；足够的直觉，使您能够设计自己的系统而不觉得像在盲目地组装乐高积木。

英文摘要

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: it is too easy to get lost. The field moves quickly, new model names appear constantly, and the gap between "I know the buzzwords" and "I actually understand how this works" can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

URL PDF HTML ☆

赞 0 踩 0

2605.07308 2026-05-19 cs.RO

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA: 用于增强视觉-语言-动作模型反馈反应的自适应触觉注入

Xiaoqi Li, Muhe Cai, Jiadong Xu, Juan Zhu, Hongwei Fan, Yan Shen, Guangrui Ren, Hao Dong

AI总结本文提出AT-VLA，一种自适应触觉注入机制，通过动态决定触觉注入的时间和位置，减少对预训练表示的干扰，同时引入触觉反应双流机制，实现快速准确的触觉响应，以提高视觉-语言-动作模型在接触丰富操作任务中的表现。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在增强机器人代理执行多样化任务的能力方面取得了显著进展；然而，它们仍然面临在需要精确物理交互的接触丰富操作场景中的挑战。为了解决这一限制，最近的研究尝试在下游任务中整合触觉信号，使预训练的VLA能够解释触觉反馈。然而，在微调过程中引入新的模态，这些模态在预训练阶段很少出现，可能会破坏VLA的预训练能力。此外，VLA固有的缓慢推理速度会阻碍实时响应，并限制触觉反馈在动作调整中的有效利用。为克服这些挑战，我们提出了自适应触觉视觉-语言-动作（AT-VLA），引入了新颖的自适应触觉注入机制。该机制动态确定触觉注入的合适时间和位置，在显著促进动作生成时才进行注入，从而最小化对预训练表示的干扰。此外，为了实现快速准确的触觉响应，我们提出了触觉反应双流机制，将感知处理分为一个慢的视觉-语言流用于低频感知推理和一个快的触觉控制流用于高频物理交互理解，从而在0.04秒内实现实时闭环响应。现实世界实验彻底验证了AT-VLA在接触丰富操作任务中的有效性。项目页面可在：https://sites.google.com/view/at-vla。

英文摘要

Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.

URL PDF HTML ☆

赞 0 踩 0

2605.07111 2026-05-19 cs.CL cs.AI

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越LoRA与全微调：基于梯度的优化器路由用于大语言模型适应

Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

AI总结本文提出了一种混合LoRA和全微调（MoLF）框架，通过在优化器层面动态路由更新，实现两种训练模式之间的连续导航，从而提升大语言模型的适应性能。

详情

AI中文摘要

近期关于微调大型语言模型的研究突显出一个根本性的争论。虽然全微调（FFT）提供了高熵知识注入所需的表示可塑性，但低秩适应（LoRA）可以匹配或超越FFT的性能，因为许多任务只需要在低秩空间中进行更新，并且受益于LoRA的额外正则化。通过在多样化的任务（SQL、医学问答和反事实知识）和不同语言模型（Gemma-3-1B、Qwen2.5-1.5B和Qwen2.5-3B）上的实证评估，我们验证了这两种趋势，并展示了仅依赖静态架构在结构上是有限的。为了解决这一挑战，我们提出了混合LoRA和全微调（MoLF）框架，这是一个统一的框架，能够连续导航于两种训练模式之间。MoLF在优化器层面动态地将更新路由到FFT和LoRA之间，以确保在整个训练过程中精确的梯度信号能够传达到两个专家，从而产生稳定的训练动态。对于内存受限的环境，我们还引入了MoLF-Efficient，它冻结了基础权重，并只在可能具有不同秩的一对LoRA专家之间路由更新。我们的评估显示，MoLF在所有设置中要么优于或保持在FFT和LoRA中更好的方法的1.5%以内，而MoLF-Efficient在事实任务上比先前的自适应LoRA方法高出高达20%，在医学和SQL任务上高出9%。我们的代码在https://github.com/11785T23/molf.git上开源。

英文摘要

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL. Our code is open-sourced at https://github.com/11785T23/molf.git.

URL PDF HTML ☆

赞 0 踩 0

2605.06506 2026-05-19 cs.CL

The Frequency Confound in Language-Model Surprisal and Metaphor Novelty

语言模型惊奇度与隐喻新颖性中的频率混淆

Omar Momen, Sina Zarrieß

AI总结研究探讨了语言模型惊奇度与隐喻新颖性之间的关系，发现词频比惊奇度更能预测隐喻新颖性，并指出惊奇度与频率之间的关联在训练阶段早期达到峰值，随后下降，暗示最优的语言模型惊奇度设置可能错误地将上下文可预测性与隐喻新颖性和处理难度联系起来，而词频可能是主要影响因素。

Comments to be presented and published at the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

2605.03409 2026-05-19 cs.AI

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

鲁棒代理补偿（RAC）：教AI代理补偿

Srinath Perera, Kaviru Hapuarachchi, Frank Leymann, Rania Khalaf

AI总结本研究提出了一种基于日志的恢复范式RAC，通过架构扩展实现安全网，可应用于大多数代理框架以支持可靠执行。RAC可在不修改现有代理代码的情况下启用，通过现有的扩展点在大多数现有代理框架中实现，并通过τ-bench和REALM-Bench验证，证明在解决复杂问题时，RAC在延迟和token经济性方面优于现有最先进的LLM-based恢复方法。

Comments Accepted at ACM Conference on AI and Agentic Systems (ACM CAIS 2026)

2605.02198 2026-05-19 cs.CV

SlimDiffSR: Toward Lightweight and Efficient Remote Sensing Image Super-Resolution via Diffusion Model Distillation

SlimDiffSR: 向轻量高效遥感图像超分辨率迈进：通过扩散模型蒸馏

Ce Wang, Zhenyu Hu, Wanjie Sun

AI总结本文提出SlimDiffSR，一种轻量高效的基于扩散模型的遥感图像超分辨率框架，通过引入不确定性引导的时间步分配策略和结构化剪枝策略，提升模型效率和重建质量。

详情

AI中文摘要

扩散模型最近在图像超分辨率（SR）中取得了显著性能，但其高计算成本限制了在遥感应用中的实际部署。为了解决这个问题，我们提出了SlimDiffSR，一种轻量高效的基于扩散模型的框架，用于实际的遥感图像超分辨率。与现有单步扩散方法不同，我们首先引入了不确定性引导的时间步分配策略，以构建一个更强的单步教师模型，其中重建难度与扩散时间步长显式相关，从而实现自适应生成强度。在此基础上，我们进一步提出了一种针对遥感图像的结构化剪枝策略，系统地移除冗余的语义模块，并用轻量级设计替换标准操作，包括频域分离卷积、方向分离卷积以及查询驱动的全局聚合模块。这些组件显式利用了遥感数据的独特特性，如稀疏的高频细节、强方向模式和长距离空间依赖性。为了增强知识转移，我们将在蒸馏过程中引入最大均值差异（MMD），以对齐教师和学生模型之间的特征分布。在多个遥感基准上的广泛实验表明，SlimDiffSR在效率和重建质量之间实现了良好的平衡。特别是，它在多步扩散模型相比下实现了高达200倍的推理加速和20倍的模型参数减少，同时在感知质量方面具有竞争力，并在效率上明显优于现有的轻量级扩散基线。代码可在：https://github.com/wwangcece/SlimDiffSR获取。

英文摘要

Diffusion models have recently achieved remarkable performance in image super-resolution (SR), but their high computational cost limits practical deployment in remote sensing applications. To address this issue, we propose SlimDiffSR, a lightweight and efficient diffusion-based framework for real-world remote sensing image super-resolution. Unlike existing single-step diffusion methods that rely on fixed timesteps, we first introduce an uncertainty-guided timestep assignment strategy to construct a stronger single-step teacher model, where reconstruction difficulty is explicitly linked to diffusion timesteps, enabling adaptive generative strength. Building upon this teacher, we further present a structured pruning strategy tailored to remote sensing imagery, which systematically removes redundant semantic modules and replaces standard operations with lightweight designs, including frequency-separable convolution, direction-separable convolution, and a query-driven global aggregation module. These components explicitly exploit the unique characteristics of remote sensing data, such as sparse high-frequency details, strong directional patterns, and long-range spatial dependencies. To enhance knowledge transfer, we incorporate Maximum Mean Discrepancy (MMD) into the distillation process to align feature distributions between the teacher and student models. Extensive experiments on multiple remote sensing benchmarks demonstrate that SlimDiffSR achieves a favorable balance between efficiency and reconstruction quality. In particular, it attains up to $200\times$ inference acceleration and a $20\times$ reduction in model parameters compared with multi-step diffusion models, while achieving competitive perceptual quality and clearly outperforming existing lightweight diffusion baselines in efficiency. The code is available at: https://github.com/wwangcece/SlimDiffSR.

URL PDF HTML ☆

赞 0 踩 0

2605.00264 2026-05-19 cs.LG cs.GT

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

通过KL正则化实现一般和博弈中的无悲观离线学习

Claire Chen, Yuheng Zhang

AI总结本文提出了一种基于KL正则化的离线学习方法，能够在一般和博弈中实现无悲观的均衡恢复，通过加速的统计速率和计算高效的算法提升学习效率。

2604.25525 2026-05-19 cs.CL cs.HC

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

从聊天机器人到知己：一项跨文化的LLM用于情感支持的使用研究

Natalia Amat-Lefort, Mert Yazan, Amanda Cercas Curry, Flor Miriam Plaza-del-Arco

AI总结本研究探讨了不同国家用户对LLM用于情感支持的接受度及影响因素，通过大规模跨文化调查发现社会经济地位是关键预测因素，并揭示了多语言提示语中用户主要寻求帮助的领域。

Comments 28 pages (9 pages main text, 19 pages references and appendices), 14 figures. The first two authors contributed equally

详情

AI中文摘要

大型语言模型（LLMs）不仅被用于执行任务，还作为全天候、非评判性的知己提供情感支持。然而，驱动采用的因素以及用户在不同国家对情感支持交互的感知仍不清楚。为填补这一空白，我们进行了首次大规模跨文化研究，调查了来自七个国家（美国、英国、德国、法国、西班牙、意大利和荷兰）的4641名参与者。我们的结果显示，不同国家的采用率差异显著（从20%到59%）。使用混合模型分离文化影响与人口统计特征，我们发现：25-44岁、有宗教信仰、已婚以及社会经济地位较高的人群更倾向于信任、使用和认为有好处。英语国家比大陆欧洲国家显示出更积极的感知。我们进一步收集了731个真实的多语言提示语，显示用户主要寻求帮助解决孤独、压力、关系冲突和心理健康问题。我们的发现表明，LLM的情感支持使用受复杂的社会技术景观影响，并呼吁更广泛的研究来探讨如何开发、部署和管理这些系统以确保安全和知情的访问。

英文摘要

Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.

URL PDF HTML ☆

赞 0 踩 0

2604.24763 2026-05-19 cs.CV

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2：像素嵌入在多模态理解和生成中优于视觉编码器

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong

AI总结本文提出Tuna-2，一种基于像素嵌入的统一多模态模型，通过直接使用像素嵌入进行多模态理解和生成，展示了统一像素空间建模在高质量图像生成中可以与潜在空间方法竞争，并证明了预训练视觉编码器在多模态建模中并非必要。

Comments Project page: https://tuna-ai.org/tuna-2

详情

AI中文摘要

统一多模态模型通常依赖于预训练的视觉编码器，并使用独立的视觉表示进行理解和生成，导致两种任务之间存在不一致，阻碍了从原始像素进行端到端优化。我们引入Tuna-2，一种原生统一多模态模型，直接基于像素嵌入进行视觉理解和生成。Tuna-2通过使用简单的补丁嵌入层来编码视觉输入，大幅简化了模型架构，完全摒弃了诸如VAE或表示编码器等模块化视觉编码器设计。实验表明，Tuna-2在多模态基准测试中实现了最先进的性能，证明了统一像素空间建模能够与潜在空间方法在高质量图像生成中竞争。此外，虽然基于编码器的变体在早期预训练中收敛更快，但Tuna-2的无编码器设计在大规模情况下实现了更强的多模态理解，特别是在需要细粒度视觉感知的任务中。这些结果表明，预训练视觉编码器在多模态建模中并非必要，端到端的像素空间学习为生成和感知的更强视觉表示提供了一条可扩展的路径。

英文摘要

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

URL PDF HTML ☆

赞 0 踩 0

2604.23355 2026-05-19 cs.AI

LEGO: An LLM Skill-Based Front-End Design Generation Platform

LEGO: 一个基于LLM技能的前端设计生成平台

Jincheng Lou, Ruohan Xu, Jiecheng Ma, Runzhe Tao, Xinyu Qu, Yibo Lin

AI总结本文提出LEGO平台，通过将数字前端流程分解为六个独立步骤，并将每个代理能力表示为标准化的可组合电路技能，实现了高效的前端设计生成，显著提升了RTL设计自动化的效果。

Comments Accepted to ISEDA 2026. Best Paper Nomination. 7 pages, 3 figures

详情

AI中文摘要

现有的基于LLM的EDA代理往往都是特定任务的孤立系统。这导致了重复的工程努力和成功设计和调试策略的有限重用。我们提出了LEGO，一个统一的基于技能的前端设计生成平台。它将数字前端流程分解为六个独立的步骤，并将每个代理的能力表示为标准化的可组合电路技能，以在即插即用的架构中进行表示。为了构建这个技能库，我们调查了超过100篇论文，选择了11个具有代表性的开源项目，并在六步有限状态机的公式中提取了42个可执行的电路技能。电路技能构建器通过线性可扩展性自动化技能提取。代理技能RAG实现了亚毫秒级检索，而无需依赖嵌入模型。在41个VerilogEval v2问题的严格子集上的实证评估显示，LEGO内构建的单个电路技能将Pass@1从0.000提升到0.805。这比基线提高了80.5%。跨项目技能组合也达到了0.805的Pass@1。它们在层次Verilog上表现更优14.6%，在VerilogCoder上表现更优2.5%。它们还与MAGE相匹配。这些结果表明，模块化技能组合支持有效且灵活的RTL设计自动化。LEGO平台和所有电路技能都在GitHub上公开：https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

英文摘要

Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

URL PDF HTML ☆

赞 0 踩 0

2604.23267 2026-05-19 cs.CL cs.LG

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

在大型语言模型中微调与上下文学习：从形式语言学习的角度

Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi

AI总结本文从形式语言学习的角度比较了大型语言模型中的微调与上下文学习，通过设计精确的语言边界、受控字符串采样和无数据污染的任务，发现微调在分布内泛化上优于上下文学习，而两者在分布外泛化上表现相当，且两者在不同熟练度水平上的归纳偏置也有所不同。

Comments Accepted at ACL 2026 (Main)

详情

AI中文摘要

大型语言模型（LLMs）在两种基本的学习模式中运作——微调（FT）和上下文学习（ICL），这引发了关于哪种模式产生更大的语言能力以及它们是否在归纳偏置上有所不同的关键问题。先前比较FT和ICL的研究由于实验设置不一致而得出混杂和不明确的结果。为了实现严格比较，我们提出了一项形式语言学习任务——提供精确的语言边界、受控字符串采样和无数据污染，并引入一种判别测试来评估语言能力，其中LLM成功当且仅当它将更高生成概率分配给语言字符串而不是非语言字符串。经验上，我们发现：（a）FT在分布内泛化上比ICL更具语言能力，但两者在分布外泛化上表现相当。（b）它们的归纳偏置，通过字符串生成概率的相关性来衡量，当两种模式部分学习语言时相似，但在更高熟练度水平上分化。（c）与FT不同，ICL的表现在不同大小和家族的模型之间差异显著，并且对语言的token词汇表敏感。因此，我们的工作展示了形式语言作为评估LLM的受控测试床的潜力，这些行为在自然语言数据集中难以隔离。我们的源代码可在https://github.com/bishwamittra/formallm上获得。

英文摘要

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

URL PDF HTML ☆

赞 0 踩 0

2604.23135 2026-05-19 cs.LG

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

刻画 Lean 4 自动形式化中的同义词诱导失败

William Feng, Ethan Lou, Aryan Sharma

AI总结本研究探讨了 Lean 4 自动形式化中由于同义词变化导致的失败模式，通过应用确定性同义词规则到本科和竞赛级数学问题数据集，发现代码生成层的失败主导了同义词敏感性，并揭示了不同数据集对失败类型的影响，结果为自动形式化提供了失败模式分类并推动了针对性的训练干预。

详情

AI中文摘要

近年来，Lean 4 自动形式化在前沿语言模型和开放权重自动形式化器中变得越来越流行，这些模型现在能够生成数学定理的有效形式化。然而，这些评估通常依赖于单个标准定理表述，很少探讨输出是否对输入的自然变化具有鲁棒性，而先前的工作已表明语义等价的同义词变化常导致形式化输出的差异。我们通过应用确定性同义词规则到本科和竞赛级数学问题数据集，研究了 Lean 4 中这些差异的结构。在四个前沿模型和三个开放权重自动形式化器上，我们发现同义词敏感性主要由代码生成层的失败主导，并且这些失败在不同数据集中被类型化不同。此外，这些模式扩展到开放权重模型，显示最先进的自动形式化器仍难以生成有效的 Lean 代码。我们的结果为自动形式化提供了失败模式分类，并推动了针对特定编译失败的训练干预。

英文摘要

Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.

URL PDF HTML ☆

赞 0 踩 0

2604.22626 2026-05-19 cs.CL

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

从字形依赖到词法结构：从马尔可夫视角看但丁的神曲

Angelo Maria Sabatini

AI总结本文通过基于元音-辅音编码的符号表示，研究但丁神曲的结构组织，发现从地狱到天堂字形记忆指数逐渐增加，表明局部依赖结构发生方向性变化，同时通过三元组分析识别出词法环境中的重复配置，并揭示局部符号依赖与词法结构之间的联系。

Comments 26 pages, 8 figures, 1 supplementary material; submitted to Journal of Computational Literary Studies

详情

AI中文摘要

本研究通过基于元音-辅音编码的符号表示，探讨但丁神曲的结构组织。将所得序列建模为四状态马尔可夫链，得到一个简洁的字形记忆指数，捕捉局部持续性和交替模式。在整部诗中，该指数从地狱到天堂略有但一致增加，表明局部依赖结构发生方向性变化。三元组分析识别出一组受限的重复配置，作为字形探针，将马尔可夫模式与词法环境及正字法现象如撇号形式联系起来。互补的分类分析识别出特定于歌的词法锚点，显示局部符号依赖既反映了三首歌之间的分离，又在整部诗中呈现出连续进展。结果提供了一个可解释的框架，将局部符号结构与高层文本组织联系起来。

英文摘要

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing local persistence and alternation patterns. Across the poem, this index shows a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram analysis identifies a restricted set of recurrent configurations acting as graphemic probes, linking Markov patterns to lexical environments and orthographic phenomena such as apostrophised forms. A complementary classification analysis identifies cantica-specific lexical anchors, showing that local symbolic dependencies reflect both the separation among the three cantiche and a continuous progression across the poem. The results provide an interpretable framework connecting local symbolic structure with higher-level textual organisation.

URL PDF HTML ☆

赞 0 踩 0

2604.22282 2026-05-19 cs.CL

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

STEM: 用于知识图谱驱动检索增强生成的结构追踪证据挖掘

Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

AI总结本文提出STEM框架，通过将多跳推理重新定义为以模式引导的图搜索任务，解决知识图谱结构异质性和现有推理路径检索方法缺乏全局结构视角的问题，从而提升多跳推理的准确性和证据完整性。

Comments 34 pages, 16 figures, accepted to ACL 2026 (Main Conference, Oral Presentation)

详情

AI中文摘要

基于知识图谱的问题回答（KGQA）在复杂推理任务中起着关键作用，但仍然受到两个持续存在的挑战的限制：知识图谱（KGs）的结构异质性常常导致检索过程中的语义不匹配，而现有的推理路径检索方法缺乏全局结构视角。为了解决这些问题，我们提出了结构追踪证据挖掘（STEM），一种新颖的框架，将多跳推理重新定义为以模式引导的图搜索任务。首先，我们设计了一个语义到结构的投影流水线，利用KG结构先验来将查询分解为原子关系断言并构建一个自适应的查询模式图。随后，我们执行全局感知的节点锚定和子图检索以获得最终的证据推理图。为了更有效地在图构建过程中整合全局结构信息，我们设计了三元组依赖图神经网络（Triple-GNN）以生成一个全局指导子图（指导图）以引导构建。STEM显著提高了多跳推理图检索的准确性和证据完整性，并在多个多跳基准上实现了最先进的性能。

英文摘要

Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2604.20155 2026-05-19 cs.CV

GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

GSCompleter: 一种无需蒸馏的插件，用于在几秒钟内进行基于度量的3D高斯溅射完成

Ao Gao, Jingyu Gong, Xin Tan, Zhizhong Zhang, Lizhuang Ma, Yuan Xie

AI总结本文提出了一种无需蒸馏的GSCompleter插件，通过稳定的'生成-注册'流程实现基于度量的3D高斯溅射完成，提高了完成质量和效率，并在三个基准上取得了新的最先进的结果。

详情

AI中文摘要

3D高斯溅射（3DGS）凭借其显式表示和效率，已彻底改变了高质量神经渲染。然而，从稀疏视角重建场景会因覆盖范围有限而遭受严重的几何空洞和漂浮物。当前的场景完成方法通常依赖于迭代的'修复-蒸馏'范式，这计算成本高，容易出现不稳定优化，并且容易过拟合。为了解决这些限制，我们提出了GSCompleter，一种无需蒸馏的插件，将场景完成转移到稳定的'生成-注册'流程。具体而言，GSCompleter合成出视觉上合理的2D参考图像，并通过稳健的立体锚点视角选择机制将其显式提升为具有一致度量尺度的3D高斯原语。这些新生成的原语随后通过新颖的射线约束注册策略无缝集成到全局场景中。通过用稳定的几何注册替代不稳定蒸馏，GSCompleter在三个基准上表现出优越的3DGS完成性能，比各种基线在质量和效率上都得到了提升，并取得了新的最先进的（SOTA）结果。

英文摘要

3D Gaussian Splatting (3DGS) has revolutionized high-fidelity neural rendering with its explicit representation and efficiency. However, reconstructing scenes from sparse viewpoints suffers from severe geometric voids and floaters due to limited coverage. Current scene completion methods typically rely on an iterative "Repair-then-Distill" paradigm, which is computationally intensive, prone to unstable optimization, and susceptible to overfitting. To address these limitations, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Specifically, GSCompleter synthesizes visually plausible 2D reference images and explicitly lifts them into 3D Gaussian primitives with a consistent metric scale via a robust Stereo-Anchor View Selection mechanism. These newly generated primitives are then seamlessly integrated into the global scene using a novel Ray-Constrained Registration strategy. By replacing unstable distillation with rapid geometric registration, GSCompleter exhibits superior 3DGS completion performance across three benchmarks, enhancing both quality and efficiency over various baselines and achieving new state-of-the-art (SOTA) results.

URL PDF HTML ☆

赞 0 踩 0

2604.18966 2026-05-19 cs.LG cs.AI

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

通过迭代奖励引导的后训练改进表格语言模型

Yunbo Long, Tejumade Afonja, Guangya Hao, Alexandra Brintrup, Mario Fritz

AI总结本文研究了通过生成-评分-对齐协议进行迭代奖励引导的后训练，提出了一种基于组相对对齐的方法TabGRAA，通过比较高分和低分生成组的组平均策略/参考对数比来改进表格语言模型，在五个混合类型基准上优于额外监督微调，并在保真度和下游效用之间实现了最佳平均权衡，同时保持经验隐私诊断接近监督基线。

详情

AI中文摘要

表格语言模型可以通过将行建模为令牌序列来生成合成表格，但通常通过监督微调一次后就作为静态生成器使用。这限制了下一步令牌似然不能直接优化用于评估合成数据的分布、效用和不可区分性属性。我们通过生成-评分-对齐协议研究了表格语言模型的迭代奖励引导后训练，其中生成器采样合成行，任务特定的奖励对其进行排序，模型则相对于固定监督参考进行更新。在该协议中，我们提出了TabGRAA（表格组相对优势对齐），通过组平均的策略/参考对数比比较高分和低分生成组，而非一对一偏好对。在五个混合类型基准上，TabGRAA在GReaT基座上优于额外监督微调，并在保真度和下游效用之间实现了最强的平均权衡，同时保持经验隐私诊断接近监督基线。消融研究显示，收益依赖于有意义的奖励排名和稳定的组级更新，而非额外训练本身。奖励替换和评分分离研究进一步表明，后训练循环可以使用基于分类器和无分类器的奖励，且适当的评分分离对于保持保真度-效用-隐私权衡至关重要。这些结果将TabGRAA定位为一种自改进的后训练方法，用于表格语言模型生成器，作为强大静态表格生成器的补充。

英文摘要

Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference log-ratios rather than one-to-one preference pairs. Across five mixed-type benchmarks, TabGRAA improves a GReaT backbone beyond additional supervised fine-tuning and achieves the strongest average trade-off among adapted DPO, KTO, and NPO baselines on fidelity and downstream utility, while maintaining empirical privacy diagnostics near the supervised baseline. Ablations show that the gains depend on meaningful reward ranking and stable group-level updates rather than extra training alone. Reward-substitution and scorer-separation studies further show that the post-training loop can use both classifier-based and classifier-free rewards, and that proper scorer separation is important for preserving the fidelity--utility--privacy trade-off. These results position TabGRAA as a self-improving post-training method for tabular language-model generators, complementary to strong static tabular synthesizers.

URL PDF HTML ☆

赞 0 踩 0

2604.17487 2026-05-19 cs.CL

Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

仅在必要时精确回答：用于代理系统的校准断言特异性控制

Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin

AI总结本文研究了代理系统因过于精确而失败的问题，提出了一种称为组合选择性特异性（CSS）的方法，通过分解回答为断言、提出更粗略的退化方案，并在最合适的校准级别发出每个断言，从而提高风险-效用权衡。

Comments Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情

AI中文摘要

代理系统往往不是因为完全错误，而是因为过于精确而失败：一个回答可能总体有用，但特定断言超出了证据支持的范围。我们研究这种失败模式为过度承诺控制，并引入组合选择性特异性（CSS），一种后生成层，将回答分解为断言，提出更粗略的退化方案，并在最具体的校准级别发出每个断言。该方法旨在将不确定性表达为局部语义退化，而不是整个回答的拒绝。在完整的LongFact运行和HotpotQA试点中，校准的CSS提高了固定草稿的风险-效用权衡。在完整的LongFact运行中，相对于无CSS输出，它将过度承诺意识效用从0.846提升到0.913，同时实现0.938的特异性保留。这些结果表明，断言层面的特异性控制是代理系统有用的不确定性接口，并且是未来无分布有效性层的目标。

英文摘要

Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports. We study this failure mode as overcommitment control and introduce compositional selective specificity (CSS), a post-generation layer that decomposes an answer into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated level that appears admissible. The method is designed to express uncertainty as a local semantic backoff rather than as a whole-answer refusal. Across a full LongFact run and HotpotQA pilots, calibrated CSS improves the risk-utility trade-off of fixed drafts. On the full LongFact run, it raises overcommitment-aware utility from 0.846 to 0.913 relative to the no-CSS output while achieving 0.938 specificity retention. These results suggest that claim-level specificity control is a useful uncertainty interface for agentic systems and a target for future distribution-free validity layers.

URL PDF HTML ☆

赞 0 踩 0

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节：在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

AI总结本文提出Mosaic模型，通过学习功能扰动生成集合成员，并利用网格对齐的块稀疏注意力机制，在原分辨率网格上操作，以线性成本捕捉长距离依赖关系，从而在1.5°分辨率下达到或超越更精细分辨率模型的性能，实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情

AI中文摘要

我们介绍Mosaic，一种概率天气预测模型，旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式：频谱阻尼（统计学）、高频混叠（架构学）和残余高频泄漏（参数学）。Mosaic通过学习的功能扰动生成集合成员，并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作，该机制是一种硬件对齐的机制，通过在空间相邻查询之间共享键和值，以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下，Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能，并在1.5°模型中实现了最先进的结果，生成了经过良好校准的集合，其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

URL PDF HTML ☆

赞 0 踩 0

2604.15851 2026-05-19 cs.LG cs.AI cs.CR

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

DPrivBench：评估大语言模型在差分隐私推理中的基准测试

Erchi Wang, Pengrun Huang, Eli Chien, Om Thakkar, Kamalika Chaudhuri, Yu-Xiang Wang, Ruihan Wu

AI总结本文提出DPrivBench基准测试，用于评估大语言模型在差分隐私推理中的能力，发现当前模型在高级算法推理上存在显著差距，并为改进自动化差分隐私推理提供了方向。

详情

AI中文摘要

差分隐私（DP）在保护数据隐私方面有广泛的应用，但设计和验证DP算法需要专家级推理，这为非专家从业者设置了高门槛。先前的工作要么依赖于需要大量领域专业知识的专用验证语言，要么仍然是半自动化的，需要人工在循环中指导。在本文中，我们研究大语言模型（LLMs）能否自动化DP推理。我们引入了DPrivBench，这是一个基准测试，每个实例询问函数或算法是否在指定假设下满足陈述的DP保证。该基准测试精心设计，覆盖了广泛的DP主题，跨越不同的难度级别，并通过简单的模式匹配来抵抗快捷推理。实验显示，尽管最强的模型能够处理教科书机制，但所有模型在高级算法上都面临困难，揭示了当前DP推理能力的显著差距。通过进一步的分析研究和失败模式分析，我们识别出改进自动化DP推理的几个有前途的方向。我们的基准测试为开发和评估此类方法提供了坚实的基础，并补充了现有的数学推理基准测试。

英文摘要

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0