arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2605.15177 2026-05-19 cs.AI

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

OpenDeepThink: 通过布拉德利-蒂尔利聚合实现并行推理

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

发表机构 * UC San Diego(UC圣地亚哥大学) Princeton University(普林斯顿大学) University of Washington(华盛顿大学) UC Berkeley(伯克利大学)

AI总结 该研究提出OpenDeepThink框架,通过布拉德利-蒂尔利聚合方法在测试时扩展计算资源,以提高大语言模型的推理能力,通过并行选择候选方案并消除选择瓶颈,从而提升模型在Codeforces等领域的表现。

Comments 19 pages, 4 figures

详情
AI中文摘要

测试时计算扩展是提高大语言模型推理能力的主要方向。现有方法主要通过扩展单个推理轨迹来扩展深度,而通过并行采样多个候选方案来扩展广度则较为简单,但会引入选择瓶颈:在没有地面真相验证器的情况下选择最佳候选方案,因为点wise LLM判断是嘈杂且有偏见的。为了解决这个问题,我们引入了OpenDeepThink,一种基于种群的测试时计算框架,通过成对的布拉德利-蒂尔利比较来选择。每次生成中,LLM随机判断候选方案对并利用布拉德利-蒂尔利聚合生成全局排名;排名最高的候选方案被保留,前四分之三的方案通过自然语言批评进行变异;后四分之一的方案被丢弃。OpenDeepThink在八个连续的LLM调用轮次中(约27分钟实时时钟时间)将Gemini 3.1 Pro的Codeforces Elo有效提升405分。该流程在较弱和较强模型之间转移时无需重新训练,并在多领域HLE基准测试中,收益集中在客观可验证的领域,而在主观领域则相反。我们发布了CF-73,一个包含73个专家评分的Codeforces问题的精选集,具有国际大师注释,并且与官方判决的本地评估一致性达到99%。

英文摘要

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

2605.14133 2026-05-19 cs.AI

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge: 为命令行代理生成可执行的交互式基准测试

Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学切里波因特分校) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校) University of Southern California(南加州大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 ClawForge通过生成可执行的交互式基准测试,解决了可扩展性与真实工作流评估之间的矛盾,通过系统测试代理在存在状态冲突时的处理能力。

详情
AI中文摘要

交互式代理基准测试面临可扩展性构建与真实工作流评估之间的张力。手工编写的任务扩展和修改成本高,而静态提示评估忽略了只有在代理在持久状态上操作时才会出现的失败。现有的交互式基准测试已显著提升了代理评估,但大多数初始化任务从干净的状态开始,没有系统测试代理如何处理已存在的部分、过时或冲突的物品。我们提出了ClawForge,一个基于生成器的可执行命令行工作流基准测试框架,在状态冲突下。该框架将场景模板、扎根槽位、初始化状态、参考轨迹和验证器编译成可重复的任务规范,并通过归一化的终端状态和可观测的副作用逐步评估代理,而不是精确轨迹匹配。我们实例化该框架为ClawForge-Bench(17个场景,6个能力类别)。在七个前沿模型上的结果表明,最佳模型仅达到45.3%的严格准确率,错误状态替换在所有模型中低于17%,最宽的模型分离(17%到90%)由代理在行动前是否检查现有状态决定。部分信用和步骤效率分析进一步揭示了许多失败是近似关闭而非早期崩溃,且在状态冲突下模型表现出不同的失败风格。

英文摘要

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

2605.14038 2026-05-19 cs.AI

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

模型适应性工具必要性揭示了大语言模型工具使用中的知行差距

Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, Soheil Feizi

发表机构 * University of Maryland, College Park(马里兰大学College Park分校)

AI总结 本文研究了大语言模型在使用外部工具时的必要性问题,提出了一种基于模型自身性能的适应性工具必要性定义,并通过四个模型在算术和事实性问答数据集上的比较,发现工具必要性与实际调用行为之间存在显著的不匹配,揭示了LLM工具使用中的知行差距。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地作为自主代理,必须决定何时直接回答问题,何时调用外部工具。先前研究大多将工具必要性视为模型无关的属性,由人类或LLM判断者标注,主要涵盖答案明显的情况(例如获取天气与改写文本)。然而,现实中的工具必要性更为复杂,因为不同模型的能力边界存在分歧:一个强模型可以单独解决的问题,可能仍需要工具帮助弱模型。在本文中,我们引入了基于每个模型实证性能的模型适应性工具必要性定义。随后,我们比较了四个模型在算术和事实性问答数据集上的必要性与观察到的工具调用行为,发现存在26.5-54.0%和30.8-41.8%的显著不匹配。为了诊断失败,我们将工具使用分解为两个阶段:内部认知阶段,反映模型是否认为需要工具;执行阶段,决定模型是否实际做出调用动作。通过探测LLM隐藏状态,我们发现这两种信号往往可以线性解码,但它们的探测方向在晚期层、最后token的范围内几乎正交。通过追踪样本在两个阶段过程中的轨迹,我们进一步发现,大多数不匹配集中在认知到行动的转换过程中,而非认知本身。这些结果揭示了LLM工具使用中的知行差距:提高工具使用可靠性不仅需要更好的识别何时需要工具,还需要更好的将这种识别转化为行动。

英文摘要

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

2605.14005 2026-05-19 cs.CL cs.LG

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

毒藤:针对推测解码的隐秘加速-崩溃攻击

Shuoyang Sun, Chang Dai, Hao Fang, Kuofeng Gao, Xinhao Zhong, Yi Sun, Fan Mo, Shu-Tao Xia, Bin Chen

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) South China University of Technology(华南理工大学) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Technology(华为技术)

AI总结 本文提出Mistletoe攻击,通过优化降质目标和语义保留目标,隐秘地降低推测解码的接受长度τ,从而减少加速效果,同时保持输出质量。

详情
AI中文摘要

推测解码已成为加速大型语言模型(LLM)推理的广泛采用技术,通过并行生成多个候选token并用目标模型验证。然而,其效率关键依赖于平均接受长度τ,即每个验证步骤中多少候选token能被接受。本文识别了基于模型的推测解码中的新机制层漏洞:drafter被训练去近似目标模型分布,但这种近似不可避免地不完美。这种drafter-目标不匹配创造了一个隐藏的攻击面,其中小扰动可以保持目标模型的可见行为,同时显著降低候选token的接受性。我们提出Mistletoe,一种针对推测解码的隐秘加速-崩溃攻击。Mistletoe直接针对推测解码的接受机制。它联合优化一个降质目标,以减少drafter-目标的一致性,以及一个语义保留目标,以约束目标模型的输出分布。为了解决这两个目标之间的冲突,我们引入了一个null-space投影机制,其中降质梯度被投影到局部语义保留方向之外,从而抑制候选token的接受,同时最小化语义漂移。在各种推测解码系统上的实验表明,Mistletoe显著降低了平均接受长度τ,崩溃速度提升,并降低了平均token吞吐量,同时保持输出质量和困惑度。我们的工作强调推测解码引入了超越现有输出鲁棒性的机制层攻击面,呼吁对LLM加速系统进行更鲁棒的设计。

英文摘要

Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $τ$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $τ$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

2605.13415 2026-05-19 cs.CL cs.AI cs.LG

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

KIT-TIP-NLP 在 MultiPride 上的持续学习:多语言基础模型

Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen

发表机构 * Text Information Processing Lab, Kitami Institute of Technology, Kitami, Hokkaido 090-0015, Japan(函授信息处理实验室,Kitami理工学院,日本北海道Kitami,090-0015)

AI总结 本文提出了一种多阶段框架,用于检测社交媒体中多语言的重新使用侮辱性语言。该框架解决了跨英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战,通过数据驱动的模型选择、语义保留的增强、归纳迁移学习和领域特定知识注入等方法,提高了多语言情感表达的识别能力。

Comments Final Workshop of the 9th evaluation campaign EVALITA 2026

详情
AI中文摘要

本文提出了一种多阶段框架,用于检测多语言社交媒体中重新使用的侮辱性语言。该框架解决了在英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战。该框架处理了三个交织的方法学挑战:数据稀缺、类别不平衡和跨语言的情感表达差异。该框架整合了通过交叉验证的数据驱动模型选择、通过回译的语义保留增强、具有动态周期级欠采样的归纳迁移学习,以及通过掩码语言模型注入的领域特定知识。系统评估了八个多语言嵌入模型,XLM-RoBERTa被选为基础模型,基于宏平均F1分数。通过GPT-4o-mini回译进行的数据增强有效将训练语料库增加了三倍,同时保留了语义内容和类别分布比例。该框架生成了四个最终运行用于评估,其中RUN 1是带有增强和欠采样的归纳迁移学习,RUN 2是带有掩码语言模型预训练,RUN 3和RUN 4是通过语言特定决策阈值优化的先前预测。语言特定的阈值优化表明,最优决策边界在不同语言中存在显著差异。这反映了模型置信度分数的分布差异和重新使用语言使用的语言差异。基于阈值的优化在不需模型重新训练的情况下,带来了2-5%的绝对F1提升。该方法完全可复现,所有代码和实验设置可在https://github.com/rbg-research/MultiPRIDE-Evalita-2026上找到。

英文摘要

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

2605.11975 2026-05-19 cs.LG

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

随机最小成本到达-避免强化学习

Jingduo Pan, Taoran Wu, Yiling Xue, Bai Xue

发表机构 * Key Laboratory of System Software (Chinese Academy of Sciences), Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院系统软件重点实验室,软件研究所,中国科学院,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学交叉学科学院,中国科学院大学,北京,中国)

AI总结 本文研究了随机最小成本到达-避免强化学习问题,提出了一种新的方法来在满足概率至少p的到达-避免约束的同时最小化预期累积成本。通过引入到达-避免概率证书(RAPCs)和基于收缩的Bellman公式,该方法能够将到达-避免考虑整合到强化学习中,并在概率约束下实现成本优化。

Comments Accepted at the Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

我们研究了随机最小成本到达-避免强化学习,其中智能体必须在概率至少p的情况下满足到达-避免规范,同时在随机环境中最小化预期累积成本。现有的安全和约束强化学习方法通常无法在随机环境中联合强制概率到达-避免约束并优化成本。为了解决这一挑战,我们引入了到达-避免概率证书(RAPCs),这些证书可以识别出从哪些状态可以满足随机到达-避免约束。基于RAPCs,我们开发了一种基于收缩的Bellman公式,该公式作为一种原理性的替代方法,用于将到达-避免考虑整合到强化学习中,从而在概率约束下实现成本优化。我们建立了所提出算法在结果目标下几乎确定收敛到局部最优策略。在MuJoCo模拟器中的实验显示了改进的成本性能和一致更高的到达-避免满足率。

英文摘要

We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.

2605.11854 2026-05-19 cs.CL

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

自蒸馏轨迹感知玻尔兹曼建模:弥合扩散语言模型训练-推理差异

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, Haoliang Li

发表机构 * City University of Hong Kong(香港城市大学) Huawei Research(华为研究) The University of Hong Kong(香港大学)

AI总结 本文研究了如何利用自蒸馏轨迹进行真正的知识获取,而非仅加速推理。提出TABOM框架,通过玻尔兹曼建模对推理解屏蔽偏好建模,从而在新领域中提升扩散语言模型的表现并缓解灾难性遗忘。

Comments Project website: https://tonyckc.github.io/TABOM-web/

详情
AI中文摘要

扩散语言模型(DLMs)最近涌现出作为自回归语言模型的有希望的替代方案,提供了更强的全局意识和高度并行的生成能力。然而,使用标准负证据下界(NELBO)基于的监督微调对DLMs进行后训练仍然效率低下:训练过程在单步中重建随机掩码的token,而推理则遵循一种由置信度引导的、多步易到难的去噪轨迹。最近的基于轨迹的自蒸馏方法主要利用这些推理轨迹来压缩和加速采样步骤,通常在不显著增强模型基础能力的情况下提高解码效率,甚至在完整扩散解码下可能降低性能。在本文中,我们问自蒸馏轨迹是否可以不仅仅用于更快的推理,而是用于真正的知识获取。虽然这些轨迹位于预训练DLM自身的分布流形上,因此提供了潜在的更低优化障碍,但我们发现使用标准NELBO目标直接微调仅能获得微小的提升。为了解决这一限制,我们提出了轨迹对齐优化通过玻尔兹曼建模(TABOM),一种基于轨迹的自蒸馏后训练框架,使训练与推理的易到难结构对齐。TABOM将推理解屏蔽偏好建模为预测熵上的玻尔兹曼分布,并推导出一个可计算的成对排名目标,以使模型的确定性顺序与观察到的解码轨迹对齐。经验上,TABOM在新领域中实现了显著的提升,扩展了DLMs的有效知识边界,并与标准SFT相比显著缓解了灾难性遗忘。

英文摘要

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

2605.11817 2026-05-19 cs.RO cs.CV

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

Yixu Feng, Zinan Zhao, Yanxiang Ma, Chenghao Xia, Chengbin Du, Yunke Wang, Chang Xu

发表机构 * The University of Sydney(悉尼大学) City University of Hong Kong(香港城市大学)

AI总结 本文提出了一种基于可微网格采样的视觉-语言-动作模型压缩方法,通过连续的token重采样保留关键空间信息,实现高达90%的计算量减少而不影响性能。

Journal ref Proceedings of the Forty-third International Conference on Machine Learning, 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中表现出色,但其高计算成本限制了实时部署。现有token剪枝方法面临根本性的权衡:使用剪枝进行剧烈压缩会不可避免地丢弃关键几何细节,如接触点,导致性能严重下降。我们主张通过重新思考压缩作为几何感知的连续token重采样来打破这种权衡。为此,我们提出了可微网格采样器(GridS),一个即插即用的模块,用于在VLA中进行任务感知的连续重采样。通过自适应预测最小的显著坐标集并利用可微插值提取特征,GridS在保留关键空间信息的同时实现了大幅压缩(少于10%的原始视觉token)。在LIBERO基准和真实机器人平台上的实验表明,GridS实现了76%的FLOPs减少,而无需降级成功率。代码可在https://github.com/Fediory/Grid-Sampler上获得。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.

2605.11567 2026-05-19 cs.CV

Dynamic Execution Commitment of Vision-Language-Action Models

视觉-语言-动作模型的动态执行承诺

Feng Chen, Xianghui Wang, Yuxuan Chen, Boying Li, Yefei He, Zeyu Zhang, Yicheng Wu

发表机构 * University of Adelaide(阿德莱德大学) Sichuan University(四川大学) Shanghai Jiao Tong University(上海交通大学) Monash University(墨尔本大学) Zhejiang University(浙江大学) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出A3机制,通过将动态执行承诺重新定义为自推测前缀验证问题,解决了视觉-语言-动作模型在动态或分布外情况下执行鲁棒性和推理吞吐量之间的平衡问题。

Comments code is available at https://inceptionwang.github.io/A3/

详情
AI中文摘要

视觉-语言-动作(VLA)模型主要采用动作分块方法,即在单次前向传递中预测并承诺一系列连续的低层动作,以摊销大规模主干网络的推理成本并减少每步延迟。然而,将这些多步骤预测提交到现实世界执行需要在成功率和推理效率之间进行平衡,这一决策通常由针对特定任务调整的固定执行时间范围控制。此类启发式方法忽略了预测可靠性与状态依赖性的关系,导致在动态或分布外情况下表现脆弱。在本文中,我们引入了A3,一种自适应动作接受机制,将动态执行承诺重新定义为自推测前缀验证问题。A3首先通过群体采样计算轨迹级的动作共识分数,然后选择一个代表性的草稿并优先验证下游部分。具体而言,它强制执行:(1)共识有序的条件不变性,通过判断在高共识动作条件下重新解码后低共识动作是否保持一致来验证低共识动作;以及(2)前缀封闭的序列一致性,通过只接受从开始处最长连续验证动作序列来保证物理运行完整性。因此,执行时间范围自然成为满足内部模型逻辑和序列执行约束的最长可验证前缀。在多种VLA模型和基准测试中,实验表明A3消除了手动调整时间范围的需要,同时在执行鲁棒性和推理吞吐量之间实现了更优的平衡。

英文摘要

Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.

2605.11461 2026-05-19 cs.AI cs.LG

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破赢家通吃:合作策略优化提升大语言模型的多样化推理

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

发表机构 * ISEE Lab, Sun Yat-sen University(中山大学ISEE实验室)

AI总结 本文提出Group Cooperative Policy Optimization (GCPO)方法,通过改变训练范式从 rollout 竞争转向团队合作,提升大语言模型在推理任务中的准确性和解题多样性。

详情
AI中文摘要

基于验证器的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的核心范式,然而流行的基于群体的优化算法如GRPO常常面临探索崩溃问题,即模型过早收敛于一组高分模式,缺乏探索新解的能力。最近的研究尝试通过添加熵正则化或多样性奖励来缓解这一问题,但这些方法并未改变赢家通吃的本质,即rollouts仍为个体优势竞争而非合作最大化全局多样性。在本文中,我们提出Group Cooperative Policy Optimization(GCPO),将训练范式从rollout竞争转向团队合作。具体而言,GCPO将独立rollout评分替换为团队层面的信用分配:rollout被奖励其对团队有效解覆盖的贡献,而非其个体准确性。该覆盖被描述为奖励加权语义嵌入上的确定体体积,其中只有正确且非冗余的rollout才对这一体积做出贡献。在优势估计过程中,GCPO将集体团队奖励重新分配给每个单个rollout,根据其对团队的平均边际贡献。这种合作训练范式将优化方向导向非冗余的正确推理路径。在多个推理基准测试中,GCPO在现有方法的基础上显著提高了推理准确性和解题多样性。代码将在https://github.com/bradybuddiemarch/gcpo上发布。

英文摘要

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.

2605.11223 2026-05-19 cs.AI

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

视觉-语言模型在点击式谜题游戏中是否展现出人类般的逻辑问题解决能力?

Maximilian Triebel, Marco Menner, Dominik Helfenstein

发表机构 * Institute of Artificial Intelligence, University of Stuttgart, Stuttgart, Germany(斯图加特大学人工智能研究所)

AI总结 本文提出VLATIM基准测试,用于评估在经典物理谜题游戏The Incredible Machine 2中人类般的逻辑问题解决能力,发现尽管大模型在规划方面表现优异,但精确的视觉定位仍存在问题,尚未达到人类水平。

详情
AI中文摘要

视觉-语言(-动作)模型(VLMs)越来越多地应用于交互环境,但现有基准测试往往忽视了点击式谜题游戏中所需的复杂物理推理。本文介绍了Vision-Language Against The Incredible Machine(VLATIM),一个用于评估在经典物理谜题游戏The Incredible Machine 2(TIM)中人类般的逻辑问题解决能力的基准测试。与现有基准测试不同,VLATIM专门针对高水平逻辑推理与需要精确鼠标交互的连续动作空间之间的关键差距。该基准测试分为五个逐步部分,评估的能力从基本的视觉定位和领域理解到多步骤操作和完整谜题解决。我们的结果揭示了推理与执行之间的显著差距。尽管大 proprietary 模型在规划能力方面表现优异,但它们在精确的视觉定位上存在困难。因此,它们尚未展现出人类般的解决问题能力。

英文摘要

Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.

2605.10239 2026-05-19 cs.CV

AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting

AdaptSplat: 为前馈3D高斯点划法适应视觉基础模型

Mingwei Xing, Xinliang Wang, Yifeng Shi

发表机构 * Ke Holdings Inc.(凯控股公司)

AI总结 本文提出AdaptSplat,通过在通用架构中引入一个仅含1.5M参数的轻量级适配器,有效提升了前馈3D高斯点划法在跨领域泛化和高频几何保真度方面的性能。

详情
AI中文摘要

本文探讨了一种简单而强大的轻量级适配器设计,用于前馈3D高斯点划法(3DGS)。现有方法通常在图像特征提取→多视角交互→特征解码的通用流程上应用复杂的、架构特定的设计。然而,受限于3D训练数据的规模瓶颈和深度网络的低通滤波效应,这些方法在跨领域泛化和高频几何保真度方面仍显不足。为了解决这些问题,我们提出了AdaptSplat,证明在不使用复杂组件工程的情况下,仅在通用架构中引入一个仅含1.5M参数的适配器就足以实现优越的性能。具体而言,我们设计了一个轻量级的频率保持适配器(FPA),从强大视觉基础模型主干的浅层特征中提取方向感知的高频结构先验,并通过高频位置编码和自适应残差调制无缝地将其整合到通用流程中。这有效补偿了深度特征中过度平滑导致的高频衰减,提高了高斯原语在复杂表面和尖锐边界上的拟合精度。大量实验表明,AdaptSplat在多个标准基准上实现了最先进的前馈重建性能,并在跨领域泛化方面表现出稳定性。代码可在:https://github.com/xmw666/AdaptSplat 获取。

英文摘要

This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.

2605.10236 2026-05-19 cs.LG cs.AI

When Does Non-Uniform Replay Matter in Reinforcement Learning?

在强化学习中非均匀回放何时起作用?

Michal Korniak, Mikołaj Czarnecki, Yarden As, Piotr Miłoś, Pieter Abbeel, Michal Nauman

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Warsaw(华沙大学) UC Berkeley(伯克利加州大学) Amazon FAR(亚马逊FAR)

AI总结 本文研究了非均匀回放在强化学习中的有效性,发现回放体积、预期近期性和回放分布熵是决定因素,并提出了一种简单有效的截断几何回放策略以提高样本效率。

详情
AI中文摘要

现代非策略强化学习算法通常依赖于简单的均匀回放采样,但非均匀回放何时以及为何优于这一强基线仍不清楚。在多样化的强化学习设置中,我们证明非均匀回放的有效性由三个因素决定:回放体积、每环境步骤回放的转换数量;预期近期性,即所采样转换的近期程度;以及回放采样分布的熵。我们的主要贡献是明确非均匀回放何时有益,并为现代非策略强化学习中的回放设计提供实用指导。我们发现,当回放体积较低时,非均匀回放最有益,且即使在预期近期性相当时,高熵采样也很重要。受这些发现的启发,我们采用了一种简单的截断几何回放策略,该策略倾向于近期经验,同时保持高熵并带来可忽略的计算开销。在大规模并行模拟、单任务和多任务设置中,包括在五个强化学习基准套件上评估的三种现代算法,这种回放采样策略在低体积情况下提高了样本效率,而在高回放体积时仍具有竞争力。

英文摘要

Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.

2605.10185 2026-05-19 cs.CV cs.AI

DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

DynGhost: 用于量子探测器动态鬼成像的时序建模Transformer

Vittorio Palladino, Ahmet Enis Cetin

发表机构 * Politecnico di Milano(米兰理工学院) University of Illinois at Chicago(伊利诺伊大学香槟分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出DynGhost,一种基于Transformer的动态鬼成像方法,通过交替的空间和时间注意力模块解决传统方法在动态场景和低光条件下的局限性,利用量子感知训练框架提升真实硬件下的性能。

Comments 6 pages, 8 figures

详情
AI中文摘要

鬼成像通过将结构化照明图案与标量强度测量相关联,从单像素桶探测器重建空间信息。尽管深度学习方法在静态场景中取得了显著成果,但存在两个关键局限:现有架构未能利用帧间的时间相干性,导致动态鬼成像问题未得到解决,且假设加性高斯噪声模型,而实际单光子硬件遵循泊松统计。我们提出了DynGhost(动态鬼成像Transformer),通过交替的空间和时间注意力块解决这两个限制。基于物理准确的探测器模拟(SNSPDs、SPADs、SiPMs)和Anscombe方差稳定化归一化,我们的量子感知训练框架解决了导致经典模型在真实硬件约束下失效的分布偏移。在多个基准测试中,DynGhost在动态和光子匮乏设置中优于传统重建方法和现有深度学习架构。

英文摘要

Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.

2605.10059 2026-05-19 cs.AI

Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

LLM代理市场中的战略利用:电子商务信任的模拟框架

Shijun Lei, Quang Nguyen, Swapneel S Mehta, Zeping Li, Huichuan Fu, Xiaolong Zheng, Siki Chen, Yunji Liang, Philip Torr, Zhenfei Yin

发表机构 * Northwestern Polytechnical University(西北工业大学) Boston University(波士顿大学) Fudan University(复旦大学) Wuhan University(武汉大学) Chinese Academy of Sciences(中国科学院) University of Oxford(牛津大学)

AI总结 本文提出TruthMarketTwin模拟框架,用于研究LLM代理在电子商务市场中的行为,发现LLM代理在传统市场中会利用声誉治理的弱点,而强制执行可减少欺骗并重塑战略推理。

详情
AI中文摘要

基于代理的建模(ABM)长期以来被用于经济学中研究人类行为,而大型语言模型(LLM)代理现在使新的社会和经济模拟成为可能。尽管先前工作发现了LLM代理在金融交易和拍卖市场中的战略性欺骗,但电子商务仍鲜有研究,尽管其有独特的信息不对称:卖家私下观察产品质量,而买家依赖广告声明和声誉信号。我们引入TruthMarketTwin,一种用于研究LLM代理在电子商务市场中行为的受控模拟框架。该框架是首个模拟不对称信息共享下双边贸易的模型之一,其中代理做出战略性列表、购买、评分和救济相关决策以优化卖家利润和买家效用。我们发现,释放到传统市场中的LLM代理会自主利用基于声誉的治理弱点,而强制执行可减少欺骗并重塑战略推理。我们的结果将LLM代理模拟定位为研究由机构治理的自主市场工具。

英文摘要

Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.

2605.09855 2026-05-19 cs.LG

Concordia: Self-Improving Synthetic Tables for Federated LLMs

Concordia:面向联邦大语言模型的自改进合成表格

Jimin Huang, Duanyu Feng, Nuo Chen, Xiaoyu Wang, Zhiqiang Zhang, Xueqing Peng, Mingquan Lin, Prayag Tiwari, Guojun Xiong, Alejandro Lopez-Lira, Sophia Ananiadou

发表机构 * University of Manchester(曼彻斯特大学) National University of Singapore(新加坡国立大学) New York University(纽约大学) University of Minnesota(明尼苏达大学) Halmstad University(哈姆斯塔德大学) Harvard University(哈佛大学) University of Florida(佛罗里达大学)

AI总结 本文研究了在无法共享原始数据的情况下,如何通过自改进的合成表格来提升联邦学习中大语言模型的适应能力,提出了一种三层优化框架Concordia,通过参数高效LoRA训练和轻量级效用评分器提升联邦验证效用和跨客户端稳定性。

Comments 12 pages

详情
AI中文摘要

联邦学习(FL)能够在不共享原始数据的情况下训练大型语言模型(LLMs),但在严格的数据隔离和非独立同分布(non-IID)客户端分布下,适应LLMs仍然具有挑战性。合成数据为本地训练提供了自然的隐私保护替代方案,但现有联邦流程通常将合成生成视为静态或松散耦合于下游优化,导致在异质客户端下效用迅速下降。我们研究了在无法共享原始记录和验证数据的情况下,如何在表格任务中进行联邦适应,并且本地训练必须完全依赖合成表格。我们提出Concordia,一种三层优化框架,该框架在这些约束下对齐合成数据生成与联邦验证效用。在客户端层面,模型通过参数高效LoRA训练在合成表格上进行适应。客户端还从私有验证反馈中学习轻量级效用评分器,以在本地训练中重新加权合成样本。在外层,每个客户端使用组相对策略优化(GRPO)来细化自己的合成表格生成器,由跨客户端共享的异质评分器集合引导,而无需聚合生成器参数或暴露验证数据。在隐私敏感的表格基准测试中,Concordia在金融和医疗领域展示了比静态和解耦合成数据基线更一致的联邦性能、跨客户端稳定性和对分布偏移的鲁棒性。

英文摘要

Federated learning (FL) enables training large language models (LLMs) without sharing raw data, but adapting LLMs under strict data isolation and non-IID client distributions remains challenging in practice. Synthetic data offers a natural privacy-preserving surrogate for local training, yet existing federated pipelines typically treat synthetic generation as static or loosely coupled with downstream optimization, leading to rapidly diminishing utility under heterogeneous clients. We study federated adaptation of LLMs on tabular tasks where raw records and validation data cannot be shared, and local training must rely entirely on synthetic tables. We propose Concordia, a tri-level optimization framework that aligns synthetic data generation with federated validation utility despite these constraints. At the client level, models are adapted via parameter-efficient LoRA training on synthetic tables. Clients additionally learn lightweight utility scorers from private validation feedback to reweight synthetic samples during local training. At the outer level, each client refines its own synthetic table generator using group-relative policy optimization (GRPO), guided by an ensemble of heterogeneous scorers shared across clients, without aggregating generator parameters or exposing validation data. Experiments on privacy-sensitive tabular benchmarks from finance and healthcare demonstrate that Concordia consistently improves federated performance, cross-client stability, and robustness to distribution shift compared to static and decoupled synthetic-data baselines.

2605.09040 2026-05-19 cs.AI cs.IR cs.LG

UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

UxSID:面向超长序列的语义感知用户兴趣建模

Hongwei Zhang, Qiqiang Zhong, Jiangxia Cao, Yiyang Lv, Huanjie Wang, Liwei Guan, Jing Yao, Yiyu Wang, Junfeng Shu, Zhaojie Liu, Han Li

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出UxSID框架,通过语义组共享兴趣记忆和双层注意力策略,实现高效且语义感知的超长用户序列建模,取得最佳性能并提升广告收益。

Comments Work in progress

详情
AI中文摘要

建模超长用户序列涉及效率与效果之间的艰难权衡。尽管当前方法依赖于物品特定搜索或物品无关压缩,我们提出UxSID,探索第三种路径:语义组共享兴趣记忆。通过利用语义ID(SIDs)和双层注意力策略,UxSID在不付出物品特定模型高昂代价的情况下捕捉目标感知偏好。这种端到端架构在计算效率与语义感知之间取得平衡,实现了最先进的性能,并在大规模广告A/B测试中提升了0.337%的收益。

英文摘要

Modeling ultra-long user sequences involves a difficult trade-off between efficiency and effectiveness. While current paradigms rely on either item-specific search or item-agnostic compression, we propose UxSID, a framework exploring a third path: semantic-group shared interest memory. By utilizing Semantic IDs (SIDs) and a dual-level attention strategy, UxSID captures target-aware preferences without the heavy cost of item-specific models. This end-to-end architecture balances computational parsimony with semantic awareness, achieving state-of-the-art performance and a 0.337% revenue lift in large-scale advertising A/B test.

2605.08738 2026-05-19 cs.LG cs.AI cs.CL

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

发表机构 * Qwen Team, Alibaba Inc.(通义实验室,阿里公司) MBZUAI KAUST(卡士大学)

AI总结 本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术,探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性,最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情
AI中文摘要

结构化剪枝和知识蒸馏(KD)是压缩大型语言模型的典型技术,但其在预训练规模下的应用仍不清楚,尤其是针对最近的混合专家(MoE)模型。本文系统研究了大规模预训练中的MoE压缩,重点探讨三个关键问题:剪枝是否比从头训练提供更好的初始化;专家压缩选择如何影响继续训练后的最终模型;以及哪种训练策略最有效。我们得出以下发现:首先,在深度、宽度和专家压缩方面,对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次,不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发,我们引入了一种简单的部分保留专家合并策略,该策略在大多数基准上提升了下游性能。第三,结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测(MTP)蒸馏,其效果一致。最后,鉴于相同的训练令牌,渐进式剪枝计划优于单次压缩,表明渐进的架构过渡导致更好的优化轨迹。综合来看,我们将Qwen3-Next-80A3B压缩到23A2B模型,保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

2605.08439 2026-05-19 cs.CL

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

语言模型能否识别乳腺癌放射治疗的副作用?

Natalie Seah, Danielle S. Bitterman, Daphna Spiegel, Thomas Hartvigsen

发表机构 * University of Virginia(弗吉尼亚大学) Mass General Brigham Dana-Farber Cancer Institute Harvard Medical School(麻省总医院达纳-法伯癌症研究所哈佛医学院)

AI总结 本研究探讨了语言模型在识别乳腺癌放射治疗副作用中的能力,通过评估多种语言模型在不同提示下的表现,揭示了其在精度、召回率及罕见长期副作用识别上的局限性,并提出了改进方向。

详情
AI中文摘要

准确地向癌症幸存者传达癌症治疗的副作用至关重要,特别是在知情同意等情境中,临床医生必须清晰而全面地传达潜在的治疗毒性。然而,由于对不良治疗反应的临床知识不足以及电子健康记录(EHR)系统之间的碎片化,这一任务仍极具挑战性。大型语言模型(LLMs)有潜力帮助完成此任务,但其在癌症幸存者护理中的可靠性仍不明确。本文提出了一种面向部署的压力测试框架,用于评估LLM生成的乳腺癌治疗和幸存者护理中的放射副作用列表。使用21名乳腺癌患者资料,我们构建了仅在放射治疗方案上不同的配对患者临床场景,以在多种提示模式下评估七种指令微调的LLM。然后将LLM输出与由两名主要学术医疗中心的知情同意文件和超过七名乳腺放射肿瘤学家团队编写的临床医生编纂参考进行比较。该参考将放射剂量分割、照射区域和位置映射到相关的毒性,按频率和时间起始点分解。在不同模型中,我们揭示了对细微文档变化的敏感性、精度与召回率之间的权衡,以及系统性低估罕见和长期副作用的问题。当单独使用时,限制生成的副作用数量会降低精度,而将输出基于临床医生编纂的副作用列表可以显著提高可靠性和稳健性。这些发现突显了LLM在肿瘤学中的重要局限性,并提出了更安全和信息丰富的幸存者护理应用的设计选择。

英文摘要

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

2605.08163 2026-05-19 cs.CV cs.AI cs.CL

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT:跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出MULTITEXTEDIT基准测试,通过12种语言、5种视觉领域和7种编辑操作的3600个实例,评估跨语言文本-图像编辑中退化问题,引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情
AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力,但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT,一个包含3,600个实例的受控基准测试,涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础,并配有人工编辑的参考文本和区域掩码,从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误,如缺失变音符号、RTL顺序颠倒和混合脚本渲染,我们引入了一个由两阶段LVM协议评分的语言保真度(LSF)度量,其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时,发现所有模型在跨语言退化方面表现显著,最大退化出现在希伯来语和阿拉伯语上,最小退化出现在荷兰语和西班牙语上,且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配,其中输出保持全局布局和背景保真度,但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

2605.07790 2026-05-19 cs.LG cs.CV

Hessian Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

Hessian Surgery: 通过Hessian尖峰扰动实现类目标后处理重平衡

Hugo Vigna, Samuel Bontemps

发表机构 * CentraleSupélec – Université Paris-Saclay(中央理工巴黎高等学院 – 巴黎萨克莱大学) ESILV – Léonard de Vinci(ESILV – 莱昂纳德·德·文奇)

AI总结 本文提出Hessian Surgery方法,通过扰动模型权重沿尖峰特征向量来重平衡各类准确率,无需重新训练,提升了CIFAR-10和ISIC-2019数据集的平衡准确率和标准差。

Comments The code is available here: https://github.com/hugovigna/hessian-surgery.git

详情
AI中文摘要

训练好的深度网络的Hessian谱表现出一种特征结构:连续的近零特征值和少量的大异常特征值(尖峰),证实了随机矩阵理论在深度学习中的相关性。尖峰数量与类别数减一相匹配。尽管先前工作描述了这种结构,但没有方法将其操作化以提高分类性能。我们提出Hessian Surgery,一种后处理优化方法,直接扰动模型权重沿尖峰特征向量以重平衡各类准确率而无需重新训练。我们引入(i)一个尖峰类敏感度矩阵,量化每个类准确率沿每个尖峰特征向量的方向导数,(ii)一个约束优化扰动系数,针对弱类同时保持强类,以及(iii)自适应幅度控制,根据迭代级改进信号调整扰动预算。我们在CIFAR-10和ISIC-2019上获得了令人鼓舞的结果,同时在平衡准确率和标准差方面都取得了显著提升。

英文摘要

The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Hessian Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class's accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.

2605.07544 2026-05-19 cs.AI

From Pixels to Prompts: Vision-Language Models

从像素到提示:视觉-语言模型

Khang Hoang Nhat Vo

发表机构 * MBZUAI

AI总结 本文探讨了视觉-语言模型的发展历程,旨在提供清晰的认知框架,帮助读者理解该领域的核心概念和应用,而非罗列所有数据集和模型变体。

详情
AI中文摘要

当您阅读一篇关于新型视觉-语言模型的论文时,可能会忘记这个想法在不久以前听起来多么奇怪。教机器看见已经很困难,教它们阅读和生成语言也已很困难。让它们同时做到这些,并随后进行推理、回答问题、遵循指令,甚至有时令人惊讶,仍带着科幻的余韵,尽管它已成为日常。这本书源于一种简单的感觉:太容易迷失方向了。该领域发展迅速,新模型名称不断出现,‘我知道 buzzwords’与‘我真的理解其工作原理’之间的差距可能让人感到不适。我曾多次感受到这种差距。如果您持有这本书,您可能也有太大的感受。我的目标不是提供一个详尽的数据集、基准和新模型变体的清单。相反,我希望提供更谦逊但或许更持久的东西:一个清晰的视觉-语言模型认知图谱。足够的结构,使您在阅读新论文时充满信心;足够的直觉,使您能够设计自己的系统而不觉得像在盲目地组装乐高积木。

英文摘要

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: it is too easy to get lost. The field moves quickly, new model names appear constantly, and the gap between "I know the buzzwords" and "I actually understand how this works" can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

2605.07308 2026-05-19 cs.RO

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA: 用于增强视觉-语言-动作模型反馈反应的自适应触觉注入

Xiaoqi Li, Muhe Cai, Jiadong Xu, Juan Zhu, Hongwei Fan, Yan Shen, Guangrui Ren, Hao Dong

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系) PrimeBot PKU Lab(北京大学实验室)

AI总结 本文提出AT-VLA,一种自适应触觉注入机制,通过动态决定触觉注入的时间和位置,减少对预训练表示的干扰,同时引入触觉反应双流机制,实现快速准确的触觉响应,以提高视觉-语言-动作模型在接触丰富操作任务中的表现。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在增强机器人代理执行多样化任务的能力方面取得了显著进展;然而,它们仍然面临在需要精确物理交互的接触丰富操作场景中的挑战。为了解决这一限制,最近的研究尝试在下游任务中整合触觉信号,使预训练的VLA能够解释触觉反馈。然而,在微调过程中引入新的模态,这些模态在预训练阶段很少出现,可能会破坏VLA的预训练能力。此外,VLA固有的缓慢推理速度会阻碍实时响应,并限制触觉反馈在动作调整中的有效利用。为克服这些挑战,我们提出了自适应触觉视觉-语言-动作(AT-VLA),引入了新颖的自适应触觉注入机制。该机制动态确定触觉注入的合适时间和位置,在显著促进动作生成时才进行注入,从而最小化对预训练表示的干扰。此外,为了实现快速准确的触觉响应,我们提出了触觉反应双流机制,将感知处理分为一个慢的视觉-语言流用于低频感知推理和一个快的触觉控制流用于高频物理交互理解,从而在0.04秒内实现实时闭环响应。现实世界实验彻底验证了AT-VLA在接触丰富操作任务中的有效性。项目页面可在:https://sites.google.com/view/at-vla。

英文摘要

Vision-Language-Action (VLA) models have significantly advanced the capabilities of robotic agents in executing diverse tasks; however, they still face challenges in contact-rich manipulation scenarios that require precise physical interactions. To address this limitation, recent studies have attempted to incorporate tactile signals during downstream tasks, enabling pretrained VLAs to interpret tactile feedback. Nevertheless, introducing new modalities during finetuning, which are rarely present in the pretrain stage, may disrupt the pretrained capabilities of VLAs. In addition, the inherently slow inference speed of VLAs hampers real-time responsiveness and limits the effective utilization of tactile feedback for action adjustment. To overcome these challenges, we propose Adaptive Tactile Vision-Language-Action (AT-VLA), which introduces a novel Adaptive Tactile Injection mechanism. This mechanism dynamically determines the appropriate timing and locations for tactile injection, incorporating only when it significantly contributes to action generation, thereby minimizing interference with pretrained representations. Furthermore, to enable rapid and accurate tactile responses, we propose a Tactile Reaction Dual-Stream mechanism, which decouples sensory processing into a slow visual-language stream for low-frequency perceptual reasoning and a fast tactile control stream for high-frequency physical interaction understanding, achieving real-time close-loop responses within 0.04 s. Real-world experiments thoroughly validate the effectiveness of AT-VLA in contact-rich manipulation tasks. The project page is available at: https://sites.google.com/view/at-vla.

2605.07111 2026-05-19 cs.CL cs.AI

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越LoRA与全微调:基于梯度的优化器路由用于大语言模型适应

Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tsinghua University(清华大学) Infinigence AI

AI总结 本文提出了一种混合LoRA和全微调(MoLF)框架,通过在优化器层面动态路由更新,实现两种训练模式之间的连续导航,从而提升大语言模型的适应性能。

详情
AI中文摘要

近期关于微调大型语言模型的研究突显出一个根本性的争论。虽然全微调(FFT)提供了高熵知识注入所需的表示可塑性,但低秩适应(LoRA)可以匹配或超越FFT的性能,因为许多任务只需要在低秩空间中进行更新,并且受益于LoRA的额外正则化。通过在多样化的任务(SQL、医学问答和反事实知识)和不同语言模型(Gemma-3-1B、Qwen2.5-1.5B和Qwen2.5-3B)上的实证评估,我们验证了这两种趋势,并展示了仅依赖静态架构在结构上是有限的。为了解决这一挑战,我们提出了混合LoRA和全微调(MoLF)框架,这是一个统一的框架,能够连续导航于两种训练模式之间。MoLF在优化器层面动态地将更新路由到FFT和LoRA之间,以确保在整个训练过程中精确的梯度信号能够传达到两个专家,从而产生稳定的训练动态。对于内存受限的环境,我们还引入了MoLF-Efficient,它冻结了基础权重,并只在可能具有不同秩的一对LoRA专家之间路由更新。我们的评估显示,MoLF在所有设置中要么优于或保持在FFT和LoRA中更好的方法的1.5%以内,而MoLF-Efficient在事实任务上比先前的自适应LoRA方法高出高达20%,在医学和SQL任务上高出9%。我们的代码在https://github.com/11785T23/molf.git上开源。

英文摘要

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL. Our code is open-sourced at https://github.com/11785T23/molf.git.

2605.06506 2026-05-19 cs.CL

The Frequency Confound in Language-Model Surprisal and Metaphor Novelty

语言模型惊奇度与隐喻新颖性中的频率混淆

Omar Momen, Sina Zarrieß

发表机构 * CRC 1646 – Linguistic Creativity in Communication(CRC 1646 语言创造力在交流中的研究)

AI总结 研究探讨了语言模型惊奇度与隐喻新颖性之间的关系,发现词频比惊奇度更能预测隐喻新颖性,并指出惊奇度与频率之间的关联在训练阶段早期达到峰值,随后下降,暗示最优的语言模型惊奇度设置可能错误地将上下文可预测性与隐喻新颖性和处理难度联系起来,而词频可能是主要影响因素。

Comments to be presented and published at the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

详情
AI中文摘要

语言模型(LM)的惊奇度被广泛用作上下文可预测性的代理,并已报告与隐喻新颖性判断相关联。然而,惊奇度与词频紧密交织。我们通过两种不同的词频测量方法,在隐喻新颖性评分上探索这种交互作用。我们分析了八个Pythia模型大小和154个训练检查点的惊奇度估计。在不同设置中,词频比惊奇度更能预测隐喻新颖性。在训练阶段中,惊奇度与新颖性关联在早期阶段达到峰值,随后再次下降,这与惊奇度与频率关联的同步增加相呼应。这些结果表明,通常报告的最优LM惊奇度设置可能错误地将上下文可预测性与隐喻新颖性和处理难度联系起来,而词频可能是主要的潜在因素。

英文摘要

Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal--novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal--frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.

2605.03409 2026-05-19 cs.AI

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

鲁棒代理补偿(RAC):教AI代理补偿

Srinath Perera, Kaviru Hapuarachchi, Frank Leymann, Rania Khalaf

发表机构 * University of Stuttgart(斯图加特大学)

AI总结 本研究提出了一种基于日志的恢复范式RAC,通过架构扩展实现安全网,可应用于大多数代理框架以支持可靠执行。RAC可在不修改现有代理代码的情况下启用,通过现有的扩展点在大多数现有代理框架中实现,并通过τ-bench和REALM-Bench验证,证明在解决复杂问题时,RAC在延迟和token经济性方面优于现有最先进的LLM-based恢复方法。

Comments Accepted at ACM Conference on AI and Agentic Systems (ACM CAIS 2026)

详情
AI中文摘要

我们提出了鲁棒代理补偿(RAC),一种基于日志的恢复范式(提供安全网),通过架构扩展实现,可应用于大多数代理框架以支持可靠的执行(避免意外副作用)。用户可以在不修改当前代理代码(例如LangGraph代理)的情况下启用RAC。所提出的方法可以通过大多数现有代理框架的现有扩展点实现。我们基于LangChain提出了一个实现,通过τ-bench和REALM-Bench验证其可行性,并证明在解决复杂问题时,RAC在延迟和token经济性方面比最先进的基于LLM的恢复方法快1.5至8倍或更多。

英文摘要

We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension that can be applied to most Agent frameworks to support reliable executions (avoiding unintended side effects). Users can choose to enable RAC without changing their current agent code (e.g., LangGraph agents). The proposed approach can be implemented in most existing agent frameworks via their existing extension points. We present an implementation based on LangChain, demonstrate its viability through the $τ$-bench and REALM-Bench, and show that when solving complex problems, RAC is 1.5-8X or more better in both latency and token economy compared to state-of-the-art LLM-based recovery approaches.

2605.02198 2026-05-19 cs.CV

SlimDiffSR: Toward Lightweight and Efficient Remote Sensing Image Super-Resolution via Diffusion Model Distillation

SlimDiffSR: 向轻量高效遥感图像超分辨率迈进:通过扩散模型蒸馏

Ce Wang, Zhenyu Hu, Wanjie Sun

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院)

AI总结 本文提出SlimDiffSR,一种轻量高效的基于扩散模型的遥感图像超分辨率框架,通过引入不确定性引导的时间步分配策略和结构化剪枝策略,提升模型效率和重建质量。

详情
AI中文摘要

扩散模型最近在图像超分辨率(SR)中取得了显著性能,但其高计算成本限制了在遥感应用中的实际部署。为了解决这个问题,我们提出了SlimDiffSR,一种轻量高效的基于扩散模型的框架,用于实际的遥感图像超分辨率。与现有单步扩散方法不同,我们首先引入了不确定性引导的时间步分配策略,以构建一个更强的单步教师模型,其中重建难度与扩散时间步长显式相关,从而实现自适应生成强度。在此基础上,我们进一步提出了一种针对遥感图像的结构化剪枝策略,系统地移除冗余的语义模块,并用轻量级设计替换标准操作,包括频域分离卷积、方向分离卷积以及查询驱动的全局聚合模块。这些组件显式利用了遥感数据的独特特性,如稀疏的高频细节、强方向模式和长距离空间依赖性。为了增强知识转移,我们将在蒸馏过程中引入最大均值差异(MMD),以对齐教师和学生模型之间的特征分布。在多个遥感基准上的广泛实验表明,SlimDiffSR在效率和重建质量之间实现了良好的平衡。特别是,它在多步扩散模型相比下实现了高达200倍的推理加速和20倍的模型参数减少,同时在感知质量方面具有竞争力,并在效率上明显优于现有的轻量级扩散基线。代码可在:https://github.com/wwangcece/SlimDiffSR获取。

英文摘要

Diffusion models have recently achieved remarkable performance in image super-resolution (SR), but their high computational cost limits practical deployment in remote sensing applications. To address this issue, we propose SlimDiffSR, a lightweight and efficient diffusion-based framework for real-world remote sensing image super-resolution. Unlike existing single-step diffusion methods that rely on fixed timesteps, we first introduce an uncertainty-guided timestep assignment strategy to construct a stronger single-step teacher model, where reconstruction difficulty is explicitly linked to diffusion timesteps, enabling adaptive generative strength. Building upon this teacher, we further present a structured pruning strategy tailored to remote sensing imagery, which systematically removes redundant semantic modules and replaces standard operations with lightweight designs, including frequency-separable convolution, direction-separable convolution, and a query-driven global aggregation module. These components explicitly exploit the unique characteristics of remote sensing data, such as sparse high-frequency details, strong directional patterns, and long-range spatial dependencies. To enhance knowledge transfer, we incorporate Maximum Mean Discrepancy (MMD) into the distillation process to align feature distributions between the teacher and student models. Extensive experiments on multiple remote sensing benchmarks demonstrate that SlimDiffSR achieves a favorable balance between efficiency and reconstruction quality. In particular, it attains up to $200\times$ inference acceleration and a $20\times$ reduction in model parameters compared with multi-step diffusion models, while achieving competitive perceptual quality and clearly outperforming existing lightweight diffusion baselines in efficiency. The code is available at: https://github.com/wwangcece/SlimDiffSR.

2604.25525 2026-05-19 cs.CL cs.HC

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

从聊天机器人到知己:一项跨文化的LLM用于情感支持的使用研究

Natalia Amat-Lefort, Mert Yazan, Amanda Cercas Curry, Flor Miriam Plaza-del-Arco

发表机构 * Leiden University(莱顿大学) Hogeschool van Amsterdam(阿姆斯特丹大学) Independent Researcher(独立研究员)

AI总结 本研究探讨了不同国家用户对LLM用于情感支持的接受度及影响因素,通过大规模跨文化调查发现社会经济地位是关键预测因素,并揭示了多语言提示语中用户主要寻求帮助的领域。

Comments 28 pages (9 pages main text, 19 pages references and appendices), 14 figures. The first two authors contributed equally

详情
AI中文摘要

大型语言模型(LLMs)不仅被用于执行任务,还作为全天候、非评判性的知己提供情感支持。然而,驱动采用的因素以及用户在不同国家对情感支持交互的感知仍不清楚。为填补这一空白,我们进行了首次大规模跨文化研究,调查了来自七个国家(美国、英国、德国、法国、西班牙、意大利和荷兰)的4641名参与者。我们的结果显示,不同国家的采用率差异显著(从20%到59%)。使用混合模型分离文化影响与人口统计特征,我们发现:25-44岁、有宗教信仰、已婚以及社会经济地位较高的人群更倾向于信任、使用和认为有好处。英语国家比大陆欧洲国家显示出更积极的感知。我们进一步收集了731个真实的多语言提示语,显示用户主要寻求帮助解决孤独、压力、关系冲突和心理健康问题。我们的发现表明,LLM的情感支持使用受复杂的社会技术景观影响,并呼吁更广泛的研究来探讨如何开发、部署和管理这些系统以确保安全和知情的访问。

英文摘要

Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.

2604.24763 2026-05-19 cs.CV

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Tuna-2:像素嵌入在多模态理解和生成中优于视觉编码器

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong

发表机构 * Meta AI The University of Hong Kong(香港大学) University of Waterloo(滑铁卢大学)

AI总结 本文提出Tuna-2,一种基于像素嵌入的统一多模态模型,通过直接使用像素嵌入进行多模态理解和生成,展示了统一像素空间建模在高质量图像生成中可以与潜在空间方法竞争,并证明了预训练视觉编码器在多模态建模中并非必要。

Comments Project page: https://tuna-ai.org/tuna-2

详情
AI中文摘要

统一多模态模型通常依赖于预训练的视觉编码器,并使用独立的视觉表示进行理解和生成,导致两种任务之间存在不一致,阻碍了从原始像素进行端到端优化。我们引入Tuna-2,一种原生统一多模态模型,直接基于像素嵌入进行视觉理解和生成。Tuna-2通过使用简单的补丁嵌入层来编码视觉输入,大幅简化了模型架构,完全摒弃了诸如VAE或表示编码器等模块化视觉编码器设计。实验表明,Tuna-2在多模态基准测试中实现了最先进的性能,证明了统一像素空间建模能够与潜在空间方法在高质量图像生成中竞争。此外,虽然基于编码器的变体在早期预训练中收敛更快,但Tuna-2的无编码器设计在大规模情况下实现了更强的多模态理解,特别是在需要细粒度视觉感知的任务中。这些结果表明,预训练视觉编码器在多模态建模中并非必要,端到端的像素空间学习为生成和感知的更强视觉表示提供了一条可扩展的路径。

英文摘要

Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2's encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

2604.23355 2026-05-19 cs.AI

LEGO: An LLM Skill-Based Front-End Design Generation Platform

LEGO: 一个基于LLM技能的前端设计生成平台

Jincheng Lou, Ruohan Xu, Jiecheng Ma, Runzhe Tao, Xinyu Qu, Yibo Lin

发表机构 * School of IC, Peking University(北京大学集成电路学院) School of EECS, Peking University(北京大学电子信息技术学院) School of Microelectronics, Xidian University(西安电子科技大学微电子学院) Institute of EDA, Peking University(北京大学EDA研究院) Beijing Advanced Innovation Center for IC(北京集成电路先进创新中心)

AI总结 本文提出LEGO平台,通过将数字前端流程分解为六个独立步骤,并将每个代理能力表示为标准化的可组合电路技能,实现了高效的前端设计生成,显著提升了RTL设计自动化的效果。

Comments Accepted to ISEDA 2026. Best Paper Nomination. 7 pages, 3 figures

详情
AI中文摘要

现有的基于LLM的EDA代理往往都是特定任务的孤立系统。这导致了重复的工程努力和成功设计和调试策略的有限重用。我们提出了LEGO,一个统一的基于技能的前端设计生成平台。它将数字前端流程分解为六个独立的步骤,并将每个代理的能力表示为标准化的可组合电路技能,以在即插即用的架构中进行表示。为了构建这个技能库,我们调查了超过100篇论文,选择了11个具有代表性的开源项目,并在六步有限状态机的公式中提取了42个可执行的电路技能。电路技能构建器通过线性可扩展性自动化技能提取。代理技能RAG实现了亚毫秒级检索,而无需依赖嵌入模型。在41个VerilogEval v2问题的严格子集上的实证评估显示,LEGO内构建的单个电路技能将Pass@1从0.000提升到0.805。这比基线提高了80.5%。跨项目技能组合也达到了0.805的Pass@1。它们在层次Verilog上表现更优14.6%,在VerilogCoder上表现更优2.5%。它们还与MAGE相匹配。这些结果表明,模块化技能组合支持有效且灵活的RTL设计自动化。LEGO平台和所有电路技能都在GitHub上公开:https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

英文摘要

Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform