arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30925 2026-06-01 cs.CV cs.GR

MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

MultiAct: 通过定制注意力引导从复合文本生成动作

Nathan Sala, Ofir Abramovich, Ariel Shamir, Daniel Cohen-Or, Andreas Aristidou, Sigal Raab

AI总结 提出MultiAct,一种无需重新训练或修改架构的推理时框架,通过自适应增强未充分表示提示组件的交叉注意力分数,解决复合文本到动作生成中语义覆盖不全的问题。

详情
Comments
Accepted to SIGGRAPH 2026 conference. Project page: https://natsala13.github.io/multiact.github.io
AI中文摘要

近年来,文本到动作生成发展迅速,为动画和人机交互提供了富有表现力的界面。然而,当前模型在处理描述同时发生的多个动作的提示时仍然脆弱。模型常常优先考虑单个主导动作而忽略其余部分,导致动作不完整或模糊,而不是实现复合描述的所有组成部分。我们提出MultiAct,一种无需配对、推理时的组合文本到动作合成框架,可直接作用于预训练的动作生成器,无需重新训练或架构修改。我们的方法通过自适应增强与未充分表示提示组件相关的交叉注意力分数来对抗语义崩溃。我们注意到有效调制取决于提示特定的选择,例如要定位的令牌和层,并引入一个轻量级辅助决策方案,以确定最有效的注意力增强参数化。广泛的定量和定性评估表明,MultiAct在复合提示上持续优于现有基线,在保持动作真实感的同时实现了改进的语义覆盖。项目页面:https://natsala13.github.io/multiact.github.io。

英文摘要

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.

2605.30924 2026-06-01 cs.CL

EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents

EMBGuard:为具身智能体安全规划构建危险感知护栏

Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim, Yeonjun Hwang, Hyojun Kim, Byungchul Kim, Young Kyun Jang, Jinyoung Yeo

AI总结 提出首个基于MLLM的具身安全护栏EMBGuard,通过解耦物理风险推理与智能体策略,评估(视觉观察,动作)对来识别危险配置并提供自然语言解释,同时构建训练数据集EMBHazard和基准测试EMBGuardTest,在紧凑模型尺寸下达到与专有MLLM竞争的性能并降低误报率。

详情
Comments
Accepted at ICML 2026
AI中文摘要

部署在真实环境中的MLLM驱动的具身智能体会遇到物理危险。然而,现有方法缺乏识别危险和推理动作条件风险的内在机制,导致智能体要么错过危险交互,要么过度识别风险。为解决此问题,我们提出EMBGuard,这是首个基于MLLM的具身智能体安全护栏,旨在将物理风险推理与智能体策略解耦。通过评估(视觉观察,动作)对,EMBGuard识别危险配置并提供潜在风险的自然语言解释。伴随EMBGuard,我们贡献了EMBHazard,一个包含15.1K个动作条件对的训练数据集,以及EMBGuardTest,一个包含329个手动策划的真实世界场景的基准测试,涵盖七种物理风险类别。通过危险和动作的组合变化,我们生成了智能体在规划过程中可能遇到的各种危险和良性场景。尽管模型尺寸紧凑(2B,4B),EMBGuard达到了与专有MLLM(例如GPT-5.1,Gemini-2.5-Pro)竞争的性能,同时显著降低了阻碍实时部署的误报率。我们在https://github.com/dongwxxkchoi/EMBGuard公开了代码、数据和模型。

英文摘要

MLLM-powered embodied agents deployed in real-world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action-conditioned risks, leading agents to either miss risky interactions or over-identify risks. To address this, we propose EMBGuard, the first MLLM-based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGuard identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGuard, we contribute EMBHazard, a training dataset of 15.1K action-conditioned pairs, and EMBGuardTest, a benchmark of 329 manually curated real-world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGuard achieves performance competitive with proprietary MLLMs (e.g., GPT-5.1, Gemini-2.5-Pro) while significantly reducing the false-positive rates that hinder real-time deployment. We make the code, data, and models publicly available at https://github.com/dongwxxkchoi/EMBGuard

2605.30920 2026-06-01 cs.LG

Unsupervised Diffusion Solver for Combinatorial Optimization via Combinatorial Adjoint Matching

通过组合伴随匹配实现组合优化的无监督扩散求解器

Shengyu Feng, Tarun Suresh, Yiming Yang

AI总结 提出组合伴随匹配(CAM)框架,利用离散伴随动力学和随机控制公式,实现无监督训练离散扩散求解器,在多种组合优化问题上达到与监督方法竞争的性能。

详情
Comments
ICML26
AI中文摘要

基于扩散的神经求解器在组合优化(CO)中显示出强大潜力,但现有方法通常依赖于使用大量近最优解进行监督训练。在这项工作中,我们将基于伴随的轨迹优化方法扩展到离散组合域。我们将基于扩散的CO表述为连续时间马尔可夫链上的随机控制问题,并引入离散伴随动力学,用于通过离散生成轨迹传播优化信号。基于这一表述,我们提出了组合伴随匹配(CAM),一种用于离散扩散求解器的无监督训练框架,具有结构化和低方差的轨迹级优化信号。实验上,CAM在多种组合优化问题上始终优于现有的无监督扩散基线,并与强大的监督扩散求解器甚至传统求解器性能相当。我们的代码可在 https://github.com/Shengyu-Feng/CAM 获取。

英文摘要

Diffusion-based neural solvers have shown strong promise for combinatorial optimization (CO), but existing methods typically rely on supervised training with large collections of near-optimal solutions. In this work, we extend adjoint-based trajectory optimization methods to discrete combinatorial domains. We formulate diffusion-based CO as a stochastic control problem over Continuous-Time Markov Chains and introduce discrete adjoint dynamics for propagating optimization signals through discrete generative trajectories. Building on this formulation, we propose Combinatorial Adjoint Matching (CAM), an unsupervised training framework for discrete diffusion solvers with structured and low-variance trajectory-level optimization signals. Empirically, CAM consistently outperforms existing unsupervised diffusion baselines and achieves performance competitive with strong supervised diffusion solvers and even traditional solvers across diverse combinatorial optimization problems. Our code is available at https://github.com/Shengyu-Feng/CAM.

2605.30919 2026-06-01 cs.LG cs.AI

De-attribute to Forget for LLM Unlearning

De-attribute to Forget for LLM Unlearning

Xinyang Lu, Jiabao Pan, Rachael Hwee Ling Sim, See-Kiong Ng, Anthony Kum Hoe Tung, Bryan Kian Hsiang Low

AI总结 本文提出基于数据归因奖励的LLM遗忘框架DareU,通过强化学习降低生成响应与遗忘数据的归因分数,实现有效遗忘并平衡模型效用。

详情
AI中文摘要

大型语言模型(LLM)的快速发展引发了对使用不当数据进行训练的担忧,这导致了对LLM遗忘研究的兴趣日益增长。许多现有的LLM遗忘方法依赖于优化预测损失,例如最大化遗忘集上的损失,但常常面临过度遗忘和模型效用差等关键问题。为了解决这些问题,本文创新地将LLM遗忘的优化目标定义为归零数据归因。具体而言,我们提出了第一个基于数据归因奖励的LLM遗忘框架,称为DareU,该框架通过强化学习来更新LLM,通过降低其生成响应与遗忘数据所有者的归因分数(即去归因)来实现遗忘。使用LLM分类器作为归因的有效近似进行的实证评估表明,DareU在实现有效遗忘的同时,很好地平衡了遗忘质量和模型效用,优于现有基线。

英文摘要

The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.

2605.30917 2026-06-01 cs.IR cs.CV

Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

无推理多模态学习稀疏检索用于生产级视觉文档搜索

Gyu-Hwung Cho, Youngjune Lee, Kiyoon Jeong, Siyoung Lee, Sanggyu Han, Hervé Dejean, Stéphane Clinchant, Seung-won Hwang

AI总结 提出V-SPLADE,一种无需推理的稀疏检索器,通过标题门控令牌监督解决视觉稀疏表示中的词汇基础问题,在视觉文档检索中达到稠密级效果。

详情
Comments
12 pages, 5 figures, 12 tables, preprint
AI中文摘要

随着arXiv论文和企业PDF等大规模视觉文档语料库的持续增长,视觉文档检索受到越来越多的关注;然而,目前仍缺乏一个可部署的系统,能够对视觉文档进行词汇索引,而无需在大规模下进行神经编码。现有方法要么使用基于VLM的稠密或多向量模型实现强大的检索质量,但需要在服务时进行神经查询编码;要么使用基于OCR或标题的BM25避免查询编码,但代价是耗时的文本提取或生成。为了填补这一缺失的服务机制,我们提出了V-SPLADE,一种用于视觉文档检索的无推理稀疏检索器。然而,这种无推理的多模态学习稀疏检索系统仍未得到充分探索,并且在高稀疏性下尚未显示出稠密级别的有效性。我们将这一限制归因于词汇基础问题:视觉稀疏表示通常无法捕捉文档图像中嵌入的词汇内容。为了解决这个问题,我们引入了标题门控令牌监督,这是一种仅在训练时使用的信号,利用VLM生成的标题作为词汇线索来激活检索相关的词汇维度。通过这种监督,V-SPLADE在六个视觉文档检索基准上的平均NDCG@5比同规模稠密基线提高了13.8个百分点,比基于OCR或标题的BM25基线提高了最多6.3个百分点。在1870万文档的语料库上,其R@5比同规模稠密基线提高了一倍以上,并通过分数融合进一步将竞争检索器的R@5提高了最多2.4个百分点。代码即将在https://github.com/naver/v-splade发布。

英文摘要

As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.

2605.30916 2026-06-01 cs.LG cs.GT econ.TH

Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

福利、可改进性与方差:最优基准测试项聚合的主-代理方法

Andreas Haupt, Justin Hartenstein, Anka Reuel, Mykel Kochenderfer, Sanmi Koyejo

AI总结 提出将基准测试建模为多任务主-代理博弈,通过福利、可改进性和方差三个维度评估项目,并应用于OLMES数据集识别帕累托劣势项目。

详情
AI中文摘要

AI基准测试存在记录完善的局限性,先前研究探讨了污染、饱和以及构造不明确等问题。聚合受到的关注要少得多:基准测试通常通过统一平均项目级分数来总结,隐含地将每个测试项目视为同等重要。我们将基准测试建模为多任务主-代理博弈,并表明基准测试的福利损失由三个项目级原始要素共同决定:与规范性福利优先级的一致性、边际可改进性和性能方差。我们将该理论转化为一个审计框架,沿这三个轴对项目进行排序,并使用WORKBank(福利)、EvoLM 4B套件(可改进性)和PolyPythias 410M面板(方差)将其应用于OLMES项目。该框架揭示了在OLMES中,在亲工人福利操作化下帕累托劣势的项目。所有代码可在 https://github.com/stair-lab/principal-agent-benchmarks 获取。

英文摘要

AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal-agent game and show that the welfare loss from a benchmark is determined jointly by three item-level primitives: alignment with normative welfare priorities, marginal improvability, and performance variance. We translate the theory into an audit framework that ranks items along each of these three axes, and apply it to OLMES items using WORKBank for welfare, the EvoLM 4B suite for improvability, and the PolyPythias 410M panel for variance. The framework surfaces items that are Pareto-inferior within OLMES subject to a pro-worker welfare operationalization. All code is available at https://github.com/stair-lab/principal-agent-benchmarks.

2605.30914 2026-06-01 cs.LG cs.SE

Automating Formal Verification with Reinforcement Learning and Recursive Inference

用强化学习和递归推理自动化形式验证

Max Tan

AI总结 研究通过可验证奖励的强化学习和验证器引导的推理搜索,提升大语言模型生成验证程序和证明的能力,在Dafny和Lean上取得显著进展。

详情
Comments
Master's thesis, 140 pages, 16 figures, 17 tables
AI中文摘要

自动化形式验证对大语言模型仍然具有挑战性,因为证明助手和验证感知语言的数据稀缺,且正确性取决于满足精确的机器可检查规范,而非生成合理的代码。本文研究验证器环境如何通过可验证奖励的强化学习(RLVR)和验证器引导的推理时搜索,改进大语言模型生成验证程序和证明的能力。首先,我们使用组相对策略优化(GRPO)及相关变体,在Dafny中训练开源模型,将生成的候选程序组装成完整程序,并根据编译器和验证器的结果进行评分。在APPS衍生的Dafny数据集上的初步实验将验证奖励从2.2%提升至58.1%,但发现了规范破解问题,即模型利用弱形式规范而非实现预期解决方案。在过滤掉欠规范和易受攻击的任务后,多轮RLVR在改进的基准上将验证通过率从9.7%提升至31.1%。其次,我们在Lean中开发了一个验证器引导的推理框架,将证明生成视为对分解子目标、验证器反馈、诊断和修复的结构化搜索。使用固定的基础模型,包含证明修订器的完整框架在初始VeriCoding试点集上将通过率从直接修复的46.2%提升至69.2%。在更大的VERINA数据集上,整体任务分解加上证明修订器解决了42个先前未解决任务中的7个。我们还引入了Dalek-Bench,一个从Rust $ exttt{curve25519-dalek}$验证项目派生的仓库级Lean基准;初步结果仍然较弱,表明仍需更强的进度评估和特定任务的工具使用策略。

英文摘要

Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. First, we train open-source models in Dafny with RLVR using Group Relative Policy Optimization (GRPO) and related variants, assembling generated candidates into complete programs and scoring them with compiler and verifier outcomes. Initial experiments on an APPS-derived Dafny dataset increased verified reward from 2.2% to 58.1%, but revealed specification hacking, where models exploit weak formal specifications instead of implementing the intended solutions. After filtering underspecified and vulnerable tasks, multi-turn RLVR on the refined benchmark improves the verified pass rate from 9.7% to 31.1%. Second, we develop a verifier-guided inference scaffold in Lean that treats proof generation as structured search over decomposed subgoals, verifier feedback, diagnostics, and repair. With a fixed base model, the full scaffold with proof reviser improves pass rate on an initial VeriCoding pilot set from 46.2% under direct repair to 69.2%. On the larger VERINA dataset, whole-task decomposition plus proof reviser solves 7 of 42 previously unsolved tasks. We also introduce Dalek-Bench, a repository-scale Lean benchmark derived from the Rust $\texttt{curve25519-dalek}$ verification project; preliminary results remain weak, indicating that stronger progress evaluation and task-specific tool-use policies are still needed.

2605.30913 2026-06-01 cs.CL cs.AI cs.CY cs.HC

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

有毒幻觉:扰动提示与追踪LLM电路

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

AI总结 研究有毒语言扰动对LLM事实可靠性的影响,发现有毒词汇降低准确率并增加不确定性,通过归因图分析揭示内部机制。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在对话环境中,用户语气从礼貌到对抗性或毒性不等,但尚不清楚在语义等效的提示中,有毒语言是否会降低事实可靠性。我们研究基于词汇和语气的提示扰动如何影响LLM的事实可靠性。通过礼貌、随机和三种毒性水平的受控提示变化,我们在ARC-Easy、GSM8K和MMLU上评估了五个LLM。我们发现有毒词汇扰动持续降低事实准确性并增加不确定性,而礼貌措辞产生有限且不一致的变化。为了检查这些答案不一致是否对应内部变化,我们进行了模型激活和影响的归因图分析。我们发现增加毒性选择性地放大对扰动敏感的变体节点,而相对稳定的核心推理节点保持更不变。这些发现将提示语气定位为LLM可靠性的关键维度,并提供了行为和机制证据,表明表面词汇变化可以改变事实输出和内部计算。

英文摘要

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

2605.30912 2026-06-01 cs.CV cs.CL

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据:面向多模态RLVR的证据锚定空间注意力监督

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang

AI总结 提出EASE方法,通过将标注证据区域转化为平滑视觉标记目标,在多模态强化学习训练中引导响应到图像的注意力,从而提升视觉语言模型在感知、幻觉、视觉数学和多模态推理基准上的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过优化从最终答案中导出的结果奖励来改进视觉语言模型(VLM)。然而,这种仅基于结果的奖励并不能告诉模型哪些图像区域证明了答案的正确性。对于需要视觉定位的问题,这些奖励无法区分由相关视觉证据支持的响应与由语言先验捷径或幸运猜测产生的响应。我们引入了EASE(证据锚定空间注意力),它通过视觉证据过程监督增强了多模态RLVR。EASE将标注的证据区域转换为平滑的视觉标记目标,并在RL训练期间使用它来引导响应到图像的注意力,但仅限于高奖励轨迹。标注仅用作特权训练标签,而推理仅需要原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B上,EASE在感知、幻觉、视觉数学和多模态推理基准上的平均得分比DAPO高出2.5到3.1分。诊断和消融实验表明,EASE更好地将视觉注意力与标注的证据区域对齐。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

2605.30911 2026-06-01 cs.CV cs.AI

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

什么使LVLMs更少产生幻觉?揭示影响幻觉鲁棒性的架构因素

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin, Jun Luo, Jiancheng Lv

AI总结 本文通过将架构设计分解为语言基础、视觉表示和语义对齐三个维度,并引入CoSimUE基准,系统探索了架构因素对LVLMs幻觉鲁棒性的影响,发现模型参数扩展效果有限,而增强视觉编码器、语言基础和语义对齐能分别减少不同类型的幻觉。

详情
AI中文摘要

幻觉仍然是削弱大型视觉-语言模型(LVLMs)可靠性的关键挑战之一。但什么使LVLM更少产生幻觉?许多现有工作专注于改进模型的内部组件。我们认为幻觉从根本上源于模型架构的设计方式。为了研究这一点,我们将架构设计分解为三个维度:语言基础(LF)、视觉表示(VR)和语义对齐(SA),并将幻觉分为共现型、相似型和先前被忽视的不确定型。基于这一框架,我们提出了CoSimUE基准,通过受控文本扰动和随机扰动创建细粒度的幻觉场景,从而建立设计选择与幻觉行为之间的映射。在7个设计方面的实验表明:1)广泛强调的参数规模扩展对减少所有三类幻觉的影响有限;2)更大且训练更好的语言基础可以减少共现型幻觉;3)更强的视觉编码器和更高的分辨率减轻相似型错误;4)有效的对齐策略缓解不确定型幻觉。5)此外,跨维度分析显示,联合增强视觉保真度和对齐质量能带来最全面的改进。本研究首次系统性地将架构级设计与幻觉鲁棒性联系起来,为开发可靠且高效的LVLMs提供了实用指导。

英文摘要

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

2605.30910 2026-06-01 cs.LG

PINNs Failure Modes are Overfitting

PINNs 的失败模式是过拟合

Nigel T. Andersen, Takashi Matsubara

AI总结 本文通过可视化残差证明物理信息神经网络的失败模式源于过拟合,并提出基于正则化和双反向传播的方法来消除失败模式,在标准方程上以更少的配置点实现最先进性能。

详情
AI中文摘要

物理信息神经网络(PINNs)是一类常见的基于机器学习的偏微分方程(PDE)求解器,它们通过最小化编码 PDE 的残差损失来训练网络以表示解。尽管取得了成功,但已知它们在某些简单方程上会失败,收敛到不正确的解,尽管损失很低。这些失败模式在过去几年中引起了文献中的广泛关注,激发了基于架构和优化的解决方案。通过直接可视化残差,我们表明失败模式是过拟合的结果:损失在配置点上被最小化,但在其他地方则不然。应用正则化会使失败模式消失。最后,我们将双反向传播扩展到整个残差集,并使用它在四个标准失败模式方程上实现了最先进的性能,配置点数量减少多达 $23\times$,且使用普通架构。

英文摘要

Physics-Informed Neural Networks (PINNs) are a common class of machine learning-based partial differential equation (PDE) solvers which train a network to represent a solution by minimizing a residual loss that encodes the PDE. Despite their successes, they are known to fail on certain simple equations, converging to an incorrect solution despite low loss. These failure modes have garnered significant attention in the literature over the past several years, motivating both architectural and optimization based solutions. By directly visualizing the residual, we show that failure modes are the result of overfitting: the loss is minimized on the collocation points, but not elsewhere. Applying regularization causes the failure modes to vanish. Finally, we extend double backpropagation over the full set of residuals, and use it to achieve state-of-the-art performance on four standard failure mode equations with up to $23\times$ fewer collocation points and a vanilla architecture.

2605.30907 2026-06-01 cs.SE cs.AI cs.CL cs.LG

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

BlueFin: 在金融电子表格上对LLM智能体进行基准测试

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta, Case Winter, George Fang, John Ling, Emma Strubell, Zach Kirshner

AI总结 提出BlueFin基准,通过131个真实金融电子表格任务评估LLM智能体的合成、操作和理解能力,并验证了LM评判与人类专家的一致性。

详情
Comments
26 pages
AI中文摘要

我们提出BlueFin,一个基准测试,要求大语言模型(LLM)智能体在专业金融领域的电子表格工作簿上执行合成、操作和理解任务。尽管全球电子表格软件付费用户估计数亿——比全球专业开发人员估计数量高一个数量级——但投入探索和扩展LLM在电子表格领域能力的资源相对较少,而专门用于反映专业金融角色实际职业任务的资源更少。为此,我们整理了131个具有现实相关性的挑战性复杂任务,包含3225个细粒度评分标准;值得注意的是,我们的评分标准和LM评判评估由一组专家人工标注员验证,从而对难以通过编程验证但可由LM评判智能体可靠评估的复杂任务进行高质量、细粒度的评估。我们的评判与专家共识达到一致(α=0.826),宏F1得分为0.839。前沿LLM在此挑战性基准上表现不佳,最强LLM在任务上的平均得分低于50%——模型在动态正确性方面表现出特别弱点。我们的贡献包括:涵盖三类电子表格任务的示例数据集、开源工具包和智能体评估框架,以及现有前沿模型在我们基准上的性能表征。

英文摘要

We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions -- an order of magnitude more than the estimated global population of professional developers -- comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real-world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high-quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus ($α=0.826$) with a macro-F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50\% average scores across tasks -- models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.

2605.30906 2026-06-01 cs.RO cs.SY eess.SY

Trajectory Planning for Non-Communicating Mobile Robots using Inverse Optimal Control

非通信移动机器人的逆最优控制轨迹规划

Nina Majer, Yannick Epple, Xin Ye, Stefan Schwab, Sören Hohmann

AI总结 针对非通信移动机器人在避碰场景中的高效交互,提出一种结合逆最优控制的轨迹规划与预测算法,通过估计未知目标状态并联合预测,实现更快的规划求解。

详情
AI中文摘要

为了实现非通信移动机器人在避碰场景中的高效交互,我们提出了一种新颖的轨迹规划与预测组合算法。逆最优控制用于基于观测到的过去轨迹估计所有机器人的未知目标状态。每个机器人还从其他机器人的角度考虑自我预测,并使用估计的目标状态解决联合预测问题。然后将得到的预测用于规划。在2-8个机器人场景中的仿真结果表明,与基于恒定加速度估计目标状态的规划相比,所有车辆到达目标的中位时间加快了9.8%。此外,所提出的方法从未导致求解器无法找到规划或预测问题的解。

英文摘要

To enable an efficient interaction of non-communicating mobile robots in collision avoidance scenarios, we present a novel combined trajectory planning and prediction algorithm. Inverse optimal control is used to estimate unknown goal states of all robots based on observed past trajectories. Each robot also takes the perspective of other robots in considering self-prediction and solves a joint prediction problem using the estimated goal states. The resulting predictions are then considered for planning. Simulation results of scenarios with 2-8 robots show that the median of the durations until all vehicles reach their goals is 9.8 % faster compared to planning with constant acceleration based estimated goal states. Moreover, the proposed approach never leads to the solver being unable to find a solution to the planning or prediction problem.

2605.30905 2026-06-01 math.OC cs.LG

A Unifying View of Anchoring via Operator-Side Tikhonov Regularization

通过算子侧Tikhonov正则化实现锚定的统一视角

Zihao Chen

AI总结 本文提出锚定固定点和单调方程方法可通过在基础方法查询的算子上添加消失的Tikhonov正则项来统一构造,并分析了四种变体的残差收敛率。

详情
AI中文摘要

锚定不动点和单调方程方法,包括Halpern迭代、额外锚定梯度及其相关方法,通过向参考点添加消失的拉力来获得最后迭代保证。现有的锚定变体通常能获得尖锐的最后迭代保证,但从更新层面来看,锚点的放置可能是算法特定的且概念上不透明。我们表明锚定允许一个单一的算子侧构造:用消失的Tikhonov项正则化基础方法查询的算子,然后运行未修改的基础方法。应用于Picard迭代,该配方重现了Halpern迭代;应用于前向步、外梯度(EG)和过去外梯度(PEG,也称为Popov方法),它产生了三种变体,其锚点放置继承了基础方法的查询模式。前向步实例化给出了一个新的残差收敛保证,而EG和PEG实例化给出了新的正则化变体。四种分析共享一个残差递推关系,恢复了Halpern残差范数的$O(1/k)$收敛速率,为正则化前向步给出了$O(1/\sqrt{k})$,并在无约束单调Lipschitz设置下为正则化EG和PEG变体给出了$O(1/k)$。

英文摘要

Anchored fixed point and monotone equation methods, including Halpern iteration, extra anchored gradient, and their relatives, add a vanishing pull toward a reference point to obtain last-iterate guarantees. Existing anchored variants often achieve sharp last-iterate guarantees, but from the update-level perspective the placement of the anchor can be algorithm-specific and conceptually opaque. We show that anchoring admits a single operator-side construction: regularize the operator queried by the base method with a vanishing Tikhonov term, then run the unmodified base method. Applied to the Picard iteration, this recipe reproduces the Halpern iteration; applied to the forward step, extragradient (EG), and past extragradient (PEG, also known as Popov's method), it yields three variants whose anchor placements inherit the base method's query pattern. The forward-step instantiation gives a new residual convergence guarantee, while the EG and PEG instantiations give new regularized variants. The four analyses share a residual recurrence, recovering the $O(1/k)$ Halpern residual-norm convergence rate, giving $O(1/\sqrt{k})$ for the regularized forward step, and giving $O(1/k)$ for the regularized EG and PEG variants in the unconstrained monotone Lipschitz setting.

2605.30904 2026-06-01 cs.CV

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

MergeTok: 通过令牌合并实现统一连续和离散视觉令牌化

Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang

AI总结 提出MergeTok统一令牌化器,通过令牌合并技术联合优化连续VAE和离散VQ令牌化器,实现高保真重建与语义可控离散表示的兼顾。

详情
Comments
11 pages (main text), 7 figures. Preprint. Under review at NeurIPS 2026
AI中文摘要

大多数用于图像生成的视觉令牌化器分为两类,各有互补的局限性:连续VAE提供高保真重建,但遭受密集、纠缠的潜在变量,不适合语义控制;而基于离散VQ的模型能够实现自回归生成,但面临梯度稀疏、训练不稳定和码本崩溃的问题。在这项工作中,我们引入了MergeTok,一个统一的令牌化器,在编码器-解码器架构中联合优化连续(VAE)和离散(VQ)令牌化器,利用令牌合并技术作为语义桥梁。通过在编码过程中聚类相似令牌,MergeTok建立了一个结构先验,提供双重监督信号:(i)在VAE分支中施加合并令牌的语义对齐,将其潜在空间正则化为解缠、语义感知的表示;(ii)推导出组级约束,促进组内多样性和组间排他性,从而稳定VQ训练。MergeTok在ImageNet-256上展示了具有竞争力的重建和生成性能,在匹配令牌预算下,其rFID远低于强VAE和VQ模型,同时产生语义组织的令牌表示,兼容自回归和扩散生成器。这表明单一架构可以赋予视觉令牌化器鲁棒的语义组织和生成器友好的离散性。

英文摘要

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

2605.30903 2026-06-01 cs.LG cs.AI

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

无最优演示者的逆强化学习:一种可行奖励集方法

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

AI总结 针对多个非最优演示者数据,提出可行奖励集框架,通过线性约束联合可行集单调收缩,并给出恢复保证与高维环境离线算法。

详情
AI中文摘要

逆强化学习(IRL)通常假设来自单个最优演示者的演示,但在许多应用中,数据来自多个具有异质次优性水平的非完美演示者。我们通过可行奖励集框架研究这一设置下的奖励学习:对于每个演示者,我们将其声明的次优性水平编码为线性约束,并在演示者之间对所得可行集取交集。我们的理论分析表明,随着数据的增加,联合可行集单调收缩,并且我们精确刻画了新演示者何时严格收紧该集合。我们进一步为真实最优演示者的可行奖励集建立了两个恢复保证:一个界限依赖于与最优占用度的接近程度,而另一个仅需要足够的覆盖且没有接近最优的演示者。在实际方面,我们引入了解决所得奖励集中固有奖励模糊性的策略,并提供了适用于高维环境的函数逼近离线算法。在表格型网格世界和大语言模型(LLM)微调设置中的实验与理论预测一致,并证明了所提框架相对于基线的有效性。

英文摘要

Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.

2605.30901 2026-06-01 cs.LG

Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity

模型多重性下表格数据的密度引导鲁棒反事实解释

Jun Tan, Qing Guo, Zicheng Xu, Jinglin Li, Qi Fang, Ning Gui

AI总结 提出DensityFlow生成框架,利用神经ODE和密度评分构建鲁棒反事实解释,避免低密度区域,并在模型多重性下保持有效性。

详情
Comments
26 pages, 11 figures, accepted by ICML 2026
AI中文摘要

反事实解释(CEs)对于可操作的补救措施至关重要,但其可靠性在低密度区域常常受到损害,因为分类器在这些区域表现出高方差。与依赖昂贵的集成交集来定义稳定性的现有方法不同,我们提出了 extit{DensityFlow},一种生成框架,通过遵循高置信度数据流形来构建鲁棒的反事实解释。具体来说,我们将反事实生成建模为由神经ODE参数化的连续时间动力学,并由可微密度评分引导,以主动避免不确定的低密度区域。该密度评分通过噪声对比估计学习,有效利用$(K{+}1)$路判别器来估计密度比。对于黑盒设置,我们引入了一种局部代理蒸馏机制,该机制在CE生成的轨迹内严格地将轻量级代理与目标模型对齐,从而实现高效的基于梯度的优化,且查询次数最少。实验表明,与基于集成的基线相比, extit{DensityFlow}在模型多重性下实现了优越的有效性,同时显著降低了查询成本。我们的实现可在https://github.com/G-AILab/DensityFlow获取。

英文摘要

Counterfactual explanations (CEs) are essential for actionable recourse, yet their reliability is often compromised in low-density regions, where classifiers exhibit high variance. Unlike existing methods that rely on expensive ensemble intersections to define stability, we propose \textit{DensityFlow}, a generative framework that constructs robust CEs by adhering to the high-confidence data manifold. Specifically, we model the counterfactual generation as continuous-time dynamics parameterized by Neural ODE, guided by a differentiable density score to actively avoid uncertain, low-density areas. This density score is learned via Noise Contrastive Estimation, effectively leveraging a $(K{+}1)$-way discriminator to estimate density ratios. For black-box settings, we introduce a local proxy distillation mechanism that aligns a lightweight surrogate with the target model strictly within the trajectory of CE generation, enabling efficient gradient-based optimization with minimal queries. Experiments demonstrate that \textit{DensityFlow} achieves superior validity under model multiplicity while significantly reducing query costs compared to ensemble-based baselines. Our implementation is available at https://github.com/G-AILab/DensityFlow.

2605.30900 2026-06-01 cs.AI physics.app-ph

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

BilliardPhys-Bench: 多模态大语言模型的物理推理与视觉动力学基准测试

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

AI总结 提出BilliardPhys-Bench基准,通过合成台球环境评估多模态大语言模型在物理推理(碰撞、反弹、最终位置预测)上的能力,发现模型存在“静态偏差”且性能随模拟时间与场景复杂度下降。

详情
AI中文摘要

当前多模态模型在静态图像识别方面表现良好,但直观的物理推理仍是弱点。从单张图像预测物体如何运动及相互作用对这些系统而言仍然困难。我们提出了BilliardPhys-Bench,一个用于合成台球环境中物理推理的基准测试。其程序化引擎生成带有摩擦和弹性碰撞的随机场景。该基准测试三种能力:(1) 预测球与球之间的碰撞,(2) 推理墙壁反弹,(3) 估计运动停止后球的最终位置。我们评估了来自GPT、Claude、Gemini和Qwen系列的最新MLLMs。随着模拟时间增加和场景几何复杂度提高,性能下降。我们还观察到一个一致的失败模式,称为“静态偏差”:当正确的物理结果更难推断时,模型倾向于预测无交互。这些发现揭示了当前MLLMs在视觉动力学上的不足之处,并指出了在多模态架构中需要更好的物理归纳偏置。

英文摘要

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

2605.30899 2026-06-01 eess.AS cs.AI cs.SD

A Unified and Reproducible Experimentation Framework for Speech Understanding

语音理解的统一可复现实验框架

Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li, Yi Yang, Yixuan Wang, Xiaoyu Gu, Guanyu Chen, Yucheng Wang, Jiang Li, Zhangjie Zhao, Haoran Wang, Wenming Tu, Haoyu Li, Duo Ma, Lirong Qian, Yu Xi, Wen Wen, Jiaqi Guo, Hui Zhang, Shuai Fan, Wenbin Jiang, Shuai Wang, Kai Yu

AI总结 提出SURE框架,通过标准化预测格式、归一化和评分,以及代理辅助的训练转换流程,提高语音理解模型在部署场景下的可比性和可复现性。

详情
Comments
This paper is submitted to INTERSPEECH 2026
AI中文摘要

语音基础模型和语音大语言模型推动了语音理解的发展,但面向部署的模型选择受到非可比评估的阻碍,这些评估由不匹配的后处理以及跨数据规模和流水线难以复现的训练结果导致。我们提出了SURE,一个统一的实验框架,标准化了预测格式、归一化和评分。SURE评估了从传统流水线到语音大语言模型的各种范式下的强系统,在代表性任务上施加了现实声学和语言压力。除了评估,SURE还引入了一种代理辅助的训练转换流程,该流程将论文和代码映射到统一协议下、基于匹配开放数据子集的版本化、可运行训练流水线。总体而言,SURE提高了面向部署评估的可比性和可复现性。

英文摘要

Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.

2605.30898 2026-06-01 cs.AI cs.CL

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale: 通过模型路由和测试时扩展的在线联合优化实现自适应统一推理扩展

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

AI总结 提出UniScale框架,将模型路由和测试时扩展统一为上下文多臂老虎机问题,通过LinUCB在线学习推理策略,实现细粒度且更优的质量-成本权衡。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

在大语言模型(LLM)的实际部署中,平衡推理质量和计算成本已成为核心挑战。现有方法沿着两个大致独立的维度处理这一权衡:模型路由(在不同规模的模型之间切换以匹配请求复杂度)和测试时扩展(TTS,在固定模型内调整推理时计算以实现细粒度控制)。然而,这种解耦设计引入了固有限制。由于模型规模稀疏,模型路由产生粗粒度的离散性能变化,而单模型TTS通常遇到能力上限,并随着计算增加出现收益递减。此外,将两种机制分开处理限制了动态推理环境中的适应性。为克服这些限制,我们引入统一推理扩展(UIS),将模型路由和TTS统一到单个优化空间中。基于此公式,我们提出UniScale,一个在线框架,将自适应UIS建模为上下文多臂老虎机问题,并通过LinUCB学习推理策略。该框架包含效率感知学习和成本建模,以确保在高维动作空间上的稳定和可扩展优化。评估表明,UniScale有效利用UIS空间中的协同作用,在多样化的动态推理场景中提供细粒度且持续更优的质量-成本权衡。

英文摘要

In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

2605.30896 2026-06-01 cs.LG

Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

零坍塌:策略梯度方法在不连续奖励环境中的一种失败模式

Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald

AI总结 本文发现策略梯度方法在拍卖等不连续奖励环境中会出现“零坍塌”失败模式,即策略因梯度信号消失而陷入零奖励区域,并提出了缓解策略。

详情
Comments
20 pages, 7 figures; includes Appendix
AI中文摘要

重复拍卖中的竞价是强化学习(RL)的一个核心挑战,它结合了连续控制与数字广告的策略复杂性。尽管策略梯度和基于值的方法似乎适合这些设置,但它们常常难以应对拍卖奖励景观的不连续、“悬崖状”特性。例如,在首价拍卖中,竞拍者在达到特定阈值之前获得零奖励,之后奖励随出价增加而减少。这形成了由尖锐边界分隔的平坦零奖励区域。我们识别出这种设置中一个基本的失败模式,称为“零坍塌”。我们表明,随机探索和基于梯度的更新可能导致策略越过最优高奖励区域,进入平坦的零奖励区域。一旦进入,由于缺乏信息性的梯度信号,恢复变得极其样本低效,有效地困住了智能体。我们发现演员-评论家方法特别容易受到影响,因为偏差的值估计会加速向不稳定区域的移动。我们的贡献包括:(1)对不连续奖励如何导致信号消失和零坍塌的机制解释;(2)对策略随机性和步长之间相互作用的分析;(3)在REINFORCE和演员-评论家变体上对该现象的经验演示。我们提出了涉及初始化和架构选择的实用缓解策略以提高稳定性。最后,我们引入了一个正式的拍卖环境RL框架,突出了其独特的结构特性。

英文摘要

Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited for these settings, they often struggle with the discontinuous, "cliff-like" nature of auction reward landscapes. In a first-price auction, for example, a bidder receives zero reward until they cross a specific threshold, after which the reward decreases as the bid increases. This creates a landscape of flat, zero-reward regions separated by sharp boundaries. We identify a fundamental failure mode in this setting termed "zero collapse." We show that stochastic exploration and gradient-based updates can cause policies to overshoot optimal high-reward regions and enter flat, zero-reward regimes. Once there, the lack of an informative gradient signal makes recovery extremely sample-inefficient, effectively trapping the agent. We find that actor-critic methods are particularly susceptible, as biased value estimates can accelerate this movement toward unstable regions. Our contributions include: (1) a mechanistic explanation of how discontinuous rewards lead to vanishing signals and zero collapse; (2) an analysis of the interaction between policy stochasticity and step size; and (3) an empirical demonstration of this phenomenon across REINFORCE and actor-critic variants. We propose practical mitigation strategies involving initialization and architectural choices to improve stability. Finally, we introduce a formal RL framework for auction environments highlighting their unique structural properties.

2605.30894 2026-06-01 cs.CV

SteerFace: Debiasing Synthetic Face Generation via Adaptive Residue Perturbation

SteerFace: 通过自适应残差扰动消除合成人脸生成中的偏差

Yuxi Mi, Qiuyang Yuan, Jianqing Xu, Yichun Zhou, Xuan Zhao, Jun Wang, Rizen Guo, Shuigeng Zhou

AI总结 针对合成人脸数据与真实数据分布存在视觉倾向差异的问题,提出SteerFace框架,通过将身份嵌入向随机正交方向扰动作为正则化项,抑制生成器对非身份视觉线索的依赖,从而缩小合成-真实差距。

详情
AI中文摘要

人脸识别训练中合法合规数据的短缺引发了人们对使用合成数据作为替代方案的日益关注。虽然最近的扩散方法能够生成具有强身份一致性和数据多样性的逼真人脸图像,但其下游识别性能仍然存在显著的合成-真实差距。本文识别出视觉倾向(visual tendency)作为一个此前未被充分探索的限制因素,即合成数据表现出不切实际的视觉属性普遍性,从而偏离真实数据分布。视觉倾向可归因于生成器对身份嵌入的条件化,通过这种条件化,共现的残留视觉线索被无意中吸收到学习到的身份语义中。为了阻止生成器利用此类视觉线索,本文提出SteerFace,一个简单高效的训练框架,通过将身份嵌入向嵌入超球面上的随机正交方向引导来扰动身份嵌入。该扰动作为一种身份保持正则化项,惩罚生成器对非身份成分的依赖,理论分析支持了这一点。本文进一步引入一种自适应策略,学习具有样本级偏好和有利总体统计的扰动强度。大量实验表明,SteerFace有效缓解了视觉倾向,在下游人脸识别中优于先前方法,并且在不同训练数据集和生成流程中具有良好的泛化能力。

英文摘要

The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.

2605.30893 2026-06-01 cs.CV

Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation

用于3D CT重建、增强和生成的基础VAE

Qi Chen, Shuhan Ding, Yu Gu, Nan Liu, Jiang Bian, Alan Yuille, Zongwei Zhou, Jingjing Fu

AI总结 本文发现,在自然图像上预训练的基础VAE可直接用于CT重建、增强和生成,无需训练或微调,通过冻结编解码器实现解剖结构保留和噪声抑制,并在分割和生成任务上取得显著提升。

详情
Comments
ICML 2026 Accepted
AI中文摘要

变分自编码器(VAE)将高分辨率CT体积压缩为紧凑的潜在表示,同时保留临床相关结构。然而,从头训练或大量微调CT专用VAE会带来巨大的计算和工程成本,并且在异构扫描仪、协议和疾病下性能常会下降。本文通过一个关键观察向免训练的医学VAE迈出了渐进的一步:一个在自然图像和视频上大规模预训练的基础VAE可以作为CT重建、增强和生成的统一接口。在编码器和解码器均冻结的情况下,基础VAE重建CT体积时保留了解剖结构,同时抑制了采集噪声;在这些重建上训练分割模型,对于胰腺肿瘤和肺肿瘤,表面准确度平均提高了3.9% NSD。在相同的基础VAE潜在空间中,条件潜在扩散模型实现了平均FVD降低3.9%,CT CLIP分数提高36.2%,并在18种疾病的多疾病生成忠实度上提高了2.76% AUC。这些结果表明基础VAE可作为可扩展的CT表示重用和忠实CT生成的实用接口。我们的代码和演示可在 https://github.com/qic999/Foundation-VAE 获取。

英文摘要

Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.

2605.30892 2026-06-01 cs.LG

Bandwidth Allocation with Device Partitioning for Federated Learning over Industrial IoT networks

面向工业物联网联邦学习的设备分区带宽分配

Kangmin Kim, Jaeyoung Song

AI总结 针对联邦学习在工业物联网中的通信瓶颈,提出一种基于设备计算能力分区的带宽分配策略,通过顺序分配全带宽给子集来最小化训练时间,并理论证明其优于无分区方案,同时降低上行能耗。

详情
AI中文摘要

我们考虑一个联邦学习(FL)系统,其中工业物联网(IIoT)设备通过无线信道协作训练全局模型,而不共享本地数据。在此类系统中,通信时间是制约整体训练效率的主要瓶颈。与优先考虑个体服务质量需求的传统网络不同,FL系统旨在尽可能高效地收敛到最优全局模型,这需要一种根本不同的带宽分配方法。本文提出一种新颖的带宽分配策略,利用设备计算能力的异构性来最小化总训练时间。该策略并非同时将所有选定设备的带宽分配出去,而是将参与设备划分为有序子集,并依次授予每个子集全带宽的独占访问权。我们正式证明,无论底层调度算法如何,这种基于分区的策略都能实现比任何无分区带宽分配方案更低的训练时间。此外,通过减少每台设备的传输持续时间,该策略还最小化了上行能耗,这对电池受限的IIoT设备尤其有利。在真实数据集(包括工业表面缺陷基准GC10-Det和标准图像分类基准CIFAR-10)上的大量实验表明,与现有带宽分配方案相比,所提策略持续降低了训练时间和能耗,接近轮次时间的理论下界。

英文摘要

We consider a federated learning (FL) system in which Industrial Internet-of-Things (IIoT) devices collaboratively train a global model over wireless channels without sharing local data. In such systems, communication time is a primary bottleneck that constrains overall training efficiency. Unlike conventional networks that prioritize individual quality-of-service requirements, FL systems collectively aim to converge to an optimal global model as efficiently as possible, which calls for a fundamentally different approach to bandwidth allocation. In this paper, we propose a novel bandwidth allocation policy that exploits the heterogeneity of device computing capabilities to minimize total training time. Rather than distributing bandwidth among all selected devices simultaneously, the proposed policy partitions the participating devices into ordered subsets and sequentially grants each subset exclusive access to the full bandwidth. We formally prove that this partitioning-based policy achieves a strictly lower training time than any bandwidth allocation scheme without partitioning, irrespective of the underlying scheduling algorithm. Furthermore, by reducing per-device transmission duration, the proposed policy also minimizes uplink energy consumption, which is particularly beneficial for battery-constrained IIoT devices. Extensive experiments on real-world datasets - including GC10-Det, an industrial surface defect benchmark, and CIFAR-10, a standard image classification benchmark - demonstrate that the proposed policy consistently reduces training time and energy consumption compared to existing bandwidth allocation schemes, approaching the theoretical lower bound on round time.

2605.30889 2026-06-01 physics.chem-ph cs.LG

MLIPilot: LLM-Driven Auto-Research for Machine-Learned Interatomic Potentials

MLIPilot:面向机器学习原子间势的LLM驱动自动研究

Etinosa Osaro, Santosh Adhikari, Stamatia Zavitsanou, Kelsey Parker, Dario Rocca

AI总结 提出MLIPilot框架,利用大语言模型自动提出假设、编辑训练代码并基于物理约束评分卡优化机器学习原子间势,在QM7和Cu EMT数据集上验证了其有效性。

详情
AI中文摘要

构建生产质量的机器学习原子间势(MLIP)需要在单个训练损失无法捕捉的约束下平衡精度、动力学稳定性和计算吞吐量。我们引入了MLIPilot,一个自动研究框架,其中工具调用的大语言模型提出假设、编辑MLIP训练代码、启动HPC作业,并使用固定的、受物理约束的评分卡接受或回退更改。我们在MACE势优化上评估了MLIPilot,使用了商业和开源权重LLM代理,包括GPT-5.5、GPT-4.1、Mistral-24B和Qwen3-32B。基准测试涵盖分子和周期性设置:一个QM7衍生数据集(我们为其生成了B3LYP/6-31G(d)能量和力),以及一个Cu EMT数据集(包含由ASE有效介质理论计算器标记的周期性铜超胞)。在这些基准测试中,最强的代理通过发现有用的训练策略(包括输出归一化、损失函数更改、渐进训练计划和模型容量调整),将最初违反约束的基线模型转变为可接受的模型。这些结果表明,当LLM代理的搜索受到领域特定验证标准的约束时,它们可以作为科学机器学习工作流的自主操作者,将MLIP开发从手动试错转向可审计的自动化实验。

英文摘要

Constructing production-quality machine-learned interatomic potentials (MLIPs) requires balancing accuracy, dynamical stability, and computational throughput under constraints that are not captured by a single training loss. We introduce MLIPilot, an auto-research framework in which tool-calling large language models propose hypotheses, edit MLIP training code, launch HPC jobs, and accept or revert changes using a fixed, physically constrained scorecard. We evaluate MLIPilot on MACE potential optimization using both commercial and open-weight LLM agents, including GPT-5.5, GPT-4.1, Mistral-24B, and Qwen3-32B. The benchmarks span molecular and periodic settings: a QM7-derived dataset for which we generated B3LYP/6-31G(d) energies and forces, and a Cu EMT dataset with periodic copper supercells labeled by ASE's Effective Medium Theory calculator. Across these benchmarks, the strongest agents move initially constraint-violating baselines to accepted models by discovering useful training strategies, including output normalization, loss-function changes, progressive training schedules, and model-capacity adjustments. These results suggest that LLM agents can serve as autonomous operators for scientific machine-learning workflows when their search is constrained by domain-specific validation criteria, shifting part of MLIP development from manual trial-and-error toward auditable, automated experimentation.

2605.30888 2026-06-01 cs.CL

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

RLHF的另一面:用于奖励模型自监督改进的在线策略反馈

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng

AI总结 提出SAVE框架,利用价值函数生成在线策略反馈,通过对比学习更新奖励模型,在六个基准上超越现有方法。

详情
AI中文摘要

构建用于语言模型对齐的强大奖励模型(RM)受到从人工标注或评判模型获取多样且可靠偏好数据的成本和难度的瓶颈限制。随着策略超越静态RM训练,这一问题变得更加严重。因此,我们提出SAVE(基于价值锚定的在线策略反馈自监督奖励模型改进),一个通过使用价值函数进行在线策略RM训练的框架,对在线策略响应进行评分作为反馈。SAVE自然地利用提示特定的价值头作为自适应锚点,将奖励评分的在线策略响应转化为监督信号。它计算RM优势并过滤模糊样本,通过对比目标更新RM。通过六个不同基准的严格实证评估,SAVE在增强RM训练方面的有效性得到了强烈验证。它在所有数据集上取得了优于现有方法的结果,同时在三种RL算法(GRPO、RLOO、GSPO)和不同策略骨干上保持一致的改进。

英文摘要

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

2605.30884 2026-06-01 cs.CV

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$:基于难度感知强化学习的由粗到细GUI定位

Junlong Li, Chao Hao, Lap-Pui Chau, Yi Wang

AI总结 提出GUI-C$^2$框架,通过难度感知数据筛选和由粗到细的强化学习机制,解决GUI定位中训练样本难度不均和视觉区域裁剪权衡问题,实现最先进性能。

详情
AI中文摘要

现有的用于GUI定位的智能体强化学习方法在数据层面和策略层面存在局限性。在数据层面,当前方法通常平等对待所有训练样本,尽管它们对基线模型的训练价值随难度而变化。忽视这一点会大大降低训练效率甚至导致崩溃。在策略层面,现有框架难以平衡裁剪较大区域以获取足够上下文和较小区域以减少冗余之间的权衡,这是工具增强定位代理固有的张力。此外,过于复杂的决策对于小参数模型来说难以处理,并显著增加推理时间。为了解决这些问题,在数据层面,我们提出了GUI-D,一个数据挖掘和难度评分流程,通过适当的测试识别值得训练的样本,并分配难度分数以指导后续训练权重。在策略层面,我们提出了GUI-C$^2$,它采用区域门控的由粗到细细化机制,通过模型内部不确定性信号逐步缩小视野,自适应地为大目标保留上下文,同时增强对小目标的精度,并通过改进感知的阶段奖励进行强化,确保每次细化真正提升定位。同时,我们简化了决策过程,大大减少了额外的推理时间。最后,大量实验表明,我们的方法达到了最先进的性能。代码和数据将公开。

英文摘要

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.

2605.30880 2026-06-01 cs.CL cs.AI

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld:可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Pan, Yangqiu Song

AI总结 提出 PatchWorld 框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型,实现无需梯度优化的符号信念状态程序,在 AgentGym 环境中达到 76.4% 的宏观成功率。

详情
Comments
40 pages
AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程(POMDP),假设模拟器的潜在状态和转移动态对智能体隐藏。然而,很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld,一个免梯度框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察,而是归纳出符号信念状态程序,其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中,PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数,在实时一步前瞻中达到 76.4% 的宏观成功率,同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现,人类指定的残差记忆偏差提高了表面观察保真度,但削弱了决策效用。这暴露了可执行世界模型中的权衡,因为提高观察保真度可能以牺牲动作判别动态为代价,反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

2605.30876 2026-06-01 cs.CL

dMoE: dLLMs with Learnable Block Experts

dMoE: 具有可学习块专家的扩散大语言模型

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

AI总结 针对扩散大语言模型与混合专家架构集成时块并行解码与令牌级专家选择不匹配导致的推理内存瓶颈,提出dMoE框架,通过聚合块内令牌级专家分布为统一的块级专家分布来减少激活专家数量,在保持性能的同时显著降低内存使用和延迟。

详情
Comments
Working in progress. Code is available at: \url{https://github.com/fscdc/dMoE}
AI中文摘要

扩散大语言模型(dLLMs)最近作为自回归模型的有前途的替代方案出现,在自然支持并行解码的同时提供了有竞争力的性能。然而,随着dLLMs越来越多地与混合专家(MoE)架构集成以扩展模型容量,块并行解码与令牌级专家选择之间出现了根本性的不匹配。具体来说,每次dLLM前向传递处理多个具有双向依赖关系的令牌,而传统的MoE层独立路由每个令牌。这种不匹配显著增加了唯一激活专家的数量,使推理越来越受内存限制。为了解决这个问题,我们提出了dMoE,一个简单而有效的块级MoE框架。dMoE的核心思想是将每个块内的令牌级专家分布聚合成统一的块级专家分布,然后以更连贯的方式指导专家路由。通过这种方式,dMoE在不牺牲性能的情况下显著减少了推理期间唯一激活专家的数量,从而缓解了内存瓶颈。在各种基准上的大量实验证明了dMoE的有效性。平均而言,dMoE将唯一激活专家的数量从69.5减少到14.6,同时保留了原始性能的99.11%。同时,它将内存使用减少了76.64%到79.84%,并实现了1.14倍到1.66倍的端到端延迟加速。代码可在https://github.com/fscdc/dMoE获取。

英文摘要

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

2605.30873 2026-06-01 cs.LG cs.AI cs.DC

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

联邦变分偏好对齐与Gumbel-Softmax先验用于个性化用户偏好

Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok

AI总结 提出FedVPA-GP框架,通过联邦混合先验和正交损失解决联邦学习中用户偏好冲突和个性化问题,在HH-RLHF数据集上优于单一模型。

详情
Comments
21 pages, 4 figures. Accepted to ICML 2026
AI中文摘要

联邦学习(FL)为对齐大型语言模型(LLMs)提供了一条保护隐私的途径;然而,现有框架通常强制使用单一奖励模型,不可避免地平均了本质上相互冲突的用户偏好(例如,有用性与无害性)。虽然变分偏好学习(VPL)提供了一条个性化的途径,但将其适应于去中心化设置面临一个基本挑战:由严重的局部数据稀缺性和异质性驱动的后验坍塌。在本文中,我们提出了具有Gumbel-Softmax先验的联邦变分偏好对齐(FedVPA-GP),这是一个旨在在不牺牲隐私的情况下解耦多样偏好的框架。为了稳定变分推断,我们引入了一个联邦混合先验,使客户端能够利用聚合的总体分布作为动态先验。此外,我们加入了一个正交损失,明确强制在潜在空间中分离偏好原型。在HH-RLHF数据集上的实验表明,FedVPA-GP显著优于单一基线,成功解耦了冲突的用户意图,并实现了动态偏好切换。

英文摘要

Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. In this paper, we propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework designed to disentangle diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that enables clients to leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.