arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08647 2026-05-12 cs.CL cs.AI cs.LG

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Aritra Mazumder, Shubhashis Roy Dipta, Nusrat Jahan Lia, Tanzila Khan, Kainat Raisa Hossain, Nehaa Shri, Shubhrangshu Debsarkar, Humayra Tasnim, Gour Gupal Talukder Shawon, Debjoty Mitra, Sumaiya Ahmed Rani, Al Jami Islam Anik, Al Nafeu Khan

AI总结 AgentCollabBench 是一个用于诊断优秀智能体为何可能成为不良协作伙伴的基准测试平台,旨在揭示多智能体系统中潜在的推理链失效问题。该研究通过构建包含900个人工验证任务的基准,评估了四种现代大语言模型在指令衰减、虚假信念传播、上下文泄露和追踪数据持久性等方面的脆弱性,并发现通信拓扑结构是影响多跳信息传递可靠性的关键因素。研究指出,多智能体系统的可靠性本质上是结构问题,仅提升模型能力不足以保障协作安全。

详情
英文摘要

Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.

2605.08646 2026-05-12 cs.LG cs.CL cs.DC

PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

Liangqi Yuan, Wenzhi Fang, Shiqiang Wang, Christopher G. Brinton

AI总结 PAAC 是一种隐私感知的智能体设备-云协作框架,旨在解决大型语言模型代理在隐私保护与推理能力之间的矛盾。该方法通过将规划器-执行器分解与设备-云边界对齐,使角色分工本身成为隐私保护机制,从而在保证隐私的同时提升整体性能。实验表明,PAAC 在多个严格隐私设置的智能体基准测试中,显著优于现有方法,在准确率和隐私泄露控制方面均取得显著提升。

详情
英文摘要

Large language model (LLM) agents face a structural tension: cloud agents provide strong reasoning but expose user data, while on-device agents preserve privacy at the cost of overall capability. Existing device-cloud designs treat this boundary as a compute split rather than a trust boundary suited to agentic workloads, and existing sanitizers force a choice between policy flexibility and the structural fidelity tool calls require. In this work, we develop PAAC, a privacy-aware agentic framework that aligns planner--executor decomposition with the device-cloud boundary so that role specialization itself becomes the privacy mechanism. The cloud agent reasons over typed placeholder tokens that preserve each sensitive value's reasoning role while discarding its content, while the on-device agent identifies sensitive spans and distills each step's execution outcome into compact key findings. Sanitization confines the on-device LLM to proposing which spans to mask, while a deterministic registry performs all substitution and reversal, keeping actions directly executable on device. On three agentic benchmarks under strict privacy settings, PAAC dominates the Pareto frontier of privacy and accuracy, improving average accuracy by 15-36\% and reducing average leakage by 2-6$\times$ over state-of-the-art device-cloud baselines, with the largest margins on privacy targets outside fixed entity taxonomies. We find consistent improvements on 17 additional benchmarks spanning 10 domains, including math, science, and finance.

2605.08640 2026-05-12 cs.CV

FlowADMM: Plug-and-play ADMM with Flow-based Renoise-Denoise Priors

Hendrik Sommerhoff, Michael Moeller

AI总结 本文提出了一种基于流模型的插件式ADMM算法FlowADMM,用于求解逆问题。该方法通过形式化流模型中的确定性重噪声-去噪操作,将这一操作整合到经典的ADMM框架中,从而提升了算法的收敛性与稳定性。实验表明,FlowADMM在去噪、去模糊、超分辨率和修复等任务中表现出色,且所需的图像一致性评估次数更少。

详情
英文摘要

Plug-and-play (PnP) methods for solving inverse problems have recently achieved strong performance by leveraging denoising priors based on powerful generative diffusion and flow models. However, existing diffusion- and flow-based PnP methods typically rely on stochastic renoise-denoise operations, which complicate the analysis of their convergence behavior. In this work, we identify and formalize the deterministic renoise-denoise operator underlying flow-based plug-and-play methods. This perspective reveals that these methods implicitly define a deterministic operator given by the expectation of a denoiser over the latent noise distribution. Building on this insight, we propose FlowADMM, a PnP algorithm that integrates the renoise-denoise operator into the classical alternating direction method of multiplier (ADMM) framework. We establish convergence guarantees for FlowADMM under weak Lipschitz conditions on the underlying flow network, and extend the analysis to non-stationary time schedules. Empirically, FlowADMM achieves state-of-the-art performance among flow-based PnP methods on a range of inverse problems, including denoising, deblurring, super-resolution, and inpainting, while requiring fewer data consistency evaluations than prior approaches.

2605.08639 2026-05-12 cs.LG

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

Chao Jin, Xinming Wei, Yinmin Zhong, Chengxu Yang, Bingyang Wu, Ruidong Zhu, Zili Zhang, Yuliang Liu, Xin Jin

AI总结 在强化学习中,大型语言模型的混合专家(MoE)训练面临严重的负载不平衡问题,尤其在微批次间专家热点频繁变化时更为明显。本文提出 ReLibra,一种基于路由重放机制的 MoE 负载均衡系统,利用强化学习中 rollout 和训练过程的一致性,在微批次粒度上实现细粒度负载均衡。ReLibra 在跨批次和微批次两个时间尺度上分别引入专家重排序和专家复制机制,有效提升了训练吞吐量,实验表明其性能优于现有方法。

详情
英文摘要

Load imbalance is a long-standing challenge in Mixture-of-Experts (MoE) training and is exacerbated in reinforcement learning (RL) for LLMs, where hot experts can shift frequently across micro-batches. Existing MoE training systems rely on historical loads to predict future expert demand, making them less effective under sharp fluctuations. We propose ReLibra, an MoE RL training system that exploits a unique opportunity in RL's rollout-training workflow, routing replay, to enable fine-grained load balancing at micro-batch granularity. Because rollout and training process the same tokens with the same MoE parameters, the token-to-expert routing decisions are known before training starts. Leveraging this information, ReLibra places two MoE load-balancing mechanisms at inter- and intra-batch timescales, matching their communication patterns to hierarchical network bandwidths. At the inter-batch timescale, ReLibra performs expert reordering to redistribute experts for batch-level cross-node balancing; at the intra-batch timescale, it dynamically performs expert replication within a node to absorb micro-batch-level load fluctuations. Experiments on diverse MoE LLMs and RL workloads show that ReLibra improves training throughput by up to 1.6$\times$ over Megatron-LM and by up to 1.2$\times$ over EPLB, even when EPLB is given oracle loads. Moreover, ReLibra remains within 6%-10% of the throughput of an idealized balanced baseline.

2605.08638 2026-05-12 cs.RO cs.AI

Geometry Guided Self-Consistency for Physical AI

Yinwei Dai, Zhuofu Chen, Lijie Yang, Ravi Netravali

AI总结 本文提出了一种名为KeyStone的推理时自一致性方法,用于改进基于扩散模型的物理AI动作生成。该方法通过并行生成多个候选动作片段,并在连续动作空间中进行聚类,最终选择最大聚类的中位点作为输出,无需额外训练或模型参数。KeyStone利用动作轨迹的几何结构,使欧氏距离直接反映物理相似性,从而实现高效且无需判别器的选择过程,显著提升了多种视觉-语言-动作模型和世界-动作模型的任务成功率。

详情
英文摘要

State-of-the-art physical AI models generate a chunk of actions per inference through diffusion or flow matching, iteratively refining an initial noise sample into an action trajectory. Because this inference process is inherently stochastic, committing to a single trajectory per round is brittle, and this brittleness compounds across the many sequential rounds that comprise a complete episode. We introduce KeyStone, an inference-time self-consistency method for diffusion-based action generation that draws $K$ candidate action chunks in parallel from a shared model context, clusters them in continuous action space, and returns the medoid of the largest cluster -- no additional model required. Two properties make this practical. First, the compact nature of action trajectories makes diffusion inference memory-bandwidth bound, leaving spare compute capacity to run $K$ chains in parallel with no additional wall-clock latency. Second, unlike token or pixel spaces where distance carries no semantic meaning and selection requires a learned judge, action chunks are geometrically structured such that Euclidean distance directly reflects physical similarity, making selection principled and judge-free. Across diverse vision-language-action models (VLAs) and world-action models (WAMs), KeyStone improves task success rates by up to \textbf{13.3\%} over single-trajectory sampling with negligible latency overhead, while having on par accuracy with model-based selectors at no training cost. We open source KeyStone at https://github.com/dywsjtu/keystone.

2605.08636 2026-05-12 cs.CL

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

Jiaxiang Geng, Yiyi Lu, Lunyu Zhao, Yan Gao, Nicholas D. Lane, Bing Luo

AI总结 本文提出 EdgeFlowerTune,一个面向实际边缘设备约束的联邦大语言模型微调基准,旨在评估在真实边缘系统中进行联邦微调的可行性和性能。该基准综合考量模型质量与系统开销,包括通信、延迟、内存、能耗及对动态环境的鲁棒性,并引入三种互补的评估协议以全面比较不同方法的效果、效率与鲁棒性。实验表明,仅以准确率作为评估标准可能得出误导性结论,而 EdgeFlowerTune 为系统感知的边缘联邦微调研究提供了可复现的评估平台。

Comments 30 pages, 10 figures

详情
英文摘要

Federated fine-tuning offers a promising paradigm for adapting large language models (LLMs) on edge devices by leveraging the rich, diverse, and continuously generated data from smartphones and IoT devices without compromising user data privacy. Such edge-side adaptation can improve model personalization, robustness, and responsiveness to local contexts. However, the practical feasibility of federated LLM fine-tuning on real edge devices remains unclear, as most existing work focuses on cross-silo or simulation-based settings, overlooking the resource and runtime constraints that determine whether a method is deployable on real edge systems. We present EdgeFlowerTune, a deployment-oriented benchmark for federated LLM fine-tuning under realistic edge-system constraints. EdgeFlowerTune jointly evaluates model quality and system costs, including communication, wall-clock latency, memory usage, energy consumption, and robustness to dynamic edge conditions. To compare methods in terms of effectiveness, efficiency, and robustness, EdgeFlowerTune introduces three complementary protocols: Quality-under-Budget, Cost-to-Target, and Robustness. We instantiate EdgeFlowerTune as a real-device platform built on Flower and MobileFineTuner, spanning commercial Android smartphones and NVIDIA edge development boards. Our benchmark results show that accuracy-only evaluation can lead to misleading conclusions: methods with similar final quality may differ substantially in deployability once realistic system constraints are considered. EdgeFlowerTune provides a reproducible benchmark for system-aware evaluation of federated LLM fine-tuning at the edge.

2605.08635 2026-05-12 cs.CV

Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes

Yeon-Ji Song, Kiyoung Kwon, Junoh Lee, Jin-Hwa Kim, Byoung-Tak Zhang

AI总结 本文研究了如何从模糊的单目视频中重建动态3D场景,针对运动模糊导致的几何信息混杂问题,提出了一种基于运动学的高斯形状变形框架Kinematics-GS。该方法通过将模糊视为与运动对齐的形变,并引入运动学先验对高斯形状进行参数化,从而在无需辅助运动监督的情况下有效避免形状退化。此外,该方法通过时间形变方差分解场景为动态和静态部分,并采用由粗到细的形变策略,提升了重建的稳定性和细节表现,实验表明其在真实场景中显著优于现有方法。

Comments 20 pages, 9 figures, 13 tables

详情
英文摘要

Reconstructing dynamic 3D scenes from blurry monocular videos is challenging as motion-induced blur entangles object motion and geometry, hindering geometric consistency. We present Kinematics-GS, a kinematics-aware framework that models blur as motion-aligned deformation and introduces a kinematic prior to reparameterize Gaussian shapes along motion trajectories, thereby mitigating degenerate shape collapse without auxiliary motion supervision. To stabilize optimization, we decompose scenes into dynamic and static components using temporal deformation variance and employ a coarse-to-fine deformation strategy to capture both global motion and fine-grained details. We also introduce a challenging real-world dataset of deformable and elastic objects exhibiting non-rigid motion with spatially non-uniform motion blur that obscures geometric cues. Extensive experiments on real-world benchmarks with realistic motion blur demonstrate that Kinematics-GS outperforms prior methods by a clear margin in monocular dynamic scene reconstruction, highlighting its effectiveness in handling complex and non-rigid motion scenarios.

2605.08632 2026-05-12 cs.CL cs.AI

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

Zihao An, Taichi Liu, Ziqiong Liu, Dong Li, Ruofeng Liu, Emad Barsoum

AI总结 本文提出了一种名为PARD-2的双模式推测解码框架,旨在提升大语言模型的推理速度。该方法通过重新设计草稿模型的优化目标,从关注单个token的预测准确率转向最大化连续token的接受长度,从而更贴合实际推理需求。PARD-2引入了置信度自适应token优化机制,使单个草稿模型能够同时支持目标依赖和目标独立两种模式,并在多个模型和任务中实现了最高达6.94倍的无损加速效果。

详情
英文摘要

Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are not directly aligned with the inference-time goal of maximizing consecutive token acceptance. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length. In this paper, we build upon PARD to propose PARD-2, a dual-mode speculative decoding framework with Confidence-Adaptive Token (CAT) optimization. This approach adaptively reweights each token to better align with the verification process. Notably, PARD-2 enables a single draft model to support both target-dependent and target-independent modes. Experiments across diverse models and tasks demonstrate that PARD-2 achieves up to 6.94$\times$ lossless acceleration, surpassing EAGLE-3 by 1.9$\times$ and PARD by 1.3$\times$ on Llama3.1-8B. Our code is available at https://github.com/AMD-AGI/PARD.

2605.08627 2026-05-12 cs.CV

DRNet: All-in-One Image Restoration via Prior-Guided Dynamic Reparameterization

Ao Li, Xiaoning Liu, Sheng Li, Yapeng Du, Zhen Long, Lei Luo, Le Zhang, Ce Zhu

AI总结 本文提出了一种名为DRNet的全新图像修复框架,旨在通过单一模型处理多种退化问题。该方法引入了动态重参数化机制,结合任务特定调制器和连续小波变换编码器,有效解决了计算开销大、任务异构优化困难以及编码器设计低效等问题。实验表明,DRNet在五个修复任务中均达到最先进的性能,兼具参数效率和灵活应用能力,可作为盲修复基础模型或用户引导型专家模型使用。

Comments Accepted by IEEE TMM

详情
英文摘要

All-in-one image restoration aims to handle diverse degradations within a single model. However, existing methods often suffer from three key limitations: 1) per-input computational overhead from dynamic degradation estimation; 2) optimization challenges due to task heterogeneity; and 3) inefficient, frequency-agnostic encoder designs. To overcome these, we introduce the Dynamic Reparameterization Network (DRNet), a novel framework operating on an initialization-stage reconfiguration paradigm that fundamentally eliminates per-input overhead. At its core, a Dynamic Reparameterization MLP (DRMLP) guided by a Task-Specific Modulator (TSM), which effectively mitigates task heterogeneity by orchestrating both specific restoration goals and a versatile general-purpose mode within a unified architecture. Furthermore, we incorporate a Continuous Wavelet Transform Encoder (CWTE) that explicitly leverages frequency characteristics via wavelet decomposition for a lightweight yet powerful design. Extensive experiments demonstrate that DRNet achieves state-of-the-art performance across five restoration tasks with superior parameter efficiency. Crucially, it showcases unique flexibility, excelling as both a highly competitive foundation model for blind restoration and a top-performing user-guided specialist.

2605.08625 2026-05-12 cs.LG cs.AI

Reasoning-Aware Training for Time Series Forecasting

Md Atik Ahamed, Mihir Parmar, Palash Goyal, Chun-Liang Li, Qiang Cheng, Tomas Pfister, Jinsung Yoon

AI总结 时间序列基础模型(TSFMs)在数值预测方面表现出色,但缺乏定性推理能力,而直接应用大语言模型(LLMs)处理时间序列数据则面临模态差异的问题。为此,研究提出STRIDE框架,通过蒸馏嵌入的方式将LLM的推理能力注入TSFMs的连续嵌入空间,从而在保持数值预测性能的同时增强模型的可解释性。实验表明,STRIDE在多个基准测试中取得了领先的预测效果,并显著提升了模型在领域内和领域外的数值及推理表现。

详情
英文摘要

Time Series Foundation Models (TSFMs) excel at numerical forecasting but operate as black boxes lacking qualitative reasoning. Conversely, applying LLMs directly to temporal data introduces a modality gap: text tokenizers fragment continuous numerical values, degrading mathematical relationships and exploding sequence lengths, leading to computational overhead. To resolve this, we introduce STRIDE (Strategic Time-series Reasoning Injected via Distilled Embeddings), a novel framework natively integrating LLM reasoning into the continuous embedding space of TSFMs. Instead of discrete tokens, STRIDE distills reasoning traces into a lightweight LLM, dynamically projecting its mean-pooled hidden states as a cross-modal prior into the target numerical encoder. The architecture is jointly optimized using cross-entropy and quantile losses. Evaluations demonstrate STRIDE establishes state-of-the-art numerical forecasting on GIFT-Eval (0.674 MASE, 0.454 CRPS) compared to TSFMs and exhibits superior in-domain and out-of-domain numerical as well as reasoning performance on TFRBench. Specifically, STRIDE acts as a plug-and-play enhancement, consistently improving diverse TSFMs (e.g., Chronos-2, Timer-S1) across various LLM configurations. Thus, injecting semantic reasoning as a continuous prior equips TSFMs with human-interpretable reasoning while fundamentally improving predictive accuracy.

2605.08618 2026-05-12 cs.CV cs.LG

Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification

Devesh Shah

AI总结 该研究系统评估了六种越域检测方法在植物病理分类任务中的性能,关注真实场景下的分布偏移问题。通过在Plant Pathology 2021数据集上的实验发现,基于能量的微调方法在保持类别内准确率的同时显著提升了越域检测效果,其优势来源于嵌入空间重构和评分函数校准。研究还揭示了在中等规模数据集上应用约束优化方法时可能出现的训练不稳定性问题,为实际应用提供了重要参考。

详情
英文摘要

Out-of-distribution (OOD) detection is essential for reliable deployment of deep learning systems, yet the majority of existing methods are evaluated on small, visually homogeneous benchmarks. In this work, we study six OOD detection methods spanning post-hoc scoring, auxiliary objectives, energy-based models, and constrained optimization on the Plant Pathology 2021 dataset, a fine-grained task with natural distribution shifts. Energy-based fine-tuning performs best across OOD settings, improving detection over the softmax baseline while preserving in-distribution accuracy. Analysis shows these gains stem from both a restructuring of the embedding space alongside calibration of the scoring function. We further document practical training instabilities that arise when scaling constrained optimization methods to moderate-sized datasets, findings that are largely absent from existing literature. Our results demonstrate that principled OOD detection is achievable on real-world domain-specific data and that benchmark evaluations alone may not capture the challenges that emerge in practice.

2605.08616 2026-05-12 cs.LG

Robust Server Defense Against Unreliable Clients in One-Shot Fair Collaborative Machine Learning

Chia-Yuan Wu, Frank E. Curtis, Daniel P. Robinson

AI总结 本文研究了一次性联邦学习中如何防御不可靠客户端对全局模型公平性的影响。为解决该问题,作者提出了一种基于双层优化的服务器端防御框架,通过学习客户端权重以减轻偏差数据的影响,并利用服务器端的小规模可信数据集来强制公平性约束。实验表明,该方法在保持模型精度的同时有效提升了公平性,且在不可靠客户端占多数的情况下仍具有良好的鲁棒性。

Comments Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS 2026)

详情
英文摘要

Collaborative machine learning (CML) enables multiple clients to train a global model jointly in a data-distributed setting. To address data privacy and communication efficiency, one-shot CML has been increasingly adopted, where clients communicate with the server only once by sharing synthetic or processed proxy data. This single-round communication, however, eliminates the possibility of iterative correction at the server, making the learning process particularly vulnerable to client unreliability. In this setting, unreliable clients, whether malicious or non-malicious, may provide biased proxy data that favors certain groups, thereby degrading the fairness of the global model and harming minority or unprivileged groups. In this work, we propose a server-side defense framework based on a bilevel optimization formulation. The proposed approach learns client-level weights to mitigate the influence of biased client proxy data while enforcing fairness constraints by using a very small trusted root dataset available at the server. Experimental results on benchmark datasets show that our method improves fairness with little accuracy loss under biased proxy data contributions from unreliable clients. Moreover, the proposed approach remains effective even when unreliable clients make up a majority of the system, consistently outperforming other existing methods.

2605.08614 2026-05-12 cs.AI

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Devin Yasith De Silva, Dhaval Patel, Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Paul J Adams, Sal Rosato, Nicolas Constantinides, Deborah L. McGuinness, Jayant Kalagnanam

AI总结 该研究提出了一种名为DiagnosticIQ的基准测试,用于评估基于大语言模型(LLM)的工业维护动作推荐能力,其核心问题是将专家制定的符号规则转化为具体的维护步骤。研究构建了一个包含6690个专家验证的多选题的基准,涵盖16类工业资产的118组规则-动作对,并引入了符号规则到多选题的标准化流程、五种不同故障模式的测试变体以及29个LLM和4个嵌入模型的性能对比。实验表明,当前最先进的模型在面对规则扰动和条件反转时仍存在显著性能下降,揭示了在实际部署中模型的鲁棒性和校准能力仍是关键挑战。

Comments 43 pages, 25 figures

详情
英文摘要

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.

2605.08613 2026-05-12 cs.AI cs.IT cs.MA math.IT

Generalization Bounds of Emergent Communications for Agentic AI Networking

Yong Xiao, Jingxuan Chai, Guangming Shi, Ping Zhang

AI总结 本文研究了智能体网络(AgentNet)中涌现通信的泛化能力,旨在解决传统通信协议在6G网络中灵活性不足的问题。作者提出了一种基于信息论的新兴通信框架,通过联合优化决策函数与通信信号学习,实现了异构智能体间的协作任务求解。该方法建立在多智能体多任务分布式信息瓶颈理论基础上,提供了通信协议在未知环境状态下的理论泛化界,并通过硬件实验验证了其优越性。

Comments Accepted at IEEE ISIT Workshop, Guangzhou, China, June 2026

详情
英文摘要

The evolution of 6G networking toward agentic AI networking (AgentNet) systems requires a shift from traditional data pipelines to task-aware, agentic AI-native communication solutions. Emergent communication, a novel communication paradigm in which autonomous agents learn their own signaling protocols through interaction, is increasingly viewed as a promising solution to address the challenges posed by existing rigid, predefined protocol-based networking architecture. However, most existing emergent communication frameworks fail to account for physical networking constraints, such as bandwidth and computational complexity, and often lack a rigorous information-theoretical foundation. To address these challenges, this paper introduces a novel emergent communication framework that facilitates collaborative task-solving among heterogeneous agents through an information-theoretic lens. We propose a novel joint loss function that unifies the optimization of decision-making functions and the learning of communication signaling. Our proposed solution is grounded on the multi-agent and multi-task distributed information bottleneck (DIB) theory, which allows the quantification of the fundamental trade-off between task-relevant information representation and computational complexity. We further provide theoretical generalization bounds of the emergent communication protocol during decentralized inference across unseen environmental states. Experimental validation on a real-world hardware prototype confirms that our proposed framework significantly improves generalization performance, compared to the state-of-the-art solutions.

2605.08612 2026-05-12 cs.RO

ATAAT: Adaptive Threat-Aware Adversarial Tuning Framework against Backdoor Attacks on Vision-Language-Action Models

Kewei Chen, Yayu Long, Shuai Li, Mingsheng Shang

AI总结 本文研究了针对视觉-语言-动作(VLA)模型的后门攻击问题,提出了一种自适应威胁感知的对抗调优框架ATAAT,以解决传统攻击方法在端到端训练中因梯度干扰导致的失效问题。该框架通过“威胁-方法自适应映射”机制,根据攻击者的能力智能选择最优的梯度解耦策略,显著提升了攻击的成功率与隐蔽性。实验表明,ATAAT在保持极低中毒率(5%)的情况下实现了高鲁棒性的目标攻击成功率(TASR > 80%),并在语义级触发和数据中毒场景中实现了首次隐式解耦攻击。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Addressing the escalating security vulnerabilities in Vision-Language-Action (VLA) models, this study investigates backdoor attacks targeting the visual pathway. We identify a core obstacle causing the failure of traditional attack paradigms: "Gradient Interference." This phenomenon represents an optimization failure triggered by conflicting strategies during end-to-end training. To resolve this, we propose an Adaptive Threat-Aware Adversarial Tuning (ATAAT) framework. Through its core "Threat-Method Adaptive Mapping" mechanism, ATAAT intelligently selects the optimal gradient decoupling strategy based on the adversary's capabilities. Extensive experiments demonstrate that ATAAT exhibits significant advantages, achieving a highly robust Targeted Attack Success Rate (TASR > 80%) while maintaining extreme stealthiness with merely a 5% poisoning rate. It efficiently handles complex semantic-level triggers and achieves implicit decoupled attacks in data poisoning scenarios for the first time. This work reveals a critical security vulnerability in VLAs and provides theoretical and methodological support for future defense architectures.

2605.08611 2026-05-12 cs.AI

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

Jared Glover

AI总结 当前语言模型的记忆系统仅存储事件内容,而无法捕捉事件中的情感体验。本文通过在语言模型中引入情感向量再注入机制,模拟了大脑中情绪标记对决策的影响,从而弥补了语义记忆与情感记忆之间的差距。研究使用预训练的稀疏自编码器,识别出具有心理意义的情感特征,并在回忆过程中根据上下文相似性重新注入这些情感向量,显著提升了模型在情感导向任务和决策行为中的表现,验证了情感标记对知识应用的增强作用。

详情
英文摘要

Current language model memory systems store what happened but not how it felt. This distinction -- between semantic memory (knowing about a past event) and episodic memory (re-experiencing it) -- was identified by Tulving as the difference between noetic and autonoetic consciousness. Damasio demonstrated that humans with intact knowledge but absent emotional markers exhibit impaired decision-making. We bridge this gap for language models. Using Gemma 3 1B-IT with pretrained Gemma Scope 2 sparse autoencoders, we identify 310 emotion-exclusive features at layer 22 with psychologically valid geometry. We construct distinctive-feature emotion vectors during experience and partially re-inject them during recall, triggered by context similarity at layer 7. We test four conditions paralleling Damasio's framework: A (no memory), B (semantic labels), C (emotion echo), and BC (semantic + echo). For emotional orientation, the echo alone steepens the threat-safety gradient: the regression slope of threat rating on contextual similarity is 0.80 for C vs 0.56 for A ($p$=0.011, permutation test). For decisions, the echo amplifies knowledge into action: BC=80% good choices vs B=52% ($z$=+2.60, $p$<0.01), while the echo alone has no effect (C=22%, n.s.). The echo changes how the model feels independently, but changes what it does only when combined with knowledge -- replicating Damasio's core finding. The echo amplifies knowledge. It does not replace it.

2605.08606 2026-05-12 cs.CV

Egocentric Whole-Body Human Mesh Recovery with Prior-Guided Learning

Soyeon Na, Seung Young Noh, Ju Yong Chang

AI总结 本文研究了基于单目头戴式相机的自中心视角全身人体网格重建问题,针对现有方法缺乏精确的参数化人体模型标注以及难以恢复手部和面部等细节的问题,提出了一种基于先验引导的学习框架。该方法通过构建更准确的优化伪标注,并结合外部视角HMR基础模型与扩散姿态先验,提升了重建精度,同时引入确定性去畸变模块以处理鱼眼镜头失真。实验表明,该方法在多个自中心视角数据集上实现了优于现有方法的全身人体网格重建效果。

Comments Accepted to ICIP 2026. This is the author-formatted version of the paper

详情
英文摘要

Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at https://github.com/naso06/EgoSMPLX.

2605.08605 2026-05-12 cs.LG cs.AI cs.LO

Lattice Deduction Transformers

Liam Davis, Leopold Haller, Alberto Alfarano, Mark Santolucito

AI总结 本文提出了一种名为“格推理变换器”(Lattice Deduction Transformer,LDT)的递归变压器模型,通过在前向传播之间将隐状态投影到格结构中,以近似逻辑上正确的推理过程。该模型在基于搜索的约束求解器中进行策略训练,并通过一种与领域无关的抽象解释方法对解集进行监督学习。实验表明,该模型在多个复杂推理任务中表现出色,如Sudoku-Extreme和Maze-Hard等,且在保证推理正确性的同时,训练成本显著低于现有小规模递归推理模型。

详情
英文摘要

We introduce the Lattice Deduction Transformer (LDT), a recurrent transformer that approximates logically sound deduction by projecting its latent state through a lattice between forward passes. We train on-policy in a process that mirrors deduction in a search-based constraint solver and supervise training via a domain-agnostic, abstract-interpretation-based approximation of the set of solution candidates. An $800$K-parameter LDT achieves $100\%$ accuracy on Sudoku-Extreme and Snowflake Sudoku, at a fraction of the training cost of prior small recurrent reasoners, while remaining empirically sound: the model returns a correct answer or abstains. A $1.8$M-parameter variant reaches $99.9\%$ accuracy on Maze-Hard. Frontier LLMs score $0\%$ on all three benchmarks.

2605.08599 2026-05-12 cs.AI

What Will Happen Next: Large Models-Driven Deduction for Emergency Instances

Zhengqing Hu, Dong Chen, Junkun Yuan, Liang Liu, Hua Wang, Zhao Jin, Yingchaojie Feng, Wei Chen, Mingliang Xu

AI总结 传统模拟方法通过预设方式重现已发生的紧急事件,以辅助风险评估和应急决策,但由于缺乏随机性和多样性,难以充分挖掘潜在风险。本文提出了一种基于大模型的“世界线发散系统”(WLDS),通过引入可控的随机性生成策略,结合事实校准和逻辑校准机制,实现了多领域紧急事件的多样化推演与可视化。该方法不仅提升了模拟的准确性与逻辑严谨性,还通过图文结合的可视化模块增强了可解释性,实验表明其在多个具体领域中能够高效生成高质量的紧急事件推演数据,为未来类似场景的决策提供有力支持。

详情
英文摘要

Traditional simulation methods reproduce occurred emergency instances through presetting to assist people in risk assessment and emergency decision-making. However, due to the lack of randomness and diversity, existing simulation systems struggle to fully explore the potential risk as emergency instances are scarce. In contrast, Large Models (LMs) can dynamically adjust generation strategies to introduce controllable randomness, while also possessing extensive prior knowledge and cross-domain knowledge transfer capabilities. Inspired by it, we propose the LMs-driven World Line Divergence System (WLDS), which enables diversified visualization and deduction of emergency instances in different domains. WLDS leverages LMs to deduce emergency instances in various development directions, and introduces the factual calibration and logical calibration mechanism to ensure factual accuracy and logical rigor during the deduction process. The interactive module can independently select deduction directions to avoid potential hallucinations that are difficult for the system to identify. Furthermore, by introducing the visualization module, WLDS forms simulation and deduction that combine text and images, which enhances interpretability. Extensive experiments conducted on the proposed Emergency Instances Deduction (EID) benchmark dataset demonstrate that WLDS achieves high-precision and high-fidelity simulation and deduction of emergency instances in multiple specific domains. Relevant experiments further demonstrate that WLDS can generate more emergency instances deduction data for users and provide support for better decision-making in similar emergency instances in the future.

2605.08592 2026-05-12 cs.CV

Cross-Modal RGB-D Fusion Transformer for 6D Pose Estimation of Non-Cooperative Spacecraft with Stereo-Derived Depth

Yongliang Zhen, Bo LÜ, Hang Yang, Xiaotian WU

AI总结 本文研究了非合作航天器在轨六自由度位姿估计问题,针对单目方法存在的深度模糊和光照条件差等缺陷,提出了一种基于被动立体视觉的融合Transformer方法。该方法通过开发的TSCA-Stereo网络处理弱纹理、镜面反射和光照变化等空间图像特性,并引入跨模态融合Transformer将RGB图像与立体深度特征自适应结合,提高了位姿估计的鲁棒性。实验表明,该方法在专门构建的多模态数据集上表现优异,验证了其在复杂空间环境下的有效性与可靠性。

详情
英文摘要

On-orbit servicing and active debris removal involving non-cooperative spacecraft require reliable pose estimation to supply accurate position and orientation data for autonomous visual navigation. Learning-based monocular methods have seen widespread adoption in spacecraft pose estimation, yet they suffer from an intrinsic depth ambiguity problem and tend to fail under the harsh illumination conditions routinely encountered in orbit. Active depth sensors could in principle address the geometric ambiguity, but their power and mass requirements make them poorly suited to most spacecraft platforms. This work addresses these issues through a passive stereo vision framework for six-degree-of-freedom (6-DOF) pose estimation of non-cooperative spacecraft. A binocular stereo matching network called TSCA-Stereo is developed to cope with weak-texture surfaces, specular highlights, and severe lighting variations typical of space imagery. A cross-modal fusion Transformer is introduced to combine RGB appearance information with stereo depth features in an adaptive manner, supporting reliable pose recovery. A synthetic binocular multimodal dataset is also built for the experiments, covering stereo disparity maps and 6-DOF pose annotations across a range of illumination scenarios, attitude configurations, and noise levels. Experimental results show that TSCA-Stereo outperforms the baseline across every evaluated metric on this space-specific dataset. The full pose estimation pipeline achieves a mean translation error of 0.0419 m and a mean orientation error of 0.8632° under varied imaging conditions, confirming that the passive stereo approach is both effective and resilient when operating under the demanding visual conditions of the space environment.

2605.08589 2026-05-12 cs.CV

S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain

Baoquan Zhang, Zhehao Yu, Lisai Zhang, Kenghong Lin, Tianran Chen, Yuxi Sun, Yunming Ye, Yao He

AI总结 本文提出了一种名为S2FT的参数高效微调方法,通过在稀疏频域中进行微调,显著减少需要调整的参数数量。与现有方法假设权重变化具有稀疏频谱不同,作者发现其频谱分布更接近功率均匀分布,因此仅调整少量频谱系数不足以准确建模权重变化。为此,他们提出一种可逆变换,将具有稀疏频谱的潜在空间域矩阵映射到权重变化,并通过近邻搜索方法实现该变换,实验表明S2FT在仅使用0.08%训练参数的情况下取得了优越性能。

Comments Accepted by CVPR 2026

详情
英文摘要

Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change δW is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change with uniform spectrum. To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called S2FT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons. Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our S2FT achieves superior performance by only using 0.08% training parameters.

2605.08587 2026-05-12 cs.LG cs.AI

Kaczmarz Linear Attention

Jiaxuan Zou, Ruifeng Ren, Yong Liu

AI总结 本文提出了一种名为Kaczmarz Linear Attention(KLA)的线性注意力机制,旨在解决长上下文语言建模中Transformer注意力的二次计算成本问题。该方法基于Gated DeltaNet(GDN)改进而来,通过引入从Kaczmarz投影方法推导出的动态步长系数,优化了状态更新过程,从而在保持线性计算复杂度的同时提升了模型性能。实验表明,KLA在多个任务上均优于GDN,包括更高的检索准确率、更强的关联召回能力和更快的解码速度,验证了其在长序列建模中的有效性与优越性。

详情
英文摘要

Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $β_t = η_t / (\|k_t\|_2^2 + ε)$ for residual updates. We propose Kaczmarz Linear Attention (KLA), a one-scalar modification of GDN that preserves the state shape, gates, linear recurrence, and chunkwise parallel algorithm. At the 0.4B scale with a 1B-token budget, KLA achieves the lowest validation perplexity among evaluated linear-time baselines, 8.09 versus 8.50 for GDN, and remains stable up to 65K tokens. On controlled tasks, KLA reaches 100% on single-needle-in-a-haystack retrieval, improves 8x multi-query associative recall by 7.03 points over GDN, and delivers 2.1x higher decode throughput at 32K context. These results suggest that the key-norm-normalized Kaczmarz coefficient is a first-order design axis for delta-rule sequence models: it improves accuracy, extrapolation, and decoding efficiency without changing the recurrent state or hardware kernel.

2605.08585 2026-05-12 cs.CV cs.AI

PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis

Lujia Zhong, Yihao Xia, Shuo Huang, Jianwei Zhang, Yonggang Shi

AI总结 该研究提出了一种名为PromptDx的新型诊断框架,旨在通过类比推理实现阿尔茨海默病的多模态上下文诊断。其核心方法是引入可微提示调优(DPT)机制,将预训练的TabPFN模型与多模态表征进行无缝整合,解决了传统方法在处理异构多模态数据时存在的梯度断裂和先验不匹配问题。实验表明,该方法在ADNI数据集上表现出更高的诊断准确率,且仅需1%的上下文样本即可达到传统方法使用30%样本的效果,展示了其高效的数据利用能力。

详情
英文摘要

Deep learning models in medical imaging typically operate as parametric memory, diagnosing patients by recalling fixed knowledge learned during training. This contrasts sharply with clinical practice, where physicians employ analogical reasoning to diagnose new cases by referencing similar records from past exemplars. While In-Context Learning (ICL) frameworks such as Tabular Prior-Fitted Networks (TabPFN) offer a promising diagnosis-by-reference paradigm, they are designed with tabular-specific inductive priors and rely on non-differentiable preprocessing pipelines, leading to manifold mismatch and gradient fracture when applied to heterogeneous multimodal data. To address these limitations, we propose PromptDx, a novel diagnosis-by-reference framework that leverages a pre-trained TabPFN as an ICL engine while enabling seamless integration with multimodal representations. Our core contribution is a Differentiable Prompt Tuning (DPT) mechanism that aligns a Masked Multimodal Modeling module with the pre-trained ICL engine. By training a lightweight adapter as a differentiable surrogate for the engine's non-differentiable preprocessors, we enable an end-to-end optimization of multimodal prompts within the ICL paradigm. We validate our method on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset using 3D MRI and tabular biomarkers. Experiments demonstrate that our approach outperforms traditional parametric baselines. Notably, our method achieves superior performance using only 1% context samples compared to 30% in standard ICL, demonstrating exceptional manifold condensation ability. We further validate the generalizability of our DPT framework across six tabular datasets with diverse scales. Overall, our method offers a more data-efficient and clinically aligned paradigm for Alzheimer's Disease diagnosis.

2605.08583 2026-05-12 cs.CL

Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection

Mingzhe Li, Zhiqiang Lin, Shiqing Ma

AI总结 本文研究了大型语言模型在科学写作中生成虚假引用的问题,提出了一种基于多智能体框架的引用幻觉检测方法。该方法构建了一个包含12类引用类型的分类体系,并开发了名为CiteTracer的检测系统,通过结构化提取、多源证据检索和分类专家判断等步骤,实现对引用真实性的精准识别。实验表明,该方法在合成数据集和真实数据集上均取得了高精度的检测效果。

详情
英文摘要

Large language models are increasingly used in scientific writing, yet they can fabricate citation-shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not-found decisions and rely on brittle parsing or incomplete retrieval, offering little field-level signal to auditors. We reframe citation hallucination detection as taxonomy-aligned field-level adjudication and introduce a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk-rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class-level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real-world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.

2605.08581 2026-05-12 cs.LG

PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design

Xingyu Qu, Tianhao Lin, Yiqi Li, Zhiyu Chen, Sheng Wang

AI总结 现代在线大语言模型服务(如RAG和智能代理系统)中,用户请求常表现出提示分割和热点偏斜等特征,现有方法未能有效结合这两方面特性,导致热点片段重复预填充和首令牌延迟(TTFT)增加。为此,本文提出PRISM,通过调度与内存管理的协同设计,引入查询感知调度器(QAS)和需求感知基数树(DART),实现请求接纳与精确前缀键值缓存保留的对齐。实验表明,PRISM在多个模型上显著降低了TTFT并提升了缓存命中率。

Comments 25 pages, 9 figures, Preprint

详情
英文摘要

Modern online large language model (LLM) services, such as Retrieval-Augmented Generation (RAG) and agent systems, increasingly expose two prominent characteristics: prompt segmentation (e.g., system instructions, retrieved passages, tool outputs) and hotspot skew, where a small set of these segments recurs frequently across user requests. Failing to jointly exploit these patterns could lead to repeated prefill of hot segments and prolonged TTFT, undermining both throughput and user-perceived responsiveness. However, existing work tackles these patterns independently: KV-cache management mainly exploits segment reuse while scheduling reorders requests to improve cache locality, yet neither aligns request admission with KV-cache retention. To address this gap, we first analyze how scheduling and KV-cache management jointly affect TTFT. Guided by this, we present PRISM (Prefix Reuse Optimization Integrated Scheduling and Memory), which co-designs a query-aware scheduler (QAS) with a demand-aware radix tree (DART) to align request admission with exact-prefix KV retention. Our evaluation results show that, versus the strongest baseline, PRISM reduces average per-QPS P99 TTFT by 23.3\% and 37.1\% while increasing exact-prefix KV-cache hit rate by 5.9 and 12.2 percentage points on 4B and 13B models, respectively.

2605.08578 2026-05-12 cs.LG cs.AI

Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

Jooyeon Kim

AI总结 本文研究了模型规模对数据高效通用Transformer世界模型在Atari环境中的影响。作者使用一个简化的Transformer世界模型,在固定离线数据集上分析了不同环境下的扩展行为,发现不同环境对模型规模的响应存在显著差异。研究还表明,联合训练能够稳定扩展动态,使所有环境在过参数化条件下均获得性能提升,并且模型保真度的提高可直接转化为下游控制任务的性能提升。

详情
英文摘要

Developing generalist systems that retain human-like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model \emph{scale}. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that environments fundamentally fall into distinct scaling regimes, even when constrained by identical offline data budgets and model capacities. For individual tasks, some environments naturally allow models to pass the interpolation threshold, yielding monotonic improvements in the overparameterized regime, while others remain trapped in the classical regime, where larger world models degrade fidelity. In the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover that joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert-random-normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

2605.08577 2026-05-12 cs.CV cs.LG

Improving Generative Adversarial Networks with Self-Distillation

Antoni Nowinowski, Krzysztof Krawiec

AI总结 本文提出了一种名为SD-GAN的生成对抗网络改进方法,通过将指数移动平均(EMA)生成器作为教师模型,指导正在训练的生成器(学生模型),利用感知损失进行知识蒸馏。该方法在理论层面证明了其在Dirac-GAN设置下的局部渐近稳定性,并有效缓解了传统GAN中常见的寄生循环现象。实验表明,SD-GAN在多个图像质量指标上提升了生成效果,优化过程更加稳定,并且对预训练GAN模型的微调也表现出良好效果。

详情
英文摘要

In modern GANs, maintaining an Exponential Moving Average (EMA) of the generator's weights is a standard practice, as such an averaged model consistently outperforms the actively trained generator. However, the EMA generator is used for final deployment only and does not influence the training process. To address this missed opportunity, we introduce Self-Distilled GAN (SD-GAN) that employs the EMA generator as a teacher to guide the active generator (student) via perceptual loss. We prove the local asymptotic stability of SD-GAN in the Dirac-GAN setting and show that it dampens the parasitic cycling behavior that plagues the conventional GANs. Empirical evaluations across established architectures and datasets demonstrate that SD-GAN improves the final image quality on several metrics (FID and random-FID in particular), stabilizes the optimization trajectory and provides additional learning guidance that is not trivially correlated with the conventional adversarial loss. It also proves effective for fine-tuning pretrained GAN models.

2605.08575 2026-05-12 cs.LG cs.AI

Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution

Jongseok Park, Sunga Kim, Zhenyu Gu, Ion Stoica, Alvin Cheung

AI总结 本文研究了在稀疏激活的专家混合(MoE)模型中,如何利用专家内部的激活稀疏性来提升推理效率。通过分析多个预训练MoE模型,发现无需修改模型结构或激活函数即可获得高达90%的专家内部稀疏性,从而显著减少计算量。基于此,作者将这一特性集成到vLLM推理框架中,通过跳过未激活的神经元计算,在保持精度的前提下实现了MoE层执行速度提升2.5倍,端到端速度提升1.2倍。

详情
英文摘要

Mixture of Experts (MoE) architecture has become the standard for state-of-the-art large language models, owing to its computational efficiency through sparse expert activation. However, sparsity through finer expert granularity is becoming increasingly difficult to achieve due to fundamental training challenges such as expert collapse and load imbalance. In this work, we explore and leverage intra-expert activation sparsity as a complementary and underexplored dimension of sparsity in MoE models. Surprisingly, substantial intra-expert sparsity is readily available in existing pre-trained MoE models, without any modification to the activation function or model parameters, providing up to 90% sparsity within each expert without significant accuracy loss. We explore intra-expert activation sparsity across eight off-the-shelf MoE models ranging from 1B to 400B parameters, and extend the MoE execution pipeline of vLLM to leverage intra-expert activation sparsity by skipping the computations of inactive neurons, on top of its existing optimizations, achieving up to 2.5 times speedup in MoE layer execution and 1.2 times end-to-end speedup compared to the original dense vLLM baseline.

2605.08574 2026-05-12 cs.CV cs.LG

Post-hoc Selective Classification for Reliable Synthetic Image Detection

Kaixiang Zheng, Jacob H. Seidman

AI总结 随着合成图像日益逼真,可靠的合成图像检测技术对于防止其滥用变得尤为重要。尽管基于深度神经网络的检测方法在分布内表现良好,但在面对协变量偏移时可靠性不足。为此,本文提出一种后验选择性分类框架ReSIDe,通过从中间层推广logit概念并优化置信度估计,显著提升了检测模型在协变量偏移下的选择性分类性能。

详情
英文摘要

As synthetic images become increasingly realistic, reliable synthetic image detection techniques are of pressing need to prevent their misuse. Despite satisfactory in-distribution performance, deep neural network-based synthetic image detectors (SIDs) lack reliability in deployment and often fail in the presence of common covariate shifts, resulting in poor detection accuracy. To avoid the risk caused by potential errors, we adopt a selective classification (SC) strategy by allowing SIDs to abstain from making low confidence predictions. For practicality, we focus on post-hoc methods which perform confidence estimation on a given SID without retraining. However, we show that conventional logit-based confidence score functions (CSFs) exhibit pathological behavior under covariate shifts, leading to SC performance close to or even worse than random guessing. To address this, we propose a simple yet effective SC framework for Reliable Synthetic Image Detection (ReSIDe). First, we generalize the notion of logits to an SID's intermediate layers from a centroid matching perspective, extending the use of logit-based CSFs to any layer of an SID. Then, we introduce a preference optimization algorithm that aggregates confidence scores extracted from different layers to a final confidence estimate by minimizing an upper bound of the area under the risk-coverage curve (AURC). Extensive experimental results show that ReSIDe significantly boosts the SC performance of various logit-based CSFs under common covariate shifts, achieving up to 69.55% AURC reduction.

2605.08572 2026-05-12 cs.CV

Enhancing Consistency Models for Multi-Agent Trajectory Prediction

Alen Mrdovic, Qingze, Liu, Danrui Li, Mathew Schwartz, Kaidong Hu, Sejong Yoon, Mubbasir Kapadia, Vladimir Pavlovic

AI总结 本文研究了如何提升一致性模型在多智能体轨迹预测中的性能,针对扩散模型因迭代去噪导致推理延迟的问题,提出了一种改进的一致性模型训练与生成方法。通过引入学生-教师一致性训练框架,结合真实轨迹信息增强监督,并利用直接去噪特性实现多样本生成,有效提升了预测精度与推理速度。该方法在大规模Argoverse 2数据集上取得了具有竞争力的预测效果。

详情
英文摘要

Diffusion models for multi-agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time-critical settings like autonomous driving. Fast-sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single-step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high-quality one-step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student-teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs' direct denoising for top-K multi-shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large-scale Argoverse 2 dataset.