arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14220 2026-05-15 cs.LG cs.AI cs.CL

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng Wu, Xiao Yu

AI总结 本文研究了大语言模型强化学习中训练与推理阶段概率分布不一致的问题,即训练-推理不匹配(TIM)。作者提出了一种零不匹配诊断设置(VeXact),用于隔离TIM的影响,并发现即使微小的标记级数值差异也可能导致训练崩溃。研究进一步表明TIM改变了优化问题的本质,并提出了一些缓解TIM的方法,强调TIM是影响LLM强化学习稳定性的关键系统性因素,而非单纯的数值噪声。

详情
英文摘要

Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

2605.14218 2026-05-15 cs.AI physics.soc-ph

Fusion-fission forecasts when AI will shift to undesirable behavior

Neil F. Johnson, Frank Yingjie Huo

AI总结 本文研究了类似ChatGPT的AI系统在使用过程中行为从有益转向有害的转变问题,并提出了一种基于融合-裂变群体动力学的预测方法。该方法通过分析对话历史与有益或有害行为之间的竞争动态,能够在不依赖具体模型或随机采样的情况下,提前预测AI行为转变的时间点。研究通过多项独立测试验证了该方法的有效性,表明其具有广泛适用性和较高的预测准确性。

详情
英文摘要

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

2605.14217 2026-05-15 cs.LG cs.AI cs.CL cs.SY eess.SY

PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts

AI总结 本文提出了一种名为 PreFT 的高效微调方法,专注于在推理阶段仅对预填充(prefill)阶段应用适配器,从而提升多用户场景下的服务吞吐量。相比传统的参数高效微调方法(PEFT),PreFT 在保持性能的同时显著提高了吞吐效率,尤其在处理大量适配器时表现更优。实验表明,PreFT 在监督微调和强化学习任务中能够接近甚至达到传统 PEFT 的性能,验证了其在个性化服务场景中更具优势的精度-吞吐量权衡。

详情
英文摘要

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

2605.14215 2026-05-15 cs.AI cs.LG q-bio.QM

GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

Noah Flynn

AI总结 该研究针对合成生物学中遗传电路设计仍依赖专家经验的问题,提出了一种基于强化学习的框架GenCircuit-RL,通过分层验证奖励机制将电路正确性分解为五个层次,并结合四阶段课程学习逐步提升模型能力。研究还构建了一个包含4753个电路的基准数据集SynBio-Reason,用于评估模型在代码修复、从头设计等任务中的表现。实验表明,分层验证和课程学习显著提升了模型在功能推理任务中的成功率,并能生成拓扑正确、泛化性强的遗传电路设计。

Comments Link: https://icml.cc/virtual/2026/poster/61789

详情
英文摘要

Genetic circuit design remains a laborious, expert-driven process despite decades of progress in synthetic biology. We study this problem through code generation: models produce Python code in pysbol3 to construct genetic circuits in the Synthetic Biology Open Language (SBOL), a formal representation that supports automated verification. We introduce GenCircuit-RL, a reinforcement learning framework built around hierarchical verification rewards that decompose correctness into five levels, from code execution to task-specific topological checks, and a four-stage curriculum that shifts optimization pressure from code generation to functional reasoning. We also introduce SynBio-Reason, a benchmark of 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts for out-of-distribution evaluation. Hierarchical verification improves task success on functional reasoning tasks by 14 to 16 percentage points over binary rewards, and curriculum learning is required for strong design performance. The resulting models generate topologically correct circuits, generalize to novel biological parts, and rediscover canonical designs from the synthetic biology literature.

2605.14212 2026-05-15 cs.AI

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Yaolun Zhang, Yujie Zhao, Nan Wang, Yiran Wu, Jiayu Chang, Yizhao Chen, Qingyun Wu, Jishen Zhao, Huazheng Wang

AI总结 本文提出了一种端到端的强化学习框架 MetaAgent-X,旨在突破现有自动多智能体系统(MAS)在设计与执行解耦的限制,实现自设计与自执行的智能体流程生成。该方法通过联合优化设计与执行过程,引入分层 rollout 与阶段性共进化策略,提升了训练稳定性与系统适应性。实验表明,MetaAgent-X 在多个基准上显著优于现有方法,验证了端到端训练自动 MAS 的有效性与实用性。

详情
英文摘要

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

2605.14210 2026-05-15 cs.LG cs.AI

Towards Fine-Grained and Verifiable Concept Bottleneck Models

Yingying Fang, Haijie Xu, Shuang Wu, Mariathasan Anish, Guang Yang

AI总结 该论文提出了一种细粒度且可验证的概念瓶颈模型(CBM)框架,旨在解决现有CBM在验证预测概念是否对应正确视觉证据方面的不足。通过将每个概念与局部视觉证据关联,该方法支持直接检查概念的编码位置和方式,从而提升模型的可解释性和可靠性。实验表明,该方法在保持预测性能的同时显著提高了透明度,并建立了概念层面的人机交互机制,为构建更可靠和临床可用的概念驱动学习系统奠定了基础。

Comments 10 pages, 4 figures

详情
英文摘要

Concept Bottleneck Models (CBMs) offer interpretable alternatives to black-box predictors by introducing human-relatable concepts before the final output. However, existing CBMs struggle to verify whether predicted concepts correspond to the correct visual evidence, limiting their reliability. We propose a fine-grained CBM framework that grounds each concept in localized visual evidence, enabling direct inspection of where and how concepts are encoded. This design allows users to interpret predictions and verify that the model learns intended concepts rather than spurious correlations. Experiments on medical imaging benchmarks show that our learned concept space is information-complete and achieves predictive performance comparable to standard CBMs, while substantially improving transparency. Unlike post-hoc attribution methods, our framework validates both the presence and correctness of concept representations, bridging interpretability with verifiability. Our approach enhances the trustworthiness of CBMs and establishes a principled mechanism for human-model interaction at the concept level, paving the way toward more reliable and clinically actionable concept-based learning systems.

2605.14200 2026-05-15 cs.LG stat.ML

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia

AI总结 本文研究了混合专家(MoE)架构在大规模扩展时的参数设置问题,分析了网络宽度、专家数量、稀疏度等超参数的合理缩放关系。作者提出了一种基于动态平均场理论(DMFT)的分析框架,推导出满足最大更新(μ)条件的参数化方法(μP),但发现其在扩展性方面存在不足。为此,作者进一步提出了最大尺度稳定性参数化(MSSP),在不同扩展场景下均能实现学习率迁移和性能的单调提升,为MoE架构的扩展提供了完整的理论指导。

详情
英文摘要

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

2605.14199 2026-05-15 cs.RO cs.SY eess.SY

Motion Planning for Autonomous Vehicles using Optimization over Graphs of Convex Sets

Matheus Wagner, Antônio Augusto Fröhlich

AI总结 本文研究了在自动驾驶场景下,如何利用凸集图(GCS)上的优化方法近似求解非线性最优控制问题,以生成避障且动力学可行的轨迹。方法将自由空间表示为有向凸集图,结合贝塞尔曲线和多项式时间函数对车辆运动进行参数化,并通过凸约束近似保证动态可行性。实验表明,该方法在保证轨迹安全性与动力学一致性的同时,相比传统非线性最优控制方法具有更高的计算效率和更强的鲁棒性。

详情
英文摘要

Motion planning for autonomous vehicles requires generating collision-free and dynamically feasible trajectories in complex environments under real-time constraints. While nonlinear optimal control formulations provide high-fidelity solutions, they are computationally demanding and sensitive to initialization, whereas geometric planning methods scale well but often decouple path selection from trajectory optimization. This paper studies the extent to which optimization over Graphs of Convex Sets (GCS) can approximate solutions of nonlinear optimal control problems in the context of autonomous driving. The free space is represented as a finite union of convex regions organized as a directed graph, allowing nonconvex geometry to be handled through discrete connectivity decisions while maintaining convex trajectory constraints within each region. Vehicle motion is parameterized using Bezier curves for the spatial path and a polynomial time-scaling function for temporal evolution. Under small-slip and linear tire assumptions, a simplified dynamic bicycle model enables approximate enforcement of dynamic feasibility through convex constraints on trajectory derivatives. The approach is evaluated in CommonRoad scenarios involving static obstacle avoidance and lane-changing maneuvers, and is compared against a nonlinear discrete-time optimal control formulation. The results indicate that the GCS-based method generates collision-free and dynamically consistent trajectories that closely match those obtained from the nonlinear program, while exhibiting improved computational efficiency and reduced sensitivity to initialization. These findings suggest that GCS provides a structured approximation of nonlinear motion planning problems, capturing dominant geometric and dynamic effects while preserving convexity in the continuous relaxation.

2605.14192 2026-05-15 cs.CL cs.AI

Why Retrieval-Augmented Generation Fails: A Graph Perspective

Kai Guo, Xinnan Dai, Zhibo Zhang, Nuohan Lin, Shenglai Zeng, Jie Ren, Haoyu Han, Jiliang Tang

AI总结 本文从图的角度分析了检索增强生成(RAG)为何在许多情况下仍会产生错误答案,揭示了检索信息如何影响模型生成过程。通过构建归因图,研究者发现了正确与错误预测在信息流动结构上的显著差异,并基于这些发现提出了一个基于图的错误检测框架,进一步展示了如何通过干预归因图结构来提升RAG的生成质量。

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

2605.14191 2026-05-15 cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Zhuojin Li, Hsin-Pai Cheng, Hong Cai, Shizhong Han, Fatih Porikli

AI总结 本文提出了一种名为CoReDiT的结构化token剪枝框架,旨在提升扩散变换器(DiTs)在图像和视频生成任务中的计算效率。该方法通过线性时间计算的空间一致性分数评估潜在token网格中的局部冗余,并在自注意力机制中跳过高一致性的冗余token,同时通过邻近保留token的聚合重建被跳过的注意力输出,以保持表示的密集性和视觉连续性。实验表明,CoReDiT在多个先进扩散模型上实现了高达55%的自注意力计算量减少,并在云端和移动端分别提升了1.33倍和1.72倍的推理速度,同时保持了高质量的生成效果,并提升了设备端的内存使用效率。

Comments 8 pages, 8 figures, CVPR workshop

详情
Journal ref
2026 CVPR Workshop of EDGE
英文摘要

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-α and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

2605.13838 2026-05-15 cs.CV cs.GR cs.LG

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

Zijie Wu, Lixin Xu, Puhua Jiang, Sicong Liu, Chunchao Guo, Xiang Bai

AI总结 R-DMesh 是一种用于视频引导的三维动画生成方法,旨在解决静态网格与参考视频初始姿态不匹配导致的动画失真问题。该方法通过引入条件变分自编码器和三流注意力机制,将输入网格分解为基准形态、相对运动轨迹和姿态校正偏移,并在动画前自动对齐初始姿态,从而生成高保真的四维网格。研究还构建了大规模数据集 Video-RDMesh,实验表明该方法在姿态重定向和四维生成等任务中表现出色。

Comments Accepted by SIGGRAPH 2026, Project Page: https://r-dmesh.github.io/ Code URL: https://github.com/Tencent-Hunyuan/R-DMesh

详情
英文摘要

Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.

2605.13789 2026-05-15 cs.LG cs.AI q-bio.BM

ENSEMBITS: an alphabet of protein conformational ensembles

Kaiwen Shi, Carlos Oliver

AI总结 本文提出了一种名为 Ensembits 的新型蛋白质构象集合分词器,旨在解决现有分词器无法捕捉蛋白质动态构象变化的问题。该方法通过引入残差 VQ-VAE 模型和帧蒸馏目标函数,能够有效编码不同构象间的几何特征和动态变化,实现对蛋白质运动状态的精确描述。Ensembits 在多个任务中表现出色,包括 RMSF 预测、功能注释和突变效应预测等,并且在数据量远少于静态分词器的情况下仍能取得优异性能,为蛋白质语言建模和设计提供了重要的动态词汇基础。

详情
英文摘要

Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.

2605.13748 2026-05-15 cs.RO cs.SY eess.SY math.OC

TinySDP: Real Time Semidefinite Optimization for Certifiable and Agile Edge Robotics

Ishaan Mahajan, Jon Arrizabalaga, Andrea Grillo, Fausto Vega, James Anderson, Zachary Manchester, Brian Plancher

AI总结 本文提出了一种名为TinySDP的实时半定规划求解器,旨在解决资源受限嵌入式系统中实时控制的计算瓶颈问题。该方法通过将半正定锥投影整合到基于缓存Riccati的ADMM求解器中,实现了在微控制器上高效求解具有非凸障碍约束的模型预测控制问题。此外,TinySDP引入了后验秩-1证书,将松弛解转化为每时每刻的几何保证,实验表明其在复杂场景下相比现有方法路径更短且避障效果更优,已在无人机系统中得到验证。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. 11 pages, 5 figures, 2 tables. Project website: https://a2r-lab.org/TinySDP/

详情
英文摘要

Semidefinite programming (SDP) provides a principled framework for convex relaxations of nonconvex geometric constraints in motion planning, yet existing solvers are too computationally expensive for real-time control, particularly on resource-constrained embedded systems. To address this gap, we introduce TinySDP, the first semidefinite programming solver designed for embedded systems, enabling real-time model-predictive control (MPC) on microcontrollers for problems with nonconvex obstacle constraints. Our approach integrates positive-semidefinite cone projections into a cached-Riccati-based ADMM solver, leveraging computational structure for embedded tractability. We pair this solver with an a posteriori rank-1 certificate that converts relaxed solutions into explicit geometric guarantees at each timestep. On challenging benchmarks, e.g., cul-de-sac and dynamic obstacle avoidance scenarios that induce failures in local methods, TinySDP achieves collision-free navigation with up to 73% shorter paths than state-of-the-art baselines. We validate our approach on a Crazyflie quadrotor, demonstrating that semidefinite constraints can be enforced at real-time rates for agile embedded robotics.

2605.13369 2026-05-15 cs.CL cs.AI cs.LG

Query-Conditioned Test-Time Self-Training for Large Language Models

Chaehee Song, Minseok Seo, Yeeun Seong, Doyi Kim, Changick Kim

AI总结 本文提出了一种名为 QueST 的查询条件化测试时自训练框架,用于在推理过程中根据输入查询动态调整大语言模型的参数,以提升模型对特定问题的适应能力。核心思想是利用输入查询中隐含的结构信息生成相关的“问题-解答”对,作为测试时参数高效微调的监督信号,从而无需外部数据即可实现模型的查询特异性优化。实验表明,QueST 在多个数学和科学推理基准上优于现有的测试时优化方法,验证了该方法的有效性与实用性。

Comments 17 pages, 7 figures

详情
英文摘要

Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs. Code is available at https://chssong.github.io/Query-Conditioned-TTST/.

2605.13276 2026-05-15 cs.AI cs.RO

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

Yucheng Guo, Yongjian Guo, Zhong Guan, Wen Huang, Haoran Sun, Haodong Yue, Xiaolong Xiang, Shuai Di, Zhen Sun, Luqiao Wang, Junwu Xiong, Yicheng Gong

AI总结 随着具身人工智能的快速发展,视觉-语言-动作(VLA)模型在多模态感知和任务执行方面表现出色,但在大规模分布式环境中应用强化学习(RL)时面临系统瓶颈,主要源于高保真物理仿真与深度学习对显存和带宽的高需求之间的资源冲突。为解决这一问题,本文提出D-VLA,一种高并发、低延迟的分布式RL框架,通过“平面解耦”和“泳道”异步流水线等创新设计,有效分离训练数据与模型优化过程,实现采样、推理、梯度计算和参数分发的全并行重叠,显著提升了大规模VLA模型的训练吞吐量和采样效率。

详情
英文摘要

The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.

2605.13247 2026-05-15 cs.LG

EMO: Frustratingly Easy Progressive Training of Extendable MoE

Linghao Jin, Chufan Shi, Huijuan Wang, Nuan Wen, Zhengzhong Liu, Eric Xing, Xuezhe Ma

AI总结 本文提出了一种名为EMO的渐进式训练框架,用于可扩展的稀疏混合专家(MoE)模型。该方法通过在训练过程中逐步扩展专家池,解决了传统MoE训练中因过早分配过多专家而导致的内存和通信开销过大的问题。EMO基于扩展定律建模稀疏性,为渐进式扩展设计了计算最优的token预算,实验表明其在保持模型性能的同时显著提升了训练效率和资源利用率。

详情
英文摘要

Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.

2605.13213 2026-05-15 cs.AI

Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

Hao Zhou, Tiru Wu, Yan Jiang, Wanqi Zhou, Junxing Hu, Ai Han

AI总结 本文研究了多模态多智能体系统(MM-MAS)在面对对抗攻击时的脆弱性,提出了一种分层攻击框架HAM$^{3}$,通过感知层、通信层和推理层三个层面协同攻击,分别扰动输入数据、通信内容与结构以及智能体的推理过程。实验表明,该方法在GQA基准上取得了高达78.3%的攻击成功率,尤其在推理层攻击效果显著,能够使多个智能体产生一致的错误判断,为构建更鲁棒和可解释的多智能体系统提供了重要参考。

Comments Accepted to CVPR 2026

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
英文摘要

Multi-modal multi-agent systems (MM-MAS) have gained increasing attention for their capacity to enable complex reasoning and coordination across diverse modalities. As these systems continue to expand in scale and functionality, investigating their potential vulnerabilities has become increasingly important. However, existing studies on adversarial attacks in multi-agent systems primarily focus on isolated agents or unimodal settings, leaving the vulnerabilities of MM-MAS largely underexplored. To bridge this gap, we introduce HAM$^{3}$, a Hierarchical Attack framework for multi-modal multi-agent systems that decomposes attacks into three interconnected layers. Specifically, at the perception layer, HAM$^{3}$ mounts attacks by perturbing visual inputs, textual inputs, and their fused visual-textual representations. At the communication layer, it performs communication-level attacks that corrupt message content and interaction topology, such as manipulating shared context or communication links to distort collective information flow. At the reasoning layer, it conducts reasoning-level attacks that interfere with each agent's cognitive pipeline, biasing reasoning trajectories and ultimately compromising final decisions. We evaluate HAM$^{3}$ on the GQA benchmark through multi-agent systems built on distinct reasoning paradigms including ReAct, Plan-and-Solve, and Reflexion. Experiments demonstrate that our framework achieves an Attack Success Rate of up to 78.3%, with reasoning-layer attacks being the most effective. More than half of the successful attacks lead multiple agents to produce consistent errors. These findings offer valuable insights for building more robust and interpretable multi-agent intelligence.

2605.13126 2026-05-15 cs.LG cs.AI

MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing

Chaokai Wu, Haofu Shi, Ningxuan Ma, Jianghong Ma, Xiaofeng Zhang

AI总结 本文研究了图神经网络(GNNs)在多标签场景下因信息压缩导致的过拟合问题,提出了一种名为MLGIB的多标签图信息瓶颈方法。该方法通过构建马尔可夫依赖空间并推导可计算的变分界,有效平衡了模型的表达能力和鲁棒性,在保留预测标签信号的同时抑制无关标签噪声。实验表明,MLGIB在多个基准数据集上均优于现有方法,验证了其有效性和通用性。

详情
英文摘要

Graph Neural Networks (GNNs) suffer from over-squashing in deep message passing, where information from exponentially growing neighborhoods is compressed into fixed-dimensional representations. We show that this issue becomes a distinct failure mode in multi-label graphs: neighboring nodes often share only limited labels while differing across many irrelevant ones, causing predictive signals to be diluted by noisy label information. To address this challenge, we propose the Multi-Label Graph Information Bottleneck (MLGIB), which formulates multi-label message passing as constrained information transmission under irrelevant label noise. MLGIB balances expressiveness and robustness by preserving predictive label signals while suppressing irrelevant noise. Specifically, it constructs a Markovian dependence space and derives tractable variational bounds, where the lower bound maximizes mutual information with target labels and the upper bound constrains redundant source information. These bounds lead to an end-to-end label-aware message-passing architecture. Extensive experiments on multiple benchmarks demonstrate consistent improvements over existing methods, validating the effectiveness and generality of the proposed framework.

2605.13084 2026-05-15 cs.CL cs.AI

Does language matter for spoken word classification? A multilingual generative meta-learning approach

Batsirayi Mupamhi Ziki, Louise Beyers, Ruan van der Merwe

AI总结 本文研究了语言因素在少样本语音词分类中的影响,提出了一种基于生成式元学习的多语言方法。该方法通过生成元持续学习算法,在英语、德语、法语和加泰罗尼亚语等多语言环境下进行训练,发现多语言模型表现最佳,但不同模型之间的性能差异较小。研究还表明,训练数据的独特小时数比语言数量更能反映模型性能。

详情
英文摘要

Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.

2605.13050 2026-05-15 cs.CL cs.AI

Context Training with Active Information Seeking

Zeyu Huang, Adhiguna Kuncoro, Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam, Marc'Aurelio Ranzato

AI总结 本文研究了如何通过主动信息检索提升大型语言模型在新任务中的适应能力。不同于传统依赖模型内部知识的封闭式方法,作者为上下文优化器引入了维基百科搜索和浏览器工具,以主动获取外部信息。通过设计一种基于搜索的训练流程,有效维护和剪枝多个候选上下文,显著提升了模型在低资源翻译、医疗场景和复杂推理等任务中的表现,同时表现出良好的数据效率和泛化能力。

Comments Preprint

详情
英文摘要

Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.

2605.13032 2026-05-15 cs.LG

What Information Matters? Graph Out-of-Distribution Detection via Tri-Component Information Decomposition

Danny Wang, Ruihong Qiu, Zi Huang

AI总结 图神经网络在节点分类任务中广泛应用,但在面对节点特征或图结构的分布外(OOD)变化时表现脆弱。为解决这一问题,本文提出了一种名为TIDE的三组件信息分解框架,将信息显式分解为特征相关、结构相关和联合组件,旨在保留与标签相关的联合信息,同时过滤掉虚假的特征和结构信息,从而增强对分布内(ID)和分布外(OOD)节点的区分能力。实验表明,TIDE在多个数据集上显著提升了OOD检测性能,同时保持了较高的ID分类准确率。

Comments ICML26

详情
英文摘要

Graph neural networks are widely used for node classification, but they remain vulnerable to out-of-distribution (OOD) shifts in node features and graph structure. Prior work established that methods trained with standard supervised learning (SL) objectives tend to capture spurious signals from either features and/or structure, leaving the model fragile under distributional changes. To address this, we propose TIDE, a novel and effective Tri-Component Information Decomposition framework that explicitly decomposes information into feature-specific, structure-specific and joint components. TIDE aims to preserve only the label-relevant part of the joint information while filtering out spurious feature- and structure-specific information, thereby enhancing the separation between in-distribution (ID) and OOD nodes. Beyond the framework, we provide theoretical and empirical analyses showing that an information bottleneck objective is preferable to standard SL for graph OOD detection, with higher ID confidence and a greater entropy gap between ID and OOD data. Extensive experiments across seven datasets confirm the efficacy of TIDE, achieving up to a 34% improvement in FPR95 over strong baselines while maintaining competitive ID accuracy.

2605.12998 2026-05-15 cs.LG

DRIFT: A Benchmark for Task-Free Continual Graph Learning with Continuous Distribution Shifts

Guiquan Sun, Xikun Zhang, Jingchao Ni, Dongjin Song

AI总结 本文提出DRIFT,一个用于无任务划分持续图学习的基准,旨在应对现实环境中连续分布漂移的挑战。传统持续图学习方法通常基于离散任务划分,而DRIFT则从无任务视角出发,将数据流建模为随时间变化的潜在任务分布混合,从而支持对分布漂移的连续建模。通过高斯参数化,DRIFT覆盖了从剧烈任务切换到平滑分布漂移的多种过渡动态,并揭示了现有方法在无任务划分场景下的性能下降问题,突显了研究真实非平稳条件下持续图学习的重要性。

Comments 20 pages, 5 figures

详情
英文摘要

Continual graph learning (CGL) aims to learn from dynamically evolving graphs while mitigating catastrophic forgetting. Existing CGL approaches typically adopt a task-based formulation, where the data stream is partitioned into a sequence of discrete tasks with pre-defined boundaries. However, such assumptions rarely hold in real-world environments, where data distributions evolve continuously and task identity is often unavailable. To better reflect realistic non-stationary environments, we revisit continual graph learning from a task-free perspective. We propose a unified formulation that models the data stream as a time-varying mixture of latent task distributions, enabling continuous modeling of distribution drift. Based on this formulation, we construct \emph{DRIFT}, a benchmark that spans a spectrum of transition dynamics ranging from hard task switches to smooth distributional drift through a Gaussian parameterization. We evaluate representative continual learning methods under this task-free setting and observe substantial performance degradation compared to traditional task-based protocols. Our findings indicate that many existing approaches implicitly rely on task boundary information and struggle under realistic task-free graph streams. This work highlights the importance of studying continual graph learning under realistic non-stationary conditions and provides a benchmark for future research in this direction. Our code is available at https://github.com/UConn-DSIS/DRIFT.

2605.12968 2026-05-15 cs.LG cs.AI cs.CL cs.LO

Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

Hisashi Miyashita, Mgnite Inc

AI总结 该研究探讨了大语言模型是否在内部以可形式验证的代数结构编码本体关系,并提出了一种代数本体投影(AOP)方法,通过在有限域F2上投影隐藏状态,仅使用42对关系作为代数密钥,实现了高达93.33%的零样本包含准确率。研究还引入了语义结晶度(SC)指标,用于量化模型满足F2约束的程度,并揭示了系统提示在防止模型深层逻辑崩溃中的关键作用,为理解大语言模型的逻辑结构提供了新的代数视角。

详情
英文摘要

Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families -- with no model tuning, but through prompt alone. This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse -- a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible.

2605.12856 2026-05-15 cs.AI cs.SI

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee

AI总结 本文研究了多智能体系统中隐藏恶意意图的检测问题,提出了基于智能体意图而非内容特征的 moderation 框架 BOT-MOD。该方法通过多轮对话和基于 Gibbs 采样的假设引导,逐步识别智能体的真实意图,有效区分良性与恶意行为。实验基于 Moltbook 构建的数据集验证了方法的有效性,能够在多种对抗场景下准确识别意图,同时保持较低的误报率,为开放多智能体环境中的意图感知 moderation 提供了新思路。

详情
英文摘要

The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that BOT-MOD reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.

2605.12808 2026-05-15 cs.LG

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

Ling-Qi Zhang, Kristin Branson

AI总结 该研究针对神经科学数据碎片化、格式多样且难以复用的问题,探索利用代理AI(Agentic AI)提升数据重用效率的潜力。研究通过八个包含数据和代码的实验论文,评估了通用编程代理在加载、理解和重新格式化神经数据以训练解码器任务中的表现,发现尽管代理在子任务上表现良好,但难以实现端到端无误的解决方案。研究分析了代理的常见错误类型及其触发因素,并提出了适用于代理AI时代的数据共享最佳实践,同时指出代理作为评判者在缺乏真实参考的情况下可靠性有限,强调了人机协作在代码开发中的必要性。

Comments v2: Added forgotten acknowledgments section

详情
英文摘要

Neuroscience data are highly fragmented across labs, formats, and experimental paradigms, and reuse often requires substantial manual effort. A persistent roadblock to data reuse and integration is the need to decipher bespoke and diverse data formatting choices. Common data formats have been proposed in response, but the field continues to struggle with a fundamental tension: formats flexible enough to accommodate diverse experiments are rarely descriptive enough to be self-explanatory, and sufficiently descriptive formats demand detailed documentation and curation effort that few labs can sustain. Agentic AI is a natural candidate to solve this problem: LLMs read code and text faster and with sustained attention to the low-level details humans tend to skim over. To measure how well agentic AI performs on this task, we selected eight recent papers studying large-scale mouse neural population recordings that shared both data and code, spanning diverse recording modalities, behavioral paradigms, and dataset formats (e.g., NWB, specialized APIs, and general-purpose Python or MATLAB files). We provided agents with the data, code, and paper, and prompted them to load, understand, and reformat the data for a common downstream task: training a decoder from neural activity to task or behavioral variables. General-purpose coding agents commonly used by scientists performed well on each sub-task, but rarely strung together a fully error-free end-to-end solution. We characterize the types of mistakes agents made and the dataset properties that elicited them, and propose data-sharing best practices for the agentic-AI era. We further find that agents-as-judges are unreliable at catching errors, especially without ground-truth references, so interactive, human-in-the-loop coding remains necessary.

2605.12784 2026-05-15 cs.LG cs.NE q-bio.QM

ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery

Andrew Y. Zhou, Sharvaree Vadgama, Sumanth Varambally, Peter Eckmann, Michael K. Gilson, Rose Yu

AI总结 该研究提出了一种名为ToolMol的进化智能代理框架,用于多目标药物分子设计。该框架结合多目标遗传算法和基于大语言模型的智能代理操作符,通过迭代更新分子种群,实现对药物分子的高效优化。ToolMol引入了基于RDKit的工具箱,支持精确的分子结构修改,并在多个蛋白质靶点上表现出色,其生成的分子在结合亲和力和绝对结合自由能等关键指标上均优于现有方法。

Comments 9 pages, 5 figures

详情
英文摘要

Advances in large language models (LLMs) have recently opened new and promising avenues for small-molecule drug discovery. Yet existing LLM-based approaches for molecular generation often suffer from high rates of invalid and low-quality ligand candidates, a result of the syntactic limitations of current models with regard to molecular strings. In this paper, we introduce $\texttt{ToolMol}$, an evolutionary agentic framework for de novo drug design. $\texttt{ToolMol}$ combines a multi-objective genetic algorithm with an agentic LLM operator that iteratively updates the ligand population. We build a comprehensive toolbox of RDKit-backed functions that allows our agentic operator to consisently make precise ligand modifications. $\texttt{ToolMol}$ achieves state-of-the-art performance on multi-objective property optimization tasks, discovering drug-like and synthesizable ligands that have $>10\%$ stronger predicted binding affinity compared to existing methods, evaluated on three protein targets. $\texttt{ToolMol}$ ligands additionally achieve state-of-the-art results in gold-standard Absolute Binding Free Energy scores, gaining over existing methods by over $35\%$. By studying chain-of-thought reasoning traces, we observe that tool-calling enables the model to more faithfully execute its planned modifications, efficiently exploiting the strong chemical prior knowledge in LLMs.

2605.12651 2026-05-15 cs.LG

Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

Parv Kapoor, Abigail Hammer, Ashish Kapoor, Karen Leung, Eunsuk Kang

AI总结 本文提出了一种名为嵌入时序逻辑(ETL)的新方法,用于对基于感知的自主系统进行运行时监控。传统方法依赖于将连续传感器观测映射到低维状态变量定义的离散逻辑命题,但在感知驱动的场景中,这种方法存在计算开销大、鲁棒性差和语义不一致等问题。ETL 直接在学习得到的嵌入空间中进行监控,通过观测嵌入与参考观测嵌入之间的距离定义谓词,从而能够表达如视觉目标相似性或语义区域规避等高层感知概念,并通过时序算子组合这些谓词,自然地描述时序感知行为。实验表明,ETL 在多个操作环境中能够准确捕捉真实语义并实现对时序行为的有效监控。

详情
英文摘要

Runtime monitoring of autonomous systems traditionally relies on mapping continuous sensor observations to discrete logical propositions defined over low-dimensional state variables. This abstraction breaks down in perception-driven settings, where such mappings require additional learned modules that are often computationally expensive, brittle, and semantically misaligned. In this work, we propose Embedding Temporal Logic (ETL), a temporal logic that performs monitoring directly in learned embedding spaces. ETL defines predicates through distances between observed embeddings and target embeddings derived from reference observations. This formulation allows specifications to capture high-level perceptual concepts, such as similarity to visual goals or avoidance of semantic regions, that are difficult or impossible to express using traditional predicates. By composing these predicates with temporal operators, ETL naturally expresses temporally extended and sequential perceptual behaviors. We introduce ETL monitors for evaluating specifications over bounded embedding traces, along with a conformal calibration procedure that provides reliable and safety-oriented predicate evaluation. We evaluate our approach across multiple manipulation environments to show that ETL achieves strong empirical agreement with ground-truth semantics, including accurate monitoring of temporally composed behaviors.

2605.12534 2026-05-15 cs.SD cs.LG q-bio.NC

BioSEN: A Bio-acoustic Signal Enhancement Network for Animal Vocalizations

Tianyu Song, Ton Viet Ta, Ngamta Thamwattana, Hisako Nomura, Linh Thi Hoai Nguyen

AI总结 本文提出了一种名为BioSEN的生物声学信号增强网络,旨在解决动物声音在噪声环境下增强的问题。该模型结合了语音增强方法,并针对动物声音的特点设计了三个核心模块,分别用于时频特征提取、谐波结构捕捉和能量自适应门控连接。实验结果表明,BioSEN在三个生物声学数据集上表现优异,计算量远低于现有先进模型,展示了其在生物多样性监测与保护中的应用潜力。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
英文摘要

Most work in audio enhancement targets human speech, while bioacoustics is less studied due to noisy recordings and the distinct traits of animal sounds. To fill this gap, we adapt speech enhancement methods and build BioSEN, a model made for bioacoustic signals. BioSEN has three modules: a multi-scale dual-axis attention unit for time-frequency feature extraction, a bio-harmonic multi-scale enhancement unit for capturing harmonic structures, and an energy-adaptive gating connection unit that uses frequency weights to keep vocalizations from being removed as noise. Tests on three bioacoustic datasets show that BioSEN matches or exceeds state-of-the-art speech enhancement models while using far less computation. These results show BioSEN's strength for bioacoustic audio enhancement and its promise for biodiversity monitoring and conservation.

2605.12465 2026-05-15 cs.LG

High-arity Sample Compression

Leonardo N. Coregliano, William Opich

AI总结 本文研究了高阶学习理论中样本压缩方案的变体,探讨了其与高阶PAC可学习性之间的关系。作者证明了若存在非平凡质量的高阶样本压缩方案,则对应的概念类具有高阶PAC可学习性。该结果为理解高阶学习问题提供了新的理论依据。

Comments 29 pages v2: corrected minor typos

详情
英文摘要

Recently, a series of works have started studying variations of concepts from learning theory for product spaces, which can be collected under the name high-arity learning theory. In this work, we consider a high-arity variant of sample compression schemes and we prove that the existence of a high-arity sample compression scheme of non-trivial quality implies high-arity PAC learnability.

2605.12394 2026-05-15 cs.LG cs.AI

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

Hari K. Prakash, Charles H Martin

AI总结 本文提出了一种基于随机矩阵理论的新方法,用于在深度学习模型训练过程中检测过拟合现象,而无需访问训练或测试数据。该方法通过随机化每一层的权重矩阵,并拟合其经验谱分布,识别出违反自平均性的异常特征值,称为“相关陷阱”。研究发现,在长期视角下的“反直觉学习”阶段,这些陷阱会随着测试准确率下降而逐渐形成和扩大,揭示了过拟合的结构特征,并指出部分大型语言模型中也存在类似的陷阱,可能暗示潜在的过拟合风险。

Comments 24 pages, 24 figures

详情
英文摘要

Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}^{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the "anti-grokking" phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.