arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.06841 2026-05-11 cs.AI cs.LG

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

Qinshi Zhang, Weipeng Deng, Zhihan Jiang, Jiaming Qu, Qianren Li, Weitao Xu, Ray LC

AI总结 在基于模型的学习中,智能体通过世界模型预测轨迹来学习行为,但传统世界模型往往忽略了动作的前提条件,导致在多步预测中出现累积误差。本文提出AGWM(基于可操作性的世界模型),通过学习动作前提条件的依赖关系图(DAG),显式追踪动作的动态可执行性,从而更准确地判断当前状态下动作是否可行。实验表明,AGWM在多步预测误差、对新场景的泛化能力和可解释性方面均有显著提升。

Comments 16 pages, 3 figures, 4 tables. Appendix on pages 11-16 (main text is self-contained)

详情
英文摘要

In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.

2605.06835 2026-05-11 cs.LG cs.AI

On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

Masoumeh Shafieinejad, D. B. Emerson, Behnoosh Zamanlooy, Elaheh Bassak, Fatemeh Tavakoli, Sara Kodeiri, Marcelo Lotif, Xi He

AI总结 本文研究了表格扩散模型(TDMs)中的隐私泄露问题,分析了影响隐私泄露的关键因素、攻击者所需的知识以及相关隐私度量方法的有效性。通过黑盒和白盒设置下的成员推理攻击,研究量化了训练配置、合成策略和攻击者知识对隐私风险的影响,并指出攻击者无需完全了解训练细节或拥有大量计算资源即可成功实施攻击。此外,研究揭示了某些启发式隐私度量方法在评估隐私泄露时存在的局限性。

Comments 23 pages, 11 Figures, 12 Tables

详情
英文摘要

Tabular data plays an important role in many fields and industries, including those with elevated privacy considerations and risks. As such, there is a rising interest in generating high-quality synthetic proxies for real tabular data as a means of reducing privacy risk and proprietary data exposure. With tabular diffusion models (TDMs) demonstrating leading performance in synthesizing such data, understanding and measuring the privacy risks associated with these models is imperative. Leveraging state-of-the-art membership inference attacks for TDMs in both black- and white-box settings, this work quantifies the impact of training setup, synthesis choices, and attacker knowledge on privacy leakage. Moreover, the results demonstrate that adversaries need not have perfect knowledge of the training setup, identical data distributions, or massive compute resources to construct successful attacks. Finally, the pitfalls associated with applying heuristic privacy metrics, such as distance-to-closest record, are revealed.

2605.06834 2026-05-11 cs.LG

Attribution-Based Neuron Utility for Plasticity Restoration in Deep Networks

Patrick Elisii, Lucas Beauchemin, Dawer Jamshed

AI总结 本文研究了深度网络在持续学习过程中因可塑性下降而导致的训练困难问题,提出了一种基于梯度归因的神经元效用度量方法——梯度与参考差(GXD),用于指导自适应重置操作以恢复网络的可训练性。该方法从理论角度出发,通过估计替换神经元的功能代价,提升了重置干预的可靠性。实验表明,GXD 能在现有重置标准失效的场景下更有效地恢复网络的持续学习能力。

详情
英文摘要

Continual learning research attempts to conserve two fundamental capabilities: new knowledge acquisition and the preservation of previously acquired knowledge. While knowledge in this case can be measured through performance over an implicit or explicit task space, model plasticity generally concerns adaptability as data distributions evolve. Though much of the literature has focused on catastrophic forgetting, deep networks can also suffer from loss of plasticity, becoming progressively harder to update under continued training. Recent research has identified multiple mechanisms underlying this phenomenon, including neuron saturation, parameter norm growth, and loss of useful curvature directions. Adaptive reset-based interventions, which selectively reinitialize low-utility network parameters, have emerged as practical solutions to restore trainability. Existing utility measures used to guide resets, such as activation magnitude, contribution utility, or gradient-based activity, rely on proxy signals that can become misaligned with the intervention they are meant to guide. In this paper, we introduce gradient times difference from reference (GXD), a theoretically motivated utility measure based on reference-based gradient attribution that estimates the first-order functional cost of replacing a unit. Our results show that utility measures aligned with the functional cost of the reset can make interventions more reliable in settings where existing reset criteria degrade. GXD reframes adaptive resetting as an intervention cost estimation problem, providing a practical path toward more robust continual learning systems.

2605.06832 2026-05-11 cs.CL cs.AI cs.LG

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

Yuwei Yin, Chuyuan Li, Giuseppe Carenini

AI总结 本文介绍了IntentGrasp,一个用于评估大语言模型(LLM)意图理解能力的综合性基准。该基准基于49个高质量、开源数据集构建,包含大规模训练集和两个评估集,广泛测试了20个主流LLM,结果显示模型在意图理解任务上表现不佳,远低于人类水平。为此,研究提出了一种意图微调(IFT)方法,显著提升了模型在意图理解任务上的性能,并展现出良好的跨领域泛化能力。

Comments IntentGrasp data is available on [Hugging Face](https://huggingface.co/datasets/yuweiyin/IntentGrasp), and the code is released on [GitHub](https://github.com/YuweiYin/IntentGrasp)

详情
英文摘要

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

2605.06830 2026-05-11 cs.LG cs.CL

ProtSent: Protein Sentence Transformers

Dan Ofer, Oriel Perets, Michal Linial, Nadav Rappoport

AI总结 本文提出了一种名为ProtSent的蛋白质句子嵌入模型,旨在提升蛋白质语言模型(pLMs)在功能、进化和结构相似性方面的表示能力。通过对比微调框架,ProtSent利用多个蛋白质对数据集进行训练,显著提升了嵌入质量。实验表明,ProtSent在多个下游任务中表现优异,尤其在远程同源检测和结构检索等任务上取得了显著提升,且无需任务特定的监督信息。

Comments 9 figures, appendix, 2 figures, open code and models

详情
英文摘要

Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.

2605.06829 2026-05-11 cs.LG cs.CV cs.ET cs.IT cs.NE math.IT

A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models

Aditya Ranganath, Mukesh Singhal

AI总结 本文从测度论的角度统一了扩散模型、基于分数的生成模型和流匹配模型,将其视为通过学习时间依赖的向量场来将简单参考分布转化为数据分布的过程。研究提出了一个统一的框架,揭示了这些方法在连续性和福克-普朗克方程下的共同结构,并分析了它们在采样、稳定性和计算方面的实际权衡。文章还比较了不同方法的目标函数、采样方案和离散化误差,并探讨了它们与薛定谔桥和熵最优传输的联系。

Comments 62 pages, 1 figure, jmlr preprint

详情
英文摘要

We survey continuous-time generative modeling methods based on transporting a simple reference distribution to a data distribution via stochastic or deterministic dynamics. We present a unified framework in which diffusion models, score-based generative models, and flow matching are instances of learning a time-dependent vector field that induces a family of marginals $(ρ_t)_{t \in [0,1]}$ governed by continuity and Fokker-Planck equations. Such a unified theory is timely because these methods are converging methodologically, yet fragmented notation and competing derivations continue to obscure their shared structure and the practical tradeoffs governing sampling, stability, and computation. Within this framework, we (i) derive reverse-time sampling for diffusion and score-based models as controlled stochastic dynamics, (ii) show that the probability flow ODE yields identical marginals and connects diffusion to likelihood-based normalizing flows, and (iii) interpret flow matching as direct regression of the velocity field under a chosen interpolation, clarifying when it coincides with or differs from score-based training. We compare objectives, sampling schemes, and discretization errors under unified notation, discuss connections to Schrodinger bridges and entropic optimal transport, and summarize theoretical guarantees and open problems on approximation, stability, and scalability.

2605.06825 2026-05-11 cs.AI cs.RO

Randomness is sometimes necessary for coordination

Rohan Patil, Jai Malegaonkar, Henrik I. Christensen

AI总结 在协作多智能体强化学习中,当智能体具有对称观察时,使用确定性策略会导致角色无法区分。为了解决这一问题,本文提出了一种基于随机性的协调机制——Diamond Attention,通过每个智能体在每个时间步采样一个随机数,生成临时的排名顺序,从而实现有效的注意力屏蔽与协调。该方法能够在单次广播轮次中实现随机位协调协议,并支持不同规模团队的零样本部署。实验表明,该方法在对称任务和控制协调任务中均优于传统确定性方法,并验证了结构化随机性在协调中的关键作用。

详情
英文摘要

Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness-137A/

2605.06822 2026-05-11 cs.LG

SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

Xiwen Chen, Wenhui Zhu, Songzhu Zheng, Kashif Rasul, Yueyue Deng, Huayu Li

AI总结 在金融交易领域,大型语言模型(LLMs)需要持续适应噪声大、非平稳的市场环境。现有自优化方法依赖无约束的提示优化,但在低信噪比和延迟奖励环境下容易导致策略漂移。本文提出SHARP,一种自演进的可审计规则策略框架,通过结构化的条件-动作规则限制代理推理,并利用跨样本推理定位规则缺陷,实现精准策略修正。实验表明,SHARP能显著提升模型性能并增强策略的透明性和可审计性。

详情
英文摘要

Large language models (LLMs) are increasingly deployed for autonomous financial trading, a domain requiring continuous adaptation to noisy, non-stationary markets. Existing self-improving agents typically address this through unbounded free-form prompt optimization. However, in low signal-to-noise environments with delayed scalar rewards (P\&L), this unstructured approach exacerbates the fundamental credit assignment problem: optimizers cannot reliably distinguish systematic logic flaws from stochastic market variance, inevitably leading to policy drift. To overcome this bottleneck, we introduce the Self-Evolving Human-Auditable Rubric Policy (SHARP), a neuro-symbolic framework that replaces unconstrained text mutation with structured, symbolic policy optimization. SHARP confines the agent's reasoning to a bounded, human-readable rubric of explicit condition-action rules. When sub-optimal trades occur, an attribution agent employs cross-sample reasoning across multiple samples to isolate specific rule failures. This enables targeted, atomic policy edits that are subsequently regularized through strict walk-forward validation. Evaluated across three diverse equity sectors and four LLM backbones, SHARP consistently transforms generic initial heuristics into highly robust strategies, lifting the empirical performance of compact models by 10 to 20 percentage points on average (e.g., GPT-4o-mini). Ultimately, SHARP demonstrates that LLMs can achieve dynamic and efficient adaptation while significantly enhancing the structural transparency and auditability demanded by institutional finance.

2605.06821 2026-05-11 cs.LG cs.AI math.OC stat.ML

A Rod Flow Model for Adam at the Edge of Stability

Eric Regis, Sinho Chewi

AI总结 本文研究了Adam优化器在稳定性边缘的行为,提出了一种称为“杆流”(rod flow)的连续时间模型。该方法将参数和一阶矩构成的联合相空间中的连续迭代过程建模为一个扩展的一维对象——“杆”,并将二阶矩作为平滑的辅助变量进行处理。该模型不仅适用于Adam,还推广到多种动量优化方法,并在多个典型机器学习任务中验证了其在稳定性边缘区域对离散迭代过程的更精确追踪能力。

详情
英文摘要

Cohen et al. (arXiv:2207.14484) observed that adaptive gradient methods such as Adam operate at the edge of stability. While there has been significant work on continuous-time modeling of gradient descent at the edge of stability, extending these models to momentum methods remains underdeveloped. In the gradient descent setting, Regis et al. (arXiv:2602.01480) introduced rod flow, which models consecutive iterates as an extended one-dimensional object -- a "rod." Here we extend rod flow to Adam by working in the joint phase space of parameters and first moment $(w, m)$ and treating the second moment $ν$ as a smooth auxiliary variable. We also develop rod flows for heavy ball momentum, Nesterov momentum, and scalar and per-component versions of RMSProp, Adam, and NAdam. For all eight optimizers, we empirically evaluate rod flow on representative machine learning architectures, where it tracks the discrete iterates through the edge-of-stability regime significantly more accurately than the corresponding stable flow.

2605.06819 2026-05-11 cs.LG

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel

AI总结 该论文研究了具有自回归链式推理过程的在线学习理论,重点分析了在未知的自回归生成器下学习最终输出的错误界。文章区分了两种反馈形式:端到端模型仅观察最终生成的标记,而链式推理模型则能看到完整的生成轨迹,并探讨了生成步数 $M$ 对错误界的影响。研究发现,在端到端模型中,错误界随着 $M$ 呈对数增长,而在链式推理模型中,错误界与 $M$ 无关,从而揭示了中间信息对学习效率的重要作用。

详情
英文摘要

Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].

2605.06815 2026-05-11 cs.AI cs.CV

Uneven Evolution of Cognition Across Generations of Generative AI Models

Isaac Galatzer-Levy, Daniel McDuff, Xin Liu, Jed McGiffin

AI总结 该研究探讨了生成式人工智能模型在不同代际间认知能力发展的不均衡现象,提出了一种心理测量框架,用于评估生成式AI的认知特征并追踪其演化过程。通过类比韦氏成人智力量表的任务,研究发现当前主流多模态模型在语言理解与工作记忆方面表现接近人类顶端水平,但在知觉推理方面则接近底部水平,显示出明显的认知结构不平衡。研究还开发了AIQ基准测试,揭示了模型在不同模态任务上的发展轨迹存在显著差异,表明当前生成模型在语言符号处理方面进展较快,但在视觉抽象推理等方面仍存在明显局限。

Comments 25 pages, 5 Figures, 3 Tables

详情
英文摘要

The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.

2605.06814 2026-05-11 cs.LG

From Model to Data (M2D): Shifting Complexity from GNNs to Graphs for Transparent Graph Learning

Debolina Halder Lina, Arlei Silva

AI总结 该论文提出了一种名为M2D(Model-to-Data)的模型蒸馏框架,旨在提升图神经网络(GNN)的透明性。通过将模型的复杂性转移到图数据中,M2D将复杂模型的行为以可解释的方式体现在增强的图结构中,使简单模型能够达到相近的性能。该方法不仅有助于理解不同GNN架构的性能差异,还能揭示如公平性目标和注意力聚合等关键机制,从而增强模型的可解释性与透明度。

详情
英文摘要

Graph Neural Networks (GNNs) achieve high performance but can be opaque to humans, making it difficult to understand and compare the many proposed architectures. While existing explainability methods attribute individual predictions to nodes, edges, or features, they do not provide architectural transparency or explain the fundamental performance gap between simple and more complex models. To address this limitation, we introduce Model-to-Data (M2D) distillation, a new framework that increases transparency by transferring model complexity into the data space. M2D distills the teacher model into an augmented graph with enriched features and structure, enabling a simple student to match the teacher's performance. By materializing model behavior in the data, our approach allows humans to inspect architectural advantages directly. We show that M2D reveals underlying mechanisms such as fairness objectives and attention-based aggregation in an interpretable way, enhancing GNN transparency while preserving performance.

2605.06812 2026-05-11 cs.AI

Towards Security-Auditable LLM Agents: A Unified Graph Representation

Chaofan Li, Lyuye Zhang, Jintao Zhai, Siyue Feng, Xichun Yang, Huahao Wang, Shihan Dou, Yu Ji, Yutao Hu, Yueming Wu, Yang Liu, Deqing Zou

AI总结 随着基于大语言模型(LLM)的智能体系统在自主任务执行中日益复杂,其安全审计面临重大挑战。本文提出Agent-BOM,一种统一的图结构表示方法,用于建模智能体系统的静态能力基础与动态运行状态,从而填补语义层面的安全审计空白。通过将执行过程转化为可查询的审计路径,Agent-BOM能够有效识别包括内存污染、工具滥用和多智能体系统劫持等在内的隐蔽攻击行为,为复杂智能体生态系统的安全分析提供了可追溯的统一基础。

详情
英文摘要

LLM-based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi-agent collaboration. However, this semantics-driven execution paradigm creates a severe semantic gap between low-level physical events and high-level execution intent, making post-hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive-state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent-BOM, a unified structural representation for agent security auditing. Agent-BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long-term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent-BOM, we develop a graph-query-based paradigm for path-level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent-BOM from live executions. Evaluation on representative real-world agentic attack scenarios shows that Agent-BOM can reconstruct stealthy attack chains, including cross-session memory poisoning and tool misuse, capability supply-chain hijacking and unexpected code execution, multi-agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent-BOM provides a unified and auditable foundation for root-cause analysis and security adjudication in complex agentic ecosystems.

2605.06809 2026-05-11 cs.CV cs.LG

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

Ali Salamatian, Anthony Fuller, Pritam Sarkar, James R. Green, Leonid Sigal, Evan Shelhamer

AI总结 该论文提出了一种名为LookWhen的视频识别框架,旨在解决传统Transformer模型在视频处理中计算成本高昂的问题。其核心思想是将视频识别分解为“何时、何处、计算什么”三个部分,通过一个浅层选择器快速筛选重要视频块,并由深层提取器处理这些关键块以生成视频表示。该方法通过引入新颖的预训练策略,有效提升了计算效率,在多个视频数据集上实现了优于现有高效模型的准确率与计算量的平衡。

详情
英文摘要

Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.

2605.06797 2026-05-11 cs.LG

MIND: Monge Inception Distance for Generative Models Evaluation

Quentin Berthet, Yu-Han Wu, Clement Crepy, Romuald Elie, Klaus Greff, Michael Eli Sander

AI总结 本文提出了一种用于生成模型评估的新指标——蒙日 inception 距离(MIND),旨在解决广泛使用的 Fréchet Inception 距离(FID)存在的关键问题。MIND 采用切片沃asserstein 距离,通过排序高效计算一维最优传输距离的平均值,避免了 FID 需要估计高维均值和协方差矩阵所带来的样本复杂度高和易受对抗攻击的问题。实验表明,MIND 在样本效率、计算速度和对抗鲁棒性方面均显著优于 FID,且仅需 5k 样本即可达到 FID 使用 50k 样本的评估效果。

详情
英文摘要

We propose the Monge Inception Distance (MIND), a metric for evaluating generative models that addresses key limitations of the widely adopted Fréchet Inception Distance (FID). The MIND metric leverages the sliced Wasserstein distance to compare distributions by averaging one-dimensional optimal transport distances, efficiently computed via sorting. This approach circumvents the estimation of high-dimensional means and covariance matrices, which underlie FID's poor sample complexity and vulnerability to adversarial attacks. We empirically demonstrate three primary advantages: (i) it is more sample-efficient by one order of magnitude, (ii) it is faster to compute by two orders of magnitude, (iii) it is more robust to adversarial attacks such as moment-matching. We show that MIND with 5k samples can replace the evaluation performance of FID with 50k samples, providing high correlation with this standard benchmark and superior discriminative performance. We further demonstrate that even smaller sample sizes (e.g., 1k or 2k) remain highly informative for rapid model iteration.

2605.06788 2026-05-11 cs.LG cs.MA

Conformal Agent Error Attribution

Naihe Feng, Yi Sui, Shiyi Hou, Ga Wu, Jesse C. Cresswell

AI总结 本文研究了多智能体系统(MAS)在失败时如何准确识别关键错误发生的位置,以实现自动恢复。为了解决基于大型语言模型的MAS生成长交互轨迹所带来的错误归因难题,作者提出了一种基于符合预测(CP)的错误归因框架,提供了有限样本、分布无关的覆盖保证。该方法引入了适用于序列数据的新算法,能够预测连续的错误区间,从而实现高效的回滚与调试,并在多种智能体和数据集上验证了其有效性。

Comments 10 pages

详情
英文摘要

When multi-agent systems (MAS) fail, identifying where the decisive error occurred is the first step for automated recovery to an earlier state. Error attribution remains a fundamental challenge due to the long interaction traces that large language model-based MAS generate. This paper presents a framework for error attribution based on conformal prediction (CP) which provides finite-sample, distribution-free coverage guarantees. We introduce new algorithms for filtration-based CP designed for sequential data such as agent trajectories. Unlike existing CP algorithms, our approach predicts sets that are contiguous sequences to enable efficient recovery and debugging. We verify our theoretical guarantees on a variety of agents and datasets, show that errors can be precisely isolated, then use prediction sets to rollback MAS to correct their own errors. Our overall approach is model-agnostic, and offers a principled uncertainty layer for MAS error attribution. We release code at https://github.com/layer6ai-labs/conformal-agent-error-attribution.

2605.06772 2026-05-11 cs.AI cs.HC hep-ph hep-th

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

Vasilis Niarchos, Constantinos Papageorgakis, Alexander G. Stapleton, Sokratis Trifinopoulos

AI总结 本文研究了研究人员与人工智能代理之间的互动如何影响理论物理研究的成果,提出了一种名为SCALAR的结构化批评-行动循环框架,用于量子场论和弦论问题的智能推理。该框架包含行动者、批评者和独立评判者三个组件,通过多轮对话和不同反馈策略的对比实验,揭示了不同角色配对和提示策略对推理效果的影响。研究发现,合理的批评反馈机制能显著提升模型表现,但其效果依赖于行动者与批评者的组合方式。

Comments 17 pages; 9 figures

详情
英文摘要

As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor--Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor--Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor--Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.

2605.06765 2026-05-11 cs.CL cs.AI

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Jiacheng Xu, Heting Gao, Liufei Xie, Zhenchuan Yang, Lijiang Li, Yiting Chen, Bin Zhang, Meng Chen, Chaoyu Fu, Weifeng Zhao, Wenjiang Zhou

AI总结 VITA-QinYu 是首个支持角色扮演和唱歌生成的端到端口语语言模型,能够生成超越自然对话的富有表现力的语音。该模型采用混合语音-文本范式,结合多码本音频标记,实现了更丰富的副语言表达,同时保持模态分离以避免干扰。研究还构建了一个全面的数据生成管道,合成超过15.8千小时的训练数据,并在多项基准测试中取得了优越的表现,同时在对话准确性和流畅性方面也达到当前最优水平。

Comments https://tme-lyra-lab.github.io/VITA-QinYu/

详情
英文摘要

Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.

2605.06764 2026-05-11 cs.LG cs.AI

Revisiting Adam for Streaming Reinforcement Learning

Florin Gogianu, Adrian Catalin Lutu, Razvan Pascanu

AI总结 本文重新审视了流式强化学习中的Adam优化算法,研究了在无存储交互数据的情况下,如何实现高效且稳定的策略更新。通过分析DQN和C51等经典算法在在线学习场景中的表现,作者发现目标函数梯度的有界性和权重更新的方差调整是实现鲁棒性能的关键。基于这些发现,作者提出了一种基于资格迹的自适应Q(λ)算法,在部分Atari游戏中表现出色,显著超越了现有方法。

详情
英文摘要

Learning from a sequence of interactions, as soon as observations are perceived and acted upon, without explicitly storing them, holds the promise of simpler, more efficient and adaptive algorithms. For over a decade, however, deep reinforcement learning walked the contrary path, augmenting agents with replay buffers or parallel sampling routines, in an effort to tame learning instability. Recently, this topic has been revisited by Elsayed et al. (2024), focusing on update computation through eligibility traces and modifications to the optimisation routine, resulting in the StreamQ algorithm. In this work we take a step back, investigating the efficacy of established updates, such as those implemented by DQN and C51 within this online setting. Not only do we find that they perform well, but through analysing how the optimisation algorithm generally, and Adam in particular, interacts with these updates, we contend that two properties are essential for robust performance: i) the derivative of the objective is to be bounded and ii) weight updates are variance-adjusted. Rigorous and exhaustive experimentation demonstrates that C51, which exhibits both characteristics, is competitive with StreamQ across a subset of 55 Atari games. Using these insights, we derive a variance-adjusted algorithm based on eligibility traces, termed Adaptive Q$(λ)$, which approaches double the human baseline on the same subset, surpassing existing methods by all performance metrics.

2605.06761 2026-05-11 cs.AI cs.CV cs.LG

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

Oğuzhan Fatih Kar, Roman Bachmann, Yuanzheng Gong, Anders Boesen Lindbo Larsen, Afshin Dehghan

AI总结 该论文提出了一种名为Weblica的框架,用于构建可复现且可扩展的视觉网络代理训练环境,以解决网络环境复杂多变、难以大规模获取训练数据的问题。Weblica结合HTTP级别的缓存技术和基于大语言模型的环境生成方法,能够在保持交互行为的同时重现稳定的视觉状态,并基于真实网站和核心导航技能合成多样化的训练环境。该框架支持在数千个不同任务和环境中进行强化学习训练,其最佳模型Weblica-8B在多个网络导航基准测试中表现优于同规模的开源模型,且具有更高的计算效率和扩展性。

Comments 28 pages, 19 figures

详情
英文摘要

The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.

2605.06759 2026-05-11 cs.RO

An Aerial Manipulator for Perception-Driven Flower Targeting Toward Contactless Pollination in Vertical Farming

Chenzhe Jin, Zhuohang Wu, Yifan Cai, Xiangqi Li, Jan Ming Kevin Tan, Narsimlu Kemsaram, Valerio Modugno

AI总结 随着自然授粉者减少,垂直农场等受控室内农业面临授粉难题,本文提出了一种基于感知驱动的空中机械臂系统,用于实现无接触授粉中的花朵定位与精准接近。该系统集成了基于RGBD的感知、基于模型预测路径积分的飞行控制以及轻量化的2自由度机械臂,能够在模拟和真实实验环境中实现稳定飞行、可靠定位和厘米级末端执行器精度。研究验证了该空中机械臂作为未来无接触授粉系统的可靠载体和定位框架的可行性。

Comments This paper has been accepted for publication in the Proceedings of the 2026 4th International Conference on Robotics, Control and Vision Engineering (RCVE 2026), 10-12 July, 2026, Tokyo, Japan

详情
英文摘要

The decline of natural pollinators has created a major challenge for crop production in controlled indoor agriculture, particularly in vertical farming environments where natural insect pollination is absent. This motivates the development of robotic systems capable of performing precise flower targeting tasks while minimizing physical interference with delicate floral structures. This paper presents an aerial manipulator platform for perception driven flower detection, localization, and approach in vertical farming environments. The proposed system integrates onboard RGBD based perception, model predictive path integral (MPPI) based unmanned aerial vehicle (UAV) control on a PX4 platform, and a lightweight 2DoF manipulator for precise end effector positioning. The platform is evaluated in both MuJoCo simulation and UAV lab experiments using a flower targeting testbed. The experimental results demonstrate stable UAV flight, reliable flower localization, and centimeter level end effector positioning accuracy. In simulation, the proposed controller achieves consistent trajectory convergence and accurate target alignment. In the real world UAV lab environment, the integrated perception control manipulation framework enables stable flower targeted positioning and end effector alignment under constrained aerial operation. These results validate the proposed aerial manipulator as a robust robotic carrier and positioning framework for future contactless pollination systems. While the current study focuses on perception guided targeting and positioning, the developed platform provides a practical foundation for integrating advanced contactless end effectors, including acoustic based pollen manipulation modules, in future work.

2605.06756 2026-05-11 cs.LG cs.SY eess.SY

Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning

Umme Mahbuba Nabila, Paul Seurin, Linyu Lin, Majdi I. Radaideh

AI总结 本文提出了一种基于主动学习的物理驱动数字孪生框架,用于集成式热能系统的实时监控与控制。该方法结合系统级Modelica仿真与四种简化的物理感知和数据驱动代理模型,通过针对性的主动学习策略提升模型的准确性与效率。实验表明,该框架在热能分配系统中实现了与传统方法相当的预测精度,同时大幅减少了所需的仿真轨迹数量,其中GRU模型表现出最佳预测性能,而SINDyC模型则在计算效率和可解释性方面具有优势。

Comments 23 pages, 12 figures, and 2 tables

详情
英文摘要

Real-time supervisory control of thermal energy distribution systems requires digital twins that are accurate, interpretable, and uncertainty-aware, yet remain data and computationally efficient. High-fidelity simulations alone are costly, while purely data-driven surrogates often lack robustness. To address these challenges, this work proposes an active learning (AL) framework that couples system-level Modelica simulations with four simpler physics-informed and data-driven surrogate modeling approaches: deterministic Sparse Identification of Nonlinear Dynamics with Control (SINDyC), its probabilistic multivariate-Gaussian extension (MvG-SINDyC), feedforward neural network (FNN), and gated recurrent unit (GRU) network. Tailored to each surrogate, model-specific AL query strategies are employed, including Mahalanobis-distance sampling in coefficient space for MvG-SINDyC and error-based sampling in prediction space for SINDyC, FNN, and GRU, allowing the learning process to prioritize dynamically informative trajectories. The proposed approach is demonstrated on the glycol heat exchanger (GHX) subsystem of the Thermal Energy Distribution System (TEDS) at Idaho National Laboratory. Across key GHX outputs--the bypass mass flow rate $\dot{m}_{\mathrm{GHX}}$ and heat transfer rate $Q_{\mathrm{GHX}}$-the AL framework achieves comparable predictive accuracy using as few as one-fifth of the simulation trajectories required by random sampling. Among the evaluated surrogates, the GRU achieves the highest predictive fidelity, while SINDyC remains the most computationally efficient and interpretable. The probabilistic MvG-SINDyC surrogate further enables uncertainty quantification and exhibits the largest computational gains under AL.

2605.06755 2026-05-11 cs.LG cs.AI

Gradient Extrapolation-Based Policy Optimization

Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque, Ser-Nam Lim

AI总结 本文提出了一种基于梯度外推的策略优化方法(GXPO),旨在提升基于GRPO风格的强化学习在大语言模型推理任务中的性能。GXPO通过仅使用三次反向传播即可模拟更长的局部前瞻,从而在不增加额外计算成本的前提下,更准确地指导策略更新。实验表明,GXPO在数学推理任务中显著优于现有方法,同时在计算效率上也有明显提升。

Comments 26 pages, 9 figures

详情
英文摘要

Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approximates a longer local lookahead using only three backward passes during an active phase. It reuses the same batch of rollouts, rewards, advantages, and GRPO loss, so it does not require new rollouts or reward computation at the lookahead points. GXPO takes two fast optimizer steps, measures how the gradients change, predicts a virtual K-step lookahead point, moves the policy partway toward that point, and then applies a corrective update using the true gradient at the new position. When the lookahead signal becomes unstable, GXPO automatically switches back to standard single-pass GRPO. We also give a plain-gradient-descent surrogate analysis that explains when the extrapolation is exact and where its local errors come from. Across Qwen2.5 and Llama math-reasoning experiments, GXPO improves the average sampled pass@1 by +1.65 to +5.00 points over GRPO and by +0.14 to +1.28 points over the strongest SFPO setting, while keeping the active-phase cost fixed at three backward passes. It also achieves up to 4.00x step speedup, 2.33x wall-clock speedup, and 1.33x backward-pass speedup in reaching GRPO's peak accuracy.

2605.06747 2026-05-11 cs.CV cs.RO

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Yufan Deng, Daquan Zhou

AI总结 该研究提出了HumanNet,一个包含一百万小时的人类中心视频数据集,旨在解决物理交互学习中缺乏大规模、多样化和精细标注数据的问题。HumanNet涵盖了第一人称和第三人称视角下的精细动作、人-物交互、工具使用和长期行为,并提供了包括动作描述、手部和身体信号在内的交互相关标注,支持运动感知和交互感知的学习。研究还引入了一套系统化的数据构建范式,通过人类中心过滤、时间结构化、视角多样性等设计原则,将非结构化的网络视频转化为可扩展的学习基础,实验表明其在视觉-语言-动作任务中优于传统机器人数据。

Comments Github: https://github.com/DAGroup-PKU/HumanNet Project website: https://dagroup-pku.github.io/HumanNet/

详情
英文摘要

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.

2605.06741 2026-05-11 cs.LG

A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

Zixi Li, Youzhen Li

AI总结 本文研究了信念空间动态中可接受学习率步长的上限问题,将学习率步长视为影响模型收缩性的关键因素。通过将更新过程建模为概率单形上的投影前向步骤,作者提出了一个闭式上界公式,用于确定保证收缩性的最大学习率步长。该方法为学习率的设定提供了理论依据,避免了传统依赖经验调参的局限性。

详情
英文摘要

Learning-rate steps are usually treated as hyperparameters. This paper isolates a local beliefspace calculation: when an update is modeled as a projected forward step on the probability simplex, admissibility means contractivity in the natural KL/Bregman geometry. Under this model, the upper bound of an admissible step is not a tuning slogan but a formula.

2605.06740 2026-05-11 cs.LG cs.AI

Geometric Kolmogorov--Arnold Network (GeoKAN)

Abhijit Sen, Bikram Keshari Parida, Giridas Maiti, Mahima Arya, Denys I. Bondar

AI总结 本文提出了一种几何感知的Kolmogorov-Arnold网络(GeoKAN),通过学习自适应的几何坐标系来进行函数逼近,从而提升模型对复杂函数结构的表达能力。GeoKAN通过学习对角黎曼度量来变形输入空间,在局部尺度和体积变形中引入几何归纳偏置,特别适用于物理信息学习等场景。研究还开发了多种GeoKAN变体,能够根据任务需求动态调整表示分辨率,尤其适用于科学机器学习中出现的尖锐、刚性、局部化和高度非均匀问题。

Comments 46 pages, 24 figures, 13 tables

详情
英文摘要

We introduce Geometric Kolmogorov--Arnold Networks (GeoKANs), a family of geometry-aware KAN-type models in which approximation is carried out in learned, geometry-adapted coordinates rather than in fixed Euclidean input coordinates. GeoKAN achieves this by learning a diagonal Riemannian metric that warps the input before basis expansion and feature mixing. The learned metric provides a geometric inductive bias through local length scaling and volume distortion, and in physics-informed settings it also affects the differential structure seen by the model. Within this framework, we develop three main variants, namely GeoKAN-NNMetric, GeoKAN-$γ$, and LM-KAN. For LM-KAN, we further consider three basis-specific versions, LM-KAN-RBF, LM-KAN-Wav, and LM-KAN-Fourier. These variants allow us to study geometry-aware KAN models both as general function approximators and as surrogates in physics-informed learning. By stretching regions with rapid variation and compressing smoother regions, GeoKAN reallocates representational resolution in a task-dependent manner, allowing the model to place capacity where it is most needed. As a result, GeoKAN is well suited to sharp, stiff, localized, and strongly non-uniform regimes arising in scientific machine learning and differential-equation problems.

2605.06736 2026-05-11 cs.LG cs.AI cs.HC

STDA-Net: Spectrogram-Based Domain Adaptation for cross-dataset Sleep Stage Classification

Unaza Tallal, Shruti Kshirsagar, Ankita Shukla

AI总结 跨数据集睡眠阶段分类因EEG通道布局、采样率、记录环境和受试人群的差异而面临挑战。本文提出STDA-Net,一种基于频谱图的无监督域适应框架,结合卷积神经网络提取频谱特征、双向LSTM建模睡眠动态,并利用对抗神经网络实现源域与目标域特征对齐,无需目标域标注数据。实验表明,该方法在多个公开数据集上取得了优于传统一维EEG方法的分类性能,具有更高的稳定性和可重复性。

Comments submitted to IEEE SMC conference

详情
英文摘要

Accurate sleep stage classification across datasets remains challenging due to variability in EEG channel montages, sampling rates, recording environments, and subject populations. Although deep learning has shown considerable promise for automated sleep staging, most existing cross-dataset methods rely on one-dimensional EEG signal representations, whereas the use of two-dimensional spectrogram-based inputs within an unsupervised domain adaptation framework has remained largely unexplored. Here, we propose STDA-Net (Spectrogram-based Temporal Domain Adaptation Network), a framework that combines a convolutional neural network (CNN) for spectrogram-based feature extraction, a bidirectional long short-term memory (BiLSTM) module for temporal modeling of sleep dynamics, and a domain-adversarial neural network (DANN) for source-to-target feature alignment without requiring any labeled target-domain data during training. Experiments are conducted on three publicly available datasets Sleep-EDF, SHHS-1, and SHHS-2 under six cross-dataset transfer settings. Results show that the proposed framework achieves an average accuracy of 89.03% and an average macro F1-score of 87.64%, consistently outperforming existing 1D baseline methods in terms of balanced classification performance, with substantially lower variance across five independent runs, indicating improved stability and reproducibility. Overall, these findings demonstrate that 2D spectrogram-based representations, combined with temporal modeling and adversarial domain adaptation, provide a robust and competitive alternative to conventional 1D EEG inputs for cross-dataset sleep staging.

2605.06733 2026-05-11 cs.LG cs.AI

Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA

Jinqian Chen, Chang Liu, Jihua Zhu

AI总结 联邦LoRA是一种在分布式数据和有限客户端资源下实现大语言模型参数高效适应的方法。然而,现有方法直接对LoRA因子进行平均,存在语义不匹配的问题,因为相同更新可以有多种等价的因子分解形式。为此,本文提出了一种新的联邦LoRA方法GLoRA,通过估计客户端投影器的共识更新子空间,并在共享参考坐标下聚合更新,从而以低秩形式完整表示语义更新。实验表明,GLoRA在数据、资源和任务异构环境下均优于现有方法,并实现了良好的效率与性能平衡。

详情
英文摘要

Federated LoRA enables parameter-efficient adaptation of large language models under decentralized data and limited client resources.However, directly averaging LoRA factors is representation-dependent: the same intrinsic update admits infinitely many gauge-equivalent factorizations, so factor-level aggregation can change under arbitrary coordinate choices while the underlying update remains unchanged. This reveals a semantic mismatch in existing federated LoRA aggregation rules. We propose \textbf{GLoRA}, a gauge-aware server representation for federated LoRA.Instead of aggregating raw factors, GLoRA estimates a consensus update subspace from client projectors and aggregates client updates in shared reference coordinates, thereby representing semantic update aggregation entirely in low-rank form. To support heterogeneous client capacities, GLoRA further provides a rank-compatible readout that instantiates adapters of different ranks from the same server state without dense update reconstruction. Experiments on GLUE and SuperNI show that GLoRA consistently outperforms federated LoRA baselines under data, resource, and task heterogeneity, including heterogeneous client ranks, sparse participation, larger backbones, and unseen-task evaluation. GLoRA also achieves a favorable efficiency--performance trade-off, suggesting that effective federated LoRA requires not merely averaging low-rank factors, but defining a semantically meaningful server-side representation for aggregation.

2605.06730 2026-05-11 cs.LG

Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics

Likhita Yerra, Remi Uttejitha Allam

AI总结 本文提出了一种名为语义状态抽象接口(SSAI)的方法,用于将稀疏的非结构化文本映射为具有可审计命名坐标的结构化表示,以在序列决策系统中分离表示假设与优化方差。研究通过四个轴(情绪、风险、置信度、波动率预测)在美股组合数据上实例化SSAI,并评估其在因子投资组合、监督回归模型和强化学习代理中的表现。尽管四因子组合取得了较高的累计收益,但其优势在统计上并不稳健,研究强调SSAI的主要贡献在于提供了一种可解释性与性能诊断的框架及可复用的评估协议,而非宣称其优于其他密集表示方法。

Comments 18 pages, 3 figures. NeurIPS 2024 manuscript style (preprint)

详情
英文摘要

We introduce Semantic State Abstraction Interfaces (SSAI): a methodological template for mapping sparse unstructured text into $K$ auditable, named coordinates with neutral defaults on no-news days, designed to separate representation hypotheses from optimisation variance in sequential decision systems. Our contribution is the framework and its evaluation protocol, not a claim that SSAI outperforms denser alternatives. We instantiate SSAI with $K=4$ axes (sentiment, risk, confidence, volatility forecast) on a US-equity panel (30 NASDAQ-100 names, FNSPID news, 2019--2023 test), and evaluate it across direct factor portfolios, supervised ridge forecasters, and RL agents (DP-PPO, SAC) that share the same fixed $ϕ$. The four-factor factor portfolio reaches 307.2% cumulative return and Sharpe 1.067, but apparent gains versus buy-and-hold (243.6%) fail coverage-stratified controls, reverse at $\geq 0.2$% costs, and are statistically fragile versus a sentiment-only baseline; a PC1 composite and a FinBERT portfolio baseline are stronger ranking signals in this setting. Ridge and RL blocks diagnose representation versus optimiser effects. We position SSAI as an interpretability-performance diagnostic and reusable protocol for sparse-text decision systems.

2605.06729 2026-05-11 cs.LG cs.AI

The E$Δ$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality

Arash Shahmansoori

AI总结 本文提出了一种新型神经网络架构——E$Δ$-MHC-Geo Transformer,通过结合流形约束超连接、深度Delta学习和Cayley变换,实现了输入自适应且无条件正交的残差连接。该模型引入了数据依赖的Cayley旋转和Householder反射的混合机制,解决了Cayley变换在处理特征值为-1情况时的局限性,并通过门控机制选择合适的正交操作。实验表明,该模型在参数规模相近的情况下,在长期稳定性、旋转损失、范数保持和反射对齐等方面均优于多个基线模型,同时层数更少。

Comments 21 pages, 8 figures; code will be available at https://github.com/arash-shahmansoori/edelta

详情
英文摘要

We present the E$Δ$-MHC-Geo Transformer, a novel architecture that unifies Manifold-Constrained Hyper-Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input-adaptive, unconditionally orthogonal residual connections. Unlike DDL, whose Householder operator is orthogonal only at $β\in \{0,2\}$, our Data-Dependent Cayley rotation $Q(x)=(I+(β/2)A(x))^{-1}(I-(β/2)A(x))$ preserves orthogonality for all $β$ and all inputs. To handle negation, an eigenvalue $-1$ case that Cayley provably excludes, we introduce the E$Δ$-MHC-Geo Hybrid, which combines Cayley rotation with Householder reflection via a learned operator-selection gate $X'=γ(X)Q(X)X+(1-γ(X))H_2(X)X$. A midpoint-collapse regularizer, $4γ(1-γ)$, encourages boundary gate decisions, where each selected component is orthogonal. In matched-parameter comparisons, with approximately 1.79M parameters per model and mean +/- standard deviation over 3 seeds, against four baselines including the concurrent JPmHC, E$Δ$-MHC-Geo achieves the best long-horizon stability, 1.9x over JPmHC and 3.8x over GPT; the best near-$π$ rotation loss, 4.5x over JPmHC on single-plane; strong norm preservation, with 0.001 mean deviation; and 0.96 negation cosine alignment in a diagnostic reflection probe, all with 33% fewer layers. While JPmHC's wider representation excels on pure rotation, its finite Cayley residual mixer excludes an exact $λ=-1$ operator and has no reflection branch, motivating our hybrid approach for accessing both connected components of $O(n)$.