arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14305 2026-05-15 cs.CL

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

Xun Fang, Yunchen Li, Hang Yuan, Zhou Yu

AI总结 本文提出了一种无因子化误差的离散扩散语言模型(FeF-DLLM),旨在解决传统方法中因独立预测清洁令牌而导致的因子化误差问题。该方法通过精确的前缀条件因子化替代独立预测,更有效地保留令牌间的依赖关系,并结合推测解码技术,在保持并行预测能力的同时提升推理速度。实验表明,该方法在多个基准数据集上平均提升了5.04个百分点的准确性,同时实现了3.86倍的加速。

详情
英文摘要

Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.

2605.14304 2026-05-15 cs.LG cs.AI

Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry

Zuyuan Zhang, Carlee Joe-Wong, Tian Lan

AI总结 该研究提出了一种名为矩阵空间强化学习(MSRL)的新方法,旨在通过复用已有轨迹片段中的局部转移几何结构,提升强化学习中的组合泛化能力。MSRL 使用正定矩阵描述符来捕捉轨迹片段的一阶和二阶统计特性,从而在抽象的矩阵空间中实现代数组合与知识迁移。实验表明,该方法在有限预算下取得了优于现有方法的性能,展示了其在跨任务学习中的有效性。

详情
英文摘要

Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).

2605.14301 2026-05-15 cs.LG stat.ML

Language-Induced Priors for Domain Adaptation

Qiyuan Chen, Jiayu Zhou, Raed Al Kontar

AI总结 在领域适应中,当目标域数据稀缺时,传统统计方法难以区分相关与不相关的源域,导致负迁移。本文提出利用目标域的专家文本描述,构建语言诱导先验(LIP),将其与期望最大化算法结合,以识别相关源域。该方法兼容多种参数模型,能够在目标信号弱时引导源域选择,并随着数据积累逐步优化,理论分析表明其在正确先验下具有接近理想冷启动性能,并保持渐近一致性。实验验证了该框架在估计、预测和决策任务中的有效性。

详情
英文摘要

Domain adaptation faces a fundamental paradox in the cold-start regime. When target data is scarce, statistical methods fail to distinguish relevant source domains from irrelevant ones, which often leads to negative transfer. In this paper, we address this challenge by leveraging expert textual descriptions of the target domain, a resource that is often available but overlooked. We propose a probabilistic framework that translates these semantic descriptions into a choice model, namely a Language-Induced Prior (LIP), that learns the preferences from a pretrained Large Language Model (LLM). The LIP is then integrated into an Expectation-Maximization algorithm to identify source relevance. Methodologically, this framework is compatible with any parametric model where a likelihood is available. It allows the LIP to guide the selection of sources when target signals are weak, while gradually refining these choices as samples accumulate. Theoretically, we prove that the estimator roughly matches an oracle cold-start MSE under a correct prior, while remaining asymptotically consistent regardless of the quality of the LIP. Empirically, we validated the framework on a descriptive (Gaussian estimation), a predictive (C-MAPSS dataset), and a prescriptive task (MuJoCo hopper).

2605.14297 2026-05-15 cs.LG cs.AI math.OC stat.ML

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

Matias Alvo, Daniel Russo, Yash Kanoria

AI总结 本文研究了在混合离散-连续动作空间中的强化学习问题,这类问题常见于机器人控制和优化领域。为了解决传统策略梯度方法在高维空间中梯度质量差的问题,作者提出了混合策略优化(HPO)方法,通过结合路径梯度和得分函数梯度,实现无偏混合梯度估计,从而有效应对离散动作和非光滑动态带来的挑战。实验表明,HPO在库存控制和切换线性二次调节器等任务中显著优于PPO算法,且在连续动作维度增加时优势更加明显。

详情
英文摘要

We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.

2605.14294 2026-05-15 cs.AI cs.LG

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

Hengjie Liu, Zhenya Zhang, Jianjun Zhao

AI总结 随着Transformer模型在安全关键领域的广泛应用,其形式化验证变得尤为重要。与传统神经网络相比,Transformer的推理过程涉及复杂的计算,如自注意力层中的点积操作,使得验证极具挑战性。本文提出了一种基于ReLU催化的抽象细化方法,通过精确表示点积的非线性边界,结合凸松弛技术,提升了验证精度,并在两种经典验证方法的基础上扩展出适用于Transformer的高效且精确的验证框架,实验表明该方法在保持较高效率的同时显著提升了验证精度。

Comments 32 pages, 6 figures, the full version of the paper accepted by CAV 2026

详情
英文摘要

Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.

2605.14289 2026-05-15 cs.LG cs.AI cs.CL cs.CR

MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

Weisen Jiang, Shuhao Chen, Sinno Jialin Pan

AI总结 本文提出了一种隐私保护的混合专家(MoE)统一框架MetaMoE,旨在解决分布式数据环境下专家模型无法共享训练数据的问题。该方法通过选择与客户端领域相关且多样化的公共代理数据,替代无法获取的私有数据,从而有效指导路由器学习并提升专家协调能力。实验表明,MetaMoE在计算机视觉和自然语言处理任务中优于现有的隐私保护MoE统一方法。

Comments Accepted by ICML 2026

详情
英文摘要

Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.

2605.14280 2026-05-15 cs.LG stat.ML

TILT: Target-induced loss tilting under covariate shift

Kakei Yamamoto, Martin J. Wainwright

AI总结 本文提出了一种名为TILT的无监督域适应方法,用于处理协变量偏移问题。该方法通过引入一个新颖的目标函数,将源域预测器分解为两个部分,并在有标签的源域数据上拟合这两个部分,同时在无标签的目标域数据上对辅助部分施加惩罚,最终得到的主预测器用于目标域预测。理论分析表明,该方法在总体层面能够隐式地诱导相对重要性加权,并且具有良好的稳定性与泛化能力。实验结果表明,TILT在多个任务中优于仅使用源域训练、精确重要性加权以及相对密度比等基线方法。

Comments 32 pages, 17 figures. Submitted to NeurIPS 2026

详情
英文摘要

We introduce and analyze Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. It is based on a novel objective function that decomposes the source predictor as $f+b$, fits $f+b$ on labeled source data while simultaneously penalizing the auxiliary component $b$ on unlabeled target inputs. The resulting fit $f$ is deployed as the final target predictor. At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand $b^*_f$ that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports). We prove a general finite-sample oracle inequality on the excess risk, and use it to give an end-to-end guarantee for training with sparse ReLU networks. Experiments on controlled regression problems and shifted CIFAR-100 distillation show that TILT improves target-domain performance over source-only training, exact importance weighting, and relative density-ratio baselines, with a stable dependence on the regularization parameter.

2605.14278 2026-05-15 cs.CV

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Ruicheng Zhang, Kaixi Cong, Jun Zhou, Zhizhou Zhong, Zunnan Xu, Shuiyang Mao, Wei Liu, Xiu Li

AI总结 本文提出了一种名为KVPO的ODE原生在线组相对策略优化框架,用于通过键值语义探索对流式自回归视频生成器进行对齐。该方法通过将多样性探索的来源从随机噪声转移到历史键值缓存,构建语义多样且保持数据流形的生成分支,从而提升长期一致性。同时,KVPO引入基于轨迹速度能量的替代策略,实现了与ODE原生形式完全一致的奖励加权对比目标,在多个实验设置中显著提升了视频的视觉质量、运动质量和文本-视频对齐效果。

详情
英文摘要

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

2605.14277 2026-05-15 cs.AI cs.GT

Parallelizing Counterfactual Regret Minimization

Juho Kim, Tuomas Sandholm

AI总结 本文研究了如何将反事实遗憾最小化(CFR)算法并行化,以加速求解大规模不完美信息博弈。作者将CFR重新表述为一系列线性代数操作,从而能够利用现有的并行计算技术提升其效率。该方法适用于多种CFR变体,如CFR+、折扣CFR和预测型CFR。实验表明,基于GPU的实现比CPU上的现有实现快达四千倍。

Comments This paper contains and extends ideas that were originally in arxiv:2408.14778

详情
英文摘要

Parallelization has played an instrumental role in the field of artificial intelligence (AI), drastically reducing the time taken to train and evaluate large AI models. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential. In this paper, we parallelize the family of counterfactual regret minimization (CFR) algorithms, which were central to important breakthroughs for solving large imperfect-information games. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state-of-the-art, such as CFR+, discounted CFR, and predictive variants of CFR. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel's CFR implementations on a CPU.

2605.14274 2026-05-15 cs.CV

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

Zhenyang Ni, Yijiang Li, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen, Zhenfei Yin, Minshuo Chen, Philip Torr, Zhaoran Wang, Qi Zhu

AI总结 该论文提出了一种名为CreFlow的在线强化学习框架,用于改进稀疏奖励下的具身视频生成模型。研究针对现有视频强化学习奖励机制无法准确反映任务逻辑的问题,引入了基于组合逻辑约束的奖励模型,将任务要求转化为线性时序逻辑约束,从而提供更准确的奖励信号和局部错误信息。CreFlow通过两个关键设计——信用感知的NFT损失和校正重流损失,有效提升了高维视频生成的训练效率与稳定性,实验表明其在双臂操作任务中的执行成功率提升了23.8个百分点。

详情
英文摘要

Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

2605.14269 2026-05-15 cs.CV cs.AI

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal

AI总结 生成真实的人类运动是视频生成中的核心挑战之一。为了解决现有奖励信号无法准确评估运动真实性的难题,本文提出PhyMotion,一种基于物理模拟的结构化运动奖励机制,通过评估运动的运动学合理性、接触与平衡一致性以及动力学可行性等多个维度,实现对生成视频中人体运动质量的精细评价。实验表明,PhyMotion相比现有方法能更准确地反映人类判断,并在基于强化学习的后训练中显著提升了运动真实性和生成质量。

Comments First two authors contributed equally, website: https://phy-motion.github.io/

详情
英文摘要

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

2605.14267 2026-05-15 cs.CV cs.AI

Image Restoration via Diffusion Models with Dynamic Resolution

Yang Zheng, Wen Li, Zhaoqiang Liu

AI总结 该研究针对扩散模型在图像修复任务中计算开销大的问题,提出了一种基于动态分辨率扩散模型的图像修复方法。通过将数据投影到低维子空间,有效降低了计算负担,并在原有像素空间方法的基础上改进,提出了SubDPS和SubDAPS两种新方法,其中SubDAPS++进一步提升了修复效率和质量。实验表明,该方法在多个数据集和任务上优于现有基于扩散模型的图像修复方法。

Comments Accepted by ICML 2026

详情
英文摘要

Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.

2605.14266 2026-05-15 cs.AI cs.CY

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence

Vidya K Sudarshan, Anushka Sisodia, Reshma A Ramachandra, Sia Batra, Josephine Chong Leng Leng

AI总结 本文探讨了人工智能代理在高等教育中的应用前景,提出构建一个集成化的多智能体AI框架,以支持教学、学习和机构管理的协同运作。当前AI工具多为单一任务导向且缺乏整合,难以满足教育生态系统复杂需求,本文通过文献分析指出现有研究在跨功能整合与包容性设计方面的不足,并强调构建协调、适应性强的多智能体系统对于实现公平、包容教育的重要意义。

Comments 50 pages, 14 figures, 3 tables

详情
英文摘要

Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco-system-level agentic multi-agent AI platform capable of coordinated planning, reasoning, and adaptive decision-making across multiple educational functions. This paper presents a forward-looking perspective on agentic multi-agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task-specific fragmented AI tools, the transition from single-agent to multi-agent systems, limited cross-functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner-centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human-aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi-agent platform can support diverse learners through adaptive, multimodal interventions.

2605.14262 2026-05-15 cs.RO cs.HC

Distill: Uncovering the True Intent behind Human-Robot Communication

Ting Li, David Porfirio

AI总结 随着机器人越来越多地融入日常生活,自然语言和用户端编程等直观的沟通方式成为指定机器人自主行为的重要手段。然而,这些方法难以准确捕捉用户的真正意图。为此,本文提出了一种名为Distill的通信方法,通过去除冗余步骤、概括单个步骤的含义以及放宽步骤间的顺序约束,有效提炼和优化用户的初始任务描述,从而更准确地理解用户的真实需求。

Comments 17 pages

详情
英文摘要

As robots become increasingly integrated into everyday environments, intuitive communication paradigms such as natural language and end-user programming have become indispensable for specifying autonomous robot behavior. However, these mechanisms are ineffective at fully capturing user intent: natural language is imprecise and ambiguous, whereas end-user programming can be overly specific. As a result, understanding what users truly mean when they interact with robots remains a central challenge for human-AI communication systems. To address this issue, we propose the Distill approach for human-robot communication interfaces. Given a task specification provided by the user, Distill (1) removes unnecessary steps; (2) generalizes the meaning behind individual steps; and (3) relaxes ordering constraints between steps. We implemented Distill on a web interface and, through a crowdsourcing study, demonstrated its ability to elicit and refine user intent from initial task specifications.

2605.14261 2026-05-15 cs.AI cs.GT

Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques

Juho Kim, Tuomas Sandholm

AI总结 本文研究了在多智能体环境中如何在样本量有限或试验成本高昂的情况下评估智能体的性能,提出了AIVAT方法族以降低估计方差。文章指出,AIVAT中的启发式价值函数选择和不确定性处理缺乏指导,进而揭示了该方法在梯度下降应用下的潜在问题,并提出应在观察评估数据前固定启发式函数。此外,作者展示了如何传播启发式不确定性以进一步降低方差,尽管这可能牺牲无偏性。实验表明,该方法在扑克数据集上有效减少了达到统计结论所需的样本数量。

详情
英文摘要

How should an agent's performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents' expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled. In our first contribution, we parameterize the heuristic value function to highlight AIVAT's potential vulnerabilities: a) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b) one can p-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates. It is then possible to further reduce the variance using inverse-variance weighted averaging, but AIVAT's unbiasedness guarantee may have to be sacrificed. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43.0% reduction in the number of samples (poker hands) needed to draw statistical conclusions.

2605.14258 2026-05-15 cs.LG cs.AI

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

Jesseba Fernando, Grigori Guitchounts

AI总结 本文研究了大型语言模型中残差流的动态特性,揭示了训练过程中谱几何与网络拓扑之间的耦合关系。通过全雅可比矩阵的特征分解,作者发现训练使得模型深度方向上形成单调的谱梯度,并伴随着维度压缩现象,这些特性是学习得到的而非由模型结构决定。研究进一步表明,网络中图社区的拓扑位置决定了雅可比矩阵对其扰动的放大或抑制作用,这一关系在模型初始化时并不存在。

详情
英文摘要

Large language models are remarkably capable, yet how computation propagates through their layers remains poorly understood. A growing line of work treats depth as discrete time and the residual stream as a dynamical system, where each layer's nonlinear update has a local linear description. However, previous analyses have relied on scalar summaries or approximate linearizations, leaving the full spectral geometry of trained LLMs unknown. We perform full Jacobian eigendecomposition across three production--scale LLMs and show that training installs a monotonic spectral gradient through depth -- from non-normal, rotation-dominated early layers to near--symmetric late layers -- together with a cumulative low-rank bottleneck that funnels perturbations into a small fraction of the residual stream's effective dimensions. Our experiments reveal that this gradient and the dimensional collapse are learned rather than architectural, and is largely dissolved when structured non-normality is removed. We further show that the topological positioning of graph communities predicts whether the Jacobian amplifies or suppresses them, with the sign of the coupling determined by the local operator type, a relationship absent at initialization. These results map a learned spectral geometry in LLMs that links perturbation propagation and compression to the network's functional topology.

2605.14253 2026-05-15 cs.CV cs.LG

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

Harry Robertshaw, Yanghe Hao, Weiyuan Deng, Benjamin Jackson, S. M. Hadi Sadati, Nikola Fischer, Tom Vercauteren, Alejandro Granados, Thomas C. Booth

AI总结 本文旨在开发一种基于荧光透视图像的实时导管尖端跟踪系统,以支持基于强化学习的自主机械取栓手术导航。研究提出了一种多线程处理框架,结合深度学习分割模型与后处理算法,有效应对图像对比度低、噪声大及设备遮挡等挑战。实验表明,该方法在分割精度上优于现有方法,为未来自主导航系统的实现提供了可靠高效的解决方案。

Comments Harry Robertshaw and Yanghe Hao contributed equally to this work. Published in the International Journal of Computer Assisted Radiology and Surgery

详情
Journal ref
Int J CARS (2026)
英文摘要

Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

2605.14252 2026-05-15 cs.LG cs.AI

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

Kai Sun, Peibo Duan, Yongsheng Huang, Guowei Zhang, Benjamin Smith, Nanxu Gong, Levin Kuhlmann

AI总结 本文研究了脉冲神经网络(SNN)与人工神经网络(ANN)之间的性能差距问题,提出了一种新的知识蒸馏方法——选择性对齐知识蒸馏(SeAl-KD)。该方法突破了传统方法对所有时间步进行统一对齐的假设,通过识别错误时间步并针对性地进行校正,同时保留有用的时序动态,从而更有效地提升SNN的性能。实验表明,该方法在静态图像和神经形态事件数据集上均优于现有蒸馏方法。

详情
英文摘要

Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter-temporal self-distillation, implicitly assuming that per-timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl-KD), which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity. Extensive experiments on static image and neuromorphic event-based datasets demonstrate consistent improvements over existing distillation methods. The code is available at https://github.com/KaiSUN1/SeAl

2605.14251 2026-05-15 cs.CV

Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

Aarushi Kulkarni, Alarice Lowe, Pratik Shah

AI总结 该研究探讨了基于条件生成对抗网络(cGAN)的数字病理图像去染色与再染色方法,并针对不同机构间未对齐的全切片图像(WSI)进行了评估。为减少领域偏移影响,研究提出了一种预处理流程,包括基于直方图的染色归一化和通道强度校准。实验结果表明,即使在无图像配准的情况下,该方法仍能实现较好的染色还原效果,并在多个指标上优于直接染色方法,验证了预处理对模型性能的重要影响。

详情
英文摘要

Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (H&E) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women's Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for H&E-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. H&E restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational H&E staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.

2605.14249 2026-05-15 cs.LG

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

Zhiye Song, Kyungmi Lee, Eun Kyung Lee, Xin Zhang, Tamar Eilam, Anantha P. Chandrakasan

AI总结 本文提出了一种端到端的EnergyLens框架,用于实现面向能效的大型语言模型(LLM)推理优化。该方法通过直观的einsum接口捕捉模型的融合、并行性和计算-通信重叠等特性,并结合负载不平衡感知的MoE建模和多GPU通信能耗模型,有效预测和优化多GPU环境下的能耗。实验表明,EnergyLens在多个模型和配置上实现了较高的能耗预测精度,并揭示了不同配置下显著的能效差异,为分布式推理优化提供了重要指导。

详情
英文摘要

We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

2605.14246 2026-05-15 cs.LG cs.AI cs.SY eess.SY

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Yushen Liu, Yin-Jen Chen, Ziyi Chen, Tao Wang, Heng Huang, Xugui Zhou, Yanfu Zhang

AI总结 该研究针对部分可观测环境下安全关键控制问题,提出了一种基于动作条件风险门控的强化学习方法,用于在不完全观测情况下平衡任务性能与安全风险。方法通过构建有限历史的紧凑代理状态,并学习动作条件的短期安全违规预测,将预测风险用于价值学习中的风险惩罚和决策时的风险门控,从而在保证安全的同时提升控制性能。实验表明,该方法在血糖调节和安全导航等任务中相比传统方法具有更优的奖励-成本平衡和运行效率。

详情
英文摘要

Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

2605.14242 2026-05-15 cs.LG cs.AI

Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment

Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

AI总结 该研究提出了一种基于人工智能的卡iotocography(CTG)模型,用于胎儿心率信号重建、心率分析及变异性评估。该模型通过大规模未标注数据预训练,并结合专家审核数据进行微调,有效提升了信号重建精度和分析可靠性。研究引入了交叠标签(IOL)方法验证胎儿心率,模型在检测关键心率减速和加速方面表现出高灵敏度和特异性,并在临床指标评估中取得了优异的AUC成绩。

详情
英文摘要

The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer's criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.

2605.14240 2026-05-15 cs.LG

Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods

Andrii Shportko, Inessa Verbitsky

AI总结 本文研究了多种AI生成文本检测方法在面对改写攻击时的鲁棒性,评估了包括微调RoBERTa、Binoculars和文本特征分析等方法及其随机森林集成的效果。研究发现,包含Binoculars的集成方法性能最强,但在攻击下表现下降也最明显,揭示了AI文本检测中性能与鲁棒性之间的矛盾,挑战了当前对先进检测技术可靠性的认知。

Comments NAACL 2025

详情
英文摘要

The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.

2605.14239 2026-05-15 cs.CV

Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

Zekun Long, Judy X. Yang, Jing Wang, Ali Zia, Guanyiman Fu, Jun Zhou

AI总结 本文研究了高光谱图像(HSI)与激光雷达(LiDAR)数据的融合问题,旨在提升复杂场景下的分类性能。针对现有方法在建模结构不连续性和光谱特征方面存在的不足,作者提出了一种基于Kolmogorov-Arnold网络(KAN)的隐式频域-几何融合网络(IFGNet),通过可学习的样条函数自适应捕捉高光谱与LiDAR特征之间的高度非线性关系,并在空间和频域引入LiDAR引导的隐式聚合模块,增强几何感知的表示能力。实验表明,IFGNet在多个基准数据集上显著优于现有方法,具有更高的分类精度和效率。

Comments 6 pages, 1 figure, conference

详情
英文摘要

Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.

2605.14237 2026-05-15 cs.AI

Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

AI总结 本文提出了一种名为LOOP SKILL ENGINE的系统,旨在解决AI代理执行重复性任务时的高失败率和高计算成本问题。该系统通过一次性的任务执行记录和确定性回放机制,实现了99%的任务成功率,并将令牌使用量减少了99%。其核心方法是将首次运行中记录的工具调用轨迹转化为参数化的确定性执行计划,后续任务直接回放该计划,无需再次调用大语言模型,从而大幅降低开销并保证执行的可预测性。

Comments 8 pages, 5 tables

详情
英文摘要

Deploying AI agents for repetitive periodic tasks exposes a critical tension: Large Language Models (LLMs) offer unmatched flexibility in tool orchestration, yet their inherent stochasticity causes unpredictable failures, and repeated invocations incur prohibitive token costs. We present the LOOP SKILL ENGINE, a system that achieves a combined 99% success rate and 99% token reduction for periodic agent tasks through a one-shot recording, deterministic replay paradigm. On its first run, the agent executes the task with full LLM reasoning while the system transparently intercepts and records the complete tool-call trajectory. A greedy length-descending template extraction algorithm then converts this recording into a parameterized, branch-free Loop Skill -- a deterministic execution plan that captures the task's functional intent while parameterizing time-dependent and result-dependent variables. All subsequent executions bypass the LLM entirely: the engine resolves template variables against real-time values and replays the tool sequence deterministically. We prove two theorems: (1) Replay Determinism -- the step sequence of a validated Loop Skill is invariant across all future executions; (2) Write Safety -- concurrent access to persistent configuration is serialized through reentrant locks and atomic file replacement. Across a benchmark of periodic agent tasks spanning intervals from 5 minutes to 24 hours, the Loop Skill Engine reduces monthly token consumption by 93.3%--99.98% and cuts execution latency by 8.7x while eliminating output non-determinism. A multi-layer degradation strategy guarantees that tasks never stall. We release the engine as part of the buddyMe open-source agent framework.

2605.14235 2026-05-15 cs.LG cs.MA quant-ph

Quantum Advantage in Multi Agent Reinforcement Learning

Simranjeet Singh Dahia, Claudia Szabo

AI总结 本文研究了量子纠缠在多智能体强化学习中的协调优势,提出了一种基于可变量子电路的去中心化量子多智能体强化学习框架。通过在CHSH游戏中验证,纠缠的量子智能体能够接近理论上的Tsirelson极限,展现出明确的量子优势,而无纠缠的量子电路则与经典方法表现相当。实验还表明,特定的纠缠结构对协调性能有显著影响,并在合作导航任务中展示了量子智能体与经典方法相比的性能提升。

Comments 19 pages

详情
英文摘要

We present an empirical evaluation of quantum entanglement in agent coordination within quantum multi agent reinforcement learning (QMARL). While QMARL has attracted growing interest recently, most prior work evaluates quantum policies without provable baselines, making it impossible to rigorously distinguish quantum advantage from algorithmic coincidence. We address this directly by evaluating a decentralized QMARL framework with variational quantum circuit (VQC) actors with shared entangled states. In the CHSH game, which has a mathematically proven classical performance ceiling of 0.75 win rate, we show that entangled QMARL agents approach the Tsirelson limit of 0.854, providing clear evidence of their quantum advantage. We show that unentangled quantum circuits match the classical baseline, confirming that entanglement and not the quantum circuit itself is the active coordination mechanism. We also explore the effect of specific entanglement structures, as some Bell states enable coordination gains while others actively harm performance. On cooperative navigation (CoopNav), QMARL without entanglement achieves $\sim2\times$ improvement in success rate over classical MAA2C ($\sim$0.85 versus $\sim$0.40), with the hybrid configuration, quantum actor paired with a classical centralised critic, outperforming both fully classical and fully quantum solutions. We present our experimental analysis and discuss future work.

2605.14232 2026-05-15 cs.RO

Reactive Planning based Control for Mobile Robots in Obstacle-Cluttered Environments

Li Tan, Junlin Xiong, Yan Wang, Wei Ren

AI总结 本文研究了移动机器人在障碍物密集环境中进行避障运动控制的问题。针对机器人仅有部分环境信息的情况,提出了一种基于反应式规划的控制策略(RPCS),通过构建参考轨迹并结合局部轨迹调整实现避障,同时设计了自适应跟踪控制方法以提高轨迹跟踪性能。该方法有效结合了反应式规划与自适应控制,实验结果验证了其优越性与实用性。

Comments 7 pages, 7 figures

详情
英文摘要

This paper addresses the motion control problem for mobile robots in obstacle-cluttered environments. The mobile robot has partial environment information only, and aims to move from an initial position to a target position without collisions. For this purpose, a reactive planning based control strategy (RPCS) is proposed. First, the initial and target positions are connected as a reference trajectory. Then, a reactive planning strategy (RPS) is developed to ensure the collision avoidance by modifying the reference trajectory locally based on the partial environment information. Next, an adaptive tracking control strategy (ATCS) is proposed to track the reference trajectory with potentially local modifications via the discretization techniques. Finally, the RPS and ATCS are combined to establish the RPCS, whose efficacy and advantages are illustrated by numerical examples.

2605.14231 2026-05-15 cs.LG cs.AI cs.SD

AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani

AI总结 本文提出了一种基于对比学习的音频编码器 AudioMosaic,用于通用音频理解任务。该方法通过结构化时频掩码生成正样本对,降低内存消耗并支持高效的大批量训练。与生成式方法相比,AudioMosaic 能够学习更具判别性的语句级表示,在不同数据集、领域和声学条件下表现出优异的迁移能力,并在多个标准音频基准测试中取得了最先进的性能。

Comments ICML2026

详情
英文摘要

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

2605.14227 2026-05-15 cs.LG cs.CL

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

Yunying Zhu, Andrew R Weckstein, Kueiyu Joshua Lin, Jie Yang

AI总结 该研究提出了一种名为DT-Transformer的基础模型,旨在基于真实世界医疗系统中的电子健康记录(EHR)预测疾病轨迹。模型在麻省总医院布里格姆健康系统(MGB)的170万名患者、5710万条结构化EHR数据上进行训练,能够准确预测多种疾病的未来发展情况。实验表明,该模型在多种疾病类别上的预测性能优异,为基于真实临床数据的疾病预测提供了新的方法和有力支持。

Comments Work in Progress

详情
英文摘要

Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.

2605.14221 2026-05-15 cs.CV

Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

Ahmed Rekik, R. Jarrett Rushmore, Sylvain Bouix, Linda Marrakchi-Kacem

AI总结 本文研究了如何在磁共振成像(MRI)中精确分割人脑皮下结构的问题,提出了一个基于标志点引导的三维脑分割方法。该方法模仿哈佛-牛津图谱的手动分割流程,通过全局到局部网络自动检测16个关键标志点,并结合语义分割模型和标志点驱动的后处理步骤,将12个粗略解剖标签分割为26个独立结构,显著提升了分割边界的一致性和准确性。

Comments 7 pages, 5 figures. Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情
英文摘要

Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard--Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.