arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08302 2026-05-12 cs.LG cs.AI

SGC-RML: A reliable and interpretable longitudinal assessment for PD in real-world DNS

Wenbin Wei, Ruixiang Gao, Suyuan Yao, Xuanzhen Zhao, Cheng Huang, Hen-Wei Huang

AI总结 本文提出了一种名为SGC-RML的可靠且可解释的帕金森病纵向评估方法,用于解决真实世界中多模态数据异构、设备偏差和标签不完整等问题。该方法通过构建一个共享的8维症状节点空间,统一了运动和非运动症状的表示,并引入不确定性估计、符合校准和选择性决策路由机制,以实现症状预测、评估拒绝和重测建议。实验表明,SGC-RML在多个真实数据集上表现出优越的性能,展示了其在不完整多模态条件下进行准确、可校准和可解释的帕金森病纵向评估的潜力。

Comments Preprint. The first five authors contributed equally. Corresponding author: Hen-Wei Huang. 9 pages main text + appendix; 4 figures, 5 tables in main text

详情
英文摘要

Real-world digital Parkinson's disease assessment faces challenges such as heterogeneous modalities, cross-device bias, and incomplete labeling. Existing methods often focus on average predictive performance, lacking the reliability mechanisms needed for retrospective reliability-aware assessment - namely, determining when the model is reliable, when to reject an assessment, when to retest, and from which symptom dimensions the predictions are based. This paper proposes SGC-RML, which maps speech, gait, wearable motion, mobility tasks, and clinical variables to a shared 8-dimensional symptom node space (7 clinical symptom nodes and 1 reliability_state auxiliary node), unifying motor and non-motor representations through a symptom atlas. By jointly introducing uncertainty estimation, conformal calibration, and selective decision routing, the model can not only predict symptoms and severity but also reject assessments or suggest retests when evidence is insufficient. We validate this framework on five real-world PD datasets, covering classification, regression, event detection, and longitudinal severity prediction. Experiments show that SGC-RML achieves an MAE of 4.579 / R^2 of 0.772 on PPMI, an AUC of 0.953 on mPower, and an AUC of 0.825 on PADS. Under leak-free temporal anchoring, as few as 5 subject-specific anchors transform UCI from an essentially non-predictive subject-independent setting (motor MAE 8.38, CCC 0.02) into a calibrated longitudinal assessment (motor MAE 3.24, CCC 0.756) with split-conformal coverage held at the 0.80 target. Under the Daphnet LOSO protocol, it achieves an F1 of 0.803 / AUC of 0.872. These results demonstrate that SGC-RML provides a unified paradigm for accurate, calibrated, auditable, and symptom-interpretable retrospective longitudinal assessment of PD under incomplete multimodal conditions.

2605.08301 2026-05-12 cs.LG cs.AI

Priming: Hybrid State Space Models From Pre-trained Transformers

Aditya Chattopadhyay, Elvis Nunez, Prannay Kaul, Benjamin Bowman, Evan Becker, Luca Zancato, David Thomas, Wei Xia, Stefano Soatto

AI总结 该研究提出了一种名为Priming的方法,通过从预训练的Transformer模型初始化混合状态空间模型(Hybrid SSM),将混合架构的设计从头训练问题转化为知识迁移问题,从而显著降低了训练成本。该方法能够在使用不到0.5%的预训练数据量的情况下,恢复下游任务的性能,并且适用于不同类型的Transformer模型和规模。实验表明,基于Priming的混合模型在长上下文推理任务中表现优异,且推理速度比传统Transformer模型快2.3倍。

详情
英文摘要

Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.

2605.08300 2026-05-12 cs.LG cs.AI cs.CL

mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters

Abdulvahap Mutlu, Şengül Doğan, Türker Tuncer

AI总结 本文提出了一种名为 mHC-SSM 的状态空间语言模型架构,通过引入流约束的超连接机制,将残差流混合矩阵限制在双随机矩阵流形上,以提升模型稳定性。该方法在 SSM 块中扩展残差流为多个并行流,并通过简单形约束的预混合和后混合实现流间信息聚合与分发,同时引入流专用适配器以增强模型表达能力。实验表明,mHC-SSM 在 WikiText-2 数据集上显著提升了验证损失和困惑度,同时带来了可预测的效率权衡。

Comments 28 Pages, 3 Figures, all implementation code available at: https://github.com/abdulvahapmutlu/mhc-slm

详情
英文摘要

Manifold-Constrained Hyper-Connections (mHC) introduce a stability-motivated variant of multi stream residual mixing by constraining residual stream mixing matrices to the manifold of doubly stochastic matrices via Sinkhorn-Knopp projection. In his work, we study whether mHC-style constrained multi-stream residual topology transfers effectively to state space model (SSM) language modeling. We implement a static mHC mechanism around an SSM block by expanding the residual stream into multiple parallel streams, aggregating streams into a single SSM input through simplex-constrained pre-mixing, scattering the SSM output back to streams through simplex-constrained post-mixing, and applying Sinkhorn-projected residual stream mixing at each layer. We further introduce stream-specialized adapters that add lightweight stream-specific capacity through a shared bottleneck with per-stream scaling, applied both before stream aggregation and after the SSM output prior to scattering. We evaluate baseline single-stream SSM, static mHC SSM, and mHC SSM with adapters on WikiText-2 using identical training settings and report checkpoint-based validation loss, perplexity, throughput, and peak GPU memory. Under the reported fair checkpoint evaluation, static mHC improves validation loss from 6.3507 to 6.2448 and reduces perplexity from 572.91 to 515.35, while mHC with adapters further improves validation loss to 6.1353 and perplexity to 461.88. These gains are accompanied by modest throughput reductions from 1025.52 to 964.81 and 938.90 tokens per second, and increased peak memory from 2365 MB to 2568 MB and 3092 MB. The results suggest that mHC-inspired constrained multi-stream residual mixing can yield measurable quality improvements in SSM language models and that stream-specialized adapter capacity can further enhance performance with predictable efficiency tradeoffs.

2605.08298 2026-05-12 cs.LG cs.AI

What Cohort INRs Encode and Where to Freeze Them

Vasiliki Sideri-Lampretsa, Sophie Starck, Robbie Holland, Julian McGinnis, Daniel Rueckert

AI总结 该研究探讨了共训(cohort-trained)隐式神经表示(INRs)中哪些编码层具有可迁移性,并分析了这些层所编码的信息内容。通过实验发现,冻结共享编码器中权重稳定秩最高的层能够实现最佳性能,且效果优于传统微调方法。研究进一步采用稀疏自编码器(SAE)对INR激活进行分解,发现SIREN和FFMLP在共训任务中表现出相似的拟合质量,但其编码的字典原子具有本质差异:SIREN的原子局部化,而FFMLP的原子覆盖整个图像并追踪记忆信号的轮廓。这一结果为理解INR的可迁移机制提供了机制性解释,并为设计更注重泛化能力的架构提供了新方向。

Comments 9 content pages plus appendix

详情
英文摘要

Reusing the early layers of cohort-trained INRs as initialization for new signals has been shown to accelerate and improve signal fitting, yet it remains unclear which layers of the shared encoder learn transferable representations and what those representations encode. We address both questions for two standard backbones, SIREN and Fourier-feature MLPs (FFMLP). First, sweeping the freeze depth across the shared encoder at test time, we find that the optimum coincides with the layer of highest weight stable rank. Moreover, freezing at this depth matches or improves on the standard fine-tuning recipe across all our experiments. Second, identifying which layer transfers does not characterize what that layer encodes. To address this we adopt sparse autoencoders (SAEs), the dominant tool in mechanistic interpretability, and present the first SAE decomposition of INR activations into sparse dictionary atoms. Interestingly, SIREN and FFMLP achieve comparable cohort-fitting quality, but learn qualitatively different dictionaries. Cohort SIREN's atoms are localized, tiling the coordinate plane such that each atom fires in a confined region independent of cohort content. Cohort FFMLP's atoms are image-spanning, tracing the contours of memorized cohort signals. Single-atom ablations confirm causal use of these dictionaries: a single FFMLP atom out of 4096 can drop PSNR by up to 10.6 dB across the image, while SIREN ablations remain confined to where the atom fires. Together, these results give the first mechanistic account of what transfers in cohort-trained INRs and turn their activations into inspectable dictionary atoms. These tools open a path towards characterizing what INRs encode and towards architectures designed for generalization rather than memorization.

2605.08297 2026-05-12 cs.LG cs.AI

A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

Daning Cheng, Zeyu Liu, Jun Sun, Fen Xia, Boyang Zhang, Dongping Liu, Yunquan Zhang

AI总结 本文研究了归一化残差网络中模型规模扩展时测试性能的提升机制,探讨在增加网络深度时,如何保证测试风险的可证改进。作者提出了一种统一的分析框架,将问题分解为表示增益、优化增益和泛化迁移三个部分,并在零初始化附近的一阶下降条件下,证明了扩展后的模型类中包含测试风险更小的辅助模型。同时,基于归一化残差结构的范数控制,建立了扩展模型类的Rademacher复杂度上界,从而提供了两种互补的测试风险保证,为残差网络深度扩展提升测试性能提供了理论依据。

详情
英文摘要

The scaling behavior, in which test performance often improves as model size and data increase, is a central empirical phenomenon in modern deep learning, yet its theoretical basis remains incomplete. In this paper, we study depth expansion in normalized residual networks: starting from a trained model in an old hypothesis class, we insert a new residual block at an intermediate layer and ask when such an expansion can yield a provable improvement in test risk. We develop a unified framework that decomposes this question into representational gain, optimization gain, and generalization transfer. First, under a first-order descent condition near zero initialization, we prove that the expanded hypothesis class contains an auxiliary jumpboard model with strictly smaller population risk than the original model. Second, under norm control tailored to post-normalized residual architectures, we establish a norm-based Rademacher complexity bound for the expanded model class. These ingredients lead to two complementary test-risk guarantees: one route passes through population risk and is tighter when a positive population margin is available, while the other works directly at the train/test level, avoids Hoeffding transfer, and is more robust in degenerate regimes. Together, these results provide a theorem-driven mechanism under which residual depth expansion can improve test performance in normalized residual networks. More broadly, they suggest that scaling is inherently joint: depth creates new improving directions, width enhances the finite-sample observability of weak signals, and data determines whether the statistical cost of expansion can be controlled.

2605.08296 2026-05-12 cs.CV eess.SP

BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition

Yize Cai, Rui Feng, Anlan Yu, Baoshen Guo, Zhiqing Hong

AI总结 本文提出 BenchHAR,一个用于评估自监督学习方法在传感器活动识别(HAR)中泛化能力的统一基准框架。针对可穿戴传感器数据异构和标注数据稀缺的问题,BenchHAR 构建了一个大规模数据集,并系统评估了八种代表性自监督学习方法在十二种编码器-分类器架构上的表现。研究发现,结合重建与对比预训练的混合方法在整体性能上最优,同时揭示了数据规模、设备类型和身体部位对泛化能力的影响,为构建更具泛化性的HAR系统提供了重要参考。

Comments 25 pages

详情
英文摘要

Human Activity Recognition (HAR) from wearable sensors supports broad healthcare and behavior science applications. However, data heterogeneity and the scarcity of labeled data limit its real-world generalization. Recent advances in self-supervised learning (SSL) in vision and language domains have shown strong capability for learning generalizable representations from unlabeled data. Yet, few studies have systematically compared the generalization performance of SSL methods or explored how to adapt them for generalizable HAR. To address these gaps, we present BenchHAR, a unified framework for evaluating the generalization capability of SSL methods for sensor-based HAR on unseen target distributions. BenchHAR curates a large-scale dataset (~258K samples) and evaluates eight representative SSL methods across 12 encoder-classifier architectures. Our results reveal that existing SSL methods struggle to achieve satisfactory generalization performance. We find that: (1) For HAR models, the hybrid paradigm (combining reconstruction and contrastive pretraining) achieves the best overall performance. The CNN encoder exhibits the strongest ability to learn generalizable representations, while more expressive classifier architectures further improve generalization. (2) For data scale, increasing the amount of pretraining data from downstream activity classes consistently improves generalization, while adding more labeled data yields limited gains. Interestingly, incorporating unlabeled data from non-downstream activity classes does not improve generalization. (3) Sensor data collected from custom-grade devices generalizes better than that from research-grade devices, and data from limb transfers more effectively to trunk positions. BenchHAR provides a unified benchmark and actionable insights for generalizable sensor-based HAR systems. Our code is available at https://github.com/saiketa/HAR-Bench.

2605.08295 2026-05-12 cs.LG cs.AI cs.CL

In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

Ming Liu

AI总结 本文研究了在少样本分类任务中,模型对示例标签的过度依赖问题。研究发现,当示例标签语义一致时,模型的分类准确率会大幅下降,甚至低于12%。通过实验分析,作者揭示了模型在生成答案时主要依赖示例中的标签词汇,而非语义理解,这一现象被称为“上下文固化”。研究还通过激活修补和逻辑透镜等方法,定位了相关神经网络结构,并验证了该现象在不同模型和任务中的广泛存在。

Comments 12 pages (10 main + 2 appendix), 4 figures, 5 tables

详情
英文摘要

While random demonstration labels barely hurt in-context learning (Min et al., 2022), we show that homogeneous labels--even semantically valid ones--collapse accuracy to <=12% across six models (Pythia, Llama, Qwen; 0.8B--8B) and four tasks. The trigger is label-slot content: the model treats tokens occupying the label position as an exhaustive answer vocabulary, with homogeneity as the maximally collapsed case. A novel set-level fixation finding confirms this: when demonstrations carry varied nonsense tokens from {foo,bar,vex,nit,orb}, the model places 42--67% of probability on the demonstrated set while P(dog) remains below 0.2%. This is inconsistent with latent-concept Bayesian accounts (Xie et al., 2022) and reveals that ICL output is constrained vocabulary retrieval--the model binds its output to the demonstrated token inventory regardless of semantic plausibility. The effect generalizes to 4-way classification (0% accuracy across three models, 1B--8B) and multi-token verbalizers ("very positive"), where we decompose fixation into format-level (template adoption) and content-level (polarity override) components that are experimentally dissociable. Mechanistically, per-item paired activation patching on Pythia-1B recovers 98.4% of the gap (95% CI [84%, 112%]), localizing fixation to a layer-7-centered circuit (rank 2/560, 99.8th percentile; 4-fold CV mean 103%). Cross-architecture logit lens on Llama-3.2-1B replicates the encode-then-override trajectory with causal confirmation (top-5 layers: 89% recovery).

2605.08292 2026-05-12 cs.LG cs.AI math.OC

Hierarchical Mixture-of-Experts with Two-Stage Optimization

Gleb Molodtsov, Alexander Miasnikov, Aleksandr Beznosikov

AI总结 本文提出了一种名为Hi-MoE的分层混合专家模型,旨在解决稀疏混合专家模型中路由器在负载均衡与专家专业化之间的根本性权衡问题。该方法通过将路由控制分解为组间负载均衡和组内专业化两个层次,有效提升了专家行为的互补性并防止组内崩溃。实验表明,Hi-MoE在自然语言处理和视觉任务中均优于现有稀疏路由和分组MoE模型,且在大规模预训练中表现出更优的性能与专家平衡性。

详情
英文摘要

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

2605.08291 2026-05-12 cs.LG cs.AI cs.AR

Graph Computation Meets Circuit Algebra: A Task-Aligned Analysis of Graph Neural Networks for Electronic Design Automation

Hyunmog Kim

AI总结 本文探讨了图神经网络(GNN)在电子设计自动化(EDA)任务中的适用性,指出不同EDA任务具有独特的代数结构,成功的GNN方法应与其任务的代数特性对齐。通过分析包括时序分析、布局、布线拥堵等任务,论文系统梳理了适用于电路的GNN架构工具,明确了电路图与通用图的差异,并指出了当前方法在代数与架构不匹配时的局限性及未来研究的关键挑战。

详情
英文摘要

EDA problems are graph-structured, but not all graph-structured problems call for the same GNN computation. We argue that successful GNN-for-EDA methods are those whose propagation, aggregation, and supervision align with the native algebra of the target task. Concretely: static timing analysis is a max-plus/min-plus recurrence on a topologically ordered DAG, structurally aligned with asynchronous DAG-GNNs; placement is governed by hypergraph wirelength and density penalties and is exploited by differentiable placers rather than by message-passing GNNs alone; routing congestion is a sparse demand-supply field over a layout grid; switching-activity propagation is a probabilistic recurrence on a directed netlist; IR drop is a linear system on the power-delivery network; and analog symmetry extraction is a discrete constraint-prediction problem on schematic graphs. Through these task-by-task alignments we (i) review the GNN architectural toolkit relevant to circuits, (ii) formalize how circuit graphs differ from generic graphs (directed, heterogeneous, multi-scale, with sequential and clock structure), (iii) characterize where current methods succeed and where the algebra-architecture mismatch limits them, and (iv) identify failure modes--stage leakage, proxy-to-signoff gap, calibration, and design-distribution shift--that we believe are likely to dominate the next phase of work. We position the paper as a GNN-for-EDA, task-aligned analysis rather than a comprehensive AI-for-chip-design survey. Continuous SE(3)-equivariant geometric GNNs are usually mismatched to Manhattan digital layout, and LLM-for-RTL, HLS, and RL/diffusion-based topology generation are outside our scope.

2605.08290 2026-05-12 cs.LG cs.AI

Toward Optimal Regret in Robust Pricing: Decoupling Corruption and Time

Kalana Kalupahana, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi

AI总结 本文研究了在存在恶意干扰的动态定价问题中如何实现最优遗憾界,提出了一种将干扰程度 $C$ 与时间范围 $T$ 解耦的新型算法。该算法基于改进的二分搜索方法,在已知干扰情况下可达到 $\mathcal{O}(C + \log T)$ 的遗憾界,在未知干扰情况下则达到 $\mathcal{O}(C + \log^2 T)$,显著优于之前的结果,为鲁棒动态定价提供了更优的理论保证。

详情
英文摘要

We design the first regret guarantees for robust dynamic pricing that decouple the dependence on the corruption $C$ and the time horizon $T$. In dynamic pricing, a seller with unlimited supply of a good interacts with a stream of buyers over \( T \) rounds, with the goal of maximizing revenue. At each round $t$, the seller posts a price $p_t$, and the buyer purchases the good only if their unknown valuation $v^\star$ exceeds this price. The seller observes only the binary feedback $\mathbb{I} \left\{ p_t \leq v^\star \right\}$, indicating whether a sale occurred. In the \emph{robust} pricing setting, a malicious adversary is allowed to corrupt this feedback in at most $C$ rounds. Even if the learner knows the corruption $C$, the best known regret bound is $\mathcal{O}(C\log\log T)$ by Gupta et al. [2025]. This leaves as an open problem to ``decouple'' the dependence on $C$ and $T$. In this work, we resolve this open problem. In particular, we develop a robust variant of binary search that achieves regret $\mathcal{O}(C+\log T)$ when the corruption $C$ is known and $\mathcal{O}(C+\log^2 T)$ when the corruption is unknown.

2605.08289 2026-05-12 cs.LG cs.AI

What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies

Fan Zhang, Shiming Fan, Hua Wang

AI总结 多变量时间序列预测在许多实际系统中至关重要,而建模跨变量依赖关系是关键。本文提出了一种名为MS-FLOW的稀疏瓶颈框架,通过限制信息流容量来显式建模变量间的交互,以减少冗余连接和虚假相关性的传播。实验表明,该方法在12个真实数据集上取得了领先的预测精度,同时学习到更可靠、更少但更有效的跨变量依赖关系,实现了从“更多交互”到“更有效交互”的转变。

详情
英文摘要

Multivariate time series forecasting is critical in many real-world systems, and thus modeling cross-channel dependencies is essential. Although existing methods improve overall accuracy by enhancing representations and cross-channel interactions, it remains challenging to reliably capture inter-variable dependencies under specific conditions. We observe that dependencies in real data are often state-dependent and noisy; in such cases, dense interactions can amplify spurious correlations and lead to representation over-smoothing, which may yield unreliable predictions in certain scenarios. Motivated by this, we propose MS-FLOW, a sparse-bottleneck framework that explicitly models inter-variable interaction as capacity-limited information flow. Specifically, MS-FLOW replaces fully connected communication with selective sparse routing, retaining only a few critical dependency paths and injecting cross-variable signals under a strict communication budget, thereby suppressing redundant connections and spurious-correlation propagation. Extensive experiments demonstrate that MS-FLOW learns more reliable multivariate correlations, achieving state-of-the-art forecasting accuracy on 12 real-world benchmarks while producing fewer yet more reliable dependencies, shifting multivariate forecasting from "more interaction" to "more effective interaction".

2605.08288 2026-05-12 cs.LG cs.AI cs.CR cs.DC

UMEDA: Unified Multi-modal Efficient Data Fusion for Privacy-Preserving Graph Federated Learning via Spectral-Gated Attention and Diffusion-Based Operator Alignment

Shih-Yu Lai, Hirozumi Yamaguchi, Shang-Tse Chen, Yu-Lun Liu, Bing-Yu Chen

AI总结 UMEDA 是一种面向隐私保护的图联邦学习框架,旨在解决异构传感器设备在无线和视觉信号融合中的定位问题。该方法通过谱门控注意力机制和基于扩散模型的算子对齐技术,实现跨模态数据的高效融合与隐私保护。UMEDA 在保持模型性能的同时,有效应对了设备异构性、数据分布偏移和隐私噪声干扰等挑战,并在多个基准测试中展现出优越的准确性和通信效率。

详情
英文摘要

Device-free localization trains models from heterogeneous wireless and visual sensors (e.g., Wi-Fi, LiDAR) distributed across edge devices. Federated learning offers a privacy-respecting framework, but is brittle when clients differ in sensor modality and resolution, when their data distributions drift, and when privacy noise destroys the structural signal needed for localization. We propose UMEDA, a graph federated learning framework in which clients form nodes of a global graph that share a continuous integral operator, and aggregation is reformulated as spectral signal processing on this operator. Each client encodes its local sensors with a linear-attention layer whose kernel spectrum is low-rank filtered, suppressing modality-specific residuals so clients with different sensors align in a common low-rank subspace. The server then aggregates client updates via a diffusion model over the kernel's spectral coefficients, treating updates as discretizations of a shared operator rather than topology-bound weights -- this absorbs varying graph sizes and missing modalities without node-wise correspondence. To balance privacy and utility, we add an anisotropic differential-privacy mechanism that projects noise preferentially into the null space of the signal subspace, preserving dominant eigendirections while ensuring formal $(ε, δ)$-DP under gradient clipping. On MM-Fi and the RELI11D out-of-distribution benchmark, UMEDA outperforms state-of-the-art federated baselines in accuracy, convergence, and communication efficiency, particularly under high modality heterogeneity and tight privacy budgets.

2605.08287 2026-05-12 cs.LG cs.AI

Multi-Armed Bandits With Best-Action Queries

Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Francesco Emanuele Stradi

AI总结 本文研究了增强型多臂老虎机问题,其中学习者可以查询一个能返回当前最佳动作的预言机。在更现实的单臂反馈模型下,作者完全解决了这一问题,证明了在随机且独立同分布的奖励设置中,最佳动作查询可将遗憾降低至 $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$,并给出了相应的下界,从而全面刻画了最佳动作查询在该模型下的性能优势。

详情
英文摘要

We study \emph{multi-armed bandits} (MABs) augmented with \emph{best-action queries}, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the \emph{full-feedback} model, where the learner observes the rewards of all arms after each round. They show that, in both \emph{stochastic} and \emph{adversarial} environments, $k$ best-action queries reduce the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret to $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T}\})$. Whether this improvement extends to the more realistic \emph{bandit-feedback} model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least $Ω(\sqrt{T-k})$. This lower bound directly extends to adversarial environments. On the positive side, we show that $\widetilde{\mathcal{O}}(\min\{T/k,\sqrt{T-k}\})$ regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of \emph{best-action queries} in the \emph{bandit-feedback} model.

2605.08286 2026-05-12 cs.LG cs.AI

Diagnosing Spectral Ceilings in Equivariant Neural Force Fields

Hyunmog Kim

AI总结 本文提出了一种频谱注入诊断方法,用于评估等变神经力场模型在不同角频率下的信息保留能力。通过向分子力场中注入可控的角频率扰动,并利用轻量化的频谱预测网络(SPN)进行分析,研究发现等变模型在特定频率边界处存在性能骤降现象。实验表明,这一现象并非由参数数量单独引起,而是与模型的频谱表达能力密切相关,揭示了等变神经网络在建模复杂分子系统时的频谱上限问题。

详情
英文摘要

We introduce a spectral-injection diagnostic for measuring which angular frequencies a trained equivariant force-field backbone preserves: inject a controlled angular-frequency perturbation into a molecular force field, attach a lightweight Spectral Prediction Network (SPN) to the frozen backbone, and read off which frequencies are recoverable. On aspirin, a quadratic SPN attached to an L = 2 NequIP backbone recovers the boundary signal at l = 4 but collapses at l = 5: a 11.7x cliff at the predicted drL boundary, with p dropping from 0.913 to 0.078. The same boundary-vs-above contrast persists across n = 4 independently trained backbones (raw-gain delta contrast, hierarchical cluster bootstrap) and is corroborated by a denominator-free injected-residual metric (R2_inj(4) = 0.374 versus R2_inj(5) = 0.006). A finite-degree span theorem calibrates the diagnostic: for a single marked direction, degree-d polynomials of degree-L spherical-harmonic features span exactly H less than or equal to dL with multiplicity-one saturation at the boundary (scoped to single-direction degree-bounded probes, not a function-class upper bound on multi-atom MPNNs). A synthetic C5 calibration plus capacity, activation, and cross-architecture controls rule out parameter count alone as the explanation.

2605.08285 2026-05-12 cs.LG cs.CE

Exactness Matters for Physical Rule Enforcement

Bum Jun Kim

AI总结 本文研究了在物理规则约束下科学预测模型中精确性对约束执行效果的影响,探讨了何时更强的物理规则约束能提升预测准确性、何时会引发分布偏移问题。通过操作符精确性分析,比较了不同约束方法在流体动力学等任务中的表现,发现精确投影在周期性系统中显著提升预测精度,但在非精确场景下,过度约束可能适得其反。研究还提出了一系列策略,以在近似约束条件下实现更稳健的预测性能。

Comments 28 pages, 6 figures

详情
英文摘要

Autoregressive scientific forecasters often enforce physical or structural constraints by repairing each predicted state before feeding it back into the model. However, it remains unclear when stronger physical rule enforcement becomes reliable and when it becomes a source of distribution shift. We study this question through operator exactness, meaning whether the repair map is the identity on the target manifold and is aligned with the target geometry. We compare raw forecasting, post hoc repair, and in-loop repair across periodic incompressible Navier--Stokes, non-periodic CFDBench flows, and a hierarchical-forecasting support task. In the exact periodic regime, Fourier projection substantially improves rollout accuracy. On the NS-128 benchmark, a strong Raw-FNO has a final-step rollout MSE at horizon 100 of $(9.390 \pm 6.290)\times 10^{-5}$, and post hoc and in-loop projection reduce it to $(1.130 \pm 0.165)\times 10^{-6}$ and $(5.370 \pm 0.113)\times 10^{-7}$. However, once an exact projection is unavailable and only approximate boundary-preserving cleanup is available, the ordering changes. Across cavity, tube, dam, and cylinder flow, stronger Poisson-based cleanup can reduce divergence while worsening rollout error; target-distortion MSE predicts this harm far better than a linear-system residual. Controlled mismatch, screened cleanup, adaptive gating, and external-backbone checks show that the best approximate-regime operating point can be raw or near-identity. Hierarchical forecasting gives the same broader pattern. Exact forecast reconciliation is a stable baseline, whereas blended top-down repair, a validation-tuned interpolation toward historical-proportion top-down reconciliation, is dataset-dependent. Thus, constraint enforcement should be benchmarked by operator--data alignment before enforcement strength.

2605.08283 2026-05-12 cs.LG cs.AI cs.CL

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

Xincheng Yao, Ruoqi Li, Cheng Chen, Daoxin Zhang, Yi Wu, Yao Hu, Chongyang Zhang

AI总结 该研究针对大语言模型强化学习中的探索与利用平衡问题,提出了一种基于分层令牌级目标控制的策略优化方法HTPO。HTPO通过将响应中的令牌按难度、答案正确性和熵值三个维度分组,并为每组设计针对性的优化目标,从而实现对推理过程中不同令牌功能的精细化引导。实验表明,HTPO在多个复杂推理基准上显著优于现有方法,验证了其在提升模型推理能力方面的有效性。

Comments 29 pages

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a pivotal technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the de facto practice of mainstream RL algorithms is to treat all tokens of one response equally and assign the same optimization objective to each token, failing to provide granular guidance for the reasoning process. While in Chain-of-Thought (CoT) reasoning, different tokens usually play distinct roles. Therefore, the current RL algorithms lack an effective mechanism to dynamically balance the exploration-exploitation trade-off during learning. To this end, we propose Hierarchical Token-level Objective Control Policy Optimization (HTPO), a novel RL algorithm that takes the divide-and-conquer idea to hierarchically partition the response tokens into specific functional groups from three aspects (i.e., prompt difficulty, answer correctness, and token entropy). Within each group, according to the contributions to exploration or exploitation, we design specialized optimization objectives to facilitate the effective execution of each token's expected functionality. In this way, HTPO can achieve a more balanced exploration-exploitation trade-off. Extensive experiments on challenging reasoning benchmarks validate the superiority of our HTPO algorithm, which significantly outperforms the strong DAPO baseline (e.g., +8.6% and +6.7% on AIME'24 and AIME'25, respectively). When scaling test-time compute, the HTPO-trained model maintains a consistent performance advantage over the DAPO baseline, and the gap widens as the sampling budget increases, validating that our adaptive token-level control method fosters effective exploration without sacrificing exploitation performance. Code will be at https://github.com/xcyao00/HTPO.

2605.08281 2026-05-12 cs.CV

Is Class Signal Clustered or Routed in Task-Induced Implicit Neural Representation Weight Spaces?

Xinyi Guo, Mingyi He, Haobin Ding, Weiming Chen, Xinrui Chen, Jiawen Li, Di Zhang, Minxi Ouyang, Yizhi Wang, Xitong Ling

AI总结 本文研究了任务诱导的隐式神经表示(INR)权重空间中类别信号是聚类还是路由的问题。通过在基于SIREN的Meta Weight Transformer(MWT)框架下进行实验,发现类别信号并非通过权重空间的几何聚类实现分类,而是通过读取器(reader)进行路由。研究进一步识别出SIREN权重中的偏置列是影响分类性能的关键因果路径,并提出了一些改进方法,如增强路由机制或引入显式偏置路径,以提升模型性能。

详情
英文摘要

Implicit neural representations (INRs) encode images as neural-network weights, making image classification a problem of weight-space classifiability. A natural geometric hypothesis is that classifier feedback should make image-specific weights cluster by class in the shared-anchor coordinate. We test this hypothesis in the SIREN-based Meta Weight Transformer (MWT) regime, where end-to-end training meta-learns a shared initialization and inner-loop update schedule for fitting image-specific SIRENs. We find that this prediction fails. Exposed weight-space geometry and supervised clustering pressure do not reliably track trained-reader accuracy; clustering can even make local neighborhoods more class-consistent while making the trained reader worse. Crucially, the reader constructs rather than inherits class-aligned geometry: token-flow diagnostics show that class-aligned neighborhoods become strongly predictive of trained-reader accuracy only after late reader interactions, not in the input coordinate. We further identify the native SIREN bias column in the augmented weight token as a low-dimensional, sample-dependent causal readout route for the trained reader; targeted controls rule out generic scalar-column and marginal-distribution artifacts. The diagnosis motivates interventions that strengthen reader routing, add an explicit bias route, or use denser inner-loop fitting; under the lane-specific training conventions used here, route-directed variants often outperform the shared-anchor baseline but interact non-additively. Task-induced INR weights are classifiable not because they form raw geometric clusters, but because their class signal is routed through the reader.

2605.08280 2026-05-12 cs.LG cs.AI

Beyond the False Trade-off: Adaptive EWC for Stealthy and Generalizable T2I Backdoors

Lu Bowen, Xinyu Tang, Yin Yin Low, Shu-Min Leong

AI总结 本文研究了如何在文本到图像(T2I)后门攻击中平衡攻击成功率与模型保真度的问题。传统方法如LwF依赖输出蒸馏,正则化效果有限,而作者引入基于参数的弹性权重固化(EWC)以提升保真度。针对标准EWC在固定正则化权重下导致的性能下降问题,提出了一种基于余弦语义效用和自适应调度的动态调整方法,有效提升了攻击成功率与模型保真度的平衡,并在跨域数据集上表现出更强的鲁棒性。

详情
英文摘要

Preserving model fidelity is essential for stealthy text-to-image (T2I) backdoor attacks. Existing methods such as Learning without Forgetting (LwF) rely on output-based distillation, which provides limited regularization. We introduce Elastic Weight Consolidation (EWC) as a parameter-based alternative for preserving fidelity in backdoor learning. While stronger in principle, we show that standard static EWC with a fixed regularization weight lambda and mean-squared utility loss creates an artificial trade-off between attack success rate (ASR) and fidelity, particularly degrading performance on weak triggers. To address this, we propose Cosine-Aware Adaptive EWC, which dynamically adjusts EWC regularization using a cosine-based semantic utility and adaptive scheduling. This approach transforms EWC from a fixed penalty into a context-sensitive constraint, maintaining high ASR while preserving model fidelity. Experiments demonstrate improved ASR-fidelity balance and enhanced robustness on out-of-domain (OOD) datasets compared to existing baselines.

2605.08279 2026-05-12 cs.LG cs.AI

LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

Qixin Xiao, Maani Ghaffari

AI总结 本文提出了一种名为LaWM的潜空间世界模型框架,旨在从视觉观测中学习具有长期物理一致性的预测模型。该方法通过在潜空间中实现最小作用量原理,利用学习到的拉格朗日作用泛函来引导未来状态的生成,而非依赖无约束的神经转移函数。核心技术创新在于引入了潜变分积分器,通过学习广义坐标和离散拉格朗日量,构建离散作用泛函并求解相应的积分条件,从而在长期预测中保持物理结构的保真性。实验表明,LaWM在多个物理和机器人任务中显著提升了预测的物理不变性、背景一致性及运动平滑性。

详情
英文摘要

Learning predictive world models from visual observations is a core problem in embodied AI, with applications to model-based reinforcement learning and robotic planning. Existing latent world models typically generate future states with unconstrained neural transition functions, while modern video generation systems often prioritize perceptual plausibility or introduce physical structure through auxiliary losses, external guidance, or separate dynamics modules. As a result, long-horizon rollouts can remain weakly grounded in the physical principles that govern real dynamics, leading to compounding error, energy drift, and physically inconsistent futures. We propose Least Action World Models (LaWM), a latent world-modeling framework that operationalizes the Principle of Least Action in learned visual latent space: future rollouts are governed by a learned Lagrangian action functional rather than produced only by an unconstrained transition predictor. Our main technical realization is a latent variational integrator: LaWM encodes observations into learned generalized coordinates, learns a latent discrete Lagrangian over consecutive latent states, constructs a discrete action functional, and advances prediction by solving the corresponding discrete integration condition. Thus, physical structure is not merely used to score, regularize, or constrain a completed trajectory; it defines the latent transition rule itself. Because the transition is induced by a discrete variational principle, LaWM provides a structure-preserving bias for long-horizon visual prediction. Across physics-clean synthetic dynamics and embodied robot interaction benchmarks, LaWM improves physical invariance, background consistency, motion smoothness, and appearance and geometric prediction metrics over video-generation and world-model baselines.

2605.08276 2026-05-12 cs.CV

Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

Weiming Chen, Xitong Ling, Zhenyang Cai, Xidong Wang, Jiawen Li, Tian Guan, Benyou Wang, Yonghong He

AI总结 细胞级别的密集预测在计算病理学中至关重要,但由于组织结构的细粒度、领域差异大以及密集标注成本高等挑战,仍面临困难。为解决现有基于ViT的病理基础模型在细胞级预测中因使用块标记化而破坏空间连续性和削弱局部形态细节的问题,本文提出了一种基于掩码扩散的卷积基础模型CMD,采用全卷积的ConvNeXt-UNet主干网络,在像素空间中进行掩码扩散预训练,并通过自适应归一化引入冻结的病理基础模型特征。实验表明,CMD在多个病理密集预测任务中优于现有ViT模型,甚至在微调少量参数的情况下超越了最先进的端到端分割方法,尤其在标注有限的情况下展现出更强的鲁棒性和泛化能力。

详情
英文摘要

Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.

2605.08273 2026-05-12 cs.LG cs.AI

Efficient Prompt Learning for Traffic Forecasting

Qianru Zhang, Xinyi Gao, Alexander Zhou, Reynold Cheng, Siu-Ming Yiu, Hongzhi Yin

AI总结 本文研究了如何提高时空图神经网络在交通预测中的泛化能力,以应对时空动态变化带来的分布偏移问题。为此,作者提出了一种高效且模型无关的提示学习框架SimpleST,通过引入轻量级的提示机制,在不改变模型参数的前提下,使预训练模型能够适应新的分布。实验表明,该方法在多个真实城市时空数据集上表现出优越的预测精度和计算效率。

Comments 24 pages. This paper is accepted by VLDBJ

详情
Journal ref
The VLDB Journal of 2026
英文摘要

Accurate traffic prediction is essential for optimizing transportation systems, enhancing resource allocation, and improving overall urban administration. Spatio-temporal graph neural networks (GNNs) have achieved state-of-the-art performance and have been widely used in various spatio-temporal prediction scenarios. However, these prediction methods often exhibit low generalization ability, struggling with distribution shifts caused by spatio-temporal dynamics. To address this challenge, we propose an approach to enhance the generalization and adaptation of spatio-temporal GNNs through efficient prompting. Specifically, we introduce a lightweight and model-agnostic prompt tuning framework for spatio-temporal GNNs, named SimpleST. It facilitates adapting pre-trained spatio-temporal GNNs to novel distributions while keeping the model parameters fixed. This prompt mechanism reduces the overhead and complexity of adaptation, enabling efficient utilization of pre-trained models for out-of-distribution generalization. Extensive experiments conducted on five real-world urban spatio-temporal datasets demonstrate the superiority of our approach in terms of prediction accuracy and computational efficiency.

2605.08271 2026-05-12 cs.CV cs.AI

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li, Chuxu Zhang

AI总结 该研究旨在解决超长视频(如第一视角录像、直播或监控视频)理解中的挑战,即现有模型在处理数天至数周的视频内容时存在上下文窗口限制和信息丢失问题。为此,作者提出了MAGIC-Video,一个无需训练的框架,通过构建多模态记忆图谱和交错叙事链,实现跨模态检索与长期叙事总结。该方法在多个基准测试中表现出色,显著优于现有主流方法。

详情
英文摘要

Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC-video.

2605.08269 2026-05-12 cs.RO cs.SY eess.SY

Anatomical Landmark-Guided Deep Reinforcement Learning for Autonomous Gastric Navigation

Haoxuan Wu, Sishen Yuan, Haitao Gao, Zhen Li, Xiuli Zuo, Hongliang Ren

AI总结 该研究提出了一种基于解剖标志引导的深度强化学习框架,用于实现自主胃部导航,以提高无线胶囊内镜的诊断效果。通过融合边缘、轮廓和深度信息的轻量模块,方法在低维解剖标志坐标上进行决策,有效克服了仿真与现实之间的差距。实验表明,该方法在多个患者模型中实现了超过97%的覆盖面积,并在实际实验中相比人工操作减少了53%的时间。

详情
英文摘要

Wireless capsule endoscopy (WCE) enables painless visualization of the gastrointestinal tract, but its diagnostic potential is limited by incomplete mucosal coverage and poor transferability of existing navigation methods across patient anatomies. We propose a transferable, anatomical landmarkguided deep reinforcement learning (AL-DRL) framework for autonomous gastric navigation. Leveraging a lightweight edgecontour-depth fusion module, our policy operates on stable, lowdimensional landmark coordinates rather than high-dimensional video streams, effectively bridging the sim-to-real gap. In simulations across eight patient-derived models, the method achieves over 97% coverage within 50 seconds, significantly outperforming vanilla PPO, SAC, and DQN agents. A two-stage sim-to-real pipeline with an adaptive dynamic programming controller actively mitigates physical disturbances. Ex-vivo experiments demonstrate a mean coverage of 87% and a 53% reduction in procedure time compared with expert manual control.

2605.08255 2026-05-12 cs.LG cond-mat.mtrl-sci cs.AI

Can LLMs Predict Polymer Physics Just by Reading Synthesis and Processing Prose?

Yuchu Liu, Rui Zhu, Jingwei Xiong, Haixu Tang

AI总结 该研究探讨了大型语言模型是否能够仅通过阅读非结构化的科学文本,预测聚合物的物理和力学性能。传统模型通常依赖化学结构表示,忽略了合成工艺、加工条件等关键实验信息。为此,研究提出了一个基于自然语言的框架PolyLM,直接从全文文献中预测材料性能,并构建了一个包含18.5万篇论文和27.6万种聚合物样本的大规模数据集进行训练。实验表明,该模型在22项性能指标上取得了显著的预测精度,其中多项指标的$R^2$值超过0.80,验证了自然语言在材料性能预测中的强大潜力。

详情
英文摘要

Can large language models predict physical and mechanical polymer properties simply by reading unstructured scientific prose? Polymer performance is rarely determined by chemical structure alone; identical nominal polymers can exhibit drastically different behaviors depending on their synthesis route, processing history, morphology, and testing conditions. Yet, state-of-the-art polymer property models typically rely on structure-only representations -- such as SMILES or molecular graphs -- which strip away this vital experimental context. In this work, we introduce \textbf{PolyLM}, a natural-language-only, process- and condition-aware framework that predicts materials performance directly from full-text literature. By circumventing structural inputs entirely, PolyLM preserves the nuanced, unstructured descriptions of synthesis and processing reported by domain scientists. To train this framework, we curated an unprecedented, literature-scale dataset encompassing 185,000 scientific papers and over 276,400 unique polymer samples across 22 physical, mechanical, and thermal properties. We fine-tuned a massive 9-billion-parameter language model (Qwen3.5-9B) using Low-Rank Adaptation (LoRA) and task-level uncertainty weighting. Evaluated on 68,283 held-out observations, the model achieves remarkably high predictive accuracy, establishing new state-of-the-art benchmarks for complex properties. Across the 22 diverse targets, the model achieves a median $R^2$ of 0.74, with predictions for key thermal, mechanical, and physicochemical properties frequently surpassing an $R^2$ of 0.80. These results unequivocally demonstrate that natural language is a powerful, highly scalable interface for realistic materials performance prediction.

2605.08254 2026-05-12 cs.LG cs.AI

HyperTransport: Amortized Conditioning of T2I Generative Models

Valentino Maiorca, Eleonora Gualdoni, Xavier Suau, Marco Cuturi, Luca Zappella, Pau Rodríguez

AI总结 随着基础模型能力的提升,如何高效且可靠地控制其行为变得至关重要。本文提出HyperTransport,一种基于超网络的框架,通过将预训练编码器(如CLIP)的嵌入直接映射到干预参数,实现了对生成模型行为的快速且稳定的控制。该方法通过端到端的最优运输损失进行训练,能够在每次干预时仅需一次前向传播,大幅提升了效率,并在未见过的概念上也表现出色,具备开放概念集的摊销控制、连续可解释强度调节和跨模态条件生成等多项优势。

详情
英文摘要

As foundation models grow in capability, the ability to efficiently and reliably control their behavior becomes critical. Fine-tuning these models can be costly, and while prompting can be practical for controllability, it remains fragile due to models' high sensitivity to exact prompt wording and structure. This brittleness has driven interest in activation steering techniques that offer more stable and predictable control over model behavior. However, existing activation steering methods require per-concept optimization, which makes them ill-suited to deployment scenarios where the concept set is large, evolving, or only specified at request time: each new concept incurs at least minutes of optimization on the target model. We propose HyperTransport, a hypernetwork framework that amortizes this cost by mapping embeddings from a pretrained encoder (CLIP in our instantiation) directly to intervention parameters, trained end-to-end using an optimal transport loss. Once trained, HyperTransport produces each new intervention in a single hypernetwork forward pass, 3600-7000x faster than per-concept fitting. On concepts unseen during training, it matches the strongest per-concept baselines at inducing the target concept. By decoupling concept representation from intervention prediction, HyperTransport combines three capabilities that no existing approach offers as a set: amortized steering for open-ended concept sets, continuous interpretable strength control, and cross-modal conditioning where reference images can directly steer text-based generation. We validate HyperTransport on DMD2 and Nitro-1-PixArt across 167 held-out test concepts via CLIP-based metrics, a VLM-as-a-judge evaluation, and a user study. In pairwise comparisons, both human and VLM judges prefer HyperTransport over prompting ~2x as often.

2605.08252 2026-05-12 cs.CV

Multimodal Emotion Recognition via Causal-Diffusion Bridge (Affect-Diff)

Ankit Sanjyal

AI总结 该研究针对多模态情感识别中数据严重不平衡的问题,提出了一种名为Affect-Diff的因果扩散桥模型,通过因果图重构模态权重、正则化潜在压缩以及扩散先验结构化潜在空间,有效提升了对小类情感(如恐惧、厌恶和惊讶)的识别能力。实验表明,Affect-Diff在CMU-MOSEI数据集上显著优于现有方法,验证集平衡准确率提升了18%,并且首次实现了对所有六类情感的检测。

Comments 10 Pages, 12 Figures, 6 Tables

详情
英文摘要

Multimodal emotion recognition on CMU-MOSEI faces an extreme imbalance as Happy accounts for 65.9% of samples while three Ekman categories collectively represent under 7%, causing standard fusion models to maximize accuracy by ignoring minority emotions entirely. We present Affect-Diff, a Causal-Diffusion Bridge that addresses this through three jointly trained mechanisms: a NOTEARS-learned causal graph that re-weights modality contributions before fusion, a beta-VAE bottleneck for regularized latent compression, and a stop-gradiented 1D DDPM prior that structures the latent space against majority-class collapse. On 3,292 aligned CMU-MOSEI samples, Affect-Diff achieves validation balanced accuracy 0.384, an 18% relative improvement over the strongest baseline (TETFN: 0.324), while all evaluated baselines produce zero F1 on Fear, Disgust, and Surprise. Ablation studies confirm independent, non-redundant contributions from the diffusion prior (-24% without it) and causal graph (-13%). Notably, only the deterministic-encoder variant detects all six emotion classes, revealing KL regularization strength as a direct lever for minority-class sensitivity.

2605.08250 2026-05-12 cs.CV cs.AI

Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

Xiaoce Wang, Sifan Zhou, Kaifei Wang, Leli Xu, Xuerui Qiu, Tao He, Ming Li

AI总结 近年来,扩散变压器(DiT)在单次图像编辑任务中表现出色,但在多次编辑过程中常出现语义漂移和质量下降的问题。本文从潜在空间频率的角度出发,将编辑过程分解为VAE和DiT两个部分,发现DiT在多次编辑中引入了累积的低频语义漂移,而VAE则主要贡献稳定的重建偏差。基于这一发现,作者提出了一种无需重新训练、可直接应用的低频对齐方法VAE-LFA,在VAE潜在空间中通过低通滤波和统计对齐有效抑制语义漂移,显著提升了多轮编辑的语义一致性和视觉质量。

Comments 9 pages main paper, 12 figures, 25 pages in total

详情
英文摘要

Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency details.Our method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent alignment.Extensive experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.

2605.08249 2026-05-12 cs.CV eess.IV eess.SP

Dimensional Coactivation for Representational Consistency in Frozen Vision Foundation Models

Izaldein Al-Zyoud Abdulmotaleb El Saddik

AI总结 本文研究了冻结视觉基础模型在单个输入样本内部表示的一致性问题,提出了一种名为维度共激活(DCA)的新方法,用于衡量模型在不同语义区域之间是否保持一致的表示结构。DCA通过分析特征维度在不同区域间的共激活模式来评估表示的一致性,避免了传统相似性度量中的归一化等操作,更适用于固定坐标系下的样本内分析。实验表明,DCA在深度伪造检测任务中表现出色,能够有效识别合成图像中语义区域之间的表示断裂。

详情
英文摘要

Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.

2605.08246 2026-05-12 cs.CV cs.CR cs.LG

Smart Railway Obstruction Detection System using IoT and Computer Vision

Pravin Kumar, Mritunjay Shall Peelam, Ramakant Kumar, Sanjay Kumar, Vinay Chamola

AI总结 本文提出了一种基于物联网和计算机视觉的智能铁路障碍物检测系统NETRA,旨在解决印度铁路面临的野生动物侵入和人为障碍物带来的安全问题。该系统部署在低成本的树莓派边缘设备上,通过概率传感器融合技术结合红外和超声波传感器,有效降低了误报率并减少了不必要的视觉处理。实验表明,NETRA在检测准确率、系统响应速度和部署成本方面均优于现有方案,为铁路安全提供了高效、经济的解决方案。

详情
英文摘要

Railway track intrusions pose a critical safety challenge for Indian Railways, encompassing wildlife incursions and deliberate malicious obstructions. The December 2025 collision in Assam, in which seven elephants were killed by the Rajdhani Express, underscores the urgency of effective real-time detection. Existing solutions such as the optical fiber-based Gajraj system suffer from prohibitive costs (\$1000/km) and high false alarm rates, limiting deployment to only 20 of India's 101 elephant corridors. This paper proposes NETRA, a cost-effective, internet-independent intrusion detection system deployed on Raspberry Pi Zero W and Raspberry Pi 4 edge platforms. NETRA employs probabilistic sensor fusion integrating a PIR motion sensor and an HC-SR04 ultrasonic distance sensor with a tunable threshold (tau_c = 0.65), enabling event-driven camera activation that reduces unnecessary visual processing by 52%. Upon confirmed intrusion, edge-AI classification using MobileNet-SSD (Pi Zero) or YOLOv5 ONNX (Pi 4) identifies threats including humans, large animals, and track obstructions. Confirmed threats are transmitted via LoRa (868 MHz) to alert the locomotive driver within 2.4 seconds end-to-end. Experimental evaluation across 113 motion events demonstrated 95% detection accuracy with zero false alarms through probabilistic fusion, compared to 85% for binary methods. Raspberry Pi 4 with YOLOv5 achieved 83.5% elephant F1-score, a 5.6x improvement over Pi Zero's heuristic approach (14.8%). LoRa communication achieved 100% packet delivery across 1-2 km in field trials. NETRA reduces deployment cost by 75% (\$247/km vs \$1000/km for Gajraj) while providing unified detection of both wildlife and obstruction threats.

2605.08241 2026-05-12 cs.CV cs.AI

TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models

Bibin Wilson

AI总结 本文提出了一种名为 TinySSL 的自监督预训练方法,旨在为参数少于500万的微控制器(MCU)模型提供高效的表示学习。该方法通过识别并克服小规模模型中的三个关键挑战,结合知识蒸馏、多尺度特征对齐和渐进式数据增强策略,显著提升了模型在图像分类和目标检测任务上的性能。实验表明,TinySSL 在保持模型轻量化的同时,实现了优于现有方法的准确率,并在部署时具有极低的内存占用和推理开销。

详情
英文摘要

Self-supervised learning (SSL) has transformed representation learning for large models, yet remains unexplored for microcontroller (MCU)-class models with fewer than 500K parameters. We identify three obstacles at this scale -- projection head dominance, representation bottleneck, and augmentation sensitivity -- and propose Capacity-Aware Distilled Self-Supervised Learning (CA-DSSL), a teacher-guided framework that overcomes them without labels or text supervision. CA-DSSL combines asymmetric distillation from a frozen DINO ViT-S/16 teacher, multi-scale feature distillation for spatial representations, and a progressive augmentation curriculum. On a MobileNetV2-0.35 backbone (396K parameters) pretrained on CIFAR-100, CA-DSSL reaches 62.7 0.5% linear-probe accuracy (3-seed mean) -- surpassing SimCLR-Tiny by 18 pp, matching SEED (61.7%) with 10 fewer projection parameters (426K vs. 3.15M), and reaching 94.0% of a supervised upper bound. Standard SSL methods (BYOL-Tiny, DINO-Tiny) collapse entirely at this scale. On Pascal VOC detection, CA-DSSL achieves 2.3 the mAP of random initialization and +3 pp over SEED, though SimCLR-Tiny matches CA-DSSL on detection mAP. The deployed backbone occupies 378 KB (INT8) with no inference overhead from pretraining. Preliminary ImageNet-100 experiments reveal that CA-DSSL's advantage is specific to small-data regimes; scaling to ImageNet-1K is discussed as future work.