arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4060
2605.09771 2026-05-12 cs.AI

Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning

Ziquan Wei, Tingting Dan, Guorong Wu

AI总结 该研究旨在提升疾病预测与推理的个性化能力,通过将生成模型与社会健康决定因素(SDoH)的数字孪生相结合,弥补现有模型对社会因素建模不足的问题。研究提出了一种基于ICD编码代理的条件潜扩散框架,能够同时建模多器官传感器数据与医疗事件的时序演化,特别是引入了用于刻画复杂数据(如脑网络)的几何扩散模型。实验表明,该方法在UK Biobank数据集上显著优于现有疾病生成模型和影像特征生成基线。

Comments 21 pages, 8 figures, ICML 2026

详情
英文摘要

Despite the central role of sensor-derived measurements such as imaging traits and plasma biomarkers in biomedical research and clinical practice, existing generative models for disease prediction largely depend on event-level representations from hospital and registry data. Given the multi-factorial nature of human disease, the absence of explicit modeling of social determinants of health (SDoH), even in the limited form of ICD-coded proxies (chapters Z and V--Y in ICD-10), limits the capacity for personalized disease modeling and clinical decision support. To address this limitation, we propose a generative model with ICD-coded proxies of SDoH for \textit{in silico} modeling of disease reasoning, a conditioned latent diffusion framework that establishes the connection between multi-organ sensor data with tokenized healthcare events. Specifically, we introduce a novel geometric diffusion model to characterize the temporal evolution of complex data representation such as brain networks (region-to-region connectivity encoded in a graph), in parallel with diffusion models for tabular data from other organ systems. Together, we integrate the generative model with digitalized SDoH proxies (coined \modelname{}) for simulated intervention and reasoning of future disease trajectories. We conduct extensive experiments on the UK Biobank (UKB) dataset, which contains organ-specific imaging traits, including brain (44,834), heart (23,987), liver (28,722), and kidney (32,155), along with nearly 500k medical history sequences (age range: 25$\sim$89 years). Our \modelname{} achieves significant improvements over state-of-the-art human disease autoregressive models and imaging trait generative baselines.

2605.09765 2026-05-12 cs.LG cs.AI

WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

Ruan Dong, Yuanyun Zhang, Shi Li

AI总结 本文提出了一种名为WISTERIA的弱监督表征学习框架,用于从电子健康记录(EHR)中学习临床表征。该方法将临床标签视为潜在临床状态的随机观测,通过构建多个弱监督操作符并强制其标签分布的一致性,实现对噪声标签的鲁棒学习。此外,该方法引入了基于本体的正则化以增强标签空间的语义结构,实验表明WISTERIA在多个EHR基准任务中表现出更优的预测性能、更强的噪声鲁棒性以及更好的跨机构泛化能力。

详情
英文摘要

Representation learning in electronic health records (EHR) has largely followed paradigms inherited from natural language processing, relying on sequence modeling and reconstruction based objectives that treat clinical labels as ground truth. However, real world clinical supervision is inherently weak, arising from heterogeneous, noisy, and institution specific labeling processes such as billing codes, heuristic phenotypes, and incomplete annotations. In this work, we propose WISTERIA, a weakly supervised representation learning framework that models labels as stochastic observations of an underlying latent clinical state. Instead of optimizing against a single supervision signal, WISTERIA constructs multiple weak supervision operators and learns representations by enforcing consistency across their induced label distributions. This multi view formulation induces an implicit denoising mechanism, allowing the model to recover clinically meaningful structure by reconciling disagreement between noisy labelers. We further incorporate ontology aware regularization in the label space to impose semantic structure over supervision signals. Empirically, WISTERIA improves predictive performance across standard EHR benchmarks, demonstrates strong robustness to label noise, and exhibits superior cross institutional generalization compared to sequence based pretraining objectives. These results suggest that explicitly modeling the supervision process rather than treating labels as fixed targets provides a more appropriate inductive bias for learning robust and clinically meaningful representations from EHR data.

2605.09760 2026-05-12 cs.CL

ConFit v3: Improving Resume-Job Matching with LLM-based Re-Ranking

Xiao Yu, Ruize Xu, Chengyuan Xue, Junyu Chen, Matthew So, Shijun Ma, Bo Liu, Xiangye Liang, Zhou Yu

AI总结 本文提出 ConFit v3,一种基于大语言模型(LLM)的重排序方法,旨在提升简历与职位的匹配效果。研究分析了 LLM 重排序器在人岗匹配任务中的训练流程,并提出多项优化策略,如多轮重排序、列表级强化学习、去噪处理和从更强 LLM 进行知识蒸馏。基于这些改进,ConFit v3 在真实招聘数据上训练,显著优于现有最佳系统及主流大模型。

详情
英文摘要

A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. While recent advances in embedding-based methods such as ConFit and ConFit v2 can efficiently retrieve candidates at scale, the lack of controllability and explainability limits their real-world adaptations. LLM-based re-rankers can address these limitations through reasoning, but existing training recipes are developed on short-document benchmarks and do not account for noise in real-world recruiting data. In this work, we first conduct a systematic analysis over the LLM re-ranker training pipeline for person-job fit, covering inference algorithm design, RL algorithm selection, data processing, and SFT distillation. We find that using multi-pass re-ranking, training with listwise RL objectives, removing noisy samples, and distilling from a stronger LLM before RL significantly improves re-ranking performance. We then aggregate these findings to train ConFit v3 with Qwen3-8B and Qwen3-32B on real-world person-job fit datasets, and find significant improvements over existing best person-job fit systems as well as strong LLMs such as GPT-5 and Claude Opus-4.5. We hope our findings provide useful insights for future research on adapting LLM-based re-rankers to person-job fit systems.

2605.09757 2026-05-12 cs.LG stat.ML

On Uniform Error Bounds for Kernel Regression under Non-Gaussian Noise

Johannes Teutsch, Oleksii Molodchyk, Marion Leibold, Timm Faulwasser, Armin Lederer

AI总结 本文研究了在非高斯噪声环境下基于核回归的函数估计的非保守不确定性量化问题,提出了新的非渐近概率统一误差界。与以往仅适用于次高斯噪声的界不同,本文的界适用于更广泛的非高斯噪声分布,包括次高斯、有界、次指数以及方差/矩有界噪声,并且适用于相关和不相关噪声。通过与现有结果在不确定性区域和安全控制性能上的对比,验证了所提出误差界的紧致性。

Comments This paper has been accepted at the 43rd International Conference on Machine Learning (ICML) 2026

详情
英文摘要

Providing non-conservative uncertainty quantification for function estimates derived from noisy observations remains a fundamental challenge in statistical machine learning, particularly for applications in safety-critical domains. In this work, we propose novel non-asymptotic probabilistic uniform error bounds for kernel-based regression. Compared to related bounds in the literature that are restricted to (conditionally) independent sub-Gaussian noise, our bounds allow to consider a broad class of non-Gaussian distributions, such as sub-Gaussian, bounded, sub-exponential, and variance/moment-bounded noise. Moreover, our results apply to correlated and uncorrelated noise. We compare our proposed error bounds with existing results in terms of the induced uncertainty region and their performance in safe control, demonstrating the tightness of the proposed bounds.

2605.09751 2026-05-12 cs.CL

Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes

A. Bochkov

AI总结 本文研究了语言模型中是否必须使用可训练的输入嵌入表。作者提出使用固定最小二进制编码替代传统嵌入矩阵,仅通过零参数变换扩展模型宽度。实验表明,在保持相近验证困惑度的前提下,该方法可减少大量可训练参数,证明可训练输入嵌入表并非语言建模所必需。

详情
英文摘要

Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2 V\rceil$ bits. We replace the usual trainable $V\times d_{\text{model}}$ input embedding matrix with fixed minimal binary token codes and a zero-parameter lift to model width. In our main setting, $V=65{,}536$, so $K=16$, and tokens are represented by fixed 16-dimensional binary codes tiled to $d_{\text{model}}=1024$. We also evaluate a fully table-free variant in which codes are generated from token IDs on the fly and randomly recoded by an invertible affine transform over $\mathbb{F}_2^K$. Across matched 32-layer decoder-only models trained on approximately 17B tokens and evaluated over three independent training seeds, fixed minimal codes achieve comparable held-out validation perplexity to a standard learned-input baseline while removing 67.1M trainable input parameters. The fixed-code runs have a lower mean validation perplexity in our experiments, 2.36 versus 2.44, but the observed gap is within the measured seed-to-seed variation of 4.8\%; we therefore interpret the result as evidence that the trainable input table is not necessary, rather than as a statistically resolved superiority claim. The table-free affine-recoded variant remains close at 2.39 despite a slightly shorter training run. These results show that, in this regime, a trainable input embedding table is not necessary for useful language modeling. The output projection remains standard and trainable.

2605.09750 2026-05-12 cs.CV

Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos

Aleksander Zamojski, Kacper Jarczak, Radoslaw Roszczyk

AI总结 本文提出了一种用于胎儿脑部超声视频中关键帧检测的新方法,旨在提高胎儿脑部影像分析的效率和准确性。该方法采用一种融合卷积神经网络(CNN)和循环神经网络(RNN)的复合神经网络架构,其中CNN用于提取视频帧的局部空间特征,RNN则用于捕捉视频序列中帧与帧之间的时序依赖关系。该模型有助于更早地检测和诊断特定胎儿脑部疾病,从而支持更及时的治疗规划。

详情
英文摘要

This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions.

2605.09749 2026-05-12 cs.AI

Primal-Dual Guided Decoding for Constrained Discrete Diffusion

Federico Tomasi, Dmitrii Moor, Alice Wang, Mounia Lalmas

AI总结 离散扩散模型通过逐步去掩码生成结构化序列,但在生成过程中满足全局属性约束仍是一个挑战。本文提出了一种原-对偶引导解码方法,在推理阶段将约束生成建模为KL正则化优化问题,并通过自适应拉格朗日乘子在线求解。该方法通过约束相关的偏置调整token的logits,保证生成分布尽可能接近无约束分布的同时满足约束条件,无需额外训练或模型评估,支持多约束同时处理,并提供了约束违反的理论界。实验表明,该方法在主题文本生成、分子设计和音乐歌单生成等任务中有效提升了约束满足度,同时保持了领域相关的质量指标。

详情
英文摘要

Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to the model's unconstrained distribution while still satisfying the constraint. The method requires no retraining and no additional model evaluations beyond standard sampling, supports multiple simultaneous constraints, and provides formal bounds on constraint violation. We evaluate our approach on topical text generation, molecular design, and music playlist generation, showing that a single algorithm instantiated via domain-specific scoring functions improves constraint satisfaction while preserving relevant domain-specific quality metrics.

2605.09746 2026-05-12 cs.LG cs.AI

Sequential Feature Selection for Efficient Landslide Segmentation from Multi-Spectral Data

Arsalaan Ahmad, Oktay Karakus, Paul L. Rosin

AI总结 该研究旨在解决从多光谱卫星数据中高效分割滑坡区域时输入特征冗余的问题。研究提出了一种基于顺序前向浮动选择(SFFS)的可解释特征选择框架,结合Sentinel-2多光谱数据和ALOS PALSAR地形数据,通过迭代构建和精简特征集,识别出仅需8个通道即可达到与使用30个通道相当的分割性能。该方法不仅提升了模型效率,还揭示了滑坡模型真正依赖的光谱和地形特征,为地球观测中的输入设计提供了原理性指导。

Comments In Process of Submission to Frontiers in Remote Sensing. Keywords: landslide segmentation, multispectral remote sensing, feature selection, explainability, Landslide4Sense

详情
英文摘要

Landslide detection from satellite imagery has advanced through deep learning, yet most models rely on large, highly correlated spectral-topographic inputs whose contributions remain poorly understood. The question of which channels are actually necessary has received surprisingly little attention. This matters: redundant or correlated inputs obscure physical interpretability, inflate computational overhead, and can actively degrade model performance through the Hughes Phenomenon. We present a systematic, explainable channel-selection framework for the Landslide4Sense benchmark, combining Sentinel-2 multispectral and ALOS PALSAR terrain data with 16 engineered spectral and structural indices. Rather than relying on conventional single-band drop tests, which evaluate channels in isolation and miss interaction effects, we apply Sequential Forward Floating Selection (SFFS) to iteratively build and prune a candidate feature pool using a lightweight U-Net++ proxy model. Beyond identifying a compact 8-channel subset that matches or exceeds the segmentation F1 of configurations using up to 30 channels, we use the selection process itself to interrogate which spectral and topographic features landslide models genuinely rely on, and what this reveals about the physical cues driving their predictions. We argue that SFFS represents a principled feature selection approach to input design in Earth observation, in contrast to the prevailing practice of appending every available band and hoping the model learns what to ignore.

2605.09745 2026-05-12 cs.LG cs.AI cs.IT math.IT

Entropy-informed Decoding: Adaptive Information-Driven Branching

Benjamin Patrick Evans, Sumitra Ganesh, Leo Ardon

AI总结 本文提出了一种名为EDEN的熵驱动解码框架,旨在提升大语言模型生成质量。该方法根据模型输出的不确定性(熵)动态调整分支因子,在高熵区域生成更多候选,在低熵区域采用更贪婪的策略,从而提高计算效率。实验表明,EDEN在数学推理、代码生成等复杂任务中优于传统解码方法,实现了更优的精度与扩展开销的权衡。

Comments Accepted at ICML 2026

详情
英文摘要

Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While sampling-based methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam search, best-of-n, majority voting) can improve upon greedy decoding, both approaches suffer from limitations: sampling generally commits to a single path, while search often expends excessive computation regardless of task complexity. To address these, we introduce Entropy-informed decoding (EDEN), a plug-and-play, model-agnostic decoding framework that adaptively allocates computation based on the model's own uncertainty, approximating higher-width beam search with fewer expansions. At each generation step, EDEN estimates the entropy of the output token distribution and adjusts the branching factor monotonically with the entropy, expanding more candidates in high-entropy regions and following a greedier path in low-entropy regions, improving token efficiency. Experiments across complex tasks, including mathematical reasoning, code generation, and scientific questions, demonstrate that EDEN consistently improves output quality over existing decoding strategies, achieving better accuracy-expansion trade-offs than fixed-width beam search. By treating next-token selection as a noisy maximisation problem, we prove that branching factors monotone in entropy are guaranteed to find better (i.e. more probable) continuations than any fixed branching factor within the same total expansion budget, and derive explicit regret rates characterising the benefit of the adaptive allocation.

2605.09742 2026-05-12 cs.LG cs.AI

TIDES: Implicit Time-Awareness in Selective State Space Models

Taylan Soydan, Miguel A. Bessa, Dirk Mohr, Rui Barreira

AI总结 本文提出了一种名为TIDES的选择性状态空间模型,旨在解决现有模型在处理不规则时间序列时的局限性。与传统模型不同,TIDES通过将输入依赖性从时间步长转移到状态矩阵的对角线上,使时间步长$\TildeΔ$保留其物理意义,从而在保持高表达能力的同时支持不规则时间戳的处理。实验表明,TIDES在多个基准测试中表现优异,特别是在时间序列分类和回归任务中取得了新的最先进成果。

Comments Preprint submitted for peer-review

详情
英文摘要

Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step $\TildeΔ$ a learned function of the input. However, in doing so, $\TildeΔ$ ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of $\TildeΔ$ and handle irregular timestamps natively ($\TildeΔ\equivΔ)$, but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbf{TIDES}, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, $\TildeΔ$ retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emph{Fading Flash} experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution $Δ$ values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: https://github.com/TaylanSoydan/TIDES.

2605.09739 2026-05-12 cs.CL cs.AI

The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

Sanket Badhe, Priyanka Tiwari, Deep Shah

AI总结 本文研究了大语言模型在零样本分类任务中因受限解码导致的“归一化偏差”问题,提出了一种名为语义softmax的新方法,通过聚合目标标签的语义邻域信息来恢复丢失的概率质量,从而提升模型的校准性和分类性能。实验表明,该方法在多个数据集上有效降低了预期校准误差和Brier分数,同时提升了AUROC和Macro-F1等指标,为零样本分类提供了更准确和可靠的解决方案。

Comments Accepted at GEM Workshop @ ACL 2026

详情
英文摘要

Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.

2605.09737 2026-05-12 cs.LG

CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring

Li Lixing

AI总结 现代大语言模型依赖系统提示来设定行为约束和安全规则,但传统因果自注意力机制对特权指令和用户内容一视同仁,导致模型在长上下文中易受提示注入和指令侵蚀的影响。本文提出 CALYREX,一种通过输入与系统提示之间的交叉注意力机制来结构化隔离和锚定规则的扩展型 Transformer 模型。实验表明,CALYREX 在指令遵循和多轮指令一致性方面均有显著提升,并有效降低了提示攻击的成功率,其优势随着模型规模的增大而进一步增强。

Comments Preprint. 25 pages, 4 figures, 9 tables

详情
英文摘要

Modern large language models (LLMs) rely on system prompts to establish behavioral constraints and safety rules. Standard causal self-attention treats privileged instructions and untrusted user content with equal structural priority -- a mismatch that leaves models vulnerable to prompt injection and instruction erosion over extended contexts. We propose CALYREX (Cross-Attention LaYeR EXtended transformers), which utilizes cross-attention between input and system prompt to structurally isolate and anchor the rule. A placement ablation on a 1.5B backbone identifies insertion at the final eighth of layers as optimal, confirmed by mechanistic activation analysis showing behavioral constraints are naturally concentrated there. At 8B scale, controlling for training data, backbone, and parameter budget, CALYREX yields $+7.4\%$ on instruction-following (IFEval) and $+16.3\%$ on multi-turn instruction adherence, while reducing many-shot jailbreaking attack success rate by $13\%$. This advantage appears to widen with model scale, consistent with larger models more effectively utilizing the dedicated routing pathway.

2605.09727 2026-05-12 cs.LG cs.AI

One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

Bowen He, Juncheng Dong, Lin Lin, Xiang Cheng

AI总结 本文研究了如何通过非线性变换器实现跨领域强化学习中的上下文学习泛化问题。作者从核方法的角度出发,建立了非线性变换器与基于核的时差学习之间的联系,提出变换器可以视为在再生核希尔伯特空间中进行回归,从而允许不同领域的价值函数共享权重。实验表明,该方法在多个MetaWorld任务中有效实现了时差目标的收敛,为强化学习中的跨任务泛化提供了新的理论视角和方法支持。

详情
英文摘要

A central challenge in reinforcement learning (RL) is to learn models that generalize beyond the tasks on which they are trained, a goal traditionally pursued through multi-task and meta RL. Recently, transformer architectures have emerged as a promising approach, enabling adaptation to new tasks via in-context learning without explicit parameter updates. From a functional perspective, a transformer can be viewed as a functional operator that maps a context to a task-specific function. It is thus fundamental to understand and design this operator to support stronger generalization in RL. In this work, we address this resulting question of generalization from a kernel-based perspective by establishing a connection between non-linear transformers and kernel-based temporal difference learning. By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.

2605.09724 2026-05-12 cs.LG

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Yiding Song, Hanming Ye

AI总结 该研究探讨了模型容量如何影响“理解”(grokking)现象,即模型在训练集上过拟合后突然泛化的能力。研究指出,模型容量并非直接决定理解的出现,而是通过记忆速度和泛化速度之间的竞争关系来影响这一过程。通过信息论框架和模运算任务的实验证明,理解发生在模型参数规模使得记忆与泛化时间尺度相交的临界点附近,揭示了模型容量、数据复杂度与学习动态之间的深层联系。

Comments 23 pages, 10 figures, 12 tables

详情
英文摘要

Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{\text{mem}}(P)$ and a generalisation speed $T_{\text{gen}}(P)$, both of which are functions of model parameter count $P$. Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{\text{mem}}(P)$ on random-label data of equivalent complexity and $T_{\text{gen}}(P)$ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.

2605.09722 2026-05-12 cs.LG

Benchmarking Transformer and xLSTM for Time-Series Forecasting of Heat Consumption

Marja Wahl, Daniel R. Bayer, Sven Rausch, Marco Pruckner

AI总结 本文研究了在短期热需求预测任务中Transformer和xLSTM模型的性能,针对德国25栋建筑的小时级热消耗数据,评估了它们在3小时和24小时预测场景下的表现。研究发现,xLSTM在RMSE指标上表现最佳,而Temporal Fusion Transformer在MAE指标上最优,但这些模型参数量大、训练耗时,其可持续性受到质疑。论文进一步分析了预测精度与计算资源消耗之间的权衡,指出传统全连接网络等低参数模型也能取得较好的预测效果,表明新型模型在精度上的小幅提升可能带来较大的资源开销。

Comments Submitted version of the paper submitted to IEEE SusTech, 2026

详情
英文摘要

Obtaining an accurate short-term forecasting for heat demand is an essential part of operating district heating networks cost-efficient and reliable. Heat consumption time series at the building level are highly dependent on exogenous variables such as outdoor temperature and individual usage patterns, making forecasting in this context a challenging task. Thus, this paper benchmarks novel Transformer-based and xLSTM architectures for short-term heat-demand forecasting. Using hourly data from 25 German buildings (2017-2025), we compare three-hour and 24-hour forecasting horizons relevant for intraday control and day-ahead scheduling. We establish a multi-building benchmark that tests whether models trained on pooled, heterogeneous building data are able to generalize across diverse building stock. The results show that the xLSTM achieves the lowest RMSE (19.88 kWh for three-hour, 21.47 kWh for 24-hour forecasts), while the Temporal Fusion Transformer attains the best MAE (9.16 kWh for three-hour forecasts). As xLSTMs and Transformers require long training times and have a huge number of trainable parameters, their sustainability remains questionable. Therefore, this paper further investigates the trade-off between predictive accuracy and computational resource demand of the evaluated forecasting models. The findings indicate that also low-parameter models like a traditional fully-connected network achieve good predictive results, highlighting that marginal accuracy gains of the novel prediction models come at substantial resource expense for this use case.

2605.09719 2026-05-12 cs.CV cs.AI

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang

AI总结 该研究提出了一种知识蒸馏框架,将大型3D视觉语言模型中的空间推理能力转移到更轻量的模型中,从而显著降低计算成本。通过引入可学习的隐式推理标记(Hidden CoT)和多任务蒸馏策略,该方法在保持教师模型72%以上性能的同时,将模型大小减少了3倍,推理延迟降低了8.7倍。该工作首次在蒸馏的3D视觉语言模型中应用隐式推理机制,实现了高效的3D场景问答任务。

详情
英文摘要

Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.

2605.09716 2026-05-12 cs.AI

Medical Model Synthesis Architectures: A Case Study

Katherine M. Collins, Marlene Berke, Ilia Sucholutsky, Ayman Ali, Adrian Weller, Timothy J. O'Donnell, Tyler Brooke-Wilson, Lionel Wong, Joshua B. Tenenbaum

AI总结 本文研究了如何构建能够在不确定性下进行透明、可验证临床推理的AI系统,以辅助医生进行临床决策。作者提出了一种名为MedMSA的框架,结合语言模型检索相关医学知识,并构建形式化的概率模型以支持校准的不确定性推理。该方法在初步实验中用于生成带不确定性权重的鉴别诊断列表,展示了其在临床应用中的潜力,并为未来安全的医患协作提供了方向。

Comments Working paper

详情
英文摘要

Medicine is rife with high-stakes uncertainty. Doctors routinely make clinical judgments and decisions that juggle many fundamental unknowns, like predictions about what might be causing a patients' symptoms or decisions about what treatment to try next. Despite increasing interest in developing AI systems that aid or even replace doctors in clinical settings, current systems struggle with calibrated reasoning under uncertainty, and are often deeply opaque about their reasoning. We propose a framework for AI systems that can make practically useful but formally transparent clinical predictions under uncertainty. Given a clinical situation, our framework (MedMSA) uses language models to retrieve relevant prior knowledge, but constructs a formal probabilistic model to support calibrated and verifiable inferences under uncertainty. We show how an initial proof-of-concept of this framework can be used for differential diagnosis, producing an uncertainty-weighted list of potential diagnoses that could explain a patients' symptoms, and discuss future applications and directions for applying this framework more generally for safe clinical collaborations.

2605.09708 2026-05-12 cs.LG cs.AI cs.DC

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Víctor Gallego

AI总结 本文提出 Metal-Sci,一个用于在苹果芯片上评估进化型大语言模型(LLM)内核搜索性能的科学计算基准,涵盖六个优化场景的十项任务。该基准结合了轻量级框架,能够自动编译候选内核并评估其性能,同时通过结构化诊断反馈给固定LLM,驱动进化搜索过程。研究显示,使用 Claude、Gemini 和 GPT 等模型在 M1 Pro 上进行内核搜索,可实现最高达 10.7 倍的性能提升,并提出了一种基于保留测试集的评分函数,用于检测模型在未知场景下的性能退化问题。

Comments Preprint

详情
英文摘要

We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in $n$-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, a roofline-anchored fitness function, and a held-out generalization size. We pair the benchmark with a lightweight harness for automatic kernel search that runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to a frozen LLM driving a $(1{+}1)$ evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span $1.00\times$ to $10.7\times$. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function $Φ_\mathcal{T}$ (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheap mechanical oversight primitive on this automatic search loop, catching e.g. an Opus template <uint D> HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at $2.95\times$ speedup but collapses to $0.23\times$ on a $256^3$ held-out cube, a silent regression that the in-distribution score alone cannot see. Code at https://github.com/vicgalle/metal-sci-kernels

2605.09707 2026-05-12 cs.LG cs.AI

Adaptive Data Harvesting for Efficient Neural Network Learning with Universal Constraints

Siteng Kang, Xinhua Zhang

AI总结 本文研究了在连续域上训练满足通用约束的神经网络所面临的问题,如李雅普诺夫神经网络和物理信息神经网络,这类问题通常缺乏解析解或约束过于严格。为解决这一问题,作者提出了一种基于强化学习的自适应数据采集方法,通过从数据和经验中学习动态调整样本,以提升模型训练的效率和约束满足能力。该方法在多种任务中验证有效,展示了其在需要自适应输入选择的训练场景中的广泛适用性。

Comments Preprint

详情
英文摘要

Training neural networks to satisfy universal constraints over continuous domains poses unique challenges. Common examples include Lyapunov Neural Networks (Lyapunov NNs) and Physics-Informed Neural Networks (PINNs), where analytical solutions are generally either unavailable or overly restrictive. Sample-based methods are therefore commonly used to enforce these constraints, and the choice of samples has a substantial impact on convergence speed, stability, and solution quality. Most existing methods rely on fixed heuristics or handcrafted rules, and are suboptimal in practice. In this paper, we aim to improve upon them by learning, from data and experience, how to dynamically and iteratively adjust the samples in response to the model's evolving learning performance. Trained by reinforcement learning, the learned policy improves empirical constraint satisfaction on test problems while significantly improving efficiency. We validate the approach on both Lyapunov NNs and PINNs, and demonstrate its broader applicability to domains where adaptive input selection is essential for effective training.

2605.09703 2026-05-12 cs.CV

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Xiaoyu Yuan, Niklas Heikkala, Tiina Törmänen, Hanna Järvenoja, Guoying Zhao, Haoyu Chen

AI总结 本文提出MOTOR-Bench,一个用于零样本人类心理状态理解的现实场景数据集与多智能体框架。该数据集包含1,440个协作学习场景的多模态视频片段,每个样本由教育专家基于自我调节学习理论标注,旨在支持对复杂人际互动的结构化分析。为解决现有方法在从可观测行为推理深层心理状态方面的不足,研究提出了MOTOR-MAS多智能体框架,通过结构化协调机制提升对行为、认知和情绪三类标签的预测性能,实验表明其在多项指标上显著优于现有方法。

Comments Accepted by CVPR 2026 workshop AI4RWC

详情
英文摘要

Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.

2605.09701 2026-05-12 cs.CV

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

Yufeng Hong, Xiaotian Zhou, Yingyan Li, Xiangpo Zhou, Lin Liu, Yadan Luo, Shaoqing Xu, Lei Yang, Ziying Song

AI总结 DriveFuture 是一种面向自动驾驶的未来感知潜在世界模型,其核心在于将未来世界状态作为当前潜在状态建模的条件,从而显式学习面向路径规划的前瞻性能力。该方法在训练过程中通过预测和优化未来潜在状态,为基于扩散模型的轨迹规划器提供显式条件,在多个公开基准测试中取得了领先的性能表现。实验结果表明,将未来状态作为当前决策的条件,比单纯预测未来状态更能提升自动驾驶系统的智能化水平。

Comments 24pages, 7 figures

详情
英文摘要

Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.

2605.09698 2026-05-12 cs.AI

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Josefa Lia Stoisser, Marc Boubnovski Martell, Sidsel Boldsen, Kaspar Märtens, Robert Kitchen

AI总结 随着数据科学代理从辅助工具向自主系统转变,任务框架的隐性错误成为关键失效模式。本文提出 Ambig-DS 基准,用于评估数据科学代理在任务目标和评估目标模糊情况下的表现,包含两个诊断套件,分别基于 DSBench 和 MLE-bench 构建。研究发现,代理常在未明确任务的情况下提交错误答案,而并非执行错误,并且在允许提问时性能显著提升,但代理难以判断何时需要提问,反映出当前评估体系对任务框架识别能力的忽视。

详情
英文摘要

As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.

2605.09696 2026-05-12 cs.LG cs.NE cs.SC

Discovery of Nonlinear Dynamics with Automated Basis Function Generation

Mohammad Amin Basiri, Charles Nicholson

AI总结 从观测数据中发现支配方程是科学建模中的一个基本挑战,尤其当系统背后的数学结构未知时。本文提出了一种名为AutoSINDy的混合框架,结合符号回归的探索能力和SINDy的稀疏性促进能力,通过分阶段的自动基函数生成与筛选,有效提升了模型发现的准确性与鲁棒性。实验表明,该方法在高噪声环境下仍能高效恢复真实方程,显著优于传统方法。

Comments 53 pages, 17 figures. Code available at https://github.com/mabasiri95/AutoSINDy

详情
英文摘要

Discovering governing equations from observational data remains a fundamental challenge in scientific modeling, particularly when the underlying mathematical structure is unknown. Traditional sparse identification methods like SINDy excel at discovering parsimonious models but require researchers to specify candidate basis functions a priori, a limitation that often leads to model failure when critical terms are omitted or when systems exhibit unconventional dynamics. Purely symbolic regression approaches offer unlimited flexibility but struggle with noise sensitivity and frequently produce overly complex, unstable equations. We present AutoSINDy, a hybrid Discovery-then-Solve framework that combines the exploratory power of symbolic regression with the robust sparsity-promoting capabilities of SINDy. Our method operates in three stages: (1) PySR-based symbolic regression discovers candidate functional forms from bootstrapped data chunks; (2) a curation pipeline decomposes, expands, and filters these expressions using collinearity analysis to construct a minimal yet comprehensive library; and (3) SINDy identifies sparse governing equations from this custom-tailored library. Extensive experiments across canonical nonlinear systems demonstrate that AutoSINDy consistently recovers ground-truth equations even under high observational noise, achieving a ground-truth recovery rate of 92.8% across all trials. Compared with standard SINDy using enriched libraries and standalone symbolic regression, AutoSINDy achieves higher predictive accuracy, superior generalization to unseen trajectories, and substantially lower symbolic complexity.

2605.09693 2026-05-12 cs.CV cs.AI cs.LG

Do multimodal models imagine electric sheep?

Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun

AI总结 该研究探讨了多模态模型在解决空间谜题时是否会产生心理意象,并发现大型多模态模型在解决如拼图、积木等任务时确实会形成类似“想象”的过程,甚至在解决与羊相关的谜题时会“想象”出羊的形象。研究通过微调Qwen3.5视觉语言模型,使其能够完成多种视觉推理任务,并发现模型在执行操作过程中会自发形成对中间状态的视觉表征。基于这一发现,研究提出了两种方法来增强和利用模型的内部视觉表征,显著提升了任务解决的准确率。

详情
英文摘要

Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

2605.09691 2026-05-12 cs.LG

Quantum Circuit Simulation of Compartmental Drug Dynamics: Leveraging Variational Algorithms for Nonlinear Mixed-Effects Population Pharmacokinetics

Isshaan Singh, Nandan Patel

AI总结 本文将传统的药物动力学(PK/PD)模型转化为开放量子系统,并利用量子电路进行模拟,以提升群体药代动力学建模的统计性能。研究通过十二个量子比特编码四个药理学腔室,并使用受控量子操作模拟腔室间的随机转移过程。实验表明,该量子方法在对数似然值上优于经典方法,同时保持参数估计一致,验证了模型的统计拟合能力和数值稳定性,为生物医学领域提供了新的量子-经典混合建模方法。

详情
英文摘要

Population pharmacokinetic/pharmacodynamic (PK/PD) modeling traditionally relies on classical ordinary differential equations to simulate drug dynamics. In this work, we reformulate a compartmental PK/PD model as an open quantum system and implement it using quantum circuits developed in PennyLane. Four pharmacological compartments (central, peripheral, effect-site, and response) are encoded using twelve qubits, with inter-compartmental transitions represented through controlled quantum operations that emulate stochastic dynamics. The framework is evaluated on Phase 1 clinical data using a quantum-enhanced stochastic approximation expectation-maximization (SAEM) approach. Compared with the classical implementation, the quantum model achieves substantially improved log-likelihood values, indicating stronger statistical fit while preserving identical parameter estimates, thereby validating numerical consistency and model interpretability. The quantum-based optimization converges faster in terms of iterations, although total runtime is increased due to current simulation overhead. The study demonstrates stable large-scale simulation performance and establishes a hybrid quantum-classical approach that maintains biological fidelity while improving statistical modeling capacity. The dataset and problem statement originate from the Quantum Innovation Challenge 2025, and additional details are provided via the associated link.

2605.09688 2026-05-12 cs.CV

ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

Rui Song, Tianhui Cai, Markus Gross, Xingcheng Zhou, Zewei Zhou, Zhiyu Huang, Olaf Wysocki, Jiaqi Ma

AI总结 本文提出了一种名为 ConFixGS 的方法,用于修复基于前馈的3D高斯泼溅(3DGS)在驾驶场景中的重建问题。该方法利用置信度感知的扩散先验,通过生成局部伪目标并结合支持视图的重投影校验,提升重建的细节可靠性并抑制不一致信息。实验表明,ConFixGS 在多个数据集上显著提升了新视角合成效果,PSNR 提升最高达3.68 dB,FID 减少近一半,展示了其在驾驶场景中鲁棒重建的有效性。

Comments 28 pages, 12 figures

详情
英文摘要

Feedforward 3D Gaussian Splatting (3DGS) often struggles in trajectory-based sparse-view driving scenes. Existing Gaussian repair methods mainly target optimization-based 3DGS, while diffusion-based repair is typically restricted to iterative refinement near observed viewpoints, leaving feedforward 3DGS repair underexplored. We propose ConFixGS, a plug-and-play method that learns to fix feedforward 3DGS with confidence-aware diffusion priors. Starting from a pretrained feedforward model, ConFixGS generates diffusion-enhanced local pseudo-targets and validates them through reprojection-based cross-checking against support views. The resulting dense confidence maps guide refinement, enhancing reliable details while suppressing hallucinated or inconsistent evidence. On Waymo, nuScenes, and KITTI, ConFixGS improves challenging novel view synthesis, with PSNR gains of up to 3.68 dB and FID reduced by nearly half. Our results highlight confidence-aware fusion of generative priors and support-view consistency as a key principle for robust feedforward 3D driving scene reconstruction.

2605.09687 2026-05-12 cs.CV

Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution

Md Aminur Hossain, Parekh Valkesh, Ayush V. Patel, Yogesh Jethani, Sanjay K. Singh, Biplab Banerjee

AI总结 本文研究了遥感单图像超分辨率问题,旨在从低分辨率观测中重建高分辨率图像并保留精细的空间结构。为了解决现有Swin Transformer模型在细节重建上的不足,作者提出了一种空间-频率门控Swin Transformer(SFG-SwinSR),通过在前馈网络中引入空间-频率门控模块,分离低频结构内容与高频残差细节,从而提升重建质量。实验表明,该方法在多个遥感数据集上取得了更好的PSNR和SSIM指标,有效增强了高分辨率图像的细节表现。

Comments 15 pages

详情
英文摘要

Remote Sensing (RS) single-image super-resolution aims to reconstruct high-resolution imagery from low-resolution observations while preserving fine spatial structures. Recent Swin Transformer-based models, including Swin2SR, provide strong spatial context modeling throughshifted-window self-attention, but their feed-forward networks remain generic channel-mixing modules and do not separate low-frequency structural content from high-frequency residual detail. To address this limitation, we propose SFG-SwinSR, a Spatial-Frequency Gated Swin Transformer for single-image super-resolution in remote sensing. SFG-SwinSR modifies the original Swin2SR attention block by replacing each transformer block's standard feed-forward network with a lightweight Spatial-Frequency Gated Feed-Forward Network (SFG-FFN). The module estimates low-frequency content via a depthwise-blur branch, extracts high-frequency residuals by subtraction, refines them with a lightweight spatial branch, and adaptively injects detail through a bottleneck gate. Experiments on SpaceNet and SEN2VENμS show that SFG-SwinSR improves reconstruction quality under the evaluated settings. On SpaceNet, it achieves 45.19 dB PSNR and 0.9852 SSIM, indicating effective enhancement of high-frequency details. This demonstrates that spatial-frequency transformation within the transformer feed-forward network improves detail reconstruction in RS super-resolution.

2605.09685 2026-05-12 cs.LG cs.AI

Learning Unified Representations of Normalcy for Time Series Anomaly Detection

Prithul Sarker, Sushmita Sarker, Nicholas G. Murray, Alireza Tavakkoli

AI总结 本文研究了无监督时间序列异常检测中的核心问题——在缺乏异常特征先验知识的情况下,如何学习区分正常数据分布的鲁棒表示。为此,作者提出了一种新的统一无监督异常检测框架 $\text{U}^2\text{AD}$,该方法基于分数生成模型学习正常数据的潜在分布,并引入了时间依赖的分数网络和统一的训练目标,以同时捕捉局部和全局时间上下文信息。实验表明,该方法在检测准确率和异常早期识别能力方面均优于现有先进方法。

详情
英文摘要

The core challenge in unsupervised anomaly detection is identifying abnormal patterns without prior knowledge of their characteristics. While existing methods have addressed aspects of this problem, they often struggle to learn a robust representation of the normal data distribution that is distinct from anomalous patterns. In this paper, we present a novel framework, Unified Unsupervised Anomaly Detection ($\text{U}^2\text{AD}$), that comprehensively addresses anomaly detection in multivariate time series. Our approach learns the underlying data distribution of normal samples by utilizing score-based generative modeling. We introduce a novel time-dependent score network and a unified training objective that together delineate the manifold of normal data while considering both local and global temporal contexts. Reconstruction is then performed via a deterministic sampling process using an ordinary differential equation solver. Our extensive experimental evaluations demonstrate that $\text{U}^2\text{AD}$ not only outperforms current state-of-the-art methods in detection accuracy but also identifies anomalies at significantly earlier stages of their occurrence.

2605.09681 2026-05-12 cs.CV

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin, Ying Qin, Wenhan Luo, Shuiyang Mao, Wei Liu, Huan Li

AI总结 本文针对自回归视频扩散模型中因冗余键值(KV)缓存导致的注意力复杂度高和内存开销大的问题,提出了一种混合KV缓存压缩方法Forcing-KV。通过分析主流模型中注意力头的功能特性,将头分为关注帧内细节和块间过渡的静态头,以及控制帧间运动和一致性的动态头,并分别采用结构化剪枝和基于片段相似度的动态剪枝策略。该方法在保持生成质量的同时,显著提升了生成速度并减少了内存占用,实现在单块NVIDIA H200 GPU上每秒生成29帧以上。

Comments 10 pages

详情
英文摘要

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.

2605.09679 2026-05-12 cs.CV cs.AI

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

AI总结 DeepTumorVQA 是一个面向医学影像的层次化3D CT基准,旨在对医疗视觉语言模型(VLMs)和工具增强代理进行分阶段评估。该基准将肿瘤诊断中的推理过程分解为识别、测量、视觉推理和医学推理四个阶段,使模型在不同层次上的表现能够被独立评估。研究还引入了工具交互环境,允许模型调用分割、测量和医学知识等外部工具,从而更贴近实际医疗场景。实验表明,工具增强显著提升了模型在复杂医学推理任务中的表现。

详情
英文摘要

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.