arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1433
专题追踪
2510.00771 2026-02-06 eess.AS cs.AI cs.SD eess.SP

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang

Comments Accepted to ICASSP 2026

详情
英文摘要

In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.

2508.06616 2026-02-06 cs.NI cs.AI

Generative AI for Intent-Driven Network Management in 6G RAN: A Case Study on the Mamba Model

Md Arafat Habib, Medhat Elsayed, Yigit Ozcan, Pedro Enrique Iturria-Rivera, Majid Bavand, Melike Erol-Kantarci

Comments Paper submitted to IEEE for possible publication. The contents of this paper may change at any time

详情
英文摘要

With the emergence of 6G, mobile networks are becoming increasingly heterogeneous and dynamic, necessitating advanced automation for efficient management. Intent-Driven Networks (IDNs) address this by translating high-level intents into optimization policies. Large Language Models (LLMs) can enhance this process by understanding complex human instructions, enabling adaptive and intelligent automation. Given the rapid advancements in Generative AI (GenAI), a comprehensive survey of LLM-based IDN architectures in disaggregated Radio Access Network (RAN) environments is both timely and critical. This article provides such a survey, along with a case study on a selective State-Space Model (SSM)-enabled IDN architecture that integrates GenAI across three key stages: intent processing, intent validation, and intent execution. For the first time in the literature, we propose a hierarchical framework built on Mamba-SSM that introduces GenAI across all stages of the IDN pipeline. We further present a case study demonstrating that the proposed Mamba architecture significantly improves network performance through intelligent automation, surpassing existing IDN approaches. In a multi-cell 5G/6G scenario, the proposed architecture reduces quality of service drift by up to 70%, improves throughput by up to 80 Mbps, and lowers inference time to 60-70 ms, outperforming GenAI, reinforcement learning, and non-machine learning baselines.

2507.16838 2026-02-06 eess.AS cs.AI cs.CL

Segmentation-free Goodness of Pronunciation

Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

Comments The article has been accepted for publication by IEEE TASLPRO

详情
英文摘要

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

2506.22949 2026-02-06 cs.CR cs.AI cs.LG

A Study on Semi-Supervised Detection of DDoS Attacks under Class Imbalance

Ehsan Hallaji, Vaishnavi Shanmugam, Roozbeh Razavi-Far, Mehrdad Saif

Comments Accepted for publication in IEEE CCECE 2025

Journal ref IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Vancouver, BC, Canada, 2025, pp. 507-511

详情
英文摘要

One of the most difficult challenges in cybersecurity is eliminating Distributed Denial of Service (DDoS) attacks. Automating this task using artificial intelligence is a complex process due to the inherent class imbalance and lack of sufficient labeled samples of real-world datasets. This research investigates the use of Semi-Supervised Learning (SSL) techniques to improve DDoS attack detection when data is imbalanced and partially labeled. In this process, 13 state-of-the-art SSL algorithms are evaluated for detecting DDoS attacks in several scenarios. We evaluate their practical efficacy and shortcomings, including the extent to which they work in extreme environments. The results will offer insight into designing intelligent Intrusion Detection Systems (IDSs) that are robust against class imbalance and handle partially labeled data.

2506.19806 2026-02-06 cs.CY cs.CL cs.MA

LLM-Based Social Simulations Require a Boundary

Zengqing Wu, Run Peng, Takayuki Ito, Makoto Onizuka, Chuan Xiao

详情
英文摘要

This position paper argues that LLM-based social simulations require clear boundaries to make meaningful contributions to social science. While Large Language Models (LLMs) offer promising capabilities for simulating human behavior, their tendency to produce homogeneous outputs, acting as an "average persona", fundamentally limits their ability to capture the behavioral diversity essential for complex social dynamics. We examine why heterogeneity matters for social simulations and how current LLMs fall short, analyzing the relationship between mean alignment and variance in LLM-generated behaviors. Through a systematic review of representative studies, we find that validation practices often fail to match the heterogeneity requirements of research questions: while most papers include ground truth comparisons, fewer than half explicitly assess behavioral variance, and most that do report lower variance than human populations. We propose that researchers should: (1) match validation depth to the heterogeneity demands of their research questions, (2) explicitly report variance alongside mean alignment, and (3) constrain claims to collective-level qualitative patterns when variance is insufficient. Rather than dismissing LLM-based simulation, we advocate for a boundary-aware approach that ensures these methods contribute genuine insights to social science.

2506.17208 2026-02-06 cs.SE cs.AI cs.CL

Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems

Matias Martinez, Xavier Franch

Comments Part of this work (RQ1) has been published at the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE-SEIP 2026), DOI: 10.1145/3786583.3786904. The published version is also available on arXiv at arXiv:2602.04449

详情
英文摘要

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards -- SWE-Bench Lite and SWE-Bench Verified -- have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

2506.08520 2026-02-06 eess.IV cs.CV

Plug-and-play linear attention with provable guarantees for training-free image restoration

Srinivasan Kidambi, Karthik Palaniappan, Pravin Nair

详情
英文摘要

Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers $1.8$--$3.6\times$ speedups on an NVIDIA RTX 4090 GPU and $1.8$--$7\times$ speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model's outputs.

2505.19447 2026-02-06 eess.IV cs.CV

A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images

Hengtong Shen, Haiyan Gu, Haitao Li, Yi Yang, Agen Qiu

Comments This article has been accepted for publication in Geo-spatial Information Science, published by Taylor & Francis

详情
英文摘要

Self-Supervised Learning (SSL) enables us to pre-train foundation models without costly labeled data. Among SSL methods, Contrastive Learning (CL) methods are better at obtaining accurate semantic representations in noise interference. However, due to the significant domain gap, while CL methods have achieved great success in many computer vision tasks, they still require specific adaptation for Remote Sensing (RS) images. To this end, we present a novel self-supervised method called PerA, which produces all-purpose RS features through semantically Perfectly Aligned sample pairs. Specifically, PerA obtains features from sampled views by applying spatially disjoint masks to augmented images rather than random cropping. Our framework provides high-quality features by ensuring consistency between teacher and student and predicting learnable mask tokens. Compared to previous contrastive methods, our method demonstrates higher memory efficiency and can be trained with larger batches due to its sparse inputs. Additionally, the proposed method demonstrates remarkable adaptability to uncurated RS data and reduce the impact of the potential semantic inconsistency. We also collect an unlabeled pre-training dataset, which contains about 5 million RS images. We conducted experiments on multiple downstream task datasets and achieved performance comparable to previous state-of-the-art methods with a limited model scale, demonstrating the effectiveness of our approach. We hope this work will contribute to practical remote sensing interpretation works.

2505.14410 2026-02-06 eess.AS cs.CL cs.SD

Pairwise Evaluation of Accent Similarity in Speech Synthesis

Jinzuomu Zhong, Suyuan Liu, Dan Wells, Korin Richmond

Comments Accepted by INTERSPEECH 2025

详情
英文摘要

Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.

2505.06085 2026-02-06 cs.PF cs.AI cs.AR

Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini

Comments Accepted to the Computational Aspects of Deep Learning Workshop at ISC High Performance 2025. To appear in the ISC High Performance 2025 Workshop Proceedings

Journal ref Proc. High Performance Computing: ISC High Performance 2025 International Workshops, Hamburg, Germany

详情
英文摘要

The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.

2505.01866 2026-02-06 cs.CR cs.LG

PQS-BFL: A Post-Quantum Secure Blockchain-based Federated Learning Framework

Daniel Commey, Garth V. Crosby

Journal ref Expert Systems with Applications, 131449 (2026)

详情
英文摘要

Federated Learning (FL) enables collaborative model training while preserving data privacy, but its classical cryptographic underpinnings are vulnerable to quantum attacks. This vulnerability is particularly critical in sensitive domains like healthcare. This paper introduces PQS-BFL (Post-Quantum Secure Blockchain-based Federated Learning), a framework integrating post-quantum cryptography (PQC) with blockchain verification to secure FL against quantum adversaries. We employ ML-DSA-65 (a FIPS 204 standard candidate, formerly Dilithium) signatures to authenticate model updates and leverage optimized smart contracts for decentralized validation. Extensive evaluations on diverse datasets (MNIST, SVHN, HAR) demonstrate that PQS-BFL achieves efficient cryptographic operations (average PQC sign time: 0.65 ms, verify time: 0.53 ms) with a fixed signature size of 3309 Bytes. Blockchain integration incurs a manageable overhead, with average transaction times around 4.8 s and gas usage per update averaging 1.72 x 10^6 units for PQC configurations. Crucially, the cryptographic overhead relative to transaction time remains minimal (around 0.01-0.02% for PQC with blockchain), confirming that PQC performance is not the bottleneck in blockchain-based FL. The system maintains competitive model accuracy (e.g., over 98.8% for MNIST with PQC) and scales effectively, with round times showing sublinear growth with increasing client numbers. Our open-source implementation and reproducible benchmarks validate the feasibility of deploying long-term, quantum-resistant security in practical FL systems.

2504.20741 2026-02-06 cs.HC cs.AI cs.CY cs.LG

In defence of post-hoc explanations in medical AI

Joshua Hatherley, Lauritz Munch, Jens Christian Bjerring

Journal ref 2026. Hastings Center Report 56(1): 40-46

详情
英文摘要

Since the early days of the Explainable AI movement, post-hoc explanations have been praised for their potential to improve user understanding, promote trust, and reduce patient safety risks in black box medical AI systems. Recently, however, critics have argued that the benefits of post-hoc explanations are greatly exaggerated since they merely approximate, rather than replicate, the actual reasoning processes that black box systems take to arrive at their outputs. In this article, we aim to defend the value of post-hoc explanations against this recent critique. We argue that even if post-hoc explanations do not replicate the exact reasoning processes of black box systems, they can still improve users' functional understanding of black box systems, increase the accuracy of clinician-AI teams, and assist clinicians in justifying their AI-informed decisions. While post-hoc explanations are not a "silver bullet" solution to the black box problem in medical AI, we conclude that they remain a useful strategy for addressing the black box problem in medical AI.

2502.17399 2026-02-06 cs.HC cs.RO

Enriching physical-virtual interaction in AR gaming by tracking identical objects via an egocentric partial observation frame

Liuchuan Yu, Ching-I Huang, Hsueh-Cheng Wang, Lap-Fai Yu

Journal ref Virtual Reality 30 (2026) 51

详情
英文摘要

Augmented reality (AR) games, particularly those designed for head-mounted displays, have grown increasingly prevalent. However, most existing systems depend on pre-scanned, static environments and rely heavily on continuous tracking or marker-based solutions, which limit adaptability in dynamic physical spaces. This is particularly problematic for AR headsets and glasses, which typically follow the user's head movement and cannot maintain a fixed, stationary view of the scene. Moreover, continuous scene observation is neither power-efficient nor practical for wearable devices, given their limited battery and processing capabilities. A persistent challenge arises when multiple identical objects are present in the environment-standard object tracking pipelines often fail to maintain consistent identities without uninterrupted observation or external sensors. These limitations hinder fluid physical-virtual interactions, especially in dynamic or occluded scenes where continuous tracking is infeasible. To address this, we introduce a novel optimization-based framework for re-identifying identical objects in AR scenes using only one partial egocentric observation frame captured by a headset. We formulate the problem as a label assignment task solved via integer programming, augmented with a Voronoi diagram-based pruning strategy to improve computational efficiency. This method reduces computation time by 50% while preserving 91% accuracy in simulated experiments. Moreover, we evaluated our approach in quantitative synthetic and quantitative real-world experiments. We also conducted three qualitative real-world experiments to demonstrate the practical utility and generalizability for enabling dynamic, markerless object interaction in AR environments. Our video demo is available at https://youtu.be/RwptEfLtW1U.

2501.12911 2026-02-06 cs.CR cs.DC cs.LG

A Selective Homomorphic Encryption Approach for Faster Privacy-Preserving Federated Learning

Abdulkadir Korkmaz, Praveen Rao

Comments 18 pages, 18 figures

Journal ref Proceedings of the IEEE Consumer Communications & Networking Conference (CCNC), 2026

详情
英文摘要

Federated learning (FL) has come forward as a critical approach for privacy-preserving machine learning in healthcare, allowing collaborative model training across decentralized medical datasets without exchanging clients' data. However, current security implementations for these systems face a fundamental trade-off: rigorous cryptographic protections like fully homomorphic encryption (FHE) impose prohibitive computational overhead, while lightweight alternatives risk vulnerable data leakage through model updates. To address this issue, we present FAS (Fast and Secure Federated Learning), a novel approach that strategically combines selective homomorphic encryption, differential privacy, and bitwise scrambling to achieve robust security without compromising practical usability. Our approach eliminates the need for model pretraining phases while dynamically protecting high-risk model parameters through layered encryption and obfuscation. We implemented FAS using the Flower framework and evaluated it on a cluster of eleven physical machines. Our approach was up to 90\% faster than applying FHE on the model weights. In addition, we eliminated the computational overhead that is required by competitors such as FedML-HE and MaskCrypt. Our approach was up to 1.5$\times$ faster than the competitors while achieving comparable security results. Experimental evaluations on medical imaging datasets confirm that FAS maintains similar security results to conventional FHE against gradient inversion attacks while preserving diagnostic model accuracy. These results position FAS as a practical solution for latency-sensitive healthcare applications where both privacy preservation and computational efficiency are requirements.

2501.01741 2026-02-06 cs.SE cs.AI cs.CL

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Simone Corbo, Luca Bancale, Valeria De Gennaro, Livia Lestingi, Vincenzo Scotti, Matteo Camilli

Journal ref IEEE Transactions on Software Engineering, Nov. 2025, pp. 3056-3071, vol. 51

详情
英文摘要

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7-671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).

2410.17790 2026-02-06 eess.AS cs.SD

Regularized autoregressive modeling and its application to audio signal reconstruction

Ondřej Mokrý, Pavel Rajmic

Comments submitted to IEEE Transactions on Audio, Speech, and Language Processing

详情
英文摘要

Autoregressive (AR) modeling is invaluable in signal processing, in particular in speech and audio fields. Attempts in the literature can be found that regularize or constrain either the time-domain signal values or the AR coefficients, which is done for various reasons, including the incorporation of prior information or numerical stabilization. Although these attempts are appealing, an encompassing and generic modeling framework is still missing. We propose such a framework and the related optimization problem and algorithm. We discuss the computational demands of the algorithm and explore the effects of various improvements on its convergence speed. In the experimental part, we demonstrate the usefulness of our approach on the audio declipping and dequantization problems. We compare its performance against state-of-the-art methods and demonstrate the competitiveness of the proposed method in declipping musical signals, and its superiority in declipping speech. The evaluation includes a heuristic algorithm of generalized linear prediction (GLP), a strong competitor which has only been presented as a patent and is new in the scientific community.

2409.09143 2026-02-06 cs.CR cs.CL

DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada

Journal ref Journal of Network and Systems Management, 34, 36, 2026

详情
英文摘要

Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.

2409.01978 2026-02-06 quant-ph cs.LG stat.ML

Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithm

Oleksandr Borysenko, Mykhailo Bratchenko, Ilya Lukin, Mykola Luhanko, Ihor Omelchenko, Andrii Sotnikov, Alessandro Lomi

Comments 11 pages, 3 figures

Journal ref Physica A 682 (2026) 131158

详情
英文摘要

A Quantum Natural Gradient (QNG) algorithm for optimization of variational quantum circuits has been proposed recently. In this study, we employ the Langevin equation with a QNG stochastic force to demonstrate that its discrete-time solution gives a generalized form of the above-specified algorithm, which we call Momentum-QNG. Similar to other optimization algorithms with the momentum term, such as the Stochastic Gradient Descent with momentum, RMSProp with momentum and Adam, Momentum-QNG is more effective to escape local minima and plateaus in the variational parameter space and, therefore, demonstrates an improved performance compared to the basic QNG. In this paper we benchmark Momentum-QNG together with the basic QNG, Adam and Momentum optimizers and explore its convergence behaviour. Among the benchmarking problems studied, the best result is obtained for the quantum Sherrington-Kirkpatrick model in the strong spin glass regime. Our open-source code is available at https://github.com/borbysh/Momentum-QNG

2404.05748 2026-02-06 q-bio.NC cs.LG

Analyzing heterogeneity in Alzheimer Disease using multimodal normative modeling on imaging-based ATN biomarkers

Sayantan Kumar, Tom Earnest, Braden Yang, Deydeep Kothapalli, Andrew J. Aschenbrenner, Jason Hassenstab, Chengie Xiong, Beau Ances, John Morris, Tammie L. S. Benzinger, Brian A. Gordon, Philip Payne, Aristeidis Sotiras

Comments Under review in Alzheimer's & Dementia

Journal ref Alzheimer's Dement. 2025; 21:e70143

详情
英文摘要

INTRODUCTION: Previous studies have applied normative modeling on a single neuroimaging modality to investigate Alzheimer Disease (AD) heterogeneity. We employed a deep learning-based multimodal normative framework to analyze individual-level variation across ATN (amyloid-tau-neurodegeneration) imaging biomarkers. METHODS: We selected cross-sectional discovery (n = 665) and replication cohorts (n = 430) with available T1-weighted MRI, amyloid and tau PET. Normative modeling estimated individual-level abnormal deviations in amyloid-positive individuals compared to amyloid-negative controls. Regional abnormality patterns were mapped at different clinical group levels to assess intra-group heterogeneity. An individual-level disease severity index (DSI) was calculated using both the spatial extent and magnitude of abnormal deviations across ATN. RESULTS: Greater intra-group heterogeneity in ATN abnormality patterns was observed in more severe clinical stages of AD. Higher DSI was associated with worse cognitive function and increased risk of disease progression. DISCUSSION: Subject-specific abnormality maps across ATN reveal the heterogeneous impact of AD on the brain.

2403.11249 2026-02-06 eess.IV cs.CV

YOLOv9 for Fracture Detection in Pediatric Wrist Trauma X-ray Images

Chun-Tse Chien, Rui-Yang Ju, Kuang-Yi Chou, Jen-Shiun Chiang

Comments Accepted by Electronics Letters

Journal ref Electron. Lett. 60 (2024) e13248

详情
英文摘要

The introduction of YOLOv9, the latest version of the You Only Look Once (YOLO) series, has led to its widespread adoption across various scenarios. This paper is the first to apply the YOLOv9 algorithm model to the fracture detection task as computer-assisted diagnosis (CAD) to help radiologists and surgeons to interpret X-ray images. Specifically, this paper trained the model on the GRAZPEDWRI-DX dataset and extended the training set using data augmentation techniques to improve the model performance. Experimental results demonstrate that compared to the mAP 50-95 of the current state-of-the-art (SOTA) model, the YOLOv9 model increased the value from 42.16% to 43.73%, with an improvement of 3.7%. The implementation code is publicly available at https://github.com/RuiyangJu/YOLOv9-Fracture-Detection.

2403.01673 2026-02-06 stat.ML cs.AI cs.LG

CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables

Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

Comments Camera-ready version. Accepted at ICML 2024

Journal ref Proceedings of the Forty-first International Conference on Machine Learning (ICML 2024)

详情
英文摘要

For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.

2403.01600 2026-02-06 cs.MA cs.AI

Can Poverty Be Reduced by Acting on Discrimination? An Agent-based Model for Policy Making

Alba Aguilera, Nieves Montes, Georgina Curto, Carles Sierra, Nardine Osman

详情
英文摘要

In the last decades, there has been a deceleration in the rates of poverty reduction, suggesting that traditional redistributive approaches to poverty mitigation could be losing effectiveness, and alternative insights to advance the number one UN Sustainable Development Goal are required. The criminalization of poor people has been denounced by several NGOs, and an increasing number of voices suggest that discrimination against the poor (a phenomenon known as \emph{aporophobia}) could be an impediment to mitigating poverty. In this paper, we present the novel Aporophobia Agent-Based Model (AABM) to provide evidence of the correlation between aporophobia and poverty computationally. We present our use case built with real-world demographic data and poverty-mitigation public policies (either enforced or under parliamentary discussion) for the city of Barcelona. We classify policies as discriminatory or non-discriminatory against the poor, with the support of specialized NGOs, and we observe the results in the AABM in terms of the impact on wealth inequality. The simulation provides evidence of the relationship between aporophobia and the increase of wealth inequality levels, paving the way for a new generation of poverty reduction policies that act on discrimination and tackle poverty as a societal problem (not only a problem of the poor).

2310.09488 2026-02-06 stat.ML cs.LG

ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning

Jiecheng Lu, Xu Han, Shihao Yang

Comments Camera-ready version. Accepted at ICLR 2024

Journal ref Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)

详情
英文摘要

Long-term time series forecasting (LTSF) is important for various domains but is confronted by challenges in handling the complex temporal-contextual relationships. As multivariate input models underperforming some recent univariate counterparts, we posit that the issue lies in the inefficiency of existing multivariate LTSF Transformers to model series-wise relationships: the characteristic differences between series are often captured incorrectly. To address this, we introduce ARM: a multivariate temporal-contextual adaptive learning method, which is an enhanced architecture specifically designed for multivariate LTSF modelling. ARM employs Adaptive Univariate Effect Learning (AUEL), Random Dropping (RD) training strategy, and Multi-kernel Local Smoothing (MKLS), to better handle individual series temporal patterns and correctly learn inter-series dependencies. ARM demonstrates superior performance on multiple benchmarks without significantly increasing computational costs compared to vanilla Transformer, thereby advancing the state-of-the-art in LTSF. ARM is also generally applicable to other LTSF architecture beyond vanilla Transformer.

2309.16858 2026-02-06 stat.ML cs.LG

Improved Generalization Bounds for Transductive Learning by Transductive Local Complexity and Its Applications

Yingzhen Yang

Comments The ICML 2025 conference version (https://openreview.net/pdf?id=NRVdvg7VMn) is a special case of this paper where the chain length is fixed at 2 (i.e.,$Q=2$, see Def. 5.1), and its main results follow directly from the results here. This paper further provides a nearly optimal excess risk bound for realizable transductive learning and a stronger bound for transductive kernel learning

详情
英文摘要

We introduce Transductive Local Complexity (TLC) to extend the classical Local Rademacher Complexity (LRC) to the transductive setting, incorporating substantial and novel components. Although LRC has been used to obtain sharp generalization bounds and minimax rates for inductive tasks such as classification and nonparametric regression, it has remained an open problem whether a localized Rademacher complexity framework can be effectively adapted to transductive learning to achieve sharp or nearly sharp bounds consistent with inductive results. We provide an affirmative answer via TLC. TLC is constructed by first deriving a new concentration inequality in Theorem 4.1 for the supremum of empirical processes capturing the gap between test and training losses, termed the test-train process, under uniform sampling without replacement, which leverages a novel combinatorial property of the test-train process and a new proof strategy applying the exponential Efron-Stein inequality twice. A subsequent peeling strategy applied to a new decomposition of the expectation of the test-train process and a new surrogate variance operator then yield excess risk bounds in the transductive setting that are nearly consistent with classical LRC-based inductive bounds up to a logarithmic gap. We further advance transductive learning through two applications: (1) for realizable transductive learning over binary-valued classes with finite VC dimension of $\dVC$ and $u \ge m \ge \dVC$, where $u$ and $m$ are the number of test features and training features, our Theorem 6.1 gives a nearly optimal bound $Θ(\dVC \log(me/\dVC)/m)$ matching the minimax rate $Θ(\dVC/m)$ up to $\log m$, resolving a decade-old open question; and (2) Theorem 6.2 presents a sharper excess risk bound for transductive kernel learning compared to the current state-of-the-art.

2110.04903 2026-02-06 eess.IV cs.LG

Normative Modeling using Multimodal Variational Autoencoders to Identify Abnormal Brain Structural Patterns in Alzheimer Disease

Sayantan Kumar, Philip Payne, Aristeidis Sotiras

Comments Medical Imaging Meets NeurIPS workshop in NeurIPS 2022

Journal ref Proc. SPIE 12465, Medical Imaging 2023: Computer-Aided Diagnosis, 1246503 (7 April 2023)

详情
英文摘要

Normative modelling is an emerging method for understanding the underlying heterogeneity within brain disorders like Alzheimer Disease (AD) by quantifying how each patient deviates from the expected normative pattern that has been learned from a healthy control distribution. Since AD is a multifactorial disease with more than one biological pathways, multimodal magnetic resonance imaging (MRI) neuroimaging data can provide complementary information about the disease heterogeneity. However, existing deep learning based normative models on multimodal MRI data use unimodal autoencoders with a single encoder and decoder that may fail to capture the relationship between brain measurements extracted from different MRI modalities. In this work, we propose multi-modal variational autoencoder (mmVAE) based normative modelling framework that can capture the joint distribution between different modalities to identify abnormal brain structural patterns in AD. Our multi-modal framework takes as input Freesurfer processed brain region volumes from T1-weighted (cortical and subcortical) and T2-weighed (hippocampal) scans of cognitively normal participants to learn the morphological characteristics of the healthy brain. The estimated normative model is then applied on Alzheimer Disease (AD) patients to quantify the deviation in brain volumes and identify the abnormal brain structural patterns due to the effect of the different AD stages. Our experimental results show that modeling joint distribution between the multiple MRI modalities generates deviation maps that are more sensitive to disease staging within AD, have a better correlation with patient cognition and result in higher number of brain regions with statistically significant deviations compared to a unimodal baseline model with all modalities concatenated as a single input.

2602.05207 2026-02-06 eess.AS cs.AI

ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference

Chunyat Wu, Jiajun Deng, Zhengxi Liu, Zheqi Dai, Haolin He, Qiuqiang Kong

Comments Accepted by ICASSP 2026

详情
英文摘要

Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.

2602.05184 2026-02-06 hep-th cond-mat.dis-nn cs.AI cs.LG

Towards Worst-Case Guarantees with Scale-Aware Interpretability

Lauren Greenspan, David Berman, Aryeh Brill, Ro Jefferson, Artemy Kolchinsky, Jennifer Lin, Andrew Mack, Anindita Maiti, Fernando E. Rosas, Alexander Stapleton, Lucas Teixeira, Dmitry Vaintrob

详情
英文摘要

Neural networks organize information according to the hierarchical, multi-scale structure of natural data. Methods to interpret model internals should be similarly scale-aware, explicitly tracking how features compose across resolutions and guaranteeing bounds on the influence of fine-grained structure that is discarded as irrelevant noise. We posit that the renormalisation framework from physics can meet this need by offering technical tools that can overcome limitations of current methods. Moreover, relevant work from adjacent fields has now matured to a point where scattered research threads can be synthesized into practical, theory-informed tools. To combine these threads in an AI safety context, we propose a unifying research agenda -- \emph{scale-aware interpretability} -- to develop formal machinery and interpretability tools that have robustness and faithfulness properties supported by statistical physics.

2602.05174 2026-02-06 stat.ML cs.AI cs.LG math.ST stat.TH

Total Variation Rates for Riemannian Flow Matching

Yunrui Guan, Krishnakumar Balasubramanian, Shiqian Ma

详情
英文摘要

Riemannian flow matching (RFM) extends flow-based generative modeling to data supported on manifolds by learning a time-dependent tangent vector field whose flow-ODE transports a simple base distribution to the data law. We develop a nonasymptotic Total Variation (TV) convergence analysis for RFM samplers that use a learned vector field together with Euler discretization on manifolds. Our key technical ingredient is a differential inequality governing the evolution of TV between two manifold ODE flows, which expresses the time-derivative of TV through the divergence of the vector-field mismatch and the score of the reference flow; controlling these terms requires establishing new bounds that explicitly account for parallel transport and curvature. Under smoothness assumptions on the population flow-matching field and either uniform (compact manifolds) or mean-square (Hadamard manifolds) approximation guarantees for the learned field, we obtain explicit bounds of the form $\mathrm{TV}\le C_{\mathrm{Lip}}\,h + C_{\varepsilon}\,\varepsilon$ (with an additional higher-order $\varepsilon^2$ term on compact manifolds), cleanly separating numerical discretization and learning errors. Here, $h$ is the step-size and $\varepsilon$ is the target accuracy. Instantiations yield \emph{explicit} polynomial iteration complexities on the hypersphere $S^d$, and on the SPD$(n)$ manifolds under mild moment conditions.

2602.05167 2026-02-06 physics.comp-ph cond-mat.soft cond-mat.stat-mech cs.LG physics.chem-ph

Path Sampling for Rare Events Boosted by Machine Learning

Porhouy Minh, Sapna Sarupria

Comments 7 pages, 1 figure

Journal ref P. Minh & S. Sarupria, KIM REVIEW, Volume 4, Article 01, 2026

详情
英文摘要

The study by Jung et al. (Jung H, Covino R, Arjun A, et al., Nat Comput Sci. 3:334-345 (2023)) introduced Artificial Intelligence for Molecular Mechanism Discovery (AIMMD), a novel sampling algorithm that integrates machine learning to enhance the efficiency of transition path sampling (TPS). By enabling on-the-fly estimation of the committor probability and simultaneously deriving a human-interpretable reaction coordinate, AIMMD offers a robust framework for elucidating the mechanistic pathways of complex molecular processes. This commentary provides a discussion and critical analysis of the core AIMMD framework, explores its recent extensions, and offers an assessment of the method's potential impact and limitations.

2602.05120 2026-02-06 cs.CC cs.LG

Certifiable Boolean Reasoning Is Universal

Wenhao Li, Anastasis Kratsios, Hrad Ghoukasian, Dennis Zvigelsky

Comments Submitted to COLT 2026

详情
英文摘要

The proliferation of agentic systems has thrust the reasoning capabilities of AI into the forefront of contemporary machine learning. While it is known that there \emph{exist} neural networks which can reason through any Boolean task $f:\{0,1\}^B\to\{0,1\}$, in the sense that they emulate Boolean circuits with fan-in $2$ and fan-out $1$ gates, trained models have been repeatedly demonstrated to fall short of these theoretical ideals. This raises the question: \textit{Can one exhibit a deep learning model which \textbf{certifiably} always reasons and can \textbf{universally} reason through any Boolean task?} Moreover, such a model should ideally require few parameters to solve simple Boolean tasks. We answer this question affirmatively by exhibiting a deep learning architecture which parameterizes distributions over Boolean circuits with the guarantee that, for every parameter configuration, a sample is almost surely a valid Boolean circuit (and hence admits an intrinsic circuit-level certificate). We then prove a universality theorem: for any Boolean $f:\{0,1\}^B\to\{0,1\}$, there exists a parameter configuration under which the sampled circuit computes $f$ with arbitrarily high probability. When $f$ is an $\mathcal{O}(\log B)$-junta, the required number of parameters scales linearly with the input dimension $B$. Empirically, on a controlled truth-table completion benchmark aligned with our setting, the proposed architecture trains reliably and achieves high exact-match accuracy while preserving the predicted structure: every internal unit is Boolean-valued on $\{0,1\}^B$. Matched MLP baselines reach comparable accuracy, but only about $10\%$ of hidden units admit a Boolean representation; i.e.\ are two-valued over the Boolean cube.