arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1433
专题追踪
2412.09333 2026-02-06 cs.CV cond-mat.mtrl-sci eess.IV

MaskTerial: A Foundation Model for Automated 2D Material Flake Detection

Jan-Lucas Uslu, Alexey Nekrasov, Alexander Hermans, Bernd Beschoten, Bastian Leibe, Lutz Waldecker, Christoph Stampfer

Comments 9 pages, 5 figures

Journal ref Digital Discovery 4, 3744 (2025)

详情
英文摘要

The detection and classification of exfoliated two-dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large-scale data collection. Existing algorithms often exhibit challenges in identifying low-contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre-trained using a synthetic data generator, that generates realistic microscopy images from unlabeled data. This results in a model that can to quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low-contrast materials such as hexagonal boron nitride.

2412.08419 2026-02-06 cs.LG

Energy Guided smoothness to improve Robustness in Graph Classification

Farooq Ahmad Wani, Maria Sofia Bucarelli, Andrea Giuseppe Di Francesco, Oleksandr Pryymak, Fabrizio Silvestri

详情
英文摘要

Graph Neural Networks (GNNs) are powerful at solving graph classification tasks, yet applied problems often contain noisy labels. In this work, we study GNN robustness to label noise, demonstrate GNN failure modes when models struggle to generalise on low-order graphs, low label coverage, or when a model is over-parameterized. We establish both empirical and theoretical links between GNN robustness and the reduction of the total Dirichlet Energy of learned node representations, which encapsulates the hypothesized GNN smoothness inductive bias. Finally, we introduce two training strategies to enhance GNN robustness: (1) by incorporating a novel inductive bias in the weight matrices through the removal of negative eigenvalues, connected to Dirichlet Energy minimization; (2) by extending to GNNs a loss penalty that promotes learned smoothness. Importantly, neither approach negatively impacts performance in noise-free settings, supporting our hypothesis that the source of GNNs robustness is their smoothness inductive bias.

2411.12788 2026-02-06 cs.CV

Efficient Scene Modeling via Structure-Aware and Region-Prioritized 3D Gaussians

Guangchi Fang, Bing Wang

详情
英文摘要

Reconstructing 3D scenes with high fidelity and efficiency remains a central pursuit in computer vision and graphics. Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic rendering with Gaussian primitives, yet the modeling process remains governed predominantly by photometric supervision. This reliance often leads to irregular spatial distribution and indiscriminate primitive adjustments that largely ignore underlying geometric context. In this work, we rethink Gaussian modeling from a geometric standpoint and introduce Mini-Splatting2, an efficient scene modeling framework that couples structure-aware distribution and region-prioritized optimization, driving 3DGS into a geometry-regulated paradigm. The structure-aware distribution enforces spatial regularity through structured reorganization and representation sparsity, ensuring balanced structural coverage for compact organization. The region-prioritized optimization improves training discrimination through geometric saliency and computational selectivity, fostering appropriate structural emergence for fast convergence. These mechanisms alleviate the long-standing tension among representation compactness, convergence acceleration, and rendering fidelity. Extensive experiments demonstrate that Mini-Splatting2 achieves up to 4$\times$ fewer Gaussians and 3$\times$ faster optimization while maintaining state-of-the-art visual quality, paving the way towards structured and efficient 3D Gaussian modeling.

2410.23059 2026-02-06 cs.RO

FilMBot: A High-Speed Soft Parallel Robotic Micromanipulator

Jiangkun Yu, Houari Bettahar, Hakan Kandemir, Quan Zhou

Comments 13 pages, 16 figures

Journal ref IEEE Transactions on Robotics, 2026

详情
英文摘要

Soft robotic manipulators are generally slow despite their great adaptability, resilience, and compliance. This limitation also extends to current soft robotic micromanipulators. Here, we introduce FilMBot, a 3-DOF film-based, electromagnetically actuated, soft kinematic robotic micromanipulator achieving speeds up to 2117 °/s and 2456 °/s in α and \{beta} angular motions, with corresponding linear velocities of 1.61 m/s and 1.92 m/s using a 4-cm needle end-effector, 0.54 m/s along the Z axis, and 1.57 m/s during Z-axis morph switching. The robot can reach ~1.50 m/s in path-following tasks, with an operational bandwidth below ~30 Hz, and remains responsive at 50 Hz. It demonstrates high precision (~6.3 μm, or ~0.05% of its workspace) in path-following tasks, with precision remaining largely stable across frequencies. The novel combination of the low-stiffness soft kinematic film structure and strong electromagnetic actuation in FilMBot opens new avenues for soft robotics. Furthermore, its simple construction and inexpensive, readily accessible components could broaden the application of micromanipulators beyond current academic and professional users.

2410.13431 2026-02-06 cs.LG cs.AI

Solving Prior Distribution Mismatch in Diffusion Models via Optimal Transport

Zhanpeng Wang, Shenghao Li, Jiameng Che, Chen Wang, Shangling Jui, Na Lei, Zhongxuan Luo

详情
英文摘要

Diffusion Models (DMs) have achieved remarkable progress in generative modeling. However, the mismatch between the forward terminal distribution and reverse initial distribution introduces prior error, leading to deviations of sampling trajectories from the true distribution and severely limiting model performance. This issue further triggers cascading problems, including non-zero Signal-to-Noise Ratio, accumulated denoising errors, degraded generation quality, and constrained sampling efficiency. To address this issue, this paper proposes a prior error elimination framework based on Optimal Transport (OT). Specifically, an OT map from the reverse initial distribution to the forward terminal distribution is constructed to achieve precise matching of the two distributions. Meanwhile, the upper bound of the prior error is quantified using the Wasserstein distance, proving that the prior error can be effectively eliminated via the OT map. Additionally, by deriving the asymptotic consistency between dynamic OT and probability flow, this method is revealed to be highly compatible with the intrinsic mechanism of the diffusion process. Experimental results demonstrate that the proposed method completely eliminates the prior error both theoretically and practically, providing a universal and rigorous solution for optimizing the performance of DMs.

2410.04560 2026-02-06 cs.LG stat.ML

GAMformer: Bridging Tabular Foundation Models and Interpretable Machine Learning

Andreas Mueller, Julien Siems, Harsha Nori, David Salinas, Arber Zela, Rich Caruana, Frank Hutter

Comments 22 pages, 15 figures

详情
英文摘要

While interpretability is crucial for machine learning applications in safety-critical domains and for regulatory compliance, existing tabular foundation models like TabPFN lack transparency. Generalized Additive Models (GAMs) provide the needed interpretability through their additive structure, but traditional GAM methods rely on iterative learning algorithms (such as splines, boosted trees, or neural networks) that are fundamentally incompatible with the in-context learning paradigm of foundation models. In this paper, we introduce GAMformer, the first tabular foundation model for GAMs that bridges the gap between the power of foundation models and the interpretability requirements of critical real-world applications. GAMformer estimates GAM shape functions in a single forward pass using in-context learning, representing a significant departure from conventional iterative approaches. Building on previous research on tabular foundation models, we train GAMformer exclusively on synthetically generated tables to prevent data leakage. Our experiments demonstrate that GAMformer performs comparably to other leading GAMs across various classification benchmarks.

2410.02103 2026-02-06 cs.CV

MVGS: Multi-view Regulated Gaussian Splatting for Novel View Synthesis

Xiaobiao Du, Yida Wang, Xin Yu

Comments Project Page:https://xiaobiaodu.github.io/mvgs-project/

详情
英文摘要

Recent works in volume rendering, \textit{e.g.} NeRF and 3D Gaussian Splatting (3DGS), significantly advance the rendering quality and efficiency with the help of the learned implicit neural radiance field or 3D Gaussians. Rendering on top of an explicit representation, the vanilla 3DGS and its variants deliver real-time efficiency by optimizing the parametric model with single-view supervision per iteration during training which is adopted from NeRF. Consequently, certain views are overfitted, leading to unsatisfying appearance in novel-view synthesis and imprecise 3D geometries. To solve aforementioned problems, we propose a new 3DGS optimization method embodying four key novel contributions: 1) We transform the conventional single-view training paradigm into a multi-view training strategy. With our proposed multi-view regulation, 3D Gaussian attributes are further optimized without overfitting certain training views. As a general solution, we improve the overall accuracy in a variety of scenarios and different Gaussian variants. 2) Inspired by the benefit introduced by additional views, we further propose a cross-intrinsic guidance scheme, leading to a coarse-to-fine training procedure concerning different resolutions. 3) Built on top of our multi-view regulated training, we further propose a cross-ray densification strategy, densifying more Gaussian kernels in the ray-intersect regions from a selection of views. 4) By further investigating the densification strategy, we found that the effect of densification should be enhanced when certain views are distinct dramatically. As a solution, we propose a novel multi-view augmented densification strategy, where 3D Gaussians are encouraged to get densified to a sufficient number accordingly, resulting in improved reconstruction accuracy.

2410.01031 2026-02-06 cs.CV

Pediatric Wrist Fracture Detection Using Feature Context Excitation Modules in X-ray Images

Rui-Yang Ju, Chun-Tse Chien, Enkaer Xieerke, Jen-Shiun Chiang

Comments arXiv admin note: text overlap with arXiv:2407.03163

Journal ref IET Image Process. 20 (2026) e70269

详情
英文摘要

Children often suffer wrist trauma in daily life, while they usually need radiologists to analyze and interpret X-ray images before surgical treatment by surgeons. The development of deep learning has enabled neural networks to serve as computer-assisted diagnosis (CAD) tools to help doctors and experts in medical image diagnostics. Since YOLOv8 model has obtained the satisfactory success in object detection tasks, it has been applied to various fracture detection. This work introduces four variants of Feature Contexts Excitation-YOLOv8 (FCE-YOLOv8) model, each incorporating a different FCE module (i.e., modules of Squeeze-and-Excitation (SE), Global Context (GC), Gather-Excite (GE), and Gaussian Context Transformer (GCT)) to enhance the model performance. Experimental results on GRAZPEDWRI-DX dataset demonstrate that our proposed YOLOv8+GC-M3 model improves the mAP@50 value from 65.78% to 66.32%, outperforming the state-of-the-art (SOTA) model while reducing inference time. Furthermore, our proposed YOLOv8+SE-M3 model achieves the highest mAP@50 value of 67.07%, exceeding the SOTA performance. The implementation of this work is available at https://github.com/RuiyangJu/FCE-YOLOv8.

2409.12636 2026-02-06 cs.CV cs.LG eess.IV

Image inpainting for corrupted images by using the semi-super resolution GAN

Mehrshad Momen-Tayefeh, Mehrdad Momen-Tayefeh, Amir Ali Ghafourian Ghahramani

详情
英文摘要

Image inpainting is a valuable technique for enhancing images that have been corrupted. The primary challenge in this research revolves around the extent of corruption in the input image that the deep learning model must restore. To address this challenge, we introduce a Generative Adversarial Network (GAN) for learning and replicating the missing pixels. Additionally, we have developed a distinct variant of the Super-Resolution GAN (SRGAN), which we refer to as the Semi-SRGAN (SSRGAN). Furthermore, we leveraged three diverse datasets to assess the robustness and accuracy of our proposed model. Our training process involves varying levels of pixel corruption to attain optimal accuracy and generate high-quality images.

2409.09098 2026-02-06 cs.SD cs.CL eess.AS

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun

Comments Accepted by ICASSP 2025

详情
英文摘要

While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.

2409.07253 2026-02-06 cs.LG cs.CV

Alignment of Diffusion Models: Fundamentals, Challenges, and Future

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie

Comments Accepted at ACM Computing Surveys. 35 pages, 5 figures, 4 tables. Paper List: github.com/xie-lab-ml/awesome-alignment-of-diffusion-models

详情
英文摘要

Diffusion models have emerged as the leading paradigm in generative modeling, excelling in various applications. Despite their success, these models often misalign with human intentions and generate results with undesired properties or even harmful content. Inspired by the success and popularity of alignment in tuning large language models, recent studies have investigated aligning diffusion models with human expectations and preferences. This work mainly reviews alignment of diffusion models, covering advancements in fundamentals of alignment, alignment techniques of diffusion models, preference benchmarks, and evaluation for diffusion models. Moreover, we discuss key perspectives on current challenges and promising future directions on solving the remaining challenges in alignment of diffusion models. To the best of our knowledge, our work is the first comprehensive review paper for researchers and engineers to comprehend, practice, and research alignment of diffusion models.

2407.18442 2026-02-06 cs.CL

Guidance-Based Prompt Data Augmentation in Specialized Domains for Named Entity Recognition

Hyeonseok Kang, Hyein Seo, Jeesu Jung, Sangkeun Jung, Du-Seong Chang, Riwoo Chung

Journal ref Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024

详情
英文摘要

While the abundance of rich and vast datasets across numerous fields has facilitated the advancement of natural language processing, sectors in need of specialized data types continue to struggle with the challenge of finding quality data. Our study introduces a novel guidance data augmentation technique utilizing abstracted context and sentence structures to produce varied sentences while maintaining context-entity relationships, addressing data scarcity challenges. By fostering a closer relationship between context, sentence structure, and role of entities, our method enhances data augmentation's effectiveness. Consequently, by showcasing diversification in both entity-related vocabulary and overall sentence structure, and simultaneously improving the training performance of named entity recognition task.

2406.17835 2026-02-06 cs.LG cs.AI

The Use of AI-Robotic Systems for Scientific Discovery

Alexander H. Gower, Konstantin Korovin, Daniel Brunnsåker, Filip Kronström, Gabriel K. Reder, Ievgeniia A. Tiukova, Ronald S. Reiserer, John P. Wikswo, Ross D. King

Comments 23 pages, book chapter

详情
英文摘要

The process of developing theories and models and testing them with experiments is fundamental to the scientific method. Automating the entire scientific method then requires not only automation of the induction of theories from data, but also experimentation from design to implementation. This is the idea behind a robot scientist -- a coupled system of AI and laboratory robotics that has agency to test hypotheses with real-world experiments. In this chapter we explore some of the fundamentals of robot scientists in the philosophy of science. We also map the activities of a robot scientist to machine learning paradigms, and argue that the scientific method shares an analogy with active learning. We demonstrate these concepts using examples from previous robot scientists, and also from Genesis: a next generation robot scientist designed for research in systems biology, comprising a micro-fluidic system with 1000 computer-controlled micro-bioreactors and interpretable models based in controlled vocabularies and logic.

2406.09643 2026-02-06 cs.LG

A Policy Gradient-Based Sequence-to-Sequence Method for Time Series Prediction

Qi Sima, Xinze Zhang, Yukun Bao, Siyue Yang, Liang Shen

详情
英文摘要

Sequence-to-sequence architectures built upon recurrent neural networks have become a standard choice for multi-step-ahead time series prediction. In these models, the decoder produces future values conditioned on contextual inputs, typically either actual historical observations (ground truth) or previously generated predictions. During training, feeding ground-truth values helps stabilize learning but creates a mismatch between training and inference conditions, known as exposure bias, since such true values are inaccessible during real-world deployment. On the other hand, using the model's own outputs as inputs at test time often causes errors to compound rapidly across prediction steps. To mitigate these limitations, we introduce a new training paradigm grounded in reinforcement learning: a policy gradient-based method to learn an adaptive input selection strategy for sequence-to-sequence prediction models. Auxiliary models first synthesize plausible input candidates for the decoder, and a trainable policy network optimized via policy gradients dynamically chooses the most beneficial inputs to maximize long-term prediction performance. Empirical evaluations on diverse time series datasets confirm that our approach enhances both accuracy and stability in multi-step forecasting compared to conventional methods.

2406.04897 2026-02-06 cs.LG

From Link Prediction to Forecasting: Addressing Challenges in Batch-based Temporal Graph Learning

Moritz Lampert, Christopher Blöcker, Ingo Scholtes

Comments 46 pages (12 pages main text), 19 figures. Published in Transactions on Machine Learning Research (2026)

Journal ref Lampert, M., Blöcker, C. & Scholtes, I., (2026). From Link Prediction to Forecasting: Addressing Challenges in Batch-based Temporal Graph Learning. Transactions on Machine Learning Research, February 2026

详情
英文摘要

Dynamic link prediction is an important problem considered in many recent works that propose approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on continuous-time and discrete-time temporal graph datasets, typically using a traditional batch-oriented evaluation setup. However, as we show in this work, a batch-oriented evaluation is often unsuitable and can cause several issues. Grouping edges into fixed-sized batches regardless of their occurrence time leads to information loss or leakage, depending on the temporal granularity of the data. Furthermore, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. In this work, we empirically show how traditional batch-based evaluation leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data.

2405.14982 2026-02-06 cs.LG cs.AI cs.CL stat.ML

In-context Time Series Predictor

Jiecheng Lu, Yan Sun, Shihao Yang

Comments Camera-ready version. Accepted at ICLR 2025

Journal ref Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)

详情
英文摘要

Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.

2404.03208 2026-02-06 cs.LG

HiMAL: A Multimodal Hierarchical Multi-task Auxiliary Learning framework for predicting and explaining Alzheimer disease progression

Sayantan Kumar, Sean Yu, Andrew Michelson, Thomas Kannampallil, Philip Payne

Comments Currently under review in Journal of Medical Informatics Association (JAMIA). 6 figures, 3 tables

Journal ref JAMIA Open, Volume 7, Issue 3, October 2024, ooae087

详情
英文摘要

Objective: We aimed to develop and validate a novel multimodal framework HiMAL (Hierarchical, Multi-task Auxiliary Learning) framework, for predicting cognitive composite functions as auxiliary tasks that estimate the longitudinal risk of transition from Mild Cognitive Impairment (MCI) to Alzheimer Disease (AD). Methods: HiMAL utilized multimodal longitudinal visit data including imaging features, cognitive assessment scores, and clinical variables from MCI patients in the Alzheimer Disease Neuroimaging Initiative (ADNI) dataset, to predict at each visit if an MCI patient will progress to AD within the next 6 months. Performance of HiMAL was compared with state-of-the-art single-task and multi-task baselines using area under the receiver operator curve (AUROC) and precision recall curve (AUPRC) metrics. An ablation study was performed to assess the impact of each input modality on model performance. Additionally, longitudinal explanations regarding risk of disease progression were provided to interpret the predicted cognitive decline. Results: Out of 634 MCI patients (mean [IQR] age : 72.8 [67-78], 60% men), 209 (32%) progressed to AD. HiMAL showed better prediction performance compared to all single-modality singe-task baselines (AUROC = 0.923 [0.915-0.937]; AUPRC= 0.623 [0.605-0.644]; all p<0.05). Ablation analysis highlighted that imaging and cognition scores with maximum contribution towards prediction of disease progression. Discussion: Clinically informative model explanations anticipate cognitive decline 6 months in advance, aiding clinicians in future disease progression assessment. HiMAL relies on routinely collected EHR variables for proximal (6 months) prediction of AD onset, indicating its translational potential for point-of-care monitoring and managing of high-risk patients.

2403.19254 2026-02-06 cs.CV

Imperceptible Protection against Style Imitation from Diffusion Models

Namhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo, Daesik Kim, Seung-Hun Nam

Comments IEEE Transactions on Multimedia

详情
英文摘要

Recent progress in diffusion models has profoundly enhanced the fidelity of image generation, but it has raised concerns about copyright infringements. While prior methods have introduced adversarial perturbations to prevent style imitation, most are accompanied by the degradation of artworks' visual quality. Recognizing the importance of maintaining this, we introduce a visually improved protection method while preserving its protection capability. To this end, we devise a perceptual map to highlight areas sensitive to human eyes, guided by instance-aware refinement, which refines the protection intensity accordingly. We also introduce a difficulty-aware protection by predicting how difficult the artwork is to protect and dynamically adjusting the intensity based on this. Lastly, we integrate a perceptual constraints bank to further improve the imperceptibility. Results show that our method substantially elevates the quality of the protected image without compromising on protection efficacy.

2402.09329 2026-02-06 cs.CV

YOLOv8-AM: YOLOv8 Based on Effective Attention Mechanisms for Pediatric Wrist Fracture Detection

Chun-Tse Chien, Rui-Yang Ju, Kuang-Yi Chou, Enkaer Xieerke, Jen-Shiun Chiang

Journal ref IEEE Access 13 (2025) 52461-52477

详情
英文摘要

Wrist trauma and even fractures occur frequently in daily life, particularly among children who account for a significant proportion of fracture cases. Before performing surgery, surgeons often request patients to undergo X-ray imaging first and prepare for it based on the analysis of the radiologist. With the development of neural networks, You Only Look Once (YOLO) series models have been widely used in fracture detection as computer-assisted diagnosis (CAD). In 2023, Ultralytics presented the latest version of the YOLO models, which has been employed for detecting fractures across various parts of the body. Attention mechanism is one of the hottest methods to improve the model performance. This research work proposes YOLOv8-AM, which incorporates the attention mechanism into the original YOLOv8 architecture. Specifically, we respectively employ four attention modules, Convolutional Block Attention Module (CBAM), Global Attention Mechanism (GAM), Efficient Channel Attention (ECA), and Shuffle Attention (SA), to design the improved models and train them on GRAZPEDWRI-DX dataset. Experimental results demonstrate that the mean Average Precision at IoU 50 (mAP 50) of the YOLOv8-AM model based on ResBlock + CBAM (ResCBAM) increased from 63.6% to 65.8%, which achieves the state-of-the-art (SOTA) performance. Conversely, YOLOv8-AM model incorporating GAM obtains the mAP 50 value of 64.2%, which is not a satisfactory enhancement. Therefore, we combine ResBlock and GAM, introducing ResGAM to design another new YOLOv8-AM model, whose mAP 50 value is increased to 65.0%. The implementation code for this study is available on GitHub at https://github.com/RuiyangJu/Fracture_Detection_Improved_YOLOv8.

2312.00992 2026-02-06 cs.LG

Improving Normative Modeling for Multi-modal Neuroimaging Data using mixture-of-product-of-experts variational autoencoders

Sayantan Kumar, Philip Payne, Aristeidis Sotiras

Comments IEEE Internattional Symposium in Biomedical Imaging 2024

Journal ref 2024 IEEE International Symposium on Biomedical Imaging (ISBI)

详情
英文摘要

Normative models in neuroimaging learn the brain patterns of healthy population distribution and estimate how disease subjects like Alzheimer's Disease (AD) deviate from the norm. Existing variational autoencoder (VAE)-based normative models using multimodal neuroimaging data aggregate information from multiple modalities by estimating product or averaging of unimodal latent posteriors. This can often lead to uninformative joint latent distributions which affects the estimation of subject-level deviations. In this work, we addressed the prior limitations by adopting the Mixture-of-Product-of-Experts (MoPoE) technique which allows better modelling of the joint latent posterior. Our model labelled subjects as outliers by calculating deviations from the multimodal latent space. Further, we identified which latent dimensions and brain regions were associated with abnormal deviations due to AD pathology.

2307.01930 2026-02-06 cs.LG cs.AI cs.CV stat.AP stat.ML

Learning ECG Signal Features Without Backpropagation Using Linear Laws

Péter Pósfay, Marcell T. Kurbucz, Péter Kovács, Antal Jakovác

Comments 35 pages, 3 figures, 3 tables

Journal ref Machine Learning: Science and Technology 6, 035001 (2025)

详情
英文摘要

This paper introduces LLT-ECG, a novel method for electrocardiogram (ECG) signal classification that leverages concepts from theoretical physics to automatically generate features from time series data. Unlike traditional deep learning approaches, LLT-ECG operates in a forward manner, eliminating the need for backpropagation and hyperparameter tuning. By identifying linear laws that capture shared patterns within specific classes, the proposed method constructs a compact and verifiable representation, enhancing the effectiveness of downstream classifiers. We demonstrate LLT-ECG's state-of-the-art performance on real-world ECG datasets from PhysioNet, underscoring its potential for medical applications where speed and verifiability are crucial.

2305.15793 2026-02-06 cs.LG cs.AI cs.CE stat.CO

Feature space reduction method for ultrahigh-dimensional, multiclass data: Random forest-based multiround screening (RFMS)

Gergely Hanczár, Marcell Stippinger, Dávid Hanák, Marcell T. Kurbucz, Olivér M. Törteli, Ágnes Chripkó, Zoltán Somogyvári

Comments 9 pages, 2 figures, 2 tables

Journal ref Machine Learning: Science and Technology 4, 045012 (2023)

详情
英文摘要

In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while simultaneously possessing many advantages over these methods.

2304.14211 2026-02-06 cs.LG cs.AI cs.CV cs.MS stat.ML

LLT: An R package for Linear Law-based Feature Space Transformation

Marcell T. Kurbucz, Péter Pósfay, Antal Jakovác

Comments 15 pages, 5 figures, 1 table

Journal ref SoftwareX 25, 101623 (2024)

详情
英文摘要

The goal of the linear law-based feature space transformation (LLT) algorithm is to assist with the classification of univariate and multivariate time series. The presented R package, called LLT, implements this algorithm in a flexible yet user-friendly way. This package first splits the instances into training and test sets. It then utilizes time-delay embedding and spectral decomposition techniques to identify the governing patterns (called linear laws) of each input sequence (initial feature) within the training set. Finally, it applies the linear laws of the training set to transform the initial features of the test set. These steps are performed by three separate functions called trainTest, trainLaw, and testTrans. Their application requires a predefined data structure; however, for fast calculation, they use only built-in functions. The LLT R package and a sample dataset with the appropriate data structure are publicly available on GitHub.

2304.05071 2026-02-06 cs.CV

Fracture Detection in Pediatric Wrist Trauma X-ray Images Using YOLOv8 Algorithm

Rui-Yang Ju, Weiming Cai

Comments Scientific Reports

Journal ref Sci Rep 13 (2023) 20077

详情
英文摘要

Hospital emergency departments frequently receive lots of bone fracture cases, with pediatric wrist trauma fracture accounting for the majority of them. Before pediatric surgeons perform surgery, they need to ask patients how the fracture occurred and analyze the fracture situation by interpreting X-ray images. The interpretation of X-ray images often requires a combination of techniques from radiologists and surgeons, which requires time-consuming specialized training. With the rise of deep learning in the field of computer vision, network models applying for fracture detection has become an important research topic. In this paper, we use data augmentation to improve the model performance of YOLOv8 algorithm (the latest version of You Only Look Once) on a pediatric wrist trauma X-ray dataset (GRAZPEDWRI-DX), which is a public dataset. The experimental results show that our model has reached the state-of-the-art (SOTA) mean average precision (mAP 50). Specifically, mAP 50 of our model is 0.638, which is significantly higher than the 0.634 and 0.636 of the improved YOLOv7 and original YOLOv8 models. To enable surgeons to use our model for fracture detection on pediatric wrist trauma X-ray images, we have designed the application "Fracture Detection Using YOLOv8 App" to assist surgeons in diagnosing fractures, reducing the probability of error analysis, and providing more useful information for surgery.

2303.09190 2026-02-06 cs.CV

Resolution Enhancement Processing on Low Quality Images Using Swin Transformer Based on Interval Dense Connection Strategy

Rui-Yang Ju, Chih-Chia Chen, Jen-Shiun Chiang, Yu-Shian Lin, Wei-Han Chen, Chun-Tse Chien

Journal ref Multimed Tools Appl 83 (2024) 14839-14855

详情
英文摘要

The Transformer-based method has demonstrated remarkable performance for image super-resolution in comparison to the method based on the convolutional neural networks (CNNs). However, using the self-attention mechanism like SwinIR (Image Restoration Using Swin Transformer) to extract feature information from images needs a significant amount of computational resources, which limits its application on low computing power platforms. To improve the model feature reuse, this research work proposes the Interval Dense Connection Strategy, which connects different blocks according to the newly designed algorithm. We apply this strategy to SwinIR and present a new model, which named SwinOIR (Object Image Restoration Using Swin Transformer). For image super-resolution, an ablation study is conducted to demonstrate the positive effect of the Interval Dense Connection Strategy on the model performance. Furthermore, we evaluate our model on various popular benchmark datasets, and compare it with other state-of-the-art (SOTA) lightweight models. For example, SwinOIR obtains a PSNR of 26.62 dB for x4 upscaling image super-resolution on Urban100 dataset, which is 0.15 dB higher than the SOTA model SwinIR. For real-life application, this work applies the lastest version of You Only Look Once (YOLOv8) model and the proposed model to perform object detection and real-life image super-resolution on low-quality images. This implementation code is publicly available at https://github.com/Rubbbbbbbbby/SwinOIR.

2211.16098 2026-02-06 cs.CV

Three-stage binarization of color document images based on discrete wavelet transform and generative adversarial networks

Rui-Yang Ju, Yu-Shian Lin, Yanlin Jin, Chih-Chia Chen, Chun-Tse Chien, Jen-Shiun Chiang

Comments Accepted by Knowledge-Based Systems

Journal ref Knowl.-Based Syst. 304 (2024) 112542

详情
英文摘要

The efficient extraction of text information from the background in degraded color document images is an important challenge in the preservation of ancient manuscripts. The imperfect preservation of ancient manuscripts has led to different types of degradation over time, such as page yellowing, staining, and ink bleeding, seriously affecting the results of document image binarization. This work proposes an effective three-stage network method to image enhancement and binarization of degraded documents using generative adversarial networks (GANs). Specifically, in Stage-1, we first split the input images into multiple patches, and then split these patches into four single-channel patch images (gray, red, green, and blue). Then, three single-channel patch images (red, green, and blue) are processed by the discrete wavelet transform (DWT) with normalization. In Stage-2, we use four independent generators to separately train GAN models based on the four channels on the processed patch images to extract color foreground information. Finally, in Stage-3, we train two independent GAN models on the outputs of Stage-2 and the resized original input images (512x512) as the local and global predictions to obtain the final outputs. The experimental results show that the Avg-Score metrics of the proposed method are 77.64, 77.95, 79.05, 76.38, 75.34, and 77.00 on the (H)-DIBCO 2011, 2013, 2014, 2016, 2017, and 2018 datasets, which are at the state-of-the-art level. The implementation code for this work is available at https://github.com/abcpp12383/ThreeStageBinarization.

2204.00943 2026-02-06 cs.CV

Efficient Convolutional Neural Networks on Raspberry Pi for Image Classification

Rui-Yang Ju, Ting-Yu Lin, Jia-Hao Jian, Jen-Shiun Chiang

Journal ref J Real-Time Image Proc 20 (2023) 21

详情
英文摘要

With the good performance of deep learning algorithms in the field of computer vision (CV), the convolutional neural network (CNN) architecture has become a main backbone of the computer vision task. With the widespread use of mobile devices, neural network models based on platforms with low computing power are gradually being paid attention. However, due to the limitation of computing power, deep learning algorithms are usually not available on mobile devices. This paper proposes a lightweight convolutional neural network, TripleNet, which can operate easily on Raspberry Pi. Adopted from the concept of block connections in ThreshNet, the newly proposed network model compresses and accelerates the network model, reduces the amount of parameters of the network, and shortens the inference time of each image while ensuring the accuracy. Our proposed TripleNet and other state-of-the-art (SOTA) neural networks perform image classification experiments with the CIFAR-10 and SVHN datasets on Raspberry Pi. The experimental results show that, compared with GhostNet, MobileNet, ThreshNet, EfficientNet, and HarDNet, the inference time of TripleNet per image is shortened by 15%, 16%, 17%, 24%, and 30%, respectively. The detail codes of this work are available at https://github.com/RuiyangJu/TripleNet.

2202.00009 2026-02-06 cs.LG

Identifying Dementia Subtypes with Electronic Health Records

Sayantan Kumar, Zachary Abrams, Suzanne Schindler, Nupur Ghoshal, Philip Payne

Comments ACM Conference on Bioinformatics, Computational Biology, and Health Informatics 13 pages, 7 figures, 3 tables

Journal ref PLoS ONE 19(11): e0313425, 2024

详情
英文摘要

Dementia is characterized by a decline in memory and thinking that is significant enough to impair function in activities of daily living. Patients seen in dementia specialty clinics are highly heterogeneous with a variety of different symptoms that progress at different rates. In this work, we used an unsupervised data-driven K-Means clustering approach on the component scores of the Clinical Dementia Rating (CDR) score to identify dementia subtypes and used the gap-statistic to identify the optimal number of clusters. Our goal was to characterize the identified dementia subtypes in terms of their cognitive performance and analyze how patient transitions between subtypes relate to disease progression. Our results indicate both inter-subtype variability, which indicates the variability amongst dementia subtypes for a particular component score even with the same CDR and (ii) intra-subtype variability, which indicates the variation in the 6 component scores within a particular dementia subtype. We observed that dementia subtypes that represented individuals with very mild dementia (CDR 0.5) had widely varying rates of transition to other subtypes. Future work includes testing the generalizability of our proposed pipeline on additional datasets, and using a larger volume of EHR data to estimate probabilistic estimates of the variability between dementia subtypes both in terms of cognitive profile and disease progression.

2201.03013 2026-02-06 cs.CV

ThreshNet: An Efficient DenseNet Using Threshold Mechanism to Reduce Connections

Rui-Yang Ju, Ting-Yu Lin, Jia-Hao Jian, Jen-Shiun Chiang, Wei-Bin Yang

Comments IEEE Access

Journal ref IEEE Access 10 (2022) 82834-82843

详情
英文摘要

With the continuous development of neural networks for computer vision tasks, more and more network architectures have achieved outstanding success. As one of the most advanced neural network architectures, DenseNet shortcuts all feature maps to solve the model depth problem. Although this network architecture has excellent accuracy with low parameters, it requires an excessive inference time. To solve this problem, HarDNet reduces the connections between the feature maps, making the remaining connections resemble harmonic waves. However, this compression method may result in a decrease in the model accuracy and an increase in the parameters and model size. This network architecture may reduce the memory access time, but its overall performance can still be improved. Therefore, we propose a new network architecture, ThreshNet, using a threshold mechanism to further optimize the connection method. Different numbers of connections for different convolution layers are discarded to accelerate the inference of the network. The proposed network has been evaluated with image classification using CIFAR 10 and SVHN datasets under platforms of NVIDIA RTX 3050 and Raspberry Pi 4. The experimental results show that, compared with HarDNet68, GhostNet, MobileNetV2, ShuffleNet, and EfficientNet, the inference time of the proposed ThreshNet79 is 5%, 9%, 10%, 18%, and 20% faster, respectively. The number of parameters of ThreshNet95 is 55% less than that of HarDNet85. The new model compression and model acceleration methods can speed up the inference time, enabling network models to operate on mobile devices.

2110.04598 2026-02-06 cs.LG

Self-explaining Neural Network with Concept-based Explanations for ICU Mortality Prediction

Sayantan Kumar, Sean C. Yu, Thomas Kannampallil, Zachary Abrams, Andrew Michelson, Philip R. O. Payne

Comments Workshop on Interpretable ML in Healthcare at International Conference on Machine Learning (ICML 2022)

Journal ref BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Article No.: 8, Pages 1 - 9, 2022

详情
英文摘要

Complex deep learning models show high prediction tasks in various clinical prediction tasks but their inherent complexity makes it more challenging to explain model predictions for clinicians and healthcare providers. Existing research on explainability of deep learning models in healthcare have two major limitations: using post-hoc explanations and using raw clinical variables as units of explanation, both of which are often difficult for human interpretation. In this work, we designed a self-explaining deep learning framework using the expert-knowledge driven clinical concepts or intermediate features as units of explanation. The self-explaining nature of our proposed model comes from generating both explanations and predictions within the same architectural framework via joint training. We tested our proposed approach on a publicly available Electronic Health Records (EHR) dataset for predicting patient mortality in the ICU. In order to analyze the performance-interpretability trade-off, we compared our proposed model with a baseline having the same set-up but without the explanation components. Experimental results suggest that adding explainability components to a deep learning framework does not impact prediction performance and the explanations generated by the model can provide insights to the clinicians to understand the possible reasons behind patient mortality.