This study applied machine learning (ML) and deep learning methods to analyze features of the whole blood transcriptome and mucosal microbiome of patients with primary immunodeficiency (PID). The aim was to better understand disease mechanisms and further develop an effective diagnostic test for PID (referred to as PrimDx) based on blood gene expression. Early diagnosis of PID is crucial, as delays in diagnosis are associated with increased morbidity and mortality. Whole blood RNA and buccal swabs for microbial DNA were collected from patients with a range of antibody deficiencies (n = 62, age 2–67, 32 female) and age- and sex-matched healthy controls (n = 71). RNA sequencing was used to characterize the blood transcriptome of each participant, detecting and measuring over 15,000 genes. ML approaches were applied iteratively to training (70%) and blinded test sets (30%). The diagnostic accuracy to identify PID for the eight models tested ranged from 85% to 95%, with Deep Neural Networks achieving the highest accuracy of 95% and an F1 score of 95% (ROC 99%). Feature selection with the least absolute shrinkage and selection operator (LASSO) identified 13 key genes as significant predictive features, most of which are not well characterized but show expression restricted to lymphocytes. To investigate PID links to the microbiome, ML models were also applied to 16S rRNA profiles of the mucosal microbiome, achieving diagnostic accuracy between 57-78% for PID, with the highest performance observed for Random Forest and LightGBM models, which achieved 78% accuracy and F1 score of 82% (ROC 82%).
In conclusion, a diagnostic (PrimDx) based on whole blood transcriptomic data combined with a predictive algorithm demonstrated high accuracy in identifying PID patients. PrimDx has the potential to support early diagnosis of PID, enabling timely treatment and improved patient outcomes. Features of the mucosal microbiome had predictive power for diagnosing PID, indicating immune function links to the microbiome. Current studies underway are expanding the transcriptome reference database for improving prediction models and investigating the 13 key predictive genes to enhance the effectiveness and accessibility of this diagnostic approach.