Date of Award
2025
Document Type
Open Access Dissertation
Degree Name
Doctor of Philosophy in Statistics (PhD)
Administrative Home Department
Department of Mathematical Sciences
Advisor 1
Kui Zhang
Committee Member 1
Qiuying Sha
Committee Member 2
Hairong Wei
Committee Member 3
Xiao Zhang
Abstract
Transcriptome-wide association studies (TWAS) have emerged as a powerful strategy to bridge genome-wide association studies (GWAS) with gene regulatory mechanisms by integrating genotypic data with gene expression data. While early TWAS methods typically rely on linear models and single-tissue expression references, recent advances underscore the need for flexible, multi-tissue approaches that can capture heterogeneous regulatory architectures and tissue-specific expression patterns. This dissertation introduces a three‑part research project that advances multi‑tissue transcriptome‑wide association studies (TWAS) along complementary axes of methodology, statistical power, and modelling flexibility.
In chapter One, TWAS‑CTL introduces a two‑stage cross‑tissue learner that trains any user‑chosen single‑tissue imputers (STLs) and then fuses their predictions with an empirical utility weight function that down‑weights poorly transferring tissues. Extensive simulations show that TWAS‑CTL can control the type I error rates and exceeds unified test for molecular signatures (UTMOST), one of the leading methods in this field, in power while cutting computational time by more than half. In the analysis of a GWAS cohort, it recovers more trait‑relevant genes than some existing benchmark works like PrediXcan (a foundational approach in the development of TWAS) and UTMOST.
In Chapter Two, GWAS-boosted cross-tissue learner (G‑Boost‑CTL) extends this framework by re‑weighting STLs with genotypic information extracted directly from the GWAS cohort—e.g., the cross‑sample variability of the imputed expression- so that tissues that carry stronger association signals are automatically emphasized. The dual weighting scheme preserves appropriate type I error rates yet delivers marked power gains over linear-penalized and covariance‑based tools across a wide spectrum of tissue‑sharing scenarios. G-Boost-CTL outperforms existing multi-tissue TWAS approaches in the analysis of a real data set as well by uncovering more statistically significant and biologically plausible disease loci.
In Chapter Three, we explore and replace the linear imputers that dominate TWAS with two non‑linear engines- gradient‑boosted trees and deep learning. Using data from the genotype-tissue expression project (GTEx) of 49 tissues, we show, through large‑scale simulation and real‑data analysis, that these learners maintain appropriate type I error rates but boost discovery, with improved powers for gradient-boosted trees and deep learning methods (e.g., deep neural networks) revealing more complementary, tissue‑specific signals.
Collectively, these studies demonstrate that (i) adaptive, cross‑tissue weighting, (ii) incorporation of GWAS‑derived information, and (iii) non‑linear advanced machine learning and deep learning imputers each confer substantial and largely orthogonal benefits. Taken together, they outline a scalable, modular blueprint for advanced multi‑tissue TWAS that more faithfully captures the complex, heterogeneous architecture of gene regulation and unlocks deeper insights into the molecular basis of human complex disease.
Recommended Citation
Billah, Md Mutasim, "METHODS IN STATISTICS, MACHINE LEARNING, AND DEEP LEARNING FOR COMBINING MULTI-OMICS DATASET", Open Access Dissertation, Michigan Technological University, 2025.
https://digitalcommons.mtu.edu/etdr/1959
Included in
Applied Statistics Commons, Bioinformatics Commons, Biostatistics Commons, Genetics Commons, Genomics Commons, Statistical Methodology Commons, Statistical Models Commons