Date of Award


Document Type

Open Access Dissertation

Degree Name

Doctor of Philosophy in Statistics (PhD)

Administrative Home Department

Department of Mathematical Sciences

Advisor 1

Qiuying Sha

Committee Member 1

Shuanglin Zhang

Committee Member 2

Kui Zhang

Committee Member 3

Weihua Zhou


This dissertation includes five Chapters. A brief description of each chapter is organized as follows.

In Chapter One, we propose a signed bipartite genotype and phenotype network (GPN) by linking phenotypes and genotypes based on the statistical associations. It provides a new insight to investigate the genetic architecture among multiple correlated phenotypes and explore where phenotypes might be related at a higher level of cellular and organismal organization. We show that multiple phenotypes association studies by considering the proposed network are improved by incorporating the genetic information into the phenotype clustering.

In Chapter Two, we first illustrate the proposed GPN to GWAS summary statistics. Then, we assess contributions to constructing a well-defined GPN with a clear representation of genetic associations by comparing the network properties with a random network, including connectivity, centrality, and community structure. The network topology annotations based on the sparse representations of GPN can be used to understand the disease heritability for the highly correlated phenotypes. In applications of phenome-wide association studies, the proposed GPN can identify more significant pairs of genetic variant and phenotype categories.

In Chapter Three, a powerful and computationally efficient gene-based association test is proposed, aggregating information from different gene-based association tests and also incorporating expression quantitative trait locus information. We show that the proposed method controls the type I error rates very well and has higher power in the simulation studies and can identify more significant genes in the real data analyses.

In Chapter Four, we develop six statistical selection methods based on the penalized regression for inferring target genes of a transcription factor (TF). In this study, the proposed selection methods combine statistics, machine learning , and convex optimization approach, which have great efficacy in identifying the true target genes. The methods will fill the gap of lacking the appropriate methods for predicting target genes of a TF, and are instrumental for validating experimental results yielding from ChIP-seq and DAP-seq, and conversely, selection and annotation of TFs based on their target genes.

In Chapter Five, we propose a gene selection approach by capturing gene-level signals in network-based regression into case-control association studies with DNA sequence data or DNA methylation data, inspired by the popular gene-based association tests using a weighted combination of genetic variants to capture the combined effect of individual genetic variants within a gene. We show that the proposed gene selection approach have higher true positive rates than using traditional dimension reduction techniques in the simulation studies and select potentially rheumatoid arthritis related genes that are missed by existing methods.