Date of Award


Document Type

Open Access Dissertation

Degree Name

Doctor of Philosophy in Statistics (PhD)

Administrative Home Department

Department of Mathematical Sciences

Advisor 1

Qiuying Sha

Committee Member 1

Shuanglin Zhang

Committee Member 2

Kui Zhang

Committee Member 3

Jingfeng Jiang


Genome-wide association studies (GWAS) have successfully detected tens of thousands of robust SNP-trait associations. Earlier researches have primarily focused on association studies of genetic variants and some well-defined functions or phenotypic traits. Emerging evidence suggests that pleiotropy, the phenomenon of one genetic variant affects multiple phenotypes, is widespread, especially in complex human diseases. Therefore, individual phenotype analyses may lose statistical power to identify the underlying genetic mechanism. Contrasting with single phenotype analyses, joint analysis of multiple phenotypes exploits the correlations between phenotypes and aggregates multiple weak marginal effects and is therefore likely to provide new insights into the functional consequences of genetic variations. This dissertation includes two papers, corresponding to two primary research projects I have done during my Ph.D. study, with each distributed in one chapter.

Chapter 1 proposed an innovative method, which referred to as HC-CLC, for joint analysis of multipole phenotypes using a Hierarchical Clustering (HC) approach followed by a Clustering Linear Combination (CLC) method. The HC step partitions phenotypes into clusters. The CLC method is then used to test the association between the genetic variant and all phenotypes, which is done by combining individual test statistics while taking full advantage of the clustering information in the HC step. Extensive simulations together with the COPDGene data analysis have been used to assess the Type I error rates and the power of our proposed method. Our simulation results demonstrate that the Type I error rates of HC-CLC are effectively controlled in different realistic settings. HC-CLC either outperforms all other methods or has statistical power that is very close to the most powerful alternative method with which it has been compared. In addition, our real data analysis shows that HC-CLC is an appropriate method for GWAS.

Chapter 2 redesigned the PheCLC (Phenome-wide association study that uses the CLC method) which was previously developed by our research group. The refined method is then applied on the UKBiobank data, a large cohort study across the United Kingdom, to test the validity and understand the limitations of the proposed method. We have named our new method UKB-PheCLC. The UKB-PheCLC method is an EHR-based PheWAS. In the first step, it classifies the whole phenome into different phenotypic categories according to the UK Biobank ICD codes. In the second step, the CLC method is applied to each phenotypic category to derive a CLC-based p-value for testing the association between the genetic variant of interest and all phenotypes in that category. In the third step, the CLC-based p-values of all categories are combined by using a strategy resemble that of the Adaptive Fisher's Combination (AFC) method. Overall, UKB-PheCLC harnesses the powerful resource of the UK Biobank and considers the possibility that phenotypes can be grouped into different phenotypic categories, which is very common in EHR-based PheWAS. Moreover, UKB-PheCLC can handle both qualitative and quantitative phenotypes, and it also doesn’t require raw phenotype information. The real data analysis results confirm that UKB-PheCLC is more powerful than the existing methods we have it compared with. Thus, UKB-PheCLC can serve as a compelling method for phenome-wide association study.