9 Genome-Wide Association Study

Now that you have learned how to perform QC, you can easily run a GWAS and execute some downstream visualization and analyses. Let’s do this with a dummy dataset.

9.1 Exploring the data

Even though someone says that the QC was done, it is still wise and good practice to run some of the commands above to get a ‘feeling’ about the data. So let’s do this.

First, we’ll clean up what we did before - we don’t need it anyways.

clear
rm -v dummy_project/*

With the command clear you create a blank Terminal screen. With rm -v you remove all files in the dummy_project directory, where the -v-flag means that you get a verbose output of what is removed.

plink --bfile ~/data/shared/gwas/gwa --freq --out dummy_project/gwa
plink --bfile ~/data/shared/gwas/gwa --missing --out dummy_project/gwa
plink --bfile ~/data/shared/gwas/gwa --hardy --out dummy_project/gwa

Let’s visualize the results. First we should load in all the results.

Question: Load the data using R. [Hint: use and adapt the examples from the previous chapters.]

We can plot the per-stratum HWE p-values.

Question: Plot the per-stratum HWE p-values using R. [Hint: use and adapt the examples from the previous chapters.]

library("ggpubr")
ggpubr::gghistogram(gwas_HWE, x = "logP",
                    add = "mean",
                    add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
                    rug = TRUE,
                    # color = "#1290D9", fill = "#1290D9",
                    color = "TEST", fill = "TEST",
                    palette = "lancet",
                    facet.by = "TEST",
                    bins = 50,
                    xlab = "HWE -log10(P)") +
  ggplot2::geom_vline(xintercept = 5, linetype = "dashed",
                      color = "#E55738", size = 1)
ggplot2::ggsave("data/dummy_project/gwas-hwe.png",
       plot = last_plot())
Per stratum HWE p-values.

Figure 9.1: Per stratum HWE p-values.

We will want to see what the distribution of allele frequencies looks like.

Question: Plot the allele frequencies using R. [Hint: use and adapt the examples from the previous chapters.]

ggpubr::gghistogram(gwas_FRQ, x = "MAF",
                    add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
                    rug = TRUE,
                    color = "#1290D9", fill = "#1290D9",
                    xlab = "minor allele frequency") +
  ggplot2::geom_vline(xintercept = 0.05, linetype = "dashed",
                      color = "#E55738", size = 1)
ggplot2::ggsave("data/dummy_project/gwas-freq.png",
       plot = last_plot())
Minor allele frequencies.

Figure 9.2: Minor allele frequencies.

We will want to identify samples that have poor call rates.

Question: Plot the per-sample call rates using R. [Hint: use and adapt the examples from the previous chapters.]

ggpubr::gghistogram(gwas_IMISS, x = "callrate",
                    add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
                    rug = TRUE, bins = 50,
                    color = "#1290D9", fill = "#1290D9",
                    xlab = "per sample call rate") +
  ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
                      color = "#E55738", size = 1)
ggplot2::ggsave("data/dummy_project/gwas-sample-call-rate.png",
       plot = last_plot())
Per sample call rates.

Figure 9.3: Per sample call rates.

We also need to know what the per SNP call rates are.

Question: Plot the per-SNP call rates using R. [Hint: use and adapt the examples from the previous chapters.]

ggpubr::gghistogram(gwas_LMISS, x = "callrate",
                    add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
                    rug = TRUE, bins = 50,
                    color = "#1290D9", fill = "#1290D9",
                    xlab = "per SNP call rate") +
  ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
                      color = "#E55738", size = 1)
ggplot2::ggsave("data/dummy_project/gwas-snp-call-rate.png",
       plot = last_plot())
Per SNP call rates.

Figure 9.4: Per SNP call rates.

9.2 Genetic models

A simple chi-square test of association can be done.

plink --bfile ~/data/shared/gwas/gwa --model --out dummy_project/data

Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case-control by genotype counts contains less than five observations. This is because the chi-square approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%.

You can change this default behaviour by adding the flag --cell, e.g., we could lower the threshold to 3.

plink --bfile ~/data/shared/gwas/gwa --model --cell 3 --out dummy_project/data

Let’s review the contents of the results.

gwas_model <- data.table::fread("dummy_project/data.model")

dim(gwas_model)

N_SNPS = length(gwas_model$SNP)

gwas_model[1:10, 1:10]

It contains 1,530,510 rows, one for each SNP, and each type of test (genotypic, trend, allelic, dominant, and recessive) and the following columns:

  • chromosome [CHR],
  • the SNP identifier [SNP],
  • the minor allele [A1] (remember, PLINK always codes the A1-allele as the minor allele!),
  • the major allele [A2],
  • the test performed [TEST]:
    • GENO (genotypic association);
    • TREND (Cochran-Armitage trend);
    • ALLELIC (allelic as- sociation);
    • DOM (dominant model); and
    • REC (recessive model)],
  • the cell frequency counts for cases [AFF],
  • the cell frequency counts for controls [UNAFF],
  • the chi-square test statistic [CHISQ],
  • the degrees of freedom for the test [DF],
  • and the asymptotic P value [P] of association.

Question: Do you know which model, i.e. TEST is most commonly used and reported? And why is that, do think?

9.3 Logistic regression

We can also perform a test of association using logistic regression. In this case we might want to correct for covariates/confounding factors, for example age, sex, ancestral background, i.e. principal components, and other study specific covariates (e.g. hospital of inclusion, genotyping centre etc.). In that case each of these P values is adjusted for the effect of the covariates.

When running a regression analysis, be it linear or logistic, PLINK assumes a multiplicative model. By default, when at least one male and one female is present, sex (male = 1, female = 0) is automatically added as a covariate on X chromosome SNPs, and nowhere else. The sex flag causes it to be added everywhere, while no-x-sex excludes it.

plink --bfile ~/data/shared/gwas/gwa --logistic sex --covar ~/data/shared/gwas/gwa.covar --out dummy_project/data

Let’s examine the results.

gwas_assoc <- data.table::fread("dummy_project/data.assoc.logistic")
dim(gwas_assoc)
## [1] 918306      9
gwas_assoc[1:9, 1:9]
##      CHR       SNP      BP     A1   TEST NMISS     OR     STAT      P
##    <int>    <char>   <int> <char> <char> <int>  <num>    <num>  <num>
## 1:     1 rs3934834  995669      T    ADD  3818 1.0290  0.38120 0.7031
## 2:     1 rs3934834  995669      T    AGE  3818 1.0020  1.11800 0.2635
## 3:     1 rs3934834  995669      T    SEX  3818 1.0120  0.19090 0.8486
## 4:     1 rs3737728 1011278      A    ADD  3982 1.0190  0.38670 0.6990
## 5:     1 rs3737728 1011278      A    AGE  3982 1.0020  1.09800 0.2721
## 6:     1 rs3737728 1011278      A    SEX  3982 1.0060  0.09898 0.9212
## 7:     1 rs6687776 1020428      T    ADD  3915 0.9692 -0.33330 0.7389
## 8:     1 rs6687776 1020428      T    AGE  3915 1.0020  1.04000 0.2984
## 9:     1 rs6687776 1020428      T    SEX  3915 1.0150  0.23690 0.8127

Question: How come there are more lines in this file than there are variants?

If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. The C >= 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. We can remove the covariate-specific lines from the main report by adding the hide-covar flag.

The columns in the association results are:

  • the chromosome [CHR],
  • the SNP identifier [SNP],
  • the base-pair location [BP],
  • the minor allele [A1],
  • the test performed [TEST]: ADD (multiplicative model or genotypic model testing additivity),
    • GENO_2DF (genotypic model),
    • DOMDEV (genotypic model testing deviation from additivity),
    • DOM (dominant model), or
    • REC (recessive model)],
  • the number of missing individuals included [NMISS],
  • the OR relative to the A1, i.e. minor allele,
  • the coefficient z-statistic [STAT], and
  • the asymptotic P-value [P] of association.

We need to calculate the standard error and confidence interval from the z-statistic. We can modify the effect size (OR) to output the beta by adding the beta flag.

Question: Can you write down the mathematical relation between beta and OR?

9.4 Let’s get visual

Looking at numbers is important, but it won’t give you a perfect overview. We should turn to visualizing our results in Chapter 10.