13 WTCCC1: quality control
As usual, we start by exploring the data in hand.
First, let’s make a clean slate and create a working directory for the WTCCC1 data.
clear
mkdir -v ~/wtccc1
Next, we’ll run a few plink
commands
plink --bfile ~/data/shared/wtccc1/CADn1871_500Kb37fwd --bmerge ~/data/shared/wtccc1/UKBSn1397_500Kb37fwd --make-bed --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --freq --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --hardy --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --missing --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --test-missing --out wtccc1/wtccc1
cat wtccc1/wtccc1.missing | awk '$5 < 0.00001' | awk '{ print $2 }' > wtccc1/wtccc1-fail-diffmiss-qc.txt
library("data.table")
wtccc1_HWE <- data.table::fread("wtccc1/wtccc1.hwe")
wtccc1_FRQ <- data.table::fread("wtccc1/wtccc1.frq")
wtccc1_IMISS <- data.table::fread("wtccc1/wtccc1.imiss")
wtccc1_LMISS <- data.table::fread("wtccc1/wtccc1.lmiss")
wtccc1_HWE$logP <- -log10(wtccc1_HWE$P)
wtccc1_LMISS$callrate <- 1 - wtccc1_LMISS$F_MISS
wtccc1_IMISS$callrate <- 1 - wtccc1_IMISS$F_MISS
Let’s investigate the HWE p-value in the whole cohort, and per stratum (cases and controls) with the code below.
library("ggpubr")
ggpubr::gghistogram(wtccc1_HWE, x = "logP",
add = "mean",
add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE,
# color = "#1290D9", fill = "#1290D9",
color = "TEST", fill = "TEST",
palette = "lancet",
facet.by = "TEST",
bins = 50,
xlab = "HWE -log10(P)") +
ggplot2::geom_vline(xintercept = 5, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("wtccc1/wtccc1-hwe.png", plot = last_plot())
This will result in Figure 13.1.

Figure 13.1: Stratified HWE p-values.
We should also inspect the allele frequencies. Note that by default PLINK (whether v0.7, v1.9, or v2.0) stores the alleles as minor (A1) and major (A2), and therefore --maf
always calculates the frequency of the minor allele (A1).
ggpubr::gghistogram(wtccc1_FRQ, x = "MAF",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE,
color = "#1290D9", fill = "#1290D9",
xlab = "minor allele frequency") +
ggplot2::geom_vline(xintercept = 0.05, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("wtccc1/wtccc1-freq.png", plot = last_plot())
This will result in Figure 13.2.

Figure 13.2: Minor allele frequencies.
There could be sample with very poor overall call rate, where for many SNPs there is no data. We will want to identify these samples and exclude them.
ggpubr::gghistogram(wtccc1_IMISS, x = "callrate",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE, bins = 50,
color = "#1290D9", fill = "#1290D9",
xlab = "per sample call rate") +
ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("wtccc1/wtccc1-sample-call-rate.png", plot = last_plot())
This will result in Figure 13.3.

Figure 13.3: Per sample call rate.
Question: What do you notice in the ‘per sample call rate’ graph? Can you think of a reason why this is? And how would you deal with this?
Lastly, we must inspect the per SNP call rate; we need to know if there are SNPs that have no data for many samples. We will want to identify such SNPs and exclude these.
ggpubr::gghistogram(wtccc1_LMISS, x = "callrate",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE, bins = 50,
color = "#1290D9", fill = "#1290D9",
xlab = "per SNP call rate") +
ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("wtccc1/wtccc1-hwe.png", plot = last_plot())
This will result in Figure 13.4.

Figure 13.4: Per SNP call rate.
13.1 Quality control
Now that we have handle on the data, we can filter it.
plink --bfile wtccc1/wtccc1 --exclude wtccc1/wtccc1-fail-diffmiss-qc.txt --maf 0.01 --geno 0.05 --hwe 0.00001 --make-bed --out wtccc1/wtccc1_clean
Question: Do you have any thoughts on that? Do you agree with the filters I set below? How would you do it differently and why?
13.2 Ancestral background
If these individuals are all from the United Kingdom, we are certain there will be admixture from other populations given UK’s history. Let’s project the WTCCC1 data on 1000G phase 1 populations.
We will face the same issue as before with our dummy dataset with respect to EIGENSOFT
. So I created the data for you to skip to the Plotting PCA section immediately. Regardless, in the Preparing PCA and Running PCA sections I show you how to get there.
13.2.1 Preparing PCA
Filtering WTCCC1
For PCA we need to perform extreme clean.
plink --bfile wtccc1/wtccc1_clean --maf 0.1 --geno 0.1 --indep-pairwise 100 50 0.2 --exclude ~/data/shared/support/exclude_problematic_range.txt --make-bed --out wtccc1/wtccc1_temp
plink --bfile wtccc1/wtccc1_temp --exclude wtccc1/wtccc1_temp.prune.out --make-bed --out wtccc1/wtccc1_extrclean
rm -fv wtccc1/wtccc1_temp*
cat wtccc1/wtccc1_extrclean.bim | awk '{ print $2 }' > wtccc1/wtccc1_extrclean.variants.txt
cat wtccc1/wtccc1.bim | grep "rs" > wtccc1/all.variants.txt
Notice that you are using real world data: there are thousands of variants ‘pruned’ due to the --indep-pairwise 100 50 0.2
-flag.
Merging WTCCC1 with 1000G phase 1
Now we are ready to extract the WTCCC1 variants from the 1000G phase 1 reference
plink --bfile ~/data/shared/ref_1kg_phase1_all/1kg_phase1_all --extract wtccc1/all.variants.txt --make-bed --out wtccc1/1kg_phase1_wtccc1
Extracting the A/T and C/G SNPs as well.
cat wtccc1/1kg_phase1_wtccc1.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> wtccc1/all.1kg_wtccc1.atcg.variants.txt
plink --bfile wtccc1/1kg_phase1_wtccc1 --exclude wtccc1/all.1kg_wtccc1.atcg.variants.txt --make-bed --out wtccc1/1kg_phase1_wtccc1_no_atcg
plink --bfile wtccc1/1kg_phase1_wtccc1_no_atcg --extract wtccc1/wtccc1_extrclean.variants.txt --make-bed --out wtccc1/1kg_phase1_raw_no_atcg_wtccc1
Finally we will merge the datasets.
plink --bfile wtccc1/wtccc1_extrclean --bmerge wtccc1/1kg_phase1_raw_no_atcg_wtccc1 --maf 0.1 --geno 0.1 --exclude ~/data/shared/support/exclude_problematic_range.txt --make-bed --out wtccc1/wtccc1_extrclean_1kg
13.2.2 Running PCA
Great, we’ve prepared our dummy project data and merged this with 1000G phase 1. Let’s execute the PCA using --pca
in PLINK
.
plink --bfile wtccc1/wtccc1_extrclean_1kg --pca --out wtccc1/wtccc1_extrclean_1kg
13.2.3 Plotting PCA
If all is peachy, you were able to run the PCA for the WTCCC1 data against 1000G phase 1. Using --pca
in plink
we have calculated principal components (PCs) and we can now start plotting them. Let’s create a scatter diagram of the first two principal components just like we did with the dummy data.
And we should visualize the PCA results: are these individuals really all from European (UK) ancestry?
PCA_eigenval <- data.table::fread("wtccc1/wtccc1_extrclean_1kg.eigenval")
PCA_eigenvec <- data.table::fread("wtccc1/wtccc1_extrclean_1kg.eigenvec")
ref_pop_raw <- data.table::fread("~/data/shared/ref_1kg_phase1_all/1kg_phase1_all.pheno")
wtccc1_pop <- data.table::fread("wtccc1/wtccc1.fam")
# Rename some
names(PCA_eigenval)[names(PCA_eigenval) == "V1"] <- "Eigenvalue"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V1"] <- "FID"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V2"] <- "IID"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V3"] <- "PC1"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V4"] <- "PC2"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V5"] <- "PC3"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V6"] <- "PC4"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V7"] <- "PC5"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V8"] <- "PC6"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V9"] <- "PC7"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V10"] <- "PC8"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V11"] <- "PC9"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V12"] <- "PC10"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V13"] <- "PC11"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V14"] <- "PC12"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V15"] <- "PC13"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V16"] <- "PC14"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V17"] <- "PC15"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V18"] <- "PC16"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V19"] <- "PC17"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V20"] <- "PC18"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V21"] <- "PC19"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V22"] <- "PC20"
names(wtccc1_pop)[names(wtccc1_pop) == "V1"] <- "Family_ID"
names(wtccc1_pop)[names(wtccc1_pop) == "V2"] <- "Individual_ID"
names(wtccc1_pop)[names(wtccc1_pop) == "V5"] <- "Gender"
names(wtccc1_pop)[names(wtccc1_pop) == "V6"] <- "Phenotype"
wtccc1_pop$V3<- NULL
wtccc1_pop$V4<- NULL
wtccc1_pop$Population <- wtccc1_pop$Phenotype
wtccc1_pop$Population[wtccc1_pop$Population == 2] <- "Case"
wtccc1_pop$Population[wtccc1_pop$Population == 1] <- "Control"
# we subset the data we need
ref_pop <- subset(ref_pop_raw, select = c("Family_ID", "Individual_ID", "Gender", "Phenotype", "Population"))
rm(ref_pop_raw)
# we combine the reference and dummy information
ref_wtccc1_pop <- rbind(wtccc1_pop, ref_pop)
PCA_1kG <- merge(PCA_eigenvec,
ref_wtccc1_pop,
by.x = "IID",
by.y = "Individual_ID",
sort = FALSE,
all.x = TRUE)
# Population Description Super population Code Counts
# ASW African Ancestry in Southwest US AFR 4 #49A01D
# CEU Utah residents with Northern and Western European ancestry EUR 7 #E55738
# CHB Han Chinese in Bejing, China EAS 8 #9A3480
# CHS Southern Han Chinese, China EAS 9 #705296
# CLM Colombian in Medellin, Colombia MR 10 #8D5B9A
# FIN Finnish in Finland EUR 12 #2F8BC9
# GBR British in England and Scotland EUR 13 #1290D9
# IBS Iberian populations in Spain EUR 16 #1396D8
# JPT Japanese in Tokyo, Japan EAS 18 #D5267B
# LWK Luhya in Webuye, Kenya AFR 20 #78B113
# MXL Mexican Ancestry in Los Angeles, California AMR 22 #F59D10
# PUR Puerto Rican in Puerto Rico AMR 25 #FBB820
# TSI Toscani in Italy EUR 27 #4C81BF
# YRI Yoruba in Ibadan, Nigeria AFR 28 #C5D220
PCA_1kGplot <- ggpubr::ggscatter(PCA_1kG,
x = "PC1",
y = "PC2",
color = "Population",
palette = c("#49A01D",
"#595A5C",
"#E55738",
"#9A3480",
"#705296",
"#8D5B9A",
"#A2A3A4",
"#2F8BC9",
"#1290D9",
"#1396D8",
"#D5267B",
"#78B113",
"#F59D10",
"#FBB820",
"#4C81BF",
"#C5D220"),
xlab = "principal component 1", ylab = "principal component 2") +
ggplot2::geom_vline(xintercept = 0.0023, linetype = "dashed",
color = "#E55738", size = 1)
p2 <- ggpubr::ggpar(PCA_1kGplot,
title = "Principal Component Analysis",
subtitle = "Reference population: 1000 G, phase 1",
legend.title = "Populations", legend = "right")
ggplot2::ggsave("wtccc1/wtccc1-qc-pca-1000g.png", plot = p2)
p2
rm(p2)
We expect most individuals from the WTCCC to be 100% British, but a substantial group will have a different ancestral background as shown in the Figure 13.5 you just made.

Figure 13.5: PCA - WTCCC1 vs. 1000G
13.2.4 Removing samples
In a similar fashion as in the example gwas and rawdata datasets, you should consider to remove the samples below the threshold based on this PCA (Figure 13.5).
Go ahead, try that.
You’re code would be something like below:
cat wtccc1/wtccc1_extrclean_1kg.eigenvec | \
awk '$3 < 0.0023' | awk '{ print $1, $2 }' > wtccc1/fail-ancestry-QC.txt
Next we filter these samples and get a final fully QC’d dataset.
plink --bfile wtccc1/wtccc1_clean --exclude wtccc1/fail-ancestry-QC.txt --make-bed --out wtccc1/wtccc1_qc
13.3 Summary
You have now explored the WTCCC1 genotype data by inspecting call rates, the HWE p-values and frequencies. You have also projected the WTCCC1 data on the 1000G phase 1 reference panel and removed samples that did not match the expected ancestry. This resulted in a fully quality controlled dataset (wtccc1/wtccc1_qc
).
Now you’re ready for the actual GWAS on coronary artery disease that led to this seminal publication. On to Chapter 14.