6 Steps in a Genome-Wide Association Study

Now that you understand a bit of the navigation in Unix-systems, you’re ready to start this practical primer. We will make use of a dummy dataset containing cases and controls. We will explain and execute the following steps:

convert raw data to a more memory-efficient format
apply extensive quality control on samples and SNPs
assess the ancestral background of your study population
perform association testing
visualize association results

6.1 Introduction

The format in which genotype data are returned to investigators varies among genome-wide SNP platforms and genotyping centers. Usually genotypes have been called by a genotyping center and returned in the standard PED and MAP file formats designed for PLINK.

A PED file is a white space (space or tab)-delimited file in which each line represents one individual and the first six columns are mandatory and in the following order:

‘Family ID’,
‘Individual ID’,
‘Paternal ID’,
‘Maternal ID’,
‘Sex (1=male, 2=female, 0=missing)’, and
‘Phenotype (1=unaffected, 2=affected, 0=missing)’.

The subsequent columns denote genotypes that can be any character (e.g., 1, 2, 3, 4 or A, C, G, T). Zero denotes a missing genotype. Each SNP must have two alleles (i.e., both alleles are either present or absent). The order of SNPs in the PED file is given in the MAP file, in which each line denotes a single marker and the four white-space–separated columns are chromosome (1–22, X, Y or 0 for unplaced), marker name (typically an rs number), genetic distance in Morgans (this can be fixed to 0) and base-pair position (bp units).

Let’s start by using PLINK to converting the datasets to a lighter, binary form (a .bed-file). This file saves data in a more memory- and time-efficient manner (in a ‘binary’-format) to facilitate the analysis of large-scale data sets (Purcell S. 2007). The marker-information is stored in the .bim-file and the family information in the .fam-file. PLINK creates a .log file (named raw-GWA-data.log) that details (among other information) the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking whether the software is successfully completing commands.

6.1.1 Setting the stage

Make sure you are in the right directory. Do you remember how to get there?

In the Terminal window you type the following:

cd $HOME

You know that you’re ‘home’ when you type pwd and see path with a lot of numbers and letters like this /projects/e66af6fa-60ee-4448-972f-daaf235ca9da - yours will be a different sequence of numbers and letters.

Start by creating a project directory.

mkdir -v dummy_project

As I mentioned, all the data is stored in the ~/data/shared directory. So, if you do the following:

ls ~/data/shared

You will see this:

dummy_project  gwas  rawdata  readme.txt  readme.wtccc1.txt  ref_1kg_phase1_all  scripts  support  wtccc1

(in blue the directories, in black the files)

The dummy_project in ~/data/shared contains all the results from this practical primer. You’re not going to use that, if all is well, because you’re going to do it yourself in the following Chapters.
The gwas directory contains the data from the dummy GWAS we will perform.
The rawdata directory contains the dummy data we will use.
The ref_1kg_phase1_all directory contains the reference data from the 1000 Genomes Project.
The wtccc1 directory contains the WTCCC1 data.
The support directory contains some additional files that are used in the practical primer.
The scripts directory contains the scripts we will use.
The readme.txt and readme.wtccc1.txt files contain some additional information.

6.2 Converting the data

Let’s convert the .ped/.map files to the binary-format.

plink --file ~/data/shared/rawdata/raw-GWA-data --make-bed --out dummy_project/rawdata

Let’s review the .log-file for a bit. It should look something like this:

PLINK v1.90b7.2 64-bit (11 Dec 2023)           www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to dummy_project/rawdata.log.
Options in effect:
  --file rawdata/raw-GWA-data
  --make-bed
  --out dummy_project/rawdata

16384 MB RAM detected; reserving 8192 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (317503 variants, 2000 people).
--file: dummy_project/rawdata-temporary.bed +
dummy_project/rawdata-temporary.bim + dummy_project/rawdata-temporary.fam
written.
317503 variants loaded from .bim file.
2000 people (997 males, 1003 females) loaded from .fam.
2000 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2000 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 11440 het. haploid genotypes present (see dummy_project/rawdata.hh );
many commands treat these as missing.
Total genotyping rate is 0.985682.
317503 variants and 2000 people pass filters and QC.
Among remaining phenotypes, 1023 are cases and 977 are controls.
--make-bed to dummy_project/rawdata.bed + dummy_project/rawdata.bim +
dummy_project/rawdata.fam ... done.

So, there are 317,503 variants included for 2,000 people, 997 males and 1,003 females. All of these individuals are ‘founders’. There are 1,023 cases and 977 controls. The genotyping rate is about 98.6% which is pretty good. Lastly, there are 11,440 heterozygous haploid genotypes present.

Question: Can you think off what the ‘11,440 heterozygous haploid genotypes present’ represent?

6.3 Quality control

We are ready for some quality control and quality assurance, heavily inspired by Anderson et al. (Anderson C. A. 2010) and Laurie et al. (Laurie C. C. 2010). In general, we should check out a couple of things regarding the data quality on two levels:

samples
variants

So, we will investigate the following:

Are the sexes based on genetic data matching the ones given by the phenotype file?
Identify individuals that are outliers in terms of missing data (call rate) or heterozygosity rates. This could indicate a genotyping error or sample swap.
Identify duplicated or related individuals.
Identify individuals with divergent ancestry.
What are the allele frequencies?
What is the per-SNP call rate?
In the case of a case-control study (which is the case here), we need to check differential missingness between cases and controls.

Question: Can you think of other scenarios in which you may want to extend the check on differential missingness beyond a check between cases and controls?

6.4 Let’s get our hands dirty

All clear? Let’s start the work. On to step 1 of the QC for GWAS: filter samples of poor quality in Chapter 7.

References

Anderson C. A., et al. 2010. “Data QC in Genetic Case-Control Association Studies.” https://www.ncbi.nlm.nih.gov/pubmed/21085122.

Laurie C. C., et al. 2010. “Quality Control and Quality Assurance in Genotypic Data for Genome-Wide Association Studies.” https://www.ncbi.nlm.nih.gov/pubmed/20718045.

Purcell S., et al. 2007. “PLINK, a Tool Set for Whole-Genome Association and Population-Based Linkage Analyses.” https://www.ncbi.nlm.nih.gov/pubmed/17701901.