uwe.menzel@medsci.uu.se

Introduction

The program pipeline conducts a Genome Wide Association Study (GWAS). It runs a linear or logistic regression in order to identify genetic variants being associated with a quantitative or a binary trait. The pipeline also includes two methods (GCTA-COJO and PLINK-CLUMP) for the identification of independent genetic variants, i.e. variants which are in linkage equilibrium.

The pipeline consists of Bash shell scripts and R scripts. Some R-scripts make also use of Rmarkdown documents for reporting and visualization of results.

The main programs of the pipeline are automatically sent to the SLURM (“Simple Linux Utility for Resource Management”) workload manager when they are started. After submission, jobs can be traced using the command

squeue   

This command displays job number, job name, job status, and elapsed time. A job (or all jobs connected to your login name) can be canceled using

scancel jobnumber
scancel -u username

Some scripts can be run in an interactive session which can be started on the Linux console using the interactive command, e.g.:

interactive -n 16 -t 6:00:00 -A sens2019016 

This exemplary command requests a whole node (16 cores) for 6 hours.

You should always run large programs not being sent to the SLURM in interactive mode, because running in the login node might slow down computation rate substantially. Only a few (auxiliary) scripts can be run in the login node; if so, a notification is added to the program description.

R scripts only work correctly if the corresponding modules are loaded. The vast majority of the scripts loads the modules automatically. Only a few (less important) scripts require manual loading. For these scripts, please run

module load R_packages/3.6.1 

before starting the R script. If you forget to load the module, you will probably be notified that some R-libraries cannot be supplied. It may also be important to load just the release 3.6.1 of the R-modules because newer releases might not be compatible with the scripts written.

Many programs of the pipeline require submission of numerous parameters. Parameters that are not likely to change very often are stored in so called “settings files”   in order to reduce typing. The settings files are read by the associated program at startup. They must be located in your home folder, see the installation instructions and the description of the settings files. You can edit these files to change the parameters permanently, but please stick to Bash syntax in the files with suffix .sh
and to R syntax in the files with suffix .R.

Parameters that presumably change more often or have to be unique must be entered on the command line. Some parameters defined in the settings files but can also be entered on the command line. As a general rule, parameters provided on the command line override definitions in the settings files.

If a program name is typed without parameters, a concise message showing the command line parameters is displayed.

The main programs use named parameters of the form - -arg value or -a value. That means that you can use the short argument identifier (starting with a single minus sign) or the long argument name (starting with two minus signs). See the examples below for clarification. Other scripts (often R-scripts) use positional parameters. It is important for these programs to enter the parameters in the correct order.

It is important to emphasize that the pipeline uses sensitive data originating from the UK Biobank. It is therefore not allowed to share these data with researchers who are not registered with UK Biobank. In particular, data containing ID’s of the participating individuals must not be copied from the server to any other storage device. See this UK Biobank website for more information.


Installation

The following commands have to be carried out just once.

Add the following lines of code to the file .bashrc   which is located in your home directory :

export SCRIPT_FOLDER="/proj/sens2019016/GWAS_SCRIPTS"
PATH=$SCRIPT_FOLDER:$PATH 

The SCRIPT_FOLDER environment variable defines the location of the scripts.
General information regarding .bashrc   files can be found on this web site.

Replace the above project name with your genuine project identifier. Source the ~/.bashrc   after finishing the changes :

 source ~/.bashrc

Copy all files located in the folder /proj/sens2019016/DEFAULT_SETTINGS to your home folder (which can be addressed using a tilde: ~):

cd /proj/sens2019016/DEFAULT_SETTINGS 
cp -i *.sh *.R ~/ 

These files are labelled “settings files” throughout this document. More about the settings files can be found here.


Settings files

Settings files store program parameters and are sourced by scripts at startup, so that there is no need to invoke these parameters on the command line.

A typical “settings file” looks like that:

The settings file for run_gwas

The settings file for run_gwas


Different settings files provide input parameters for different programs:

Settings file Programs reading that file
gwas_settings.sh run_gwas.sh ; gwas_chr.sh ; run_gwas_single.sh
cojo_settings.sh run_cojo.sh ; cojo_pheno ; cojo_convert.sh; cojo_chr.sh ; cojo_collect.sh ; cojo_clean.sh
clump_settings.sh run_clump.sh ; clump_pheno.sh ; clump_chr.sh
review_settings.sh review_gwas.sh
review_settings.R review_gwas.R
archive_settings.sh archive_gwas.sh ; retrieve_gwas.sh ; tar_gwas.sh ; untar_gwas.sh
convert_settings.sh convert_genotype.sh ; convert_genotype_chr.sh
diagnose_settings.sh gwas_diagnose.sh ; gwas_diagnose_INT.sh ;gwas_diagnose_nomarker.sh ;extract_genotype.sh
diagnose_settings.R deprecated
extract_settings.sh extract_samples.sh ; extract_samples_chr.sh ; extract_snps.sh ; extract_snps_chr.sh
remove_settings.sh remove_samples.sh ; remove_samples_chr.sh
extract_raw_settings.sh extract_raw.sh
fetch_settings.sh fetch_pheno.sh
linkage_settings.sh linkage_pair.sh
stats_settings.sh get_stats.sh

Note that the scripts are always called without the corresponding suffix because softlinks have been created, e.g.

ln -s run_gwas.sh run_gwas

As a consequence, you can just call “run_gwas” instead of “run_gwas.sh”.

Changes in the settings files can be made to adapt to your needs. However, it is important to stick to bash syntax in the settings files with suffix .sh and to R syntax in the settings files with suffix .R. Text following a hashtag (#) is a comment and is irrelevant for the code. Feel free to add your own comments.


Input verification

Most of the scripts conduct a number of checks in order to make sure that the input is correct before starting the program (sending it to the SLURM).

Validation is made regarding:

In case of any inconsistency, an ERROR message is shown containing script name and failure description.
Moreover, the auxiliary program check_file can examine if a file is correctly formatted.


GWAS pipeline

Workflow

An overall of the workflow is shown in the scheme below. The most important part is the linear or logistic regression conducted by the script run_gwas (and it’s subprograms). You can traverse the pipeline along some path indicated by the red arrows. For instance, you can just run the regression and then immediately proceed to review_gwas in order to get a visual summary of the regression results. Alternatively, you can squeeze in GCTA-COJO or PLINK-CLUMP (or both) in order to identify markers which are in linkage equilibrium.


Each of the programs calls a number of sub-programs (e.g. for each phenotype and for each chhromosome), see below.


Regression: PLINK2.0

The linear or logistic regression is conducted using PLINK2.0.

Parameters

The name of the main script running the regression is run_gwas, which has the following command line parameters:


Command line options for run_gwas

Command line options for run_gwas

The screenshot shows what you see if the program name is typed without any parameter.