GWAS Diagnostics

1. Input data survey
2. Linear Regression results obtained using lm() in R
3. Diagnostic plots regarding the residuals
4. Influential observations and outliers
5. Inverse response plot
6. Links to databases
7. Technical information

1. Input data survey

The GWAS diagnose was conducted for rs188247550_T_C based on 27.243 observations (which is the number of common samples for response variable, predictor variable, and covariates, see Venn diagram below).

1.1. Venn diagram based on sample ID’s for phenotype, genotype, and covariates

The Venn diagram displays which samples are common to the response and predictor variables and the covariates.

1.2. Heatmap for predictor and covariate correlations

The heatmap indicates the strength of the correlations between the independent variables (use the script examine_covariates to get detailed information).

1.3. Variance inflation factors

Variance inflation factors (VIF) are calculated in order to discover multicollinearity. VIF can be obtained by regressing a single independent variable against all other independent variables. As a rule of thumb, no variance inflation factor should be bigger than 10. Otherwise, highly correlated variables should be removed from the model.

2. Linear Regression results obtained using `lm()` in R

2.1. Linear regression model:

liver_fat_a ~ geno + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + array + sex + age

2.2. Phenotype vs. genotype plot

The scatterplot shows the genotype on the x-axis and the phenotype on the y-axis. The x-values are jittered for better visibility. Hypothetical outliers are marked blue , while hypothetical influential observations (i.e. observations with high values of Cook’s D) are marked red . Note that at least six observations with the highest values of Cook’s D are marked in the plot, no matter if they exceed the calculated cutoff for being influential or not. For details regarding Cook’s D, check the section “Cook’s distance” below.

2.3. Some metrics obtained using `lm()`:

sigma is the estimatimated standard deviation of the noise term
Fstat is the value of the F-statistic
Rsquared is coefficient of determination
Rsq.adj is coefficient of determination adjusted for the number of predictors
CI_low is the lower limit of the confidence interval for \(\beta\)
CI_up is the upper limit of the confidence interval for \(\beta\)
AIC is the Akaike information criterion (which estimates the quality of a gregession model, relative to others)

2.4. Regression coefficients obtained with `lm()`:

2.5. Comparison of plink results with those obtained using `lm()`

3. Diagnostic plots regarding the residuals

3.1. Histogram for the magnitudes of the residuals and the phenotype

According to the linear model established, the residuals should be normally distributed. Consequently, the histogram below should approximately resemble a normal distribution. The residuals have been saved to diagnose_residuals_rs188247550_T_C_allele_T.RData .

3.2. Plot of the residuals vs. the fitted values

This plot shows if residuals have non-linear patterns (which should not be the case). It is desirable that the residuals are equally spread around the horizontal line without distinct patterns. The p-value for the Non-constant Variance Score Test (ncvTest in R) is 1,623053e-69.

3.3. Normal Q-Q plot of the residuals

This plot shows if residuals are normally distributed. It is desirable that the points displaying the residuals are located close to the straight line.

3.4. Scale-Location plot

The plot shows if residuals are spread equally along the ranges of predictors, allowing to check the equal variance (homoscedasticity) assumption. It is desirable that we see a horizontal line with equally (randomly) spread points.

3.5. Autocorrelation of the residuals

The residuals should be independent according to the assumptions of the linear model applied. This means that the autocorrelation for any lag should be small. It is therefore desirable that all vertical lines standing for the magnitudes of autocorrelation are well inside the blue dashed lines displayed in the plot. The p-value for the Durbin-Watson-Test is 0,02

4. Influential observations and outliers

4.1. Cook’s distance

Cook’s distance quantifies the influence of each observation on the regression results. Cook’s distance is inferred by recalculating the regression results after removal of a single observation from the input dataset. It summarizes how much the results are changed when the observation is removed. The cutoff for Cook’s distance is 0.9528287 (calculated as the median of the F-distribution for 14 and 27229 degrees of freedom). According to this cutoff, we have 0 variables being influental.

4.2. Plot of the residuals vs. leverage

The plot supports identification of influential observations. Influential observations are located at the upper right or the lower right corner of this plot. Cases outside the dashed line (Cook’s distance) might be influential to the regression results. i.e. the regression results will be altered if these observations are excluded from the model. ( Note that the dashed line indicating Cook’s distance may not be visible in the plot if all observations have a magninude of Cook’s D which is below the cutoff.)

4.3. Outlier Test

Outliers were calculated using the function outlierTest in R. The number of hypothetical outliers obtained by this function was 169.

5. Inverse response plot

The inverse response plot displays the response variable (i.e. the phenotype) on the x-axis and the fitted values on the y-axis. A relationship bettween these variables in the form \(Y_{fitted} = \beta_0 + \beta_1 \cdot Y_{response}^\lambda\) is fitted by using the nls function in R. The estimated \(\lambda\) for the model considered here is -0,9750917

6. Links to databases

The position of the marker rs188247550 is 19.396.616 on chromosome 19.

Link to marker at Phenoscanner
Link to marker at Ensembl

7. Technical information

GWAS workfolder: /castor/project/proj/GWAS_DEV3/liver10