1. The Random Forest (RF) machine learning approach
- Breiman, Leo (2001). “Random Forests”. Machine Learning 45 (1): 5-32
- Leo Breiman’s & Adele Cutler’s Website
- RF is an ensemble classifier which creates many decision trees (typically 1000)
Figure: mathworks.com
- RF is favourable when the number of variables (predictors) is higher than the number of observations (samples)
- RF builts a predictive machine learning model by aggregating multiple deep (unpruned) trees
- for each tree, only a subset of the data is used:
- variance is reduced by sampling with replacement from the observations (samples)
- additional predictors can be identified by selecting a random subset of the variables (predictors)
- prevents from overfitting the model to a (possibly insufficient) training set
- Figure:
- ignore about \(1/3\) of the observations (samples) by sampling with replacement (red lines = ignored samples = Out-Of-Bag samples)
- randomly choose \(\sqrt(N)\) of the \(N\) variables (predictors) (blue lines = discarded variables)
- here, the predictors are expression levels (RPKM) of genes of a species of fish
- … and the samples are individuals of that species
2. Classification using a Random Forest model
2.2. Create a Decision Tree
- not necessary for running the RF algorithm but (possibly) instructive
- see also the Decision Tree example on this webpage
- package RWeka will be used
- RWeka implements the C4.5 algorithm developed by Ross Quinlan, Wikipedia
- Java must be installed (check with ‘java -version’ on the command prompt) in order to run RWeka
colnames(data)[996:1001]
## [1] "Gene_996" "Gene_997" "Gene_998" "Gene_999" "Gene_1000" "group"
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.5.3
tree <- J48(group ~ ., data = data)
print(tree)
## J48 pruned tree
## ------------------
##
## Gene_412 <= 12.928452
## | Gene_55 <= 8.684733: G1 (14.0)
## | Gene_55 > 8.684733: G2 (16.0/1.0)
## Gene_412 > 12.928452: G3 (15.0)
##
## Number of Leaves : 3
##
## Size of the tree : 5
library(rpart) # for plotting the tree
plot(tree)
- annotation on the x-axis (‘G1’) is misleading, should be ‘G1, G2, or G3’
- expression values of only a few genes are sufficient to build the decision tree
- however, the genes identified here might not be the only suitable candidates for effective splitting
- RF might identify more genes suitable for classification, see below
2.3. Create the Random Forest model for classification
- package randomForest
- subdivide into matrix of predictors and response vector:
predictors = data[, 1:1000]
response = data[, 1001]
levels(response) # 'response' is a factor --> classification
## [1] "G1" "G2" "G3"
- the response vector is a factor which means that classification will be conducted when using this package
suppressMessages(library(randomForest))
## Warning: package 'randomForest' was built under R version 3.5.3
forest <- randomForest(predictors, response, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)
## ntree OOB 1 2 3
## 100: 0.00% 0.00% 0.00% 0.00%
## 200: 0.00% 0.00% 0.00% 0.00%
## 300: 0.00% 0.00% 0.00% 0.00%
## 400: 0.00% 0.00% 0.00% 0.00%
## 500: 0.00% 0.00% 0.00% 0.00%
## 600: 0.00% 0.00% 0.00% 0.00%
## 700: 0.00% 0.00% 0.00% 0.00%
## 800: 0.00% 0.00% 0.00% 0.00%
## 900: 0.00% 0.00% 0.00% 0.00%
## 1000: 0.00% 0.00% 0.00% 0.00%
# keep.forest = TRUE to enable prediction
# names(forest) # elements of the list
class(forest)
## [1] "randomForest"
methods(class = "randomForest") # what can we do with the object 'forest' ?
## [1] grow importance margin outlier partialPlot plot
## [7] predict print
## see '?methods' for accessing help and source code
forest$confusion
## G1 G2 G3 class.error
## G1 15 0 0 0
## G2 0 15 0 0
## G3 0 0 15 0
- the confusion matrix gives an estimate of how well the classification algorithm performs
- the confusion matrix is calculated by predicting the response of the Out-Of-Bag samples
- how many members of a particular response group are actually classified as being members of this group ?
- a diagonal matrix (as above) indicates that the model performs very well, e.g. all 15 G1 samples are also predicted to be G1
2.4. MDS plot based on RF proximity
- the RF algorithm can calculate the proximity (vicinity) of the samples (use argument importance = TRUE above)
- RF proximity: samples are “close” if they end up in the same leaf frequently (in many trees)
- the proximity of samples can be visualized using a Multi-Dimensional Scaling (MDS) plot
- the MDS-plot can be created by utilizing the built-in function cmdscale
- first argument of cmdscale must be a distance matrix, so 1 - proximity can be used
- the color of the points is chosen according to the true response values (G1, G2, or G3)
mds <- cmdscale(1 - forest$proximity, eig = TRUE)
x <- mds$points[,1]
y <- mds$points[,2]
color = as.character(response)
color = sub("G1", "red", color)
color = sub("G2", "blue", color)
color = sub("G3", "green", color)
unique(cbind(response, color))
## response color
## [1,] "1" "red"
## [2,] "2" "blue"
## [3,] "3" "green"
plot(x, y, col = color, pch = 18, xlab = "", ylab = "", main = "MDS plot based on RF proximity", font.main = 1)
legend("topright", c("G1", "G2", "G3"), col=c("red", "blue", "green"), pch = 18)
- the discrimination of the three groups in this example appears to be very good
- don’t expect this with real data - I admit I fiddled a little bit with the input data here …
2.5. Prediction using RF models
- suppose we have a new dataset (i.e. the predictors) but the response (age group) is unknown
- we can use the model ‘forest’ calculated above to predict the response (age group)
- different types of output are possible, controlled by ‘type’ argument
newdata = get(load("expr_data_new.RData"))
dim(newdata) # 10 samples, 1000 genes
## [1] 10 1000
pred1 <- predict(forest, newdata, type = "response")
pred1
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## G1 G1 G1 G1 G1 G1 G1
## Sample_8 Sample_9 Sample_10
## G1 G1 G1
## Levels: G1 G2 G3
pred2 <- predict(forest, newdata, type = "prob")
pred2
## G1 G2 G3
## Sample_1 0.586 0.367 0.047
## Sample_2 0.605 0.352 0.043
## Sample_3 0.614 0.367 0.019
## Sample_4 0.600 0.377 0.023
## Sample_5 0.620 0.350 0.030
## Sample_6 0.638 0.338 0.024
## Sample_7 0.599 0.340 0.061
## Sample_8 0.606 0.346 0.048
## Sample_9 0.621 0.320 0.059
## Sample_10 0.576 0.395 0.029
## attr(,"class")
## [1] "matrix" "votes"
- all samples are predicted to belong to group ‘G1’
- this is correct - I intended this when I faked the data ;-)
3. Variable importance in RF models
- the variable importance is a measure of the explanatory power of a variable with regard to the response
- a variable splits efficiently when the response groups can be efficiently sorted by this variable (as in the tree plotted above)
- the randomForest package provides two measures of importance
- Permutation importance:
- a training set variable is scrambled (along a column) and the change in the Out-Of-Bag (OOB) error is recorded
- if scrambling of a variable does not (much) change the performance of the model, the variable was not important
- Impurity importance:
- measures the decrease of the Gini index which quantifies the inequality among values
- it is recorded how much the Gini index decreases from a parent to a daughter node when a variable makes a split
- the total decrease in node impurities caused by a variables’ split is calculated
- lower values of the Gini Index indicate better classification (purer nodes)
# head(forest$importance)
imp.perm = sort(forest$importance[,"MeanDecreaseAccuracy"], decreasing = T) # Permutation importance
imp.gini = sort(forest$importance[,"MeanDecreaseGini"], decreasing = T) # Impurity importance
barplot(imp.perm[1:30], width=0.6, ylab="Importance", main = "Permutation Importance of genes", legend.text=F, col="red", font.main=1, las=3, cex.names=0.7) # best 30 genes
barplot(imp.gini[1:30], width=0.6, ylab="Importance", main = "Impurity Importance of genes", legend.text=F, col="red", font.main=1, las=3, cex.names=0.7)
- variables with greater importance are more crucial for the response group
- in the fish example, variables with great importance can be considered as biomarkers
4. Regression models using RF
4.1. Loading data for RF regression
- the package randomForest conducts RF regression when the response vector is numerical
- columns 1 .. 1000: expression levels (RPKM) of 1000 genes for 45 samples
- column 1001: numerical values of lifespan (fish)
4.2. Create the Random Forest model for regression
- load some (faked) example data:
data = get(load("expr_data_regr.RData"))
dim(data)
## [1] 45 1001
data[1:6,1:6] # top left
## Gene_1 Gene_2 Gene_3 Gene_4 Gene_5 Gene_6
## Sample_1 2.5693765 6.059599 2.995293 6.836092 7.804588 3.250978
## Sample_2 8.1141287 7.836250 10.607745 8.097445 5.398739 11.096624
## Sample_3 5.2441530 4.134373 3.650618 7.111028 5.062770 4.331364
## Sample_4 0.4371232 5.414329 8.150474 6.956259 3.567373 7.600040
## Sample_5 6.5659180 9.174541 8.055494 5.243102 9.292768 4.809160
## Sample_6 9.6341325 5.651336 7.344717 5.390121 6.936642 6.204089
data[40:45,996:1001] # bottom right
## Gene_996 Gene_997 Gene_998 Gene_999 Gene_1000 age
## Sample_40 16.610444 13.44591 14.49645 16.03593 11.56931 13.88256
## Sample_41 16.087456 17.36023 17.54690 11.29153 16.21904 15.20492
## Sample_42 16.460808 14.03001 19.10219 12.26169 15.23282 13.98376
## Sample_43 19.145770 15.25750 12.96063 11.26993 11.67638 14.08816
## Sample_44 18.214603 15.17865 14.65838 15.54203 17.31798 13.23555
## Sample_45 8.844086 14.15185 12.41995 15.00920 10.17885 15.51807
- regression is carried out when the 2nd input argument is a numerical vector:
predictors = data[, 1:1000]
response1 = data[, 1001]
str(response1) # numerical --> regression
## num [1:45] 9.23 9.27 7.36 5.43 2.89 ...
- build the RF regression model:
forest <- randomForest(predictors, response1, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 100 | 5.104 43.50 |
## 200 | 5.151 43.90 |
## 300 | 5.177 44.12 |
## 400 | 5.082 43.31 |
## 500 | 5.079 43.29 |
## 600 | 5.117 43.61 |
## 700 | 5.175 44.11 |
## 800 | 5.218 44.47 |
## 900 | 5.178 44.13 |
## 1000 | 5.169 44.05 |
- the MDS plot can also be created using the function MDSplot from the randomForest package:
MDSplot(forest, response)
5. RF in unsupervised mode
- when no training set with known response vector is available
- model must be built exclusively on the basis of the predictors
- unsupervised learning: Wikipedia article
forest <- randomForest(predictors, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)
## ntree OOB 1 2
## 100: 34.44% 22.22% 46.67%
## 200: 34.44% 31.11% 37.78%
## 300: 36.67% 31.11% 42.22%
## 400: 42.22% 33.33% 51.11%
## 500: 40.00% 33.33% 46.67%
## 600: 42.22% 33.33% 51.11%
## 700: 41.11% 33.33% 48.89%
## 800: 42.22% 33.33% 51.11%
## 900: 40.00% 33.33% 46.67%
## 1000: 42.22% 33.33% 51.11%
mds <- cmdscale(1 - forest$proximity, eig = TRUE)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, col = "red", pch = 18, xlab = "", ylab = "", main = "MDS plot - RF proximity", font.main = 1)
lab = rep("", 45)
lab[5] = "S5"
lab[16] = "S16"
lab[31] = "S31"
# text(x, y, label = names(x), cex = 0.8, pos = 1) # too dense when all samples are labeled
text(x, y, label = lab, cex = 0.8, pos = 1) # 3 labels only
# identify(x, y, labels = names(x)) # works in interactive mode
- using the plot, we are able to see how the data cluster together
- even with no response vector available, we can identify 3 clusters and the samples therein
- in interactive mode, the identify command displays the sample name when a data point is moused over
6. RF models with variable selection
- Goal of variable selection: identify the variables (here: genes) that are crucial for building the RF model
- variable selection can be performed using the varSelRF package
- the number of variables is reduced by iteratively eliminating the least important variables (using the calculated importance values)
- the number of variables dropped in each iteration is controlled by the argument vars.drop.frac
library(varSelRF)
## Warning: package 'varSelRF' was built under R version 3.5.3
## Loading required package: parallel
selected <- varSelRF(predictors, response, ntree = 500, ntreeIterat = 300, vars.drop.frac = 0.2)
selected
##
## Backwards elimination on random forest; ntree = 500 ; mtryFactor = 1
##
## Selected variables:
## [1] "Gene_227" "Gene_412" "Gene_475" "Gene_609" "Gene_798" "Gene_878"
##
## Number of selected variables: 6
plot(selected)
# rf.model$selec.history # selection history
selected$selected.vars # selected variables
## [1] "Gene_227" "Gene_412" "Gene_475" "Gene_609" "Gene_798" "Gene_878"
names(selected) # available components of the list
## [1] "selec.history" "rf.model"
## [3] "selected.vars" "selected.model"
## [5] "best.model.nvars" "initialImportances"
## [7] "initialOrderedImportances" "ntree"
## [9] "ntreeIterat" "mtryFactor"
## [11] "firstForest"
- again, in our example, the selected variables can be considered as biomarkers
uwe.menzel@matstat.org