Random Forest

1. The Random Forest (RF) machine learning approach

Breiman, Leo (2001). “Random Forests”. Machine Learning 45 (1): 5-32
Leo Breiman’s & Adele Cutler’s Website
RF is an ensemble classifier which creates many decision trees (typically 1000)

Figure: mathworks.com

RF is favourable when the number of variables (predictors) is higher than the number of observations (samples)
RF builts a predictive machine learning model by aggregating multiple deep (unpruned) trees
for each tree, only a subset of the data is used:
- variance is reduced by sampling with replacement from the observations (samples)
- additional predictors can be identified by selecting a random subset of the variables (predictors)
- prevents from overfitting the model to a (possibly insufficient) training set

Figure:
- ignore about \(1/3\) of the observations (samples) by sampling with replacement (red lines = ignored samples = Out-Of-Bag samples)
- randomly choose \(\sqrt(N)\) of the \(N\) variables (predictors) (blue lines = discarded variables)
- here, the predictors are expression levels (RPKM) of genes of a species of fish
- … and the samples are individuals of that species

2. Classification using a Random Forest model

2.1. Input data

data frame with 1001 columns
columns 1 .. 1000: expression levels (RPKM) of 1000 genes for 45 samples
column 1001: three groups of lifespan of a species (fish)
- G1: short-lived
- G2: medium lifespan
- G3: long-lived
note that we have (much) more variables (genes) than samples (individuals)

data = get(load("expr_data.RData"))
dim(data)

## [1]   45 1001

data[1:6,1:6]           # top left

##             Gene_1   Gene_2    Gene_3   Gene_4   Gene_5    Gene_6
## Sample_1 2.5693765 6.059599  2.995293 6.836092 7.804588  3.250978
## Sample_2 8.1141287 7.836250 10.607745 8.097445 5.398739 11.096624
## Sample_3 5.2441530 4.134373  3.650618 7.111028 5.062770  4.331364
## Sample_4 0.4371232 5.414329  8.150474 6.956259 3.567373  7.600040
## Sample_5 6.5659180 9.174541  8.055494 5.243102 9.292768  4.809160
## Sample_6 9.6341325 5.651336  7.344717 5.390121 6.936642  6.204089

data[40:45,996:1001]    # bottom right

##            Gene_996 Gene_997 Gene_998 Gene_999 Gene_1000 group
## Sample_40 16.610444 13.44591 14.49645 16.03593  11.56931    G3
## Sample_41 16.087456 17.36023 17.54690 11.29153  16.21904    G3
## Sample_42 16.460808 14.03001 19.10219 12.26169  15.23282    G3
## Sample_43 19.145770 15.25750 12.96063 11.26993  11.67638    G3
## Sample_44 18.214603 15.17865 14.65838 15.54203  17.31798    G3
## Sample_45  8.844086 14.15185 12.41995 15.00920  10.17885    G3

2.2. Create a Decision Tree

not necessary for running the RF algorithm but (possibly) instructive
see also the Decision Tree example on this webpage
package RWeka will be used
RWeka implements the C4.5 algorithm developed by Ross Quinlan, Wikipedia
Java must be installed (check with ‘java -version’ on the command prompt) in order to run RWeka

colnames(data)[996:1001]

## [1] "Gene_996"  "Gene_997"  "Gene_998"  "Gene_999"  "Gene_1000" "group"

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.5.3

tree <- J48(group ~ ., data = data)
print(tree)

## J48 pruned tree
## ------------------
## 
## Gene_412 <= 12.928452
## |   Gene_55 <= 8.684733: G1 (14.0)
## |   Gene_55 > 8.684733: G2 (16.0/1.0)
## Gene_412 > 12.928452: G3 (15.0)
## 
## Number of Leaves  :  3
## 
## Size of the tree :   5

library(rpart)  # for plotting the tree
plot(tree)

annotation on the x-axis (‘G1’) is misleading, should be ‘G1, G2, or G3’
expression values of only a few genes are sufficient to build the decision tree
however, the genes identified here might not be the only suitable candidates for effective splitting
- RF might identify more genes suitable for classification, see below

2.3. Create the Random Forest model for classification

package randomForest
subdivide into matrix of predictors and response vector:

predictors = data[, 1:1000]
response = data[, 1001]
levels(response)  # 'response' is a factor --> classification

## [1] "G1" "G2" "G3"

the response vector is a factor which means that classification will be conducted when using this package

suppressMessages(library(randomForest))

## Warning: package 'randomForest' was built under R version 3.5.3

forest <- randomForest(predictors, response, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)

## ntree      OOB      1      2      3
##   100:   0.00%  0.00%  0.00%  0.00%
##   200:   0.00%  0.00%  0.00%  0.00%
##   300:   0.00%  0.00%  0.00%  0.00%
##   400:   0.00%  0.00%  0.00%  0.00%
##   500:   0.00%  0.00%  0.00%  0.00%
##   600:   0.00%  0.00%  0.00%  0.00%
##   700:   0.00%  0.00%  0.00%  0.00%
##   800:   0.00%  0.00%  0.00%  0.00%
##   900:   0.00%  0.00%  0.00%  0.00%
##  1000:   0.00%  0.00%  0.00%  0.00%

# keep.forest = TRUE  to enable prediction
# names(forest)     # elements of the list
class(forest)

## [1] "randomForest"

methods(class = "randomForest")  # what can we do with the object 'forest' ?

## [1] grow        importance  margin      outlier     partialPlot plot       
## [7] predict     print      
## see '?methods' for accessing help and source code

forest$confusion

##    G1 G2 G3 class.error
## G1 15  0  0           0
## G2  0 15  0           0
## G3  0  0 15           0

the confusion matrix gives an estimate of how well the classification algorithm performs
the confusion matrix is calculated by predicting the response of the Out-Of-Bag samples
how many members of a particular response group are actually classified as being members of this group ?
a diagonal matrix (as above) indicates that the model performs very well, e.g. all 15 G1 samples are also predicted to be G1

2.4. MDS plot based on RF proximity

the RF algorithm can calculate the proximity (vicinity) of the samples (use argument importance = TRUE above)
RF proximity: samples are “close” if they end up in the same leaf frequently (in many trees)
the proximity of samples can be visualized using a Multi-Dimensional Scaling (MDS) plot
the MDS-plot can be created by utilizing the built-in function cmdscale
first argument of cmdscale must be a distance matrix, so 1 - proximity can be used
the color of the points is chosen according to the true response values (G1, G2, or G3)

mds <- cmdscale(1 - forest$proximity, eig = TRUE)
x <- mds$points[,1]
y <- mds$points[,2]
color = as.character(response)
color = sub("G1", "red", color)
color = sub("G2", "blue", color)
color = sub("G3", "green", color)
unique(cbind(response, color))

##      response color  
## [1,] "1"      "red"  
## [2,] "2"      "blue" 
## [3,] "3"      "green"

plot(x, y, col = color, pch = 18, xlab = "", ylab = "", main = "MDS plot based on RF proximity", font.main = 1)
legend("topright",  c("G1", "G2", "G3"), col=c("red", "blue", "green"), pch = 18)

the discrimination of the three groups in this example appears to be very good
don’t expect this with real data - I admit I fiddled a little bit with the input data here …

2.5. Prediction using RF models

suppose we have a new dataset (i.e. the predictors) but the response (age group) is unknown
we can use the model ‘forest’ calculated above to predict the response (age group)
different types of output are possible, controlled by ‘type’ argument

newdata = get(load("expr_data_new.RData"))
dim(newdata)   # 10 samples, 1000 genes

## [1]   10 1000

pred1 <- predict(forest, newdata, type = "response")
pred1

##  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7 
##        G1        G1        G1        G1        G1        G1        G1 
##  Sample_8  Sample_9 Sample_10 
##        G1        G1        G1 
## Levels: G1 G2 G3

pred2 <- predict(forest, newdata, type = "prob")
pred2

##              G1    G2    G3
## Sample_1  0.586 0.367 0.047
## Sample_2  0.605 0.352 0.043
## Sample_3  0.614 0.367 0.019
## Sample_4  0.600 0.377 0.023
## Sample_5  0.620 0.350 0.030
## Sample_6  0.638 0.338 0.024
## Sample_7  0.599 0.340 0.061
## Sample_8  0.606 0.346 0.048
## Sample_9  0.621 0.320 0.059
## Sample_10 0.576 0.395 0.029
## attr(,"class")
## [1] "matrix" "votes"

all samples are predicted to belong to group ‘G1’
this is correct - I intended this when I faked the data ;-)

3. Variable importance in RF models

the variable importance is a measure of the explanatory power of a variable with regard to the response
a variable splits efficiently when the response groups can be efficiently sorted by this variable (as in the tree plotted above)
the randomForest package provides two measures of importance
Permutation importance:
- a training set variable is scrambled (along a column) and the change in the Out-Of-Bag (OOB) error is recorded
- if scrambling of a variable does not (much) change the performance of the model, the variable was not important
Impurity importance:
- measures the decrease of the Gini index which quantifies the inequality among values
- it is recorded how much the Gini index decreases from a parent to a daughter node when a variable makes a split
- the total decrease in node impurities caused by a variables’ split is calculated
- lower values of the Gini Index indicate better classification (purer nodes)

# head(forest$importance)
imp.perm = sort(forest$importance[,"MeanDecreaseAccuracy"], decreasing = T)     # Permutation importance
imp.gini = sort(forest$importance[,"MeanDecreaseGini"], decreasing = T)         # Impurity importance
barplot(imp.perm[1:30],  width=0.6, ylab="Importance", main = "Permutation Importance of genes", legend.text=F, col="red", font.main=1, las=3, cex.names=0.7)   # best 30 genes

barplot(imp.gini[1:30], width=0.6, ylab="Importance", main = "Impurity Importance of genes", legend.text=F, col="red", font.main=1, las=3, cex.names=0.7)

variables with greater importance are more crucial for the response group
in the fish example, variables with great importance can be considered as biomarkers

4. Regression models using RF

4.1. Loading data for RF regression

the package randomForest conducts RF regression when the response vector is numerical
columns 1 .. 1000: expression levels (RPKM) of 1000 genes for 45 samples
column 1001: numerical values of lifespan (fish)

4.2. Create the Random Forest model for regression

load some (faked) example data:

data = get(load("expr_data_regr.RData"))
dim(data)

## [1]   45 1001

data[1:6,1:6]           # top left

##             Gene_1   Gene_2    Gene_3   Gene_4   Gene_5    Gene_6
## Sample_1 2.5693765 6.059599  2.995293 6.836092 7.804588  3.250978
## Sample_2 8.1141287 7.836250 10.607745 8.097445 5.398739 11.096624
## Sample_3 5.2441530 4.134373  3.650618 7.111028 5.062770  4.331364
## Sample_4 0.4371232 5.414329  8.150474 6.956259 3.567373  7.600040
## Sample_5 6.5659180 9.174541  8.055494 5.243102 9.292768  4.809160
## Sample_6 9.6341325 5.651336  7.344717 5.390121 6.936642  6.204089

data[40:45,996:1001]    # bottom right

##            Gene_996 Gene_997 Gene_998 Gene_999 Gene_1000      age
## Sample_40 16.610444 13.44591 14.49645 16.03593  11.56931 13.88256
## Sample_41 16.087456 17.36023 17.54690 11.29153  16.21904 15.20492
## Sample_42 16.460808 14.03001 19.10219 12.26169  15.23282 13.98376
## Sample_43 19.145770 15.25750 12.96063 11.26993  11.67638 14.08816
## Sample_44 18.214603 15.17865 14.65838 15.54203  17.31798 13.23555
## Sample_45  8.844086 14.15185 12.41995 15.00920  10.17885 15.51807

regression is carried out when the 2nd input argument is a numerical vector:

predictors = data[, 1:1000]
response1 = data[, 1001]
str(response1)  # numerical --> regression

##  num [1:45] 9.23 9.27 7.36 5.43 2.89 ...

build the RF regression model:

forest <- randomForest(predictors, response1, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 |    5.104    43.50 |
##  200 |    5.151    43.90 |
##  300 |    5.177    44.12 |
##  400 |    5.082    43.31 |
##  500 |    5.079    43.29 |
##  600 |    5.117    43.61 |
##  700 |    5.175    44.11 |
##  800 |    5.218    44.47 |
##  900 |    5.178    44.13 |
## 1000 |    5.169    44.05 |

the MDS plot can also be created using the function MDSplot from the randomForest package:

MDSplot(forest, response)

5. RF in unsupervised mode

when no training set with known response vector is available
model must be built exclusively on the basis of the predictors
unsupervised learning: Wikipedia article

forest <- randomForest(predictors, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)

## ntree      OOB      1      2
##   100:  34.44% 22.22% 46.67%
##   200:  34.44% 31.11% 37.78%
##   300:  36.67% 31.11% 42.22%
##   400:  42.22% 33.33% 51.11%
##   500:  40.00% 33.33% 46.67%
##   600:  42.22% 33.33% 51.11%
##   700:  41.11% 33.33% 48.89%
##   800:  42.22% 33.33% 51.11%
##   900:  40.00% 33.33% 46.67%
##  1000:  42.22% 33.33% 51.11%

mds <- cmdscale(1 - forest$proximity, eig = TRUE)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, col = "red", pch = 18, xlab = "", ylab = "", main = "MDS plot - RF proximity", font.main = 1)
lab = rep("", 45)
lab[5] = "S5"
lab[16] = "S16"
lab[31] = "S31"
# text(x, y, label = names(x), cex = 0.8, pos = 1)   # too dense when all samples are labeled 
text(x, y, label = lab, cex = 0.8, pos = 1)          # 3 labels only

# identify(x, y, labels = names(x))                  # works in interactive mode

using the plot, we are able to see how the data cluster together
even with no response vector available, we can identify 3 clusters and the samples therein
in interactive mode, the identify command displays the sample name when a data point is moused over

6. RF models with variable selection

Goal of variable selection: identify the variables (here: genes) that are crucial for building the RF model
variable selection can be performed using the varSelRF package
the number of variables is reduced by iteratively eliminating the least important variables (using the calculated importance values)
the number of variables dropped in each iteration is controlled by the argument vars.drop.frac

library(varSelRF)

## Warning: package 'varSelRF' was built under R version 3.5.3

## Loading required package: parallel

selected <- varSelRF(predictors, response, ntree = 500, ntreeIterat = 300, vars.drop.frac = 0.2)    
selected

## 
## Backwards elimination on random forest; ntree =  500 ;  mtryFactor =  1 
## 
##  Selected variables:
## [1] "Gene_227" "Gene_412" "Gene_475" "Gene_609" "Gene_798" "Gene_878"
## 
##  Number of selected variables: 6

plot(selected)

# rf.model$selec.history    # selection history
selected$selected.vars      # selected variables

## [1] "Gene_227" "Gene_412" "Gene_475" "Gene_609" "Gene_798" "Gene_878"

names(selected)             # available components of the list

##  [1] "selec.history"             "rf.model"                 
##  [3] "selected.vars"             "selected.model"           
##  [5] "best.model.nvars"          "initialImportances"       
##  [7] "initialOrderedImportances" "ntree"                    
##  [9] "ntreeIterat"               "mtryFactor"               
## [11] "firstForest"

again, in our example, the selected variables can be considered as biomarkers

uwe.menzel@matstat.org

Random Forest

Uwe Menzel

1. The Random Forest (RF) machine learning approach

2. Classification using a Random Forest model

2.1. Input data

2.2. Create a Decision Tree

2.3. Create the Random Forest model for classification

2.4. MDS plot based on RF proximity

2.5. Prediction using RF models

3. Variable importance in RF models

4. Regression models using RF

4.1. Loading data for RF regression

4.2. Create the Random Forest model for regression

5. RF in unsupervised mode

6. RF models with variable selection