1. The Random Forest (RF) machine learning approach


Iris versicolor Figure: mathworks.com



Iris versicolor



2. Classification using a Random Forest model

2.1. Input data

  • data frame with 1001 columns
  • columns 1 .. 1000: expression levels (RPKM) of 1000 genes for 45 samples
  • column 1001: three groups of lifespan of a species (fish)
    • G1: short-lived
    • G2: medium lifespan
    • G3: long-lived
  • note that we have (much) more variables (genes) than samples (individuals)
data = get(load("expr_data.RData"))
dim(data)
## [1]   45 1001
data[1:6,1:6]           # top left
##             Gene_1   Gene_2    Gene_3   Gene_4   Gene_5    Gene_6
## Sample_1 2.5693765 6.059599  2.995293 6.836092 7.804588  3.250978
## Sample_2 8.1141287 7.836250 10.607745 8.097445 5.398739 11.096624
## Sample_3 5.2441530 4.134373  3.650618 7.111028 5.062770  4.331364
## Sample_4 0.4371232 5.414329  8.150474 6.956259 3.567373  7.600040
## Sample_5 6.5659180 9.174541  8.055494 5.243102 9.292768  4.809160
## Sample_6 9.6341325 5.651336  7.344717 5.390121 6.936642  6.204089
data[40:45,996:1001]    # bottom right
##            Gene_996 Gene_997 Gene_998 Gene_999 Gene_1000 group
## Sample_40 16.610444 13.44591 14.49645 16.03593  11.56931    G3
## Sample_41 16.087456 17.36023 17.54690 11.29153  16.21904    G3
## Sample_42 16.460808 14.03001 19.10219 12.26169  15.23282    G3
## Sample_43 19.145770 15.25750 12.96063 11.26993  11.67638    G3
## Sample_44 18.214603 15.17865 14.65838 15.54203  17.31798    G3
## Sample_45  8.844086 14.15185 12.41995 15.00920  10.17885    G3


2.2. Create a Decision Tree

  • not necessary for running the RF algorithm but (possibly) instructive
  • see also the Decision Tree example on this webpage
  • package RWeka will be used
  • RWeka implements the C4.5 algorithm developed by Ross Quinlan, Wikipedia
  • Java must be installed (check with ‘java -version’ on the command prompt) in order to run RWeka
colnames(data)[996:1001]
## [1] "Gene_996"  "Gene_997"  "Gene_998"  "Gene_999"  "Gene_1000" "group"
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.5.3
tree <- J48(group ~ ., data = data)
print(tree)
## J48 pruned tree
## ------------------
## 
## Gene_412 <= 12.928452
## |   Gene_55 <= 8.684733: G1 (14.0)
## |   Gene_55 > 8.684733: G2 (16.0/1.0)
## Gene_412 > 12.928452: G3 (15.0)
## 
## Number of Leaves  :  3
## 
## Size of the tree :   5
library(rpart)  # for plotting the tree
plot(tree)  

  • annotation on the x-axis (‘G1’) is misleading, should be ‘G1, G2, or G3’
  • expression values of only a few genes are sufficient to build the decision tree
  • however, the genes identified here might not be the only suitable candidates for effective splitting
    • RF might identify more genes suitable for classification, see below


2.3. Create the Random Forest model for classification

  • package randomForest
  • subdivide into matrix of predictors and response vector:
predictors = data[, 1:1000]
response = data[, 1001]
levels(response)  # 'response' is a factor --> classification 
## [1] "G1" "G2" "G3"
  • the response vector is a factor which means that classification will be conducted when using this package
suppressMessages(library(randomForest)) 
## Warning: package 'randomForest' was built under R version 3.5.3
forest <- randomForest(predictors, response, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100) 
## ntree      OOB      1      2      3
##   100:   0.00%  0.00%  0.00%  0.00%
##   200:   0.00%  0.00%  0.00%  0.00%
##   300:   0.00%  0.00%  0.00%  0.00%
##   400:   0.00%  0.00%  0.00%  0.00%
##   500:   0.00%  0.00%  0.00%  0.00%
##   600:   0.00%  0.00%  0.00%  0.00%
##   700:   0.00%  0.00%  0.00%  0.00%
##   800:   0.00%  0.00%  0.00%  0.00%
##   900:   0.00%  0.00%  0.00%  0.00%
##  1000:   0.00%  0.00%  0.00%  0.00%
# keep.forest = TRUE  to enable prediction
# names(forest)     # elements of the list
class(forest)
## [1] "randomForest"
methods(class = "randomForest")  # what can we do with the object 'forest' ?
## [1] grow        importance  margin      outlier     partialPlot plot       
## [7] predict     print      
## see '?methods' for accessing help and source code
forest$confusion
##    G1 G2 G3 class.error
## G1 15  0  0           0
## G2  0 15  0           0
## G3  0  0 15           0
  • the confusion matrix gives an estimate of how well the classification algorithm performs
  • the confusion matrix is calculated by predicting the response of the Out-Of-Bag samples
  • how many members of a particular response group are actually classified as being members of this group ?
  • a diagonal matrix (as above) indicates that the model performs very well, e.g. all 15 G1 samples are also predicted to be G1


2.4. MDS plot based on RF proximity

  • the RF algorithm can calculate the proximity (vicinity) of the samples (use argument importance = TRUE above)
  • RF proximity: samples are “close” if they end up in the same leaf frequently (in many trees)
  • the proximity of samples can be visualized using a Multi-Dimensional Scaling (MDS) plot
  • the MDS-plot can be created by utilizing the built-in function cmdscale
  • first argument of cmdscale must be a distance matrix, so 1 - proximity can be used
  • the color of the points is chosen according to the true response values (G1, G2, or G3)
mds <- cmdscale(1 - forest$proximity, eig = TRUE)
x <- mds$points[,1]
y <- mds$points[,2]
color = as.character(response)
color = sub("G1", "red", color)
color = sub("G2", "blue", color)
color = sub("G3", "green", color)
unique(cbind(response, color))
##      response color  
## [1,] "1"      "red"  
## [2,] "2"      "blue" 
## [3,] "3"      "green"
plot(x, y, col = color, pch = 18, xlab = "", ylab = "", main = "MDS plot based on RF proximity", font.main = 1)
legend("topright",  c("G1", "G2", "G3"), col=c("red", "blue", "green"), pch = 18)

  • the discrimination of the three groups in this example appears to be very good
  • don’t expect this with real data - I admit I fiddled a little bit with the input data here …


2.5. Prediction using RF models

  • suppose we have a new dataset (i.e. the predictors) but the response (age group) is unknown
  • we can use the model ‘forest’ calculated above to predict the response (age group)
  • different types of output are possible, controlled by ‘type’ argument
newdata = get(load("expr_data_new.RData"))
dim(newdata)   # 10 samples, 1000 genes
## [1]   10 1000
pred1 <- predict(forest, newdata, type = "response")
pred1
##  Sample_1  Sample_2  Sample_3  Sample_4  Sample_5  Sample_6  Sample_7 
##        G1        G1        G1        G1        G1        G1        G1 
##  Sample_8  Sample_9 Sample_10 
##        G1        G1        G1 
## Levels: G1 G2 G3
pred2 <- predict(forest, newdata, type = "prob")
pred2
##              G1    G2    G3
## Sample_1  0.586 0.367 0.047
## Sample_2  0.605 0.352 0.043
## Sample_3  0.614 0.367 0.019
## Sample_4  0.600 0.377 0.023
## Sample_5  0.620 0.350 0.030
## Sample_6  0.638 0.338 0.024
## Sample_7  0.599 0.340 0.061
## Sample_8  0.606 0.346 0.048
## Sample_9  0.621 0.320 0.059
## Sample_10 0.576 0.395 0.029
## attr(,"class")
## [1] "matrix" "votes"
  • all samples are predicted to belong to group ‘G1’
  • this is correct - I intended this when I faked the data ;-)


3. Variable importance in RF models

# head(forest$importance)
imp.perm = sort(forest$importance[,"MeanDecreaseAccuracy"], decreasing = T)     # Permutation importance
imp.gini = sort(forest$importance[,"MeanDecreaseGini"], decreasing = T)         # Impurity importance
barplot(imp.perm[1:30],  width=0.6, ylab="Importance", main = "Permutation Importance of genes", legend.text=F, col="red", font.main=1, las=3, cex.names=0.7)   # best 30 genes

barplot(imp.gini[1:30], width=0.6, ylab="Importance", main = "Impurity Importance of genes", legend.text=F, col="red", font.main=1, las=3, cex.names=0.7)


4. Regression models using RF


4.1. Loading data for RF regression

  • the package randomForest conducts RF regression when the response vector is numerical
  • columns 1 .. 1000: expression levels (RPKM) of 1000 genes for 45 samples
  • column 1001: numerical values of lifespan (fish)


4.2. Create the Random Forest model for regression

  • load some (faked) example data:
data = get(load("expr_data_regr.RData"))
dim(data)
## [1]   45 1001
data[1:6,1:6]           # top left
##             Gene_1   Gene_2    Gene_3   Gene_4   Gene_5    Gene_6
## Sample_1 2.5693765 6.059599  2.995293 6.836092 7.804588  3.250978
## Sample_2 8.1141287 7.836250 10.607745 8.097445 5.398739 11.096624
## Sample_3 5.2441530 4.134373  3.650618 7.111028 5.062770  4.331364
## Sample_4 0.4371232 5.414329  8.150474 6.956259 3.567373  7.600040
## Sample_5 6.5659180 9.174541  8.055494 5.243102 9.292768  4.809160
## Sample_6 9.6341325 5.651336  7.344717 5.390121 6.936642  6.204089
data[40:45,996:1001]    # bottom right
##            Gene_996 Gene_997 Gene_998 Gene_999 Gene_1000      age
## Sample_40 16.610444 13.44591 14.49645 16.03593  11.56931 13.88256
## Sample_41 16.087456 17.36023 17.54690 11.29153  16.21904 15.20492
## Sample_42 16.460808 14.03001 19.10219 12.26169  15.23282 13.98376
## Sample_43 19.145770 15.25750 12.96063 11.26993  11.67638 14.08816
## Sample_44 18.214603 15.17865 14.65838 15.54203  17.31798 13.23555
## Sample_45  8.844086 14.15185 12.41995 15.00920  10.17885 15.51807
  • regression is carried out when the 2nd input argument is a numerical vector:
predictors = data[, 1:1000]
response1 = data[, 1001]
str(response1)  # numerical --> regression
##  num [1:45] 9.23 9.27 7.36 5.43 2.89 ...
  • build the RF regression model:
forest <- randomForest(predictors, response1, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  100 |    5.104    43.50 |
##  200 |    5.151    43.90 |
##  300 |    5.177    44.12 |
##  400 |    5.082    43.31 |
##  500 |    5.079    43.29 |
##  600 |    5.117    43.61 |
##  700 |    5.175    44.11 |
##  800 |    5.218    44.47 |
##  900 |    5.178    44.13 |
## 1000 |    5.169    44.05 |
  • the MDS plot can also be created using the function MDSplot from the randomForest package:
MDSplot(forest, response)

5. RF in unsupervised mode

forest <- randomForest(predictors, ntree = 1000, importance = TRUE, proximity = TRUE, keep.forest = TRUE, do.trace = 100)
## ntree      OOB      1      2
##   100:  34.44% 22.22% 46.67%
##   200:  34.44% 31.11% 37.78%
##   300:  36.67% 31.11% 42.22%
##   400:  42.22% 33.33% 51.11%
##   500:  40.00% 33.33% 46.67%
##   600:  42.22% 33.33% 51.11%
##   700:  41.11% 33.33% 48.89%
##   800:  42.22% 33.33% 51.11%
##   900:  40.00% 33.33% 46.67%
##  1000:  42.22% 33.33% 51.11%
mds <- cmdscale(1 - forest$proximity, eig = TRUE)
x <- mds$points[,1]
y <- mds$points[,2]
plot(x, y, col = "red", pch = 18, xlab = "", ylab = "", main = "MDS plot - RF proximity", font.main = 1)
lab = rep("", 45)
lab[5] = "S5"
lab[16] = "S16"
lab[31] = "S31"
# text(x, y, label = names(x), cex = 0.8, pos = 1)   # too dense when all samples are labeled 
text(x, y, label = lab, cex = 0.8, pos = 1)          # 3 labels only

# identify(x, y, labels = names(x))                  # works in interactive mode


6. RF models with variable selection

library(varSelRF)
## Warning: package 'varSelRF' was built under R version 3.5.3
## Loading required package: parallel
selected <- varSelRF(predictors, response, ntree = 500, ntreeIterat = 300, vars.drop.frac = 0.2)    
selected
## 
## Backwards elimination on random forest; ntree =  500 ;  mtryFactor =  1 
## 
##  Selected variables:
## [1] "Gene_227" "Gene_412" "Gene_475" "Gene_609" "Gene_798" "Gene_878"
## 
##  Number of selected variables: 6
plot(selected)

# rf.model$selec.history    # selection history
selected$selected.vars      # selected variables
## [1] "Gene_227" "Gene_412" "Gene_475" "Gene_609" "Gene_798" "Gene_878"
names(selected)             # available components of the list
##  [1] "selec.history"             "rf.model"                 
##  [3] "selected.vars"             "selected.model"           
##  [5] "best.model.nvars"          "initialImportances"       
##  [7] "initialOrderedImportances" "ntree"                    
##  [9] "ntreeIterat"               "mtryFactor"               
## [11] "firstForest"

uwe.menzel@matstat.org