Decision Trees

Goal: predict some feature (attribute) of objects on grounds of data collected for these objects
a Decision Tree ist trained using training data sets, and is then applied to test data sets
Learning: the Decision Tree is constructed using data with known attributes, called training set
Prediction: the Decision Tree is used to predict the unknown attributes of objects in the test set
Training set:
- must be a dataset where the attributes are known
- is used to construct the Decision Tree
Test set:
- here, we have only the data but not the corresponding attributes
- we want to infer the attributes on grounds of the data, with the help of the beforehand constructed Decision Tree
we will use the package RWeka here

Training set: Edgar Anderson’s Iris Data

sepal length, sepal width, petal length and petal width for 50 flowers from each of the 3 species of iris
the species are Iris setosa, Iris versicolor, and Iris virginica
Iris dataset: Wikipedia
this is a training set, i.e. we have the attribute (= species) corresponding to each dataset (consisting of the 4 measures)

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris, 10)  # data: column 1 to 4 ;  attribute: column 5

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

levels(iris[,5])

## [1] "setosa"     "versicolor" "virginica"

Constructing the Decision Tree

using the C4.5 algorithm developed by Ross Quinlan, Wikipedia
Java must be installed (check with ‘java -version’ on the command prompt) in order to run the package RWeka

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.5.3

tree <- J48(Species ~ ., data = iris)
print(tree)

## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7
## |   |   Petal.Length <= 4.9: versicolor (48.0/1.0)
## |   |   Petal.Length > 4.9
## |   |   |   Petal.Width <= 1.5: virginica (3.0)
## |   |   |   Petal.Width > 1.5: versicolor (3.0/1.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  5
## 
## Size of the tree :   9

library(rpart)  # for plotting the tree
plot(tree)

Interpretation of the results

annotation on the x-axis (‘setosa’) is misleading, should be ‘setosa, versicolor, virginica’
the tree shows the decision criteria on the vertices, e.g. \(\leq 0.6\)
the graph shows how the actual dataset (with known attributes) would we classified using the calculated classification criteria
the final nodes give information on how well the actual data set is classified using the constructed tree
- node 2 is a pure node only containing all 50 samples of setosa (which can be checked easily, see below)
- node 5 contains 47 versicolor and 1 virginica
- node 7 contains 3 virginica (small but pure node)
- node 8 contains 2 versicolor and 1 virginica
- node 9 contains 45 virginica and 1 versicolor
the classification is not perfect because we don’t get exclusively pure nodes
however, a bigger tree would probably overfit the tree (see below)
if we assume that our training set represents the whole population of potential datasets, new test sets can be run down the tree and conclusions can be made:
- if a test set (consisting of the 4 measures) ends up in node 2, we can be quite sure that it is setosa
- if a test set ends up in node 5, we can almost be sure that it is versicolor
- if a test set ends up in node 9, we can be quite sure that it is virginica
- if a test sets end up in nodes 7 or 8, we are rather unsure what species it is (because the sample numbers in these nodes are low or not even unique)

Checking the content of node 2:

criterion = iris[,4] <= 0.6                # logical array
table(iris[criterion,]$Species)

## 
##     setosa versicolor  virginica 
##         50          0          0

Checking the content of node 5:

criterion = (iris$Petal.Width > 0.6) & (iris$Petal.Width <= 1.7) & (iris$Petal.Length <= 4.9)
table(iris[criterion,]$Species)

## 
##     setosa versicolor  virginica 
##          0         47          1

Checking the content of node 7:

with(iris,
    {criterion = (Petal.Width > 0.6) & (Petal.Width <= 1.7) & (Petal.Length > 4.9) & (Petal.Width <= 1.5)
    table(iris[criterion,]$Species)} 
)

## 
##     setosa versicolor  virginica 
##          0          0          3

Checking the content of node 8:

with(iris,
    {criterion = (Petal.Width > 0.6) & (Petal.Width <= 1.7) & (Petal.Length > 4.9) & (Petal.Width > 1.5)
    table(iris[criterion,]$Species)} 
)

## 
##     setosa versicolor  virginica 
##          0          2          1

Checking the content of node 9:

criterion = iris$Petal.Width > 1.7 
table(iris[criterion,]$Species)

## 
##     setosa versicolor  virginica 
##          0          1         45

Validation of the model

find out how good the predictive model (the tree) is
the original data set is partitioned into a training set and a test set:
- the training set builds the model (the tree), then
- the attributes for the test set are predicted using this model, then
- it is checked how many of the predictions were correct.
\(k\)-fold cross-validation:
- the original sample is randomly partitioned into \(k\) equal-sized subsets
- \(k-1\) subsets are used as training set
- one subset is used as test set
- the cross-validation is then repeated \(k\) times, so that each subset serves as test set exactly once
- the total amount of correct predictions in all \(k\) repetitions is calculated
often, 10-fold cross-validation is conducted (\(k=10\))

# summary(tree)  # also possible
valid <- evaluate_Weka_classifier(tree, numFolds = 10)
valid

## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances         140               93.3333 %
## Incorrectly Classified Instances        10                6.6667 %
## Kappa statistic                          0.9   
## Mean absolute error                      0.0533
## Root mean squared error                  0.2028
## Relative absolute error                 11.996  %
## Root relative squared error             43.0252 %
## Total Number of Instances              150     
## 
## === Confusion Matrix ===
## 
##   a  b  c   <-- classified as
##  49  1  0 |  a = setosa
##   0 45  5 |  b = versicolor
##   0  4 46 |  c = virginica

in total, 96% of the samples were correctly predicted (144 out of 150)
the Confusion Matrix reports how many items of each species were correctly classified, e.g.:
- 49 setosa were (correctly) classified as setosa, 1 setosa was (mistakenly) classified as versicolor
- 47 versicolor were (correctly) classified as versicolor, 3 versicolor were (mistakenly) classified as virginica
- 48 virginica were (correctly) classified as virginica, 2 virginica were (mistakenly) classified as versicolor

A new dataset for prediction

some faked data with unknown attribute:

dim(iris[,1:4])

## [1] 150   4

trafo = matrix(runif(600, min = 0.5, max = 1.5), ncol = 4)
new.data = as.data.frame(iris[,1:4] * trafo)
# sum(new.data < 0) == 0  # TRUE
head(new.data, 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1      4.832716    5.122523    2.0991858  0.17564576
## 2      2.854215    1.874564    1.6270647  0.24534582
## 3      5.478056    2.248085    1.6337735  0.25635555
## 4      4.412058    2.095338    1.5231784  0.15873155
## 5      3.936727    5.347617    1.8457398  0.20578437
## 6      2.951850    4.738964    2.2712688  0.45398926
## 7      3.752095    2.697692    1.3381425  0.19033576
## 8      4.587077    3.397883    1.9045512  0.25682642
## 9      6.030708    1.860872    0.7206633  0.11685948
## 10     6.396816    3.502716    2.0242492  0.09565466

Prediction using the Decision tree

for each row of the test set, we predict the attribute ‘species’:

prediction <- predict(tree, newdata = new.data)
str(prediction)

##  Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

result1 = cbind(new.data, prediction)
result1

##     Sepal.Length Sepal.Width Petal.Length Petal.Width prediction
## 1       4.832716    5.122523    2.0991858  0.17564576     setosa
## 2       2.854215    1.874564    1.6270647  0.24534582     setosa
## 3       5.478056    2.248085    1.6337735  0.25635555     setosa
## 4       4.412058    2.095338    1.5231784  0.15873155     setosa
## 5       3.936727    5.347617    1.8457398  0.20578437     setosa
## 6       2.951850    4.738964    2.2712688  0.45398926     setosa
## 7       3.752095    2.697692    1.3381425  0.19033576     setosa
## 8       4.587077    3.397883    1.9045512  0.25682642     setosa
## 9       6.030708    1.860872    0.7206633  0.11685948     setosa
## 10      6.396816    3.502716    2.0242492  0.09565466     setosa
## 11      5.943058    4.072902    2.2274814  0.29316926     setosa
## 12      3.016985    1.821694    2.3393281  0.16772238     setosa
## 13      4.804658    1.826393    1.2735984  0.06542385     setosa
## 14      4.860173    3.780216    1.1170545  0.05857827     setosa
## 15      6.570960    3.681125    1.0500925  0.24184318     setosa
## 16      3.530287    2.724209    1.6194209  0.28702872     setosa
## 17      4.986281    5.834168    0.9886306  0.56685548     setosa
## 18      3.286190    2.676965    0.8913875  0.21118476     setosa
## 19      7.687908    4.891066    1.7007642  0.36118353     setosa
## 20      2.738755    3.124861    0.8124252  0.17246026     setosa
## 21      5.062289    3.345152    1.2242594  0.24436398     setosa
## 22      4.058749    5.204812    0.8351260  0.49736829     setosa
## 23      2.850975    5.251906    1.3364728  0.10125522     setosa
## 24      2.779614    3.081571    2.1900976  0.38634188     setosa
## 25      2.937637    2.331701    1.4850430  0.17474287     setosa
## 26      5.315325    3.624403    1.4020883  0.20023908     setosa
## 27      6.166012    3.409298    2.1961946  0.21572545     setosa
## 28      6.244989    1.873612    0.9203357  0.24105894     setosa
## 29      3.639132    1.885118    1.8450813  0.15824337     setosa
## 30      5.853805    4.299255    0.9571473  0.19648223     setosa
## 31      5.919919    2.759024    2.0954649  0.29200994     setosa
## 32      2.833008    1.901636    1.8998995  0.45850322     setosa
## 33      4.992551    3.225702    1.0551993  0.06531902     setosa
## 34      6.593400    5.713773    2.0325360  0.20341695     setosa
## 35      3.111231    2.841053    2.1874752  0.20881520     setosa
## 36      7.004825    3.311882    0.6576265  0.23245626     setosa
## 37      4.031835    3.261548    0.7743620  0.21463489     setosa
## 38      6.141885    2.077962    0.7858660  0.09718561     setosa
## 39      3.777757    4.234628    1.4579466  0.13455255     setosa
## 40      7.051590    4.191850    1.0865406  0.29689719     setosa
## 41      5.765006    4.701749    0.9577343  0.30544897     setosa
## 42      5.779574    1.235240    0.9728323  0.30412761     setosa
## 43      4.279601    4.330575    1.2167328  0.23049953     setosa
## 44      2.591314    2.878865    2.3927535  0.58061333     setosa
## 45      6.763196    4.990708    2.2109211  0.22348202     setosa
## 46      7.186208    2.160478    1.7440791  0.26029225     setosa
## 47      5.282986    3.812588    1.4221341  0.27726634     setosa
## 48      3.265226    1.956757    1.8479365  0.12266365     setosa
## 49      4.762972    3.543531    0.9891561  0.19297598     setosa
## 50      3.345149    3.448123    1.9302926  0.24637028     setosa
## 51      8.742657    3.841667    5.1695992  1.25692878  virginica
## 52      3.844931    2.386025    5.6622839  0.84952889  virginica
## 53      8.718737    2.379241    5.2242200  2.13433535  virginica
## 54      4.234440    3.037695    4.4805097  1.10476524 versicolor
## 55      4.959848    2.769440    4.2522363  2.18465440  virginica
## 56      6.045163    2.012805    3.0917925  0.86792880 versicolor
## 57      3.737477    3.915816    3.5859034  1.73440388  virginica
## 58      5.670866    2.028810    2.6499562  0.61750846 versicolor
## 59      4.202744    1.856623    3.6645251  1.68217684 versicolor
## 60      4.379822    2.718736    4.9990942  1.99128899  virginica
## 61      3.033290    2.297334    4.7318999  1.11315215 versicolor
## 62      8.786806    3.850833    4.3541196  1.47663198 versicolor
## 63      8.240229    3.285175    2.8399583  0.76385861 versicolor
## 64      6.446854    3.641839    3.7240133  0.82757080 versicolor
## 65      7.001825    2.267892    5.0228502  1.32203644  virginica
## 66      5.451719    2.590259    6.1478981  2.03299208  virginica
## 67      3.405662    1.791855    6.0851748  1.95148766  virginica
## 68      5.723218    3.768938    3.3760824  1.13247269 versicolor
## 69      4.780760    2.790003    2.6960507  2.20740653  virginica
## 70      5.272699    1.851365    3.8916031  0.98068134 versicolor
## 71      8.816352    3.132248    4.9889977  2.63767410  virginica
## 72      5.350125    1.986364    4.7478489  0.67025457 versicolor
## 73      4.203775    1.654432    2.5738289  1.58543463 versicolor
## 74      6.409963    2.610098    6.2135779  0.93882719  virginica
## 75      8.240457    3.964222    2.1895989  0.95691883 versicolor
## 76      8.762775    3.996914    3.9247762  0.87483287 versicolor
## 77      7.255909    2.319953    6.8492307  0.82336750  virginica
## 78      8.099363    4.080294    4.0517736  0.96917620 versicolor
## 79      4.820584    2.860442    5.6377012  1.86838030  virginica
## 80      6.318633    3.722013    2.0366081  0.79574199 versicolor
## 81      8.203643    3.057393    2.6312645  0.65870745 versicolor
## 82      4.517657    1.725634    3.4318489  1.46491500 versicolor
## 83      6.165700    3.630446    3.6515013  0.85046646 versicolor
## 84      8.763337    1.801799    6.5221238  1.37847329  virginica
## 85      7.194513    2.411695    3.4572395  1.84675892  virginica
## 86      6.482665    4.755928    3.7126390  2.28194465  virginica
## 87      6.467284    2.148698    5.7219687  1.84450568  virginica
## 88      8.776032    2.581382    5.2524546  1.46669481  virginica
## 89      8.262017    4.040443    5.2959714  0.83156829  virginica
## 90      3.284570    3.080284    2.3295676  0.81942775 versicolor
## 91      6.778975    1.682780    4.6253215  0.90405001 versicolor
## 92      8.115952    4.251712    5.1677275  1.32486388  virginica
## 93      6.299611    2.183056    3.9932239  0.61342497 versicolor
## 94      6.430682    2.853464    3.1478331  0.50764713     setosa
## 95      4.034101    2.948159    2.6495131  1.77721366  virginica
## 96      4.741147    3.573355    3.5776308  0.84312842 versicolor
## 97      3.677493    1.619995    2.4604212  1.52472772 versicolor
## 98      5.631430    4.214066    5.8615448  1.34765434  virginica
## 99      4.670305    3.000035    3.8492299  1.08718694 versicolor
## 100     3.406271    2.603243    4.7887187  1.83460089  virginica
## 101     5.412494    2.768269    3.9176295  1.82598282  virginica
## 102     4.174296    3.220575    5.3573717  2.50845081  virginica
## 103     8.916831    3.496705    5.3443078  2.69107179  virginica
## 104     3.408714    2.119770    7.6363355  2.42231125  virginica
## 105     7.693003    1.578378    7.5433542  3.11401597  virginica
## 106     6.943272    1.502315    9.4540851  2.33190844  virginica
## 107     6.761995    3.610559    4.9122007  2.11011187  virginica
## 108     9.880241    1.521159    4.2071940  1.68435920 versicolor
## 109     5.090855    2.899962    5.2489287  0.97091766  virginica
## 110     7.846517    2.219277    4.7105649  3.27590621  virginica
## 111     5.982181    2.407672    4.2231867  2.93446542  virginica
## 112     8.293309    1.937653    6.9035730  2.71335159  virginica
## 113     5.760098    2.710019    7.9684022  2.76157754  virginica
## 114     5.466427    2.674704    2.9973433  2.48488719  virginica
## 115     5.937860    3.563594    3.5523909  1.69759229 versicolor
## 116     3.546129    4.297026    4.9848704  2.38481172  virginica
## 117     8.346896    4.192600    4.8138415  1.48472573 versicolor
## 118     6.070196    4.624252    9.6513670  1.29250449  virginica
## 119     5.432680    1.780398    8.4202676  1.33046022  virginica
## 120     6.727776    2.604435    6.4902592  0.76100906  virginica
## 121     5.040791    3.815913    7.7680044  2.81599959  virginica
## 122     5.568607    1.552100    3.7793098  2.44377175  virginica
## 123     5.008661    2.306757    6.9837526  1.60726194 versicolor
## 124     9.442820    3.073266    4.4182416  1.10672933 versicolor
## 125     8.048954    3.674011    5.8599089  2.43866647  virginica
## 126     4.753932    2.620837    5.8422255  2.38860098  virginica
## 127     3.167409    3.566129    5.8160231  1.78234452  virginica
## 128     8.856527    1.842868    5.7695312  1.58478304 versicolor
## 129     6.757927    1.775758    6.2410990  1.59086761 versicolor
## 130     6.097694    2.542240    7.6771751  1.83585701  virginica
## 131     6.231776    2.412246    6.9606010  1.67937293 versicolor
## 132     7.042629    2.485692    7.6061105  1.74705897  virginica
## 133     8.894703    3.150081    5.6677920  2.95348354  virginica
## 134     7.775730    2.251343    3.5754494  1.15741537 versicolor
## 135     6.967865    3.566389    3.7665705  0.97070322 versicolor
## 136    11.123091    4.148667    5.1356627  2.09005688  virginica
## 137     5.032044    2.578277    8.0075962  3.40275984  virginica
## 138     5.656349    3.831401    4.3073856  1.62953249 versicolor
## 139     7.479488    3.118329    4.4863036  2.17516611  virginica
## 140     3.989887    1.930024    2.7528994  2.47162337  virginica
## 141     6.985691    1.743139    5.7631909  2.69427229  virginica
## 142     8.166098    2.387197    4.4143853  2.80877725  virginica
## 143     6.274991    3.370998    7.1657572  1.86624777  virginica
## 144     3.728006    2.090939    6.5733810  2.26185388  virginica
## 145     9.950141    4.274248    6.3119399  1.89759800  virginica
## 146     4.015230    3.306560    5.6319216  2.86518737  virginica
## 147     5.332751    1.825646    7.3790063  1.07590510  virginica
## 148     4.642727    2.002926    7.5637778  2.69859480  virginica
## 149     4.399559    2.964567    4.4050450  3.23821013  virginica
## 150     6.397060    1.586019    2.6698864  2.16664270  virginica

result2 <- predict(tree, newdata = new.data, type = "probability")
result2

##     setosa versicolor  virginica
## 1        1 0.00000000 0.00000000
## 2        1 0.00000000 0.00000000
## 3        1 0.00000000 0.00000000
## 4        1 0.00000000 0.00000000
## 5        1 0.00000000 0.00000000
## 6        1 0.00000000 0.00000000
## 7        1 0.00000000 0.00000000
## 8        1 0.00000000 0.00000000
## 9        1 0.00000000 0.00000000
## 10       1 0.00000000 0.00000000
## 11       1 0.00000000 0.00000000
## 12       1 0.00000000 0.00000000
## 13       1 0.00000000 0.00000000
## 14       1 0.00000000 0.00000000
## 15       1 0.00000000 0.00000000
## 16       1 0.00000000 0.00000000
## 17       1 0.00000000 0.00000000
## 18       1 0.00000000 0.00000000
## 19       1 0.00000000 0.00000000
## 20       1 0.00000000 0.00000000
## 21       1 0.00000000 0.00000000
## 22       1 0.00000000 0.00000000
## 23       1 0.00000000 0.00000000
## 24       1 0.00000000 0.00000000
## 25       1 0.00000000 0.00000000
## 26       1 0.00000000 0.00000000
## 27       1 0.00000000 0.00000000
## 28       1 0.00000000 0.00000000
## 29       1 0.00000000 0.00000000
## 30       1 0.00000000 0.00000000
## 31       1 0.00000000 0.00000000
## 32       1 0.00000000 0.00000000
## 33       1 0.00000000 0.00000000
## 34       1 0.00000000 0.00000000
## 35       1 0.00000000 0.00000000
## 36       1 0.00000000 0.00000000
## 37       1 0.00000000 0.00000000
## 38       1 0.00000000 0.00000000
## 39       1 0.00000000 0.00000000
## 40       1 0.00000000 0.00000000
## 41       1 0.00000000 0.00000000
## 42       1 0.00000000 0.00000000
## 43       1 0.00000000 0.00000000
## 44       1 0.00000000 0.00000000
## 45       1 0.00000000 0.00000000
## 46       1 0.00000000 0.00000000
## 47       1 0.00000000 0.00000000
## 48       1 0.00000000 0.00000000
## 49       1 0.00000000 0.00000000
## 50       1 0.00000000 0.00000000
## 51       0 0.00000000 1.00000000
## 52       0 0.00000000 1.00000000
## 53       0 0.02173913 0.97826087
## 54       0 0.97916667 0.02083333
## 55       0 0.02173913 0.97826087
## 56       0 0.97916667 0.02083333
## 57       0 0.02173913 0.97826087
## 58       0 0.97916667 0.02083333
## 59       0 0.97916667 0.02083333
## 60       0 0.02173913 0.97826087
## 61       0 0.97916667 0.02083333
## 62       0 0.97916667 0.02083333
## 63       0 0.97916667 0.02083333
## 64       0 0.97916667 0.02083333
## 65       0 0.00000000 1.00000000
## 66       0 0.02173913 0.97826087
## 67       0 0.02173913 0.97826087
## 68       0 0.97916667 0.02083333
## 69       0 0.02173913 0.97826087
## 70       0 0.97916667 0.02083333
## 71       0 0.02173913 0.97826087
## 72       0 0.97916667 0.02083333
## 73       0 0.97916667 0.02083333
## 74       0 0.00000000 1.00000000
## 75       0 0.97916667 0.02083333
## 76       0 0.97916667 0.02083333
## 77       0 0.00000000 1.00000000
## 78       0 0.97916667 0.02083333
## 79       0 0.02173913 0.97826087
## 80       0 0.97916667 0.02083333
## 81       0 0.97916667 0.02083333
## 82       0 0.97916667 0.02083333
## 83       0 0.97916667 0.02083333
## 84       0 0.00000000 1.00000000
## 85       0 0.02173913 0.97826087
## 86       0 0.02173913 0.97826087
## 87       0 0.02173913 0.97826087
## 88       0 0.00000000 1.00000000
## 89       0 0.00000000 1.00000000
## 90       0 0.97916667 0.02083333
## 91       0 0.97916667 0.02083333
## 92       0 0.00000000 1.00000000
## 93       0 0.97916667 0.02083333
## 94       1 0.00000000 0.00000000
## 95       0 0.02173913 0.97826087
## 96       0 0.97916667 0.02083333
## 97       0 0.97916667 0.02083333
## 98       0 0.00000000 1.00000000
## 99       0 0.97916667 0.02083333
## 100      0 0.02173913 0.97826087
## 101      0 0.02173913 0.97826087
## 102      0 0.02173913 0.97826087
## 103      0 0.02173913 0.97826087
## 104      0 0.02173913 0.97826087
## 105      0 0.02173913 0.97826087
## 106      0 0.02173913 0.97826087
## 107      0 0.02173913 0.97826087
## 108      0 0.97916667 0.02083333
## 109      0 0.00000000 1.00000000
## 110      0 0.02173913 0.97826087
## 111      0 0.02173913 0.97826087
## 112      0 0.02173913 0.97826087
## 113      0 0.02173913 0.97826087
## 114      0 0.02173913 0.97826087
## 115      0 0.97916667 0.02083333
## 116      0 0.02173913 0.97826087
## 117      0 0.97916667 0.02083333
## 118      0 0.00000000 1.00000000
## 119      0 0.00000000 1.00000000
## 120      0 0.00000000 1.00000000
## 121      0 0.02173913 0.97826087
## 122      0 0.02173913 0.97826087
## 123      0 0.66666667 0.33333333
## 124      0 0.97916667 0.02083333
## 125      0 0.02173913 0.97826087
## 126      0 0.02173913 0.97826087
## 127      0 0.02173913 0.97826087
## 128      0 0.66666667 0.33333333
## 129      0 0.66666667 0.33333333
## 130      0 0.02173913 0.97826087
## 131      0 0.66666667 0.33333333
## 132      0 0.02173913 0.97826087
## 133      0 0.02173913 0.97826087
## 134      0 0.97916667 0.02083333
## 135      0 0.97916667 0.02083333
## 136      0 0.02173913 0.97826087
## 137      0 0.02173913 0.97826087
## 138      0 0.97916667 0.02083333
## 139      0 0.02173913 0.97826087
## 140      0 0.02173913 0.97826087
## 141      0 0.02173913 0.97826087
## 142      0 0.02173913 0.97826087
## 143      0 0.02173913 0.97826087
## 144      0 0.02173913 0.97826087
## 145      0 0.02173913 0.97826087
## 146      0 0.02173913 0.97826087
## 147      0 0.00000000 1.00000000
## 148      0 0.02173913 0.97826087
## 149      0 0.02173913 0.97826087
## 150      0 0.02173913 0.97826087

the first result assigns the most probable attribute to each of the 150 samples
the second result gives the probability for each sample to be setosa, versicolor, or virginica

Drawback of Decision Tree models

overfitting, Bias-variance dilemma
overfitting means that the decision tree is too much adapted to the training set it was built with
overfitting occurs if the training set is too small and/or does not represent the whole universe of datasets which potentially are to be investigated
overfitting is partly prevented by pruning the tree, i.e. by allowing for smaller trees with impure end nodes (as above)
a better solution to prevent overfitting is establishing a Random Forest model:
- constructing many trees, each based on slightly modified input data
- see the talk: ‘Identification of longevity biomarkers by Random Forest analysis’

uwe.menzel@matstat.org

Decision Trees

Uwe Menzel

Decision Trees

Training set: Edgar Anderson’s Iris Data

Constructing the Decision Tree

Interpretation of the results

Checking the content of node 2:

Checking the content of node 5:

Checking the content of node 7:

Checking the content of node 8:

Checking the content of node 9:

Validation of the model

A new dataset for prediction

Prediction using the Decision tree

Drawback of Decision Tree models