In this computer module, you will learn:

  1. how to create training and test datasets
  2. how to build a model using Random Forest and k-Nearest Neighbors approaches and to predict classes on a new dataset
  3. how to use cross-validation to optimize random forest and k-nearest neighbors models
  4. Describe the quality of the classification from all those models

1. Install the different packages and import the data

Let’s first install if you haven’t do it before and load the packages we will need in this computer module:

install.packages(c("caret","class","randomForest","e1071"))
library(caret)
library(class)
library(randomForest)
library(e1071)

Now that the packages are installed and loaded, if you work on your own computer/laptop you will need to define your working directory. We can define our working directory and import the data we will use.

# First, set your working directory! The working directory is the directory where your files are located.
# Click Session >> Set Working Directory >> Choose Directory

Q1a: Import the data “data_AD.csv” using the function read.csv2. Don’t forget to mention the decimal (“,”) and separator (“;”) characters as well as set header to TRUE.

This dataset is from a recently published manuscript by Bakker et al. (Clinical & Experimental Allergy 2019). The aim of the study was to identify immunological biomarkers (proteins) in the blood of Atopic Dermatitis patients to predict which patients will be difficult-to-treat (ie won’t respond to treatment). Blood samples were taken before treatment and were measured once the outcome of the treatment was known (group=2 Difficult-to-treat, 1 controled disease). A total of 143 markers were measured using Luminex technology.

Q1b: Formulate a research question.

Q1c: Have a look at the data. Describe 5 of the proteins that have been measured and look up online what their function is.

2. Create train and test set and use Random forest and k-nearest neighbor approach

2.1 Create train and test set

We can now create the training and test set that will be used in our classification model.

Q2: Why do we need to create a training and a test set?

Q3: Which fraction of patients should we include in the training set? Replace the “p=0.50” by the corresponding value.

train_nb <- createDataPartition(
  y = data$group,  ## the outcome data are needed
  p = 0.50,   ## The fraction of data in the training set
  list = FALSE
)
# train_nb contained the lines kept for trainset

## create the train and test sets
train.data <- data[train_nb,]
test.data <- data[-train_nb,]
## extract the reference for accuracy
train.category <- as.character(train.data$group)
test.category <- as.character(test.data$group)
## remove the reference from both train and test sets
train.data <- train.data[,3:ncol(train.data)]
test.data <- test.data[,3:ncol(test.data)]

2.2 Build a classification model using Random Forest approach and make prediction on test sets

Let’s first use a random forest approach to build the model using the train set. Here we propose to use 500 trees. You can try with different values to see the impact on the accuracy. Note that more trees will take more computational times.

Q4a: Explain with your own words what a Random Forest approach is.

model <- randomForest(x = train.data, y = as.factor(train.category), 
                     importance = T, ntree = 500, 
                     proximity = F)

We can print the output of the model with the following commands.

model

Q4b: Run the following line. What is reported by running model$votes. Explain with your own words.

model$votes # probability for classification in each group

Q4c: Run the following line. What is reported? Look up in the help or online what are the MeanDecreaseAccuracy and MeanDecreaseGini.

model$importance # importance of each marker

We can now use this model on the test set and compute some quantitative quality indicators such as Sensitivity and Specificity.

# Use the model to predict on the test set.
prediction <- predict(model, newdata=test.data)
proba <- predict(model, newdata=test.data,type="prob")
# Calculate the overall accuracy.
correct <- prediction == test.category
print(paste("% of predicted classifications correct", mean(correct)*100))
conf_matrix <- table(test.category,prediction)
conf_matrix
sensitivity(conf_matrix)
specificity(conf_matrix)
posPredValue(conf_matrix)
negPredValue(conf_matrix)

## This function divides the correct predictions by total number of predictions that 
## tell us how accurate the model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(conf_matrix)

Q5: what do the sensitivity and specificity tell you? How to interpret these numbers?

2.3 Build a classification model using the k-nearest neighbors (knn) approach and make prediction on test sets

We will now repeat the same steps but we will use a k-nearest neighbors approach. With the function knn we can immediately define the train and test set and the output of the model will be the prediction on the test set.

Q6: Explain with your own words what a k-nearest neighbors approach is.

pred_knn <- knn(train.data,test.data,cl=train.category,k=6)   
##create confusion matrix by comparing pred_knn with test.category
conf_matrix <- table(pred_knn,test.category)
conf_matrix
##this function divides the correct predictions by total number of predictions that tell us how accurate the model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(conf_matrix)

Q7: What happens to the accuracy if you change the number of nearest neighbors (k)? How to choose the best value for “k”?

Q8: Determine the Sensitivity, Specificity, PPV and NPV for this model. How do they compare to the ones using the random forest approach? Which approach gives you the best predictive model?

3 Cross-validation

As explained during the lecture, a cross-validation approach has several advantages such as defining the best parameters for the model (the number of variables randomly samples as candidates mtry or the number of nearest neighbors k).

Q9: Explain with your own words the principle of cross-validation. Why is it useful?

3.1 k-fold cross-validation

In this case, we will create 5 sets of data. Each one will be used one as test set while the others will be merged and used at train set.

# Define training control
set.seed(123) 
train.control <- trainControl(method = "cv", number = 5)

3.1.1 Random Forest

# Train the model (The dot (".") represents all other columns from data )
model_rf <- train(as.character(group) ~., data = data, method = "rf",
               trControl = train.control)
# Summarize the results
print(model_rf)
model_rf$finalModel
model_rf$resample # accuracy per sample

Q10: What is the optimal number of variables randomly sampled?

# confusion matrix for the 5Fold
confusionMatrix(model_rf,reference = as.character(data$group))

3.1.2 k-nearest neighbors

# Train the model
model_knn <- train(as.character(group) ~., data = data, method = "knn",
              trControl = train.control)
# Summarize the results
print(model_knn)
model_knn$finalModel
model_knn$resample # accuracy per sample

Q11: What is the optimal number of nearest neighbors?

# confusion matrix for the 5Fold
confusionMatrix(model_knn,reference = as.character(data$group))

Q12: Repeat the analyses of 3.1.1 and 3.1.2 but change the number of repetition of train/test sets in the function “train.control” defined in 3.1 and compare the accuracy of the models.

3.2 Repeated k-fold cross-validation

The k-fold cross-validation can be improved by repeating it a certain number of times. We will investigate in the following section how this can impact the accuracy of our models. For computational time we chose here to repeat the cross-validation 3 times and will hence create 3 times 5 sets of data and each one will be used once as test set while the others will be merged as train set.

Note: You can change the number of fold cross-validation and number of repetitions but the computational time will increase.

train.control <- trainControl(method = "repeatedcv", 
                              number = 5, repeats = 3)

3.2.1 Random Forest

model_rf3 <- train(as.character(group) ~., data = data, method = "rf",
               trControl = train.control)
# Summarize the results
print(model_rf3)
model_rf3$finalModel
model_rf3$resample # accuracy per Fold and replication
# confusion matrix for the 5Fold with 3 repetitions
confusionMatrix(model_rf3,reference = as.character(data$group))

3.2.2 k-nearest neigbors

model_knn3 <- train(as.character(group) ~., data = data, method = "knn",
                   trControl = train.control)
# Summarize the results
print(model_knn3)
model_knn3$finalModel
model_knn3$resample # accuracy per Fold and replication
# confusion matrix for the 5Fold with 3 repetitions
confusionMatrix(model_knn3,reference = as.character(data$group))

Q13: Compare these models (accuracy and optimal parameters) to the ones of the previous section (3.1.1 and 3.1.2). What can you conclude?

3.3 One-leave-out

We will now investigate the impact on the accuracy of our predictive model if we do a “one-leave-out” cross-validation. In this case, every sample has the opportunity to be used as the test set.

train.control <- trainControl(method = "LOOCV")

3.3.1 Random Forest

model_rf_olo <- train(as.character(group) ~., data = data, method = "rf",
               trControl = train.control)
# Summarize the results
print(model_rf_olo)
confusionMatrix(model_rf_olo,reference = as.character(data$group))

3.3.2 k-nearest neigbors

model_knn_olo <- train(as.character(group) ~., data = data, method = "knn",
               trControl = train.control)
# Summarize the results
print(model_knn_olo)
confusionMatrix(model_knn_olo,reference = as.character(data$group))

Q13: Reflect on all the different approaches you used in this COO. What is the best approach to predict which Atopic Dermatitis patients will be difficult-to-treat (ie won’t respond to treatment)? Explain your answer.

Summary

In this module we have learned how to define training and test sets and to build predictive models using two classification approaches (random forest and k-nearest neighbors). We also used different sort of cross-validation and applied them to both classification approaches. Finally, we have computed some quantitative quality indicators of these models (accuracy, sensitivity, specificity, PPV and NPV) using the confusion matrix.