In this computer module, you will learn:
Let’s first install if you haven’t do it before and load the packages we will need in this computer module:
install.packages(c("caret","class","randomForest","e1071"))
library(caret)
library(class)
library(randomForest)
library(e1071)
Now that the packages are installed and loaded, if you work on your own computer/laptop you will need to define your working directory. We can define our working directory and import the data we will use.
# First, set your working directory! The working directory is the directory where your files are located.
# Click Session >> Set Working Directory >> Choose Directory
Q1a: Import the data “data_AD.csv” using the function read.csv2. Don’t forget to mention the decimal (“,”) and separator (“;”) characters as well as set header to TRUE.
This dataset is from a recently published manuscript by Bakker et al. (Clinical & Experimental Allergy 2019). The aim of the study was to identify immunological biomarkers (proteins) in the blood of Atopic Dermatitis patients to predict which patients will be difficult-to-treat (ie won’t respond to treatment). Blood samples were taken before treatment and were measured once the outcome of the treatment was known (group=2 Difficult-to-treat, 1 controled disease). A total of 143 markers were measured using Luminex technology.
Q1b: Formulate a research question.
Q1c: Have a look at the data. Describe 5 of the proteins that have been measured and look up online what their function is.
We can now create the training and test set that will be used in our classification model.
Q2: Why do we need to create a training and a test set?
Q3: Which fraction of patients should we include in the training set? Replace the “p=0.50” by the corresponding value.
train_nb <- createDataPartition(
y = data$group, ## the outcome data are needed
p = 0.50, ## The fraction of data in the training set
list = FALSE
)
# train_nb contained the lines kept for trainset
## create the train and test sets
train.data <- data[train_nb,]
test.data <- data[-train_nb,]
## extract the reference for accuracy
train.category <- as.character(train.data$group)
test.category <- as.character(test.data$group)
## remove the reference from both train and test sets
train.data <- train.data[,3:ncol(train.data)]
test.data <- test.data[,3:ncol(test.data)]
Let’s first use a random forest approach to build the model using the train set. Here we propose to use 500 trees. You can try with different values to see the impact on the accuracy. Note that more trees will take more computational times.
Q4a: Explain with your own words what a Random Forest approach is.
model <- randomForest(x = train.data, y = as.factor(train.category),
importance = T, ntree = 500,
proximity = F)
We can print the output of the model with the following commands.
model
Q4b: Run the following line. What is reported by running model$votes. Explain with your own words.
model$votes # probability for classification in each group
Q4c: Run the following line. What is reported? Look up in the help or online what are the MeanDecreaseAccuracy and MeanDecreaseGini.
model$importance # importance of each marker
We can now use this model on the test set and compute some quantitative quality indicators such as Sensitivity and Specificity.
# Use the model to predict on the test set.
prediction <- predict(model, newdata=test.data)
proba <- predict(model, newdata=test.data,type="prob")
# Calculate the overall accuracy.
correct <- prediction == test.category
print(paste("% of predicted classifications correct", mean(correct)*100))
conf_matrix <- table(test.category,prediction)
conf_matrix
sensitivity(conf_matrix)
specificity(conf_matrix)
posPredValue(conf_matrix)
negPredValue(conf_matrix)
## This function divides the correct predictions by total number of predictions that
## tell us how accurate the model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(conf_matrix)
Q5: what do the sensitivity and specificity tell you? How to interpret these numbers?
We will now repeat the same steps but we will use a k-nearest neighbors approach. With the function knn we can immediately define the train and test set and the output of the model will be the prediction on the test set.
Q6: Explain with your own words what a k-nearest neighbors approach is.
pred_knn <- knn(train.data,test.data,cl=train.category,k=6)
##create confusion matrix by comparing pred_knn with test.category
conf_matrix <- table(pred_knn,test.category)
conf_matrix
##this function divides the correct predictions by total number of predictions that tell us how accurate the model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(conf_matrix)
Q7: What happens to the accuracy if you change the number of nearest neighbors (k)? How to choose the best value for “k”?
Q8: Determine the Sensitivity, Specificity, PPV and NPV for this model. How do they compare to the ones using the random forest approach? Which approach gives you the best predictive model?
As explained during the lecture, a cross-validation approach has several advantages such as defining the best parameters for the model (the number of variables randomly samples as candidates mtry or the number of nearest neighbors k).
Q9: Explain with your own words the principle of cross-validation. Why is it useful?
In this case, we will create 5 sets of data. Each one will be used one as test set while the others will be merged and used at train set.
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 5)
# Train the model (The dot (".") represents all other columns from data )
model_rf <- train(as.character(group) ~., data = data, method = "rf",
trControl = train.control)
# Summarize the results
print(model_rf)
model_rf$finalModel
model_rf$resample # accuracy per sample
Q10: What is the optimal number of variables randomly sampled?
# confusion matrix for the 5Fold
confusionMatrix(model_rf,reference = as.character(data$group))
# Train the model
model_knn <- train(as.character(group) ~., data = data, method = "knn",
trControl = train.control)
# Summarize the results
print(model_knn)
model_knn$finalModel
model_knn$resample # accuracy per sample
Q11: What is the optimal number of nearest neighbors?
# confusion matrix for the 5Fold
confusionMatrix(model_knn,reference = as.character(data$group))
Q12: Repeat the analyses of 3.1.1 and 3.1.2 but change the number of repetition of train/test sets in the function “train.control” defined in 3.1 and compare the accuracy of the models.
The k-fold cross-validation can be improved by repeating it a certain number of times. We will investigate in the following section how this can impact the accuracy of our models. For computational time we chose here to repeat the cross-validation 3 times and will hence create 3 times 5 sets of data and each one will be used once as test set while the others will be merged as train set.
Note: You can change the number of fold cross-validation and number of repetitions but the computational time will increase.
train.control <- trainControl(method = "repeatedcv",
number = 5, repeats = 3)
model_rf3 <- train(as.character(group) ~., data = data, method = "rf",
trControl = train.control)
# Summarize the results
print(model_rf3)
model_rf3$finalModel
model_rf3$resample # accuracy per Fold and replication
# confusion matrix for the 5Fold with 3 repetitions
confusionMatrix(model_rf3,reference = as.character(data$group))
model_knn3 <- train(as.character(group) ~., data = data, method = "knn",
trControl = train.control)
# Summarize the results
print(model_knn3)
model_knn3$finalModel
model_knn3$resample # accuracy per Fold and replication
# confusion matrix for the 5Fold with 3 repetitions
confusionMatrix(model_knn3,reference = as.character(data$group))
Q13: Compare these models (accuracy and optimal parameters) to the ones of the previous section (3.1.1 and 3.1.2). What can you conclude?
We will now investigate the impact on the accuracy of our predictive model if we do a “one-leave-out” cross-validation. In this case, every sample has the opportunity to be used as the test set.
train.control <- trainControl(method = "LOOCV")
model_rf_olo <- train(as.character(group) ~., data = data, method = "rf",
trControl = train.control)
# Summarize the results
print(model_rf_olo)
confusionMatrix(model_rf_olo,reference = as.character(data$group))
model_knn_olo <- train(as.character(group) ~., data = data, method = "knn",
trControl = train.control)
# Summarize the results
print(model_knn_olo)
confusionMatrix(model_knn_olo,reference = as.character(data$group))
Q13: Reflect on all the different approaches you used in this COO. What is the best approach to predict which Atopic Dermatitis patients will be difficult-to-treat (ie won’t respond to treatment)? Explain your answer.
In this module we have learned how to define training and test sets and to build predictive models using two classification approaches (random forest and k-nearest neighbors). We also used different sort of cross-validation and applied them to both classification approaches. Finally, we have computed some quantitative quality indicators of these models (accuracy, sensitivity, specificity, PPV and NPV) using the confusion matrix.