COO2 - Introduction R - importing, selecting, ordering, and missing data

In the second computer module, you will learn how to import files into R. In addition, you will learn how to select and order data within objects.

1 Importing data into R

We can import files with data into RStudio and start working with them. Common file types are text files, csv (comma-separated values) files, tab delimited files or SPSS files. The first row of these file types generally correspond to the columns of the data frame in R. Data files can be imported through the RStudio interface, as explained in the lecture. However, it is more recommended to import a file through a line of code in your script. For this purpose, you need to specify the file path that indicates the folder in which the file is located on your computer. You can do this by explicitly specifying the full file path each time you use a function to import a file, but also by setting the ‘working directory’ (e.g. the folder on your computer in which the file is stored) at the very top of your script. If you set the working directory, you specify to R in which folder it has to look for (and save) files by default. This helps you to organize your project by working from one folder on your computer in which the original data files are stored and the output generated in your project can be saved (this will be explained later in this course).

1.1 Setting the working directory

To exemplify how to set the working directory, you first need to create a new folder on your computer (for instance called Intro_R_COO2), in which you will store the datsets required for this course. Next, the working directory can be set by selecting ‘Session’ >> ‘Set working directory’ >> ‘Choose directory’ on top of RStudio, as shown in the lecture. You can select the desired folder, which will show a code similar to the following code in your Console, specifying the path to the selected folder:

setwd("C:/Users/[yourname]/Documents/Intro_R_COO2")

You can copy this code from the Console to your R script, so the working directory can be easily specified the next time you run your code.

If you run the function getwd(), RStudio will show you in the Console the working directory in which you are currently working.

getwd()

We will use different datasets. Therefore, you first need to download the following files:

Save the files in the folder you just created. You can do this by right-clicking on the link and pick ‘Save link as …’ to directly save the file. Note: opening the file and then saving it (e.g. using Excel) may alter the contents! The R scripts that you have created can be saved in this folder as well. Use a proper name for your scripts to prevent chaos. Use for example the date (yy-mm-dd) + file name + version (e.g. 210907_RscriptCOO2_version1).

1.2 Reading a table file with read.table()

The general R function to import data files is the read.table() function, which reads a file and imports it as a data frame. For certain file types, variations of this function are also available in R, for example the [read.csv() function] to import a csv file. ‘csv’ indicates that the values in the document are comma-separated. If you open the original .txt file, you can see that each line contains a row of values, which can be numbers or letters, and each value is seperated by a comma. In addition, read.csv() by default uses the first line in the text file as column names for the data frame in RStudio. Let’s first use the manual way to import data as shown in the lecture: File > Import Dataset > From Text (base) > Import.

1. Import the Mammogram dataset manually. Note: unfortunately, newer versions of RStudio give a warning about default.stringsAsFactors. This is a bug in RStudio, but does not affect this COO. You can ignore this warning.

In the Console panel, you can see the command RStudio used to import the data. As you can see, the function RStudio used is read.csv(). You can see that the only argument in the function is the path to your text file, including the name and extension (.txt) of your file. And R automatically assigned the data to an object with the name of the file (Mammogram <- read.csv()).

Now, let’s import another dataset by using the more general function read.table() in your script. The advantages of scripting instead of manually importing are that you can find the command again when you need to, and you automatically open the data the next time you run this script.

2. Import the Heart_disease data by copying and editing the commands from your Console.
Change the object name, the name of the .txt file and change the function to read.table().

When you view your new data (either by clicking on the object in the Environment panel or using the View() function), it is loaded as a data frame with one column and multiple values separated by commas in one cell. read.table() by default uses spaces in the text file to separate columns. Since our data is separated by commas instead of spaces, the data file is not imported properly. In addition, the first row of the file is not used for the column names. Therefore we need to define some new arguments (as shown in the lecture).

3. Adapt the following commands to correctly load the Heart_disease data:

Heart_disease <- read.table("Heart_disease.txt",
                            header = TRUE,
                            sep = ",")
View(Heart_disease)

To view the dataset, you can also click on the object Heart_disease in the Working and History panel under the Environment tab.

The next dataset we are going to import is the Countries dataset. This dataset does not only contain numerical data (the other datasets did), but also text data, such as country names.

4. What R data type is used to store text?

As explained before, many arguments in a function are set to default settings. One of the default settings of the read.table() function is that strings (data type ‘character’) are converted to factors in older R versions (before R 4.0.0).

5. What is the difference between the data types ‘character’ and ‘factor’?
6. Suppose a column in our data frame contains seven countries and the data type is factor. How many levels will this column have?
7. What will happen if we want to add another country to this column?

To keep our options open, we will import character strings as character data and not as factors. Therefore, we make sure to set the argument stringsAsFactors to FALSE. We can call our data frame df.countries. ‘df’ is short for data frame, which helps us remember that the object with this name is a data frame.

8. Import the dataset with the following commands:

df.countries <- read.table("Countries.txt",
                         header = TRUE,
                         sep = ",",
                         stringsAsFactors = FALSE)
df.countries

1.3 Reading a tab delimited file

Now, take a look at the document “CountriesTD.txt” and try to find what looks different in this document compared to the document “Countries”. You can see that the text is not separated anymore by commas but now by tabs. We can still use the same functions as before but we must change the argument sep to "\t" which means that this file is tab delimited.

countriestd <- read.table("CountriesTD.txt", 
                          header = TRUE, 
                          sep = "\t", 
                          stringsAsFactors = FALSE) 
head(countriestd)

Notice how this gives the exact same data frame as before. For tab delimited files, a specified function with default settings (i.e., header and separator) for importing this file type is also available: the read.delim() function.

countriestd2 <- read.delim("CountriesTD.txt", 
                           stringsAsFactors = FALSE)
countriestd2

1.4 Looking at your data

There are different ways to get information on our data. With these functions we can do some quality control, which is very important.
We can view the data frame in our Console by running the following command in our script:

df.countries

We can see the whole data frame at once since this data frame is not too big. But this is not always the case. With bigger data frames, it is easier to use:

head(df.countries) # view the first six rows of the data frame

We can then check the class of the df.countries object:

class(df.countries)

With the function dim() we can see how many rows and columns there are in this data frame or we can see them separately using the nrow() and ncol() functions. The dim() function is different from the length() function used in the previous course (COO1) to determine the number of elements that are present in a vector/object. The length() function is typically applied to a one-dimensional vector or array and returns the number of elements. The dim() function, on the other hand, is applied to objects with multiple dimensions and gives you the number of rows and columns instead of the number of elements.

dim(df.countries)# the first being the number of rows, the second being the number of columns

nrow(df.countries)

ncol(df.countries)

Now we have an idea of how big the data set is. We can see the columns included in the data set:

names(df.countries)

The head() function only gives us the first six rows. We can also narrow it down to the exact number of rows we want to see:

head(df.countries, 3) # Shows the first three rows of the data frame

We can do the same with the last couple of rows by using the tail() function:

tail(df.countries, 2) # Shows the last two rows of the data frame

Another useful function to study your data frame is the str() function. It compactly displays the structure of your data object.

str(df.countries)

The summary() function is a powerful tool in R for quickly getting an overview of the key characteristics of your data. When applied to an object, such as a data frame or a numerical vector, the summary() function provides a summary of the data’s central tendency (i.e., mean, median), distribution (i.e., interquartile range, range) and extreme values (i.e., min, max). For example, if you have a data frame named Heart_disease, you can use the summary() function to generate a summary of its variables:

summary(Heart_disease)

The summary() function provides output such as mean, median, minimum, maximum and quartiles for numerical variables, as well as counts and levels for factor variables. This function is particularly useful for an initial exploration of your data and can help you quickly identify potential issues, missing values (NA), outliers, or trends.

2 Selecting data

To select data from any data type, use the index operator []. In one-dimensional data, such as a vector, you provide one index. So, to select the third value of a vector, you use vector[3]. For two-dimensional data, if you want to select one value you need to indicate both the row, and the column. Therefore, you need two indices. data frame[2, 5] selects the value from the second row, fifth column.

2.1 Selecting a subset of a vector

Use the following commands to create a vector and select some values from this vector:

a <- c(2, 3, 4, 5)
a

Select the third element of vector a:

a[3] # The third element of vector a.

Select the first and third elements:

a[c(1, 3)] # Returning the first and third elements of vector a.

You can also do calculations with selected elements in a vector:

a[1 + 3] # Returning the fourth element of a vector. 
a[1] + a[3] # Adding the first to the third element, returning the sum.

9. Now, select the first two elements of vector a.

You can also select multiple values in a row:

b <- 10:1
b[2:4] # Returning the second to fourth element of vector b

You can also select a subset of a vector using logical values.

b[b > 3] # Returning values in the vector of b that are greater than 3

Note that the output is different if you don’t use the index operator [ ] properly.

b > 3 # Which values in the vector of b are greater than 3?

Instead of selecting a subset, you now ask R which values in the vector of b are greater than 3. 10:4 are all greater than 3 and therefore return TRUE, 1, 2 and 3 are smaller or equal to 3 and therefore return FALSE.

Lastly, you can select by excluding values from your vector by using a minus: [-1] selects every value except the first one.

b[-1] # Returning all values except the first one

10. Select every value of b except the second and sixth.

We can assign new values to a specific subset that we selected. For example we can select the fifth element of vector b and replace it with 0, using the assignment operator <-.

b[5] <- 0

We can also replace larger parts.

b[6:10] <- 9

2.2 Selecting a subset of a matrix

As explained before, to select from a matrix with two dimensions you need to use two indices. You can also select a whole column or row. In that case, we have to specify whether we want to select either the rows or columns by carefully placing commas. The first index is for rows, the second for columns [rows, columns].

First, we make a new matrix with the following vectors:

vector.a <- c(1, 2, 3) # vector 1
vector.b <- c(4, 5, 6) # vector 2
vector.c <- c(7, 8, 9) # vector 3

11. Combine these vectors as columns of a new matrix, called matrix.a.

Now, you can select the first row:

matrix.a[1,] # select row 1 from matrix a

This shows us the first row of the matrix. We can also see both the first and second row by using c(1,2).

matrix.a[c(1,2),] # select rows 1 and 2 from matrix a

Put your indices after the comma to select columns:

matrix.a[,c(1,3)] # select columns 1 and 3 from matrix a

12. Now, select the value in the third row, second column of matrix.a.

13. Can you describe what this next line of code does to matrix.a?

matrix.a[2,1] <- 999

2.3 Selecting a subset of a data frame

There are multiple ways to select data from a data frame. You can use the index operator [], you can use the dollar sign $, or you can provide the name of a column as character data:

df.countries[4,] # selecting row 4

df.countries[,2] # selecting column 2

df.countries[,"Capital"] # selecting the information in the column called Capital

df.countries$Capital # selecting the information in the column called Capital

Note that $Capital reproduces the information in the column as a vector with character data, while [,2] and [,"Capital"] reproduce the whole column, as a data frame with one column.

15. Use the class() function to return the class for each of the data frame access commands above.
One example is provided:

class(df.countries[4,])

We can also provide some statistics of our data frame with these selection commands. Some basic statistic functions are mean(), min(), and max().

16. Calculate the mean number of inhabitants in the df.countries data frame, using the summary function.
17. Calculate the minimum and maximum age in the Heart_disease data frame, using the dollar sign ($) to get suggestions for the column names.

We can also select data using logical operators, explained in the previous COO. The expressions created with logical operators are also called Boolean expressions. To get an overview of the countries with more than 12,000,000 inhabitants, you can use:

df.countries["Inhabitants"] > 12000000

This creates a Boolean expression that shows us that the second and sixth countries have fewer than 12,000,000 inhabitants (displayed as FALSE). We can also use this information to select data from another column. Therefore, we us the Boolean expression we created to select rows. You can do this with the following commands:

many_inhabitants <- df.countries$Inhabitants > 12000000
df.countries[many_inhabitants, "Countries"]

We first created a vector containing the logical information [1] TRUE FALSE TRUE TRUE TRUE FALSE TRUE. Then, we wanted to produce the country names with many inhabitants, so we used "Countries" to select the column and our logical vector to select the rows. It then selected the information in the first, third, fourth, fifth and seventh row of the column Countries, because those were TRUE. It left out the names of the countries with fewer inhabitants, because those were FALSE.

18. Produce the country names of the countries with fewer than 12,000,000 inhabitants in a similar manner.

3 Sorting and ordering data

In this next bit, we will discuss the sort() and order() functions. The sort function returns the input data from lowest to highest value or in alphabetically ascending order. The order function returns the index order in which the data is present in the object from lowest to highest value or in alphabetically ascending order. So for alphabet <- c("c", "a", "b"), sort(alphabet) will return: "a" "b" "c" and order(alphabet) would return 2 3 1, because the second value ("a") is the lowest in the alphabet, then the third value ("b"), and then the first value ("c").

So, we can (temporarily) sort the countries in our data frame with:

sort(df.countries$Countries)

sort(df.countries$Countries, decreasing = TRUE)

As you can see in the df.countries data frame, the actual order of the countries in the column ‘Countries’ has not changed. The sort function has just returned the column in alphabetical order in the Console panel.

We can use the order function to create a new data frame with our data sorted alphabetically on country name:

countries_sorted1 <- df.countries[order(df.countries$Countries),] # creating a new data frame with alphabetically ordered countries 
countries_sorted1

The order we used is the following:

order(df.countries$Countries)

We use this order to indicate the rows to be used for our new data frame. So, our new data frame is constructed as follows: for the first row of countries_ordered1, the second row of the df.countries data frame is used (containing Belgium), the second row is then filled with the third row of df.countries (containing France), and so on.

19. Now, create a data frame called countries_sorted2, which is sorted on number of inhabitants.

In the lecture, the iris data frame was sorted on two values. To try this yourself:

20. Import the iris_dataset.csv as a data frame.

The following commands were used in the lecture:

volgorde <- order(iris_dataset$sepal_width)
iris_sorted <- iris_dataset[volgorde,]
iris_sorted <- iris_dataset[order(iris_dataset$sepal_width),]
iris_sorted <- iris_dataset[order(iris_dataset$flower, iris_dataset$sepal_width),]

iris_sorted <- iris_dataset[order(iris_dataset$flower, iris_dataset$sepal_width,
                                  decreasing = c(FALSE, TRUE), method = "radix"),]

The iris_sorted data frame has sorted the iris_dataset data first in increasing order on flower name, and then in decreasing order on sepal width.

Now, let’s try a similar sorting for our Heart_disease data frame.

21. Create a new data frame in which you sort the Heart_disease data first in decreasing order on sex, and then in increasing order on age.

4 Missing values

It sometimes happens that you use a data set with missing information. Missing data will be displayed with a NA value (not available). It can be difficult to work with this because this can hinder certain calculations. For example, if you calculate the mean of a column that contains an NA, this will return NA (instead of the mean). We must be cautious when drawing conclusions if our data frame contains NAs and, for instance, check whether the data has been imported well. We can also remove observations with NA values.

Let’s work with the iris data set with missing values from the lecture. We will use a new variable name for our incomplete dataset.

22. Import the iris_data_incomplete.txt as a data frame called iris_data_incomplete.

23. Run the commands from the lecture:

summary(iris_data_incomplete)
which(is.na(iris_data_incomplete$sepal_width)) # which row contains NA?
iris_clean <- na.exclude(iris_data_incomplete) # removes whole row containing NA
mean(iris_data_incomplete$sepal_width) # includes the NA value
mean(iris_data_incomplete$sepal_width, na.rm = TRUE) # excludes the NA value for this calculation

Note that the na.exclude() function itself does not remove the row containing NA, only if you assign it to a new object. This new object will contain the original data set minus the row containing NA, so will have one observation (row) less than the original data set.
You can also check the number of NA values with the sum(is.na()) function.

24. Check the number of NA values in the iris_dataset, iris_data_incomplete and iris_clean data frames.

5 Saving Output and R script

Saving your R script and output in R can be very helpful for future use or sharing.

To save your data and output in the desired folder, it is important to set the working directory, as previously discussed. You can additionally check this with the getwd() function. If the working directory is not set, you need to provide the entire path to the folder in which you aim to save the files.
It is essential to manage your working directory properly to ensure that R can access the files it needs.

5.1 Write data frame to a file

The write.table() or write.csv() functions are used to save data frames or matrices as text files. This is helpful for sharing data with others or importing it to other software programmes. For example:

mydata <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                     Age = c(25, 30, 28))
write.csv(mydata, file = "mydata.csv", row.names = FALSE)

This saves the mydata data frame as a CSV file names mydata.csv. You can import your .csv file for example in Excel via ‘import data’.

5.2 Saving R script

R scripts - in which you write your code in the RStudio’s script editor - can be saved with an .R extension. Saving your R script is important for reproducibility and sharing with others. In the top of your window click the ‘disk icon’ to save your R script, or click 'File' >> 'Save As', shortcut is ctrl + s.

6 Extra assignment

Now, take a look at the Grades.txt file. Use scripting to answer the following questions.
A1. Import the Grades.txt file as a data frame and check whether it worked correctly.
A2. How many rows and columns does the data frame have?
A3. What are the column names?
A4. What does the first part of this data frame look like?
A5. Provide a summary of the data.
A6. Are there errors in your data?
A7. Calculate the number of NA values.
A8. Calculate the mean of each exam. Exclude the NA values in your calculation.
A9. Make a new data frame that excludes the rows that contain NA values.
A10. For both exams, create a logical vector that shows which grades are bigger than 10.
A11. Use the vector that contains a ‘TRUE’ to return the grade that was bigger than 10.
A12. For both exams, create a logical vector that shows which grades are lower than 0.
A13. Use the vector that contains a ‘TRUE’ to return the grade that was lower than 0.
Since grades can vary from 0 to 10, we are quite certain that the grade which is higher than 10 contains a typo and should be 7.5. And the grade lower than 0 accidentally got a minus in front of it. You check this on the original exams, and indeed the grades should be 7.5 and 6.5 respectively. Now, you can change the grades, with the following commands:
A14. Change the name of the data frame and check whether the grades you change are indeed in the 10th and 9th row, in your data frame. Then, change the grades with the following commands:

grades_complete[10, "Exam1"] <- 7.5
grades_complete[9, "Exam2"] <- 6.5

A15. Create a vector of the grades in Exam1 sorted with the lowest grade first.
A16. Create a vector of the grades in Exam2 sorted with the highest grade first.
A17. Create a new, sorted data frame which first sorts on the first exam and then on the second exam.
A18. Calculate the average grade again for both exams.
A19. What is the highest grade for the first exam?
A20. Create a logical vector that shows which row(s) contain the highest grade.
Hint: use the code we used before for the number of inhabitants as an example of creating a logical vector:
many_inhabitants <- df.countries$Inhabitants > 12000000
A21. Now, use this vector to get the name of the person with the highest grade.
A22. Now, produce the name of the person with the lowest grade for the second exam.