In the second computer module, you will learn how to import files into R. In addition, you will learn how to select and order data within objects.
We can import files with data into RStudio and start working with them. Common file types are text files, csv (comma-separated values) files, tab delimited files or SPSS files. The first row of these file types generally correspond to the columns of the data frame in R. Data files can be imported through the RStudio interface, as explained in the lecture. However, it is more recommended to import a file through a line of code in your script. For this purpose, you need to specify the file path that indicates the folder in which the file is located on your computer. You can do this by explicitly specifying the full file path each time you use a function to import a file, but also by setting the ‘working directory’ (e.g. the folder on your computer in which the file is stored) at the very top of your script. If you set the working directory, you specify to R in which folder it has to look for (and save) files by default. This helps you to organize your project by working from one folder on your computer in which the original data files are stored and the output generated in your project can be saved (this will be explained later in this course).
To exemplify how to set the working directory, you first need to create a new folder on your computer (for instance called Intro_R_COO2), in which you will store the datsets required for this course. Next, the working directory can be set by selecting ‘Session’ >> ‘Set working directory’ >> ‘Choose directory’ on top of RStudio, as shown in the lecture. You can select the desired folder, which will show a code similar to the following code in your Console, specifying the path to the selected folder:
setwd("C:/Users/[yourname]/Documents/Intro_R_COO2")
You can copy this code from the Console to your R script, so the working directory can be easily specified the next time you run your code.
If you run the function getwd()
, RStudio will show you
in the Console the working directory in which you are currently
working.
getwd()
We will use different datasets. Therefore, you first need to download the following files:
Save the files in the folder you just created. You can do this by right-clicking on the link and pick ‘Save link as …’ to directly save the file. Note: opening the file and then saving it (e.g. using Excel) may alter the contents! The R scripts that you have created can be saved in this folder as well. Use a proper name for your scripts to prevent chaos. Use for example the date (yy-mm-dd) + file name + version (e.g. 210907_RscriptCOO2_version1).
The general R function to import data files is the read.table()
function, which reads a file and imports it as a data frame. For
certain file types, variations of this function are also available in R,
for example the [read.csv()
function] to import a csv file.
‘csv’ indicates that the values in the document are comma-separated. If
you open the original .txt file, you can see that each line contains a
row of values, which can be numbers or letters, and each value is
seperated by a comma. In addition, read.csv() by default uses the first
line in the text file as column names for the data frame in RStudio.
Let’s first use the manual way to import data as shown in the lecture:
File > Import Dataset > From Text (base) > Import.
1. Import the Mammogram dataset manually. Note: unfortunately, newer versions of RStudio give a warning about default.stringsAsFactors. This is a bug in RStudio, but does not affect this COO. You can ignore this warning.
In the Console panel, you can see the command RStudio used to import
the data. As you can see, the function RStudio used is
read.csv()
. You can see that the only argument in the
function is the path to your text file, including the name and extension
(.txt) of your file. And R automatically assigned the data to an object
with the name of the file (Mammogram <- read.csv()
).
Now, let’s import another dataset by using the more general function
read.table()
in your script. The advantages of scripting
instead of manually importing are that you can find the command again
when you need to, and you automatically open the data the next time you
run this script.
2. Import the Heart_disease data by copying and editing the
commands from your Console.
Change the object name, the name of the .txt file and change the
function to read.table().
When you view your new data (either by clicking on the object in the
Environment panel or using the View()
function), it is
loaded as a data frame with one column and multiple values separated by
commas in one cell. read.table()
by default uses spaces in
the text file to separate columns. Since our data is separated by commas
instead of spaces, the data file is not imported properly. In addition,
the first row of the file is not used for the column names. Therefore we
need to define some new arguments (as shown in the lecture).
3. Adapt the following commands to correctly load the Heart_disease data:
Heart_disease <- read.table("Heart_disease.txt",
header = TRUE,
sep = ",")
View(Heart_disease)
To view the dataset, you can also click on the object
Heart_disease
in the Working and History panel under the
Environment tab.
The next dataset we are going to import is the Countries dataset. This dataset does not only contain numerical data (the other datasets did), but also text data, such as country names.
4. What R data type is used to store text?
As explained before, many arguments in a function are set to default
settings. One of the default settings of the read.table()
function is that strings (data type ‘character’) are converted to
factors in older R versions (before R 4.0.0).
5. What is the difference between the data types ‘character’
and ‘factor’?
6. Suppose a column in our data frame contains seven countries
and the data type is factor. How many levels will this column
have?
7. What will happen if we want to add another country to this
column?
To keep our options open, we will import character strings as
character data and not as factors. Therefore, we make sure to set the
argument stringsAsFactors
to FALSE. We can call our data
frame df.countries. ‘df’ is short for data frame, which helps us
remember that the object with this name is a data frame.
8. Import the dataset with the following commands:
df.countries <- read.table("Countries.txt",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE)
df.countries
Now, take a look at the document “CountriesTD.txt” and try to find
what looks different in this document compared to the document
“Countries”. You can see that the text is not separated anymore by
commas but now by tabs. We can still use the same functions as before
but we must change the argument sep
to "\t"
which means that this file is tab delimited.
countriestd <- read.table("CountriesTD.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
head(countriestd)
Notice how this gives the exact same data frame as before. For tab
delimited files, a specified function with default settings (i.e.,
header and separator) for importing this file type is also available:
the read.delim()
function.
countriestd2 <- read.delim("CountriesTD.txt",
stringsAsFactors = FALSE)
countriestd2
There are different ways to get information on our data. With these
functions we can do some quality control, which is very important.
We can view the data frame in our Console by running the following
command in our script:
df.countries
We can see the whole data frame at once since this data frame is not too big. But this is not always the case. With bigger data frames, it is easier to use:
head(df.countries) # view the first six rows of the data frame
We can then check the class of the df.countries
object:
class(df.countries)
With the function dim()
we can see how many rows and
columns there are in this data frame or we can see them separately using
the nrow()
and ncol()
functions. The
dim()
function is different from the length()
function used in the previous course (COO1) to determine the number of
elements that are present in a vector/object. The length()
function is typically applied to a one-dimensional vector or array and
returns the number of elements. The dim()
function, on the
other hand, is applied to objects with multiple dimensions and gives you
the number of rows and columns instead of the number of elements.
dim(df.countries)# the first being the number of rows, the second being the number of columns
nrow(df.countries)
ncol(df.countries)
Now we have an idea of how big the data set is. We can see the columns included in the data set:
names(df.countries)
The head()
function only gives us the first six rows. We
can also narrow it down to the exact number of rows we want to see:
head(df.countries, 3) # Shows the first three rows of the data frame
We can do the same with the last couple of rows by using the
tail()
function:
tail(df.countries, 2) # Shows the last two rows of the data frame
Another useful function to study your data frame is the
str()
function. It compactly displays the structure of your
data object.
str(df.countries)
The
summary()
function is a powerful tool in R for quickly getting an overview of
the key characteristics of your data. When applied to an object, such as
a data frame or a numerical vector, the summary()
function
provides a summary of the data’s central tendency (i.e., mean, median),
distribution (i.e., interquartile range, range) and extreme values
(i.e., min, max). For example, if you have a data frame named
Heart_disease
, you can use the summary()
function to generate a summary of its variables:
summary(Heart_disease)
The summary()
function provides output such as mean,
median, minimum, maximum and quartiles for numerical variables, as well
as counts and levels for factor variables. This function is particularly
useful for an initial exploration of your data and can help you quickly
identify potential issues, missing values (NA), outliers, or trends.
To select data from any data type, use the index operator
[]
. In one-dimensional data, such as a vector, you provide
one index. So, to select the third value of a vector, you use
vector[3]
. For two-dimensional data, if you want to select
one value you need to indicate both the row, and the column. Therefore,
you need two indices. data frame[2, 5]
selects the value
from the second row, fifth column.
Use the following commands to create a vector and select some values from this vector:
a <- c(2, 3, 4, 5)
a
Select the third element of vector a:
a[3] # The third element of vector a.
Select the first and third elements:
a[c(1, 3)] # Returning the first and third elements of vector a.
You can also do calculations with selected elements in a vector:
a[1 + 3] # Returning the fourth element of a vector.
a[1] + a[3] # Adding the first to the third element, returning the sum.
9. Now, select the first two elements of vector a.
You can also select multiple values in a row:
b <- 10:1
b[2:4] # Returning the second to fourth element of vector b
You can also select a subset of a vector using logical values.
b[b > 3] # Returning values in the vector of b that are greater than 3
Note that the output is different if you don’t use the index operator [ ] properly.
b > 3 # Which values in the vector of b are greater than 3?
Instead of selecting a subset, you now ask R which values in the vector of b are greater than 3. 10:4 are all greater than 3 and therefore return TRUE, 1, 2 and 3 are smaller or equal to 3 and therefore return FALSE.
Lastly, you can select by excluding values from your vector by using
a minus: [-1]
selects every value except the first one.
b[-1] # Returning all values except the first one
10. Select every value of b except the second and sixth.
We can assign new values to a specific subset that we selected. For
example we can select the fifth element of vector b
and
replace it with 0
, using the assignment operator
<-
.
b[5] <- 0
We can also replace larger parts.
b[6:10] <- 9
As explained before, to select from a matrix with two dimensions you
need to use two indices. You can also select a whole column or row. In
that case, we have to specify whether we want to select either the rows
or columns by carefully placing commas. The first index is for rows, the
second for columns [rows, columns]
.
First, we make a new matrix with the following vectors:
vector.a <- c(1, 2, 3) # vector 1
vector.b <- c(4, 5, 6) # vector 2
vector.c <- c(7, 8, 9) # vector 3
11. Combine these vectors as columns of a new matrix, called matrix.a.
Now, you can select the first row:
matrix.a[1,] # select row 1 from matrix a
This shows us the first row of the matrix. We can also see both the
first and second row by using c(1,2)
.
matrix.a[c(1,2),] # select rows 1 and 2 from matrix a
Put your indices after the comma to select columns:
matrix.a[,c(1,3)] # select columns 1 and 3 from matrix a
12. Now, select the value in the third row, second column of matrix.a.
13. Can you describe what this next line of code does to
matrix.a
?
matrix.a[2,1] <- 999
There are multiple ways to select data from a data frame. You can use
the index operator []
, you can use the dollar sign
$
, or you can provide the name of a column as character
data:
df.countries[4,] # selecting row 4
df.countries[,2] # selecting column 2
df.countries[,"Capital"] # selecting the information in the column called Capital
df.countries$Capital # selecting the information in the column called Capital
Note that $Capital
reproduces the information in the
column as a vector with character data, while [,2]
and
[,"Capital"]
reproduce the whole column, as a data frame
with one column.
15. Use the class()
function to return the class
for each of the data frame access commands above.
One example is provided:
class(df.countries[4,])
We can also provide some statistics of our data frame with these
selection commands. Some basic statistic functions are
mean()
, min()
, and max()
.
16. Calculate the mean number of inhabitants in the
df.countries data frame, using the summary function.
17. Calculate the minimum and maximum age in the Heart_disease
data frame, using the dollar sign ($
) to get suggestions
for the column names.
We can also select data using logical operators, explained in the previous COO. The expressions created with logical operators are also called Boolean expressions. To get an overview of the countries with more than 12,000,000 inhabitants, you can use:
df.countries["Inhabitants"] > 12000000
This creates a Boolean expression that shows us that the second and sixth countries have fewer than 12,000,000 inhabitants (displayed as FALSE). We can also use this information to select data from another column. Therefore, we us the Boolean expression we created to select rows. You can do this with the following commands:
many_inhabitants <- df.countries$Inhabitants > 12000000
df.countries[many_inhabitants, "Countries"]
We first created a vector containing the logical information
[1] TRUE FALSE TRUE TRUE TRUE FALSE TRUE
. Then, we
wanted to produce the country names with many inhabitants, so we used
"Countries"
to select the column and our logical vector to
select the rows. It then selected the information in the first, third,
fourth, fifth and seventh row of the column Countries
,
because those were TRUE
. It left out the names of the
countries with fewer inhabitants, because those were
FALSE
.
18. Produce the country names of the countries with fewer than 12,000,000 inhabitants in a similar manner.
In this next bit, we will discuss the sort()
and
order()
functions. The sort function returns the input data
from lowest to highest value or in alphabetically ascending order. The
order function returns the index order in which the data is present in
the object from lowest to highest value or in alphabetically ascending
order. So for alphabet <- c("c", "a", "b")
,
sort(alphabet)
will return: "a" "b" "c"
and
order(alphabet)
would return 2 3 1
, because
the second value ("a"
) is the lowest in the alphabet, then
the third value ("b"
), and then the first value
("c"
).
So, we can (temporarily) sort the countries in our data frame with:
sort(df.countries$Countries)
sort(df.countries$Countries, decreasing = TRUE)
As you can see in the df.countries data frame, the actual order of the countries in the column ‘Countries’ has not changed. The sort function has just returned the column in alphabetical order in the Console panel.
We can use the order function to create a new data frame with our data sorted alphabetically on country name:
countries_sorted1 <- df.countries[order(df.countries$Countries),] # creating a new data frame with alphabetically ordered countries
countries_sorted1
The order we used is the following:
order(df.countries$Countries)
We use this order to indicate the rows to be used for our new data frame. So, our new data frame is constructed as follows: for the first row of countries_ordered1, the second row of the df.countries data frame is used (containing Belgium), the second row is then filled with the third row of df.countries (containing France), and so on.
19. Now, create a data frame called countries_sorted2, which is sorted on number of inhabitants.
In the lecture, the iris data frame was sorted on two values. To try this yourself:
20. Import the iris_dataset.csv as a data frame.
The following commands were used in the lecture:
volgorde <- order(iris_dataset$sepal_width)
iris_sorted <- iris_dataset[volgorde,]
iris_sorted <- iris_dataset[order(iris_dataset$sepal_width),]
iris_sorted <- iris_dataset[order(iris_dataset$flower, iris_dataset$sepal_width),]
iris_sorted <- iris_dataset[order(iris_dataset$flower, iris_dataset$sepal_width,
decreasing = c(FALSE, TRUE), method = "radix"),]
The iris_sorted data frame has sorted the iris_dataset data first in increasing order on flower name, and then in decreasing order on sepal width.
Now, let’s try a similar sorting for our Heart_disease data frame.
21. Create a new data frame in which you sort the Heart_disease data first in decreasing order on sex, and then in increasing order on age.
It sometimes happens that you use a data set with missing information. Missing data will be displayed with a NA value (not available). It can be difficult to work with this because this can hinder certain calculations. For example, if you calculate the mean of a column that contains an NA, this will return NA (instead of the mean). We must be cautious when drawing conclusions if our data frame contains NAs and, for instance, check whether the data has been imported well. We can also remove observations with NA values.
Let’s work with the iris data set with missing values from the lecture. We will use a new variable name for our incomplete dataset.
22. Import the iris_data_incomplete.txt as a data frame called iris_data_incomplete.
23. Run the commands from the lecture:
summary(iris_data_incomplete)
which(is.na(iris_data_incomplete$sepal_width)) # which row contains NA?
iris_clean <- na.exclude(iris_data_incomplete) # removes whole row containing NA
mean(iris_data_incomplete$sepal_width) # includes the NA value
mean(iris_data_incomplete$sepal_width, na.rm = TRUE) # excludes the NA value for this calculation
Note that the na.exclude()
function itself does not
remove the row containing NA, only if you assign it to a new object.
This new object will contain the original data set minus the row
containing NA, so will have one observation (row) less than the original
data set.
You can also check the number of NA values with the
sum(is.na())
function.
24. Check the number of NA values in the iris_dataset, iris_data_incomplete and iris_clean data frames.
Saving your R script and output in R can be very helpful for future use or sharing.
To save your data and output in the desired folder, it is
important to set the working directory, as previously discussed. You can
additionally check this with the getwd()
function. If the
working directory is not set, you need to provide the entire path to the
folder in which you aim to save the files.
It is essential to manage your working directory properly to ensure that
R can access the files it needs.
The write.table()
or write.csv()
functions
are used to save data frames or matrices as text files. This is helpful
for sharing data with others or importing it to other software
programmes. For example:
mydata <- data.frame(Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 28))
write.csv(mydata, file = "mydata.csv", row.names = FALSE)
This saves the mydata data frame as a CSV file names mydata.csv. You can import your .csv file for example in Excel via ‘import data’.
R scripts - in which you write your code in the RStudio’s script
editor - can be saved with an .R extension. Saving your R script is
important for reproducibility and sharing with others. In the top of
your window click the ‘disk icon’ to save your R script, or click
'File'
>> 'Save As'
, shortcut is
ctrl + s.
Now, take a look at the Grades.txt file. Use scripting to answer the
following questions.
A1. Import the Grades.txt file as a data frame and check whether
it worked correctly.
A2. How many rows and columns does the data frame
have?
A3. What are the column names?
A4. What does the first part of this data frame look
like?
A5. Provide a summary of the data.
A6. Are there errors in your data?
A7. Calculate the number of NA values.
A8. Calculate the mean of each exam. Exclude the NA values in
your calculation.
A9. Make a new data frame that excludes the rows that contain NA
values.
A10. For both exams, create a logical vector that shows which
grades are bigger than 10.
A11. Use the vector that contains a ‘TRUE’ to return the grade
that was bigger than 10.
A12. For both exams, create a logical vector that shows which
grades are lower than 0.
A13. Use the vector that contains a ‘TRUE’ to return the grade
that was lower than 0.
Since grades can vary from 0 to 10, we are quite certain that the grade
which is higher than 10 contains a typo and should be 7.5. And the grade
lower than 0 accidentally got a minus in front of it. You check this on
the original exams, and indeed the grades should be 7.5 and 6.5
respectively. Now, you can change the grades, with the following
commands:
A14. Change the name of the data frame and check whether the
grades you change are indeed in the 10th and 9th row, in your data
frame. Then, change the grades with the following commands:
grades_complete[10, "Exam1"] <- 7.5
grades_complete[9, "Exam2"] <- 6.5
A15. Create a vector of the grades in Exam1 sorted with the
lowest grade first.
A16. Create a vector of the grades in Exam2 sorted with the
highest grade first.
A17. Create a new, sorted data frame which first sorts on the
first exam and then on the second exam.
A18. Calculate the average grade again for both
exams.
A19. What is the highest grade for the first
exam?
A20. Create a logical vector that shows which row(s) contain the
highest grade.
Hint: use the code we used before for the number of inhabitants as
an example of creating a logical vector:
many_inhabitants <- df.countries$Inhabitants > 12000000
A21. Now, use this vector to get the name of the person with the
highest grade.
A22. Now, produce the name of the person with the lowest grade
for the second exam.