You need to do this only once!
The script below checks whether the listed CRAN packages are installed
on your computer. If they are not, they will be automatically
installed.
## First specify the packages of interest
packages = c("devtools","hdf5r","dplyr","ggplot2","stringr",
"RColorBrewer","useful","readr","BiocManager")
## Now load or install&load all
package.check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
)Seurat for scRNAseq data analysis
The following is necessary for interaction with hdf5 objects, and can only be installed using BiocManager installer.
In this COO, you will learn how to analyse single cell RNA sequencing data (scRNAseq) using the well established Seurat pipeline.
This pipeline includes functions that do most of the work ‘behind the scenes’. This makes the tool very user friendly and suitable for todays tutorial. There is extensive documentation on the use of the pipeline. The following tutorial is the closest to what we will do today: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
You may need to run this on the terminal, since in RStudio, the working directory is set to your home directory after the chunk is run! An alternative is to use the pull down menu at the top of the bottom-right window on RStudio. Click Files/More to see the option
For today’s tutorial, we will use the scRNAseq atlas of the adolescent mouse cortex published by Saunders et al 2018.
\(~\)
Why? A major use of the scRNAseq is cell type identification. To achieve
this, you need to perform quality control steps and cluster analysis. To
visualize the data, you will perform dimensionality reduction. Finally,
you can plot marker genes that you find from the literature to reveal
the cell type identity of clusters \(~\)
We will use the Seurat object that we have uploaded. This object made specially for the Seurat pipeline has a lot of ‘slots’ where information can be stored, and the architecture is a bit complicated. You do not need it for this tutorial, except what is mentioned
The data (‘Linnerson_cortex_10X22.rds’) downloaded and processed into a ‘Seurat object’ to prevent technical errors caused by the ‘loom’ file format in several computers.
You can download them from Brightspace.
Load the Seurat object that was saved as rds.
Note: The object contains data from 6658 cells (samples) and 27998 features (genes). There is 1 assay (RNA).
** We skip this part for time reasons **
** We skip this part for time reasons **
In Seurat, standard preprocessing workflow is replaced by a single command. However, it is good to see this part to learn each step.
First, we will normalize the data. This is to get rid of the differences in total RNA counts between cells. In other words, we will equalize the total count number in each cell to a fixed number (e.g. 10000 RNA molecules per cell).
We want to find out ‘informative genes’ that will explain biological differences to use in some of hte downstream applications. If a gene is expressed everywhere, that doesnt tell us much. However, if a gene is expressed in a subset of cells, this will cause ‘variation’.
We can detect these genes using FindVariableFeatures() function.
Go through the code: What is happening?
\(~\)
Identify the 10 most highly variable genes.
Now visualise.
## Plot
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(dataset)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2Note:Dispersion indicates variation, while red color shows ‘significantly variable’ genes
Select the top 1000 highly variable genes. For downstream applications, we will use these.
Check the dispersion table: What do you see?
What do we mean by variation?
\(~\)
We will use the ScaleData() function of hte Seurat package. The confounding factors can be discarded using ‘vars.to.regress’.
Performing Dimensionality reduction in Seurat is very simple!
It is also easy to find out genes that contribute to the + and - direction of the top PCs! We can also plot the level of contribution.
What can we learn from the PC1 versus PC2 plot?
What do the genes represent?
What can we learn from by looking at these genes?
\(~\)
We can use an integrated function of Seurat to plot heatmaps to visualize genes that drive different principle components.
Plot some of these genes on PCA plots to see how their expression is distributed along different PCs. For this, we will look at the first 4 PCs, indentify genes that are highest on + and - axis and plot them on PC1/PC2 and PC3/PC4 plots.
The first two components:
Cldn5 and Mog genes look interesting:
plot_f1 <- FeaturePlot(dataset,features = c('Cldn5','Mog'),reduction = 'pca',dims = c(1,2),cols = c('gray','blue','red'))
plot_f2 <- FeaturePlot(dataset,features = c('Cldn5','Mog'),reduction = 'pca',dims = c(3,4),cols = c('gray','blue','red'))
plot_f1 / plot_f2Now components 5 and 8:
Pprm and Tmem119 genes look interesting:
We will use a graph-based clustering algorithm discussed at the lecture.
We need to build a neighborhood graph. In this network graph, each cell will be a node, and their similarity in high dimensional space will become their edges.
One could say that cells closest to each other reside in a neighborhood.
We will use the top PCs to calculate the umap and tSNE coordinates. You can change the numnber of PCs based on your observation in Practice 2.
Looking at the clusters that you have identified, what differences do
you see between the dimentionality reduction methods? Do you have a
preference to use one instead of the other? Why?
\(~\)
Check Use Google/ChatGPT to find marker genes for neurons, inhibitory neurons, astrocytes and oligodendrocytes.
Hint: The following papers have marker genes for relevant cell types:
Plot the expression of these marker genes using the
VlnPLot() function.
** What do you see? Can you identify different cell types clearly, or is there anything missing? ** \(~\)
The following parts are extras. If you have time, please follow them :::
The following is a question from teh last year. We didn’t go into details of k-Means this year, thus the question may look out of context. But it will give you an idea of what can be expected. A indicated the answer
Please list four important facts about the k-means algorithm on the following topics: i) What is it used for? > A: For classification of data into groups/clusters
Please explain the important steps > A: Determine the expected number of clusters. Take random points in the data and calculate the distance of each point to these random points to assign clusters. Then calculate the center of the cluster. Finally, repeat this process until there is no change in the centeral point, meaning that a stability is reached.
Name one of the drawbacks > A: needs estimation of the number of clusters beforehand. Cannot work on all different ‘shapes’ of data. Cannot find outliers. Does not show how similar each cluster is.
If you run the k-means algorithm on the same data two different times, do you always get the same results? Why/why not? > A: No. The process is stochastic, starts randomly and is repeated many times until stability is reached. The final results will, in most cases, be different.
Here is another question for you to think about:
Which steps of the single cell RNA sequencing analysis aims at getting rid of the differences in number of reads between cells (e.g. a cell with 1000 features and another with 5000 features)?
A: I wont provide the answer for this one
The images for info boxes are taken from https://www.cleanpng.com
Also see: https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm