We often use tools to inspect our sequence alignment maps. This helps us get a clearer picture of our data. In these exercises, we will use the integrative genome viewer (IGV).
We will be using publicly available Illumina sequence data from the
HCC1143 cell line. The HCC1143 cell line was generated from a 52 year
old caucasian woman with breast cancer. Additional information on this
cell line can be found here: HCC1143:
(tumor, TNM stage IIA, grade 3, primary ductal carcinoma) and HCC1143/BL
(matched normal EBV transformed lymphoblast cell line). Reads were
mapped against reference genome Human (GRCh37/hg19).
Reads within these cell lines have been filtered to
chr21:19,000,000-20,000,000
in order to reduce file
sizes.
You will need these two files
The Integrative Genomics Viewer (IGV) is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data. It supports flexible integration of all the common types of genomic data and metadata, investigator-generated or publicly available, loaded from local or cloud sources. https://software.broadinstitute.org/software/igv/
You can install IGV or run it inside your web browser with the IGV-Web application.
The written instructions assume you are using the IGV-Web version.
It is important to use the reference against which you aligned your sequencing data if you wish to properly interpret the alignments! For these exercises, we will be using Human (GRCh37/hg19). You can pick the genome by clicking the drop down menu in the upper-left!
We will be using the breast cancer cell line HCC1143 to visualize alignments. For speed, only a small portion of chr21 will be loaded (19M:20M).
Copy the files (see Data Set section at top of this document) to your
local drive, and in IGV choose Tracks
>
Local File ...
, select the bam file AND the bam.bai file at
the same time, and click OK. Note that the bam and index files must be
in the same directory for IGV to load these properly.
Navigate to a narrow window on chromosome 21:
chr21:19,480,041-19,480,386
.
You will see reads represented by grey or white bars stacked on top of each other, where they were aligned to the reference genome. The reads are pointed to indicate their orientation (i.e. the strand on which they are mapped). Click on any read and notice that a lot of information is available.
3. Click on a random read and see if you can explain what information is reported. You may want to check the SAM file format documentation to look up certain abbreviations and terms: Map Format Specification and Optional Fields Specification.
Enter chr21:19,479,321
directly into the location
field
C/T
SNPHCC1143.normal.21.19M-20M.bam
track
at the exact SNP position chr21:19,479,321
and click
Sort by...
base
.HCC1143.normal.21.19M-20M.bam
track and click
Color by:
read strand
HCC1143.normal.21.19M-20M.bam
track at the exact SNP
position chr21:19,479,321
for a summary of the bases mapped
to this exact position.Note
4. How does Color by read strand help determine if this is a real variant?
5. Also investigate another nearby SNP at
chr21:19,479,731
and determine the frequencies of
A,C,G,T at that location.
Navigate to position chr21:19,611,925-19,631,555
. Note
that the range contains areas where coverage drops to zero in a few
places.
Click on the cog wheel on the right of the
HCC1143.normal.21.19M-20M.bam
track and click on
Display mode:
squish
Also click Color by:
pair orientation & insert size (TLEN)
6. Why are some reads throughout the alignments highlighted deep blue or red?
Navigate to region chr21:19,666,833-19,667,007
Click on the cog wheel on the right of the
HCC1143.normal.21.19M-20M.bam
track and click on
Display mode:
expand
Sort by base (at position chr21:19,666,901
)
7. Are the two (non-reference) single-nucleotide variants linked together (existing on the same haplotype) or not at all? How can you tell?
Navigate to region chr21:19,800,320-19,818,162
Tracks
>
Annotations
> select Repeat Masker
>
OK
)8. Why are so many reads highlighted in white (instead of gray)?
9. Can you explain the poor mapping quality in these regions, when you look at the features indicated on the Repeat Masker track?
Navigate to region chr21:19,324,469-19,331,468
HCC1143.normal.21.19M-20M.bam
track and click
View as Pairs
and Display mode:
expand
Color by
pair orientation & insert size (TLEN)
HCC1143.normal.21.19M-20M.bam
track
and click Sort by...
insert size
.10. What is the insert size of these read pairs indicated in red?
11. What is the typical insert size if you look at the surrounding read pairs?
12. How large is the deletion?
13. Why does this section indicate a homozygous deletion in the subject genome and not a heterozygous deletion?
Navigate to region chr21:19,102,154-19,103,108
Color by
pair orientation & insert size (TLEN)
and note the
reads highlighted in diverse colors.14. What do the diverse color highlights indicate?
15. Where can you find a legend of these colors?
16. Can you find out which feature on the Repeat Masker could be causing the mapping issues in this region?
17. Explain in your own words how this element causes these mapping issues.
Navigate to region chr21:19,089,694-19,095,362
.
Color by
pair orientation
and try to
find the three read pairs highlighted in green.18. What is unusual about these three read pairs indicated in green?
19. How can you explain that pattern?
Note
Adv1. What software was used to map these reads?
Adv2. What would actually happen if you load the data, but use the wrong reference genome, say Human (hg18)?
Adv3. A SAM file reports the reference sequence name in the header, but does not explicitly contain the reference sequence. Could we reconstruct the reference sequence from just the alignments in the SAM file?
*Adapted from Genomic Visualization and Interpretations - Griffith Lab