Sequence Mapping, Part 1: Reference Based Assembly
Reference Based Transcriptome Analysis (Genome Mapping)
In this write-up, I am going to demonstrate Reference Based Transcriptome Analysis also known as genome mapping. This is going to be a lengthy demo (and reading) so take your time and take breaks in between. This process is broken down into 3 sections (total of 7 steps): Mapping, Quantification and Differential Expression Analysis.
Preliminary Preparation
Before we start, there are several tools you need to download (hyperlinked): HISAT2, Cufflinks, Samtools, and Bowtie2. make sure you have the correct binary and the latest version when downloading and follow the manual instructions for installations. Make sure all these tools are added to your PATH environment.
Note: I run my program on Linux/Ubuntu.
Step 1. Genome Data Preparation
As you recall from my previous article, genome mapping requires reference genome data. You can get the genome data from databases such as NCBI, UCSC genome browser, Ensembl, etc. The data we need is a genome sequence file (genome.fa
) and an annotation file (genes.gtf
).
You can obtain those files for your reference model species, e.g. Homo sapien or Mus musculus etc., from the Illumina website here. You should download this on your remote server rather than on your local PC due to the nature of the file size.
For this demonstration, my model species, or reference if you want to call it that way, is Mus musculus (mouse). I will be using the Mus musculus data from UCSC database. To download the data, use the following command below replacing the bracket with the correct information:
wget [past the link address of UCSC Mus musculus]
Once you have completed the download, use tar -xvzf [name of the file including '.tar.gz']
command by replacing the bracket with the correct file name to unzip the file. As a result, a new directory has been generated called Mus musculus
.
Step 2. Indexing
Before mapping the reads, we need to index the reference genome. Why are we doing this? By indexing, it will reduce the time and memory capacity while mapping the genome.
- Locate the
genome.fa
file in theMus musulus
directory usingcd
command. For example, mine looks like this:cd sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta
Note: TheWholeGenomeFasta
is the end directory where thegenome.fa
file is located. - Make a new directory name
genome_index.
- We will now commence indexing creating the output file in the new directory,
genome_index:
cd genome_index
hisat2-build -p 8 /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa [Name of Index file]
### Note: ###
# -p: number of cpu to use when indexing
Note: You should replace the bracket with the output file name of your choice. I’m going to call my index file as musculus_index
for this demonstration.
There should be a total of 8 index files in .ht2
format generated when the process has completed.
Step 3. Genome Mapping
To map the sequence reads, we will be using the HISAT2 tool and the command format looks like below, however, there is an extra step to add on:
hisat2 -p 8 -x [Index File Name] -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks
Note: the Fastq files it’s referring to are the Paired Fastq Files (forward and reverse) we generated from Trimmomatic.
While HISAT2 is mapping, an alignment summary will be generated for each sample/file. This report is not saved automatically but rather created as a standard error (“stderr”).
Therefore we need to use the following command 2>&1 |tee [name of alignment file].summary
to save the stdeer report and the mapped files. So the final, complete command format for this step looks like below:
hisat2 -p 8 -x musculus_index -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks 2>&1 |tee [name of alignment file].summary
To give you an example how I did it: hisat2 -p 8 -x musculus_index -1 /home/compbio/sra_data/SRR8238941_1_P.fastq.gz -2 /home/compbio/sra_data/SRR8238941_2_P.fastq.gz -S SRR8238941.sam --dta-cufflinks 2>&1 |tee SRR8238941.summary
As a result, you should have generated a .sam
file and a .summary
file. Once the alignment summary report has been generated, you will need to calculate the Overall Alignment Rate which I will demonstrate at the end of the article.
Step 4. Bam Sorting
This is the last step of genome mapping which we will be converting the .sam
file into a Bam file format using the following command:
samtools sort -@8 -o [output file name].sort.bam [file name from mapping].sam
Note: -@
command is a parameter for the number of CPUs to use during this step. To demonstrate one example your code should look similar to this: samtools sort -@ 8 -o SRR8238941.sort.bam SRR8238941.sam
The next couple steps from here will be Quantification of assembled transcriptome.
Step 5. Generating Transcript Assembly
For this step, it will be based on the Tuxedo protocol. Use the following command format to commence transcription assembly.
cufflinks -p 8 -G [pathway of the Annotation file] -o [output_directory_name] [file name from Bam sorting].sort.bam
### Note: ###
# -p: number of CPU
# -G: Annotation file (should write the pull pathway where this file is located)
# -o: cufflinks creates a new directory to store the output files therefore you just need to name it
To give you an example, mine would look like this: cufflinks -p 8 -G /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -o SRR8238941_trspt SRR8238941.sort.bam
Note: you will need to write a full pathway for -G
command since the annotation file (genes.gtf
) is in a separate directory. Unless you move the files into the same directories then you do not need to state the pathway to the file.
Step 6. Merging Assemblies
We will be merging the cufflink-assembled files into a single annotation file so that we can compare the expression level between samples. You also need Python version 2.7 installed or the tool will not work, so make sure you have Conda on your Linux/Ubuntu environment beforehand.
- Compile the pathways for the assembled files (
transcripts.gtf
) from the previous step using the following command:
vi assemblies.txt
As an example for one file, mine will look like this: /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/gene_alignment/bam/SRR8238941_trspt/transcripts.gtf
You should have total of 6 lines in your assemblies.txt
since there are 6 data files. Make sure you save the text file using the correct vi editor commands.
2. Execute the below command to merge the assembled files.
cuffmerge -p 8 -g [pathway to genes.gtf] -s [pathway to genome.fa] -o merged_asm assemblies.txt
# -p: number of cpu
# -g Annotation file
# -s: genome sequence file
# -o: directory where cuffmerge files will be stored
To demonstrate, my command will look like this:cuffmerge -p 8 -g /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -s genome.fa -o merged_asm assemblies.txt
Note: I wrote the full pathway to genes.gtf
file because it’s located in a different directory. You will need to provide a full pathway if the file(s) you are working with is not located in the same directory.
3. Once it’s done, the cuffmerge
will create a merged_asm
directory, might be named merge_assemble
, where the merge.gtf
file is generated, which is the transcriptome assembly file.
Step 7. Differential Expression Analysis
At last, we will be finding the differentially expressed genes (DEGs) between samples using cuffdiff command. The command format looks like this:
cuffdiff -o diff_out -b genome.fa -p 8 -L [list of samples] -u merged.gtf file1.sort.bam file2.sort.bam
# -o: name of output directory where your DEG file(s) will be stored
# -b: genome sequence file
# -p: number of cpu
# -L: sample name
# -u: multi-read-correction;reads that are mapped on multiple locations in the genome are calculated using weight read mapping
If your data (or sample) has replications you will need to use this file naming format with commas: [file_1 name].sort.bam,[file_1-1 name].sort.bam [file_2 name].sort.bam,[file_2-1 name].sort.bam [file_3 name].sort.bam,[file_3-1 name].sort.bam
To figure out what the sample names are for each corresponding file go back to the journal article or here. Each file should tell you whether it is a control or different sample(s).
Here is an example how mine looks:cuffdiff -o diff_out -b /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/genome.fa -p 8 -L Ctrl,Ra,Rv -u /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/merge_assemble/merged.gtf SRR8238941.sort.bam,SRR8238942.sort.bam SRR8238943.sort.bam,SRR8238944.sort.bam
Since the data has 3 different samples (Control, Ra, Rv), stated in the journal article, I specified the sample name as Ctrl, Ra, Rv. Also notice how the bam files are grouped by same sample types. For instance SRR8238941.sort.bam,SRR8238942.sort.bam
are grouped and separated by commas, not spaced, because they are control group samples.
Once the process has been completed, you will find your several output files in the diff_out
directory.
If you are focused on gene expression level, the files you want to refer to are gene_exp.diff
and genes.read_group_tracking
. You may want to export these files into your local PC or copy these files and store it in a separate directory where you can work on it.
gene_exp.diff
: contains gene epxression level for each sample, p & q values, etc of differentially expressed genes. genes.read_group_tracking
file contains expression levels of each replicate.
Overall Alignment Rate Calculation
Referring back to Step 3. Gene Mapping after alignment summary reports ( [file name].summary
) have been generated, the next step is to calculate the overall alignment rate. When taking a look at one of the alignment summary files you should see a report like below:
37537336 reads; of these:
37537336 (100.00%) were paired; of these:
986245 (2.63%) aligned concordantly 0 times
33036211 (88.01%) aligned concordantly exactly 1 time
3514880 (9.36%) aligned concordantly >1 times
— —
986245 pairs aligned concordantly 0 times; of these:
112422 (11.40%) aligned discordantly 1 time
— —
873823 pairs aligned 0 times concordantly or discordantly; of these:
1747646 mates make up the pairs; of these:
1077358 (61.65%) aligned 0 times
579867 (33.18%) aligned exactly 1 time
90421 (5.17%) aligned >1 times
98.56% overall alignment rate
<<End of Report>>
The following is the calculation how we derived 98.65% overall alignment rate, refer to bolded section in the report :
98.56% = {33036211 + 3514880 + 112422 + (579867 + 90421) / 2} / 37537336 x 100
Congrats, you have made it!
In the next article I will be demonstrating De Novo Assembly. In the meantime, carefully go over each step and try to run through the workflow with other dataset.
Thanks for reading my article, I hope you are enjoying them so far. Let me know in the comment below or consider buying me a coffee (no pressure): https://www.buymeacoffee.com/shortlongseq
I also publish my articles at https://shortlongseq.hashnode.dev/ and https://seqbioinformatics.substack.com/ if you prefer a different mode of newsletter delivery, they are free.