Sequence Mapping, Part 1: Reference Based Assembly

7 min readJan 16, 2023

Reference Based Transcriptome Analysis (Genome Mapping)

In this write-up, I am going to demonstrate Reference Based Transcriptome Analysis also known as genome mapping. This is going to be a lengthy demo (and reading) so take your time and take breaks in between. This process is broken down into 3 sections (total of 7 steps): Mapping, Quantification and Differential Expression Analysis.

Preliminary Preparation

Before we start, there are several tools you need to download (hyperlinked): HISAT2, Cufflinks, Samtools, and Bowtie2. make sure you have the correct binary and the latest version when downloading and follow the manual instructions for installations. Make sure all these tools are added to your PATH environment.
Note: I run my program on Linux/Ubuntu.

Step 1. Genome Data Preparation

As you recall from my previous article, genome mapping requires reference genome data. You can get the genome data from databases such as NCBI, UCSC genome browser, Ensembl, etc. The data we need is a genome sequence file (genome.fa) and an annotation file (genes.gtf).

You can obtain those files for your reference model species, e.g. Homo sapien or Mus musculus etc., from the Illumina website here. You should download this on your remote server rather than on your local PC due to the nature of the file size.

For this demonstration, my model species, or reference if you want to call it that way, is Mus musculus (mouse). I will be using the Mus musculus data from UCSC database. To download the data, use the following command below replacing the bracket with the correct information:

wget [past the link address of UCSC Mus musculus]

Once you have completed the download, use tar -xvzf [name of the file including '.tar.gz'] command by replacing the bracket with the correct file name to unzip the file. As a result, a new directory has been generated called Mus musculus .

Step 2. Indexing

Before mapping the reads, we need to index the reference genome. Why are we doing this? By indexing, it will reduce the time and memory capacity while mapping the genome.

Locate the genome.fa file in the Mus musulus directory using cd command. For example, mine looks like this: cd sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta
Note: The WholeGenomeFasta is the end directory where the genome.fa file is located.
Make a new directory name genome_index.
We will now commence indexing creating the output file in the new directory, genome_index:

cd genome_index
hisat2-build -p 8 /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa [Name of Index file]

### Note: ###
# -p: number of cpu to use when indexing

Note: You should replace the bracket with the output file name of your choice. I’m going to call my index file as musculus_index for this demonstration.
There should be a total of 8 index files in .ht2 format generated when the process has completed.

Step 3. Genome Mapping

To map the sequence reads, we will be using the HISAT2 tool and the command format looks like below, however, there is an extra step to add on:

hisat2 -p 8 -x [Index File Name] -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks

Note: the Fastq files it’s referring to are the Paired Fastq Files (forward and reverse) we generated from Trimmomatic.

While HISAT2 is mapping, an alignment summary will be generated for each sample/file. This report is not saved automatically but rather created as a standard error (“stderr”).

Therefore we need to use the following command 2>&1 |tee [name of alignment file].summary to save the stdeer report and the mapped files. So the final, complete command format for this step looks like below:

hisat2 -p 8 -x musculus_index -1 [forward paired fastq file] -2 [reverse paired fastq file] -S [output file name].sam --dta-cufflinks 2>&1 |tee [name of alignment file].summary

To give you an example how I did it:
hisat2 -p 8 -x musculus_index -1 /home/compbio/sra_data/SRR8238941_1_P.fastq.gz -2 /home/compbio/sra_data/SRR8238941_2_P.fastq.gz -S SRR8238941.sam --dta-cufflinks 2>&1 |tee SRR8238941.summary

As a result, you should have generated a .sam file and a .summary file. Once the alignment summary report has been generated, you will need to calculate the Overall Alignment Rate which I will demonstrate at the end of the article.

Step 4. Bam Sorting

This is the last step of genome mapping which we will be converting the .sam file into a Bam file format using the following command:

samtools sort -@8 -o [output file name].sort.bam [file name from mapping].sam

Note: -@ command is a parameter for the number of CPUs to use during this step. To demonstrate one example your code should look similar to this: samtools sort -@ 8 -o SRR8238941.sort.bam SRR8238941.sam

The next couple steps from here will be Quantification of assembled transcriptome.

Step 5. Generating Transcript Assembly

For this step, it will be based on the Tuxedo protocol. Use the following command format to commence transcription assembly.

cufflinks -p 8 -G [pathway of the Annotation file] -o [output_directory_name] [file name from Bam sorting].sort.bam

### Note: ###
# -p: number of CPU
# -G: Annotation file (should write the pull pathway where this file is located)
# -o: cufflinks creates a new directory to store the output files therefore you just need to name it

To give you an example, mine would look like this:
cufflinks -p 8 -G /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -o SRR8238941_trspt SRR8238941.sort.bam
Note: you will need to write a full pathway for -G command since the annotation file (genes.gtf ) is in a separate directory. Unless you move the files into the same directories then you do not need to state the pathway to the file.

Step 6. Merging Assemblies

We will be merging the cufflink-assembled files into a single annotation file so that we can compare the expression level between samples. You also need Python version 2.7 installed or the tool will not work, so make sure you have Conda on your Linux/Ubuntu environment beforehand.

Compile the pathways for the assembled files (transcripts.gtf ) from the previous step using the following command:

vi assemblies.txt

As an example for one file, mine will look like this: /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/gene_alignment/bam/SRR8238941_trspt/transcripts.gtf
You should have total of 6 lines in your assemblies.txt since there are 6 data files. Make sure you save the text file using the correct vi editor commands.

2. Execute the below command to merge the assembled files.

cuffmerge -p 8 -g [pathway to genes.gtf] -s [pathway to genome.fa] -o merged_asm assemblies.txt

# -p: number of cpu
# -g Annotation file 
# -s: genome sequence file
# -o: directory where cuffmerge files will be stored

To demonstrate, my command will look like this:
cuffmerge -p 8 -g /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -s genome.fa -o merged_asm assemblies.txt
Note: I wrote the full pathway to genes.gtf file because it’s located in a different directory. You will need to provide a full pathway if the file(s) you are working with is not located in the same directory.

3. Once it’s done, the cuffmerge will create a merged_asm directory, might be named merge_assemble , where the merge.gtf file is generated, which is the transcriptome assembly file.

Step 7. Differential Expression Analysis

At last, we will be finding the differentially expressed genes (DEGs) between samples using cuffdiff command. The command format looks like this:

cuffdiff -o diff_out -b genome.fa -p 8 -L [list of samples] -u merged.gtf file1.sort.bam file2.sort.bam

# -o: name of output directory where your DEG file(s) will be stored
# -b: genome sequence file
# -p: number of cpu
# -L: sample name 
# -u: multi-read-correction;reads that are mapped on multiple locations in the genome are calculated using weight read mapping

If your data (or sample) has replications you will need to use this file naming format with commas: [file_1 name].sort.bam,[file_1-1 name].sort.bam [file_2 name].sort.bam,[file_2-1 name].sort.bam [file_3 name].sort.bam,[file_3-1 name].sort.bam

To figure out what the sample names are for each corresponding file go back to the journal article or here. Each file should tell you whether it is a control or different sample(s).

Here is an example how mine looks:
cuffdiff -o diff_out -b /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/genome.fa -p 8 -L Ctrl,Ra,Rv -u /home/compbio/sra_data/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome_index/merge_assemble/merged.gtf SRR8238941.sort.bam,SRR8238942.sort.bam SRR8238943.sort.bam,SRR8238944.sort.bam

Since the data has 3 different samples (Control, Ra, Rv), stated in the journal article, I specified the sample name as Ctrl, Ra, Rv. Also notice how the bam files are grouped by same sample types. For instance SRR8238941.sort.bam,SRR8238942.sort.bam are grouped and separated by commas, not spaced, because they are control group samples.

Once the process has been completed, you will find your several output files in the diff_out directory.

If you are focused on gene expression level, the files you want to refer to are gene_exp.diff and genes.read_group_tracking. You may want to export these files into your local PC or copy these files and store it in a separate directory where you can work on it.

gene_exp.diff : contains gene epxression level for each sample, p & q values, etc of differentially expressed genes.
genes.read_group_tracking file contains expression levels of each replicate.

Overall Alignment Rate Calculation

Referring back to Step 3. Gene Mapping after alignment summary reports ( [file name].summary ) have been generated, the next step is to calculate the overall alignment rate. When taking a look at one of the alignment summary files you should see a report like below:

37537336 reads; of these:

37537336 (100.00%) were paired; of these:
986245 (2.63%) aligned concordantly 0 times
33036211 (88.01%) aligned concordantly exactly 1 time
3514880 (9.36%) aligned concordantly >1 times

— —

986245 pairs aligned concordantly 0 times; of these:
112422 (11.40%) aligned discordantly 1 time

— —

873823 pairs aligned 0 times concordantly or discordantly; of these:

1747646 mates make up the pairs; of these:
1077358 (61.65%) aligned 0 times
579867 (33.18%) aligned exactly 1 time
90421 (5.17%) aligned >1 times

98.56% overall alignment rate
<<End of Report>>

The following is the calculation how we derived 98.65% overall alignment rate, refer to bolded section in the report :

98.56% = {33036211 + 3514880 + 112422 + (579867 + 90421) / 2} / 37537336 x 100

Congrats, you have made it!
In the next article I will be demonstrating De Novo Assembly. In the meantime, carefully go over each step and try to run through the workflow with other dataset.

Thanks for reading my article, I hope you are enjoying them so far. Let me know in the comment below or consider buying me a coffee (no pressure): https://www.buymeacoffee.com/shortlongseq
I also publish my articles at https://shortlongseq.hashnode.dev/ and https://seqbioinformatics.substack.com/ if you prefer a different mode of newsletter delivery, they are free.