4. Alignment of reads to a reference genome

The metagenomic screenig of the shotgun library detected reads assigned to Yersinia pestis. The following step is to ascertain that these molecules are authentic. You can do that by mapping your pre-processed fastq files (merged and trimmed) to the Yersinia pestis CO92 strain reference sequence, available in the RefSeq NCBI database. Here, in order to obtain an optimal coverage for the subsequent variant call, we will run the alignment on a different fastq file that we prepared to simulate an enriched library.

4.1. Preparation of the reference sequence

4.1.1. Index the reference sequence with bwa

To align the reads to the reference sequence we will use the program BWA, in particular the BWA aln algorithm. BWA first needs to construct the FM-index for the reference genome, with the command BWA index. FM-indexing in Burrows-Wheeler transform is used to efficiently find the number of occurrences of a pattern within a compressed text, as well as locate the position of each occurrence. It is an essential step for querying the DNA reads to the reference sequence. This command generates five files with different extensions: amb, ann, bwt, pac, sa.

bwa index -a is reference.fasta

Note

The option -a indicates the algorithm to use for constructing the index. For genomes smaller than < 2 Gb use the is algorithm. For larger genomes (>2 Gb), use the bwtsw algorithm.

4.1.2. Create a reference dictionary

A dictionary file (dict) is necessary to run later in the pipeline GATK RealignerTargetCreator. A sequence dictionary contains the sequence name, sequence length, genome assembly identifier, and other information about the sequences. To create the dict file we use Picard.

picard CreateSequenceDictionary R= referece.fasta O= reference.dict

Note

In our server environment we can call Picard just by typing the program name. In other environments (including your laptop) you may have to call Picard by providing the full path to the java file jar of the program:

java -jar /path/to/picard.jar CreateSequenceDictionary R= referece.fasta O= ref.dict

4.1.3. Index the reference sequence with Samtools

The reference sequence has to be indexed in order to run later in the pipeline GATK IndelRealigner. To do that, we will use Samtools, in particular the tool samtools faidx, which enables efficient access to arbitrary regions within the reference sequence. The index file typically has the same filename as the corresponding reference sequece, with the extension fai appended.

samtools faidx reference.fasta

4.2. Alignment of the reads to the reference sequence

4.2.1. Alignment of pre-processed reads to the reference genome with BWA aln

To align the reads to the reference genome we will use BWA aln, which supports an end-to-end alignment of reads to the reference sequence. The alternative algorithm, BWA mem supports also local (portion of the reads) and chimeric alignments (resulting in a larger number of mapped reads than BWA aln). BWA aln is more suitable for aliging short reads, like expected for ancient DNA samples. The following comand will generate a sai file.

bwa aln reference.fasta filename.fastq.gz -n 0.1 -l 1000 > filename.sai

Some of the options available in BWA aln:

Option Function
-n number Maximum edit distance if the value is an integer. If the value is float the edit distance is automatically chosen for different read lengths (default=0.04)
-l integer Seed length. If the value is larger than the query sequence, seeding will be disabled.
-o integer Maximum number of gap opens. For aDNA, tolerating more gaps helps mapping more reads (default=1).

Note

  • Due to the particular damaged nature of ancient DNA molecules, carrying deaminations at the molecules ends, we deactivate the seed-length option by giving it a high value (e.g. -l 1000).
  • Here we are aligning reads to a bacterial reference genome. To reduce the impact of spurious alignemnts due to presence bacterial species closely related to the one that we are investigating, we will adopt stringent conditions by decreasing the maximum edit distance option (-n 0.1). For alignment of DNA reads to the human reference sequence, less stringent conditions can be used (-n 0.01 and -o 2).

Once obtained the sai file, we align the reads (fastq file) to the reference (fasta file) using BWA samse, to generate the alignment file sam.

bwa samse reference.fasta filename.sai filename.fastq.gz -f filename.sam

4.2.2. Converting sam file to bam file

For the downstream analyses we will work with the binary (more compact) version of the sam file, called bam. To convert the sam file in bam we will use Samtools view.

samtools view -Sb filename.sam > filename.bam

Note

The conversion from sam to bam can be piped (|) in one command right after the alignment step:

bwa samse reference.fasta filename.sai filename.fastq.gz | samtools view -Sb - > filename.bam

To view the content of a sam file we can just use standard commands like head, tail, less, while to view the content of a bam file (binary format of sam) we have to use Samtools view:

samtools view filename.bam

You may want to display on the screen one read/line (scrolling with the spacebar):

samtools view filename.bam | less -S

while to display just the header of the bam file:

samtools view -H filename.bam

4.2.3. Sorting and indexing the bam file

To go on with the analysis, we have to sort the reads aligned in the bam file by leftmost coordinates (or by read name when the option -n is used) with Samtools sort. The option -o is used to provide an output file name:

samtools sort filename.bam -o filename.sort.bam

The sorted bam files are then indexed with Samtools index. Indexes allow other programs to retrieve specific parts of the bam file without reading through each sequence. The following command generates a bai file, a companion file of the bam which contains the indexes:

samtools index filename.sort.bam

4.2.4. Adding Read Group tags and indexing bam files

A number of predefined tags may be appropriately assigned to specific set of reads in order to distinguish samples, libraries and other technical features. To do that we will use Picard. You may want to use RGLB (library ID) and RGSM (sample ID) tags at your own convenience based on the experimental design. Remember to call Picard from the path of the jar file.

picard AddOrReplaceReadGroups INPUT= filename.sort.bam OUTPUT= filename.RG.bam RGID=rg_id RGLB=lib_id RGPL=platform RGPU=plat_unit RGSM=sam_id VALIDATION_STRINGENCY=LENIENT

Note

  • In some instances, Picard may stop running and return error messages due to conflicts with sam specifications produced by BWA (e.g. “MAPQ should be 0 for unmapped reads”). To suppress this error and allow Picard to continue, we pass the VALIDATION_STRINGENCY=LENIENT options (default is STRICT).
  • Read Groups may be also added during the alignment with BWA using the option -R.

Once added the Read Group tags, we index again the bam file:

samtools index filename.RG.bam

4.2.5. Marking and removing duplicates

Amplification through PCR of genomic libraries leads to duplication formation, hence reads originating from a single fragment of DNA. The MarkDuplicates tool of Picard marks the reads as duplicates when the 5’-end positions of both reads and read-pairs match. A metric file with various statistics is created, and reads are removed from the bam file by using the REMOVE_DUPLICATES=True option (the default option is False, which simply ‘marks’ duplicate reads keep them in the bam file).

picard MarkDuplicates I= filename.RG.bam O= filename.DR.bam M=output_metrics.txt REMOVE_DUPLICATES=True VALIDATION_STRINGENCY=LENIENT &> logFile.log

Once removed the duplicates, we index again the bam file:

samtools index filename.DR.bam

4.2.6. Local realignment of reads

The presence of insertions or deletions (indels) in the genome may be responsible of misalignments and bases mismatches that are easily mistaken as SNPs. For this reason, we locally realign reads to minimize the number of mispatches around the indels. The realignment process is done in two steps using two different tools of GATK called with the -T option. We first detect the intervals which need to be realigned with the GATK RealignerTargetCreator, and save the list of these intevals in a file that we name target.intervals:

gatk -T RealignerTargetCreator -R reference.fasta -I filename.DR.bam -o target.intervals

Note

Like Picard, in some server environment you can call GATK just by typing the program name. In other environments (also in this server) you have to call GATK by providing the full path to the java jar file. Here, the absolute path to the file is ~/Share/Paleogenomics/programs/GenomeAnalysisTK.jar:

java -jar ~/Share/Paleogenomics/programs/GenomeAnalysisTK.jar -T RealignerTargetCreator -h

Warning

In version 4 of GATK the indel realigment tools have been retired from the best practices (they are unnecessary if you are using an assembly based caller like Mutect2 or HaplotypeCaller). To use the indel realignment tools make sure to install version 3 of GATK.

Then, we realign the reads over the intervals listed in the target.intervals file with the option -targetIntervals of the tool IndelRealigner in GATK:

java -jar ~/Share/Paleogenomics/programs/GenomeAnalysisTK.jar  -T IndelRealigner -R reference.fasta -I filename.RG.DR.bam -targetIntervals target.intervals -o filename.final.bam --filter_bases_not_stored

Note

  • If you want, you can redirect the standard output of the command into a log file by typing at the end of the command &> logFile.log
  • The option --filter_bases_not_stored is used to filter out reads with no stored bases (i.e. with * where the sequence should be), instead of failing with an error

The final bam file has to be sorted and indexed as previously done:

samtools sort filename.final.bam -o filename.final.sort.bam
samtools index filename.final.sort.bam

4.2.7. Generate flagstat file

We can generate a file with useful information about our alignment with Samtools flagstat. This file is a final summary report of the bitwise FLAG fields assigned to the reads in the sam file.

samtools flagstat filename.final.sort.bam > flagstat_filename.txt

Note

  • You could generate a flagstat file for the two bam files before and after refinement and see the differences.

  • You can decode each FLAG field assigned to a read on the Broad Institute website.

4.2.8. Visualization of reads alignment

Once generated the final bam file, you can compare the bam files before and after the refinement and polishing process (duplicates removal, realignment around indels and sorting). To do so, we will use the program IGV, in which we will first load the reference fasta file from Genomes –> Load genome from file and then we will add one (or more) bam files with File –> Load from file:

_images/igv-bam_bam.png

4.3. Create mapping summary reports

We will use Qualimap to create summary reports from the generated bam files. As mentioned in the website, Qualimap examines sequencing alignment data in sam/bam files according to the features of the mapped reads and provides an overall view of the data that helps to detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.

qualimap bamqc -c -bam input.bam

Here are some screenshots of the outputs:

_images/qualimap.png

At this stage we have created different type of summary report using FastQC and Qualimap. To create a unique summary that integrate and compare all the generated reports, we will use MultiQC. If all the reports are in the same directory and its sub-directories, you can run MultiQC as follows:

multiqc .

A list of programs that generate output files recognized by MultiQC are availble here: https://github.com/ewels/MultiQC

Multiqc will create a summary report in html format that will let you compare all the summary reports for each of your samples:

_images/multiqc.png

4.4. Damage analysis and quality rescaling of the BAM file

To authenticate our analysis we will assess the post-mortem damage of the reads aligned to the reference sequence. We can track the post-portem damage accumulated by DNA molecules in the form of fragmentation due to depurination and cytosine deamination, which generates the typical pattern of C->T and G->A variation at the 5’- and 3’-end of the DNA molecules. To assess the post-mortem damage patterns in our bam file we will use mapDamage, which analyses the size distribution of the reads and the base composition of the genomic regions located up- and downstream of each read, generating various plots and summary tables. To start the analysis we need the final bam and the reference sequence:

mapDamage -i filename.final.sort.bam -r reference.fasta

mapDamage creates a new folder where the output files are created. One of these files, is named Fragmisincorporation_plot.pdf which contains the following plots:

_images/damage.png

If DNA damage is detected, we can run mapDamage again using the --rescale-only option and providing the path to the results folder that has been created by the program (option -d). This command will downscale the quality scores at positions likely affected by deamination according to their initial quality values, position in reads and damage patterns. A new rescaled bam file is then generated.

mapDamage -i filename.final.sort.bam -r reference.fasta --rescale-only -d results_folder

You can also rescale the bam file directly in the first command with the option --rescale:

mapDamage -i filename.final.sort.bam -r reference.fasta --rescale

Note

Another useful tool for estimating post-mortem damage (PMD) is PMDTools. This program uses a model incorporating PMD, base quality scores and biological polymorphism to assign a PMD score to the reads. PMD > 0 indicates support for the sequence being genuinely ancient. PMDTools filters the damaged reads (based on the selected score) in a separate bam file which can be used for downstream analyses (e.g. variant call).

The rescaled bam file has to be indexed, as usual.

samtools index filename.final.sort.rescaled.bam

4.5. Edit Distance

The edit distance defines the number of nucleotide changes that have to be made to one read sequence for it to be identical to the reference sequence. To be more confident about the quality and authenticity of your sequencing data, you need to align your reads againt your reference sequence and the genome of a closely related species. Here we will align our fastq file against the Yersinia pseudotuberculosis genome, following all the steps from 4.1 to 4.4. The edit distance must be lower when aligning the reads to the reference sequence compared to the closely related species.

_images/EditDistance.png

To calculate the edit distance we will use BAMStats, a tool for summarising Next Generation Sequencing alignments. The commands to generate summary-charts, including the edit distance is:

BAMStats -i filename.rescaled.bam -v html -d -q -o outfile.html
Option Function
-i filename SAM or BAM input file (must be sorted).
-v html/simple View option for output format (currently accepts ‘simple’ or ‘html’; default, simple).
-d If selected, edit distance statistics will also be displayed as a separate table (optional).
-q If selected, mapping quality (MAPQ) statistics will also be displayed as a separate table (optional).