Melbourne Bioinformatics logo
GVL logo



Introduction to Variant Calling using Galaxy


Tutorial Overview

In this tutorial we cover the concepts of detecting small variants (SNVs and indels) in human genomic DNA using a small set of reads from chromosome 22.

Note: The tutorial is designed to introduce the tools, datatypes and workflow of variation detection. We filter the variations manually to understand what is actually happening in variant calling. In practice the datasets would be much larger and you would use more sophisticated tools to call, annotate and filter variants.


Learning Objectives

At the end of this tutorial you should:

  • Be familiar with the FASTQ format and base quality scores
  • Be able to align reads to generate a BAM file and subsequently generate a pileup file
  • Be able to visualise BAM files using IGV and identify likely SNVs and indels by eye

Background

Some background reading material - background

Where is the data in this tutorial from?

The workshop is based on analysis of short read data from the exome of chromosome 22 of a single human individual. There are one million 76bp reads in the dataset, produced on an Illumina GAIIx from exome-enriched DNA. This data was generated as part of the 1000 genomes Genomes project.


1. Preparation

  1. Make sure you have an instance of Galaxy ready to go.
  2. Import data for the tutorial.

    • In this case, we are uploading a FASTQ file.
    • Method 1
      Paste/Fetch data from a URL to Galaxy.
      1. In the Galaxy tools panel (left), click on Get Data and choose Upload File.
      2. Click Paste/Fetch data and paste the following URL into the box
        https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC/NA12878.GAIIx.exome_chr22.1E6reads.76bp.fastq
      3. Select Type as fastqsanger and click Start.
      4. Once the upload status turns green, it means the upload is complete. You should now be able to see the file in the Galaxy history panel (right).

    Alternatively, if you have a local file to upload (For the purpose of this tutorial we can stick with the option above):

    • Method 2
      Upload data to Galaxy.
      1. In the Galaxy tools panel (left), click on Get Data and choose Upload File.
      2. From Choose local file select the downloaded FASTQ file. Select Type as fastqsanger and click Start.
      3. Once the upload status turns green, it means the upload is complete. You should now be able to see the file in the Galaxy history panel (right).

Summary:
So far, we have started a Galaxy instance, got hold of our data and uploaded it to the Galaxy instance.
Now we are ready to perform our analysis.


2. Quality Control

The first step is to evaluate the quality of the raw sequence data. If the quality is poor, then adjustments can be made - e.g. trimming the short reads, or adjusting your expectations of the final outcome!

1. Take a look at the FASTQ file

  1. Click on the eye icon to the top right of the fastq file to view the a snippet of the file.
  2. Note that each read is represented by 4 lines:
    • read identifier
    • short read sequence
    • separator
    • short read sequence quality scores
e.g.
identifier:    @61CC3AAXX100125:7:72:14903:20386/1
read sequence: TTCCTCCTGAGGCCCCACCCACTATACATCATCCCTTCATGGTGAGGGAGACTTCAGCCCTCAATGCCACCTTCAT
separator:     +
quality score: ?ACDDEFFHBCHHHHHFHGGCHHDFDIFFIFFIIIIHIGFIIFIEEIIEFEIIHIGFIIIIIGHCIIIFIID?@<6

For more details see FASTQ.

2. Assessing read quality from the FASTQ files

  1. From the Galaxy tools panel, select
    NGS: QC and manipulation > FastQC: Comprehensive QC

    The input FASTQ file will be selected by default. Keep the other defaults and click execute

Tip Tip: Note the batch processing interface of Galaxy:

  • grey = waiting in queue
  • yellow = running
  • green = finished
  • red = tried to run and failed

When the job has finished, click on the eye icon to view the newly generated data (in this case a set of quality metrics for the FASTQ data). Look at the various quality scores. The data looks pretty good - high Per base sequence quality (avg. above 30).

Note that the 'Sequence Duplication Levels' are marked as high. Normally we would run another tool to remove duplicates (technical PCR artifacts) but for the sake of brevity we will omit this step.


3. Alignment to the reference - (FASTQ to BAM)

The basic process here to map individual reads - from the input sample FASTQ file - to a matching region on the reference genome.

1. Align the reads with BWA

  1. Map/align the reads with the BWA tool to Human reference genome 19 (hg19) UCSC hg19. From the Galaxy tools panel, select

    NGS: Mapping > Map with BWA-MEM [3-5mins]

    From the options:
    Using reference genome: set to hg19.
    Single or Paired-end reads: set to Single

    Keep other options as default and click execute

    Note: This is the longest step in the workshop and will take a few minutes, possibly more depending on how many people are also scheduling mappings

  2. Sort the BAM file. From the Galaxy tools panel, select

    NGS: SAM Tools > Sort BAM dataset

    From the options:
    BAM File: set to the output from the alignment BAM file
    Sort by: Chromosomal coordinates

    Keep other options as default and click execute

2. Examine the alignment

  1. To examine the output sorted BAM file, we need to first convert it into readable SAM format. From the Galaxy tools panel, select

    NGS: SAM Tools > BAM-to-SAM

    From the options:
    BAM File to Convert: set to the output to the sorted BAM file

    Keep other options as default and click execute

  2. Examine the generated Sequence Alignment Map (SAM) file.

    • Click the eye icon next to the newly generated file
    • Familiarise yourself with the SAM format
    • Note that some reads have mapped to non-chr22 chromosomes (see column 3).

    This is the essence of alignment algorithms - the aligner does the best it can, but because of compromises in accuracy vs performance and repetitive sequences in the genome, not all the reads will necessarily align to the ‘correct’ sequence or could this be suggesting the presence of a structural variant?

    Tip Tip: Galaxy auto-generates a name for all outputs. Therefore, it is advisable to choose a more meaningful name to these outputs.

    This can be done as follows:

    Click on the pencil icon (edit attributes) and change Name e.g. Sample.bam or Sample.sam or Sample.sorted.bam etc.

3. Assess the alignment data

We can generate some mapping statistics from the BAM file to assess the quality of our alignment.

  1. Run IdxStats. From the Galaxy tools panel, select

    NGS: SAM Tools > IdxStats

    From the options:
    The BAM: select the sorted BAM file

    Keep other options as default and click execute

    IdxStats generates an tab-delimited output with four columns. Each line consists of a reference sequence name (e.g. a chromosome), reference sequence length, number of mapped reads and number of placed but unmapped reads.

    We can see that most of the reads are aligning to chromosome 22 as expected.

  2. Run Flagstat. From the Galaxy tools panel, select

    NGS: Sam Tools > Flagstat

    From the options:
    The BAM: select the sorted BAM file

    Keep other options as default and click execute

    Note that in this case the statistics are not very informative. This is because the dataset has been generated for this workshop and much of the noise has been removed (and in fact we just removed a lot more noise in the previous step); also we are using single ended read data rather than paired-end so some of the metrics are not relevant.


4. Visualize the BAM file.

Download the sorted BAM file. From the Galaxy history panel,

Click on the sorted BAM file.
Click on the disk icon > Download dataset
Click on the disk icon > Download bam_index

This will result in two files being downloaded:

  • a bam file (*.bam), and
  • a bam index file (*.bai)

Now open [IGV] browser and open the downloaded BAM file.

  • Select chromosome 22 using the drop-down menu on the top toolbar.
  • IGV will be too zoomed out to view any reads.
  • Zoom in further or type a gene of interest and now you should see aligned reads.

IGV view 1

Try looking at region chr22:36,006,744-36,007,406

Can you see a few variants?

IGV view 1

Don't close IGV yet as we'll be using it later.


5. Calling single nucleotide variations (SNVs)

1. Generate a pileup file

Generate a pileup file from the aligned reads (sorted bam file previous step 2). A pileup is essentially a column wise representation of the aligned read - at the base level - to the reference. The pileup file summarises all data from the reads at each genomic region that is covered by at least one read.

From the Galaxy tools panel, select

NGS: SAMtools > Generate Pileup

From the options:
Call consensus according to MAQ model = Yes
This generates a called 'consensus base' for each chromosomal position.

Keep other options as default and click execute

For each output file, Galaxy tries to assign a datatype attribute to every file. In this case, you'll need to manually assign the datatype to pileup.

  1. First, rename the output to something more meaningful by clicking on the pencil icon
  2. To change the datatype, click on the Datatype link from the top tab while you're editing attributes.
  3. Although it is a tabular file, for downstream processing we want to tell Galaxy that this is a pileup file.
  4. From the drop-down, select Pileup and click save.

Tip Tip: Get familiar with the Pileup format. The pileup file we generated has 10 columns:
1. chromosome
2. position
3. current reference base
4. consensus base from the mapped reads
5. consensus quality
6. SNV quality
7. maximum mapping quality
8. coverage
9. bases within reads
10. quality values

Further information on (9):
Each character represents one of the following (the longer this string, higher the coverage):

  • . = match on forward strand for that base
  • , = match on reverse strand
  • ACGTN = mismatch on forward
  • acgtn = mismatch on reverse
  • +[0-9]+[ACGTNacgtn]+' = insertion between this reference position and the next
  • -[0-9]+[ACGTNacgtn]+' = deletion between this reference position and the next
  • ^ = start of read
  • $ = end of read
  • BaseQualities = one character per base in ReadBases, ASCII encoded Phred scores

2. Filtering pileup to get a list of SNVs

The pileup from the above steps described all aligned positions. For the purpose of calling SNVs, we are after:

  • aligned positions where the consensus base is different from the reference base
  • the different consensus base is covered by minimum amount of reads (e.g. at least 10 reads)
  • removing sites with variants with poor quality base scores

We can get these variants by filtering.

  1. From the Galaxy tools panel, select:

    NGS: SAM Tools > Filter Pileup

    From the options:
    which contains = Pileup with ten columns (with consensus)
    Do not report positions with coverage lower than = 10
    Convert coordinates to intervals = Yes

    Keep other options as default and click execute

    How many variants (intervals) do you see?

    If you see ~30,000 variants - do not be surprised. Our filtering criteria was not very stringent.

    Tip Tip: How many SNVs should we be expecting?
    Generally you can expect 1 SNV per 1000 bases and the exome of chromosome 22 is about 6-700kb (~2% of the 33500kb length of chr22). Therefore, we should expect to see ~700 SNVs.

  2. To further filter the dataset from above, we can use the Filter and sort tool.

    From the Galaxy tools panel, select

    Filter and Sort > Filter

    From the options:
    With following condition = c7>50 (filter on SNV quality (column 7)).
    This is a score generated by a combination of reads and base quality etc.

    Keep other options as default and click execute

    Note that the number of SNVs has gone down by ~95%, and we're down to 1059, which is not too far off the expected number.

    Tip Tip: Interval file description
    1. Chromosome
    2. Start position (0-based)
    3. End position (1-based)
    4. Reference base at that position
    5. Coverage (# reads aligning over that position)
    6. Bases within reads
    7. Quality values (phred33 scale, see Galaxy wiki for more)
    8. Number of A variants
    9. Number of C variants
    10. Number of G variants
    11. Number of T variants
    12. Quality adjusted coverage
    13. Total number of deviants (if Convert coordinates to intervals? is set to yes)

    To visualize these variants we need to go back to IGV or any other genome browser of your choice.

  3. Convert the variants file to BED format.

    • Start with renaming the variant file to e.g. 'NA12878.filtered.snps'
    • Change the filetype (under the pencil - see tips above) to bed.

    Tip Tip: Bed files are similar to Interval files, but follow a stricter structural format

    • Download the BED file by clicking on the disc icon and saving to local disk.
    • Assuming the alignment (bam file) is open, load the downloaded bed file.
    • Select chr22 from the drop-down to see all SNVs.
    • The track - NA12878.filtered.snps.bed - now shows an zoomed out view of all SNVs.
    • Take a look again at the same region as earlier:
    • chr22:36,006,744-36,007,406

    IGV SNV view

    Is this a heterozygous variant?

    • Now zoom into the region chr22:35,947,503-35,947,667.

    Is this a homozygous variant?

    Don't close IGV just yet as we'll be using it later.


6. Calling small insertions and deletions (indels)

  1. Generate a pileup file from the aligned reads containing only indels. From the Galaxy tools panel, select

    NGS: SAM Tools > Generate pileup

    From the options:
    Select the BAM file to generate the pileup file for = sorted bam file
    Whether or not to print only output pileup lines containing indels = Print only lines containing indels
    Call consensus according to MAQ model? = yes

    Keep other options as default and click execute

  2. Filtering pileup to get a list of indels

    Unlike the previous pileup, this file will only report putative indels. Like previously, we will need to filter the ~4000 indels down. Therefore, we will filter on:

    • removing poor quality bases (column 7)
    • the indel is covered by at least e.g. 10 reads (column 11)

    From the Galaxy tools panel, select

    Filter and Sort > Filter

    From the options:
    With following condition = c7>50 and c11>20

    keep other options as default and click execute

    This filtering strategy reduces the call set by 83% and we are left with ~700 small indels.

  3. To visualize these indels, we need to convert from tabular to bed. This is two step process.

    Firstly,

    • Click the pencil icon.
    • Under the Datatype tab: choose Interval and save
    • Under Attributes tab: make sure End column = 2

    Next, we can convert the Interval file to BED format.

    • Click the pencil icon.
    • Under Convert Format tab: choose Convert Genomic Interval to BED.
    • Rename this to indels.filtered
  4. Download the bed file and open it using IGV genome browser.

    Try looking at region chr22:31,854,409-31,854,460

    Can you see any indels?

    Also try looking at other regions by zooming out.

IGV indel