Short-read sequence alignment

The Sequence Read Archive (SRA) is a international public archival of raw short read sequencing data from the next generation of sequencing platforms, established under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC). The metadata data storage model from SRA can be found here.

Use of SRA in Ensembl Genomes

In Ensembl Genome, expression data under various studies are mapped to the genome and the BAM format thus obtained are configured to dispalyed in the genome browser browser. Re-sequencing data from the SRA are used for SNP calling and displayed as variation data in Ensembl genomes.

Generation of alignments

The reads from the SRA are downloaded from European Nucleotide Archive (ENA) in fastq format and mapped to the genome using GSNAP or Burrow Wheeler Aligner (BWA). The mapped information is stored in Sequence Alignment Map (SAM) format. SAMtools are used to convert it into Binary format (BAM).

SNP Calling

BAM files obtained from mapping the resequenced data using the above method is also used to for SNP calling. SAMtools is uses mapped data to call the sequence variants, which are stored are stored in a Variant Call Format (VCF). VCF is then imported to Ensembl Variation database, the documentation can be found here.

Re-submission to ENA

The BAM and the VCF files obtained using the above method is then resubmitted to European Nucleotide Archive as as 'Analysis' object with reference to the original Study and the Assembly version of the genome it is mapped to.

Data Visualizaton

The variation data obtained from mapping SRA data can be visualized using the Ensembl Variation Browser and API . The expression data mapped in BAM format can be visualized in Ensembl Genome browser as customized tracks.