Data download

The data in Ensembl Genomes can be downloaded in bulk from the Ensembl Genomes FTP server in a variety of formats (see below). To facilitate storage and download, all datasets are compressed with GZip (*.gz), which is natively supported on most operating systems.

Please note: Highly customisable tables of data can be downloaded using the BioMart data mining tool, which may be simpler than extracting information from our data dumps.

There is an FTP downloads page for each Ensembl Genomes division:

About the data

The following types of data dumps are available on the FTP site.

FASTA

FASTA format files containing sequence for gene, transcript and protein models. Since the FASTA format does not permit sequence annotation, these files are mainly intended for use with local sequence similarity search algorithms. Each directory has a README file with a detailed description of the header line format and the file naming conventions.

  • DNA - Masked and unmasked genome sequences associated with the assembly (contigs, chromosomes etc.).
  • cDNA - cDNA sequences for protein-coding genes
  • Peptides - Protein sequences for protein-coding genes.
  • RNA - Non-coding RNA gene preditions.

Flatfile

The sequence and annotated features provided by Ensembl Genomes are available in the following flatfile formats:

Note that EMBL and GenBank files are not available for Ensembl Bacteria.

MySQL

All Ensembl Genomes MySQL databases are available in text format as are the SQL table definition files. These can be imported into to any SQL database for a local installation of a mirror site.Generally, the FTP directory tree contains one directory per database. For more information about these databases and their Application Programming Interfaces (or APIs) see the API section.

GTF (General Transfer Format)

Gene sets for each genome. These files include annotations of both coding and non-coding genes. This file format is described here.

GFF3 (General Feature Format v3)

Gene and feature sets for each genome. These files include annotations of both coding and non-coding genes. This file format is described here.

GVF (Genome Variation Format)

Variation features for each genome with variation data. This file format is described here.

VCF (Variant Call Format)

Variation features for each genome with variation data. This file format is described here.

EMF flatfile dumps (variation and comparative data)

Alignments of resequencing data are available for several species as Ensembl Multi Format (EMF) flat file dumps. The accompanying README file describes the file format.

Also, the same format is used to dump whole-genome multiple alignments (where available) as well as gene-based multiple alignments and phylogentic trees used to infer Ensembl orthologues and paralogues. These files are available in the ensembl_compara database which will be found in the compara directory.

Note: the EMF directories for compara also contain trees in phyloxml, nh and newick format.

TSV

Tab separated files containing selected data for individual species and from comparative genomics provided for convenience. All files are gzipped and contain header lines detailing the contents of each file. The files available are:

  • Per species
    <species_name>...uniprot.tsv - Provides mappings from Gene, Transcript and Translation stable identifiers to UniProtKB accessions with reports as to the % identity of the hit where applicable.
  • Per compara
    Compara.homologies..tsv - Homologies and identity between proteins in different species retrieved from Compara.
    Compara.stableid_to_genetreeid..tsv - Provides mappings from compara gene tree stable identifiers to the component gene and translation stable identifiers.

Metadata

Data files containing metadata for Ensembl Genomes from release 15 onwards can be found in the root directory or appropriate division directory of each release e.g.ftp://ftp.ensemblgenomes.org/pub/current/fungi/ or ftp://ftp.ensemblgenomes.org/pub/current/.

The following files are provided: