Ensembl Bacteria pipeline
Ensembl Bacteria contains genomes from annotated INSDC records that are loaded into Ensembl multi-species databases, using the INSDC annotation import pipeline. This document describes how this pipeline is used for bulk loading of bacteria and provides tips for handling the data.
Ensembl Bacteria consists of completely sequenced genomes from eubacteria and archaea, which sequences have been deposited in INSDC. We then query the ENA Genomes Assembly Database to find assembly sets for eubacteria and archaea, which have at least 50 CDS annotations. Assembly sets are grouped into collections of up to 250 genomes. There is no strict rules guiding the assignment of genomes to particular collections, but a given genome will usually stay in the same collection. The collections of genomes are then passed to the INSDC annotation import pipeline for loading into Ensembl.
Genome names and identifiers
Genomes in Ensembl Bacteria are assigned the names and NCBI taxonomy identifiers obtained from the corresponding ENA Genomes Assembly Database record. However, these names and identifiers are not always unique, and exact names may change between releases. For some genomes, the most stable identifiers can be the unique assembly accessions assigned by the ENA Genomes Assembly Database e.g. GCA_000005845.1. Each assembly also has an externally assigned name, also obtained from the ENA Genome Assembly Database e.g. ASM584v1. This may be used to disambiguate cases where the organism names happen to be the same (e.g. Borrelia afzelii PKo (ASM22283v1) vs. Borrelia afzelii PKo (ASM16559v2)).
Dealing with identifier problems
For older records (and other special cases), locus_tag and/or protein_id qualifiers can be either missing or duplicated within/between records. This may reflect errors in annotation or submission, and systematic problems, such as the reuse of locus tag prefixes between different entries. We try to resolve identifiers problems as follows:
Other data sources
In addition to the annotation loaded using INSDC, the following data sources are also used to enrich the annotation of our genomes:
Due to the number of genomes included in Ensembl Bacteria, gene trees and whole genome aligments are not calculated for all genomes. But gene families are still populated based on InterPro annotation. A large number of bacterial genomes are included in the Ensembl Genomes pan-taxonomic compara for which gene trees are built using selected genomes from across the taxonomy.