RNA gene annotation

Data Sources

If the genesets that are imported by Ensembl Genomes include RNA genes as well as protein-coding genes, these RNA genes will be displayed without modification. However, often RNA genes are not provided, or only a subset of biotypes will have been annotated (e.g. tRNA genes). In the case of missing or partial RNA gene annotation, Ensembl Genomes performs RNA gene prediction.

Currently, we are in a transitional state between two versions of our RNA gene annotation pipeline. As of release 36, Ensembl Metazoa use the "New" method below; the other divisions currently use the "Old" method, but are moving to the "New" method in subsequent releases.

Ensembl Genomes annotation pipeline: New

The RNA gene annotation uses three sources:

  • miRBase (release 21) [1] provides genomic locations of precursor microRNA genes for a subset of species. These are loaded without any filtering.
  • tRNAscan-SE (version 1.23) [2] is used to predict tRNA genes. tRNA annotation is sometimes difficult in repeat-rich species, leading to over-prediction with the default tRNAscan score threshold (=40). We therefore analyse the distribution of predicted tRNA genes for each division, to determine a score threshold that provides a balance between sensitivity and specificity. For Ensembl Metazoa we use a score threshold of 65, and tRNA genes are not annotated if they would overlap either repeat regions or protein-coding exons.
  • Rfam (version 12.2) [3] covariance model alignments are used to annotate RNA genes of all classes except tRNA. The annotation relies on taxonomically-appropriate alignments with cmscan, from the Infernal (version 1.1.2) software suite [4], which are available as a separate track in the genome browser.
    Each alignment must pass a set of filters for it to be annotated as an RNA gene:
    • Rfam is not just a database of RNA genes; it also contains structures for regulatory elements, for example. So non-gene biotypes are filtered out of the potential gene set.
    • E-values must be higher than a certain threshold, set on a per-division basis. The threshold for Ensembl Metazoa is 1e-6.
    • Some RNA sequences are inherently palindromic, due to the base pairing required for hairpin structures; this means you often get creible hits on both strands. However, it is usually the case that one has a much better E-value, so in the case of overlapping alignments (on either the same or different strands), we use the E-value to determine which alignment to convert to a gene.
    • RNA genes are not annotated if they would overlap either protein-coding exons or RNA genes from miRBase.
    • It is possible for alignments to partially represent an RNA structure, if there is an assembly gap, for example. Such alignments would make partial genes, so are not annotated as RNA genes.
    • Skewed GC content makes false positives more likely, because the chance of spurious base pairing increases. RNA genes are not annotated in areas of high GC or AT content.

Ensembl Genomes annotation pipeline: Old

For all ncRNA except tRNA and rRNA genes, models are predicted by aligning a genomic sequence against Rfam sequences [5] using BLASTN. The BLAST hits are then used to seed Infernal searches of the aligned regions with the corresponding Rfam covariance models. The purpose of this is to reduce the search space required, as to scan the entire genome with all the Rfam covariance models would be extremely CPU-intensive.

tRNA is predicted by using tRNAScan-SE software [2]. Version 1.23 of the program was used, configured for superregnum as appropriate.

rRNA is predicted by using RNAmmer software [6]. Version 1.2 of the program was used, configured for superregnum as appropriate.

Stable identifiers

Historically, RNA genes annotated by Ensembl Genomes have had identifiers that encoded the division and species name, e.g. EMBMOG00000000027, where "EM" means Ensembl Metazoa, "BMO" means Bombyx mori, and "G" indicates a gene (as opposed to "T" for a transcript). Starting with Ensembl Metazoa in release 36 (and other divisions in subsequent releases) we are switching to a simpler format, applicable across all divisions, with an "ENSRNA" prefix, where "ENS" is short for Ensembl, and "RNA" is, hopefully, self-explanatory. E.g. ENSRNA022711053 is a gene stable identifier, and transcripts of that gene have a suffix with "T" for transcript and a number, e.g. ENSRNA022711053-T1.

Stable identifiers have been mapped, such that ID History Converter tool can be used to find new IDs which correspond to old IDs (and vice versa). However, note that the new pipeline for RNA gene annotation is more rigorous in it's filtering than the old pipeline, so old, unsupported genes may have been deleted and have no mapping to a new ID.

References

  1. Kozomara A and Griffiths-Jones S (2014) miRBase: annotating high confidence microRNAs using deep sequencing data Nucl. Acids Res. 42 D68-73

  2. Lowe TM and Eddy SR (1997) tRNAScan-SE: a program for improved detection of transfer RNA genes in genomic sequence Nucl. Acids Res. 25 955-964

  3. Nawrocki EP et al. (2015) Rfam 12.0: updates to the RNA families database Nucl. Acids Res. 43 D130-37

  4. Nawrocki EP and Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches Bioinformatics 29 2933-2935

  5. Burge SW et al. (2013) Rfam 11.0: 10 years of RNA families Nucl. Acids Res. 41 D226-32

  6. Lagesen K et al. (2007) RNammer: consistent annotation of rRNA genes in genomic sequences Nucl. Acids Res. 35 3100-3108