ensembldb: an R package to create and use Ensembl-based annotation resources

Abstract Summary Bioinformatics research frequently involves handling gene-centric data such as exons, transcripts, proteins and their positions relative to a reference coordinate system. The ensembldb Bioconductor package retrieves and stores Ensembl-based genetic annotations and positional information, and furthermore offers identifier conversion and coordinates mappings for gene-associated data. In support of reproducible research, data are tied to Ensembl releases and are kept separately from the software. Premade data packages are available for a variety of genomes and Ensembl releases. Three examples demonstrate typical use cases of this software. Availability and implementation ensembldb is part of Bioconductor (https://bioconductor.org/packages/ensembldb). Supplementary information Supplementary data are available at Bioinformatics online.


1
Query for helix-loop-helix transcription factors on chromosome 21 Down syndrome is a genetic disorder characterized by the presence of all or parts of a third copy of chromosome 21. It is associated, among other, with characteristic facial features and mild to moderate intellectual disability. The phenotypes are currently believed to be the result from a gene dosage-dependent increased expression of the genes encoded on chromosome 21 (Lana-Elola et al. 2011). Compared to other gene classes, transcription factors are more likely to have an immediate impact, even due to a moderate over-expression (which might be the result from gene duplication). One of the largest dimerizing transcription factor families is characterized by a basic helix-loop-helix domain (Massari and Murre 2000), a protein structural motif facilitating DNA binding.
The example below aims at identifying transcription factors with a basic helix-loop-helix domain (Pfam ID PF00010) that are encoded on chromosome 21. To this end we first load an R-library providing human annotations from Ensembl release 86 and pass the loaded EnsDb object along with a filter expression to the genes method that retrieves the corresponding genes. Filter expressions have to be written in the form~<field> <condition> <value> with <field> representing the database column to be used for the filter. Several such filter expressions can be concatenated with standard R logical expressions (such as & or |). To get a list of all available filters and their corresponding fields, the supportedFilters(edb) function could be used. The function returned a GRanges object with the genomic position of the genes and additional gene-related annotations stored in metadata columns. Three transcription factors with a helix-loop-helix domain are encoded on chromosome 21: SIM2, which is a master regulator of neurogenesis and is thought to contribute to some specific phenotypes of Down syndrome (Gardiner and Costa 2006) and the two genes OLIG1 and OLIG2 for which genetic triplication was shown to cause developmental brain defects (Chakrabarti et al. 2010). To visualize the exonic regions encoding the helix-loop-helix domain of these genes we next retrieve their transcript models and the positions of all Pfam protein domains within the amino acid sequences encoded by these transcripts. We process SIM2 separately from OLIG1 and OLIG2 because the latter are encoded in a narrow region on chromosome 21 and can thus be visualized easily within the same plot. We extract the transcript models for OLIG1 and OLIG2 that encode the protein domain using the getGeneRe gionTrackForGviz function which returns the data in a format that can be directly passed to functions from the Gviz Bioconductor package (Hahne and Ivanek 2016) for plotting. Since Gviz expects UCSC-style chromosome names instead of the Ensembl chromosome names (e.g. chr21 instead of 21), we change the format in which chromosome names are returned by ensembldb with the seqlevelsStyle method. All subsequent queries to the EnsDb database will return chromosome names in UCSC format. Next we fetch the coordinates of all Pfam protein domains encoded by these transcripts with the proteins method, asking for columns "prot _ dom _ start", "prot _ dom _ end" and "pro tein _ domain _ id" to be returned by the function. Note that we restrict the results in addition to protein domains defined in Pfam by using an additional filter. We next map these protein-relative positions to the genome. We define first an IRanges object with the coordinates and submit this to the proteinToGenome function for mapping. Besides coordinates, the function requires also the respective protein identifiers which we supply as names.
pdoms _ rng <-IRanges(start = pdoms$prot _ dom _ start, end = pdoms$prot _ dom _ end, names = pdoms$protein _ id) pdoms _ gnm <-proteinToGenome(pdoms _ rng, edb) The result is a list of GRanges objects with the genomic coordinates at which the protein domains are encoded, one for each of the input protein domains. Additional information such as the protein ID, the encoding transcript and the exons of the respective transcript in which the domain is encoded are provided as metadata columns. Column cds _ ok in the result object indicates whether the length of the CDS of the encoding transcript matches the length of the protein sequence. For transcripts with unknown 3' and/or 5' CDS ends these will differ. The mapping result has to be re-organized before being plotted: Gviz expects a single GRanges object, with specific metadata columns for the grouping of the individual genomic regions. This is performed in the code block below.
Next we repeat the analysis for SIM2 by first fetching all of its transcript variants encoding the PF00010 Pfam protein domain from the database. Subsequently we retrieve all Pfam protein domains encoded in these transcripts.

## Fetch all SIM2 transcripts encoding PF00010
txs <-getGeneRegionTrackForGviz(edb, filter =~genename == "SIM2" & protein _ domain _ id == "PF00010") ## Fetch all Pfam protein domains within these transcripts pdoms <-proteins(edb, filter =~tx _ id %in% txs$transcript & protein _ domain _ source == "pfam", columns = c("protein _ domain _ id", "prot _ dom _ start", "prot _ dom _ end")) At last we have to map the protein domain coordinates to the genome and prepare the data for the plot. Since the code is essentially identical to the one for OLIG1 and OLIG2 it is not displayed.  The SIM2 transcript encodes a protein with in total 4 protein domains. The helix-loop-helix domain PF00010 is encoded in its first exon.

2
Mapping of genomic coordinates to protein-relative positions One of the known mutations for human red hair color is located at position 16:89920138 (dbSNP ID rs1805009) on the human genome (version GRCh38). Below we map this genomic coordinate to the respective coordinate within the protein sequence encoded at that location using the genomeToProtein function. Note that we use "chr16" as the name of the chromosome, since we changed the chromosome naming style to UCSC in the previous example.
gnm _ pos <-GRanges("chr16", IRanges(89920138, width = 1)) prt _ pos <-genomeToProtein(gnm _ pos, edb) The genomic position could thus be mapped to the amino acid 294 in each of the 3 proteins listed above. Using the select function we retrieve the HGNC symbol of the gene for these 3 proteins.  The plot above visualizes the expanded genomic region of the variant (indicated with a vertical red line) including all transcripts (from both strands) encoded in the region. In total 4 genes are present in the region: MC1R, the MC1R-TUBB3 read-through gene RP11-566K11.2, the non-coding gene RP11-566K11.4 located on the reverse strand and TUBB3. Exons of transcripts from the former 3 genes overlap the genomic position of the variant. Using the proteins method we next extract the sequences of the proteins encoded by the 3 transcripts (two of MC1R and one of RP11-566K11.2 ) and determine the amino acid at position 294 in these. To retrieve the results in a format most suitable for the representation of amino acid sequences we specify return.type = "AAStringSet" in the proteins call. The amino acid at position 294 is for all an aspartic acid ("D") which is in agreement with the reference amino acid of mutation Asp294His (Valverde et al. 1995) described by the dbSNP ID of this example.

Using ensembldb in a standard gene-level RNA-seq workflow
This workflow is based on the RNA-seq workflow (Love et al. 2015) and shows how the ensembldb package can be used for all annotation-related tasks in that analysis. In brief, the workflow describes the analysis of an RNA-seq data set after alignment of the reads to the genome with the aim to identify genes that are differentially expressed in airway smooth muscle cells after treatment with the synthetic glucocorticoid dexamethasone, a drug with anti-inflammatory effects. Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. Note that here we focus only on the parts of the analysis in which ensembldb functionality and data are used, i.e. the gene quantification step and the annotation of the analysis results. Other topics from the workflow such as the visualization and exploratory data analysis are not covered here. Also, please refer to the original workflow for background information and detailed descriptions of the methodology used.
Below we load the airway data package that provides all required files for the present analysis (including BAM files with a subset of the alignment results of the raw reads to the genome GRCh37) and load a text file with sample descriptions from this package.

Gene quantification
After alignment of the reads to the genome, gene abundance estimates can be generated by counting the number of reads falling within the exon boundaries of a gene. The BAM files with the alignment results of the present data set are provided by the airway package. Below we load these into a BamFileList object defined in the Rsamtools package which enables reading alignment information from BAM files.

yieldSize = 2000000)
For gene quantification we use the summarizeOverlaps function from the GenomicAlignments package (Lawrence et al. 2013) that requires, apart from the aligned reads, also the genomic positions of all exons of all genes. In contrast to the original workflow, we retrieve these directly from the appropriate ensembldb EnsDb database that we load in the code block below.
Since the reads in this RNA-seq data set were aligned against GRCh37, we load an ensembldb resource for an Ensembl release matching that genome version (i.e. release 75). The genomic positions of all exons grouped by gene are extracted from the database with the exonsBy function.
library(ensembldb) library(EnsDb.Hsapiens.v75) ebg <-exonsBy(EnsDb.Hsapiens.v75, "gene") With this we can now proceed to the read counting with the summarizeOverlaps function using the same settings as in the original workflow. At last we add the sample metadata to the result object (a SummarizedExperiment) and ensure that untrt is the reference level for the variable specifying the treatment of the samples (i.e. dex).

Differential expression analysis
We use DESeq2 (Love, Huber, and Anders 2014) for the differential expression analysis to identify genes that are deregulated upon treatment with the synthetic glucocorticoid dexamethasone (dex). The SumarizedExperiment with the gene quantifications generated in the previous section contains only reads aligned against chromosome 1. Thus, as in the original workflow, we load a prepared SummarizedExperiment containing read counts for genes from all chromosomes (which is also provided in the airway package).
library(DESeq2) dds <-DESeqDataSet(se, design =~cell + dex) Before proceeding to the differential expression analysis we remove rows (genes) with only zeros or very low read counts.

dds <-dds[rowSums(counts(dds)) > 1, ]
As we have already specified an experimental design when creating the DESeqDataSet, we can run the differential expression analysis on the raw counts with a single call to the function

Annotating results
As a last step we want to annotate our result table. In contrast to the original workflow, we extract annotations for the Ensembl gene identifiers from the EnsDb database we have already used in the gene quantification step above. This simplifies the analysis and additionally ensures that both annotations as well as positional information used for the read counting are from the same data release.
Below we use ensembldb's genes function to extract all gene annotations available in the EnsDb database and return the results as a DataFrame.
anns <-genes(EnsDb.Hsapiens.v75, return.type = "DataFrame") Benchmarking the filter framework Filters in ensembldb are translated into the underlying SQL commands which leads to a significant performance gain on data retrieval if only subsets of the data are fetched. In this section we compare the performances of data retrieval in the classical way, i.e. by first extracting the full data from the database with subsequent sub-setting in R, and data retrieval using the ensembldb filter framework. The first use case is to retrieve all gene-related annotations for the SIM2 gene from Section 1. The classical approach is to get all gene annotations and subset the results to those for which the gene name matches SIM2. We see a clear performance advantage of the filter framework. With the classical approach it took on average 1 second to extract the relevant annotation, while filtering the data with the filter framework reduced this to less than 20 milliseconds which represents an about 60-fold performance increase.
In the next example we want to get the exon positions, grouped by gene, for genes that are encoded on chromosome X. First we use again the classical approach fetching all information and sub-setting later. The performance gain when using the filter framework was again remarkable (on average about 13 seconds with the classical approach compared to approximately 0.25 seconds with the filter framework).