Keywords
Bioconductor, R Package, Collective Data Access, Gene expression, Gene Enrichment Analysis
This article is included in the Bioconductor gateway.
Bioconductor, R Package, Collective Data Access, Gene expression, Gene Enrichment Analysis
Gene expression levels influence the behavior of cells, the functionality of tissues, and a wide range of processes from development and aging to physiology or behavior. It is of particular importance that researchers are able to take advantage of the vast amounts of publicly available gene expression datasets to reproduce and validate results, or to investigate new research questions1–3.
To that purpose, one should be able to easily query and import gene expression datasets generated using different technologies, and their associated metadata. The R environment4 has now become a standard for bioinformatics and statistical analysis of gene expression data, through the Bioconductor framework and its many open source packages5,6. It is thus desirable to provide access to gene expression datasets programmatically and directly in R. For example, the Bioconductor packages ArrayExpress7, GEOquery8 and SRAdb9 provide access to the reference databases ArrayExpress10, GEO11 and SRA12 respectively.
However, such databases are primary archives aiming at comprehensiveness. They include gene expression datasets and other functional genomics data, generated from diverse experimental conditions, of diverse quality. The data provided are heterogeneous, with some datasets including only unprocessed raw data, and others including only data processed using specific analysis pipelines. For instance, over the 44,177 RNA array assay experiments stored in ArrayExpress with processed data available as of October 2016, 7,520 do not include the raw data. Metadata are often provided as free-text information that is difficult to query. For instance, the GEO database encourages submitters of high-throughput sequencing experiments to provide MINSEQE elements, but does not enforce this practice (see, e.g., GEO submission guidelines, and GEO Excel template for submissions). Unless the user needs to retrieve a specific known dataset from its accession number, it can be difficult to identify relevant available datasets. This can ultimately constitute an obstacle to data reuse.
One response to this diversity of primary archives is topical databases1. They can be useful for researchers of specialized fields, and even more so if they propose an R package for data access. For example, the BrainStars Bioconductor package allows access to microarray data of mouse brain regions samples from the BrainStars project13,14. The ImmuneSpaceR Bioconductor package allows access to the gene expression data generated by the Human Immunology Project Consortium15. Such efforts allow better control of the data and annotation quality, but by nature they include a limited number of conditions, which only fit the needs of specialized projects. Similarly, numerous “ExperimentData” packages are available on the Bioconductor repository, which each include a single curated and well-formatted expression dataset (see http://www.bioconductor.org/packages/release/BiocViews.html#___ExpressionData). But these packages are rarely updated and are mostly meant to be used as examples in software packages vignettes, for teaching, or as supplementary data for publications.
Finally, added-value databases aim at filtering, annotating, and possibly reprocessing all or some of the datasets available from the primary archives1. For example, a Bioconductor package was recently released to access the Expression Atlas, which includes a selection of microarray and RNA-seq datasets from ArrayExpress that are re-annotated and reprocessed16,17. Similarly, the recount Bioconductor package provides access to a dataset of 2,040 reanalyzed human RNA-seq samples from SRA (see https://jhubiostatistics.shinyapps.io/recount/)18–20.
The Bgee database (http://bgee.org/)21 is another added-value database, which currently offers access to reprocessed gene expression datasets from 17 animal species. Bgee aims to compare gene expression patterns across tissues, developmental stages, ages and species. It provides manually curated annotations to ontology terms, describing precisely the experimental conditions used. It integrates expression data generated with multiple technologies: RNA-Seq, Affymetrix microarrays, in situ hybridization, and expressed sequence tags (ESTs). An important characteristic of Bgee is that all datasets are manually curated to retain only “normal” healthy wild-type samples, i.e., excluding gene knock-out, treatments or diseases. Finally, Bgee datasets are carefully checked for quality issues, and reprocessed to produce normalized expression level, calls of presence/absence of expression, and of differential expression. Bgee thus provides a reference of high-quality and reusable gene expression datasets that are relevant for biological insights into normal conditions of gene expression. Release 13 of Bgee includes 526 RNA-seq libraries, 12,736 Affymetrix chips, 349,613 results from 46,619 in situ hybridization experiments and 3,185 EST libraries. Release 14 of Bgee is in preparation and will notably include 5,746 RNA-seq libraries from 29 animal species, including 4,860 human libraries from the GTEx project22,23.
Until recently the Bgee database lacked programmatic access to data through an R package, a shortcoming that we have addressed with the release of the BgeeDB Bioconductor package, available at http://www.bioconductor.org/packages/BgeeDB/. The package provides functions for fast extraction of data and metadata. The data structures used in the package can be easily incorporated with other Bioconductor packages, offering a wide range of possibilities for downstream analyses.
Moreover, in BgeeDB we introduce the possibility to run TopAnat analyses, i.e., anatomical expression enrichment tests on gene lists provided by the user. This functionality is based on the topGO package24,25, modified to use Bgee data (A. Alexa, personal communication). TopAnat is similar to the widely used Gene Ontology enrichment test26–28. But in our case, the enrichment test is applied to terms from an anatomical ontology, mapped to genes by expression patterns. As a result, TopAnat allows for discovery of tissues where a set of genes is preferentially expressed. This feature is available as a web-tool at http://bgee.org/?page=top_anat, but the R package offers more flexibility in the choice of input data and analysis parameters, and possibilities of inclusion within programs or pipelines.
In the following sections we provide some typical examples of usage of the BgeeDB package.
The first step of data retrieval is to initialize a new Bgee reference class object, for a targeted species and data type. Normalized expression levels are currently available in the BgeeDB package for two data types: Affymetrix microarrays and Illumina RNA-seq. The list of species available in the Bgee database for each data type, along with their NCBI taxonomy IDs and common names can be obtained with the listBgeeSpecies() function. By default, data will be downloaded from the latest Bgee release, but this can be changed with the release argument.
Next, the functions getAnnotation(), getData(), and formatData() can be called to respectively download the annotations of datasets, download the actual expression data, and reformat the expression data for more convenient use. Of note, BgeeDB creates a directory to store the downloaded annotation files and datasets, by default in the user’s R working directory, but this can be changed with the pathToData argument. These versioned cached files make it faster for the user to return to previously used data and allow for offline work.
Microarray dataset retrieval. In the following example, we look for a microarray dataset in mouse (Mus musculus), spanning multiple early developmental stages, including zygote. At the time of publication the latest Bgee release is 13.2, so if one needs to strictly reproduce the output of the code below in the future, the release="13.2" argument needs to be added when creating the Bgee object (see Supplementary file S1 and Supplementary file S2).
# specify species and data type
bgee.affymetrix <- Bgee$new(species="Mus_musculus", dataType="affymetrix")
# retrieve annotation of all mouse affymetrix datasets in Bgee
annotation.bgee.mouse.affymetrix <- getAnnotation(bgee.affymetrix)
str(annotation.bgee.mouse.affymetrix)
This creates a list of two data frames, one including the annotation of experiments, and one including the annotation of each individual sample, i.e., hybridized microarray chip. For mouse, there are 694 Affymetrix experiments and 6,077 samples available in Bgee release 13. Anatomical structures and developmental stages are annotated using the Uberon ontology29,30. Below, we are selecting the experiments for which at least one sample is annotated to the zygote stage (UBERON:0000106).
# retrieve annotations of samples and experiments
sample.annotation <- annotation.bgee.mouse.affymetrix$sample.annotation
experiment.annotation <- annotation.bgee.mouse.affymetrix$experiment.annotation
# list experiments including a zygote sample
selected.experiments <- unique(sample.annotation$Experiment.ID[sample.annotation$Stage.ID == "UBERON:0000106"])
experiment.annotation[experiment.annotation$Experiment.ID %in% selected.experiments,]
# stages sampled in each of these experiments
unique(sample.annotation[sample.annotation$Experiment.ID %in% selected.experiments, c(1,6)])
This yields three microarray experiments, with accessions GSE1749, E-MEXP-51 and GSE18290. Among these, the accession E-MEXP-51, submitted to ArrayExpress by Wang and colleagues31, includes samples from more developmental stages than the other two, so we use this in the next steps. For this experiment, raw data were available from ArrayExpress, so samples were fully normalized with gcRMA32 version 2.40.0 through the Bgee pipeline.
# List all samples from E-MEXP-51 in Bgee
sample.annotation[sample.annotation$Experiment.ID == "E-MEXP-51",]
The experiment includes 35 samples that passed Bgee quality controls. They originate from 12 developmental stages: primary and secondary oocyte, zygote, early, mid and late 2-cells embryo, 4-cells embryo, 8-cells embryo, 16-cells embryo, early, mid and late blastocyst, although the developmental stages ontology used is not precise enough yet to differentiate some of these conditions: the early, mid and late 2-cells stages are annotated as Theiler stage 2 embryo, and the 4-cells and 8-cells stages are annotated as Theiler stage 3 embryo. All samples were hybridized to the Affymetrix GeneChip Murine Genome U74Av2 microarray. Let us download the normalized probesets intensities measured for all samples.
data.E.MEXP.51 <- getData(bgee.affymetrix, experimentId="E-MEXP-51")
head(data.E.MEXP.51)
The resulting data frame lists for each sample (column “Chip.ID”), the 9,017 probesets on the microarray (column “Probeset.ID”), their mapping to Ensembl gene IDs33 (column “Gene.ID”), their logged normalized intensities (column “Log.of.normalized.signal.intensity”), and a presence/absence call and quality (columns “Detection.flag” and “Detection.quality”).
As this format might not be the most convenient for downstream processing of an expression dataset, we offer the formatData() function, which creates an ExpressionSet object including the expression data matrix, the probesets annotation to Ensembl genes and the samples' anatomical structure and stage annotation into (assayData, featureData and phenoData slots respectively). This object class is of standard use in numerous Bioconductor packages.
data.E.MEXP.51.formatted <- formatData(bgee.affymetrix, data.E.MEXP.51, callType="all", stats="intensities")
data.E.MEXP.51.formatted
# matrix of expression intensities
head(exprs(data.E.MEXP.51.formatted))
# annotation of samples
pData(data.E.MEXP.51.formatted)
# annotation of probesets
head(fData(data.E.MEXP.51.formatted))
The callType option of the formatData() function could alternatively be set to present or present high quality to display only the intensities of probesets detected as actively expressed.
The result is a nicely formatted Bioconductor object including expression data and their annotations, ready to be used for downstream analysis with other Bioconductor packages.
RNA-seq dataset retrieval. We now search Bgee for a RNA-seq dataset sampling brain and liver tissues (Uberon Ids UBERON:0000955 and UBERON:0002107 respectively) in macaque (Macaca mulatta), and including multiple biological replicates for each tissue.
# specify species and data type
bgee.rnaseq <- Bgee$new(species="Macaca_mulatta", dataType="rna_seq")
# retrieve annotations of RNA-seq samples and experiments
annotation.bgee.macaque.rna.seq <- getAnnotation(bgee.rnaseq)
sample.annotation <- annotation.bgee.macaque.rna.seq$sample.annotation
experiment.annotation <- annotation.bgee.macaque.rna.seq$experiment.annotation
# list experiments including both brain and liver samples
selected.experiments <- intersect(unique(sample.annotation$Experiment.ID[sample.annotation$Anatomical.entity.ID == "UBERON:0000955"]),
unique(sample.annotation$Experiment.ID[sample.annotation$Anatomical.entity.ID == "UBERON:0002107"]))
experiment.annotation[experiment.annotation$Experiment.ID %in% selected.experiments,]
# check whether experiments include biological replicates
sample.annotation[sample.annotation$Experiment.ID %in%
selected.experiments & (sample.annotation$Anatomical.entity.ID == "UBERON:0000955" |
sample.annotation$Anatomical.entity.ID == "UBERON:0002107"), 1:6]
Accessions GSE4163734 and GSE3035235 both include biological replicates for brain and liver. We focus on GSE41637 for the next steps since it includes three replicates of each tissue, vs. only two for GSE30352. We download the dataset and reformat it to obtain an ExpressionSet including counts of mapped reads on each Ensembl gene for each sample.
data.GSE41637 <- getData(bgee.rnaseq, experimentId="GSE41637")
data.GSE41637.formatted <- formatData(bgee.rnaseq, data.GSE41637, callType="all", stats="counts")
data.GSE41637.formatted
Instead of mapped read counts, it is also possible to fill the data matrix with expression levels in the RPKM unit (reads per kilobase per million reads), using the option stats="rpkm". In the next Bgee release (release 14), it will be possible to obtain expression levels in the TPM unit (transcript per million)36,37 from pseudo-mapping of reads computed in Bgee using the Kallisto software38.
Presence/absence calls retrieval. It is often difficult to compare expression levels across species39, and even within species, across datasets generated by different experimenters or laboratories40–42. Batch effects have indeed been shown to impact extensively gene expression levels, confounding biological signal differences.
Encoding gene expression as present or absent in a sample allows a more robust comparison across such conditions. In addition to retrieving RNA-seq and Affymetrix quantitative expression levels, BgeeDB also allows to retrieve calls of presence or absence of expression computed in the Bgee database for each gene (RNA-seq) or probeset (Affymetrix), in the column “Detection.flag” of the data.E.MEXP.51 and data.GSE41637 objects created above. And interestingly, expression calls are also available in Bgee for ESTs and in situ hybridization data, as well as for the consensus of the four data types for each triplet “gene / tissue / developmental stage”.
A powerful use of these expression calls is the anatomical expression enrichment test “TopAnat”. TopAnat uses a similar approach to Gene Ontology enrichment tests26, but genes are associated to the anatomical structures where they display expression, instead of to their functional classification. These tests allow detecting where a set of genes is preferentially expressed as compared to a background universe (Roux J., Seppey M., Sanjeev K., Rech de Laval V., Moret P., Artimo P., Duvaud S., Ioannidis V., Stockinger H., Robinson-Rechavi M., Bastian F.B.; unpublished report). We show an example of such an analysis in the section “Anatomical expression enrichment analysis” below.
Of note, the expression calls imported from BgeeDB can also be used for other downstream analyses. For example, when studying protein-protein interaction datasets, it might be biologically relevant to retain only interactions for which both members are expressed in the same tissues43,44.
Clustering analysis. A variety of downstream analyses can be performed on the imported expression data. Below we detail an example of gene expression clustering analysis on the developmental time-series microarray experiment imported above. The analysis, performed with the Mfuzz package45,46 (version 2.34.0 for this paper), aims at uncovering genes with similar expression profiles across development. We can readily start with the ExpressionSet object previously created.
# for simplicity, keep only one sample per condition
data.E.MEXP.51.formatted <- data.E.MEXP.51.formatted[, !duplicated(pData(data.E.MEXP.51.formatted)[2:5])]
# order developmental stages
data.E.MEXP.51.formatted <- data.E.MEXP.51.formatted[, c(5,8,9,3,2,1,4,7,6)]
# filter out rows with no variance
data.E.MEXP.51.formatted <- data.E.MEXP.51.formatted[apply(exprs(data.E.MEXP.51.formatted), 1, sd) != 0, ]
# Mfuzz clustering
biocLite("Mfuzz")
library(Mfuzz)
# standardize matric of expression data
z.mat <- standardise(data.E.MEXP.51.formatted)
# cluster data into 16 clusters
clusters <- mfuzz(z.mat, centers=16, m=1.25)
# visualizing clusters
mfuzz.plot2(z.mat, cl=clusters, mfrow=c(4,4), colo="fancy",
time.labels=row.names(pData(z.mat)), las=2, xlab="", ylab="Standardized expression level", x11=FALSE)
The resulting plot can be seen in Figure 1.
Differential expression analysis. Below, we detail a differential expression analysis, with the package edgeR47,48 (version 3.16.1 for this paper), on the previously imported RNA-seq dataset of macaque tissues. We aim at isolating genes differentially expressed between brain and liver.
# differential expression analysis with edgeR
biocLite("edgeR")
library(edgeR)
# subset the dataset to brain and liver
brain.liver <- data.GSE41637.formatted[, pData(data.GSE41637.formatted)$Anatomical.entity.name %in% c("brain", "liver")]
# filter out very lowly expressed genes
brain.liver.filtered <- brain.liver[rowSums(cpm(brain.liver) > 1) > 3, ]
# create edgeR DGElist object
dge <- DGEList(counts=brain.liver.filtered,
group=pData(brain.liver.filtered)$Anatomical.entity.name)
dge <- calcNormFactors(dge)
dge <- estimateCommonDisp(dge)
dge <- estimateTagwiseDisp(dge)
de <- exactTest(dge, pair=c("brain","liver"))
de.genes <- topTags(de, n=nrow(de))$table
# MA plot with DE genes highlighted
plotSmear(dge, de.tags=rownames(de.genes)[de.genes$FDR < 0.01], cex=0.3)
The resulting plot can be seen in Figure 2.
The loadTopAnatData() function loads the names of anatomical structures, and relationships between them, from the Uberon anatomical ontology (based on parent-child “is_a” and “part_of” relationships). It also loads a mapping from genes to anatomical structures, based on the presence calls of the genes in the targeted species. These calls come from a consensus of all data types specified in the input Bgee class object. We recommend to use all available data types (RNA-seq, Affymetrix, EST and in situ hybridization) for both genomic coverage and anatomical precision, which is the default behavior if no dataType argument is specified when the Bgee class object is created.
By default, presence calls of both high and low quality are used, which can be changed with the confidence argument of the loadTopAnatData() function. Finally, it is possible to specify the developmental stage under consideration, with the stage argument. By default expression calls generated from samples of all developmental stages are used, which is equivalent to specifying stage="UBERON:0000104" (“life cycle”, the root of the stage ontology). Data are stored in versioned tab-separated cached files that will be read again if a query with the exact same parameters is launched later, to save time and server resources, and to work offline.
In this example, we use expression calls for zebrafish genes using all sources of expression data.
bgee.topanat <- Bgee$new(species="Danio_rerio")
myTopAnatData <- loadTopAnatData(bgee.topanat)
str(myTopAnatData)
We look at the expression localization of the genes with an annotated phenotype related to pectoral fin (i.e., genes which upon knock-out or knock-down led to abnormal phenotypes of pectoral fin or its components). Zebrafish phenotypic data are available from the ZFIN database49 and integrated into the Ensembl database50. We thus retrieve the targeted genes using the biomaRt51 Bioconductor package (version 2.30.0 for this paper).
biocLite("biomaRt")
library(biomaRt)
# zebrafish data in Ensembl 85 (stable link)
ensembl <- useMart("ENSEMBL_MART_ENSEMBL",
dataset="drerio_gene_ensembl", host="jul2016.archive.ensembl.org")
# get the mapping of Ensembl genes to phenotypes
genesToPhenotypes <- getBM(filters=c("phenotype_source"), value=c("ZFIN"),
attributes=c("ensembl_gene_id","phenotype_description"), mart=ensembl)
# select phenotypes related to pectoral fin
myPhenotypes <- grep("pectoral fin", unique(genesToPhenotypes$phenotype_description), value=T)
# select the genes annotated to select phenotypes
myGenes <- unique(genesToPhenotypes$ensembl_gene_id[genesToPhenotypes$phenotype_description %in% myPhenotypes])
This gives a list of 150 zebrafish genes implicated in the development and function of pectoral fin. The next step of the analysis relies on the topGO Bioconductor package. We prepare a modified topGOdata object allowing to handle the Uberon anatomical ontology instead of the Gene Ontology, and perform a GO-like enrichment test for anatomical terms. As for a classical topGO analysis, we need to prepare a vector including all background genes, and with values 0 or 1 depending if genes are part of the foreground or not. The choice of background is very important since the wrong background can lead to spurious results in enrichment tests52. Here we choose as background all zebrafish Ensembl genes with an annotated phenotype from ZFIN.
# prepare the gene list vector
geneList <- factor(as.integer(unique(genesToPhenotypes$ensembl_gene_id) %in% myGenes))
names(geneList) <- unique(genesToPhenotypes$ensembl_gene_id)
summary(geneList)
# prepare the topAnat object based on topGO
myTopAnatObject <- topAnat(myTopAnatData, geneList)
At this step, expression calls are propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in the brain, the nervous system, etc). This can take some time, especially if the gene list is large.
Finally, we launch an enrichment test for anatomical terms. The functions of the topGO package can directly be used at this step. See the vignette of this package for more details25. Here we use a Fisher test, coupled with the “weight” decorrelation algorithm.
results <- runTest(myTopAnatObject, algorithm='weight', statistic='fisher')
Finally, we implement a function to display results in a formatted table. By default anatomical structures are sorted by their test p-value, which is displayed along with the associated false discovery rate (FDR53) and the enrichment fold. Sorting on other columns of the table (e.g., on decreasing enrichment folds) is possible with the ordering argument. Of note, it is debated whether a FDR correction is relevant on such enrichment test results, since tests on different terms of the ontologies are not independent. An interesting discussion can be found in the vignette of the topGO package.
# retrieve anatomical structures enriched at a 1% FDR threshold tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, cutoff=0.01)
The 22 anatomical structures displaying a significant enrichment at a FDR threshold of 1% are show in Table 1. The first term is “paired limb/fin bud”, and the second “pectoral fin”. Other terms in the list, especially those with high enrichment folds, are clearly related to pectoral fins (e.g., “pectoral appendage cartilage tissue”), substructures of fins (e.g., “fin bone”), or located next to them (e.g., “ceratohyal cartilage”). This analysis shows that genes with phenotypic effects on pectoral fins are specifically expressed in or next to these structures. More generally, it proves the pertinence of TopAnat analysis for the characterization of lists of genes.
In summary, the BgeeDB package serves as a bridge between data from the Bgee database and the R/Bioconductor environment, facilitating access to high-quality curated and re-analyzed gene expression datasets, and significantly reducing time for downstream analyses of the datasets. Moreover, it provides access to TopAnat, a new enrichment that makes sense of lists of genes by uncovering their preferential localization of expression in anatomical structures. The TopAnat workflow is straightforward; for users already using topGO in their analysis pipelines, performing a TopAnat analysis on the same gene list only requires 6 additional lines of code.
Software available from: http://www.bioconductor.org/packages/BgeeDB/
Latest source code: https://github.com/BgeeDB/BgeeDB_R
Archived source code as at the time of publication: https://doi.org/10.5281/zenodo.16376854
AK and JR contributed equally to this work. AK developed the initial BgeeDB R package and made it available in Bioconductor. JR implemented the enrichment analyses, and refined the data download part. FBB developed the server-side responses. MRR and FBB tested and commented on the package development. AK and JR wrote the manuscript. All authors discussed the results and implications and commented on the manuscript at all stages.
This work was supported by SIB Swiss Institute of Bioinformatics project Bgee, Swiss National Science Foundation grant 31003A_153341, SystemsX.ch project AgingX, and Etat de Vaud.
R markdown file including code from the paper.
Click here to access the data.
PDF file including the results of execution of the code from File S1.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
References
1. Bastian F, Parmentier G, Roux J, Moretti S, et al.: Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species. 2008. 124-131 Publisher Full TextCompeting Interests: No competing interests were disclosed.
References
1. Morin A, Urban J, Sliz P: A quick guide to software licensing for the scientist-programmer.PLoS Comput Biol. 2012; 8 (7): e1002598 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
References
1. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, et al.: Enrichr: a comprehensive gene set enrichment analysis web server 2016 update.Nucleic Acids Res. 2016; 44 (W1): W90-7 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 07 Aug 18 |
read | ||
Version 1 23 Nov 16 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)