ArrayExpress: Discover functional genomics data quickly and easily

Beginner 2 hours ArrayExpress is a database of functional genomics data. This course will give you an overview of how these data are stored in ArrayExpress and will teach you how to effectively search and retrieve data from the ArrayExpress website [2]. Describe what ArrayExpress is and when to use it Evaluate how functional genomics data is stored within ArrayExpress Search ArrayExpress to find information on functional genomics data Retrieve/download data from ArrayExpress Overview Figure 1 (below) explains what we will cover during this course.


ArrayExpress: Discover functional genomics data quickly and easily
Anja Füllgrabe [1] Gene Expression

What is ArrayExpress?
ArrayExpress [2] is one of the major public repositories for functional genomics [3] datasets.Most of the data is genome-wide gene expression [4] data, measured on microarray [5] or next-generation sequencing (NGS) platforms.A range of DNA assays are also hosted by ArrayExpress, such as ChIP-seq or genotyping.
The main object in ArrayExpress is the experiment.An experiment usually groups several assays belonging to one study or publication.Each experiment contains metadata [6] describing the biological specimen and experimental procedures, as well as resulting data files (Figure 2).The definition of an assay [7] depends on the experiment type.For microarray experiments an assay represents one hybridisation [8] (of biological sample

Submission process
The main route for direct submissions to ArrayExpress is via the submission tool Annotare (5 [9]).Users supply their files and metadata in a webform, from where the data is exported and stored in MAGE-TAB [12] format (6 [9]) (Figure 3).Data sets imported from other functional genomics databases are also converted into MAGE-TAB files for archiving.

The two main MAGE-TAB spreadsheets IDF
The IDF contains an overview of the whole experiment, including the title, the submitter's contact details, publication information, protocols and the experimental variables.

SDRF
The SDRF describes all the sample characteristics (e.g.cell type) or any treatment that the sample has been subjected to (e.g.growth in low oxygen conditions), and links each sample to its corresponding data file.The structure of the SDRF, i.e. the order of the columns, reflects the experimental workflow from source material, through intermediate steps (e.g.labelling of nucleic acids, preparation of sequencing libraries, running of sequencing assays) to raw and processed data.

Metadata for microarray experiments
There is a third type of MAGE-TAB file, which is relevant for microarray experiments.All microarray experiments are additionally associated with an Array Design Format (ADF) file (Figure 5).array.Each position of the array or "probe" is annotated with information like the gene ID or the genomic position for which the probe is specific.This information is crucial for analysis of microarray raw data.
For commercially available microarrays, this file is provided by the array manufacturer and is often already archived in ArrayExpress.
Custom array designs need to be submitted to ArrayExpress before they can be associated with an experiment.

Raw and processed data files
ArrayExpress stores raw data files, and processed data files or matrices (usually in the form of tab-delimited text files) for microarray experiments (Figure 6).
For NGS experiments, all processed sequence files (e.g.BAM [14] alignment files) and derived data (e.g.normalised read count matrices) are stored directly in the ArrayExpress database.The raw sequence files are deposited in the European Nucleotide Archive [15] (ENA) (7 [9]) and appear as direct links in ArrayExpress.The format of raw data files varies depending on the platform used (array manufacturer, sequencer machine, etc.).

The experimental variable
An important element that is defined in the metadata of each experiment is the "experimental variable".The experimental variable is usually one or several of the sample attribute categories.It describes the factors that differ between the test and the control samples, which you are investigating (Figure 7).At least one experimental variable must be specified for each experiment in ArrayExpress because it helps to identify the specific experimental conditions you are interested in.You will see how this is useful for retrieving precise search results in the next section.

How to search ArrayExpress
Start your search by typing directly into the search box, or view all experiments in ArrayExpress by clicking on "browse" (Figure 8).

The experiments overview
Let's start browsing all available experiments in ArrayExpress.Clicking on "Browse" opens the list of experiments with a short overview of the study and the available files in each row (Figure 9).

Try it for yourself
Open ArrayExpress [2] in another window/tab.Click on "Browse" and familiarise yourself with the interface and the information in the table.By default the experiments are sorted by release date.Click on "Assays" to see the experiments with the most assays at the top of the list.Try to display the top 100 most viewed experiments on one page.

Filter options
The filter box lets you select experiments by: species, e.g.only human material, or only Drosophila the type of analysed material: DNA, RNA, protein, metabolite technology type: microarray [5], sequencing, mass spectrometry [18] array: the specific name of the array Ticking the box "ArrayExpress data only" shows only curated experiments that were directly submitted to ArrayExpress.

Free-text search
In the free-text search box, you can enter keywords to start a query or enter an accession [16] number (e.g.E-MTAB-3682) to go directly to the desired experiment page.
Put quotes around multiple keywords if you want to find experiments where these words are found next to each other e.g."breast cancer".Entering multiple words without quotes will retrieve experiments where both keywords are found but they are not necessarily adjacent e.g.mouse leukemia [19].

Ontologies in ArrayExpress
Experiments in ArrayExpress are annotated with ontology [20] terms from the Experimental Factor Ontology [21] (EFO) (8 [9]).When you start typing in the search field, you will be shown suggestions of matching or similar terms from the EFO (Figure 10).
Ontologies help to make the experimental metadata [6] clearer and bring them into a standardised form.Ontologies facilitate searching by matching synonyms (e.g."human" = "Homo sapiens") and expanding the search to include "child terms" (e.g."cancer cell line" will find "HeLa [22]", "MCF7" and other related terms).In the free-text search drop-down list, you can reveal child terms by clicking the "+" sign and select a more specific term for your search.

The results table
The search results can be sorted by clicking on any of the column headings.For example, clicking on "Processed" will show all the experiments that have processed data [23] files at the top of the list.
The last column, "Atlas", marks the experiments which have been selected for the Expression Atlas [24] at EMBL-EBI (9 [25]), which presents analysed results from our in-house standard statistical pipeline.The search terms are highlighted according to their relation to the original search term (Figure 11).
Figure 11 The search results page lists experiments matching the search criteria.

Try it for yourself
Start typing the word "cancer" into the search box and browse through the suggested ontology [20] terms.Click on the "+"-box next to the terms to see all "child terms", and try to find "acute myeloid leukemia [19]" (or select any type of cancer).Use the search filters to limit the results to display only sequencing data from human material.

Advanced search functions
You might notice that the results of your cancer search (from the previous page) are not very specific yet.By default, all experimental fields which include this search term are listed, including the experiment description, sample annotation, publication title, submitter's email address, protocol description, etc. Let's look at an example:

Question: "I would like to find transcriptomics experiments which compare patients with diabetes mellitus to healthy individuals."
Use the fieldname "efv:" for experimental factor value to specify the disease: efv:diabetes AND efv:normal To make sure you will find transcription profiling data from human samples with you can add: organism:"Homo sapiens" AND exptype:"Transcription profiling" So the complete query would be: efv:diabetes AND efv:normal AND organism:"Homo sapiens" AND exptype: "Transcription profiling" Not constraining the search terms in the above example could result in false matches, e.g.experiments merely mentioning "diabetes" in the description field, when in fact they are studying a different disease (e.g.hypertension).
To search for a certain experiment type, the field code "exptype:" is very useful.The search term that follows should be from the list of experiment types [27].The exact terms (in quotations) can be used or parts of them, e.g.

Practise searching
Let's have a go at doing some search exercises.
If you need help to complete this section you can look in the 'Need some help?' and 'Want to know how we did it?'sections.

Exercise 1
Can you find the two experiments comparing human cafeteria diet, human fast food diet and chimpanzee diet?What species were the test subjects?

Want to know how we did it?
Using ef:diet (experimental factor [28] should be "diet") and for example the terms "fast food" or "chimpanzee" will find E-GEOD-6285 and E-GEOD-6297.
The experiments were conducted using mice (Mus musculus).

Exercise 2
Can you find the Arabidopsis gene expression profiling by array experiment with experimental factor value "space flight"?How many assays does this experiment have?

Need some help?
Use the "experimental factor [28] variable" and "organism" fields in your search query....

Want to know how we did it?
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online)Using efv:"space flight" AND organism:Arabidopsis limits the search to one experiment.
The accession number of the experiment is E-MTAB-2518 and it contains six assays which correspond to six biological samples.

Exercise 3
Search for an area of research you are interested in by typing it in the search box (e.g."skin cancer").Then try limiting the results to match different constraints and see what the results are.You could try for example "only RNAseq experiments" (using the filters) and "only samples from females" (using sa:female).

Need some help?
Try using the filter options to narrow down your search results....

Want to know how we did it?
1.An example: I searched for experiments about "skin cancer".Strikingly, searching the phrase without quotation marks actually returns fewer experiments.This is due to the expansion of the complete phrase to child terms in the Experimental Factor Ontology [21].If I use quotation marks, my search will also find experiments mentioning the type of skin cancer "melanoma" but not the words "skin" and "cancer".2. To find e.g.mouse models of this disease, I selected the organism "Mus musculus" from the filter option "By organism".I also selected "Array assay [7]" to only find array-based experiments.Adding "ef:diet" will show me experiments that have tested the effect of diet on skin cancer.

Experiment and Sample information
The Experiment page ).Most of the information displayed here is stored in the IDF file, which can also be downloaded at the bottom of the page.

Figure 12
The experiment overview.
The first block of links, at the top of the page, take you to the overview of the samples, the array design (for microarray [5] experiments only), and the list of protocols.
At the bottom of the page you will find the score of the different parts of the experiment, which have been automatically evaluated for compliance with the MIAME [10] or MINSEQE [11] guidelines.This is followed by an overview and links to all the downloadable Files related to this experiment.For sequencing-based experiments, the raw data [30] files are not available here but on the ENA website (7 [9]).The Links sections contains links to this experiment on other resources, e.g. the ENA page of the experiment (for sequencing experiments), Expression Atlas [24] (9 [9]) (if the experiment has been analysed).
To find out more about the samples in this experiment, click on the first link on the page next to Samples.
The number in the brackets indicates the number of biological replicates in the whole study.Note that this is not necessarily equal to the number of assays or sequencing libraries, e.g. in two-colour microarray experiments there is often only one assay [7] (hybridisation [8]) for every pair of samples.
The Array field tells you the name of the array platform that was used in this experiment and the accession of the ADF file.Follow the link if you would like to download the full ADF file that contains details about the probes on the array.
The Protocols link provides you with an overview of the experimental procedures that were used to derive the sample material, perform the hybridisation or sequencing assay, and generate the data files.

ArrayExpress: Discover functional genomics data quickly and easily
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online)

Exercise 1
Go to experiment E-MTAB-4294 [31] and find out which of the sample attributes are experimental variables and which are not.Does this make sense?
Want to know how we did it?
The sample attributes are: The experimental variables are "disease", "compound" and "dose", as they differ between the samples and are the subject of the study (investigating the effect of the compound on the disease state).
"Age" also differs between the samples, and it's clear that pigs with induced injury are older than the healthy controls.However, the point of the experiment was not about comparing older vs. younger pigs, hence "age" is not declared as experimental variable."Compound" and "dose" were not listed as a sample attribute as they are not intrinsic properties of the pigs or the synovial membranes.
Exercise 2 E-MTAB-2548 [32] is a sequencing experiment.Have a look at the samples table.Each row with sample information has been duplicated.Can you guess why this is?
Hint: Open the extended samples table and find "Comment[LIBRARY_LAYOUT]" and "Scan Name" in the header.

Want to know how we did it?
This is a "paired-end" sequencing experiment.This sequencing technique results in two raw data files per sequencing assay, denoted with the suffix "_1" and "_2".The ArrayExpress samples table shows one row per data file.Hence, each row of sample annotation is duplicated to link to both files to each sample.
Hint: Check the "Label" column in the SDRF file or extended samples table.

Want to know how we did it?
E-MEXP-2004 is a two-colour experiment.Cy3 and Cy5 are the two most common fluorescent labels used for this hybridisation technique.The column "Hybridisation Name" in the SDRF tells you which two samples were hybridised together on the same chip.

Files and Download
Most of the experiments in ArrayExpress contain raw data files, and many of these have additional processed data files (but not all experiments have both raw data and processed data).There are several types of data files associated with each experiment that can be downloaded.Figure 15 The SDRF file is a representation of the "Samples and Data" information in a downloadable text file.

Metadata
Looking at this file will help you find which files belong to each sample and the experimental conditions to be compared, e.g. which files belong to the control group and which to the test group.Sometimes the file names do not contain any clues, so the information in this file will be vital.

Exercises
to experiment E-MTAB-3420 [36], and have a look at the Samples overview.Find the Experimental variable and the two conditions that were compared.
the SDRF file and figure out which file belongs to each category.

Want to know how we did it?
Step 1. Understand the experiment The description text of E-MTAB-3420 and the sample attributes indicate that skeletal muscle tissue was taken from Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online) patients with diabetes and healthy individuals in order to identified gene sets associated with insulin resistance.
Thus, the experimental variable in E-MTAB-3420 is "disease" and the two conditions that are compared are "diabetes mellitus" and "normal".

Step 2. Download and open SDRF
To download and view the SDRF file, click on the link on bottom of the experiment page under "Files" and select "E-MTAB-3420.sdrf.txt".If your browser opens the text file directly, chose the option to save the page/file in the browser menu.Alternatively, use right-click on the link "Save link/file as…".Go to your downloads folder and open E-MTAB-3420.sdrf.txt,using right-click "Open with…" and choose your favourite spreadsheet software.
Step 3. Group files Find the columns "Factor Value[disease]" (the last column) and "Array Data File" (the third last column).Now you should be able to categorise the data files according to the disease condition.The correct annotation of the raw data files is essential for any type of analysis.

Raw data
Raw data [30] files can be downloaded individually from the "Samples and Data" page [37] in the rightmost column (Figure 16).For microarray [5] experiments, a zipped archive file can be downloaded from the "Experiment" page [38] (Figure 16), which contains all raw data files belonging to the study.Sometimes multiple zip files are created to keep the downloadable files below 200 MB.

Figure 16
Where to find links to download raw/processed data files.
The types of raw data files vary depending on the platform (array manufacturer or sequencing machine).For Affymetrix [39] microarrays, CEL files are common, while Agilent arrays often generate tab-delimited text matrices.Most sequencing-based experiments provide FASTQ [40] files, containing raw sequence reads.
We will look more at the different data types in the next section.

Processed data
The nature of processed data [23]  Before re-using these files it is advisable to check the protocols for "normalisation data transformation" to understand what these files represent (Figure 17).
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online) Figure 17 Finding protocol information describing how the raw data was processed to derive the processed data matrix.
In the next section, you will find a short overview of how to open and process some of the most common data formats.

Next steps (towards data analysis)
In the next section, we will give you a brief overview of common analysis methods performed on functional genomics data (Figure 18).

Opening and processing raw data files
How to open and process data files will depend on whether you have a microarray [44] or sequencing [45] experiments, e.g.NGS.We will cover both of these in the next two sections.

Microarray experiments
Array manufacturers often provide software to open and analyse their raw data [30] files (Table 1).These programs may not always be available or may not be flexible enough for your needs.There are several free software tools that are suitable for the downstream processing of microarray [5]

Figure 1 (
Figure 1 (below) explains what we will cover during this course.

Figure 1
Figure 1 An overview of the ArrayExpress course.
Each experiment is represented by two MAGE-TAB spreadsheets: the Investigation Description Format (IDF) file and the Sample Data Relationship Format (SDRF) file (Figure4).

Figure 4
Figure 4 Representation of experiment metadata in Investigation Description Format (IDF) and Sample Data Relationship Format (SDRF) spreadsheets.

Figure 5
Figure 5 The Array Design Format describes the layout of the array used for microarray experiments in ArrayExpress.The ADF describes how a microarray was manufactured and what was printed or synthesised on the

Figure 6
Figure 6 Types of experimental data that are stored in ArrayExpress.

Figure 7
Figure 7 Examples of experimental variables.

Figure 9 A
Figure 9 A list of all experiments archived in ArrayExpress with key information about each experiment.
Most of the time you are interested in finding experiments matching certain criteria.Two options are available for searching and filtering [17] the experiment list: the filter box and the search box (Figure 10).The option you choose depends on what kind of criteria you would like to use to find interesting experiments.The two methods can of course be combined.As a general rule, use the search box for attributes not listed in the filter box.For example, use the filters to select data sets from one species, and use the search box so specify the experimental variable(s) you are interested in.Examples are shown in Figure 10, below.

Figure 10
Figure 10The filter and free-text search in ArrayExpress.
If you want to search for experiments with a specific design, you can limit your search to certain fields.This can be done by writing the query in the format of fieldname:value Common "fields" that can be queried directly include: the organism of the source material with organism: the experimental factor (experimental variable) with ef: the value of an experimental factor with efv: the assay technology with exptype: any attribute of the biological sample with sa:(See the full list of field names[26]).
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online)To see the experiment overview page for an individual study (Figure 10), click on the accession [16] number in the left column of the search results / browse experiments table (Figure 9 [29]

Name:Figure 14
Figure 14The IDF file is a representation of the experiment and author information in a downloadable text file.Name: SDRF fileType: Tab-delimited text file Job: Represents the relationship between samples and data files

Figure 18
Figure 18 Analysis of functional genomics data.
can vary between different submissions because "processing" can mean different things ranging from background removal, log2 transformation [41] and data normalisation [42] of single hybridisations, to fold change values between two conditions.For sequencing experiments, there are multiple stages in the data analysis pipeline that can generate processed data files e.g.trimming or filtering [17] of read sequences, reference genome alignment, and normalised "reads per kilobase of transcript per million mapped reads" (RPKM [43]) values.