Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibody-based Proteomics*

Global classification of the human proteins with regards to spatial expression patterns across organs and tissues is important for studies of human biology and disease. Here, we used a quantitative transcriptomics analysis (RNA-Seq) to classify the tissue-specific expression of genes across a representative set of all major human organs and tissues and combined this analysis with antibody-based profiling of the same tissues. To present the data, we launch a new version of the Human Protein Atlas that integrates RNA and protein expression data corresponding to ∼80% of the human protein-coding genes with access to the primary data for both the RNA and the protein analysis on an individual gene level. We present a classification of all human protein-coding genes with regards to tissue-specificity and spatial expression pattern. The integrative human expression map can be used as a starting point to explore the molecular constituents of the human body.

Central questions in human biology relate to how cells, tissues, and organs differ in the expression of genes and proteins and what consequences the global expression pattern has for the phenotype of various cells with different functions in the body. Therefore, the annotation of the human protein-coding genes with regards to the spatial, temporal, and functional space represents one of the greatest challenges in human biology (1). Important questions related to this are how many of the genes actually code for functional proteins, how many are expressed in a tissue-specific manner, and how many proteins have "housekeeping" functions and are therefore expressed in all cells? These questions have a major impact not only on efforts to try to understand human biology, but also for applied medical research, such as pharmaceutical drug development and biomarker discovery in the field of translational medicine.
Several efforts have been initiated in the aftermath of the genome project to systematically annotate the putative protein-coding part of the human genome. Genome annotation efforts, such as Ensembl (2) and RefSeq (3), have provided an increasingly accurate map with at present ϳ20,000 proteincoding genes. Similarly, the ENCODE consortium has been launched to provide an integrated encyclopedia of DNA ele-ments in the human genome (4). On the protein level, the UniProt consortium (5) has annotated 20,255 human genes as coding for proteins (release 2013_05), including a large number of isoforms. In addition, the protein distribution in human tissues have been explored using antibodies generating more than 13 million manually annotated immunohistochemistry images in the Human Protein Atlas (6). On the RNA level, the FANTOM consortium has been initiated to map the transcriptional space of the human genome and several gene expression atlases for RNA expression data have been launched, such as the original work to create a gene atlas by integrating mouse and human expression data from multiple tissues using micro arrays (7), the BioGPS portal with expression data from numerous sources (8), the repository ArrayExpress (9) and the RNA-Seq Atlas (10), with transcriptomics data based on deep sequencing from eleven normal human tissues. These resources are important tools for comparisons of gene expression patterns in different normal and disease tissues and applications span from in-depth analysis of specific genes to more global systems biology studies to understand human biology and disease.
Here, we describe an extension of these efforts by integrating protein expression data with a transcriptomics analysis based on deep sequencing (RNA-Seq) 1 of tissue samples. Because the selection of samples cover a large portion of the major tissues in the human body, we have used the quantitative RNA expression analysis to allow a tissue-specific classification of all the human protein-coding genes. A new version of the Human Protein Atlas (www.proteinatlas.org) is presented with RNA and protein expression data corresponding to 91 and 80%, respectively, of the putative protein-coded genes.

MATERIALS AND METHODS
Transcript Profiling (RNA-seq)-The use of human tissue samples was approved by the Uppsala Ethical Review Board (Reference #2011/473). Tissues samples, collected within the infrastructure of an established biobank, were embedded in Optimal Cutting Temperature (O.C.T.) compound and stored at Ϫ80°C. A hematoxylin-eosin (HE) stained frozen section (4 m) was prepared from each sample using a cryostat and the CryoJane® Tape-Transfer System (Instrumedics, St. Louis, MO, USA). Each slide was examined by a pathologist to ensure proper tissue morphology. Three sections (10 m) were cut from each frozen tissue block and collected into a tube for subsequent RNA extraction. The tissue was homogenized mechanically using a 3 mm steel grinding ball (VWR). Total RNA was extracted from cell lines and tissue samples using the RNeasy Mini Kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions. The extracted RNA samples were analyzed using either an Experion automated electrophoresis system (Bio-Rad Laboratories, Hercules, CA, USA) with the standard-sensitivity RNA chip or an Agilent 2100 Bioanalyzer system (Agilent Biotechnologies, Palo Alto, USA) with the RNA 6000 Nano Labchip Kit. Only samples of high-quality RNA (RNA Integrity Number Ն7.5) were used in the following mRNA sample preparation for sequencing. mRNA sequencing was performed on Illumina HiSeq2000 and 2500 machines (Illumina, San Diego, CA, USA) using the standard Illumina RNA-seq protocol with a read length of 2 ϫ 100 bases . The samples were sequenced multiplexed 15 per lane,  producing an average of 18 million mappable read pairs per sample. The raw reads obtained from the sequencing system were trimmed for low quality ends with the software sickle (11), using a phred quality threshold of 20. All reads shorter than 54 bp after the trimming were discarded. The processed reads were mapped to the GRCh37 version of the human genome with Tophat v2.0.3 (12). Potential PCR duplicates were eliminated using the MarkDuplicates module of Picard 1.77 (13). To obtain quantification scores for all human genes and transcripts, FPKM (fragments per kilobase of exon model per million mapped reads) values were calculated using Cufflinks v2.0.2 (14), which corrects for transcript length and the total number of mapped reads from the library to compensate for different read depths for different samples. The gene models from Ensembl build 69 (2) were used in Cufflinks.
Sample Selection-At least three samples (from different individuals) of each tissue were selected for sequencing in order to ensure proper statistical power, in accordance to common best practice recommendations for RNA-seq. Five samples that did not group with other samples from the same tissue in the hierarchical clustering were discarded after closer investigation of the tissue cuts and, if possible, replaced with newly prepared samples. However, for three tissues (ovary, duodenum and pancreas), only two replicates are included in the final analyses.
Barcode "Leakage"-As has been previously observed (15) multiplexing of samples on a single lane on the Illumina platform can sometimes lead to misidentification of barcodes, leading to what looks like a cross-contamination between samples. Here we observe a small quantity of misidentified reads (ϳ0.1%) for samples sequenced multiplexed on the same lane in the same run. Effectively this makes genes with very high expression in a certain tissue appear to be have low expression in the other samples run on the same lane.
This introduces a minor bias in the data, which could potentially mislead analyses. However, because most analyses here are looking at relative differences (fold-changes) compared the tissue with highest expression, a leakage of 0.1% hardly affects the analyses.
Specificity Classification-For each tissue, the average FPKM value of all individual samples was used to estimate the gene expression level. A cutoff value of 1 FPKM was used as the detection limit. Each of the 20,050 genes were classified into one of eight categories based on the FPKM levels in 27 tissues: 1. "Not detected" -Ͻ 1 FPKM in all 27 tissues; "Tissue specific" -50-fold higher FPKM level in one tissue compared with all other tissues; "Tissue enriched" -fivefold higher FPKM level in one tissue compared with all other tissues; "Group enriched" -fivefold higher average FPKM level in a group of 2-7 tissues compared with all other tissues; "Mixed low" -detected in 1-26 tissues and at least one tissue Ͻ 10 FPKM; "Mixed high"detected in 1-26 tissues and all detected tissues Ͼ 10 FPKM; "Expressed in all low" -detected in 27 tissues and at least one tissue Ͻ 10 FPKM; "Expressed in all high" -detected in 27 tissues and all tissues Ͼ10 FPKMЈ. For analyses performed in this study where a log2-scale of the data was used, pseudo-counts of ϩ1 were added to the data set.
Hierarchical Clustering-FPKM values for all 20,050 genes were used to calculate a correlation matrix based on Spearman's rank correlation coefficient for each pairwise combination of individual samples (n ϭ 95). The correlation matrix was used for unsupervised hierarchical clustering analyses of individual samples using the average linkage method to measure distances between clusters. The results were visualized in a heatmap of all pair-wise correlation coefficients.
Gene Ontology Analysis-A gene ontology (16) analysis was performed using the GOrilla tool (17) to determine overrepresented GO categories in the gene set of not detected genes. Genes were assigned to terms in order of overexpression of the terms. Meaning that genes placed in the first category (Sensory perception) were not assigned to less overexpressed terms even though they were associated with the term. A list of all genes analyzed in this study was used as the background list in GOrilla.
Network Analysis-A network analysis of all genes in the tissue enriched and group enriched categories was performed using Cytoscape 3.0 (19). The resulting network includes only group enriched nodes with at least three expressed genes and a maximum of five connections.
Tissue Profiling-Tissue microarrays (TMA) containing triplicate 1-mm cores of 44 different types of normal tissue and duplicate 1-mm cores of 216 different cancer tissues representing the 20 most common forms of human cancer were generated as previously described (18). TMA sections were immunostained as previously described (20). Briefly, slides were deparaffinized in xylene, hydrated in graded alcohols and blocked for endogenous peroxidase in 0.3% hydrogen peroxide diluted in 95% ethanol. For antigen retrieval, a Decloaking chamber® (Biocare Medical, Walnut Creek, CA) was used. Slides were immersed and boiled in Citrate buffer®, pH6 (Lab Vision, Freemont, CA) for 4 min at 125C and then allowed to cool to 90C. Automated IHC was performed essentially as previously described (21) in brief, using an Autostainer 480 instrument® (Lab Vision). Primary antibodies and a dextran polymer visualization system (UltraVision LP HRP polymer®, Lab Vision) were incubated for 30 min each at room temperature and slides were developed for 10 min using Diaminobenzidine (Lab Vision) as chromogen. All incubations were followed by rinse in wash buffer® (Lab Vision). Slides were counterstained in Mayers hematoxylin (Histolab) and cover slipped using Pertex® (Histolab) as mounting medium. Incubation with PBS instead of primary antibody served as negative control. TheAperioScanScope XT Slide Scanner (Aperio Technologies, Vista, CA) system was used to capture digital whole slide images with a 20ϫ objective. Slides were dearrayed to obtain individual cores. The outcome of IHC stainings in the screening phase, that included various normal and cancer tissues, was manually evaluated and scored by certified pathologists using a web-based annotation system (unpublished). In brief, the manual score of IHC-based protein expression was determined as the fraction of positive cells defined in different tissues: 0 ϭ 0 -1%, 1 ϭ 2-25%, 2 ϭ 26 -75%, 3 Ͼ 75% and intensity of immunoreactivity: 0 ϭ negative, 1 ϭ weak, 2 ϭ moderate, and 3 ϭ strong staining. All of the tissues used as donor blocks were acquired from the archives at the Department of Pathology of Uppsala University Hospital in agreement with approval from the Research Ethics Committee at Uppsala University (Uppsala, Sweden)(Ups 02-577).
Antibodies Data Availability-All the data (FPKM values for all the samples) are available for downloads without any restrictions (www.proteinatlas. org/about/download). The primary data (reads) are available through the Array Express Archive (www.ebi.ac.uk/arrayexpress/) under the accession number: E-MTAB-1733. The transcript profiling data (FPKM values) for each gene in each cell and tissue will also be available in the next version of the Human Protein Atlas (www.proteinatlas.org). The classification according to Fig. 2C will be included on the Protein Atlas gene summary page for each of the genes. FPKM values for all 20,050 genes included in the study are provided as downloads to facilitate comparative studies with other "omics" data.

RESULTS
The transcriptomes of 27 different human organs and tissues were analyzed using next generation sequencing based on specimens from altogether 95 individuals. These tissues were selected to represent tissue types with specialized body functions and to include all major organs and tissues (Fig. 1A). All tissues were microscopically examined to ensure the representation of normal tissue and to estimate the fraction of normal cell types in each sample. The transcriptome of each sample was quantified using RNA-Seq to determine the normalized mRNA abundance, calculated as FPKM-values (22). In these analyses we have used a cutoff of FPKM 1, roughly representing one mRNA molecule per average cell in the sample (23). The distribution of average FPKM values in the various organ and tissues samples range from 1 to more than 10,000, as exemplified by 60,000 for some of the transcripts encoding digestive enzymes in the pancreas (chymotrypsinogen, amylase and protease), to yield a dynamic range of more than 10 4 within most tissues. Analysis of biological replicates across all tissues showed high reproducibility with a Spearman correlation between samples from the same tissue, but different individuals, of 0.94 to 0.98 (supplemental Table S1 and supplemental Fig. S1). The high correlation coefficients suggest good technical reproducibility between samples and low biological inter-individual variability between different individuals.
The global expression profiles were investigated using hierarchical clustering and a correlation heat map including all the 95 biological replicates for the 27 organs and tissues (Fig.  1B). The results reveal the relationship between the samples with distinct clusters for each tissue type and larger clusters representing the similar tissue from the gastro-intestinal tract (colon, duodenum, stomach, and small intestine) and tissues dominated by cells from the blood and immune system (lymph node, appendix, spleen, and bone marrow).
The Human Transcriptome-The total number of genes with detected transcripts was calculated for each sample and the results are shown in Fig. 2A. The genes were categorized based on mRNA abundance to reveal the number of genes in each category. With a cut-off of 1 FPKM, the number of detected genes ranged from 11,520 in the bone marrow to 15,224 in the testis. The overlap between genes identified in this study with putative human protein-coding genes annotated by UniProt (5) and the consensus gene consortium (CCDS) are shown in Fig. 2B. The Venn-diagram shows that more than 17,000 genes of the genes identified as transcribed here have previously been proposed as protein-coding by both the UniProt and CCDS consortia, whereas transcripts corresponding to 619 genes that were detected in the tissues analyzed in the present study have not yet been annotated as protein-coding genes by either UniProt or CCDS. These are important genes for further in-depth studies to establish if they indeed code for proteins.
Classification of all Protein-coding Genes Based on Transcriptomics Analysis-The transcriptomics analysis across the 27 organs and tissues allowed us to classify all the protein-coding genes (n ϭ 20,050) according to pattern of tissuespecific expression (Fig. 2C). The definition of the various classes used in this study is shown in supplemental Table S2 with four major classes further subdivided into subclasses. The largest group of genes (46%) is the "housekeeping genes" that are expressed in all tissues, supporting earlier results (24) that many genes are ubiquitously expressed in human cells. The second class (28%) consists of the mixed categories with gene expression detected in 2-26 of the tissues, but not identified as tissue-specific. The third class is the tissue-specific genes that consist of two major subclasses; the tissue-enriched genes with at least fivefold higher expression in a specified tissue as compared with all other tissues and the group-enriched genes with at least fivefold higher expression in a small group of tissues as compared with all other tissues. The tissue-enriched genes constitute ϳ12% of all genes (n ϭ 2,473), with ϳ3% highly tissue enriched with at least 50-fold higher mRNA level than any other tissue, and 9% moderately tissue enriched with at least a fivefold higher expression than any other tissue. Another 5% (n ϭ 1026) of all genes are classified as group enriched in 2-7 tissues. Eight percent of all genes were not detected in any of the tissues analyzed here and a functional analysis (supplemental Fig. S2) shows that many of the undetected genes belong to olfactory receptors, keratin-associated proteins and genes involved in development. Thus, these genes will most likely be found when more specialized tissues and different developmental stages are included and they will most likely be added to the tissue-specific class of proteins. Almost half of the genes in the undetected group were unknown according to the Gene Ontology (GO) analysis and the question remains if these genes are true protein-coding genes or if they should be re-classified as pseudogenes or noncoding genes.
Integration of the Transcriptomics Data on the Human Protein Atlas-The transcriptomics data has been imported into the Human Protein Atlas to allow for visualization of the RNA expression patterns integrated with the results from more than 13 million individual antibody-based immunostainings (6) corresponding to more 80% of the protein coding genes. A new summary page for all genes has been designed to provide for comparative analysis of RNA/protein expression on an individual gene basis. In Fig. 4, two examples of summary pages are shown for a "house-keeping" gene (RPL24) and a tissue-specific gene (CYP2A13) with the RNA and protein level as assessed from the RNA-Seq data and antibodybased protein profiling data across the same 27 tissues. Note that the antibody for CYP2A13, in addition to the strong staining in liver, also stains one of the tissues in the GI-tract (stomach) weakly, whereas the RNA-Seq data show no transcripts for this gene in this tissue suggesting that the protein staining might be because of technical issues related to off-target binding. The comparative analysis of both RNA and protein levels can therefore be used as a validation tool to generate a protein atlas with higher reliability in the future.
In the new version of the Human Protein Atlas portal, the comparative analysis is shown for more than 80% of the putative protein-coding genes, including a visualization of the underlying primary data for the RNA-Seq with the number of reads corresponding to each base in the exons and introns of the analyzed gene. The lists of tissue-specific genes (supplemental Table S3) are available in the new version of the Human Protein Atlas (www.proteinatlas.org) portal to allow convenient exploration of the tissue-specific genes in various parts of the human body. The raw sequencing reads as well as the calculated mRNA levels (FPKM values) in each tissue, are also available at ArrayExpress (http://www.ebi.ac.uk/ arrayexpress/) under the accession number E-MTAB-1733. All the data including the classification of human protein-coding genes can also be downloaded from the Human Protein Atlas portal.
Antibody-based Profiling of the Tissue-specific Proteins-The analysis has given us lists of potential tissue-specific proteins, defined as highly or moderately tissue enriched, or group enriched. In order to validate the results, we have used the results from more than 13 million individual antibodybased protein profiles available in the Human Protein Atlas (6) to confirm the tissue-specificity also on the protein level. Immunohistochemistry allows a high resolution spatial mapping on a single cell level in the composite tissues and organs, to yield a more precise map of human protein expression. The various tissue-specific proteomes, such as cerebral cortex, testis, liver,  kidney, fat, and pancreas, with accompanied protein data, will be published separately, but here we give a few examples illustrating the usefulness of combining the transcriptomics analysis with the antibody-based protein profiling.
The first example is from the kidney-specific proteome for which 65 tissue-enriched genes were defined in our study. Immunohistochemistry was available for 62 of the corresponding proteins and the results show a strong bias for membrane-bound localization. In Fig. 5A, three examples of such proteins are shown, localized to distinct compartments in the human kidney, such as glomeruli (NPHS2), proximal tubuli (SLC22A13), and distal tubuli (TMEM72). Of the highly tissue enriched proteins, 18 out of 20 are transmembrane proteins as predicted by bioinformatics analysis and 16 of these are shown to localize to various subcompartments of the renal nephron by the immunohistochemistry analysis. Twelve of these proteins have previously been annotated as members of different solute carrier proteins (SLC), which me- diate trans-membrane movement of electrolytes, endogenous metabolites, nutrients, micronutrients, and vitamins (25). Among the 65 kidney-enriched genes, we identified 15 genes previously not described as expressed in the kidney and we were able to confirm the localization of the corresponding proteins in the kidney by immunohistochemistry for a majority of proteins, as exemplified by TMEM72.
The second example is from the liver-specific proteome for which 179 tissue-enriched proteins were defined in our study. Immunohistochemistry-based protein profiling data was available for 141 of these proteins. Of the highly tissue enriched proteins, half (27 out of 54) are plasma proteins as predicted by bioinformatics analysis and 25 of these show plasma positivity by the immunohistochemistry analysis, including serpin peptidase inhibitors, haptoglobulin, and fibrinogens. Many of the other liver specific proteins are associated with known liver specific functions, such as drug, retinol and xenobiotics metabolism as well as primary bile acid biosynthesis. The antibody-based profiling allows a more in depth analysis of the protein expression pattern in the liver and in Fig.  5B, examples of detoxification enzymes with a heterogeneous protein expression pattern in the hepatocytes are displayed. A majority of the enzymes show a gradient-like expression pattern with high expression in hepatocytes sur-rounding the central vein and low expression adjacent to the portal zone, whereas CYP2A13 shows the opposite expression pattern.
The third example is from the testis. This male organ has by far the largest number of tissue-specific proteins (n ϭ 1,378), accounting for more than one third of all identified tissuespecific proteins. Immunohistochemistry was available for 1061 of the testis tissue-specific proteins and a preference toward cytoplasmic and nuclear localization was observed, which is not surprising because many of the genes are involved in intracellular events, such as meiosis. Among the highest expressed genes are PRM1, PRM2, and TNP1, which are nuclear proteins known to be involved in the condensation of sperm chromatin during the haploid phase of spermatogenesis (26). We identified several previously uncharacterized proteins shown to be involved in various stages of spermatogenesis, such as SPATA3, ZPBP, HMGB4, RBMXL2, TSPYL6, and SOCS7 (Fig. 5C). The majority of the tissue-enriched genes belong to cell types representing various stages of spermatogenesis, whereas only a few proteins were localized in Sertoli and Leydig cells. ent proteins might yield evidence of additional tissue specificity. An example of this is displayed in Fig. 6, showing the transcript data for Claudin 18 (CLDN18), which is a gene encoding an integral membrane protein in tight junction strands. Tight junction strands serve as a physical barrier to prevent solutes and water from passing freely through the paracellular space between epithelial or endothelial cell sheets, and also play critical roles in maintaining cell polarity and signal transductions. The RNA-Seq data (Fig. 6) show that alternative initial exons are used in lung and stomach, respectively. A detailed analysis of the corresponding coding regions reveal that, although similar, the N-terminal region of the proteins (coded by the initial exons) differ in almost 1/3 of the residues between lung and stomach. Further studies are needed to elucidate how wide-spread this phenomenon is and to gain a more in-depth understanding of the molecular consequences in these two tissues and other similar cases.

DISCUSSION
The comprehensive analysis presented here has identified ϳ3500 human genes that display a tissue-or group-enriched expression pattern across the human body. Functional analysis of these tissue-specific proteins identified in our analysis is well in line with the function of the respective tissue or organ. Thus, the kidney specific proteome consists of many membrane-bound transport proteins, whereas the most abundant tissue-specific proteins in liver are secreted plasma proteins, e.g. albumin and haptoglobin, and detoxification proteins, such as alcohol dehydrogenase and a large number of proteins belonging to the cytochrome P450 superfamily of enzymes. The most abundant pancreas-specific proteins are digestive enzymes, whereas many of the fat-specific proteins are involved in lipid metabolism. The analysis identified a multitude of tissue-specific genes with no or little previous evidence on the protein level and the combined RNA-and antibody-based profiling can thus be used to confirm the functional existence of these protein-coding genes lacking previous annotation. These proteins are of course interesting starting points for further in-depth studies to gain better molecular understanding of the cellular phenotypes that define the function of each respective tissue and organ.
The data presented here has allowed us to build a new resource with integration of RNA and protein expression data. As expected, the RNA expression levels as measured by RNA-Seq and the protein levels detected by staining with immunohistochemistry display low correlation for many genes. This is not surprising, because immunohistochemistry based on enzymatic amplification technology is not quantitative (27) and in addition often yields off-target binding to unrelated proteins. An important task for the future will there-fore be to investigate the genes with low correlation between the RNA and protein levels. In many cases, this lack of correlation could also be because of biology, such as secretion of the protein out of the expressed cell type or post-translational events because of protein turn-over or other proteolytic processing, but the integrative RNA and protein data can also be used to indicate genes with potential technical issues, such as a context-depending off-target binding of the antibody or erroneous mapping of the RNA-Seq reads to the genome. The antibody-based protein profiling can be used for more precise localization patterns within the mixture of cell populations in a tissue, as shown here for the subcompartments of the renal nephron in the kidney, well-defined stages of spermatogenesis in the testis or gradient-like expression patterns in the liver.
The analysis presented here can be expanded in many different directions to provide an even more refined atlas of protein expression in the human body. Integration of this data with similar global efforts, such as the Cancer Genome Atlas program (28), the ENCODE (4) and the UniProt (5) efforts, will allow for a knowledge-base for molecular studies of the individual components of human proteome. Analysis of protein isoforms, in particular splice variant analysis of mRNA molecules, is an attractive extension. The results also facilitate comparative studies where the involvement of tissue-specific proteins are explored to identify targets for drug development and diagnostic assays for personalized medicine, including stratification of patients.
The integration of protein and RNA expression data is aimed toward the generation of a global, high-resolution expression map covering most of the tissues and cell types in the human body and this will be facilitated by alternative efforts based on transcriptomics analysis of specialized tissues, cells of different development origin and inclusion of cells and tissues originating from patients with different diseases. The classification of genes described here can thus be seen as step toward integration of RNA and protein data from various sources, including proteomics (29), to generate a high-resolution knowledge-base resource to allow for in-depth studies on individual genes and their protein counterparts, as well as more global studies using systems biology approaches.