Applying computation biology and “big data” to develop multiplex diagnostics for complex chronic diseases such as osteoarthritis

Abstract The data explosion in the last decade is revolutionizing diagnostics research and the healthcare industry, offering both opportunities and challenges. These high-throughput “omics” techniques have generated more scientific data in the last few years than in the entire history of mankind. Here we present a brief summary of how “big data” have influenced early diagnosis of complex diseases. We will also review some of the most commonly used “omics” techniques and their applications in diagnostics. Finally, we will discuss the issues brought by these new techniques when translating laboratory discoveries to clinical practice.


Computational techniques in early diagnosis
The ability to provide effective treatments in the early stages of a disease tends to lead to significantly better outcomes for the patient when compared with providing the same treatment at a significantly later stage of progression. This is particularly true for a number of diseases such as cancer and cardiovascular diseases, where any time lost can be a matter of life or death. However, early diagnosis of these diseases may be difficult using traditional biochemical methods due to their asymptomatic nature and the lack of efficient detective technologies. In the last decade, an exponential increase in the amount of data has been produced by various high-throughput ''omic'' technologies and we have now effectively entered the era of ''big data''. Although requiring massive computational resources and advanced data processing and analysis methods, ''big data'' approaches to diagnostic medicine and biomarker development have been successfully applied to the early detection of complex chronic diseases. In some cases, this has given us a deeper understanding of the molecular pathogenesis of disease. In this review, we will discuss the current state of ''big data'' and computational techniques for early-stage disease diagnosis and how advances in these techniques may promote a better understanding of complex diseases.

''Big data'' in disease diagnosis
Although great progress has been made within the last few decades, classical biomedical research methodology is still facing a challenge with diagnosis of complex diseases. These are typically associated with the effects of multiple genes in combination with lifestyle and environmental factors. One of the reasons for this difficulty in early diagnosis (or prediction) is that changes in traditional biomarkers can be too subtle at the asymptomatic stage to efficiently distinguish patients from normal individuals (Chen et al., 2012), and useful information can often be masked by the ''noise'' generated from naturally occurring variation within a given population. Therefore, many groups have suggested that diagnosis should be considered in a more comprehensive manner. Hampel et al. (2011) suggested that a combination of multiple biomarkers as well as genetic predisposition and environmental factors should all be taken into account for early diagnosis and personalized therapies of complex diseases such as Alzheimer's disease. However, such studies require large-scale measurements on a large number of individuals to eliminate over-fitting of predictive models. With the development of high-throughput ''omics'' techniques and the reduction in prices per sample, these types of analyzes are now a reality. An enormous number of data have been generated, providing a global view with rich information on diseases and their diagnosis.
One of the largest projects is The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/), which contains clinical information, histopathology slide images and molecular information from over 8000 tissue samples of 34 types of cancer. The goal of TCGA is to improve early detection of cancer and treatment by understanding how DNA mutations interact to drive cancers. However, the interpretation of such rich information seems to be a ''big data problem''. Big data is a concept that varies in different fields. In biomedical research, ''big data'' essentially refers to computational analyses that help scientists make sense of the chaos of extremely large experimental and clinical data sets. Conceivably, big data are already impacting disease diagnosis. For example, by studying a large sample set, Chen et al. (2011) achieved considerable high specificity (98.9% and 91.9%) for non-invasive prenatal diagnosis of trisomy 13 and trisomy 18 using maternal plasma DNA sequencing.
Big data in disease diagnosis shares the same IT challenges as in other fields, including data storage, transfer, access control, and management (Marx, 2013;Schadt et al., 2010). Another challenge is the computational modeling of complex biology systems. Due to the large scale and diversity of the data, non-optimized models may fall into Non-deterministic Polynomial-time hard (NP hard) problems whose time complexity increases super-exponentially (Schadt et al., 2010). Moreover, sampling bias should not be neglected. According to the study of Kaplan et al. (2014), bigger data are not always better, since large sample studies sometimes can magnify biases associated with error resulting from sampling or study design.

Computational ''omics'' techniques
Diseases with an identifiable genetic component play a role in nine of the ten leading causes of death in the United States (Hoyert & Xu, 2012). A positive association between genetic variation and disease may not only help diagnose diseases at an early stage but also predict disease onset before the initiation of pathogenesis. Genome-wide association study (GWAS) is one of the most common statistical approaches that involves rapidly scanning millions of markers (single-nucleotide polymorphisms, SNPs) at the same time across genomes to find genetic variations associated with a common complex disease (Visscher et al., 2012;Wellcome Trust Case Control, 2007). Liu et al. (2014) reported that the inclusion of the GWAS genetic variants data significantly improved their breast cancer naïve Bayes diagnostic model. As technological improvements continue to decrease DNA sequencing costs, whole genome sequencing (WGS) or whole exome sequencing (WES, sequence proteincoding genes only) becomes more practical for clinical applications and might be a potential alternative to GWAS as it provides more information on whole genomes (Berg et al., 2011). However, WGS/WES generates large quantities of data that require tremendous computational capacity for analysis such as sequence alignment, variant calling, filtering, and identifying disease susceptibility genes. In fact, sequence data are produced significantly faster than current computational resources can handle (Stein, 2010). Thus, more efficient algorithms and/or more powerful hardware need to be developed in the future (Ding et al., 2014). However, this may lead to an ''arms race'' between hardware and software resulting in increased rates of obsolescence in the field. Therefore, it is clear that data acquisition (hardware) and analysis (software) cannot be pursued independently of each other.
Gene expression (transcriptomics) profiling provides an opportunity for accurate, definitive diagnosis (Wen et al., 2013;Wiseman et al., 2013). High-throughput mRNA sequencing (RNA-Seq) is one of the most popular techniques in transcriptomics since this technology allows for investigating both known transcripts and uncovering new ones. Since transcripts (RNAs) need to be converted to cDNA and then sequenced, RNA sequence assembly algorithms for short, lowquality reads without references are required (Martin & Wang, 2011). While microarray suffers from a number of limitations compared with RNA-Seq (e.g. unbiased detection of transcripts, increased dynamic range, increased specificity/sensitivity, and increased detection of rare/low-abundance transcripts), it can be used to measure large numbers of gene expression levels simultaneously. In addition to regular clinical diagnosis, many recent articles reported the success of applying microarray in prenatal diagnosis (Shaffer et al., 2012;Wapner et al., 2012).
Proteomics can also be used for the biomarker detection of early-stage disease such as cancer (Mehrotra & Gupta, 2011;Rahman et al., 2011), cardiovascular disease (Delles et al., 2010;Gerszten et al., 2011), Alzheimer's disease (Craig-Schapiro et al., 2011), and other chronic diseases (Good et al., 2010;Zurbig et al., 2012). Mass spectrometry (MS)-based proteomics can help identify all differentially expressed proteins and their post-translational modifications during disease progression that can be used as biomarkers for early diagnosis and monitoring disease treatment (Colinge & Bennett, 2007). The data process of MS relies heavily on open access public proteomics databases. Both our own group and others in the field have employed the use of highthroughput ELISA technology such as Luminex and Meso Scale to examine panels of proteins (typically numbering between 20 and 60) in chronic diseases such as osteoarthritis (Heard et al., 2013) and traumatic injuries (Helmy et al., 2012).
Metabolomics, while a younger field than the rest, is rapidly expanding in the diagnostics field in ''post-genomic era''. Metabolic characteristics and changes in patients are influenced not only by which genes are transcribed, but also the composition of material that the cells obtain from their microenvironment. Many reviews have discussed the application of metabolomics in diagnostics using high-throughput techniques such as nuclear magnetic resonance spectroscopy (NMR) and MS. Madsen et al. (2010) made a comprehensive summary of metabolomics in cancer, diabetes, cardiovascular, and other complex disease diagnosis. Zhang et al. (2012) pointed out that saliva metabolomics is a potential method for personalized therapy and treatment monitoring. However, the type of data analysis is crucial for metabolomics-based diagnosis: in some cases, one single marker from the metabolic profile might be sufficient to detect the disease specifically, in most cases, machine learning techniques are applied to recognize and classify metabolic profiles or fingerprints between normal and disease states. The most widely used are linear discriminant analysis (LDA), artificial neural networks (ANN), and support vector machines (SVM). Principal component analysis (PCA) is often employed for data dimension reduction before model training in order to lower the chance of over-fitting the model. Another way to avoid model over-fitting is to apply crossvalidation techniques at the model training step.

System biology
Not until the completion of the Human Genome Project was it realized that gene sequence alone was insufficient to identify all the biologic origin of a disease. The function of each protein and the complexities of protein-protein interactions are critical for understanding physiological processes. In addition, recent studies show that non-coding parts of the genome produce small conserved ribonucleic acids (noncoding RNA, ncRNA) that control molecular and cellular processes (Alexander et al., 2010;Tutar, 2012). Thus, in order to develop effective diagnostic techniques and disease treatments, genomics, transcriptomics, and proteomics should be studied integrally and systematically as a whole system.
Through a system-based approach, Lusis et al. integrated genomic, molecular, physiological data with traditional genetic and biochemical methods to study complex disease including diabetes and cardiovascular disease. He pointed out that analyzing the individual components of the whole system is far from sufficient, since in reality, these components interact with each other and these interactions play crucial roles in development of diseases (Lusis et al., 2008).
A number of recent studies have successfully applied network models in describing and simplifying such complex systems (Akutekwe & Seker, 2014;Barabási et al., 2011;Gilman et al., 2011;O'Roak et al., 2012;Vandin et al., 2011, Vidal et al., 2011. In these studies, network topology is used to investigate biological networks including metabolic networks, protein-protein interaction networks, gene regulatory networks, transcriptional profiling networks, etc., and their interactions. For example, in the gene network clusters created by Gilman et al. (2011) using NETBAG (network-based analysis of genetic associations), many proteins are found to participate in the formation of autism. These proteins may become new biomarkers for the diagnosis of autism. In another study conducted by Akutekwe & Seker (2014), a biomarker identification method used a dynamic Bayesian network to model the temporal relationship among stratified features for early diagnosis of ovarian cancer. Gstaiger et al. tried to bridge the gap between genotype and phenotype by studying the inference of genetically perturbed molecular networks based on a combination of genomics, proteomics, and phenomics data (Gstaiger & Aebersold, 2009). All these innovative strategies may provide a deeper understanding of disease development and help us discover new indicators for early-stage diseases.

Early diagnosis of osteoarthritis
Osteoarthritis (OA), one of the leading causes of chronic disability worldwide, is a form of arthritic disease characterized by the progressive destruction of articular cartilage. The pathogenesis of OA is multifactorial: aging, injury, and genetic predisposition may all be contributing factors that cause joint cartilage degeneration. Currently, clinical diagnosis of OA relies on radiographic assessment, pain symptoms, and mobility of the joint. Unfortunately, OA develops asymptomatically its early stages and when it becomes detectable, extensive and irreversible deterioration of joint has already occurred. Therefore, there is a need for new diagnostic methods, such as new specific biological markers, to detect OA at before such deterioration happens. However, without understanding the biological mechanisms of OA, the search for effective early biomarkers among billions of molecules is like finding a ''needle in a haystack''. In the past few years, development ''omics'' and bioinformatics techniques have impacted the etiology and diagnostics of complex diseases like OA. High-throughput fast screening of biomarkers at the whole ''omic'' level becomes a reality. As an example, we describe recent progress and challenges in early-stage OA diagnosis using these high-throughput techniques.

Genomics in OA diagnosis
Genome-wide association studies have examined thousands of SNPs in the whole genome and OA. So far, approximately 15 OA susceptibility loci have been identified by GWAS, although some of them are gender or racial specific (Tsezou, 2014). Elliott et al. (2013) found significant overlap between OA and height and OA and body mass index (BMI) by comparing OA and BMI GWAS data, suggesting that OA and obesity may share genetic background. In a more comprehensive mate-analysis study, Rodriguez-Fontenla et al. (2014) summarized nine GWAS of OA, they identified two genes (COL11A1 and VEGF) that are significantly associated with hip OA development.
In order to find the rare variants that are missed in common GWAS studies, Boer et al. conducted a whole exome-sequencing study of 1524 participants, of whom 199 had hip OA. Besides three genes already identified in previous GWAS studies, they found that gene fibroblast growth factor 3 (FGF3) may contribute to hip OA by suppressing endochondral bone formation (Boer et al., 2014). Unfortunately, to our knowledge, this is the only OA-related whole genome/exome sequencing study published to date. To obtain a better understanding of the genomic architecture of OA, additional whole genome large-scale NGS studies on various cohorts should be undertaken.
Several recent genome-wide DNA epigenetic studies using high-throughput arrays have revealed new potential OA biomarkers. DNA methylation (one of the common DNA epigenetic modifications in promoter regions of genomic DNA) may influence DNA stability, chromatin structure, and regulate gene expression. Several studies have examined the genome-wide DNA methylation profile of human articular chondrocytes in cartilage and trabecular bone samples from OA patients and healthy controls to identify profiles of DNA methylation in OA disease (Delgado-Calle et al., 2013;Fernandez-Tajes et al., 2014). All these studies found significant differential methylation levels in certain genes between the patient and normal groups, and it is possible that these methylation sites and the genes in which they are contained could be used as new diagnostic markers for OA.

Transcriptomics in OA diagnosis
Many microarray-based gene expression studies on various tissue types from OA patients have identified differentially expressed genes and profiles that could contribute to the development of new biomarkers. For example, Blom et al. (2014) identified approximately 200 differentially expressed genes (fold change ± 2) in synovium, whereas in peripheral blood, 86 genes were expressed with at least 1.5-fold difference (Ramos et al., 2014). As increased evidence indicates that the subchondral bone plays a major role in the initiation and progression of OA, Chou et al. (2013) performed a whole-genome gene expression study of subchondral bone. They found a total of 972 genes that were differentially expressed (fold change ± 2) between normal and OA bone samples. Interestingly, these studies identified only very few of the same differentially expressed genes, suggesting that in OA, disease-related gene expression changes with time, or may be highly tissue and/or patient specific. Although a few molecular models can explain a small portion of tissue-dependent gene expression regulation, the full regulation mechanisms in different tissues are not clear (Fu et al., 2012). Nevertheless, it is essential that we consider the complex (and in some cases, non-canonical) roles of genes and their pathways in diverse tissue and cell types. Hence, it is important that different studies use expression data from the same tissues to maintain comparability and assess the association between genes and disease.

Proteomics and metabolomics in OA diagnosis
Although proteomics and metabolomics approaches in OA diagnostic studies are relatively new, they have already identified a great number of potential disease biomarkers. A broad range investigation of proteomic profiles in different tissues has been conducted, including femoral head, humeral head, meniscus, explants, etc. (Hsueh et al., 2014). Additional studies are more focused on human body fluid as the harvest is comparatively non-invasive and consequently easier to translate to clinical practice. Serum and urine are the most commonly used body fluids for proteomic analysis of OA (Takinami et al., 2013). However, since they are spatially removed from the affected tissues it is possible that some key proteins may be diluted. Synovial fluid (SF), although sometimes difficult to obtain, can be studied as a compromise between non-invasiveness and sensitivity (Balakrishnan et al., 2014). A metabolomics analysis of synovial fluid has successfully classified OA phenotypes into two metabolically distinct subgroups using the concentration of acylcarnitine, which may be related to the carnitine metabolism pathway . These types of studies will help to unravel the complex pathogenesis of OA and simplify new biomarker discovery by dividing OA into several subtypes.
A problem with proteomic and metabolomic studies of early OA is that abnormal protein or metabolite expression is relatively dynamic compared with gene mutation. Usually samples are obtained from patients who are already clinically diagnosed with OA; therefore, the proteomic and metabolomics profiles can only represent the status of the patients at the advanced or even end stage of the disease. Without knowing the biomarker profile changes during OA progression, we should be careful in assuming that differentially expressed proteins or metabolites in late OA are also potential biomarkers for early OA diagnosis. Takinami et al. (2013) conducted a study which followed knee OA patients for 2 years to overcome this problem. However, OA is known to have a much longer pathogenic in some patients (even up to decades), and some evidence shows that cartilage degeneration which could ultimately lead to OA can start in youth. Therefore, it is essential to develop long term follow-up studies now, so that the next generation will be able to benefit from these types of diagnostic studies in OA.

System biology in OA diagnosis
Extensive ''omics'' data have been screened so far and many biomarkers have been proposed, but their sensitivity or specificity is not high enough for clinical use and the reliability varies among studies (Table 1). One possible explanation for this is the multifactorial pathogenesis of OA: aging, injury, and genetic predisposition may all act as contributing factors, and consequently, single biomarker diagnostics are not efficient enough to comprehensively classify all early-stage OA patients of various etiologies. Although system biology is an effective technique for complex disease research, very few studies have been conducted on OA. Olex et al. (2014) integrated time-course microarray gene expression data from a mouse model into a PPI network. However, mice are known to have a much different genetic response than humans following an injury, and mouse models might be poor representatives of the human inflammatory response (Seok et al., 2013). Nacher et al. (2014) applied a PageRank-based diffusion algorithm to recognize OA-related proteins in a chondrocyte protein network and found that protein Q6EEV6 could play a key role in OA development. In another similar study, some of the top hub genes in the PPI network are also differentially expressed, indicating that these genes may be potential targets for OA diagnosis and treatment (Wang et al., 2014).
All these studies share some common limitations. First, further genetic and experimental studies are needed to eliminate the possibility of false positive results from computational analysis. Second, all these studies are trying to find one or several biomarkers, which departs from the original purpose of system biology study in complex disease: to study complex intracellular and intercellular networks as a whole. Lack of effective methods to interpret biologic network results might be one of the reasons. Pilot works are needed to put computational analysis into perspective in the future.

Importance of patient cohort characterization
Although high-throughput ''omics'' platforms coupled with the application of complex bioinformatics approaches have had a number of successes in identifying potential biomarkers in complex diseases such as cancer (Wang et al., 2009;Zhang et al., 2013), sepsis (Lukaszewski et al., 2008), arthritis (Heard et al., 2014;Swan et al., 2013) and others, it is important to realize that some, if not all complex diseases have numerous associated co-morbidities and risk factors. Therefore, it is essential to have extremely well-characterized patient cohorts to be sure we are not identifying biomarkers associated with those co-morbidities and/or risk factors. This is particularly important in diseases where no early diagnostic tests exist to assist in the confirmation/validation of the novel biomarkers.

Conclusion
The high-throughput ''omics'' techniques bring new energy to diagnostics, offering a comprehensive data resource from micro (e.g. genomics) to macro (e.g. phenomics). Facing the ''big data'' generated by such techniques, more powerful computational resources and efficient models or algorithms are needed for data storage, transfer, and mining. Systems biology is one of the most successful methods for studying biologic processes and integrating multiple data resources. Many studies have applied network models in describing etiopathogenesis and immune responses that may help the discovery of novel biomarkers for early diagnosis. However, we should be careful when applying such models, especially when there is uncertainty regarding the bias of clinical data and no other diagnostic tests are available for validation.